performance r nfsd with RWF_DONTCACHE and larger wsizes

All of lore.kernel.org
 help / color / mirror / Atom feed

* performance r nfsd with RWF_DONTCACHE and larger wsizes
@ 2025-05-06 17:40 Jeff Layton
  2025-05-06 18:16 ` Chuck Lever
  2025-05-06 22:31 ` Dave Chinner
  0 siblings, 2 replies; 12+ messages in thread
From: Jeff Layton @ 2025-05-06 17:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-nfs
  Cc: Chuck Lever, Mike Snitzer, Trond Myklebust, Jens Axboe,
	Chris Mason, Anna Schumaker

FYI I decided to try and get some numbers with Mike's RWF_DONTCACHE
patches for nfsd [1]. Those add a module param that make all reads and
writes use RWF_DONTCACHE.

I had one host that was running knfsd with an XFS export, and a second
that was acting as NFS client. Both machines have tons of memory, so
pagecache utilization is irrelevant for this test.

I tested sequential writes using the fio-seq_write.fio test, both with
and without the module param enabled.

These numbers are from one run each, but they were pretty stable over
several runs:

# fio /usr/share/doc/fio/examples/fio-seq-write.fio

wsize=1M:

Normal:      WRITE: bw=1034MiB/s (1084MB/s), 1034MiB/s-1034MiB/s (1084MB/s-1084MB/s), io=910GiB (977GB), run=901326-901326msec
DONTCACHE:   WRITE: bw=649MiB/s (681MB/s), 649MiB/s-649MiB/s (681MB/s-681MB/s), io=571GiB (613GB), run=900001-900001msec

DONTCACHE with a 1M wsize vs. recent (v6.14-ish) knfsd was about 30%
slower. Memory consumption was down, but these boxes have oodles of
memory, so I didn't notice much change there.

Chris suggested that the write sizes were too small in this test, so I
grabbed Chuck's patches to increase the max RPC payload size [2] to 4M,
and patched the client to allow a wsize that big:

wsize=4M:

Normal:       WRITE: bw=1053MiB/s (1104MB/s), 1053MiB/s-1053MiB/s (1104MB/s-1104MB/s), io=930GiB (999GB), run=904526-904526msec
DONTCACHE:    WRITE: bw=1191MiB/s (1249MB/s), 1191MiB/s-1191MiB/s (1249MB/s-1249MB/s), io=1050GiB (1127GB), run=902781-902781msec

Not much change with normal buffered I/O here, but DONTCACHE is faster
with a 4M wsize. My suspicion (unconfirmed) is that the dropbehind flag
ends up causing partially-written large folios in the pagecache to get
written back too early, and that slows down later writes to the same
folios.

I wonder if we need some heuristic that makes generic_write_sync() only
kick off writeback immediately if the whole folio is dirty so we have
more time to gather writes before kicking off writeback?

This might also be a good reason to think about a larger rsize/wsize
limit in the client.

I'd like to also test reads with this flag, but I'm currently getting
back that EOPNOTSUPP error when I try to test them.

[1]: https://lore.kernel.org/linux-nfs/20250220171205.12092-1-
snitzer@kernel.org/
[2]: https://lore.kernel.org/linux-nfs/20250428193702.5186-15-
cel@kernel.org/
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: performance r nfsd with RWF_DONTCACHE and larger wsizes
  2025-05-06 17:40 performance r nfsd with RWF_DONTCACHE and larger wsizes Jeff Layton
@ 2025-05-06 18:16 ` Chuck Lever
  2025-05-06 18:30   ` Jeff Layton
  2025-05-06 22:31 ` Dave Chinner
  1 sibling, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2025-05-06 18:16 UTC (permalink / raw)
  To: Jeff Layton, linux-fsdevel, linux-nfs
  Cc: Mike Snitzer, Trond Myklebust, Jens Axboe, Chris Mason,
	Anna Schumaker

On 5/6/25 1:40 PM, Jeff Layton wrote:
> FYI I decided to try and get some numbers with Mike's RWF_DONTCACHE
> patches for nfsd [1]. Those add a module param that make all reads and
> writes use RWF_DONTCACHE.
> 
> I had one host that was running knfsd with an XFS export, and a second
> that was acting as NFS client. Both machines have tons of memory, so
> pagecache utilization is irrelevant for this test.
> 
> I tested sequential writes using the fio-seq_write.fio test, both with
> and without the module param enabled.
> 
> These numbers are from one run each, but they were pretty stable over
> several runs:
> 
> # fio /usr/share/doc/fio/examples/fio-seq-write.fio
> 
> wsize=1M:
> 
> Normal:      WRITE: bw=1034MiB/s (1084MB/s), 1034MiB/s-1034MiB/s (1084MB/s-1084MB/s), io=910GiB (977GB), run=901326-901326msec
> DONTCACHE:   WRITE: bw=649MiB/s (681MB/s), 649MiB/s-649MiB/s (681MB/s-681MB/s), io=571GiB (613GB), run=900001-900001msec
> 
> DONTCACHE with a 1M wsize vs. recent (v6.14-ish) knfsd was about 30%
> slower. Memory consumption was down, but these boxes have oodles of
> memory, so I didn't notice much change there.
> 
> Chris suggested that the write sizes were too small in this test, so I
> grabbed Chuck's patches to increase the max RPC payload size [2] to 4M,
> and patched the client to allow a wsize that big:
> 
> wsize=4M:
> 
> Normal:       WRITE: bw=1053MiB/s (1104MB/s), 1053MiB/s-1053MiB/s (1104MB/s-1104MB/s), io=930GiB (999GB), run=904526-904526msec
> DONTCACHE:    WRITE: bw=1191MiB/s (1249MB/s), 1191MiB/s-1191MiB/s (1249MB/s-1249MB/s), io=1050GiB (1127GB), run=902781-902781msec
> 
> Not much change with normal buffered I/O here, but DONTCACHE is faster
> with a 4M wsize. My suspicion (unconfirmed) is that the dropbehind flag
> ends up causing partially-written large folios in the pagecache to get
> written back too early, and that slows down later writes to the same
> folios.

My feeling is that at this point, the NFSD read and write paths are not
currently tuned for large folios -- they break every I/O into single
pages.


> I wonder if we need some heuristic that makes generic_write_sync() only
> kick off writeback immediately if the whole folio is dirty so we have
> more time to gather writes before kicking off writeback?

Mike has suggested that NFSD should limit the use RWF_UNCACHED to
WRITE requests with large payloads (for some arbitrary definition of
"large").


> This might also be a good reason to think about a larger rsize/wsize
> limit in the client.
> 
> I'd like to also test reads with this flag, but I'm currently getting
> back that EOPNOTSUPP error when I try to test them.

That's expected for that patch series.

But I have to ask: what problem do you expect RWF_UNCACHED to solve?


> [1]: https://lore.kernel.org/linux-nfs/20250220171205.12092-1-
> snitzer@kernel.org/
> [2]: https://lore.kernel.org/linux-nfs/20250428193702.5186-15-
> cel@kernel.org/


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: performance r nfsd with RWF_DONTCACHE and larger wsizes
  2025-05-06 18:16 ` Chuck Lever
@ 2025-05-06 18:30   ` Jeff Layton
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff Layton @ 2025-05-06 18:30 UTC (permalink / raw)
  To: Chuck Lever, linux-fsdevel, linux-nfs
  Cc: Mike Snitzer, Trond Myklebust, Jens Axboe, Chris Mason,
	Anna Schumaker

On Tue, 2025-05-06 at 14:16 -0400, Chuck Lever wrote:
> On 5/6/25 1:40 PM, Jeff Layton wrote:
> > FYI I decided to try and get some numbers with Mike's RWF_DONTCACHE
> > patches for nfsd [1]. Those add a module param that make all reads and
> > writes use RWF_DONTCACHE.
> > 
> > I had one host that was running knfsd with an XFS export, and a second
> > that was acting as NFS client. Both machines have tons of memory, so
> > pagecache utilization is irrelevant for this test.
> > 
> > I tested sequential writes using the fio-seq_write.fio test, both with
> > and without the module param enabled.
> > 
> > These numbers are from one run each, but they were pretty stable over
> > several runs:
> > 
> > # fio /usr/share/doc/fio/examples/fio-seq-write.fio
> > 
> > wsize=1M:
> > 
> > Normal:      WRITE: bw=1034MiB/s (1084MB/s), 1034MiB/s-1034MiB/s (1084MB/s-1084MB/s), io=910GiB (977GB), run=901326-901326msec
> > DONTCACHE:   WRITE: bw=649MiB/s (681MB/s), 649MiB/s-649MiB/s (681MB/s-681MB/s), io=571GiB (613GB), run=900001-900001msec
> > 
> > DONTCACHE with a 1M wsize vs. recent (v6.14-ish) knfsd was about 30%
> > slower. Memory consumption was down, but these boxes have oodles of
> > memory, so I didn't notice much change there.
> > 
> > Chris suggested that the write sizes were too small in this test, so I
> > grabbed Chuck's patches to increase the max RPC payload size [2] to 4M,
> > and patched the client to allow a wsize that big:
> > 
> > wsize=4M:
> > 
> > Normal:       WRITE: bw=1053MiB/s (1104MB/s), 1053MiB/s-1053MiB/s (1104MB/s-1104MB/s), io=930GiB (999GB), run=904526-904526msec
> > DONTCACHE:    WRITE: bw=1191MiB/s (1249MB/s), 1191MiB/s-1191MiB/s (1249MB/s-1249MB/s), io=1050GiB (1127GB), run=902781-902781msec
> > 
> > Not much change with normal buffered I/O here, but DONTCACHE is faster
> > with a 4M wsize. My suspicion (unconfirmed) is that the dropbehind flag
> > ends up causing partially-written large folios in the pagecache to get
> > written back too early, and that slows down later writes to the same
> > folios.
> 
> My feeling is that at this point, the NFSD read and write paths are not
> currently tuned for large folios -- they break every I/O into single
> pages.
> 

*nod*

> 
> > I wonder if we need some heuristic that makes generic_write_sync() only
> > kick off writeback immediately if the whole folio is dirty so we have
> > more time to gather writes before kicking off writeback?
> 
> Mike has suggested that NFSD should limit the use RWF_UNCACHED to
> WRITE requests with large payloads (for some arbitrary definition of
> "large").
> 

Yeah. I think we need something along those lines.

> 
> > This might also be a good reason to think about a larger rsize/wsize
> > limit in the client.
> > 
> > I'd like to also test reads with this flag, but I'm currently getting
> > back that EOPNOTSUPP error when I try to test them.
> 
> That's expected for that patch series.
> 

Yep, I figured.

> But I have to ask: what problem do you expect RWF_UNCACHED to solve?
> 

I don't have a problem to solve, per-se. I was mainly just wondering
what sort of effect RWF_DONTCACHE and larger payloads would have on
performance.

> 
> > [1]: https://lore.kernel.org/linux-nfs/20250220171205.12092-1-
> > snitzer@kernel.org/
> > [2]: https://lore.kernel.org/linux-nfs/20250428193702.5186-15-
> > cel@kernel.org/
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: performance r nfsd with RWF_DONTCACHE and larger wsizes
  2025-05-06 17:40 performance r nfsd with RWF_DONTCACHE and larger wsizes Jeff Layton
  2025-05-06 18:16 ` Chuck Lever
@ 2025-05-06 22:31 ` Dave Chinner
  2025-05-07  0:06   ` Jeff Layton
  1 sibling, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2025-05-06 22:31 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, linux-nfs, Chuck Lever, Mike Snitzer,
	Trond Myklebust, Jens Axboe, Chris Mason, Anna Schumaker

On Tue, May 06, 2025 at 01:40:35PM -0400, Jeff Layton wrote:
> FYI I decided to try and get some numbers with Mike's RWF_DONTCACHE
> patches for nfsd [1]. Those add a module param that make all reads and
> writes use RWF_DONTCACHE.
> 
> I had one host that was running knfsd with an XFS export, and a second
> that was acting as NFS client. Both machines have tons of memory, so
> pagecache utilization is irrelevant for this test.

Does RWF_DONTCACHE result in server side STABLE write requests from
the NFS client, or are they still unstable and require a post-write
completion COMMIT operation from the client to trigger server side
writeback before the client can discard the page cache?

> I tested sequential writes using the fio-seq_write.fio test, both with
> and without the module param enabled.
> 
> These numbers are from one run each, but they were pretty stable over
> several runs:
> 
> # fio /usr/share/doc/fio/examples/fio-seq-write.fio

$ cat /usr/share/doc/fio/examples/fio-seq-write.fio
cat: /usr/share/doc/fio/examples/fio-seq-write.fio: No such file or directory
$

What are the fio control parameters of the IO you are doing? (e.g.
is this single threaded IO, does it use the psync, libaio or iouring
engine, etc)

> wsize=1M:
> 
> Normal:      WRITE: bw=1034MiB/s (1084MB/s), 1034MiB/s-1034MiB/s (1084MB/s-1084MB/s), io=910GiB (977GB), run=901326-901326msec
> DONTCACHE:   WRITE: bw=649MiB/s (681MB/s), 649MiB/s-649MiB/s (681MB/s-681MB/s), io=571GiB (613GB), run=900001-900001msec
> 
> DONTCACHE with a 1M wsize vs. recent (v6.14-ish) knfsd was about 30%
> slower. Memory consumption was down, but these boxes have oodles of
> memory, so I didn't notice much change there.

So what is the IO pattern that the NFSD is sending to the underlying
XFS filesystem?

Is it sending 1M RWF_DONTCACHE buffered IOs to XFS as well (i.e. one
buffered write IO per NFS client write request), or is DONTCACHE
only being used on the NFS client side?

> I wonder if we need some heuristic that makes generic_write_sync() only
> kick off writeback immediately if the whole folio is dirty so we have
> more time to gather writes before kicking off writeback?

You're doing aligned 1MB IOs - there should be no partially dirty
large folios in either the client or the server page caches.

That said, this is part of the reason I asked about both whether the
client side write is STABLE and  whether RWF_DONTCACHE on
the server side. i.e. using either of those will trigger writeback
on the serer side immediately; in the case of the former it will
also complete before returning to the client and not require a
subsequent COMMIT RPC to wait for server side IO completion...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: performance r nfsd with RWF_DONTCACHE and larger wsizes
  2025-05-06 22:31 ` Dave Chinner
@ 2025-05-07  0:06   ` Jeff Layton
  2025-05-07  2:50     ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Layton @ 2025-05-07  0:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-nfs, Chuck Lever, Mike Snitzer,
	Trond Myklebust, Jens Axboe, Chris Mason, Anna Schumaker

On Wed, 2025-05-07 at 08:31 +1000, Dave Chinner wrote:
> On Tue, May 06, 2025 at 01:40:35PM -0400, Jeff Layton wrote:
> > FYI I decided to try and get some numbers with Mike's RWF_DONTCACHE
> > patches for nfsd [1]. Those add a module param that make all reads and
> > writes use RWF_DONTCACHE.
> > 
> > I had one host that was running knfsd with an XFS export, and a second
> > that was acting as NFS client. Both machines have tons of memory, so
> > pagecache utilization is irrelevant for this test.
> 
> Does RWF_DONTCACHE result in server side STABLE write requests from
> the NFS client, or are they still unstable and require a post-write
> completion COMMIT operation from the client to trigger server side
> writeback before the client can discard the page cache?
> 

The latter. I didn't change the client at all here (other than to allow
it to do bigger writes on the wire). It's just doing bog-standard
buffered I/O. nfsd is adding RWF_DONTCACHE to every write via Mike's
patch.

> > I tested sequential writes using the fio-seq_write.fio test, both with
> > and without the module param enabled.
> > 
> > These numbers are from one run each, but they were pretty stable over
> > several runs:
> > 
> > # fio /usr/share/doc/fio/examples/fio-seq-write.fio
> 
> $ cat /usr/share/doc/fio/examples/fio-seq-write.fio
> cat: /usr/share/doc/fio/examples/fio-seq-write.fio: No such file or directory
> $
> 
> What are the fio control parameters of the IO you are doing? (e.g.
> is this single threaded IO, does it use the psync, libaio or iouring
> engine, etc)
> 


; fio-seq-write.job for fiotest

[global]
name=fio-seq-write
filename=fio-seq-write
rw=write
bs=256K
direct=0
numjobs=1
time_based
runtime=900

[file1]
size=10G
ioengine=libaio
iodepth=16


> > wsize=1M:
> > 
> > Normal:      WRITE: bw=1034MiB/s (1084MB/s), 1034MiB/s-1034MiB/s (1084MB/s-1084MB/s), io=910GiB (977GB), run=901326-901326msec
> > DONTCACHE:   WRITE: bw=649MiB/s (681MB/s), 649MiB/s-649MiB/s (681MB/s-681MB/s), io=571GiB (613GB), run=900001-900001msec
> > 
> > DONTCACHE with a 1M wsize vs. recent (v6.14-ish) knfsd was about 30%
> > slower. Memory consumption was down, but these boxes have oodles of
> > memory, so I didn't notice much change there.
> 
> So what is the IO pattern that the NFSD is sending to the underlying
> XFS filesystem?
> 
> Is it sending 1M RWF_DONTCACHE buffered IOs to XFS as well (i.e. one
> buffered write IO per NFS client write request), or is DONTCACHE
> only being used on the NFS client side?
> 

It's should be sequential I/O, though the writes would be coming in
from different nfsd threads. nfsd just does standard buffered I/O. The
WRITE handler calls nfsd_vfs_write(), which calls vfs_write_iter().
With the module parameter enabled, it also adds RWF_DONTCACHE.

DONTCACHE is only being used on the server side. To be clear, the
protocol doesn't support that flag (yet), so we have no way to project
DONTCACHE from the client to the server (yet). This is just early
exploration to see whether DONTCACHE offers any benefit to this
workload.

> > I wonder if we need some heuristic that makes generic_write_sync() only
> > kick off writeback immediately if the whole folio is dirty so we have
> > more time to gather writes before kicking off writeback?
> 
> You're doing aligned 1MB IOs - there should be no partially dirty
> large folios in either the client or the server page caches.
> 

Interesting. I wonder what accounts for the slowdown with 1M writes? It
seems likely to be related to the more aggressive writeback with
DONTCACHE enabled, but it'd be good to understand this.

> That said, this is part of the reason I asked about both whether the
> client side write is STABLE and  whether RWF_DONTCACHE on
> the server side. i.e. using either of those will trigger writeback
> on the serer side immediately; in the case of the former it will
> also complete before returning to the client and not require a
> subsequent COMMIT RPC to wait for server side IO completion...
> 

I need to go back and sniff traffic to be sure, but I'm fairly certain
the client is issuing regular UNSTABLE writes and following up with a
later COMMIT, at least for most of them. The occasional STABLE write
might end up getting through, but that should be fairly rare.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: performance r nfsd with RWF_DONTCACHE and larger wsizes
  2025-05-07  0:06   ` Jeff Layton
@ 2025-05-07  2:50     ` Dave Chinner
  2025-05-07 13:43       ` Chuck Lever
  2025-05-07 21:50       ` Mike Snitzer
  0 siblings, 2 replies; 12+ messages in thread
From: Dave Chinner @ 2025-05-07  2:50 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, linux-nfs, Chuck Lever, Mike Snitzer,
	Trond Myklebust, Jens Axboe, Chris Mason, Anna Schumaker

On Tue, May 06, 2025 at 08:06:51PM -0400, Jeff Layton wrote:
> On Wed, 2025-05-07 at 08:31 +1000, Dave Chinner wrote:
> > On Tue, May 06, 2025 at 01:40:35PM -0400, Jeff Layton wrote:
> > > FYI I decided to try and get some numbers with Mike's RWF_DONTCACHE
> > > patches for nfsd [1]. Those add a module param that make all reads and
> > > writes use RWF_DONTCACHE.
> > > 
> > > I had one host that was running knfsd with an XFS export, and a second
> > > that was acting as NFS client. Both machines have tons of memory, so
> > > pagecache utilization is irrelevant for this test.
> > 
> > Does RWF_DONTCACHE result in server side STABLE write requests from
> > the NFS client, or are they still unstable and require a post-write
> > completion COMMIT operation from the client to trigger server side
> > writeback before the client can discard the page cache?
> > 
> 
> The latter. I didn't change the client at all here (other than to allow
> it to do bigger writes on the wire). It's just doing bog-standard
> buffered I/O. nfsd is adding RWF_DONTCACHE to every write via Mike's
> patch.

Ok, that wasn't clear that it was only server side RWF_DONTCACHE.

I have some more context from a different (internal) discussion
thread about how poorly the NFSD read side performs with
RWF_DONTCACHE compared to O_DIRECT. This is because there is massive
page allocator spin lock contention due to all the concurrent reads
being serviced.

The buffered write path locking is different, but I suspect
something similar is occurring and I'm going to ask you to confirm
it...

> > > I tested sequential writes using the fio-seq_write.fio test, both with
> > > and without the module param enabled.
> > > 
> > > These numbers are from one run each, but they were pretty stable over
> > > several runs:
> > > 
> > > # fio /usr/share/doc/fio/examples/fio-seq-write.fio
> > 
> > $ cat /usr/share/doc/fio/examples/fio-seq-write.fio
> > cat: /usr/share/doc/fio/examples/fio-seq-write.fio: No such file or directory
> > $
> > 
> > What are the fio control parameters of the IO you are doing? (e.g.
> > is this single threaded IO, does it use the psync, libaio or iouring
> > engine, etc)
> > 
> 
> 
> ; fio-seq-write.job for fiotest
> 
> [global]
> name=fio-seq-write
> filename=fio-seq-write
> rw=write
> bs=256K
> direct=0
> numjobs=1
> time_based
> runtime=900
> 
> [file1]
> size=10G
> ioengine=libaio
> iodepth=16

Ok, so we are doing AIO writes on the client side, so we have ~16
writes on the wire from the client at any given time.

This also means they are likely not being received by the NFS server
in sequential order, and the NFS server is going to be processing
roughly 16 write RPCs to the same file concurrently using
RWF_DONTCACHE IO.

These are not going to be exactly sequential - the server side IO
pattern to the filesystem is quasi-sequential, with random IOs being
out of order and leaving temporary holes in the file until the OO
write is processed.

XFS should handle this fine via the speculative preallocation beyond
EOF that is triggered by extending writes (it was designed to
mitigate the fragmentation this NFS behaviour causes). However, we
should always keep in mind that while client side IO is sequential,
what the server is doing to the underlying filesystem needs to be
treated as "concurrent IO to a single file" rather than "sequential
IO".

> > > wsize=1M:
> > > 
> > > Normal:      WRITE: bw=1034MiB/s (1084MB/s), 1034MiB/s-1034MiB/s (1084MB/s-1084MB/s), io=910GiB (977GB), run=901326-901326msec
> > > DONTCACHE:   WRITE: bw=649MiB/s (681MB/s), 649MiB/s-649MiB/s (681MB/s-681MB/s), io=571GiB (613GB), run=900001-900001msec
> > > 
> > > DONTCACHE with a 1M wsize vs. recent (v6.14-ish) knfsd was about 30%
> > > slower. Memory consumption was down, but these boxes have oodles of
> > > memory, so I didn't notice much change there.
> > 
> > So what is the IO pattern that the NFSD is sending to the underlying
> > XFS filesystem?
> > 
> > Is it sending 1M RWF_DONTCACHE buffered IOs to XFS as well (i.e. one
> > buffered write IO per NFS client write request), or is DONTCACHE
> > only being used on the NFS client side?
> > 
> 
> It's should be sequential I/O, though the writes would be coming in
> from different nfsd threads. nfsd just does standard buffered I/O. The
> WRITE handler calls nfsd_vfs_write(), which calls vfs_write_iter().
> With the module parameter enabled, it also adds RWF_DONTCACHE.

Ok, so buffered writes (even with RWF_DONTCACHE) are not processed
concurrently by XFS - there's an exclusive lock on the inode that
will be serialising all the buffered write IO.

Given that most of the work that XFS will be doing during the write
will not require releasing the CPU, there is a good chance that
there is spin contention on the i_rwsem from the 15 other write
waiters.

That may be a contributing factor to poor performance, so kernel
profiles from the NFS server for both the normal buffered write path
as well as the RWF_DONTCACHE buffered write path. Having some idea
of the total CPU usage of the nfsds during the workload would also
be useful.

> DONTCACHE is only being used on the server side. To be clear, the
> protocol doesn't support that flag (yet), so we have no way to project
> DONTCACHE from the client to the server (yet). This is just early
> exploration to see whether DONTCACHE offers any benefit to this
> workload.

The nfs client largely aligns all of the page caceh based IO, so I'd
think that O_DIRECT on the server side would be much more performant
than RWF_DONTCACHE. Especially as XFS will do concurrent O_DIRECT
writes all the way down to the storage.....

> > > I wonder if we need some heuristic that makes generic_write_sync() only
> > > kick off writeback immediately if the whole folio is dirty so we have
> > > more time to gather writes before kicking off writeback?
> > 
> > You're doing aligned 1MB IOs - there should be no partially dirty
> > large folios in either the client or the server page caches.
> 
> Interesting. I wonder what accounts for the slowdown with 1M writes? It
> seems likely to be related to the more aggressive writeback with
> DONTCACHE enabled, but it'd be good to understand this.

What I suspect is that block layer IO submission latency has
increased significantly  with RWF_DONTCACHE and that is slowing down
the rate at which it can service buffered writes to a single file.

The difference between normal buffered writes and RWF_DONTCACHE is
that the write() context will marshall the dirty folios into bios
and submit them to the block layer (via generic_write_sync()). If
the underlying device queues are full, then the bio submission will
be throttled to wait for IO completion.

At this point, all NFSD write processing to that file stalls. All
the other nfsds are blocked on the i_rwsem, and that can't be
released until the holder is released by the block layer throttling.
Hence any time the underlying device queue fills, nfsd processing of
incoming writes stalls completely.

When doing normal buffered writes, this IO submission stalling does
not occur because there is no direct writeback occurring in the
write() path.

Remember the bad old days of balance_dirty_pages() doing dirty
throttling by submitting dirty pages for IO directly in the write()
context? And how much better buffered write performance and write()
submission latency became when we started deferring that IO to the
writeback threads and waiting on completions?

We're essentially going back to the bad old days with buffered
RWF_DONTCACHE writes. Instead of one nicely formed background
writeback stream that can be throttled at the block layer without
adversely affecting incoming write throughput, we end up with every
write() context submitting IO synchronously and being randomly
throttled by the block layer throttle....

There are a lot of reasons the current RWF_DONTCACHE implementation
is sub-optimal for common workloads. This IO spraying and submission
side throttling problem
is one of the reasons why I suggested very early on that an async
write-behind window (similar in concept to async readahead winodws)
would likely be a much better generic solution for RWF_DONTCACHE
writes. This would retain the "one nicely formed background
writeback stream" behaviour that is desirable for buffered writes,
but still allow in rapid reclaim of DONTCACHE folios as IO cleans
them...

> > That said, this is part of the reason I asked about both whether the
> > client side write is STABLE and  whether RWF_DONTCACHE on
> > the server side. i.e. using either of those will trigger writeback
> > on the serer side immediately; in the case of the former it will
> > also complete before returning to the client and not require a
> > subsequent COMMIT RPC to wait for server side IO completion...
> > 
> 
> I need to go back and sniff traffic to be sure, but I'm fairly certain
> the client is issuing regular UNSTABLE writes and following up with a
> later COMMIT, at least for most of them. The occasional STABLE write
> might end up getting through, but that should be fairly rare.

Yeah, I don't think that's an issue given that only the server side
is using RWF_DONTCACHE. The COMMIT will effectively just be a
journal and/or device cache flush as all the dirty data has already
been written back to storage....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: performance r nfsd with RWF_DONTCACHE and larger wsizes
  2025-05-07  2:50     ` Dave Chinner
@ 2025-05-07 13:43       ` Chuck Lever
  2025-05-08  1:13         ` Dave Chinner
  2025-05-07 21:50       ` Mike Snitzer
  1 sibling, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2025-05-07 13:43 UTC (permalink / raw)
  To: Dave Chinner, Jeff Layton
  Cc: linux-fsdevel, linux-nfs, Mike Snitzer, Trond Myklebust,
	Jens Axboe, Chris Mason, Anna Schumaker

On 5/6/25 10:50 PM, Dave Chinner wrote:
> On Tue, May 06, 2025 at 08:06:51PM -0400, Jeff Layton wrote:
>> On Wed, 2025-05-07 at 08:31 +1000, Dave Chinner wrote:
>>> On Tue, May 06, 2025 at 01:40:35PM -0400, Jeff Layton wrote:
>>>> FYI I decided to try and get some numbers with Mike's RWF_DONTCACHE
>>>> patches for nfsd [1]. Those add a module param that make all reads and
>>>> writes use RWF_DONTCACHE.
>>>>
>>>> I had one host that was running knfsd with an XFS export, and a second
>>>> that was acting as NFS client. Both machines have tons of memory, so
>>>> pagecache utilization is irrelevant for this test.
>>>
>>> Does RWF_DONTCACHE result in server side STABLE write requests from
>>> the NFS client, or are they still unstable and require a post-write
>>> completion COMMIT operation from the client to trigger server side
>>> writeback before the client can discard the page cache?
>>>
>>
>> The latter. I didn't change the client at all here (other than to allow
>> it to do bigger writes on the wire). It's just doing bog-standard
>> buffered I/O. nfsd is adding RWF_DONTCACHE to every write via Mike's
>> patch.
> 
> Ok, that wasn't clear that it was only server side RWF_DONTCACHE.
> 
> I have some more context from a different (internal) discussion
> thread about how poorly the NFSD read side performs with
> RWF_DONTCACHE compared to O_DIRECT. This is because there is massive
> page allocator spin lock contention due to all the concurrent reads
> being serviced.
> 
> The buffered write path locking is different, but I suspect
> something similar is occurring and I'm going to ask you to confirm
> it...
> 
>>>> I tested sequential writes using the fio-seq_write.fio test, both with
>>>> and without the module param enabled.
>>>>
>>>> These numbers are from one run each, but they were pretty stable over
>>>> several runs:
>>>>
>>>> # fio /usr/share/doc/fio/examples/fio-seq-write.fio
>>>
>>> $ cat /usr/share/doc/fio/examples/fio-seq-write.fio
>>> cat: /usr/share/doc/fio/examples/fio-seq-write.fio: No such file or directory
>>> $
>>>
>>> What are the fio control parameters of the IO you are doing? (e.g.
>>> is this single threaded IO, does it use the psync, libaio or iouring
>>> engine, etc)
>>>
>>
>>
>> ; fio-seq-write.job for fiotest
>>
>> [global]
>> name=fio-seq-write
>> filename=fio-seq-write
>> rw=write
>> bs=256K
>> direct=0
>> numjobs=1
>> time_based
>> runtime=900
>>
>> [file1]
>> size=10G
>> ioengine=libaio
>> iodepth=16
> 
> Ok, so we are doing AIO writes on the client side, so we have ~16
> writes on the wire from the client at any given time.
> 
> This also means they are likely not being received by the NFS server
> in sequential order, and the NFS server is going to be processing
> roughly 16 write RPCs to the same file concurrently using
> RWF_DONTCACHE IO.
> 
> These are not going to be exactly sequential - the server side IO
> pattern to the filesystem is quasi-sequential, with random IOs being
> out of order and leaving temporary holes in the file until the OO
> write is processed.
> 
> XFS should handle this fine via the speculative preallocation beyond
> EOF that is triggered by extending writes (it was designed to
> mitigate the fragmentation this NFS behaviour causes). However, we
> should always keep in mind that while client side IO is sequential,
> what the server is doing to the underlying filesystem needs to be
> treated as "concurrent IO to a single file" rather than "sequential
> IO".
> 
>>>> wsize=1M:
>>>>
>>>> Normal:      WRITE: bw=1034MiB/s (1084MB/s), 1034MiB/s-1034MiB/s (1084MB/s-1084MB/s), io=910GiB (977GB), run=901326-901326msec
>>>> DONTCACHE:   WRITE: bw=649MiB/s (681MB/s), 649MiB/s-649MiB/s (681MB/s-681MB/s), io=571GiB (613GB), run=900001-900001msec
>>>>
>>>> DONTCACHE with a 1M wsize vs. recent (v6.14-ish) knfsd was about 30%
>>>> slower. Memory consumption was down, but these boxes have oodles of
>>>> memory, so I didn't notice much change there.
>>>
>>> So what is the IO pattern that the NFSD is sending to the underlying
>>> XFS filesystem?
>>>
>>> Is it sending 1M RWF_DONTCACHE buffered IOs to XFS as well (i.e. one
>>> buffered write IO per NFS client write request), or is DONTCACHE
>>> only being used on the NFS client side?
>>>
>>
>> It's should be sequential I/O, though the writes would be coming in
>> from different nfsd threads. nfsd just does standard buffered I/O. The
>> WRITE handler calls nfsd_vfs_write(), which calls vfs_write_iter().
>> With the module parameter enabled, it also adds RWF_DONTCACHE.
> 
> Ok, so buffered writes (even with RWF_DONTCACHE) are not processed
> concurrently by XFS - there's an exclusive lock on the inode that
> will be serialising all the buffered write IO.
> 
> Given that most of the work that XFS will be doing during the write
> will not require releasing the CPU, there is a good chance that
> there is spin contention on the i_rwsem from the 15 other write
> waiters.

This observation echoes my experience with a client pushing 16MB
writes via 1MB NFS WRITEs to one file. They are serialized on the server
by the i_rwsem (or a similar generic per-file lock). The first NFS WRITE
to be emitted by the client is as fast as can be expected, but the RTT
of the last NFS WRITE to be emitted by the client is almost exactly 16
times longer.

I've wanted to drill into this for some time, but unfortunately (for me)
I always seem to have higher priority issues to deal with.

Comparing performance with a similar patch series that implements
uncached server-side I/O with O_DIRECT rather than RWF_UNCACHED might be
illuminating.


> That may be a contributing factor to poor performance, so kernel
> profiles from the NFS server for both the normal buffered write path
> as well as the RWF_DONTCACHE buffered write path. Having some idea
> of the total CPU usage of the nfsds during the workload would also
> be useful.
> 
>> DONTCACHE is only being used on the server side. To be clear, the
>> protocol doesn't support that flag (yet), so we have no way to project
>> DONTCACHE from the client to the server (yet). This is just early
>> exploration to see whether DONTCACHE offers any benefit to this
>> workload.
> 
> The nfs client largely aligns all of the page caceh based IO, so I'd
> think that O_DIRECT on the server side would be much more performant
> than RWF_DONTCACHE. Especially as XFS will do concurrent O_DIRECT
> writes all the way down to the storage.....
> 
>>>> I wonder if we need some heuristic that makes generic_write_sync() only
>>>> kick off writeback immediately if the whole folio is dirty so we have
>>>> more time to gather writes before kicking off writeback?
>>>
>>> You're doing aligned 1MB IOs - there should be no partially dirty
>>> large folios in either the client or the server page caches.
>>
>> Interesting. I wonder what accounts for the slowdown with 1M writes? It
>> seems likely to be related to the more aggressive writeback with
>> DONTCACHE enabled, but it'd be good to understand this.
> 
> What I suspect is that block layer IO submission latency has
> increased significantly  with RWF_DONTCACHE and that is slowing down
> the rate at which it can service buffered writes to a single file.
> 
> The difference between normal buffered writes and RWF_DONTCACHE is
> that the write() context will marshall the dirty folios into bios
> and submit them to the block layer (via generic_write_sync()). If
> the underlying device queues are full, then the bio submission will
> be throttled to wait for IO completion.
> 
> At this point, all NFSD write processing to that file stalls. All
> the other nfsds are blocked on the i_rwsem, and that can't be
> released until the holder is released by the block layer throttling.
> Hence any time the underlying device queue fills, nfsd processing of
> incoming writes stalls completely.
> 
> When doing normal buffered writes, this IO submission stalling does
> not occur because there is no direct writeback occurring in the
> write() path.
> 
> Remember the bad old days of balance_dirty_pages() doing dirty
> throttling by submitting dirty pages for IO directly in the write()
> context? And how much better buffered write performance and write()
> submission latency became when we started deferring that IO to the
> writeback threads and waiting on completions?
> 
> We're essentially going back to the bad old days with buffered
> RWF_DONTCACHE writes. Instead of one nicely formed background
> writeback stream that can be throttled at the block layer without
> adversely affecting incoming write throughput, we end up with every
> write() context submitting IO synchronously and being randomly
> throttled by the block layer throttle....
> 
> There are a lot of reasons the current RWF_DONTCACHE implementation
> is sub-optimal for common workloads. This IO spraying and submission
> side throttling problem
> is one of the reasons why I suggested very early on that an async
> write-behind window (similar in concept to async readahead winodws)
> would likely be a much better generic solution for RWF_DONTCACHE
> writes. This would retain the "one nicely formed background
> writeback stream" behaviour that is desirable for buffered writes,
> but still allow in rapid reclaim of DONTCACHE folios as IO cleans
> them...
> 
>>> That said, this is part of the reason I asked about both whether the
>>> client side write is STABLE and  whether RWF_DONTCACHE on
>>> the server side. i.e. using either of those will trigger writeback
>>> on the serer side immediately; in the case of the former it will
>>> also complete before returning to the client and not require a
>>> subsequent COMMIT RPC to wait for server side IO completion...
>>>
>>
>> I need to go back and sniff traffic to be sure, but I'm fairly certain
>> the client is issuing regular UNSTABLE writes and following up with a
>> later COMMIT, at least for most of them. The occasional STABLE write
>> might end up getting through, but that should be fairly rare.
> 
> Yeah, I don't think that's an issue given that only the server side
> is using RWF_DONTCACHE. The COMMIT will effectively just be a
> journal and/or device cache flush as all the dirty data has already
> been written back to storage....
> 
> -Dave.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: performance r nfsd with RWF_DONTCACHE and larger wsizes
  2025-05-07 13:43       ` Chuck Lever
@ 2025-05-08  1:13         ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2025-05-08  1:13 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, linux-fsdevel, linux-nfs, Mike Snitzer,
	Trond Myklebust, Jens Axboe, Chris Mason, Anna Schumaker

On Wed, May 07, 2025 at 09:43:05AM -0400, Chuck Lever wrote:
> On 5/6/25 10:50 PM, Dave Chinner wrote:
> > Ok, so buffered writes (even with RWF_DONTCACHE) are not processed
> > concurrently by XFS - there's an exclusive lock on the inode that
> > will be serialising all the buffered write IO.
> > 
> > Given that most of the work that XFS will be doing during the write
> > will not require releasing the CPU, there is a good chance that
> > there is spin contention on the i_rwsem from the 15 other write
> > waiters.
> 
> This observation echoes my experience with a client pushing 16MB
> writes via 1MB NFS WRITEs to one file. They are serialized on the server
> by the i_rwsem (or a similar generic per-file lock). The first NFS WRITE
> to be emitted by the client is as fast as can be expected, but the RTT
> of the last NFS WRITE to be emitted by the client is almost exactly 16
> times longer.

Yes, that is the symptom that will be visible if you just batch
write IO 16 at a time. If you allow AIO submission up to a depth
of 16 (i.e. first 16 submit in a batch, then submit new IO in
completion batch sizes) then there is always 16 writes on the wire
instead of it trailing off like 16 -> 0, 16 -> 0, 16 -> 0.

This would at least keep the pipeline full, but it does nothing to
address the IO latency of the server side serialisation.

There is some work in progress to allow concurrent buffered writes
in XFS, and this would largely solve this issue for the NFS
server...

> I've wanted to drill into this for some time, but unfortunately (for me)
> I always seem to have higher priority issues to deal with.

It's really an XFS thing, not an NFS server problem...

> Comparing performance with a similar patch series that implements
> uncached server-side I/O with O_DIRECT rather than RWF_UNCACHED might be
> illuminating.

Yes, that will directly compare concurrent vs serialised submission,
but O_DIRECT will also include IO completion latency in the write
RTT, so overall write throughput can still go down.

In my experience, Improving NFS IO throughput is all about
maximising the number of OTW requests in flight (client side) whilst
simultaneously minimising the latency of individual IO operations
(server side). RWF_DONTCACHE makes the latency of individual
operations somewhat worse, O_DIRECT makes the latency quite a bit
worse. O_DIRECT, however, can mitigate IO latency via concurrency,
but RWF_DONTCACHE cannot (yet).

Hence it is no surprise to me that, everything else being equal,
these server side options actually reduce throughput rather than
improve it...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: performance r nfsd with RWF_DONTCACHE and larger wsizes
  2025-05-07  2:50     ` Dave Chinner
  2025-05-07 13:43       ` Chuck Lever
@ 2025-05-07 21:50       ` Mike Snitzer
  2025-05-08  0:09         ` Jeff Layton
  2025-05-08  1:50         ` Dave Chinner
  1 sibling, 2 replies; 12+ messages in thread
From: Mike Snitzer @ 2025-05-07 21:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Layton, linux-fsdevel, linux-nfs, Chuck Lever,
	Trond Myklebust, Jens Axboe, Chris Mason, Anna Schumaker

Hey Dave,

Thanks for providing your thoughts on all this.  More inlined below.

On Wed, May 07, 2025 at 12:50:20PM +1000, Dave Chinner wrote:
> On Tue, May 06, 2025 at 08:06:51PM -0400, Jeff Layton wrote:
> > On Wed, 2025-05-07 at 08:31 +1000, Dave Chinner wrote:
> > > On Tue, May 06, 2025 at 01:40:35PM -0400, Jeff Layton wrote:
> > > > FYI I decided to try and get some numbers with Mike's RWF_DONTCACHE
> > > > patches for nfsd [1]. Those add a module param that make all reads and
> > > > writes use RWF_DONTCACHE.
> > > > 
> > > > I had one host that was running knfsd with an XFS export, and a second
> > > > that was acting as NFS client. Both machines have tons of memory, so
> > > > pagecache utilization is irrelevant for this test.
> > > 
> > > Does RWF_DONTCACHE result in server side STABLE write requests from
> > > the NFS client, or are they still unstable and require a post-write
> > > completion COMMIT operation from the client to trigger server side
> > > writeback before the client can discard the page cache?
> > > 
> > 
> > The latter. I didn't change the client at all here (other than to allow
> > it to do bigger writes on the wire). It's just doing bog-standard
> > buffered I/O. nfsd is adding RWF_DONTCACHE to every write via Mike's
> > patch.
> 
> Ok, that wasn't clear that it was only server side RWF_DONTCACHE.
> 
> I have some more context from a different (internal) discussion
> thread about how poorly the NFSD read side performs with
> RWF_DONTCACHE compared to O_DIRECT. This is because there is massive
> page allocator spin lock contention due to all the concurrent reads
> being serviced.

That discussion started with: its a very chaotic workload "read a
bunch of large files that cause memory to be oversubscribed 2.5x
across 8 servers".  Many knfsd threads (~240) per server handling 1MB
IO to 8 XFS on NVMe.. (so 8 servers, each with 8 NVMe devices).

For others' benefit here is the flamegraph for this heavy
nfsd.nfsd_dontcache=Y read workload as seen on 1 of the 8 servers:
https://original.art/dontcache_read.svg

Dave offered this additional analysis:
"the flame graph indicates massive lock contention in the page
allocator (i.e. on the page free lists). There's a chunk of time in
data copying (copy_page_to_iter), but 70% of the CPU usage looks to be
page allocator spinlock contention."

All this causes RWF_DONTCACHE reads to be considerably slower than
normal buffered reads (only getting 40-66% of normal buffered reads,
worse read performance occurs when the system is less loaded).  How
knfsd is handling the IO seems to be contributing to the 100% cpu
usage.  If fio is used (with pvsync2 and uncached=1) directly to a
single XFS then CPU is ~50%.

(Jeff: not following why you were seeing EOPNOTSUPP for RWF_DONTCACHE
reads, is that somehow due to the rsize/wsize patches from Chuck?
RWF_DONTCACHE reads work with my patch you quoted as "[1]").

> The buffered write path locking is different, but I suspect
> something similar is occurring and I'm going to ask you to confirm
> it...

With knfsd to XFS on NVMe, favorable difference for RWF_DONTCACHE
writes is that despite also seeing 100% CPU usage, due to lock
contention et al, RWF_DONTCACHE does perform 0-54% better compared to
normal buffered writes that exceed the system's memory by 2.5x
(largest gains seen with most extreme load).

Without RWF_DONTCACHE the system gets pushed to reclaim and the
associated work really hurts.

As tested with knfsd we've been generally unable to see the
reduced CPU usage that is documented in Jens' commit headers:
  for reads:  https://git.kernel.org/linus/8026e49bff9b
  for writes: https://git.kernel.org/linus/d47c670061b5
But as mentioned above, eliminating knfsd and testing XFS directly
with fio does generally reflect what Jens documented.

So more work needed to address knfsd RWF_DONTCACHE inefficiencies.

> > > > I tested sequential writes using the fio-seq_write.fio test, both with
> > > > and without the module param enabled.
> > > > 
> > > > These numbers are from one run each, but they were pretty stable over
> > > > several runs:
> > > > 
> > > > # fio /usr/share/doc/fio/examples/fio-seq-write.fio
> > > 
> > > $ cat /usr/share/doc/fio/examples/fio-seq-write.fio
> > > cat: /usr/share/doc/fio/examples/fio-seq-write.fio: No such file or directory
> > > $
> > > 
> > > What are the fio control parameters of the IO you are doing? (e.g.
> > > is this single threaded IO, does it use the psync, libaio or iouring
> > > engine, etc)
> > > 
> > 
> > 
> > ; fio-seq-write.job for fiotest
> > 
> > [global]
> > name=fio-seq-write
> > filename=fio-seq-write
> > rw=write
> > bs=256K
> > direct=0
> > numjobs=1
> > time_based
> > runtime=900
> > 
> > [file1]
> > size=10G
> > ioengine=libaio
> > iodepth=16
>
> Ok, so we are doing AIO writes on the client side, so we have ~16
> writes on the wire from the client at any given time.

Jeff's workload is really underwhelming given he is operating well
within available memory (so avoiding reclaim, etc).  As such this test
is really not testing what RWF_DONTCACHE is meant to address (and to
answer Chuck's question of "what do you hope to get from
RWF_DONTCACHE?"): the ability to reach steady state where even if
memory is oversubscribed the network pipes and NVMe devices are as
close to 100% utilization as possible.

> This also means they are likely not being received by the NFS server
> in sequential order, and the NFS server is going to be processing
> roughly 16 write RPCs to the same file concurrently using
> RWF_DONTCACHE IO.
> 
> These are not going to be exactly sequential - the server side IO
> pattern to the filesystem is quasi-sequential, with random IOs being
> out of order and leaving temporary holes in the file until the OO
> write is processed.
> 
> XFS should handle this fine via the speculative preallocation beyond
> EOF that is triggered by extending writes (it was designed to
> mitigate the fragmentation this NFS behaviour causes). However, we
> should always keep in mind that while client side IO is sequential,
> what the server is doing to the underlying filesystem needs to be
> treated as "concurrent IO to a single file" rather than "sequential
> IO".

Hammerspace has definitely seen that 1MB IO coming off the wire is
fragmented by the time it XFS issues it to underlying storage; so much
so that IOPs bound devices (e.g. AWS devices that are capped at ~10K
IOPs) are choking due to all the small IO.

So yeah, minimizing the fragmentation is critical (and largely *not*
solved at this point... hacks like sync mount from NFS client or using
O_DIRECT at the client, which sets sync bit, helps reduce the
fragmentation but as soon as you go full buffered the N=16+ IOs on the
wire will fragment each other).

Do you recommend any particular tuning to help XFS's speculative
preallocation work for many competing "sequential" IO threads?  Like
would having 32 AG allow for 32 speculative preallocation engines?  Or
is it only possible to split across AG for different inodes?
(Sorry, I really do aim to get more well-versed with XFS... its only
been ~17 years that it has featured in IO stacks I've had to
engineer, ugh...).

> > > > wsize=1M:
> > > > 
> > > > Normal:      WRITE: bw=1034MiB/s (1084MB/s), 1034MiB/s-1034MiB/s (1084MB/s-1084MB/s), io=910GiB (977GB), run=901326-901326msec
> > > > DONTCACHE:   WRITE: bw=649MiB/s (681MB/s), 649MiB/s-649MiB/s (681MB/s-681MB/s), io=571GiB (613GB), run=900001-900001msec
> > > > 
> > > > DONTCACHE with a 1M wsize vs. recent (v6.14-ish) knfsd was about 30%
> > > > slower. Memory consumption was down, but these boxes have oodles of
> > > > memory, so I didn't notice much change there.
> > > 
> > > So what is the IO pattern that the NFSD is sending to the underlying
> > > XFS filesystem?
> > > 
> > > Is it sending 1M RWF_DONTCACHE buffered IOs to XFS as well (i.e. one
> > > buffered write IO per NFS client write request), or is DONTCACHE
> > > only being used on the NFS client side?
> > > 
> > 
> > It's should be sequential I/O, though the writes would be coming in
> > from different nfsd threads. nfsd just does standard buffered I/O. The
> > WRITE handler calls nfsd_vfs_write(), which calls vfs_write_iter().
> > With the module parameter enabled, it also adds RWF_DONTCACHE.
> 
> Ok, so buffered writes (even with RWF_DONTCACHE) are not processed
> concurrently by XFS - there's an exclusive lock on the inode that
> will be serialising all the buffered write IO.
> 
> Given that most of the work that XFS will be doing during the write
> will not require releasing the CPU, there is a good chance that
> there is spin contention on the i_rwsem from the 15 other write
> waiters.
> 
> That may be a contributing factor to poor performance, so kernel
> profiles from the NFS server for both the normal buffered write path
> as well as the RWF_DONTCACHE buffered write path. Having some idea
> of the total CPU usage of the nfsds during the workload would also
> be useful.
> 
> > DONTCACHE is only being used on the server side. To be clear, the
> > protocol doesn't support that flag (yet), so we have no way to project
> > DONTCACHE from the client to the server (yet). This is just early
> > exploration to see whether DONTCACHE offers any benefit to this
> > workload.
> 
> The nfs client largely aligns all of the page caceh based IO, so I'd
> think that O_DIRECT on the server side would be much more performant
> than RWF_DONTCACHE. Especially as XFS will do concurrent O_DIRECT
> writes all the way down to the storage.....

Yes.  We really need to add full-blown O_DIRECT support to knfsd.  And
Hammerspace wants me to work on it ASAP.  But I welcome all the help I
can get, I have ideas but look forward to discussing next week at
Bakeathon and/or in this thread...

The first hurdle is coping with the head and/or tail of IO being
misaligned relative to the underlying storage's logical_block_size.
Need to cull off misaligned IO and use RWF_DONTCACHE for those but
O_DIRECT for the aligned middle is needed.

I aim to deal with that for NFS LOCALIO first (NFS client issues
IO direct to XFS, bypassing knfsd) and then reuse it for knfsd's
O_DIRECT support.

> > > > I wonder if we need some heuristic that makes generic_write_sync() only
> > > > kick off writeback immediately if the whole folio is dirty so we have
> > > > more time to gather writes before kicking off writeback?
> > > 
> > > You're doing aligned 1MB IOs - there should be no partially dirty
> > > large folios in either the client or the server page caches.
> > 
> > Interesting. I wonder what accounts for the slowdown with 1M writes? It
> > seems likely to be related to the more aggressive writeback with
> > DONTCACHE enabled, but it'd be good to understand this.
> 
> What I suspect is that block layer IO submission latency has
> increased significantly  with RWF_DONTCACHE and that is slowing down
> the rate at which it can service buffered writes to a single file.
> 
> The difference between normal buffered writes and RWF_DONTCACHE is
> that the write() context will marshall the dirty folios into bios
> and submit them to the block layer (via generic_write_sync()). If
> the underlying device queues are full, then the bio submission will
> be throttled to wait for IO completion.
> 
> At this point, all NFSD write processing to that file stalls. All
> the other nfsds are blocked on the i_rwsem, and that can't be
> released until the holder is released by the block layer throttling.
> Hence any time the underlying device queue fills, nfsd processing of
> incoming writes stalls completely.
> 
> When doing normal buffered writes, this IO submission stalling does
> not occur because there is no direct writeback occurring in the
> write() path.
> 
> Remember the bad old days of balance_dirty_pages() doing dirty
> throttling by submitting dirty pages for IO directly in the write()
> context? And how much better buffered write performance and write()
> submission latency became when we started deferring that IO to the
> writeback threads and waiting on completions?
> 
> We're essentially going back to the bad old days with buffered
> RWF_DONTCACHE writes. Instead of one nicely formed background
> writeback stream that can be throttled at the block layer without
> adversely affecting incoming write throughput, we end up with every
> write() context submitting IO synchronously and being randomly
> throttled by the block layer throttle....
> 
> There are a lot of reasons the current RWF_DONTCACHE implementation
> is sub-optimal for common workloads. This IO spraying and submission
> side throttling problem
> is one of the reasons why I suggested very early on that an async
> write-behind window (similar in concept to async readahead winodws)
> would likely be a much better generic solution for RWF_DONTCACHE
> writes. This would retain the "one nicely formed background
> writeback stream" behaviour that is desirable for buffered writes,
> but still allow in rapid reclaim of DONTCACHE folios as IO cleans
> them...

I recall you voicing this concern and nobody really seizing on it.
Could be that Jens is open changing the RWF_DONTCACHE implementation
if/when more proof is made for the need?

> > > That said, this is part of the reason I asked about both whether the
> > > client side write is STABLE and  whether RWF_DONTCACHE on
> > > the server side. i.e. using either of those will trigger writeback
> > > on the serer side immediately; in the case of the former it will
> > > also complete before returning to the client and not require a
> > > subsequent COMMIT RPC to wait for server side IO completion...
> > > 
> > 
> > I need to go back and sniff traffic to be sure, but I'm fairly certain
> > the client is issuing regular UNSTABLE writes and following up with a
> > later COMMIT, at least for most of them. The occasional STABLE write
> > might end up getting through, but that should be fairly rare.
> 
> Yeah, I don't think that's an issue given that only the server side
> is using RWF_DONTCACHE. The COMMIT will effectively just be a
> journal and/or device cache flush as all the dirty data has already
> been written back to storage....

FYI, most of Hammerspace RWF_DONTCACHE testing has been using O_DIRECT
for client IO and nfsd.nfsd_dontcache=Y on the server.

Thanks for the interesting discussion!
Mike

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: performance r nfsd with RWF_DONTCACHE and larger wsizes
  2025-05-07 21:50       ` Mike Snitzer
@ 2025-05-08  0:09         ` Jeff Layton
  2025-05-08  2:05           ` Dave Chinner
  2025-05-08  1:50         ` Dave Chinner
  1 sibling, 1 reply; 12+ messages in thread
From: Jeff Layton @ 2025-05-08  0:09 UTC (permalink / raw)
  To: Mike Snitzer, Dave Chinner
  Cc: linux-fsdevel, linux-nfs, Chuck Lever, Trond Myklebust,
	Jens Axboe, Chris Mason, Anna Schumaker

On Wed, 2025-05-07 at 17:50 -0400, Mike Snitzer wrote:
> Hey Dave,
> 
> Thanks for providing your thoughts on all this.  More inlined below.
> 
> On Wed, May 07, 2025 at 12:50:20PM +1000, Dave Chinner wrote:
> > On Tue, May 06, 2025 at 08:06:51PM -0400, Jeff Layton wrote:
> > > On Wed, 2025-05-07 at 08:31 +1000, Dave Chinner wrote:
> > > > On Tue, May 06, 2025 at 01:40:35PM -0400, Jeff Layton wrote:
> > > > > FYI I decided to try and get some numbers with Mike's RWF_DONTCACHE
> > > > > patches for nfsd [1]. Those add a module param that make all reads and
> > > > > writes use RWF_DONTCACHE.
> > > > > 
> > > > > I had one host that was running knfsd with an XFS export, and a second
> > > > > that was acting as NFS client. Both machines have tons of memory, so
> > > > > pagecache utilization is irrelevant for this test.
> > > > 
> > > > Does RWF_DONTCACHE result in server side STABLE write requests from
> > > > the NFS client, or are they still unstable and require a post-write
> > > > completion COMMIT operation from the client to trigger server side
> > > > writeback before the client can discard the page cache?
> > > > 
> > > 
> > > The latter. I didn't change the client at all here (other than to allow
> > > it to do bigger writes on the wire). It's just doing bog-standard
> > > buffered I/O. nfsd is adding RWF_DONTCACHE to every write via Mike's
> > > patch.
> > 
> > Ok, that wasn't clear that it was only server side RWF_DONTCACHE.
> > 
> > I have some more context from a different (internal) discussion
> > thread about how poorly the NFSD read side performs with
> > RWF_DONTCACHE compared to O_DIRECT. This is because there is massive
> > page allocator spin lock contention due to all the concurrent reads
> > being serviced.
> 
> That discussion started with: its a very chaotic workload "read a
> bunch of large files that cause memory to be oversubscribed 2.5x
> across 8 servers".  Many knfsd threads (~240) per server handling 1MB
> IO to 8 XFS on NVMe.. (so 8 servers, each with 8 NVMe devices).
> 
> For others' benefit here is the flamegraph for this heavy
> nfsd.nfsd_dontcache=Y read workload as seen on 1 of the 8 servers:
> https://original.art/dontcache_read.svg
> 
> Dave offered this additional analysis:
> "the flame graph indicates massive lock contention in the page
> allocator (i.e. on the page free lists). There's a chunk of time in
> data copying (copy_page_to_iter), but 70% of the CPU usage looks to be
> page allocator spinlock contention."
> 
> All this causes RWF_DONTCACHE reads to be considerably slower than
> normal buffered reads (only getting 40-66% of normal buffered reads,
> worse read performance occurs when the system is less loaded).  How
> knfsd is handling the IO seems to be contributing to the 100% cpu
> usage.  If fio is used (with pvsync2 and uncached=1) directly to a
> single XFS then CPU is ~50%.
> 
> (Jeff: not following why you were seeing EOPNOTSUPP for RWF_DONTCACHE
> reads, is that somehow due to the rsize/wsize patches from Chuck?
> RWF_DONTCACHE reads work with my patch you quoted as "[1]").
> 

Possibly. I'm not sure either. I hit that error on reads with the
RWF_DONTCACHE enabled and decided to focus on writes for the moment.
I'll run it down when I get a chance.

> > The buffered write path locking is different, but I suspect
> > something similar is occurring and I'm going to ask you to confirm
> > it...
> 

I started collecting perf traces today, but I'm having trouble getting
meaningful reports out of it. So, I'm working on it, but stay tuned.

> With knfsd to XFS on NVMe, favorable difference for RWF_DONTCACHE
> writes is that despite also seeing 100% CPU usage, due to lock
> contention et al, RWF_DONTCACHE does perform 0-54% better compared to
> normal buffered writes that exceed the system's memory by 2.5x
> (largest gains seen with most extreme load).
> 
> Without RWF_DONTCACHE the system gets pushed to reclaim and the
> associated work really hurts.
> 

That makes total sense. The boxes I've been testing on have gobs of
memory. The system never gets pushed into reclaim. It sounds like I
need to do some testing with small memory sizes (maybe in VMs).

> As tested with knfsd we've been generally unable to see the
> reduced CPU usage that is documented in Jens' commit headers:
>   for reads:  https://git.kernel.org/linus/8026e49bff9b
>   for writes: https://git.kernel.org/linus/d47c670061b5
> But as mentioned above, eliminating knfsd and testing XFS directly
> with fio does generally reflect what Jens documented.
> 
> So more work needed to address knfsd RWF_DONTCACHE inefficiencies.
> 

Agreed.

> > > > > I tested sequential writes using the fio-seq_write.fio test, both with
> > > > > and without the module param enabled.
> > > > > 
> > > > > These numbers are from one run each, but they were pretty stable over
> > > > > several runs:
> > > > > 
> > > > > # fio /usr/share/doc/fio/examples/fio-seq-write.fio
> > > > 
> > > > $ cat /usr/share/doc/fio/examples/fio-seq-write.fio
> > > > cat: /usr/share/doc/fio/examples/fio-seq-write.fio: No such file or directory
> > > > $
> > > > 
> > > > What are the fio control parameters of the IO you are doing? (e.g.
> > > > is this single threaded IO, does it use the psync, libaio or iouring
> > > > engine, etc)
> > > > 
> > > 
> > > 
> > > ; fio-seq-write.job for fiotest
> > > 
> > > [global]
> > > name=fio-seq-write
> > > filename=fio-seq-write
> > > rw=write
> > > bs=256K
> > > direct=0
> > > numjobs=1
> > > time_based
> > > runtime=900
> > > 
> > > [file1]
> > > size=10G
> > > ioengine=libaio
> > > iodepth=16
> > 
> > Ok, so we are doing AIO writes on the client side, so we have ~16
> > writes on the wire from the client at any given time.
> 
> Jeff's workload is really underwhelming given he is operating well
> within available memory (so avoiding reclaim, etc).  As such this test
> is really not testing what RWF_DONTCACHE is meant to address (and to
> answer Chuck's question of "what do you hope to get from
> RWF_DONTCACHE?"): the ability to reach steady state where even if
> memory is oversubscribed the network pipes and NVMe devices are as
> close to 100% utilization as possible.
> 

I'll see about setting up something more memory-constrained on the
server side. That would be more interesting for sure.

> > This also means they are likely not being received by the NFS server
> > in sequential order, and the NFS server is going to be processing
> > roughly 16 write RPCs to the same file concurrently using
> > RWF_DONTCACHE IO.
> > 
> > These are not going to be exactly sequential - the server side IO
> > pattern to the filesystem is quasi-sequential, with random IOs being
> > out of order and leaving temporary holes in the file until the OO
> > write is processed.
> > 
> > XFS should handle this fine via the speculative preallocation beyond
> > EOF that is triggered by extending writes (it was designed to
> > mitigate the fragmentation this NFS behaviour causes). However, we
> > should always keep in mind that while client side IO is sequential,
> > what the server is doing to the underlying filesystem needs to be
> > treated as "concurrent IO to a single file" rather than "sequential
> > IO".
> 
> Hammerspace has definitely seen that 1MB IO coming off the wire is
> fragmented by the time it XFS issues it to underlying storage; so much
> so that IOPs bound devices (e.g. AWS devices that are capped at ~10K
> IOPs) are choking due to all the small IO.
> 
> So yeah, minimizing the fragmentation is critical (and largely *not*
> solved at this point... hacks like sync mount from NFS client or using
> O_DIRECT at the client, which sets sync bit, helps reduce the
> fragmentation but as soon as you go full buffered the N=16+ IOs on the
> wire will fragment each other).
> 
> Do you recommend any particular tuning to help XFS's speculative
> preallocation work for many competing "sequential" IO threads?  Like
> would having 32 AG allow for 32 speculative preallocation engines?  Or
> is it only possible to split across AG for different inodes?
> (Sorry, I really do aim to get more well-versed with XFS... its only
> been ~17 years that it has featured in IO stacks I've had to
> engineer, ugh...).
> 
> > > > > wsize=1M:
> > > > > 
> > > > > Normal:      WRITE: bw=1034MiB/s (1084MB/s), 1034MiB/s-1034MiB/s (1084MB/s-1084MB/s), io=910GiB (977GB), run=901326-901326msec
> > > > > DONTCACHE:   WRITE: bw=649MiB/s (681MB/s), 649MiB/s-649MiB/s (681MB/s-681MB/s), io=571GiB (613GB), run=900001-900001msec
> > > > > 
> > > > > DONTCACHE with a 1M wsize vs. recent (v6.14-ish) knfsd was about 30%
> > > > > slower. Memory consumption was down, but these boxes have oodles of
> > > > > memory, so I didn't notice much change there.
> > > > 
> > > > So what is the IO pattern that the NFSD is sending to the underlying
> > > > XFS filesystem?
> > > > 
> > > > Is it sending 1M RWF_DONTCACHE buffered IOs to XFS as well (i.e. one
> > > > buffered write IO per NFS client write request), or is DONTCACHE
> > > > only being used on the NFS client side?
> > > > 
> > > 
> > > It's should be sequential I/O, though the writes would be coming in
> > > from different nfsd threads. nfsd just does standard buffered I/O. The
> > > WRITE handler calls nfsd_vfs_write(), which calls vfs_write_iter().
> > > With the module parameter enabled, it also adds RWF_DONTCACHE.
> > 
> > Ok, so buffered writes (even with RWF_DONTCACHE) are not processed
> > concurrently by XFS - there's an exclusive lock on the inode that
> > will be serialising all the buffered write IO.
> > 
> > Given that most of the work that XFS will be doing during the write
> > will not require releasing the CPU, there is a good chance that
> > there is spin contention on the i_rwsem from the 15 other write
> > waiters.
> > 
> > That may be a contributing factor to poor performance, so kernel
> > profiles from the NFS server for both the normal buffered write path
> > as well as the RWF_DONTCACHE buffered write path. Having some idea
> > of the total CPU usage of the nfsds during the workload would also
> > be useful.
> > 
> > > DONTCACHE is only being used on the server side. To be clear, the
> > > protocol doesn't support that flag (yet), so we have no way to project
> > > DONTCACHE from the client to the server (yet). This is just early
> > > exploration to see whether DONTCACHE offers any benefit to this
> > > workload.
> > 
> > The nfs client largely aligns all of the page caceh based IO, so I'd
> > think that O_DIRECT on the server side would be much more performant
> > than RWF_DONTCACHE. Especially as XFS will do concurrent O_DIRECT
> > writes all the way down to the storage.....
> 
> Yes.  We really need to add full-blown O_DIRECT support to knfsd.  And
> Hammerspace wants me to work on it ASAP.  But I welcome all the help I
> can get, I have ideas but look forward to discussing next week at
> Bakeathon and/or in this thread...
> 
> The first hurdle is coping with the head and/or tail of IO being
> misaligned relative to the underlying storage's logical_block_size.
> Need to cull off misaligned IO and use RWF_DONTCACHE for those but
> O_DIRECT for the aligned middle is needed.
>
> I aim to deal with that for NFS LOCALIO first (NFS client issues
> IO direct to XFS, bypassing knfsd) and then reuse it for knfsd's
> O_DIRECT support.
>

I'll be interested to hear your thoughts on this!

> > > > > I wonder if we need some heuristic that makes generic_write_sync() only
> > > > > kick off writeback immediately if the whole folio is dirty so we have
> > > > > more time to gather writes before kicking off writeback?
> > > > 
> > > > You're doing aligned 1MB IOs - there should be no partially dirty
> > > > large folios in either the client or the server page caches.
> > > 
> > > Interesting. I wonder what accounts for the slowdown with 1M writes? It
> > > seems likely to be related to the more aggressive writeback with
> > > DONTCACHE enabled, but it'd be good to understand this.
> > 
> > What I suspect is that block layer IO submission latency has
> > increased significantly  with RWF_DONTCACHE and that is slowing down
> > the rate at which it can service buffered writes to a single file.
> > 
> > The difference between normal buffered writes and RWF_DONTCACHE is
> > that the write() context will marshall the dirty folios into bios
> > and submit them to the block layer (via generic_write_sync()). If
> > the underlying device queues are full, then the bio submission will
> > be throttled to wait for IO completion.
> > 
> > At this point, all NFSD write processing to that file stalls. All
> > the other nfsds are blocked on the i_rwsem, and that can't be
> > released until the holder is released by the block layer throttling.
> > Hence any time the underlying device queue fills, nfsd processing of
> > incoming writes stalls completely.
> > 
> > When doing normal buffered writes, this IO submission stalling does
> > not occur because there is no direct writeback occurring in the
> > write() path.
> > 
> > Remember the bad old days of balance_dirty_pages() doing dirty
> > throttling by submitting dirty pages for IO directly in the write()
> > context? And how much better buffered write performance and write()
> > submission latency became when we started deferring that IO to the
> > writeback threads and waiting on completions?
> > 
> > We're essentially going back to the bad old days with buffered
> > RWF_DONTCACHE writes. Instead of one nicely formed background
> > writeback stream that can be throttled at the block layer without
> > adversely affecting incoming write throughput, we end up with every
> > write() context submitting IO synchronously and being randomly
> > throttled by the block layer throttle....
> > 
> > There are a lot of reasons the current RWF_DONTCACHE implementation
> > is sub-optimal for common workloads. This IO spraying and submission
> > side throttling problem
> > is one of the reasons why I suggested very early on that an async
> > write-behind window (similar in concept to async readahead winodws)
> > would likely be a much better generic solution for RWF_DONTCACHE
> > writes. This would retain the "one nicely formed background
> > writeback stream" behaviour that is desirable for buffered writes,
> > but still allow in rapid reclaim of DONTCACHE folios as IO cleans
> > them...
> 
> I recall you voicing this concern and nobody really seizing on it.
> Could be that Jens is open changing the RWF_DONTCACHE implementation
> if/when more proof is made for the need?
> 

It does seem like using RWF_DONTCACHE currently leads to a lot of
fragmented I/O. I suspect that doing filemap_fdatawrite_range_kick()
after every DONTCACHE write is the main problem on the write side. We
probably need to come up with a way to make it flush more optimally
when there are streaming DONTCACHE writes.

An async writebehind window could be a solution. How would we implement
that? Some sort of delay before we kick off writeback (and hopefully
for larger ranges)?

> > > > That said, this is part of the reason I asked about both whether the
> > > > client side write is STABLE and  whether RWF_DONTCACHE on
> > > > the server side. i.e. using either of those will trigger writeback
> > > > on the serer side immediately; in the case of the former it will
> > > > also complete before returning to the client and not require a
> > > > subsequent COMMIT RPC to wait for server side IO completion...
> > > > 
> > > 
> > > I need to go back and sniff traffic to be sure, but I'm fairly certain
> > > the client is issuing regular UNSTABLE writes and following up with a
> > > later COMMIT, at least for most of them. The occasional STABLE write
> > > might end up getting through, but that should be fairly rare.
> > 
> > Yeah, I don't think that's an issue given that only the server side
> > is using RWF_DONTCACHE. The COMMIT will effectively just be a
> > journal and/or device cache flush as all the dirty data has already
> > been written back to storage....
> 
> FYI, most of Hammerspace RWF_DONTCACHE testing has been using O_DIRECT
> for client IO and nfsd.nfsd_dontcache=Y on the server.

Good to know. I'll switch my testing to O_DIRECT as well. The client-
side pagecache isn't adding any benefit to this.
--
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: performance r nfsd with RWF_DONTCACHE and larger wsizes
  2025-05-08  0:09         ` Jeff Layton
@ 2025-05-08  2:05           ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2025-05-08  2:05 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Mike Snitzer, linux-fsdevel, linux-nfs, Chuck Lever,
	Trond Myklebust, Jens Axboe, Chris Mason, Anna Schumaker

On Wed, May 07, 2025 at 08:09:33PM -0400, Jeff Layton wrote:
> On Wed, 2025-05-07 at 17:50 -0400, Mike Snitzer wrote:
> > Hey Dave,
> > 
> > Thanks for providing your thoughts on all this.  More inlined below.
> > 
> > On Wed, May 07, 2025 at 12:50:20PM +1000, Dave Chinner wrote:
> > > Remember the bad old days of balance_dirty_pages() doing dirty
> > > throttling by submitting dirty pages for IO directly in the write()
> > > context? And how much better buffered write performance and write()
> > > submission latency became when we started deferring that IO to the
> > > writeback threads and waiting on completions?
> > > 
> > > We're essentially going back to the bad old days with buffered
> > > RWF_DONTCACHE writes. Instead of one nicely formed background
> > > writeback stream that can be throttled at the block layer without
> > > adversely affecting incoming write throughput, we end up with every
> > > write() context submitting IO synchronously and being randomly
> > > throttled by the block layer throttle....
> > > 
> > > There are a lot of reasons the current RWF_DONTCACHE implementation
> > > is sub-optimal for common workloads. This IO spraying and submission
> > > side throttling problem
> > > is one of the reasons why I suggested very early on that an async
> > > write-behind window (similar in concept to async readahead winodws)
> > > would likely be a much better generic solution for RWF_DONTCACHE
> > > writes. This would retain the "one nicely formed background
> > > writeback stream" behaviour that is desirable for buffered writes,
> > > but still allow in rapid reclaim of DONTCACHE folios as IO cleans
> > > them...
> > 
> > I recall you voicing this concern and nobody really seizing on it.
> > Could be that Jens is open changing the RWF_DONTCACHE implementation
> > if/when more proof is made for the need?
> 
> It does seem like using RWF_DONTCACHE currently leads to a lot of
> fragmented I/O. I suspect that doing filemap_fdatawrite_range_kick()
> after every DONTCACHE write is the main problem on the write side. We
> probably need to come up with a way to make it flush more optimally
> when there are streaming DONTCACHE writes.
> 
> An async writebehind window could be a solution. How would we implement
> that? Some sort of delay before we kick off writeback (and hopefully
> for larger ranges)?

My thoughts on this are as follows...

When we mark the inode dirty, we currently put it on the list of
dirty inodes for writeback. We could change how we mark an inode
dirty for RWF_DONTCACHE writes to say "dirty for write-through" and
put it on a new write-through inode list.  Then we can kick an expedited
write-through worker thread that writes back all the dirty
write-through inodes on it's list.

In this case, a delay of a few milliseconds is probably large enough
time to allow decent write-through IO sizes to build up without
causing excessive page cache memory usage for dirty DONTCACHE
folios...

The other thing that this could allow is throttling incoming
RWF_DONTCACHE IOs in balance_dirty_pages_ratelimited. e.g. if more
than 16MB of DONTCACHE folios are built up on a BDI, kick the
write-through worker and wait for the DONTCACHE folio count to drop.
This then gives some control (and potential admin control) of how
much dirty page cache is allowed to accrue for DONTCACHE write
IOs...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: performance r nfsd with RWF_DONTCACHE and larger wsizes
  2025-05-07 21:50       ` Mike Snitzer
  2025-05-08  0:09         ` Jeff Layton
@ 2025-05-08  1:50         ` Dave Chinner
  1 sibling, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2025-05-08  1:50 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jeff Layton, linux-fsdevel, linux-nfs, Chuck Lever,
	Trond Myklebust, Jens Axboe, Chris Mason, Anna Schumaker

On Wed, May 07, 2025 at 05:50:14PM -0400, Mike Snitzer wrote:
> Hey Dave,
> 
> Thanks for providing your thoughts on all this.  More inlined below.
> 
> On Wed, May 07, 2025 at 12:50:20PM +1000, Dave Chinner wrote:
> > On Tue, May 06, 2025 at 08:06:51PM -0400, Jeff Layton wrote:
> > > On Wed, 2025-05-07 at 08:31 +1000, Dave Chinner wrote:
> > > > What are the fio control parameters of the IO you are doing? (e.g.
> > > > is this single threaded IO, does it use the psync, libaio or iouring
> > > > engine, etc)
> > > > 
> > > 
> > > 
> > > ; fio-seq-write.job for fiotest
> > > 
> > > [global]
> > > name=fio-seq-write
> > > filename=fio-seq-write
> > > rw=write
> > > bs=256K
> > > direct=0
> > > numjobs=1
> > > time_based
> > > runtime=900
> > > 
> > > [file1]
> > > size=10G
> > > ioengine=libaio
> > > iodepth=16
> >
> > Ok, so we are doing AIO writes on the client side, so we have ~16
> > writes on the wire from the client at any given time.
> 
> Jeff's workload is really underwhelming given he is operating well
> within available memory (so avoiding reclaim, etc).  As such this test
> is really not testing what RWF_DONTCACHE is meant to address (and to
> answer Chuck's question of "what do you hope to get from
> RWF_DONTCACHE?"): the ability to reach steady state where even if
> memory is oversubscribed the network pipes and NVMe devices are as
> close to 100% utilization as possible.

Right.

However, one of the things that has to be kept in mind is that we
don't have 100% of the CPU dedicated to servicing RWF_DONTCACHE IO
like the fio microbenchmarks have.

Applications are going to take a chunk of CPU time to
create/marshall/process the data that we we are doing IO on, so any
time we spend on doing IO is less time that the applications have to
do their work. If you can saturate the storage without saturating
CPUs, then RWF_DONTCACHE should allow that steady state to be
maintained indefinitely.

However, RWF_DONTCACHE does not remove the data copy overhead of
buffered IO, whilst it adds IO submission overhead to each IO. Hence
it will require more CPU time to saturate the storage devices than
normal buffered IO. If you've got CPU to spare, great. If you don't,
then overall performance will be reduced.

> > This also means they are likely not being received by the NFS server
> > in sequential order, and the NFS server is going to be processing
> > roughly 16 write RPCs to the same file concurrently using
> > RWF_DONTCACHE IO.
> > 
> > These are not going to be exactly sequential - the server side IO
> > pattern to the filesystem is quasi-sequential, with random IOs being
> > out of order and leaving temporary holes in the file until the OO
> > write is processed.
> > 
> > XFS should handle this fine via the speculative preallocation beyond
> > EOF that is triggered by extending writes (it was designed to
> > mitigate the fragmentation this NFS behaviour causes). However, we
> > should always keep in mind that while client side IO is sequential,
> > what the server is doing to the underlying filesystem needs to be
> > treated as "concurrent IO to a single file" rather than "sequential
> > IO".
> 
> Hammerspace has definitely seen that 1MB IO coming off the wire is
> fragmented by the time it XFS issues it to underlying storage; so much
> so that IOPs bound devices (e.g. AWS devices that are capped at ~10K
> IOPs) are choking due to all the small IO.

That should not happen in the general case. Can you start a separate
thread to triage the issue so we can try to understand why that is
happening?

> So yeah, minimizing the fragmentation is critical (and largely *not*
> solved at this point... hacks like sync mount from NFS client or using
> O_DIRECT at the client, which sets sync bit, helps reduce the
> fragmentation but as soon as you go full buffered the N=16+ IOs on the
> wire will fragment each other).

Fragmentation mitigation for NFS server IO is generally only
addressable at the filesystem level - it's not really something you
can mitigate at the NFS server or client.

> Do you recommend any particular tuning to help XFS's speculative
> preallocation work for many competing "sequential" IO threads?

I can suggest lots of things, but without knowing the IO pattern,
the fragmentation pattern, the filesystem state, what triggers the
fragmentation, etc, I'd just be guessing as to which knob might make
the problem go away (hence the request to separate that out).

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-05-08  2:05 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-06 17:40 performance r nfsd with RWF_DONTCACHE and larger wsizes Jeff Layton
2025-05-06 18:16 ` Chuck Lever
2025-05-06 18:30   ` Jeff Layton
2025-05-06 22:31 ` Dave Chinner
2025-05-07  0:06   ` Jeff Layton
2025-05-07  2:50     ` Dave Chinner
2025-05-07 13:43       ` Chuck Lever
2025-05-08  1:13         ` Dave Chinner
2025-05-07 21:50       ` Mike Snitzer
2025-05-08  0:09         ` Jeff Layton
2025-05-08  2:05           ` Dave Chinner
2025-05-08  1:50         ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.