Async client v4 mount results in unexpected number of extents on the server

Linux NFS development
 help / color / mirror / Atom feed

* Async client v4 mount results in unexpected number of extents on the server
@ 2025-04-10 12:55 Holger Hoffstätte
  2025-04-11 13:29 ` Benjamin Coddington
  0 siblings, 1 reply; 5+ messages in thread
From: Holger Hoffstätte @ 2025-04-10 12:55 UTC (permalink / raw)
  To: linux-nfs

Story time!

For quite some time now (at least two years or so) I noticed that
client NFS I/O has started to generate "weird" extent sizes for
large files on my NFS exports (async, XFS, plenty of unfragmented
space, 10Gb). It happens from several machines, with both 1Gb and
10Gb NICs, all up-to-date userland and kernels.

While I normally use autofs, the following demonstrates the issue
with a regular mount to my server:

$mount.nfs4 -o async tux:/var/cache/distfiles /var/cache/distfiles
$mount | grep distfiles
tux:/var/cache/distfiles on /var/cache/distfiles type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,
proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.100.223,
local_lock=none,addr=192.168.100.222)

Now we do a buffered write to the mount point using rsync.
This also happens e.g. with a GUI file manager or coreutils' cp,
but I currently have the latter patched with forced fallocate for
large files, so rsync it is:

$rsync -av --progress /tmp/linux-6.14.1.tar /var/cache/distfiles/
sending incremental file list
linux-6.14.1.tar
   1,561,466,880 100%    1.73GB/s    0:00:00 (xfr#1, to-chk=0/1)

sent 1,561,684,835 bytes  received 35 bytes  446,195,677.14 bytes/sec
total size is 1,561,466,880  speedup is 1.00

So a nice and fast async write, minus the expected delay on close.
This results (repeatably) in something like the following, after
writeback has settled on the server:

$ssh tux "filefrag -ek /var/cache/distfiles/linux-6.14.1.tar"
Filesystem type is: 58465342
File size of /var/cache/distfiles/linux-6.14.1.tar is 1561466880 (1524872 blocks of 1024 bytes)
  ext:     logical_offset:        physical_offset: length:   expected: flags:
    0:        0..    4031:  504215344.. 504219375:   4032:
    1:     4032..   36863:  512042080.. 512074911:  32832:  504219376:
    2:    36864..   53183:  511952912.. 511969231:  16320:  512074912:
    3:    53184..   69567:  511991704.. 512008087:  16384:  511969232:
    4:    69568..  292863:  516378400.. 516601695: 223296:  512008088:
    5:   292864..  331775:  516679456.. 516718367:  38912:  516601696:
    6:   331776..  462783:  515329204.. 515460211: 131008:  516718368:
    7:   462784..  525311:  515541100.. 515603627:  62528:  515460212:
    8:   525312..  552959:  515460212.. 515487859:  27648:  515603628:
    9:   552960..  683967:  515763228.. 515894235: 131008:  515487860:
   10:   683968..  782271:  516280096.. 516378399:  98304:  515894236:
   11:   782272..  806911:  516980448.. 517005087:  24640:  516378400:
   12:   806912..  884671:  516601696.. 516679455:  77760:  517005088:
   13:   884672.. 1048575:  517037124.. 517201027: 163904:  516679456:
   14:  1048576.. 1310655:  516718368.. 516980447: 262080:  517201028:
   15:  1310656.. 1524871:  517201028.. 517415243: 214216:  516980448: last,eof
/var/cache/distfiles/linux-6.14.1.tar: 16 extents found

The target filesystem is completely unfragmented and there is no good
reason why XFS's buffered I/O should result in this roller coaster layout.
The extent sizes obviously vary from run to run, but way too many times
the first few extents have ~4k sizes. Sometimes they then quickly double
in size (as XFS does), sometimes they do not. There is no memory pressure
anywhere, many GBs of RAM free etc.

Increasing or reducing writeback sysctls on either the client or server
has accomplished exactly nothing useful, other than varying the timing.

None of this happens when the client explicitly mounts in sync mode;
the resulting file is always a nice contiguous extent, despite the fact
that the export is in async mode and does buffered I/O. So the symptom
seems to be caused by the client when collecting pages for writeback.

Unfortunately sync writes have very poor performance: ~55 MB/s on my
1Gb laptop, ~330 MB/s on my workstation with 10GBit. Sad.

Now I'm trying to understand why client writeback results in these
irregular extents. I *think* all this started ~2 years back but do
not remember a particular kernel release (xarray? maple tree?).
I only noticed this after the fact when I copyied files with a GUI and
always ended up with more extents than before.

Does this behaviour seem familiar to anybody?

I realize this is "not a bug" (all data is safe and sound etc.) but
somehow it seems that various layers are not working together as one
might expect. It's possible that my expectation is wrong. :)

Thanks for reading!

Holger

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Async client v4 mount results in unexpected number of extents on the server
  2025-04-10 12:55 Async client v4 mount results in unexpected number of extents on the server Holger Hoffstätte
@ 2025-04-11 13:29 ` Benjamin Coddington
  2025-04-12 12:36   ` Holger Hoffstätte
  0 siblings, 1 reply; 5+ messages in thread
From: Benjamin Coddington @ 2025-04-11 13:29 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: linux-nfs

On 10 Apr 2025, at 8:55, Holger Hoffstätte wrote:
> ...
> Does this behaviour seem familiar to anybody?
>
> I realize this is "not a bug" (all data is safe and sound etc.) but
> somehow it seems that various layers are not working together as one
> might expect. It's possible that my expectation is wrong. :)

My first impression is that the writes are being processed out-of-order on
the server, so XFS is using a range of allocation sizes while growing the
file extents.

Ben


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Async client v4 mount results in unexpected number of extents on the server
  2025-04-11 13:29 ` Benjamin Coddington
@ 2025-04-12 12:36   ` Holger Hoffstätte
  2025-04-12 17:28     ` Benjamin Coddington
  0 siblings, 1 reply; 5+ messages in thread
From: Holger Hoffstätte @ 2025-04-12 12:36 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: linux-nfs

On 2025-04-11 15:29, Benjamin Coddington wrote:
> On 10 Apr 2025, at 8:55, Holger Hoffstätte wrote:
>> ...
>> Does this behaviour seem familiar to anybody?
>>
>> I realize this is "not a bug" (all data is safe and sound etc.) but
>> somehow it seems that various layers are not working together as one
>> might expect. It's possible that my expectation is wrong. :)
> 
> My first impression is that the writes are being processed out-of-order on
> the server, so XFS is using a range of allocation sizes while growing the
> file extents.

Thanks for reading!

Yes, that's true but not the cause of the re-ordering - the server just
does what it's being told. The reordering has to come from the client in
async mode.

Compare the following packet traces, first with a sync mount on the client:

holger>tshark -r client-sync.pcap | grep Offset | head -n 20
   148 6.349609085 192.168.100.221 → 192.168.100.222 NFS 1826 V4 Call WRITE StateID: 0x7cf4 Offset: 0 Len: 262144
   223 6.351592937 192.168.100.221 → 192.168.100.222 NFS 38026 V4 Call WRITE StateID: 0x7cf4 Offset: 262144 Len: 262144
   329 6.354313478 192.168.100.221 → 192.168.100.222 NFS 24994 V4 Call WRITE StateID: 0x7cf4 Offset: 524288 Len: 262144
   452 6.357495813 192.168.100.221 → 192.168.100.222 NFS 13410 V4 Call WRITE StateID: 0x7cf4 Offset: 786432 Len: 262144
   534 6.359567456 192.168.100.221 → 192.168.100.222 NFS 38026 V4 Call WRITE StateID: 0x7cf4 Offset: 1048576 Len: 262144
   638 6.362138118 192.168.100.221 → 192.168.100.222 NFS 26442 V4 Call WRITE StateID: 0x7cf4 Offset: 1310720 Len: 262144
   762 6.365241741 192.168.100.221 → 192.168.100.222 NFS 13410 V4 Call WRITE StateID: 0x7cf4 Offset: 1572864 Len: 262144
   866 6.367883641 192.168.100.221 → 192.168.100.222 NFS 1826 V4 Call WRITE StateID: 0x7cf4 Offset: 1835008 Len: 262144
   962 6.370450322 192.168.100.221 → 192.168.100.222 NFS 3274 V4 Call WRITE StateID: 0x7cf4 Offset: 2097152 Len: 262144
  1035 6.372527063 192.168.100.221 → 192.168.100.222 NFS 13410 V4 Call WRITE StateID: 0x7cf4 Offset: 2359296 Len: 262144
  1132 6.375268418 192.168.100.221 → 192.168.100.222 NFS 13410 V4 Call WRITE StateID: 0x7cf4 Offset: 2621440 Len: 262144
  1229 6.377913391 192.168.100.221 → 192.168.100.222 NFS 13410 V4 Call WRITE StateID: 0x7cf4 Offset: 2883584 Len: 262144
  1326 6.380523792 192.168.100.221 → 192.168.100.222 NFS 13410 V4 Call WRITE StateID: 0x7cf4 Offset: 3145728 Len: 262144
  1423 6.383116733 192.168.100.221 → 192.168.100.222 NFS 13410 V4 Call WRITE StateID: 0x7cf4 Offset: 3407872 Len: 262144
  1519 6.385729789 192.168.100.221 → 192.168.100.222 NFS 13410 V4 Call WRITE StateID: 0x7cf4 Offset: 3670016 Len: 262144
  1616 6.388442438 192.168.100.221 → 192.168.100.222 NFS 13410 V4 Call WRITE StateID: 0x7cf4 Offset: 3932160 Len: 262144
  1713 6.391023297 192.168.100.221 → 192.168.100.222 NFS 13410 V4 Call WRITE StateID: 0x7cf4 Offset: 4194304 Len: 262144
  1809 6.393621476 192.168.100.221 → 192.168.100.222 NFS 13410 V4 Call WRITE StateID: 0x7cf4 Offset: 4456448 Len: 262144
  1906 6.396189134 192.168.100.221 → 192.168.100.222 NFS 13410 V4 Call WRITE StateID: 0x7cf4 Offset: 4718592 Len: 262144
  2004 6.398795974 192.168.100.221 → 192.168.100.222 NFS 13410 V4 Call WRITE StateID: 0x7cf4 Offset: 4980736 Len: 262144

The offset is increasing linearly.

Now with the regular async mount:

holger>tshark -r client-async.pcap | grep Offset | head -n 20
   349 1.939181198 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 0 Len: 1048576
   713 1.948199991 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 1048576 Len: 1048576
  1078 1.957190498 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 2097152 Len: 1048576
  1438 1.966163614 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 3145728 Len: 1048576
  1781 1.974626188 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 528482304 Len: 1048576
  2140 1.983642048 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 4194304 Len: 1048576
  2499 1.992622009 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 529530880 Len: 1048576
  2862 2.001627951 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 5242880 Len: 1048576
  3224 2.010608959 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 530579456 Len: 1048576
  3584 2.019595764 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 7340032 Len: 1048576
  3951 2.028614348 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 531628032 Len: 1048576
  4290 2.037056179 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 6291456 Len: 1048576
  4656 2.046079372 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 532676608 Len: 1048576
  5029 2.055044526 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 8388608 Len: 1048576
  5399 2.064067160 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 533725184 Len: 1048576
  5772 2.073040626 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 9437184 Len: 1048576
  6142 2.082045031 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 534773760 Len: 1048576
  6497 2.090504113 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 10485760 Len: 1048576
  6868 2.099502093 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 535822336 Len: 1048576
  7229 2.108506848 192.168.100.221 → 192.168.100.222 NFS 62330 V4 Call WRITE StateID: 0x9abe Offset: 11534336 Len: 1048576

Offset is all over the place, with a write at 504 MB after the first 3.

So far I tried:
- disabling the write congestion handling (client side)
- increased dirty writeback thresholds to unreasonable levels: >30 seconds,
several GBs

Except for bloating writeback none of this really made a difference.

I've been poking around in nfs/write.c and nfs/pagelist.c but unfortunately
did not really get far. Looking at the async case above it *looks* as if
multiple writebacks are initiated and interleaved, or at least their page
ranges. That's with vm.dirty_background_bytes=536870912 (512 MB) - could be
a coincidence.

If it helps I can make the client packet traces available for download;
I didn't see anything interesting or unusual in there. The server traces
match the client traces 1:1 wrt. to Offset handling, so they are not really
interesting.

cheers
Holger

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Async client v4 mount results in unexpected number of extents on the server
  2025-04-12 12:36   ` Holger Hoffstätte
@ 2025-04-12 17:28     ` Benjamin Coddington
  2025-04-12 19:19       ` Holger Hoffstätte
  0 siblings, 1 reply; 5+ messages in thread
From: Benjamin Coddington @ 2025-04-12 17:28 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: linux-nfs

On 12 Apr 2025, at 8:36, Holger Hoffstätte wrote:

> On 2025-04-11 15:29, Benjamin Coddington wrote:
>> On 10 Apr 2025, at 8:55, Holger Hoffstätte wrote:
>>> ...
>>> Does this behaviour seem familiar to anybody?
>>>
>>> I realize this is "not a bug" (all data is safe and sound etc.) but
>>> somehow it seems that various layers are not working together as one
>>> might expect. It's possible that my expectation is wrong. :)
>>
>> My first impression is that the writes are being processed out-of-order on
>> the server, so XFS is using a range of allocation sizes while growing the
>> file extents.
>
> Thanks for reading!
>
> Yes, that's true but not the cause of the re-ordering - the server just
> does what it's being told. The reordering has to come from the client in
> async mode.

There isn't any guarantee of write processing order for async writes, is
there?  I also don't think there's any practical impact.  So I'm wondering
what's the expectation of behavior and what problem you're trying to fix?

Ben


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Async client v4 mount results in unexpected number of extents on the server
  2025-04-12 17:28     ` Benjamin Coddington
@ 2025-04-12 19:19       ` Holger Hoffstätte
  0 siblings, 0 replies; 5+ messages in thread
From: Holger Hoffstätte @ 2025-04-12 19:19 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: linux-nfs

On 2025-04-12 19:28, Benjamin Coddington wrote:
> There isn't any guarantee of write processing order for async writes, is
> there?  I also don't think there's any practical impact.  So I'm wondering
> what's the expectation of behavior and what problem you're trying to fix?

This definitely does not happen with regular local page writeback.
As I said before, my expectation that NFS behaves the same way might be
overly pedantic, but then again I'm still curious why this happens and whether
it's a bug or just accidental behaviour. *shrug*
Anway thanks for taking a look.

Holger

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-04-12 19:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-10 12:55 Async client v4 mount results in unexpected number of extents on the server Holger Hoffstätte
2025-04-11 13:29 ` Benjamin Coddington
2025-04-12 12:36   ` Holger Hoffstätte
2025-04-12 17:28     ` Benjamin Coddington
2025-04-12 19:19       ` Holger Hoffstätte

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox