I/O stalls when merging qcow2 snapshots on nfs

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* I/O stalls when merging qcow2 snapshots on nfs
@ 2024-05-05 11:29 Thomas Glanzmann
  2024-05-06 11:25 ` Benjamin Coddington
  2024-05-06 13:47 ` Trond Myklebust
  0 siblings, 2 replies; 4+ messages in thread
From: Thomas Glanzmann @ 2024-05-05 11:29 UTC (permalink / raw)
  To: kvm, linux-nfs

Hello,
I often take snapshots in order to move kvm VMs from one nfs share to
another while they're running or to take backups. Sometimes I have very
large VMs (1.1 TB) which take a very long time (40 minutes - 2 hours) to
backup or move. They also write between 20 - 60 GB of data while being
backed up or moved. Once the backup or move is done the dirty snapshot
data needs to be merged to the parent disk. While doing this I often
experience I/O stalls within the VMs in the range of 1 - 20 seconds.
Sometimes worse. But I have some very latency sensitive VMs which crash
or misbehave after 15 seconds I/O stalls. So I would like to know if there
is some tuening I can do to make these I/O stalls shorter.

- I already tried to set vm.dirty_expire_centisecs=100 which appears to
  make it better, but not under 15 seconds. Perfect would be I/O stalls
  no more than 1 second.

This is how you can reproduce the issue:

- NFS Server:
mkdir /ssd
apt install -y nfs-kernel-server
echo '/nfs 0.0.0.0/0.0.0.0(rw,no_root_squash,no_subtree_check,sync)' > /etc/exports
exports -ra

- NFS Client / KVM Host:
mount server:/ssd /mnt
# Put a VM on /mnt and start it.
# Create a snapshot:
virsh snapshot-create-as --domain testy guest-state1 --diskspec vda,file=/mnt/overlay.qcow2 --disk-only --atomic --no-metadata -no-metadata

- In the VM:

# Write some data (in my case 6 GB of data are writen in 60 seconds due
# to the nfs client being connected with a 1 Gbit/s link)
fio --ioengine=libaio --filesize=32G --ramp_time=2s --runtime=1m --numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=write --blocksize=1m --iodepth=1 --readwrite=write --unlink=1
# Do some synchronous I/O
while true; do date | tee -a date.log; sync; sleep 1; done

- On the NFS Client / KVM host:
# Merge the snapshot into the parentdisk
time virsh blockcommit testy vda --active --pivot --delete

Successfully pivoted

real    1m4.666s
user    0m0.017s
sys     0m0.007s

I exported the nfs share with sync on purpose because I often use drbd
in sync mode (protocol c) to replicate the data on the nfs server to a
site which is 200 km away using a 10 Gbit/s link.

The result is:
(testy) [~] while true; do date | tee -a date.log; sync; sleep 1; done
Sun May  5 12:53:36 CEST 2024
Sun May  5 12:53:37 CEST 2024
Sun May  5 12:53:38 CEST 2024
Sun May  5 12:53:39 CEST 2024
Sun May  5 12:53:40 CEST 2024
Sun May  5 12:53:41 CEST 2024 < here I started virsh blockcommit
Sun May  5 12:53:45 CEST 2024
Sun May  5 12:53:50 CEST 2024
Sun May  5 12:53:59 CEST 2024
Sun May  5 12:54:04 CEST 2024
Sun May  5 12:54:22 CEST 2024
Sun May  5 12:54:23 CEST 2024
Sun May  5 12:54:27 CEST 2024
Sun May  5 12:54:32 CEST 2024
Sun May  5 12:54:40 CEST 2024
Sun May  5 12:54:42 CEST 2024
Sun May  5 12:54:45 CEST 2024
Sun May  5 12:54:46 CEST 2024
Sun May  5 12:54:47 CEST 2024
Sun May  5 12:54:48 CEST 2024
Sun May  5 12:54:49 CEST 2024

This is with 'vm.dirty_expire_centisecs=100' with the default values
'vm.dirty_expire_centisecs=3000' it is worse.

I/O stalls:
- 4 seconds
- 9 seconds
- 5 seconds
- 18 seconds
- 4 seconds
- 5 seconds
- 8 seconds
- 2 seconds
- 3 seconds

With the default vm.dirty_expire_centisecs=3000 I get something like that:

(testy) [~] while true; do date | tee -a date.log; sync; sleep 1; done
Sun May  5 11:51:33 CEST 2024
Sun May  5 11:51:34 CEST 2024
Sun May  5 11:51:35 CEST 2024
Sun May  5 11:51:37 CEST 2024
Sun May  5 11:51:38 CEST 2024
Sun May  5 11:51:39 CEST 2024
Sun May  5 11:51:40 CEST 2024 << virsh blockcommit
Sun May  5 11:51:49 CEST 2024
Sun May  5 11:52:07 CEST 2024
Sun May  5 11:52:08 CEST 2024
Sun May  5 11:52:27 CEST 2024
Sun May  5 11:52:45 CEST 2024
Sun May  5 11:52:47 CEST 2024
Sun May  5 11:52:48 CEST 2024
Sun May  5 11:52:49 CEST 2024

I/O stalls:

- 9 seconds
- 18 seconds
- 19 seconds
- 18 seconds
- 1 seconds

I'm open to any suggestions which improve the situation. I often have 10
Gbit/s network and a lot of dirty buffer cache, but at the same time I
often replicate synchronously to a second site 200 kms apart which only
gives me around 100 MB/s write performance.

With vm.dirty_expire_centisecs=10 even worse:

(testy) [~] while true; do date | tee -a date.log; sync; sleep 1; done
Sun May  5 13:25:31 CEST 2024
Sun May  5 13:25:32 CEST 2024
Sun May  5 13:25:33 CEST 2024
Sun May  5 13:25:34 CEST 2024
Sun May  5 13:25:35 CEST 2024
Sun May  5 13:25:36 CEST 2024
Sun May  5 13:25:37 CEST 2024 < virsh blockcommit
Sun May  5 13:26:00 CEST 2024
Sun May  5 13:26:01 CEST 2024
Sun May  5 13:26:06 CEST 2024
Sun May  5 13:26:11 CEST 2024
Sun May  5 13:26:40 CEST 2024
Sun May  5 13:26:42 CEST 2024
Sun May  5 13:26:43 CEST 2024
Sun May  5 13:26:44 CEST 2024

I/O stalls:

- 23 seconds
- 5 seconds
- 5 seconds
- 29 seconds
- 1 second

Cheers,
        Thomas

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: I/O stalls when merging qcow2 snapshots on nfs
  2024-05-05 11:29 I/O stalls when merging qcow2 snapshots on nfs Thomas Glanzmann
@ 2024-05-06 11:25 ` Benjamin Coddington
  2024-05-06 17:21   ` Thomas Glanzmann
  2024-05-06 13:47 ` Trond Myklebust
  1 sibling, 1 reply; 4+ messages in thread
From: Benjamin Coddington @ 2024-05-06 11:25 UTC (permalink / raw)
  To: Thomas Glanzmann; +Cc: kvm, linux-nfs

On 5 May 2024, at 7:29, Thomas Glanzmann wrote:

> Hello,
> I often take snapshots in order to move kvm VMs from one nfs share to
> another while they're running or to take backups. Sometimes I have very
> large VMs (1.1 TB) which take a very long time (40 minutes - 2 hours) to
> backup or move. They also write between 20 - 60 GB of data while being
> backed up or moved. Once the backup or move is done the dirty snapshot
> data needs to be merged to the parent disk. While doing this I often
> experience I/O stalls within the VMs in the range of 1 - 20 seconds.
> Sometimes worse. But I have some very latency sensitive VMs which crash
> or misbehave after 15 seconds I/O stalls. So I would like to know if there
> is some tuening I can do to make these I/O stalls shorter.
>
> - I already tried to set vm.dirty_expire_centisecs=100 which appears to
>   make it better, but not under 15 seconds. Perfect would be I/O stalls
>   no more than 1 second.
>
> This is how you can reproduce the issue:
>
> - NFS Server:
> mkdir /ssd
> apt install -y nfs-kernel-server
> echo '/nfs 0.0.0.0/0.0.0.0(rw,no_root_squash,no_subtree_check,sync)' > /etc/exports
> exports -ra
>
> - NFS Client / KVM Host:
> mount server:/ssd /mnt
> # Put a VM on /mnt and start it.
> # Create a snapshot:
> virsh snapshot-create-as --domain testy guest-state1 --diskspec vda,file=/mnt/overlay.qcow2 --disk-only --atomic --no-metadata -no-metadata

What NFS version ends up getting mounted here?  You might eliminate some
head-of-line blocking issues with the "nconnect=16" mount option to open
additional TCP connections.

My view of what could be happening is that the IO from your guest's process
is congesting with the IO from your 'virsh blockcommit' process, and we
don't currently have a great way to classify and queue IO from various
sources in various ways.

Ben


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: I/O stalls when merging qcow2 snapshots on nfs
  2024-05-05 11:29 I/O stalls when merging qcow2 snapshots on nfs Thomas Glanzmann
  2024-05-06 11:25 ` Benjamin Coddington
@ 2024-05-06 13:47 ` Trond Myklebust
  1 sibling, 0 replies; 4+ messages in thread
From: Trond Myklebust @ 2024-05-06 13:47 UTC (permalink / raw)
  To: kvm@vger.kernel.org, thomas@glanzmann.de,
	linux-nfs@vger.kernel.org

On Sun, 2024-05-05 at 13:29 +0200, Thomas Glanzmann wrote:
> Hello,
> I often take snapshots in order to move kvm VMs from one nfs share to
> another while they're running or to take backups. Sometimes I have
> very
> large VMs (1.1 TB) which take a very long time (40 minutes - 2 hours)
> to
> backup or move. They also write between 20 - 60 GB of data while
> being
> backed up or moved. Once the backup or move is done the dirty
> snapshot
> data needs to be merged to the parent disk. While doing this I often
> experience I/O stalls within the VMs in the range of 1 - 20 seconds.
> Sometimes worse. But I have some very latency sensitive VMs which
> crash
> or misbehave after 15 seconds I/O stalls. So I would like to know if
> there
> is some tuening I can do to make these I/O stalls shorter.
> 
> - I already tried to set vm.dirty_expire_centisecs=100 which appears
> to
>   make it better, but not under 15 seconds. Perfect would be I/O
> stalls
>   no more than 1 second.
> 
> This is how you can reproduce the issue:
> 
> - NFS Server:
> mkdir /ssd
> apt install -y nfs-kernel-server
> echo '/nfs 0.0.0.0/0.0.0.0(rw,no_root_squash,no_subtree_check,sync)'
> > /etc/exports
> exports -ra
> 
> - NFS Client / KVM Host:
> mount server:/ssd /mnt
> # Put a VM on /mnt and start it.
> # Create a snapshot:
> virsh snapshot-create-as --domain testy guest-state1 --diskspec
> vda,file=/mnt/overlay.qcow2 --disk-only --atomic --no-metadata -no-
> metadata
> 
> - In the VM:
> 
> # Write some data (in my case 6 GB of data are writen in 60 seconds
> due
> # to the nfs client being connected with a 1 Gbit/s link)
> fio --ioengine=libaio --filesize=32G --ramp_time=2s --runtime=1m --
> numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --
> directory=/mnt --name=write --blocksize=1m --iodepth=1 --
> readwrite=write --unlink=1
> # Do some synchronous I/O
> while true; do date | tee -a date.log; sync; sleep 1; done
> 
> - On the NFS Client / KVM host:
> # Merge the snapshot into the parentdisk
> time virsh blockcommit testy vda --active --pivot --delete
> 
> Successfully pivoted
> 
> real    1m4.666s
> user    0m0.017s
> sys     0m0.007s
> 
> I exported the nfs share with sync on purpose because I often use
> drbd
> in sync mode (protocol c) to replicate the data on the nfs server to
> a
> site which is 200 km away using a 10 Gbit/s link.
> 
> The result is:
> (testy) [~] while true; do date | tee -a date.log; sync; sleep 1;
> done
> Sun May  5 12:53:36 CEST 2024
> Sun May  5 12:53:37 CEST 2024
> Sun May  5 12:53:38 CEST 2024
> Sun May  5 12:53:39 CEST 2024
> Sun May  5 12:53:40 CEST 2024
> Sun May  5 12:53:41 CEST 2024 < here I started virsh blockcommit
> Sun May  5 12:53:45 CEST 2024
> Sun May  5 12:53:50 CEST 2024
> Sun May  5 12:53:59 CEST 2024
> Sun May  5 12:54:04 CEST 2024
> Sun May  5 12:54:22 CEST 2024
> Sun May  5 12:54:23 CEST 2024
> Sun May  5 12:54:27 CEST 2024
> Sun May  5 12:54:32 CEST 2024
> Sun May  5 12:54:40 CEST 2024
> Sun May  5 12:54:42 CEST 2024
> Sun May  5 12:54:45 CEST 2024
> Sun May  5 12:54:46 CEST 2024
> Sun May  5 12:54:47 CEST 2024
> Sun May  5 12:54:48 CEST 2024
> Sun May  5 12:54:49 CEST 2024
> 
> This is with 'vm.dirty_expire_centisecs=100' with the default values
> 'vm.dirty_expire_centisecs=3000' it is worse.
> 
> I/O stalls:
> - 4 seconds
> - 9 seconds
> - 5 seconds
> - 18 seconds
> - 4 seconds
> - 5 seconds
> - 8 seconds
> - 2 seconds
> - 3 seconds
> 
> With the default vm.dirty_expire_centisecs=3000 I get something like
> that:
> 
> (testy) [~] while true; do date | tee -a date.log; sync; sleep 1;
> done
> Sun May  5 11:51:33 CEST 2024
> Sun May  5 11:51:34 CEST 2024
> Sun May  5 11:51:35 CEST 2024
> Sun May  5 11:51:37 CEST 2024
> Sun May  5 11:51:38 CEST 2024
> Sun May  5 11:51:39 CEST 2024
> Sun May  5 11:51:40 CEST 2024 << virsh blockcommit
> Sun May  5 11:51:49 CEST 2024
> Sun May  5 11:52:07 CEST 2024
> Sun May  5 11:52:08 CEST 2024
> Sun May  5 11:52:27 CEST 2024
> Sun May  5 11:52:45 CEST 2024
> Sun May  5 11:52:47 CEST 2024
> Sun May  5 11:52:48 CEST 2024
> Sun May  5 11:52:49 CEST 2024
> 
> I/O stalls:
> 
> - 9 seconds
> - 18 seconds
> - 19 seconds
> - 18 seconds
> - 1 seconds
> 
> I'm open to any suggestions which improve the situation. I often have
> 10
> Gbit/s network and a lot of dirty buffer cache, but at the same time
> I
> often replicate synchronously to a second site 200 kms apart which
> only
> gives me around 100 MB/s write performance.
> 
> With vm.dirty_expire_centisecs=10 even worse:
> 
> (testy) [~] while true; do date | tee -a date.log; sync; sleep 1;
> done
> Sun May  5 13:25:31 CEST 2024
> Sun May  5 13:25:32 CEST 2024
> Sun May  5 13:25:33 CEST 2024
> Sun May  5 13:25:34 CEST 2024
> Sun May  5 13:25:35 CEST 2024
> Sun May  5 13:25:36 CEST 2024
> Sun May  5 13:25:37 CEST 2024 < virsh blockcommit
> Sun May  5 13:26:00 CEST 2024
> Sun May  5 13:26:01 CEST 2024
> Sun May  5 13:26:06 CEST 2024
> Sun May  5 13:26:11 CEST 2024
> Sun May  5 13:26:40 CEST 2024
> Sun May  5 13:26:42 CEST 2024
> Sun May  5 13:26:43 CEST 2024
> Sun May  5 13:26:44 CEST 2024
> 
> I/O stalls:
> 
> - 23 seconds
> - 5 seconds
> - 5 seconds
> - 29 seconds
> - 1 second
> 
> Cheers,
>         Thomas
> 

Two suggestions:
   1. Try mounting the NFS partition on which these VMs reside with the
      "write=eager" mount option. That ensures that the kernel kicks
      off the write of the block immediately once QEMU has scheduled it
      for writeback. Note, however that the kernel does not wait for
      that write to complete (i.e. these writes are all asynchronous).
   2. Alternatively, try playing with the 'vm.dirty_ratio' or
      'vm.dirty_bytes' values in order to trigger writeback at an
      earlier time. With the default value of vm.dirty_ratio=20, you
      can end up caching up to 20% of your total memory's worth of
      dirty data before the VM triggers writeback over that 1Gbit link.


-- 
Trond Myklebust Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: I/O stalls when merging qcow2 snapshots on nfs
  2024-05-06 11:25 ` Benjamin Coddington
@ 2024-05-06 17:21   ` Thomas Glanzmann
  0 siblings, 0 replies; 4+ messages in thread
From: Thomas Glanzmann @ 2024-05-06 17:21 UTC (permalink / raw)
  To: Benjamin Coddington, Trond Myklebust; +Cc: kvm, linux-nfs

Hello Ben and Trond,

> On 5 May 2024, at 7:29, Thomas Glanzmann wrote paraphrased:

> When commiting 20 - 60 GB snapshots on kvm VMs which are stored on NFS I get 20
> seconds+ I/O stalls.

> When doing backups and migrations with kvm on NFS I get I/O stalls in
> the guest. How to avoid that?

* Benjamin Coddington <bcodding@redhat.com> [2024-05-06 13:25]:
> What NFS version ends up getting mounted here?

NFS 4.2: (below output has already your's and Tronds options added)

172.31.0.1:/nfs on /mnt type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,nconnect=16,timeo=600,retrans=2,sec=sys,clientaddr=172.31.0.6,local_lock=none,write=eager,addr=172.31.0.1)

> You might eliminate some head-of-line blocking issues with the
> "nconnect=16" mount option to open additional TCP connections.

> My view of what could be happening is that the IO from your guest's process
> is congesting with the IO from your 'virsh blockcommit' process, and we
> don't currently have a great way to classify and queue IO from various
> sources in various ways.

thank you for reminding me of nconnect. I evaluated it with VMware ESX and saw
no benefit when benchmarking it with a single VM and dismissed it. But of
course it makes sense when having more than one concurrent I/O stream.

* Trond Myklebust <trondmy@hammerspace.com> [2024-05-06 15:47]:
> Two suggestions:
>    1. Try mounting the NFS partition on which these VMs reside with the
>       "write=eager" mount option. That ensures that the kernel kicks
>       off the write of the block immediately once QEMU has scheduled it
>       for writeback. Note, however that the kernel does not wait for
>       that write to complete (i.e. these writes are all asynchronous).
>    2. Alternatively, try playing with the 'vm.dirty_ratio' or
>       'vm.dirty_bytes' values in order to trigger writeback at an
>       earlier time. With the default value of vm.dirty_ratio=20, you
>       can end up caching up to 20% of your total memory's worth of
>       dirty data before the VM triggers writeback over that 1Gbit link.

Thank you for the option write=eager. I was not aware of that but I
often run into problems where a 10 Gbit/s network pipe fills up my
buffer cache and than tries to destage GB 128 GB * 0.2 - 25.6 GB to the
disk which can't keep in my case and resulting in long I/O stalls. Usually my
disks can take between 100 (synchronous replicated drbd link 200km) - 500 MB/s
(SATA SSDs). I tried to tell kernel to destage faster by
(vm.dirty_expire_centisecs=100) which improved some workloads but not all.

So, I think I found a solution to my problem by doing the following:

- Increase NFSD threads to 128:

cat > /etc/nfs.conf.d/storage.conf <<'EOF'
[nfsd]
threads = 128

[mountd]
threads = 8
EOF
echo 128 > /proc/fs/nfsd/threads

- Mount the nfs volume with -o nconnect=16,write=eager

- Use iothreads and cache=none.

  <iothreads>2</iothreads>
  <driver name='qemu' type='qcow2' cache='none' discard='unmap' iothread='1'/>

By doing the above I no longer see any I/O stalls longer than one second (in my
date loop 2 seconds time difference).

Thank you two again for helping me out with this.

Cheers,
	Thomas

PS: Cache=writethrough and without I/O threads the I/O stalls for the time blockcommit executes.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-05-06 17:21 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-05 11:29 I/O stalls when merging qcow2 snapshots on nfs Thomas Glanzmann
2024-05-06 11:25 ` Benjamin Coddington
2024-05-06 17:21   ` Thomas Glanzmann
2024-05-06 13:47 ` Trond Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox