* [Qemu-devel] Live migration results in non-working virtio-net device (sometimes) @ 2014-01-30 18:23 Neil Skrypuch 2014-02-28 20:14 ` Neil Skrypuch ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Neil Skrypuch @ 2014-01-30 18:23 UTC (permalink / raw) To: qemu-devel First, let me briefly outline the way we use live migration, as it is probably not typical. We use live migration (with block migration) to make backups of VMs with zero downtime. The basic process goes like this: 1) migrate src VM -> dest VM 2) migration completes 3) cont src VM 4) gracefully shut down dest VM 5) dest VM's disk image is now a valid backup In general, this works very well. Up until now we have been using qemu-kvm 1.1.2 and have not had any issues with the above process. I am now attempting to upgrade us to a newer version of qemu, but all newer versions I've tried occasionally result in the virtio- net device ceasing to function on the src VM after step 3. I am able to reproduce this reliably (given enough iterations), it happens in roughly 2% of all migrations. Here is the complete qemu command line for the src VM: /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive file=/var/lib/kvm/testbackup.polldev.com.img,if=virtio -m 2048 -smp 4,cores=4,sockets=1,threads=1 -net nic,macaddr=52:54:98:00:00:00,model=virtio -net tap,script=/etc/qemu-ifup- br2,downscript=no -curses -name "testbackup.polldev.com",process=testbackup.polldev.com -monitor unix:/var/lib/kvm/monitor/testbackup,server,nowait The dest VM: /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive file=/backup/testbackup.polldev.com.img.bak20140129,if=virtio -m 2048 -smp 4,cores=4,sockets=1,threads=1 -net nic,macaddr=52:54:98:00:00:00,model=virtio -net tap,script=no,downscript=no - curses -name "testbackup.polldev.com",process=testbackup.polldev.com -monitor unix:/var/lib/kvm/monitor/testbackup.bak,server,nowait -incoming tcp:0:4444 The migration is performed like so: echo "migrate -b tcp:localhost:4444" | socat STDIO UNIX- CONNECT:/var/lib/kvm/monitor/testbackup echo "migrate_set_speed 1G" | socat STDIO UNIX- CONNECT:/var/lib/kvm/monitor/testbackup #wait echo cont | socat STDIO UNIX-CONNECT:/var/lib/kvm/monitor/testbackup The guest in question is a minimal install of CentOS 6.5. I have observed this issue across the following qemu versions: qemu 1.4.2 qemu 1.6.0 qemu 1.6.1 qemu 1.7.0 I also attempted to test qemu 1.5.3, but live migration flat out crashed there (totally different issue). I have also tested a number of other scenarios with qemu 1.6.0, all of which exhibit the same failure mode: qemu 1.6.0 + host kernel 3.1.0 qemu 1.6.0 + host kernel 3.10.7 qemu 1.6.0 + host kernel 3.10.17 qemu 1.6.0 + virtio with -netdev/-device syntax qemu 1.6.0 + accel=tcg The one case I have found that works properly is the following: qemu 1.6.0 + e1000 It is worth noting that when the virtio-net device ceases to function in the guest that removing and reinserting the virtio-net kernel module results in the device working again (except in 1.4.2, this had no effect there). As mentioned above I can reproduce this with minimal effort, and am willing to test out any patches or provide further details as necessary. - Neil ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] Live migration results in non-working virtio-net device (sometimes) 2014-01-30 18:23 [Qemu-devel] Live migration results in non-working virtio-net device (sometimes) Neil Skrypuch @ 2014-02-28 20:14 ` Neil Skrypuch 2014-03-01 2:34 ` 陈梁 2014-03-05 15:59 ` Andreas Färber 2014-03-08 15:02 ` Stefan Hajnoczi 2 siblings, 1 reply; 7+ messages in thread From: Neil Skrypuch @ 2014-02-28 20:14 UTC (permalink / raw) To: qemu-devel On Thursday 30 January 2014 13:23:04 Neil Skrypuch wrote: > First, let me briefly outline the way we use live migration, as it is > probably not typical. We use live migration (with block migration) to make > backups of VMs with zero downtime. The basic process goes like this: > > 1) migrate src VM -> dest VM > 2) migration completes > 3) cont src VM > 4) gracefully shut down dest VM > 5) dest VM's disk image is now a valid backup > > In general, this works very well. > > Up until now we have been using qemu-kvm 1.1.2 and have not had any issues > with the above process. I am now attempting to upgrade us to a newer version > of qemu, but all newer versions I've tried occasionally result in the > virtio- net device ceasing to function on the src VM after step 3. > > I am able to reproduce this reliably (given enough iterations), it happens > in roughly 2% of all migrations. > > Here is the complete qemu command line for the src VM: > > /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive > file=/var/lib/kvm/testbackup.polldev.com.img,if=virtio -m 2048 -smp > 4,cores=4,sockets=1,threads=1 -net > nic,macaddr=52:54:98:00:00:00,model=virtio -net tap,script=/etc/qemu-ifup- > br2,downscript=no -curses -name > "testbackup.polldev.com",process=testbackup.polldev.com -monitor > unix:/var/lib/kvm/monitor/testbackup,server,nowait > > The dest VM: > > /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive > file=/backup/testbackup.polldev.com.img.bak20140129,if=virtio -m 2048 -smp > 4,cores=4,sockets=1,threads=1 -net > nic,macaddr=52:54:98:00:00:00,model=virtio -net tap,script=no,downscript=no > - curses -name "testbackup.polldev.com",process=testbackup.polldev.com > -monitor unix:/var/lib/kvm/monitor/testbackup.bak,server,nowait -incoming > tcp:0:4444 > > The migration is performed like so: > > echo "migrate -b tcp:localhost:4444" | socat STDIO UNIX- > CONNECT:/var/lib/kvm/monitor/testbackup > echo "migrate_set_speed 1G" | socat STDIO UNIX- > CONNECT:/var/lib/kvm/monitor/testbackup > #wait > echo cont | socat STDIO UNIX-CONNECT:/var/lib/kvm/monitor/testbackup > > The guest in question is a minimal install of CentOS 6.5. > > I have observed this issue across the following qemu versions: > > qemu 1.4.2 > qemu 1.6.0 > qemu 1.6.1 > qemu 1.7.0 > > I also attempted to test qemu 1.5.3, but live migration flat out crashed > there (totally different issue). > > I have also tested a number of other scenarios with qemu 1.6.0, all of which > exhibit the same failure mode: > > qemu 1.6.0 + host kernel 3.1.0 > qemu 1.6.0 + host kernel 3.10.7 > qemu 1.6.0 + host kernel 3.10.17 > qemu 1.6.0 + virtio with -netdev/-device syntax > qemu 1.6.0 + accel=tcg > > The one case I have found that works properly is the following: > > qemu 1.6.0 + e1000 > > It is worth noting that when the virtio-net device ceases to function in the > guest that removing and reinserting the virtio-net kernel module results in > the device working again (except in 1.4.2, this had no effect there). > > As mentioned above I can reproduce this with minimal effort, and am willing > to test out any patches or provide further details as necessary. > > - Neil Ok, I was able to narrow this down to somewhere in between 1.2.2 (or rather, 1.2.0) and 1.3.0. Migration in 1.3.0 is broken, however, I was able to cherry pick d7cd369, d5f1f28, and 9ee0cb2 on top of 1.3.0 to fix the unrelated migration bug and confirm that the bug from this thread is still present in 1.3.0. I started a git bisect on 1.2.2..1.3.0 but didn't get very far before running into several unrelated bugs which kept migration from working. I also tested out the latest master code (d844a7b) and it fails in the same way as 1.7.0. - Neil ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] Live migration results in non-working virtio-net device (sometimes) 2014-02-28 20:14 ` Neil Skrypuch @ 2014-03-01 2:34 ` 陈梁 2014-03-03 20:15 ` Neil Skrypuch 0 siblings, 1 reply; 7+ messages in thread From: 陈梁 @ 2014-03-01 2:34 UTC (permalink / raw) To: Neil Skrypuch; +Cc: 陈梁, qemu-devel > On Thursday 30 January 2014 13:23:04 Neil Skrypuch wrote: >> First, let me briefly outline the way we use live migration, as it is >> probably not typical. We use live migration (with block migration) to make >> backups of VMs with zero downtime. The basic process goes like this: >> >> 1) migrate src VM -> dest VM >> 2) migration completes >> 3) cont src VM >> 4) gracefully shut down dest VM >> 5) dest VM's disk image is now a valid backup >> >> In general, this works very well. >> >> Up until now we have been using qemu-kvm 1.1.2 and have not had any issues >> with the above process. I am now attempting to upgrade us to a newer version >> of qemu, but all newer versions I've tried occasionally result in the >> virtio- net device ceasing to function on the src VM after step 3. >> >> I am able to reproduce this reliably (given enough iterations), it happens >> in roughly 2% of all migrations. >> >> Here is the complete qemu command line for the src VM: >> >> /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive >> file=/var/lib/kvm/testbackup.polldev.com.img,if=virtio -m 2048 -smp >> 4,cores=4,sockets=1,threads=1 -net >> nic,macaddr=52:54:98:00:00:00,model=virtio -net tap,script=/etc/qemu-ifup- >> br2,downscript=no -curses -name >> "testbackup.polldev.com",process=testbackup.polldev.com -monitor >> unix:/var/lib/kvm/monitor/testbackup,server,nowait >> >> The dest VM: >> >> /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive >> file=/backup/testbackup.polldev.com.img.bak20140129,if=virtio -m 2048 -smp >> 4,cores=4,sockets=1,threads=1 -net >> nic,macaddr=52:54:98:00:00:00,model=virtio -net tap,script=no,downscript=no >> - curses -name "testbackup.polldev.com",process=testbackup.polldev.com >> -monitor unix:/var/lib/kvm/monitor/testbackup.bak,server,nowait -incoming >> tcp:0:4444 >> >> The migration is performed like so: >> >> echo "migrate -b tcp:localhost:4444" | socat STDIO UNIX- >> CONNECT:/var/lib/kvm/monitor/testbackup >> echo "migrate_set_speed 1G" | socat STDIO UNIX- >> CONNECT:/var/lib/kvm/monitor/testbackup >> #wait >> echo cont | socat STDIO UNIX-CONNECT:/var/lib/kvm/monitor/testbackup >> >> The guest in question is a minimal install of CentOS 6.5. >> >> I have observed this issue across the following qemu versions: >> >> qemu 1.4.2 >> qemu 1.6.0 >> qemu 1.6.1 >> qemu 1.7.0 >> >> I also attempted to test qemu 1.5.3, but live migration flat out crashed >> there (totally different issue). >> >> I have also tested a number of other scenarios with qemu 1.6.0, all of which >> exhibit the same failure mode: >> >> qemu 1.6.0 + host kernel 3.1.0 >> qemu 1.6.0 + host kernel 3.10.7 >> qemu 1.6.0 + host kernel 3.10.17 >> qemu 1.6.0 + virtio with -netdev/-device syntax >> qemu 1.6.0 + accel=tcg >> >> The one case I have found that works properly is the following: >> >> qemu 1.6.0 + e1000 >> >> It is worth noting that when the virtio-net device ceases to function in the >> guest that removing and reinserting the virtio-net kernel module results in >> the device working again (except in 1.4.2, this had no effect there). >> >> As mentioned above I can reproduce this with minimal effort, and am willing >> to test out any patches or provide further details as necessary. >> >> - Neil > > Ok, I was able to narrow this down to somewhere in between 1.2.2 (or rather, > 1.2.0) and 1.3.0. Migration in 1.3.0 is broken, however, I was able to cherry > pick d7cd369, d5f1f28, and 9ee0cb2 on top of 1.3.0 to fix the unrelated > migration bug and confirm that the bug from this thread is still present in > 1.3.0. > > I started a git bisect on 1.2.2..1.3.0 but didn't get very far before running > into several unrelated bugs which kept migration from working. > > I also tested out the latest master code (d844a7b) and it fails in the same > way as 1.7.0. > > - Neil > hi,have you try to ping from vm to other host after migration? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] Live migration results in non-working virtio-net device (sometimes) 2014-03-01 2:34 ` 陈梁 @ 2014-03-03 20:15 ` Neil Skrypuch 0 siblings, 0 replies; 7+ messages in thread From: Neil Skrypuch @ 2014-03-03 20:15 UTC (permalink / raw) To: 陈梁; +Cc: qemu-devel On Saturday 01 March 2014 10:34:03 陈梁 wrote: > > On Thursday 30 January 2014 13:23:04 Neil Skrypuch wrote: > >> First, let me briefly outline the way we use live migration, as it is > >> probably not typical. We use live migration (with block migration) to > >> make > >> backups of VMs with zero downtime. The basic process goes like this: > >> > >> 1) migrate src VM -> dest VM > >> 2) migration completes > >> 3) cont src VM > >> 4) gracefully shut down dest VM > >> 5) dest VM's disk image is now a valid backup > >> > >> In general, this works very well. > >> > >> Up until now we have been using qemu-kvm 1.1.2 and have not had any > >> issues > >> with the above process. I am now attempting to upgrade us to a newer > >> version of qemu, but all newer versions I've tried occasionally result > >> in the virtio- net device ceasing to function on the src VM after step > >> 3. > >> > >> I am able to reproduce this reliably (given enough iterations), it > >> happens > >> in roughly 2% of all migrations. > >> > >> Here is the complete qemu command line for the src VM: > >> > >> /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive > >> file=/var/lib/kvm/testbackup.polldev.com.img,if=virtio -m 2048 -smp > >> 4,cores=4,sockets=1,threads=1 -net > >> nic,macaddr=52:54:98:00:00:00,model=virtio -net > >> tap,script=/etc/qemu-ifup- > >> br2,downscript=no -curses -name > >> "testbackup.polldev.com",process=testbackup.polldev.com -monitor > >> unix:/var/lib/kvm/monitor/testbackup,server,nowait > >> > >> The dest VM: > >> > >> /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive > >> file=/backup/testbackup.polldev.com.img.bak20140129,if=virtio -m 2048 > >> -smp > >> 4,cores=4,sockets=1,threads=1 -net > >> nic,macaddr=52:54:98:00:00:00,model=virtio -net > >> tap,script=no,downscript=no > >> - curses -name "testbackup.polldev.com",process=testbackup.polldev.com > >> -monitor unix:/var/lib/kvm/monitor/testbackup.bak,server,nowait -incoming > >> tcp:0:4444 > >> > >> The migration is performed like so: > >> > >> echo "migrate -b tcp:localhost:4444" | socat STDIO UNIX- > >> CONNECT:/var/lib/kvm/monitor/testbackup > >> echo "migrate_set_speed 1G" | socat STDIO UNIX- > >> CONNECT:/var/lib/kvm/monitor/testbackup > >> #wait > >> echo cont | socat STDIO UNIX-CONNECT:/var/lib/kvm/monitor/testbackup > >> > >> The guest in question is a minimal install of CentOS 6.5. > >> > >> I have observed this issue across the following qemu versions: > >> > >> qemu 1.4.2 > >> qemu 1.6.0 > >> qemu 1.6.1 > >> qemu 1.7.0 > >> > >> I also attempted to test qemu 1.5.3, but live migration flat out crashed > >> there (totally different issue). > >> > >> I have also tested a number of other scenarios with qemu 1.6.0, all of > >> which exhibit the same failure mode: > >> > >> qemu 1.6.0 + host kernel 3.1.0 > >> qemu 1.6.0 + host kernel 3.10.7 > >> qemu 1.6.0 + host kernel 3.10.17 > >> qemu 1.6.0 + virtio with -netdev/-device syntax > >> qemu 1.6.0 + accel=tcg > >> > >> The one case I have found that works properly is the following: > >> > >> qemu 1.6.0 + e1000 > >> > >> It is worth noting that when the virtio-net device ceases to function in > >> the guest that removing and reinserting the virtio-net kernel module > >> results in the device working again (except in 1.4.2, this had no effect > >> there). > >> > >> As mentioned above I can reproduce this with minimal effort, and am > >> willing > >> to test out any patches or provide further details as necessary. > >> > >> - Neil > > > > Ok, I was able to narrow this down to somewhere in between 1.2.2 (or > > rather, 1.2.0) and 1.3.0. Migration in 1.3.0 is broken, however, I was > > able to cherry pick d7cd369, d5f1f28, and 9ee0cb2 on top of 1.3.0 to fix > > the unrelated migration bug and confirm that the bug from this thread is > > still present in 1.3.0. > > > > I started a git bisect on 1.2.2..1.3.0 but didn't get very far before > > running into several unrelated bugs which kept migration from working. > > > > I also tested out the latest master code (d844a7b) and it fails in the > > same > > way as 1.7.0. > > > > - Neil > > hi,have you try to ping from vm to other host after migration? Yes, pings from the VM to anywhere result in "Destination Host Unreachable", it's not the usual MAC address moved problem with migration. Note that the problem occurs on the *source* VM, not the destination VM, the destination VM is intentionally configured with an unconnected network interface (script=no). Also, I had a closer look at the source VM's state after the network stops working. If I initiate a ping from inside the VM, via tcpdump I can see ARP traffic on the host's corresponding tap and bridge adaptors (both the request and response), however, tcpdump from inside the guest does not see either of these. I can see the TX count on eth0 inside the guest is increasing, but the RX count is not moving. On the host, I can see the RX count on the tap is increasing, but the TX is not. Similarly, the dropped count on the tap is rising rapidly: tap0 Link encap:Ethernet HWaddr fe:19:99:0a:9b:07 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:452 errors:0 dropped:0 overruns:0 frame:0 TX packets:62626 errors:0 dropped:954919 overruns:0 carrier:0 collisions:0 txqueuelen:500 RX bytes:29104 (28.4 KiB) TX bytes:8726592 (8.3 MiB) If I try to ping the guest from an external host, I can see the ICMP request reach the tap adaptor on the host, but never a response and nothing in the guest. It seems like the TX side is working properly. Is it possible that the RX side of the virtio-net adaptor is in a confused state and thus resulting in dropped packets? - Neil ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] Live migration results in non-working virtio-net device (sometimes) 2014-01-30 18:23 [Qemu-devel] Live migration results in non-working virtio-net device (sometimes) Neil Skrypuch 2014-02-28 20:14 ` Neil Skrypuch @ 2014-03-05 15:59 ` Andreas Färber 2014-03-05 18:32 ` Neil Skrypuch 2014-03-08 15:02 ` Stefan Hajnoczi 2 siblings, 1 reply; 7+ messages in thread From: Andreas Färber @ 2014-03-05 15:59 UTC (permalink / raw) To: Neil Skrypuch, qemu-devel; +Cc: Stefan Hajnoczi, Juan Quintela Am 30.01.2014 19:23, schrieb Neil Skrypuch: > First, let me briefly outline the way we use live migration, as it is probably > not typical. We use live migration (with block migration) to make backups of > VMs with zero downtime. The basic process goes like this: > > 1) migrate src VM -> dest VM > 2) migration completes > 3) cont src VM > 4) gracefully shut down dest VM > 5) dest VM's disk image is now a valid backup > > In general, this works very well. > > Up until now we have been using qemu-kvm 1.1.2 and have not had any issues > with the above process. I am now attempting to upgrade us to a newer version > of qemu, but all newer versions I've tried occasionally result in the virtio- > net device ceasing to function on the src VM after step 3. While I don't know this particular symptom, I can definitely tell you that migrating from qemu-kvm to qemu is bound to fail unless you enable at least a version_id change in piix4.c, possibly also in kvmvapic.c. Such errors would lead to migration not successfully completing though, with cryptic error on the dest side. Regards, Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] Live migration results in non-working virtio-net device (sometimes) 2014-03-05 15:59 ` Andreas Färber @ 2014-03-05 18:32 ` Neil Skrypuch 0 siblings, 0 replies; 7+ messages in thread From: Neil Skrypuch @ 2014-03-05 18:32 UTC (permalink / raw) To: Andreas Färber; +Cc: qemu-devel, Stefan Hajnoczi, Juan Quintela On Wednesday 05 March 2014 16:59:24 Andreas Färber wrote: > Am 30.01.2014 19:23, schrieb Neil Skrypuch: > > First, let me briefly outline the way we use live migration, as it is > > probably not typical. We use live migration (with block migration) to > > make backups of VMs with zero downtime. The basic process goes like this: > > > > 1) migrate src VM -> dest VM > > 2) migration completes > > 3) cont src VM > > 4) gracefully shut down dest VM > > 5) dest VM's disk image is now a valid backup > > > > In general, this works very well. > > > > Up until now we have been using qemu-kvm 1.1.2 and have not had any issues > > with the above process. I am now attempting to upgrade us to a newer > > version of qemu, but all newer versions I've tried occasionally result in > > the virtio- net device ceasing to function on the src VM after step 3. > > While I don't know this particular symptom, I can definitely tell you > that migrating from qemu-kvm to qemu is bound to fail unless you enable > at least a version_id change in piix4.c, possibly also in kvmvapic.c. > Such errors would lead to migration not successfully completing though, > with cryptic error on the dest side. > > Regards, > Andreas I should clarify, all of these migrations happen from same version to same version (and same host to same host). So 1.7.0 -> 1.7.0, 1.6.0 -> 1.6.0, etc. What we're looking for out of this is a clean copy of the disk image (consistent and from a graceful shutdown) for backup purposes, without having to shut down the VM. I expected cross-version migration to be dicey and made a point of avoiding it. - Neil ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] Live migration results in non-working virtio-net device (sometimes) 2014-01-30 18:23 [Qemu-devel] Live migration results in non-working virtio-net device (sometimes) Neil Skrypuch 2014-02-28 20:14 ` Neil Skrypuch 2014-03-05 15:59 ` Andreas Färber @ 2014-03-08 15:02 ` Stefan Hajnoczi 2 siblings, 0 replies; 7+ messages in thread From: Stefan Hajnoczi @ 2014-03-08 15:02 UTC (permalink / raw) To: Neil Skrypuch; +Cc: qemu-devel On Thu, Jan 30, 2014 at 7:23 PM, Neil Skrypuch <neil@tembosocial.com> wrote: > As mentioned above I can reproduce this with minimal effort, and am willing to > test out any patches or provide further details as necessary. Hi Neil, Thanks for all your efforts on IRC. I have sent a fix titled "[PATCH] tap: avoid deadlocking rx". If your tests pass with the fix, please respond to that email thread with Tested-by: Neil Skrypuch <neil@tembosocial.com>. Thanks, Stefan ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-03-08 15:02 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-01-30 18:23 [Qemu-devel] Live migration results in non-working virtio-net device (sometimes) Neil Skrypuch 2014-02-28 20:14 ` Neil Skrypuch 2014-03-01 2:34 ` 陈梁 2014-03-03 20:15 ` Neil Skrypuch 2014-03-05 15:59 ` Andreas Färber 2014-03-05 18:32 ` Neil Skrypuch 2014-03-08 15:02 ` Stefan Hajnoczi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).