From mboxrd@z Thu Jan 1 00:00:00 1970 From: Diana Crisan Subject: Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI Date: Fri, 31 May 2013 09:34:23 +0100 Message-ID: <51A8608F.9000302@flexiant.com> References: <1717491994.10371605.1369131737226.JavaMail.root@zimbra002> <519B50C9.1000008@citrix.com> <519B577E.6070200@flexiant.com> <519B6D51.2060508@citrix.com> <951B3441BAE2324286D3AA6D@Ximines.local> <420439EA40B15FCBFDFF2BE3@nimrod.local> <1369557503.22605.11.camel@dagon.hellion.org.uk> <51A4C7EB.1010406@flexiant.com> <51A7767A.9030904@flexiant.com> <51A7791C.2020208@eu.citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <51A7791C.2020208@eu.citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: George Dunlap Cc: Ian Campbell , Konrad Rzeszutek Wilk , "xen-devel@lists.xen.org" , David Vrabel , Alex Bligh , Anthony PERARD List-Id: xen-devel@lists.xenproject.org George, On 30/05/13 17:06, George Dunlap wrote: > On 05/30/2013 04:55 PM, Diana Crisan wrote: >> On 30/05/13 16:26, George Dunlap wrote: >>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan >>> wrote: >>>> Hi, >>>> >>>> >>>> On 26/05/13 09:38, Ian Campbell wrote: >>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote: >>>>>> George, >>>>>> >>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap >>>>>> >>>>>> wrote: >>>>>> >>>>>>>> FWIW it's reproducible on every host h/w platform we've tried >>>>>>>> (a total of 2). >>>>>>> Do you see the same effects if you do a local-host migrate? >>>>>> I hadn't even realised that was possible. That would have made >>>>>> testing >>>>>> live >>>>>> migrate easier! >>>>> That's basically the whole reason it is supported ;-) >>>>> >>>>>> How do you avoid the name clash in xen-store? >>>>> Most toolstacks receive the incoming migration into a domain named >>>>> FOO-incoming or some such and then rename to FOO upon completion. >>>>> Some >>>>> also rename the outgoing domain "FOO-migratedaway" towards the end so >>>>> that the bits of the final teardown which can safely happen after the >>>>> target have start can be done so. >>>>> >>>>> Ian. >>>>> >>>>> >>>> I am unsure what I am doing wrong, but I cannot seem to be able to >>>> do a >>>> localhost migrate. >>>> >>>> I created a domU using "xl create xl.conf" and once it fully booted I >>>> issued >>>> an "xl migrate 11 localhost". This fails and gives the output below. >>>> >>>> Would you please advise on how to get this working? >>>> >>>> Thanks, >>>> Diana >>>> >>>> >>>> root@ubuntu:~# xl migrate 11 localhost >>>> root@localhost's password: >>>> migration target: Ready to receive domain. >>>> Saving to migration stream new xl format (info 0x0/0x0/2344) >>>> Loading new save file (new xl fmt info >>>> 0x0/0x0/2344) >>>> Savefile contains xl domain config >>>> xc: progress: Reloading memory pages: 53248/1048575 5% >>>> xc: progress: Reloading memory pages: 105472/1048575 10% >>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12 >>>> device >>>> model: spawn failed (rc=-3) >>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device >>>> model >>>> did not start: -3 >>>> libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device >>>> Model >>>> already exited >>>> migration target: Domain creation failed (code -3). >>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream >>>> truncated >>>> reading ready message from migration receiver stream >>>> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration >>>> target process [10934] exited with error status 3 >>>> Migration failed, resuming at sender. >>>> xc: error: Cannot resume uncooperative HVM guests: Internal error >>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume >>>> failed for >>>> domain 11: Success >>> Aha -- I managed to reproduce this one as well. >>> >>> Your problem is the "vncunused=0" -- that's instructing qemu "You must >>> use this exact port for the vnc server". But when you do the migrate, >>> that port is still in use by the "from" domain; so the qemu for the >>> "to" domain can't get it, and fails. >>> >>> Obviously this should fail a lot more gracefully, but that's a bit of >>> a lower-priority bug I think. >>> >>> -George >> Yes, I managed to get to the bottom of it too and got vms migrating on >> localhost on our end. >> >> I can confirm I did get the clock stuck problem while doing a localhost >> migrate. > > Does the script I posted earlier "work" for you (i.e., does it fail > after some number of migrations)? > I left your script running throughout the night and it seems that it does not always catch the problem. I see the following: 1. vm has the clock stuck 2. script is still running as it seems the vm is still ping-able. 3. migration fails on the basis that the vm is does not ack the suspend request (see below). libxl: error: libxl_dom.c:1063:libxl__domain_suspend_common_callback: guest didn't acknowledge suspend, cancelling request libxl: error: libxl_dom.c:1085:libxl__domain_suspend_common_callback: guest didn't acknowledge suspend, request cancelled xc: error: Suspend request failed: Internal error xc: error: Domain appears not to have suspended: Internal error libxl: error: libxl_dom.c:1370:libxl__xc_domain_save_done: saving domain: domain did not respond to suspend request: Invalid argument migration sender: libxl_domain_suspend failed (rc=-8) xc: error: 0-length read: Internal error xc: error: read_exact_timed failed (read rc: 0, errno: 0): Internal error xc: error: Error when reading batch size (0 = Success): Internal error xc: error: Error when reading batch (0 = Success): Internal error libxl: error: libxl_create.c:834:libxl__xc_domain_restore_done: restoring domain: Resource temporarily unavailable libxl: error: libxl_create.c:916:domcreate_rebuild_done: cannot (re-)build domain: -3 libxl: error: libxl.c:1378:libxl__destroy_domid: non-existant domain 111 libxl: error: libxl.c:1342:domain_destroy_callback: unable to destroy guest with domid 111 libxl: error: libxl_create.c:1225:domcreate_destruction_cb: unable to destroy domain 111 following failed creation migration target: Domain creation failed (code -3). libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration target process [7849] exited with error status 3 Migration failed, failed to suspend at sender. PING 172.16.1.223 (172.16.1.223) 56(84) bytes of data. 64 bytes from 172.16.1.223: icmp_req=1 ttl=64 time=0.339 ms 64 bytes from 172.16.1.223: icmp_req=2 ttl=64 time=0.569 ms 64 bytes from 172.16.1.223: icmp_req=3 ttl=64 time=0.535 ms 64 bytes from 172.16.1.223: icmp_req=4 ttl=64 time=0.544 ms 64 bytes from 172.16.1.223: icmp_req=5 ttl=64 time=0.529 ms > I've been using it to do a localhost migrate, using a nearly identical > config as the one you posted (only difference, I'm using blkback > rather than blktap), with an Ubuntu Precise VM using the > 3.2.0-39-virtual kernel, and I'm up to 20 migrates with no problems. > > Differences between my setup and yours at this point: > - probably hardware (I've got an old AMD box) > - dom0 kernel is Debian 2.6.32-5-xen > - not using blktap > > I've also been testing this on an Intel box, with the Debian > 3.2.0-4-686-pae kernel, with a Debian distro, and it's up to 103 > successful migrates. > > It's possible that it's a model-specific issue, but it's sort of hard > to see how the dom0 kernel, or blktap, could cause this. > > Do you have any special kernel config parameters you're passing in to > the guest? > > Also, could you try a generic Debian Wheezy install, just to see if > it's got something to do with the kernel? > > -George I reckon our code caught a separate problem with this issue as whenever the vm got its clock stuck, the network interface wasn't coming back up and I would see NO-CARRIER for the guest, which made it unreachable. -- Diana