From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI Date: Fri, 31 May 2013 11:54:09 +0100 Message-ID: <51A88151.3080001@eu.citrix.com> References: <1717491994.10371605.1369131737226.JavaMail.root@zimbra002> <519B50C9.1000008@citrix.com> <519B577E.6070200@flexiant.com> <519B6D51.2060508@citrix.com> <951B3441BAE2324286D3AA6D@Ximines.local> <420439EA40B15FCBFDFF2BE3@nimrod.local> <1369557503.22605.11.camel@dagon.hellion.org.uk> <51A4C7EB.1010406@flexiant.com> <51A7767A.9030904@flexiant.com> <51A7791C.2020208@eu.citrix.com> <51A8608F.9000302@flexiant.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <51A8608F.9000302@flexiant.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Diana Crisan Cc: Ian Campbell , Konrad Rzeszutek Wilk , "xen-devel@lists.xen.org" , David Vrabel , Alex Bligh , Anthony PERARD List-Id: xen-devel@lists.xenproject.org On 31/05/13 09:34, Diana Crisan wrote: > George, > On 30/05/13 17:06, George Dunlap wrote: >> On 05/30/2013 04:55 PM, Diana Crisan wrote: >>> On 30/05/13 16:26, George Dunlap wrote: >>>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan >>>> wrote: >>>>> Hi, >>>>> >>>>> >>>>> On 26/05/13 09:38, Ian Campbell wrote: >>>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote: >>>>>>> George, >>>>>>> >>>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap >>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>>> FWIW it's reproducible on every host h/w platform we've tried >>>>>>>>> (a total of 2). >>>>>>>> Do you see the same effects if you do a local-host migrate? >>>>>>> I hadn't even realised that was possible. That would have made >>>>>>> testing >>>>>>> live >>>>>>> migrate easier! >>>>>> That's basically the whole reason it is supported ;-) >>>>>> >>>>>>> How do you avoid the name clash in xen-store? >>>>>> Most toolstacks receive the incoming migration into a domain named >>>>>> FOO-incoming or some such and then rename to FOO upon completion. >>>>>> Some >>>>>> also rename the outgoing domain "FOO-migratedaway" towards the >>>>>> end so >>>>>> that the bits of the final teardown which can safely happen after >>>>>> the >>>>>> target have start can be done so. >>>>>> >>>>>> Ian. >>>>>> >>>>>> >>>>> I am unsure what I am doing wrong, but I cannot seem to be able to >>>>> do a >>>>> localhost migrate. >>>>> >>>>> I created a domU using "xl create xl.conf" and once it fully booted I >>>>> issued >>>>> an "xl migrate 11 localhost". This fails and gives the output below. >>>>> >>>>> Would you please advise on how to get this working? >>>>> >>>>> Thanks, >>>>> Diana >>>>> >>>>> >>>>> root@ubuntu:~# xl migrate 11 localhost >>>>> root@localhost's password: >>>>> migration target: Ready to receive domain. >>>>> Saving to migration stream new xl format (info 0x0/0x0/2344) >>>>> Loading new save file (new xl fmt info >>>>> 0x0/0x0/2344) >>>>> Savefile contains xl domain config >>>>> xc: progress: Reloading memory pages: 53248/1048575 5% >>>>> xc: progress: Reloading memory pages: 105472/1048575 10% >>>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12 >>>>> device >>>>> model: spawn failed (rc=-3) >>>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device >>>>> model >>>>> did not start: -3 >>>>> libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device >>>>> Model >>>>> already exited >>>>> migration target: Domain creation failed (code -3). >>>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream >>>>> truncated >>>>> reading ready message from migration receiver stream >>>>> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: >>>>> migration >>>>> target process [10934] exited with error status 3 >>>>> Migration failed, resuming at sender. >>>>> xc: error: Cannot resume uncooperative HVM guests: Internal error >>>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume >>>>> failed for >>>>> domain 11: Success >>>> Aha -- I managed to reproduce this one as well. >>>> >>>> Your problem is the "vncunused=0" -- that's instructing qemu "You must >>>> use this exact port for the vnc server". But when you do the migrate, >>>> that port is still in use by the "from" domain; so the qemu for the >>>> "to" domain can't get it, and fails. >>>> >>>> Obviously this should fail a lot more gracefully, but that's a bit of >>>> a lower-priority bug I think. >>>> >>>> -George >>> Yes, I managed to get to the bottom of it too and got vms migrating on >>> localhost on our end. >>> >>> I can confirm I did get the clock stuck problem while doing a localhost >>> migrate. >> >> Does the script I posted earlier "work" for you (i.e., does it fail >> after some number of migrations)? >> > > I left your script running throughout the night and it seems that it > does not always catch the problem. I see the following: > > 1. vm has the clock stuck > 2. script is still running as it seems the vm is still ping-able. > 3. migration fails on the basis that the vm is does not ack the > suspend request (see below). So I wrote a script to run "date", sleep for 2 seconds, and run "date" a second time -- and eventually the *sleep* hung. The VM is still responsive, and I can log in; if I type "date" manually successive times then I get an advancing clock, but if I type "sleep 1" it just hangs. If you run "dmesg" in the guest, do you see the following line? CE: Reprogramming failure. Giving up -George