From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI Date: Mon, 3 Jun 2013 10:43:26 +0100 Message-ID: <51AC653E.8060206@eu.citrix.com> References: <420439EA40B15FCBFDFF2BE3@nimrod.local> <1369557503.22605.11.camel@dagon.hellion.org.uk> <51A4C7EB.1010406@flexiant.com> <51A7767A.9030904@flexiant.com> <51A7791C.2020208@eu.citrix.com> <51A8608F.9000302@flexiant.com> <51A88151.3080001@eu.citrix.com> <51A8828A.5090804@eu.citrix.com> <20130531213041.GG5362@phenom.dumpdata.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20130531213041.GG5362@phenom.dumpdata.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Konrad Rzeszutek Wilk Cc: Ian Campbell , Stefano Stabellini , "xen-devel@lists.xen.org" , David Vrabel , Alex Bligh , Anthony PERARD , Diana Crisan List-Id: xen-devel@lists.xenproject.org On 31/05/13 22:30, Konrad Rzeszutek Wilk wrote: > On Fri, May 31, 2013 at 11:59:22AM +0100, George Dunlap wrote: >> On 31/05/13 11:54, George Dunlap wrote: >>> On 31/05/13 09:34, Diana Crisan wrote: >>>> George, >>>> On 30/05/13 17:06, George Dunlap wrote: >>>>> On 05/30/2013 04:55 PM, Diana Crisan wrote: >>>>>> On 30/05/13 16:26, George Dunlap wrote: >>>>>>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan >>>>>>> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> >>>>>>>> On 26/05/13 09:38, Ian Campbell wrote: >>>>>>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote: >>>>>>>>>> George, >>>>>>>>>> >>>>>>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap >>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>>> FWIW it's reproducible on every host h/w platform we've tried >>>>>>>>>>>> (a total of 2). >>>>>>>>>>> Do you see the same effects if you do a local-host migrate? >>>>>>>>>> I hadn't even realised that was possible. That would >>>>>>>>>> have made testing >>>>>>>>>> live >>>>>>>>>> migrate easier! >>>>>>>>> That's basically the whole reason it is supported ;-) >>>>>>>>> >>>>>>>>>> How do you avoid the name clash in xen-store? >>>>>>>>> Most toolstacks receive the incoming migration into a domain named >>>>>>>>> FOO-incoming or some such and then rename to FOO upon >>>>>>>>> completion. Some >>>>>>>>> also rename the outgoing domain "FOO-migratedaway" >>>>>>>>> towards the end so >>>>>>>>> that the bits of the final teardown which can safely >>>>>>>>> happen after the >>>>>>>>> target have start can be done so. >>>>>>>>> >>>>>>>>> Ian. >>>>>>>>> >>>>>>>>> >>>>>>>> I am unsure what I am doing wrong, but I cannot seem to >>>>>>>> be able to do a >>>>>>>> localhost migrate. >>>>>>>> >>>>>>>> I created a domU using "xl create xl.conf" and once it >>>>>>>> fully booted I >>>>>>>> issued >>>>>>>> an "xl migrate 11 localhost". This fails and gives the output below. >>>>>>>> >>>>>>>> Would you please advise on how to get this working? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Diana >>>>>>>> >>>>>>>> >>>>>>>> root@ubuntu:~# xl migrate 11 localhost >>>>>>>> root@localhost's password: >>>>>>>> migration target: Ready to receive domain. >>>>>>>> Saving to migration stream new xl format (info 0x0/0x0/2344) >>>>>>>> Loading new save file (new xl fmt info >>>>>>>> 0x0/0x0/2344) >>>>>>>> Savefile contains xl domain config >>>>>>>> xc: progress: Reloading memory pages: 53248/1048575 5% >>>>>>>> xc: progress: Reloading memory pages: 105472/1048575 10% >>>>>>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12 >>>>>>>> device >>>>>>>> model: spawn failed (rc=-3) >>>>>>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device >>>>>>>> model >>>>>>>> did not start: -3 >>>>>>>> libxl: error: >>>>>>>> libxl_dm.c:1311:libxl__destroy_device_model: Device >>>>>>>> Model >>>>>>>> already exited >>>>>>>> migration target: Domain creation failed (code -3). >>>>>>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream >>>>>>>> truncated >>>>>>>> reading ready message from migration receiver stream >>>>>>>> libxl: info: >>>>>>>> libxl_exec.c:118:libxl_report_child_exitstatus: >>>>>>>> migration >>>>>>>> target process [10934] exited with error status 3 >>>>>>>> Migration failed, resuming at sender. >>>>>>>> xc: error: Cannot resume uncooperative HVM guests: Internal error >>>>>>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume >>>>>>>> failed for >>>>>>>> domain 11: Success >>>>>>> Aha -- I managed to reproduce this one as well. >>>>>>> >>>>>>> Your problem is the "vncunused=0" -- that's instructing >>>>>>> qemu "You must >>>>>>> use this exact port for the vnc server". But when you do >>>>>>> the migrate, >>>>>>> that port is still in use by the "from" domain; so the qemu for the >>>>>>> "to" domain can't get it, and fails. >>>>>>> >>>>>>> Obviously this should fail a lot more gracefully, but that's a bit of >>>>>>> a lower-priority bug I think. >>>>>>> >>>>>>> -George >>>>>> Yes, I managed to get to the bottom of it too and got vms migrating on >>>>>> localhost on our end. >>>>>> >>>>>> I can confirm I did get the clock stuck problem while doing >>>>>> a localhost >>>>>> migrate. >>>>> Does the script I posted earlier "work" for you (i.e., does it >>>>> fail after some number of migrations)? >>>>> >>>> I left your script running throughout the night and it seems >>>> that it does not always catch the problem. I see the following: >>>> >>>> 1. vm has the clock stuck >>>> 2. script is still running as it seems the vm is still ping-able. >>>> 3. migration fails on the basis that the vm is does not ack the >>>> suspend request (see below). >>> So I wrote a script to run "date", sleep for 2 seconds, and run >>> "date" a second time -- and eventually the *sleep* hung. >>> >>> The VM is still responsive, and I can log in; if I type "date" >>> manually successive times then I get an advancing clock, but if I >>> type "sleep 1" it just hangs. >>> >>> If you run "dmesg" in the guest, do you see the following line? >>> >>> CE: Reprogramming failure. Giving up >> I think this must be it; on my other box, I got the following messages: >> >> [ 224.732083] PM: late freeze of devices complete after 3.787 msecs >> [ 224.736062] Xen HVM callback vector for event delivery is enabled >> [ 224.736062] Xen Platform PCI: I/O protocol version 1 >> [ 224.736062] xen: --> irq=8, pirq=16 >> [ 224.736062] xen: --> irq=12, pirq=17 >> [ 224.736062] xen: --> irq=1, pirq=18 >> [ 224.736062] xen: --> irq=6, pirq=19 >> [ 224.736062] xen: --> irq=4, pirq=20 >> [ 224.736062] xen: --> irq=7, pirq=21 >> [ 224.736062] xen: --> irq=28, pirq=22 >> [ 224.736062] ata_piix 0000:00:01.1: restoring config space at >> offset 0x1 (was 0x2800001, writing 0x2800005) >> [ 224.736062] PM: early restore of devices complete after 5.854 msecs >> [ 224.739692] ata_piix 0000:00:01.1: setting latency timer to 64 >> [ 224.739782] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28 >> (level, low) -> IRQ 28 >> [ 224.746900] PM: restore of devices complete after 7.540 msecs >> [ 224.758612] Setting capacity to 16777216 >> [ 224.758749] Setting capacity to 16777216 >> [ 224.898426] ata2.01: NODEV after polling detection >> [ 224.900941] ata2.00: configured for MWDMA2 >> [ 231.055978] CE: xen increased min_delta_ns to 150000 nsec >> [ 231.055986] hrtimer: interrupt took 14460 ns >> [ 247.893303] PM: freeze of devices complete after 2.168 msecs >> [ 247.893306] suspending xenstore... >> [ 247.896977] PM: late freeze of devices complete after 3.666 msecs >> [ 247.900067] Xen HVM callback vector for event delivery is enabled >> [ 247.900067] Xen Platform PCI: I/O protocol version 1 >> [ 247.900067] xen: --> irq=8, pirq=16 >> [ 247.900067] xen: --> irq=12, pirq=17 >> [ 247.900067] xen: --> irq=1, pirq=18 >> [ 247.900067] xen: --> irq=6, pirq=19 >> [ 247.900067] xen: --> irq=4, pirq=20 >> [ 247.900067] xen: --> irq=7, pirq=21 >> [ 247.900067] xen: --> irq=28, pirq=22 >> [ 247.900067] ata_piix 0000:00:01.1: restoring config space at >> offset 0x1 (was 0x2800001, writing 0x2800005) >> [ 247.900067] PM: early restore of devices complete after 4.612 msecs >> [ 247.906454] ata_piix 0000:00:01.1: setting latency timer to 64 >> [ 247.906558] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28 >> (level, low) -> IRQ 28 >> [ 247.914770] PM: restore of devices complete after 8.762 msecs >> [ 247.926557] Setting capacity to 16777216 >> [ 247.926661] Setting capacity to 16777216 >> [ 248.066661] ata2.01: NODEV after polling detection >> [ 248.067326] CE: xen increased min_delta_ns to 225000 nsec >> [ 248.067344] CE: xen increased min_delta_ns to 337500 nsec >> [ 248.067361] CE: xen increased min_delta_ns to 506250 nsec >> [ 248.067378] CE: xen increased min_delta_ns to 759375 nsec >> [ 248.067396] CE: xen increased min_delta_ns to 1139062 nsec >> [ 248.067413] CE: xen increased min_delta_ns to 1708593 nsec >> [ 248.067428] CE: xen increased min_delta_ns to 2562889 nsec >> [ 248.067441] CE: xen increased min_delta_ns to 3844333 nsec >> [ 248.067453] CE: xen increased min_delta_ns to 4000000 nsec >> [ 248.067466] CE: Reprogramming failure. Giving up >> [ 248.068075] ata2.00: configured for MWDMA2 >> >> Note the "CE: xen increased min_delta_ns to 150000nsec" at 231 for >> the previous suspend, and now it's increasing it up to 4 >> milliseconds before giving up for this suspend. >> >> Konrad, stefano, any idea what's going on here? > VIRQ_TIMER not being delievered. Aka this commit > > bee980d9e9642e96351fa3ca9077b853ecf62f57 > xen/events: Handle VIRQ_TIMER before any other hardirq in event loop. > > should be back-ported but didn't yet. Let me put that > on my TODO list. Konrad, I don't understand how the VIRQ timer can be the issue. As far as I can tell, what's happening is this: 1. The kernel asks Xen timer for something N ns in the future. 2. The xen timer stuff in Linux calculates the current time using stuff from the shared info page, adds N ns, then asks Xen for an event to trigger at that time. 3. Unfortunately, that new time is in the past, and Xen returns an error. So how is the VIRQ_TIMER not being delivered causing the calculation to come up with a time in the past? -George