From: George Dunlap <george.dunlap@eu.citrix.com>
To: Diana Crisan <dcrisan@flexiant.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>,
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
David Vrabel <david.vrabel@citrix.com>,
Alex Bligh <alex@alex.org.uk>,
Anthony PERARD <anthony.perard@citrix.com>
Subject: Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
Date: Fri, 31 May 2013 11:59:22 +0100 [thread overview]
Message-ID: <51A8828A.5090804@eu.citrix.com> (raw)
In-Reply-To: <51A88151.3080001@eu.citrix.com>
On 31/05/13 11:54, George Dunlap wrote:
> On 31/05/13 09:34, Diana Crisan wrote:
>> George,
>> On 30/05/13 17:06, George Dunlap wrote:
>>> On 05/30/2013 04:55 PM, Diana Crisan wrote:
>>>> On 30/05/13 16:26, George Dunlap wrote:
>>>>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com>
>>>>> wrote:
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> On 26/05/13 09:38, Ian Campbell wrote:
>>>>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>>>>>>>> George,
>>>>>>>>
>>>>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap
>>>>>>>> <George.Dunlap@eu.citrix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>>>>>>> (a total of 2).
>>>>>>>>> Do you see the same effects if you do a local-host migrate?
>>>>>>>> I hadn't even realised that was possible. That would have made
>>>>>>>> testing
>>>>>>>> live
>>>>>>>> migrate easier!
>>>>>>> That's basically the whole reason it is supported ;-)
>>>>>>>
>>>>>>>> How do you avoid the name clash in xen-store?
>>>>>>> Most toolstacks receive the incoming migration into a domain named
>>>>>>> FOO-incoming or some such and then rename to FOO upon
>>>>>>> completion. Some
>>>>>>> also rename the outgoing domain "FOO-migratedaway" towards the
>>>>>>> end so
>>>>>>> that the bits of the final teardown which can safely happen
>>>>>>> after the
>>>>>>> target have start can be done so.
>>>>>>>
>>>>>>> Ian.
>>>>>>>
>>>>>>>
>>>>>> I am unsure what I am doing wrong, but I cannot seem to be able
>>>>>> to do a
>>>>>> localhost migrate.
>>>>>>
>>>>>> I created a domU using "xl create xl.conf" and once it fully
>>>>>> booted I
>>>>>> issued
>>>>>> an "xl migrate 11 localhost". This fails and gives the output below.
>>>>>>
>>>>>> Would you please advise on how to get this working?
>>>>>>
>>>>>> Thanks,
>>>>>> Diana
>>>>>>
>>>>>>
>>>>>> root@ubuntu:~# xl migrate 11 localhost
>>>>>> root@localhost's password:
>>>>>> migration target: Ready to receive domain.
>>>>>> Saving to migration stream new xl format (info 0x0/0x0/2344)
>>>>>> Loading new save file <incoming migration stream> (new xl fmt info
>>>>>> 0x0/0x0/2344)
>>>>>> Savefile contains xl domain config
>>>>>> xc: progress: Reloading memory pages: 53248/1048575 5%
>>>>>> xc: progress: Reloading memory pages: 105472/1048575 10%
>>>>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12
>>>>>> device
>>>>>> model: spawn failed (rc=-3)
>>>>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device
>>>>>> model
>>>>>> did not start: -3
>>>>>> libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device
>>>>>> Model
>>>>>> already exited
>>>>>> migration target: Domain creation failed (code -3).
>>>>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream
>>>>>> truncated
>>>>>> reading ready message from migration receiver stream
>>>>>> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus:
>>>>>> migration
>>>>>> target process [10934] exited with error status 3
>>>>>> Migration failed, resuming at sender.
>>>>>> xc: error: Cannot resume uncooperative HVM guests: Internal error
>>>>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume
>>>>>> failed for
>>>>>> domain 11: Success
>>>>> Aha -- I managed to reproduce this one as well.
>>>>>
>>>>> Your problem is the "vncunused=0" -- that's instructing qemu "You
>>>>> must
>>>>> use this exact port for the vnc server". But when you do the
>>>>> migrate,
>>>>> that port is still in use by the "from" domain; so the qemu for the
>>>>> "to" domain can't get it, and fails.
>>>>>
>>>>> Obviously this should fail a lot more gracefully, but that's a bit of
>>>>> a lower-priority bug I think.
>>>>>
>>>>> -George
>>>> Yes, I managed to get to the bottom of it too and got vms migrating on
>>>> localhost on our end.
>>>>
>>>> I can confirm I did get the clock stuck problem while doing a
>>>> localhost
>>>> migrate.
>>>
>>> Does the script I posted earlier "work" for you (i.e., does it fail
>>> after some number of migrations)?
>>>
>>
>> I left your script running throughout the night and it seems that it
>> does not always catch the problem. I see the following:
>>
>> 1. vm has the clock stuck
>> 2. script is still running as it seems the vm is still ping-able.
>> 3. migration fails on the basis that the vm is does not ack the
>> suspend request (see below).
>
> So I wrote a script to run "date", sleep for 2 seconds, and run "date"
> a second time -- and eventually the *sleep* hung.
>
> The VM is still responsive, and I can log in; if I type "date"
> manually successive times then I get an advancing clock, but if I type
> "sleep 1" it just hangs.
>
> If you run "dmesg" in the guest, do you see the following line?
>
> CE: Reprogramming failure. Giving up
I think this must be it; on my other box, I got the following messages:
[ 224.732083] PM: late freeze of devices complete after 3.787 msecs
[ 224.736062] Xen HVM callback vector for event delivery is enabled
[ 224.736062] Xen Platform PCI: I/O protocol version 1
[ 224.736062] xen: --> irq=8, pirq=16
[ 224.736062] xen: --> irq=12, pirq=17
[ 224.736062] xen: --> irq=1, pirq=18
[ 224.736062] xen: --> irq=6, pirq=19
[ 224.736062] xen: --> irq=4, pirq=20
[ 224.736062] xen: --> irq=7, pirq=21
[ 224.736062] xen: --> irq=28, pirq=22
[ 224.736062] ata_piix 0000:00:01.1: restoring config space at offset
0x1 (was 0x2800001, writing 0x2800005)
[ 224.736062] PM: early restore of devices complete after 5.854 msecs
[ 224.739692] ata_piix 0000:00:01.1: setting latency timer to 64
[ 224.739782] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28
(level, low) -> IRQ 28
[ 224.746900] PM: restore of devices complete after 7.540 msecs
[ 224.758612] Setting capacity to 16777216
[ 224.758749] Setting capacity to 16777216
[ 224.898426] ata2.01: NODEV after polling detection
[ 224.900941] ata2.00: configured for MWDMA2
[ 231.055978] CE: xen increased min_delta_ns to 150000 nsec
[ 231.055986] hrtimer: interrupt took 14460 ns
[ 247.893303] PM: freeze of devices complete after 2.168 msecs
[ 247.893306] suspending xenstore...
[ 247.896977] PM: late freeze of devices complete after 3.666 msecs
[ 247.900067] Xen HVM callback vector for event delivery is enabled
[ 247.900067] Xen Platform PCI: I/O protocol version 1
[ 247.900067] xen: --> irq=8, pirq=16
[ 247.900067] xen: --> irq=12, pirq=17
[ 247.900067] xen: --> irq=1, pirq=18
[ 247.900067] xen: --> irq=6, pirq=19
[ 247.900067] xen: --> irq=4, pirq=20
[ 247.900067] xen: --> irq=7, pirq=21
[ 247.900067] xen: --> irq=28, pirq=22
[ 247.900067] ata_piix 0000:00:01.1: restoring config space at offset
0x1 (was 0x2800001, writing 0x2800005)
[ 247.900067] PM: early restore of devices complete after 4.612 msecs
[ 247.906454] ata_piix 0000:00:01.1: setting latency timer to 64
[ 247.906558] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28
(level, low) -> IRQ 28
[ 247.914770] PM: restore of devices complete after 8.762 msecs
[ 247.926557] Setting capacity to 16777216
[ 247.926661] Setting capacity to 16777216
[ 248.066661] ata2.01: NODEV after polling detection
[ 248.067326] CE: xen increased min_delta_ns to 225000 nsec
[ 248.067344] CE: xen increased min_delta_ns to 337500 nsec
[ 248.067361] CE: xen increased min_delta_ns to 506250 nsec
[ 248.067378] CE: xen increased min_delta_ns to 759375 nsec
[ 248.067396] CE: xen increased min_delta_ns to 1139062 nsec
[ 248.067413] CE: xen increased min_delta_ns to 1708593 nsec
[ 248.067428] CE: xen increased min_delta_ns to 2562889 nsec
[ 248.067441] CE: xen increased min_delta_ns to 3844333 nsec
[ 248.067453] CE: xen increased min_delta_ns to 4000000 nsec
[ 248.067466] CE: Reprogramming failure. Giving up
[ 248.068075] ata2.00: configured for MWDMA2
Note the "CE: xen increased min_delta_ns to 150000nsec" at 231 for the
previous suspend, and now it's increasing it up to 4 milliseconds before
giving up for this suspend.
Konrad, stefano, any idea what's going on here?
-George
next prev parent reply other threads:[~2013-05-31 10:59 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1223417765.8633857.1368537033873.JavaMail.root@zimbra002>
2013-05-14 13:11 ` HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI Diana Crisan
2013-05-14 16:09 ` George Dunlap
2013-05-15 10:05 ` Diana Crisan
2013-05-15 13:46 ` Alex Bligh
2013-05-20 11:11 ` George Dunlap
2013-05-20 19:28 ` Konrad Rzeszutek Wilk
2013-05-20 22:38 ` Alex Bligh
2013-05-21 1:04 ` Konrad Rzeszutek Wilk
2013-05-21 10:22 ` Diana Crisan
2013-05-21 10:47 ` David Vrabel
2013-05-21 11:16 ` Diana Crisan
2013-05-21 12:49 ` David Vrabel
2013-05-21 13:16 ` Alex Bligh
2013-05-24 16:16 ` George Dunlap
2013-05-25 10:18 ` Alex Bligh
2013-05-26 8:38 ` Ian Campbell
2013-05-28 15:06 ` Diana Crisan
2013-05-29 16:16 ` Alex Bligh
2013-05-29 19:04 ` Ian Campbell
2013-05-30 14:30 ` George Dunlap
2013-05-30 15:39 ` Frediano Ziglio
2013-05-30 15:26 ` George Dunlap
2013-05-30 15:55 ` Diana Crisan
2013-05-30 16:06 ` George Dunlap
2013-05-30 17:02 ` Diana Crisan
2013-05-31 8:34 ` Diana Crisan
2013-05-31 10:54 ` George Dunlap
2013-05-31 10:59 ` George Dunlap [this message]
2013-05-31 11:41 ` George Dunlap
2013-05-31 21:30 ` Konrad Rzeszutek Wilk
2013-05-31 22:51 ` Alex Bligh
2013-06-03 9:43 ` George Dunlap
2013-05-31 11:18 ` Alex Bligh
2013-05-31 11:36 ` Diana Crisan
2013-05-31 11:41 ` Diana Crisan
2013-05-31 11:49 ` George Dunlap
2013-05-31 11:57 ` Alex Bligh
2013-05-31 12:40 ` Ian Campbell
2013-05-31 13:07 ` George Dunlap
2013-05-31 15:10 ` Roger Pau Monné
2013-06-03 8:37 ` Roger Pau Monné
2013-06-03 10:05 ` Stefano Stabellini
2013-06-03 10:23 ` Roger Pau Monné
2013-06-03 10:30 ` Stefano Stabellini
2013-06-03 11:16 ` George Dunlap
2013-06-03 11:24 ` Diana Crisan
2013-06-03 14:01 ` Diana Crisan
2013-06-03 17:09 ` Alex Bligh
2013-06-03 17:12 ` George Dunlap
2013-06-03 17:18 ` Alex Bligh
2013-06-03 17:25 ` George Dunlap
2013-06-03 17:42 ` Alex Bligh
2013-06-03 10:25 ` George Dunlap
2013-05-31 13:16 ` Alex Bligh
2013-05-31 14:36 ` Ian Campbell
2013-05-31 15:18 ` Alex Bligh
2013-05-31 12:34 ` Ian Campbell
2013-05-30 14:32 ` George Dunlap
2013-05-30 14:42 ` Diana Crisan
2013-06-03 17:18 Alex Bligh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51A8828A.5090804@eu.citrix.com \
--to=george.dunlap@eu.citrix.com \
--cc=Ian.Campbell@citrix.com \
--cc=alex@alex.org.uk \
--cc=anthony.perard@citrix.com \
--cc=david.vrabel@citrix.com \
--cc=dcrisan@flexiant.com \
--cc=konrad.wilk@oracle.com \
--cc=stefano.stabellini@eu.citrix.com \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).