Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: George Dunlap <george.dunlap@eu.citrix.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	David Vrabel <david.vrabel@citrix.com>,
	Alex Bligh <alex@alex.org.uk>,
	Anthony PERARD <anthony.perard@citrix.com>,
	Diana Crisan <dcrisan@flexiant.com>
Subject: Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
Date: Mon, 3 Jun 2013 10:43:26 +0100	[thread overview]
Message-ID: <51AC653E.8060206@eu.citrix.com> (raw)
In-Reply-To: <20130531213041.GG5362@phenom.dumpdata.com>

On 31/05/13 22:30, Konrad Rzeszutek Wilk wrote:
> On Fri, May 31, 2013 at 11:59:22AM +0100, George Dunlap wrote:
>> On 31/05/13 11:54, George Dunlap wrote:
>>> On 31/05/13 09:34, Diana Crisan wrote:
>>>> George,
>>>> On 30/05/13 17:06, George Dunlap wrote:
>>>>> On 05/30/2013 04:55 PM, Diana Crisan wrote:
>>>>>> On 30/05/13 16:26, George Dunlap wrote:
>>>>>>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com>
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> On 26/05/13 09:38, Ian Campbell wrote:
>>>>>>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>>>>>>>>>> George,
>>>>>>>>>>
>>>>>>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap
>>>>>>>>>> <George.Dunlap@eu.citrix.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>>>>>>>>> (a total of 2).
>>>>>>>>>>> Do you see the same effects if you do a local-host migrate?
>>>>>>>>>> I hadn't even realised that was possible. That would
>>>>>>>>>> have made testing
>>>>>>>>>> live
>>>>>>>>>> migrate easier!
>>>>>>>>> That's basically the whole reason it is supported ;-)
>>>>>>>>>
>>>>>>>>>> How do you avoid the name clash in xen-store?
>>>>>>>>> Most toolstacks receive the incoming migration into a domain named
>>>>>>>>> FOO-incoming or some such and then rename to FOO upon
>>>>>>>>> completion. Some
>>>>>>>>> also rename the outgoing domain "FOO-migratedaway"
>>>>>>>>> towards the end so
>>>>>>>>> that the bits of the final teardown which can safely
>>>>>>>>> happen after the
>>>>>>>>> target have start can be done so.
>>>>>>>>>
>>>>>>>>> Ian.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I am unsure what I am doing wrong, but I cannot seem to
>>>>>>>> be able to do a
>>>>>>>> localhost migrate.
>>>>>>>>
>>>>>>>> I created a domU using "xl create xl.conf" and once it
>>>>>>>> fully booted I
>>>>>>>> issued
>>>>>>>> an "xl migrate 11 localhost". This fails and gives the output below.
>>>>>>>>
>>>>>>>> Would you please advise on how to get this working?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Diana
>>>>>>>>
>>>>>>>>
>>>>>>>> root@ubuntu:~# xl migrate 11 localhost
>>>>>>>> root@localhost's password:
>>>>>>>> migration target: Ready to receive domain.
>>>>>>>> Saving to migration stream new xl format (info 0x0/0x0/2344)
>>>>>>>> Loading new save file <incoming migration stream> (new xl fmt info
>>>>>>>> 0x0/0x0/2344)
>>>>>>>>   Savefile contains xl domain config
>>>>>>>> xc: progress: Reloading memory pages: 53248/1048575 5%
>>>>>>>> xc: progress: Reloading memory pages: 105472/1048575 10%
>>>>>>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12
>>>>>>>> device
>>>>>>>> model: spawn failed (rc=-3)
>>>>>>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device
>>>>>>>> model
>>>>>>>> did not start: -3
>>>>>>>> libxl: error:
>>>>>>>> libxl_dm.c:1311:libxl__destroy_device_model: Device
>>>>>>>> Model
>>>>>>>> already exited
>>>>>>>> migration target: Domain creation failed (code -3).
>>>>>>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream
>>>>>>>> truncated
>>>>>>>> reading ready message from migration receiver stream
>>>>>>>> libxl: info:
>>>>>>>> libxl_exec.c:118:libxl_report_child_exitstatus:
>>>>>>>> migration
>>>>>>>> target process [10934] exited with error status 3
>>>>>>>> Migration failed, resuming at sender.
>>>>>>>> xc: error: Cannot resume uncooperative HVM guests: Internal error
>>>>>>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume
>>>>>>>> failed for
>>>>>>>> domain 11: Success
>>>>>>> Aha -- I managed to reproduce this one as well.
>>>>>>>
>>>>>>> Your problem is the "vncunused=0" -- that's instructing
>>>>>>> qemu "You must
>>>>>>> use this exact port for the vnc server".  But when you do
>>>>>>> the migrate,
>>>>>>> that port is still in use by the "from" domain; so the qemu for the
>>>>>>> "to" domain can't get it, and fails.
>>>>>>>
>>>>>>> Obviously this should fail a lot more gracefully, but that's a bit of
>>>>>>> a lower-priority bug I think.
>>>>>>>
>>>>>>>   -George
>>>>>> Yes, I managed to get to the bottom of it too and got vms migrating on
>>>>>> localhost on our end.
>>>>>>
>>>>>> I can confirm I did get the clock stuck problem while doing
>>>>>> a localhost
>>>>>> migrate.
>>>>> Does the script I posted earlier "work" for you (i.e., does it
>>>>> fail after some number of migrations)?
>>>>>
>>>> I left your script running throughout the night and it seems
>>>> that it does not always catch the problem. I see the following:
>>>>
>>>> 1. vm has the clock stuck
>>>> 2. script is still running as it seems the vm is still ping-able.
>>>> 3. migration fails on the basis that the vm is does not ack the
>>>> suspend request (see below).
>>> So I wrote a script to run "date", sleep for 2 seconds, and run
>>> "date" a second time -- and eventually the *sleep* hung.
>>>
>>> The VM is still responsive, and I can log in; if I type "date"
>>> manually successive times then I get an advancing clock, but if I
>>> type "sleep 1" it just hangs.
>>>
>>> If you run "dmesg" in the guest, do you see the following line?
>>>
>>> CE: Reprogramming failure. Giving up
>> I think this must be it; on my other box, I got the following messages:
>>
>> [  224.732083] PM: late freeze of devices complete after 3.787 msecs
>> [  224.736062] Xen HVM callback vector for event delivery is enabled
>> [  224.736062] Xen Platform PCI: I/O protocol version 1
>> [  224.736062] xen: --> irq=8, pirq=16
>> [  224.736062] xen: --> irq=12, pirq=17
>> [  224.736062] xen: --> irq=1, pirq=18
>> [  224.736062] xen: --> irq=6, pirq=19
>> [  224.736062] xen: --> irq=4, pirq=20
>> [  224.736062] xen: --> irq=7, pirq=21
>> [  224.736062] xen: --> irq=28, pirq=22
>> [  224.736062] ata_piix 0000:00:01.1: restoring config space at
>> offset 0x1 (was 0x2800001, writing 0x2800005)
>> [  224.736062] PM: early restore of devices complete after 5.854 msecs
>> [  224.739692] ata_piix 0000:00:01.1: setting latency timer to 64
>> [  224.739782] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28
>> (level, low) -> IRQ 28
>> [  224.746900] PM: restore of devices complete after 7.540 msecs
>> [  224.758612] Setting capacity to 16777216
>> [  224.758749] Setting capacity to 16777216
>> [  224.898426] ata2.01: NODEV after polling detection
>> [  224.900941] ata2.00: configured for MWDMA2
>> [  231.055978] CE: xen increased min_delta_ns to 150000 nsec
>> [  231.055986] hrtimer: interrupt took 14460 ns
>> [  247.893303] PM: freeze of devices complete after 2.168 msecs
>> [  247.893306] suspending xenstore...
>> [  247.896977] PM: late freeze of devices complete after 3.666 msecs
>> [  247.900067] Xen HVM callback vector for event delivery is enabled
>> [  247.900067] Xen Platform PCI: I/O protocol version 1
>> [  247.900067] xen: --> irq=8, pirq=16
>> [  247.900067] xen: --> irq=12, pirq=17
>> [  247.900067] xen: --> irq=1, pirq=18
>> [  247.900067] xen: --> irq=6, pirq=19
>> [  247.900067] xen: --> irq=4, pirq=20
>> [  247.900067] xen: --> irq=7, pirq=21
>> [  247.900067] xen: --> irq=28, pirq=22
>> [  247.900067] ata_piix 0000:00:01.1: restoring config space at
>> offset 0x1 (was 0x2800001, writing 0x2800005)
>> [  247.900067] PM: early restore of devices complete after 4.612 msecs
>> [  247.906454] ata_piix 0000:00:01.1: setting latency timer to 64
>> [  247.906558] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28
>> (level, low) -> IRQ 28
>> [  247.914770] PM: restore of devices complete after 8.762 msecs
>> [  247.926557] Setting capacity to 16777216
>> [  247.926661] Setting capacity to 16777216
>> [  248.066661] ata2.01: NODEV after polling detection
>> [  248.067326] CE: xen increased min_delta_ns to 225000 nsec
>> [  248.067344] CE: xen increased min_delta_ns to 337500 nsec
>> [  248.067361] CE: xen increased min_delta_ns to 506250 nsec
>> [  248.067378] CE: xen increased min_delta_ns to 759375 nsec
>> [  248.067396] CE: xen increased min_delta_ns to 1139062 nsec
>> [  248.067413] CE: xen increased min_delta_ns to 1708593 nsec
>> [  248.067428] CE: xen increased min_delta_ns to 2562889 nsec
>> [  248.067441] CE: xen increased min_delta_ns to 3844333 nsec
>> [  248.067453] CE: xen increased min_delta_ns to 4000000 nsec
>> [  248.067466] CE: Reprogramming failure. Giving up
>> [  248.068075] ata2.00: configured for MWDMA2
>>
>> Note the "CE: xen increased min_delta_ns to 150000nsec" at 231 for
>> the previous suspend, and now it's increasing it up to 4
>> milliseconds before giving up for this suspend.
>>
>> Konrad, stefano, any idea what's going on here?
> VIRQ_TIMER not being delievered. Aka this commit
>
> bee980d9e9642e96351fa3ca9077b853ecf62f57
> xen/events: Handle VIRQ_TIMER before any other hardirq in event loop.
>
> should be back-ported but didn't yet. Let me put that
> on my TODO list.

Konrad,

I don't understand how the VIRQ timer can be the issue.

As far as I can tell, what's happening is this:

1. The kernel asks Xen timer for something N ns in the future.

2. The xen timer stuff in Linux calculates the current time using stuff 
from the shared info page, adds N ns, then asks Xen for an event to 
trigger at that time.

3. Unfortunately, that new time is in the past, and Xen returns an error.

So how is the VIRQ_TIMER not being delivered causing the calculation to 
come up with a time in the past?

  -George

next prev parent reply	other threads:[~2013-06-03  9:43 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1223417765.8633857.1368537033873.JavaMail.root@zimbra002>
2013-05-14 13:11 ` HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI Diana Crisan
2013-05-14 16:09   ` George Dunlap
2013-05-15 10:05     ` Diana Crisan
2013-05-15 13:46   ` Alex Bligh
2013-05-20 11:11     ` George Dunlap
2013-05-20 19:28       ` Konrad Rzeszutek Wilk
2013-05-20 22:38         ` Alex Bligh
2013-05-21  1:04           ` Konrad Rzeszutek Wilk
2013-05-21 10:22             ` Diana Crisan
2013-05-21 10:47               ` David Vrabel
2013-05-21 11:16                 ` Diana Crisan
2013-05-21 12:49                   ` David Vrabel
2013-05-21 13:16                     ` Alex Bligh
2013-05-24 16:16                       ` George Dunlap
2013-05-25 10:18                         ` Alex Bligh
2013-05-26  8:38                           ` Ian Campbell
2013-05-28 15:06                             ` Diana Crisan
2013-05-29 16:16                               ` Alex Bligh
2013-05-29 19:04                                 ` Ian Campbell
2013-05-30 14:30                                   ` George Dunlap
2013-05-30 15:39                                 ` Frediano Ziglio
2013-05-30 15:26                               ` George Dunlap
2013-05-30 15:55                                 ` Diana Crisan
2013-05-30 16:06                                   ` George Dunlap
2013-05-30 17:02                                     ` Diana Crisan
2013-05-31  8:34                                     ` Diana Crisan
2013-05-31 10:54                                       ` George Dunlap
2013-05-31 10:59                                         ` George Dunlap
2013-05-31 11:41                                           ` George Dunlap
2013-05-31 21:30                                           ` Konrad Rzeszutek Wilk
2013-05-31 22:51                                             ` Alex Bligh
2013-06-03  9:43                                             ` George Dunlap [this message]
2013-05-31 11:18                                         ` Alex Bligh
2013-05-31 11:36                                         ` Diana Crisan
2013-05-31 11:41                                           ` Diana Crisan
2013-05-31 11:49                                             ` George Dunlap
2013-05-31 11:57                                               ` Alex Bligh
2013-05-31 12:40                                                 ` Ian Campbell
2013-05-31 13:07                                                   ` George Dunlap
2013-05-31 15:10                                                     ` Roger Pau Monné
2013-06-03  8:37                                                       ` Roger Pau Monné
2013-06-03 10:05                                                         ` Stefano Stabellini
2013-06-03 10:23                                                           ` Roger Pau Monné
2013-06-03 10:30                                                             ` Stefano Stabellini
2013-06-03 11:16                                                             ` George Dunlap
2013-06-03 11:24                                                               ` Diana Crisan
2013-06-03 14:01                                                               ` Diana Crisan
2013-06-03 17:09                                                               ` Alex Bligh
2013-06-03 17:12                                                                 ` George Dunlap
2013-06-03 17:18                                                                   ` Alex Bligh
2013-06-03 17:25                                                                     ` George Dunlap
2013-06-03 17:42                                                                       ` Alex Bligh
2013-06-03 10:25                                                         ` George Dunlap
2013-05-31 13:16                                                   ` Alex Bligh
2013-05-31 14:36                                                     ` Ian Campbell
2013-05-31 15:18                                                       ` Alex Bligh
2013-05-31 12:34                                               ` Ian Campbell
2013-05-30 14:32   ` George Dunlap
2013-05-30 14:42     ` Diana Crisan
2013-06-03 17:18 Alex Bligh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51AC653E.8060206@eu.citrix.com \
    --to=george.dunlap@eu.citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=alex@alex.org.uk \
    --cc=anthony.perard@citrix.com \
    --cc=david.vrabel@citrix.com \
    --cc=dcrisan@flexiant.com \
    --cc=konrad.wilk@oracle.com \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).