From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: HVM Migration of domU on Qemu-upstream DM causes
 stuck system clock with ACPI
Date: Fri, 31 May 2013 11:54:09 +0100
Message-ID: <51A88151.3080001@eu.citrix.com>
References: <1717491994.10371605.1369131737226.JavaMail.root@zimbra002>
	<519B50C9.1000008@citrix.com> <519B577E.6070200@flexiant.com>
	<519B6D51.2060508@citrix.com>
	<951B3441BAE2324286D3AA6D@Ximines.local>
	<CAFLBxZbBz-vKSd9KHA9uLahk7=L5GrDAiWNzem+0PvCK8SmpNA@mail.gmail.com>
	<420439EA40B15FCBFDFF2BE3@nimrod.local>
	<1369557503.22605.11.camel@dagon.hellion.org.uk>
	<51A4C7EB.1010406@flexiant.com>
	<CAFLBxZYbzuhR3SK6dw0xsuF7QPG164h-0bJ+n+xEDKvzbYoHzw@mail.gmail.com>
	<51A7767A.9030904@flexiant.com> <51A7791C.2020208@eu.citrix.com>
	<51A8608F.9000302@flexiant.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <51A8608F.9000302@flexiant.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Diana Crisan <dcrisan@flexiant.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, David Vrabel <david.vrabel@citrix.com>, Alex Bligh <alex@alex.org.uk>, Anthony PERARD <anthony.perard@citrix.com>
List-Id: xen-devel@lists.xenproject.org

On 31/05/13 09:34, Diana Crisan wrote:
> George,
> On 30/05/13 17:06, George Dunlap wrote:
>> On 05/30/2013 04:55 PM, Diana Crisan wrote:
>>> On 30/05/13 16:26, George Dunlap wrote:
>>>> On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@flexiant.com>
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>>
>>>>> On 26/05/13 09:38, Ian Campbell wrote:
>>>>>> On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
>>>>>>> George,
>>>>>>>
>>>>>>> --On 24 May 2013 17:16:07 +0100 George Dunlap
>>>>>>> <George.Dunlap@eu.citrix.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>> FWIW it's reproducible on every host h/w platform we've tried
>>>>>>>>> (a total of 2).
>>>>>>>> Do you see the same effects if you do a local-host migrate?
>>>>>>> I hadn't even realised that was possible. That would have made 
>>>>>>> testing
>>>>>>> live
>>>>>>> migrate easier!
>>>>>> That's basically the whole reason it is supported ;-)
>>>>>>
>>>>>>> How do you avoid the name clash in xen-store?
>>>>>> Most toolstacks receive the incoming migration into a domain named
>>>>>> FOO-incoming or some such and then rename to FOO upon completion. 
>>>>>> Some
>>>>>> also rename the outgoing domain "FOO-migratedaway" towards the 
>>>>>> end so
>>>>>> that the bits of the final teardown which can safely happen after 
>>>>>> the
>>>>>> target have start can be done so.
>>>>>>
>>>>>> Ian.
>>>>>>
>>>>>>
>>>>> I am unsure what I am doing wrong, but I cannot seem to be able to 
>>>>> do a
>>>>> localhost migrate.
>>>>>
>>>>> I created a domU using "xl create xl.conf" and once it fully booted I
>>>>> issued
>>>>> an "xl migrate 11 localhost". This fails and gives the output below.
>>>>>
>>>>> Would you please advise on how to get this working?
>>>>>
>>>>> Thanks,
>>>>> Diana
>>>>>
>>>>>
>>>>> root@ubuntu:~# xl migrate 11 localhost
>>>>> root@localhost's password:
>>>>> migration target: Ready to receive domain.
>>>>> Saving to migration stream new xl format (info 0x0/0x0/2344)
>>>>> Loading new save file <incoming migration stream> (new xl fmt info
>>>>> 0x0/0x0/2344)
>>>>>   Savefile contains xl domain config
>>>>> xc: progress: Reloading memory pages: 53248/1048575    5%
>>>>> xc: progress: Reloading memory pages: 105472/1048575   10%
>>>>> libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12
>>>>> device
>>>>> model: spawn failed (rc=-3)
>>>>> libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device
>>>>> model
>>>>> did not start: -3
>>>>> libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device 
>>>>> Model
>>>>> already exited
>>>>> migration target: Domain creation failed (code -3).
>>>>> libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream
>>>>> truncated
>>>>> reading ready message from migration receiver stream
>>>>> libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: 
>>>>> migration
>>>>> target process [10934] exited with error status 3
>>>>> Migration failed, resuming at sender.
>>>>> xc: error: Cannot resume uncooperative HVM guests: Internal error
>>>>> libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume
>>>>> failed for
>>>>> domain 11: Success
>>>> Aha -- I managed to reproduce this one as well.
>>>>
>>>> Your problem is the "vncunused=0" -- that's instructing qemu "You must
>>>> use this exact port for the vnc server".  But when you do the migrate,
>>>> that port is still in use by the "from" domain; so the qemu for the
>>>> "to" domain can't get it, and fails.
>>>>
>>>> Obviously this should fail a lot more gracefully, but that's a bit of
>>>> a lower-priority bug I think.
>>>>
>>>>   -George
>>> Yes, I managed to get to the bottom of it too and got vms migrating on
>>> localhost on our end.
>>>
>>> I can confirm I did get the clock stuck problem while doing a localhost
>>> migrate.
>>
>> Does the script I posted earlier "work" for you (i.e., does it fail 
>> after some number of migrations)?
>>
>
> I left your script running throughout the night and it seems that it 
> does not always catch the problem. I see the following:
>
> 1. vm has the clock stuck
> 2. script is still running as it seems the vm is still ping-able.
> 3. migration fails on the basis that the vm is does not ack the 
> suspend request (see below).

So I wrote a script to run "date", sleep for 2 seconds, and run "date" a 
second time -- and eventually the *sleep* hung.

The VM is still responsive, and I can log in; if I type "date" manually 
successive times then I get an advancing clock, but if I type "sleep 1" 
it just hangs.

If you run "dmesg" in the guest, do you see the following line?

CE: Reprogramming failure. Giving up

  -George