From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alex Bligh <alex@alex.org.uk>
Subject: Re: HVM Migration of domU on Qemu-upstream DM causes
 stuck system clock with ACPI
Date: Fri, 31 May 2013 16:18:39 +0100
Message-ID: <37A4010683390E8CBA67A8BE@nimrod.local>
References: <1717491994.10371605.1369131737226.JavaMail.root@zimbra002>		
	<519B50C9.1000008@citrix.com> <519B577E.6070200@flexiant.com>		
	<519B6D51.2060508@citrix.com>
	<951B3441BAE2324286D3AA6D@Ximines.local>		
	<CAFLBxZbBz-vKSd9KHA9uLahk7=L5GrDAiWNzem+0PvCK8SmpNA@mail.gmail.com>		
	<420439EA40B15FCBFDFF2BE3@nimrod.local>		
	<1369557503.22605.11.camel@dagon.hellion.org.uk>		
	<51A4C7EB.1010406@flexiant.com>		
	<CAFLBxZYbzuhR3SK6dw0xsuF7QPG164h-0bJ+n+xEDKvzbYoHzw@mail.gmail.com>		
	<51A7767A.9030904@flexiant.com> <51A7791C.2020208@eu.citrix.com>		
	<51A8608F.9000302@flexiant.com> <51A88151.3080001@eu.citrix.com>		
	<0FE70400-1152-45F5-9BF9-973DF1DA9EE8@flexiant.com>		
	<BFF4FE32-0B18-4429-A7C7-C3BD0021F11A@flexiant.com>		
	<51A88E3E.5090208@eu.citrix.com>
	<A9BDBB961CCE37B70FCF175A@nimrod.local>	
	<1370004031.5199.133.camel@zakaz.uk.xensource.com>	
	<3C61B2368D479E44F6D5FACE@nimrod.local>
	<1370010963.5199.184.camel@zakaz.uk.xensource.com>
Reply-To: Alex Bligh <alex@alex.org.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <1370010963.5199.184.camel@zakaz.uk.xensource.com>
Content-Disposition: inline
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Anthony@alex.org.uk, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, George Dunlap <george.dunlap@eu.citrix.com>, xen-devel@lists.xen.org, David Vrabel <david.vrabel@citrix.com>, Alex Bligh <alex@alex.org.uk>, PERARD <anthony.perard@citrix.com>, Diana Crisan <dcrisan@flexiant.com>
List-Id: xen-devel@lists.xenproject.org

Ian,

> There's no such thing as a "migration" on physical hardware and a
> save/restore etc is under kernel control so it knows not to cache timer
> values etc.

Indeed, so it's the live migrate which is causing it!

>> If that's correct, and I've understood what George said, then
>> I /think/ the only quirky fix that needs doing is this is to change
>> the API between kernel driver and xen so that 'don't give me a time
>> in the past' means 'don't give me a time in the past unless you've
>> just done a live migrate'.
>
> What does "just" mean here? How do you determine it?

I'd suggest whatever time interval is required to resync. If you said
1 second, for instance, that would be a bodge, but would presumably
work unless the clocks were out by more than a second.

> I said "filling the hypervisor with lots of quirky exceptions", this is
> just one and in isolation maybe it isn't too bad. Now imagine we'd
> accumulated a dozen over the last 10 years, the semantics of our timer
> operation would be impossible to understand, do this unless A, otherwise
> if not B do something else, etc etc.
>
>>  If you really want giving a time in the
>> past to error under some circumstances, you can signal that another
>> way ('really don't give me a time in the past).
>
> That would be changing the behaviour of an existing ABI AFAICT, which is
> right out -- what if some other guest is relying on the current
> behaviour?

Well Linux is sort of relying on it - so we might fix those guests too :-)

I suppose the result would be that if anyone relied on the failure of
the timer event in the one second following migration, then sometimes
that failure would not happen.

> But in any case until George (or someone else) has actually diagnosed
> what is going on this entire discussion is premature.
>
>>  Yes, it would be lovely if everyone always applied the latest
>> patches to their kernel and rebooted, but they don't.
>>
>> Otherwise the net result will be Xen4.3 does not reliably live migrate
>> a pile of Linux OS's unless running with a patched kernel. That is not
>> a great conclusion.
>
> Are you saying this didn't happen with Xen 4.2 and earlier? That would
> tend to lean towards this being a Xen bug.

It happens in 4.2.

We did not discover it in 4.1, but have not retested so comprehensively.
And in 4.1 we were using a different device model (if that's relevant).

-- 
Alex Bligh