From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= Subject: Re: HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI Date: Fri, 31 May 2013 17:10:00 +0200 Message-ID: <51A8BD48.6060104@citrix.com> References: <1717491994.10371605.1369131737226.JavaMail.root@zimbra002> <519B50C9.1000008@citrix.com> <519B577E.6070200@flexiant.com> <519B6D51.2060508@citrix.com> <951B3441BAE2324286D3AA6D@Ximines.local> <420439EA40B15FCBFDFF2BE3@nimrod.local> <1369557503.22605.11.camel@dagon.hellion.org.uk> <51A4C7EB.1010406@flexiant.com> <51A7767A.9030904@flexiant.com> <51A7791C.2020208@eu.citrix.com> <51A8608F.9000302@flexiant.com> <51A88151.3080001@eu.citrix.com> <0FE70400-1152-45F5-9BF9-973DF1DA9EE8@flexiant.com> <51A88E3E.5090208@eu.citrix.com> <1370004031.5199.133.camel@zakaz.uk.xensource.com> <51A8A0AC.1030301@eu.citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <51A8A0AC.1030301@eu.citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: George Dunlap Cc: Ian Campbell , Konrad Rzeszutek Wilk , xen-devel@lists.xen.org, David Vrabel , Alex Bligh , Anthony PERARD , Diana Crisan List-Id: xen-devel@lists.xenproject.org On 31/05/13 15:07, George Dunlap wrote: > On 31/05/13 13:40, Ian Campbell wrote: >> On Fri, 2013-05-31 at 12:57 +0100, Alex Bligh wrote: >>> --On 31 May 2013 12:49:18 +0100 George Dunlap >>> >>> wrote: >>> >>>> No -- Linux is asking, "Can you give me an alarm in 5ns?" And Xen is >>>> saying, "No". So Linux is saying, "OK, how about 5us? 10us? >>>> 20us?" By >>>> the time it reaches 4ms, Linux has had enough, and says, "If this timer >>>> is so bad that it can't give me an event within 4ms it just won't use >>>> timers at all, thank you very much." >>>> >>>> The problem appears to be that Linux thinks it's asking for >>>> something in >>>> the future, but is actually asking for something in the past. It must >>>> look at its watch just before the final domain pause, and then asks for >>>> the time just after the migration resumes on the other side. So it >>>> doesn't realize that 10ms (or something) has already passed, and that >>>> it's actually asking for a timer in the past. The Xen timer driver in >>>> Linux specifically asks Xen for times set in the past to return an >>>> error. >>>> Xen is returning an error because the time is in the past, Linux thinks >>>> it's getting an error because the time is too close in the future and >>>> tries asking a little further away. >>>> >>>> Unfortunately I think this is something which needs to be fixed on the >>>> Linux side; I don't really see how we can work around it in Xen. >>> I don't think fixing it only on the Linux side is a great idea, not >>> least >>> as it makes any current Linux image not live migrateable reliably. >>> That's >>> pretty horrible. >> Ultimately though a guest bug is a guest bug, we don't really want to be >> filling the hypervisor with lots of quirky exceptions to interfaces in >> order to work around them, otherwise where does it end? >> >> A kernel side fix can be pushed to the distros fairly aggressively (it's >> mostly just a case of getting an upstream stable backport then filing >> bugs with the main ones, we've done it before) and for users upgrading >> the kernel via the distros is really not so hard and mostly reuses the >> process they must have in place for guest kernel security updates and >> other important kernel bugs anyway. > > In any case, it seems I was wrong -- Linux does "look at its watch" > every time it asks. > > The generic timer interface is "set me a timer N nanoseconds in the > future"; the Xen timer implementation executes > pvclock_clocksource_read() and adds the delta. So it may well actually > be a bug in Xen. > > Stand by for further investigation... I've also seen this on FreeBSD PVHVM when doing live migration, which also uses the single shot timer. It seems like the values in vcpu_info->time are not updated as often as they should after the migration. I've implemented a back-off mechanism to cope with that, but this clearly looks like a bug in Xen.