From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: HVM Migration of domU on Qemu-upstream DM causes
 stuck system clock with ACPI
Date: Mon, 3 Jun 2013 11:25:08 +0100
Message-ID: <51AC6F04.9030501@eu.citrix.com>
References: <1717491994.10371605.1369131737226.JavaMail.root@zimbra002>
	<519B50C9.1000008@citrix.com> <519B577E.6070200@flexiant.com>
	<519B6D51.2060508@citrix.com>
	<951B3441BAE2324286D3AA6D@Ximines.local>
	<CAFLBxZbBz-vKSd9KHA9uLahk7=L5GrDAiWNzem+0PvCK8SmpNA@mail.gmail.com>
	<420439EA40B15FCBFDFF2BE3@nimrod.local>
	<1369557503.22605.11.camel@dagon.hellion.org.uk>
	<51A4C7EB.1010406@flexiant.com>
	<CAFLBxZYbzuhR3SK6dw0xsuF7QPG164h-0bJ+n+xEDKvzbYoHzw@mail.gmail.com>
	<51A7767A.9030904@flexiant.com> <51A7791C.2020208@eu.citrix.com>
	<51A8608F.9000302@flexiant.com> <51A88151.3080001@eu.citrix.com>
	<0FE70400-1152-45F5-9BF9-973DF1DA9EE8@flexiant.com>
	<BFF4FE32-0B18-4429-A7C7-C3BD0021F11A@flexiant.com>
	<51A88E3E.5090208@eu.citrix.com>
	<A9BDBB961CCE37B70FCF175A@nimrod.local>
	<1370004031.5199.133.camel@zakaz.uk.xensource.com>
	<51A8A0AC.1030301@eu.citrix.com> <51A8BD48.6060104@citrix.com>
	<51AC55DD.7000507@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <51AC55DD.7000507@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= <roger.pau@citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, xen-devel@lists.xen.org, David Vrabel <david.vrabel@citrix.com>, Alex Bligh <alex@alex.org.uk>, Anthony PERARD <anthony.perard@citrix.com>, Diana Crisan <dcrisan@flexiant.com>
List-Id: xen-devel@lists.xenproject.org

On 03/06/13 09:37, Roger Pau Monn=E9 wrote:
> On 31/05/13 17:10, Roger Pau Monn=E9 wrote:
>> On 31/05/13 15:07, George Dunlap wrote:
>>> On 31/05/13 13:40, Ian Campbell wrote:
>>>> On Fri, 2013-05-31 at 12:57 +0100, Alex Bligh wrote:
>>>>> --On 31 May 2013 12:49:18 +0100 George Dunlap
>>>>> <george.dunlap@eu.citrix.com>
>>>>> wrote:
>>>>>
>>>>>> No -- Linux is asking, "Can you give me an alarm in 5ns?"  And Xen is
>>>>>> saying, "No".  So Linux is saying, "OK, how about 5us?  10us?
>>>>>> 20us?"  By
>>>>>> the time it reaches 4ms, Linux has had enough, and says, "If this ti=
mer
>>>>>> is so bad that it can't give me an event within 4ms it just won't use
>>>>>> timers at all, thank you very much."
>>>>>>
>>>>>> The problem appears to be that Linux thinks it's asking for
>>>>>> something in
>>>>>> the future, but is actually asking for something in the past.  It mu=
st
>>>>>> look at its watch just before the final domain pause, and then asks =
for
>>>>>> the time just after the migration resumes on the other side.  So it
>>>>>> doesn't realize that 10ms (or something) has already passed, and that
>>>>>> it's actually asking for a timer in the past.  The Xen timer driver =
in
>>>>>> Linux specifically asks Xen for times set in the past to return an
>>>>>> error.
>>>>>> Xen is returning an error because the time is in the past, Linux thi=
nks
>>>>>> it's getting an error because the time is too close in the future and
>>>>>> tries asking a little further away.
>>>>>>
>>>>>> Unfortunately I think this is something which needs to be fixed on t=
he
>>>>>> Linux side; I don't really see how we can work around it in Xen.
>>>>> I don't think fixing it only on the Linux side is a great idea, not
>>>>> least
>>>>> as it makes any current Linux image not live migrateable reliably.
>>>>> That's
>>>>> pretty horrible.
>>>> Ultimately though a guest bug is a guest bug, we don't really want to =
be
>>>> filling the hypervisor with lots of quirky exceptions to interfaces in
>>>> order to work around them, otherwise where does it end?
>>>>
>>>> A kernel side fix can be pushed to the distros fairly aggressively (it=
's
>>>> mostly just a case of getting an upstream stable backport then filing
>>>> bugs with the main ones, we've done it before) and for users upgrading
>>>> the kernel via the distros is really not so hard and mostly reuses the
>>>> process they must have in place for guest kernel security updates and
>>>> other important kernel bugs anyway.
>>> In any case, it seems I was wrong -- Linux does "look at its watch"
>>> every time it asks.
>>>
>>> The generic timer interface is "set me a timer N nanoseconds in the
>>> future"; the Xen timer implementation executes
>>> pvclock_clocksource_read() and adds the delta.  So it may well actually
>>> be a bug in Xen.
>>>
>>> Stand by for further investigation...
> I've been investigating further during the weekend, and although I'm not
> familiar with the timer code in Xen, I think the problem comes from the
> fact that in __update_vcpu_system_time when Xen detects that the guest
> is using a vtsc it adds offsets to the time passed to the guest, while
> in VCPUOP_set_singleshot_timer Xen compares the time passed from the
> guest using NOW(), which is just the Xen uptime, without taking into
> account any offsets.

All the code is really complicated, but it seems like the offset is =

added because the offset is *subtacted* by the hardware when the HVM =

guest does an RDTSC instruction -- and subtracted in a different way by =

Xen when emulating the RDTSC instruction, if you've set tsc_mode =

"always_emulate".

Just to test some of this stuff, I put the TSC mode to "always_emulate", =

and it has the exact same effect -- even though "always_emulate" will =

emulate a 1GHz clock.

> This only happens after migration because Xen automatically switches to
> vtsc when it detects that the guest has been migrated. I'm currently
> setting up a Linux PVHVM on shared storage to perform some testing, but
> one possible solution might be to add tsc_mode=3D"native_paravirt" to the
> PVHVM config file, and another one would be fixing
> VCPUOP_set_singleshot_timer to take into account the vtsc offsets and
> correctly translate the time passed from the guest.

So have you tested it with native_paravirt?  Does it work around the =

problem?

  -George