From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1KxzyO-0003c7-Ks
	for qemu-devel@nongnu.org; Thu, 06 Nov 2008 03:12:12 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1KxzyM-0003bg-Sp
	for qemu-devel@nongnu.org; Thu, 06 Nov 2008 03:12:11 -0500
Received: from [199.232.76.173] (port=39541 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1KxzyM-0003bP-NY
	for qemu-devel@nongnu.org; Thu, 06 Nov 2008 03:12:10 -0500
Received: from mx20.gnu.org ([199.232.41.8]:16124)
	by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.60) (envelope-from <gleb@redhat.com>) id 1KxzyM-0001kq-R4
	for qemu-devel@nongnu.org; Thu, 06 Nov 2008 03:12:11 -0500
Received: from mx2.redhat.com ([66.187.237.31])
	by mx20.gnu.org with esmtp (Exim 4.60)
	(envelope-from <gleb@redhat.com>) id 1KxzyK-0003i6-Rc
	for qemu-devel@nongnu.org; Thu, 06 Nov 2008 03:12:09 -0500
Date: Thu, 6 Nov 2008 10:12:06 +0200
From: Gleb Natapov <gleb@redhat.com>
Subject: Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under
	heavy load.
Message-ID: <20081106081206.GD3820@redhat.com>
References: <20081029152236.14831.15193.stgit@dhcp-1-237.local>
	<490B59BF.3000205@codemonkey.ws>
	<20081102130441.GD16809@redhat.com>
	<4911CD42.2040803@codemonkey.ws>
MIME-Version: 1.0
Content-Type: text/plain; charset=cp1255
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <4911CD42.2040803@codemonkey.ws>
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org

On Wed, Nov 05, 2008 at 10:43:46AM -0600, Anthony Liguori wrote:
> Gleb Natapov wrote:
>> So? I raise them now. Have you tried suggested scenario and was able to
>> reproduce the problem?
>>  =20
>
> Sorry, I mistyped.  I meant to say, I don't think any of the problems =20
> raised when this was initially posted have been addressed.  Namely, I =20
> asked for hard data on how much this helped things
Does my answer to Dor's mail provide enough data?

>                                                     and Paul complained =
=20
> that this fix only fixed things partially, and was very invasive to =20
> other architectures.
Paul complained that initial version of the patch doesn't compile on non
x86 and I fixed that long time ago. I don't remember him complaining
about invasiveness of the patch, but there is nothing that can be done
about it. The patch changes core API function that is used all over
the code, so every place that uses old function definition have to be
amended. The changes are trivial BTW. May be it is possible to add another
API with new semantics and change qemu_irq subsystem to use both in paralle=
l,
but I personally don't like this. It like microsoft approach to add new APIs
instead of fixing old ones to maintain backwards compatibility. Qemu doesn't
have backwards compatibility problem so why use inferior approach?

>
> Basically, there are two hurdles to overcome here.  The first is that I =
=20
> don't think it's overwhelmingly obvious that this is the correct =20
> solution in all scenarios.  We need to understand the scenarios it helps =
=20
> and by how much.  We then probably need to make sure to limit this =20
> operation to those specific scenarios.
I think we can all agree that this is _not_ the correct solution in all
scenarios. For instance icount approach works for pure qemu and looks
like much more clean approach, that is why interrupt de-coalescing
is disabled if icount is in use. We can have interrupt de-coalescing
be disabled by default and enable it only with a command line option.

>
> The second is that this is not how hardware behaves normally.  This =20
> makes it undesirably from an architectural perspective.  If it's =20
> necessary, we need to find a way to minimize it's impact in much the way =
=20
> -win2k-hack's impact is minimized.
>
>> The time drift is eliminated. If there is a spike in a load time may
>> slow down, but after that it catches up (this happens only during very
>> high loads though).
>>  =20
>
> How bad is time drift without it.  Under workload X, we lose N seconds =
=20
> per Y hours and with this patch, under the same workload, we lose M =20
> seconds per Y hours and N << M.
>
See may reply to Dor please.

> I strongly, strongly doubt that you'll be eliminating drift 100%.  And =
=20
> please describe workload X in such a way that it is 100% reproducible.  =
=20
> If you're using a multimedia file to do this, please provide a link to =
=20
> obtain the multimedia file.
I found much simpler way to reproduce the problem, no multimedia
involved. Just copy windows folder somewhere. It should eliminate time
drift 100% assuming that on average the guest will have enough time to
run.

>
>>>                                                  How does having a=20
>>> high  resolution timer in the host affect the problem to begin with?
>>>    =20
>> My test machine has relatively recent kernel that use high resolution
>> timers for time keeping. Also the problem is that guest does not receive
>> enough time to process injected interrupt. How hr timer can help here?
>>  =20
>
> If the host can awaken QEMU 1024 times a second and QEMU can deliver a =
=20
> timer interrupt each time, there is no need for time drift fixing.
>
It is not enough to wake QEMU process 1024 time a second to signal time
interrupt. A guest should also have enough time to run between each
interrupt to process it. If QEMU signals 1024 time interrupt a second
and guest process only half of that a second you'll see time drift.
And if guest asks for 1024hz frequency and guest can't provide that no
solution for time drift exists in that case.

> I would think that with high res timers on the host, you would have to =
=20
> put the host under heavy load before drift began occurring.
>
I see a time drift even with guest using 100hz timers and I am pretty sure
my 2.6.25 host uses hr timers.

>>>                                                                 How=20
>>> do  Linux guests behave with this?
>>>    =20
>> Linux guests don't use pit or RTC for time keeping. They are completely
>> unaffected by those patches.
>>  =20
>
> They certainly can, under the right circumstances.
>
I know they can, but by default they are smarter then that.

>>> Even the Windows PV spec calls out three separate approaches to=20
>>> dealing  with missed interrupts and provides an interface for the=20
>>> host to query  the guest as to which one should be used.  I don't=20
>>> think any solution  that uses a single technique is going to be=20
>>> correct.
>>>
>>>    =20
>> That is what I found in Microsoft docs:
>>
>>   If a virtual processor is unavailable for a sufficiently long period of
>>   time, a full timer period may be missed. In this case, the hypervisor
>>   uses one of two techniques. The first technique involves timer period
>>   modulation, in effect shortening the period until the timer =93catches
>>   up=94.
>>
>>   If a significant number of timer signals have been missed, the
>>   hypervisor may be unable to compensate by using period modulation. In
>>   this case, some timer expiration signals may be skipped completely.
>>   For timers that are marked as lazy, the hypervisor uses a second
>>   technique for dealing with the situation in which a virtual processor =
is
>>   unavailable for a long period of time. In this case, the timer signal =
is
>>   deferred until this virtual processor is available. If it doesn=92t be=
come
>>   available until shortly before the next timer is due to expire, it is
>>   skipped entirely.=20
>>
>> The first techniques is what I am trying to introduce with this patch
>> series.
>>  =20
>
> There is a third technique whereas the hypervisor is supposed to =20
> modulate the delivery of missed ticks by ensuring an even distribution =
=20
> of them across the next few time slices.
The third technique you describe looks to me exactly like the first one
in the text I quoted. My patches implement that approach for PIT. RTC
injects missed interrupt as soon as the previous one is acknowledged.

>                                           The windows guest is supposed =
=20
> to be able to tell the hypervisor which technique it should be using.
>
Do you have pointer to a documentation that describe it? I am especially
interested if old guest like windows XP and windows 2003 support this?
Surprisingly many people are interested in those old guests, not shiny
newer ones :)

--
			Gleb.