From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@redhat.com>
Subject: Re: gettimeofday "slow" in RHEL4 guests
Date: Mon, 29 Dec 2008 15:11:54 +0200
Message-ID: <4958CC9A.5050008@redhat.com>
References: <492AE8AC.2090502@cisco.com> <492B8204.5@cisco.com> <87d4gkcdsy.fsf@basil.nowhere.org> <426B9829-823B-40BE-9A7E-9F7EF2ED3412@suse.de> <20081125114815.GG6703@one.firstfloor.org> <CC19AD88-0576-47B2-AC76-A51F317964BD@suse.de> <20081125125259.GH6703@one.firstfloor.org> <20081228183807.GA3883@amt.cnet>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "Yang, Sheng" <sheng.yang@intel.com>,
	Alexander Graf <agraf@suse.de>,
	"David S. Ahern" <daahern@cisco.com>,
	kvm-devel <kvm@vger.kernel.org>,
	Glauber de Oliveira Costa <gcosta@redhat.com>,
	Gleb Natapov <gleb@redhat.com>,
	Dor Laor <dor.laor@qumranet.com>
To: Marcelo Tosatti <mtosatti@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx2.redhat.com ([66.187.237.31]:50979 "EHLO mx2.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751429AbYL2NLr (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 29 Dec 2008 08:11:47 -0500
In-Reply-To: <20081228183807.GA3883@amt.cnet>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Marcelo Tosatti wrote:
> The tsc clock on older Linux 2.6 kernels compensates for lost ticks.
> The algorithm uses the PIT count (latched) to measure the delay between
> interrupt generation and handling, and sums that value, on the next
> interrupt, to the TSC delta.
>
> Sheng investigated this problem in the discussions before in-kernel PIT
> was merged:
>
> http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg13873.html
>
> The algorithm overcompensates for lost ticks and the guest time runs
> faster than the hosts.
>
> There are two issues:
>
> 1) A bug in the in-kernel PIT which miscalculates the count value.
>
> 2) For the case where more than one interrupt is lost, and later
> reinjected, the value read from PIT count is meaningless for the purpose
> of the tsc algorithm. The count is interpreted as the delay until the
> next interrupt, which is not the case with reinjection.
>
> As Sheng mentioned in the thread above, Xen pulls back the TSC value
> when reinjecting interrupts. VMWare ESX has a notion of "virtual TSC",
> which I believe is similar in this context.
>
> For KVM I believe the best immediate solution (for now) is to provide an
> option to disable reinjection, behaving similarly to real hardware. The
> advantage is simplicity compared to virtualizing the time sources.
>
> The QEMU PIT emulation has a limit on the rate of interrupt reinjection,
> perhaps something similar should be investigated in the future.
>
> The following patch (which contains the bugfix for 1) and disabled
> reinjection) fixes the severe time drift on RHEL4 with "clock=tsc".
> What I'm proposing is to condition reinjection with an option
> (-kvm-pit-no-reinject or something).
>
> Comments or better ideas?
>
>
> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> index e665d1c..608af7b 100644
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -201,13 +201,16 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
>  	if (!atomic_inc_and_test(&pt->pending))
>  		set_bit(KVM_REQ_PENDING_TIMER, &vcpu0->requests);
>  
> +	if (atomic_read(&pt->pending) > 1)
> +		atomic_set(&pt->pending, 1);
> +
>   

Replace the atomic_inc() with atomic_set(, 1) instead? One less test, 
and more important, the logic is scattered less around the source.

>  	if (vcpu0 && waitqueue_active(&vcpu0->wq))
>  		wake_up_interruptible(&vcpu0->wq);
>  
>  	hrtimer_add_expires_ns(&pt->timer, pt->period);
>  	pt->scheduled = hrtimer_get_expires_ns(&pt->timer);
>  	if (pt->period)
> -		ps->channels[0].count_load_time = hrtimer_get_expires(&pt->timer);
> +		ps->channels[0].count_load_time = ktime_get();
>  
>  	return (pt->period == 0 ? 0 : 1);
>  }
>   

I don't like the idea of punting to the user but looks like we don't 
have a choice.  Hopefully vendors will port kvmclock to these kernels 
and release them as updates -- time simply doesn't work will with 
virtualization, especially Linux guests.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.