From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avi Kivity Subject: Re: gettimeofday "slow" in RHEL4 guests Date: Mon, 29 Dec 2008 15:11:54 +0200 Message-ID: <4958CC9A.5050008@redhat.com> References: <492AE8AC.2090502@cisco.com> <492B8204.5@cisco.com> <87d4gkcdsy.fsf@basil.nowhere.org> <426B9829-823B-40BE-9A7E-9F7EF2ED3412@suse.de> <20081125114815.GG6703@one.firstfloor.org> <20081125125259.GH6703@one.firstfloor.org> <20081228183807.GA3883@amt.cnet> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "Yang, Sheng" , Alexander Graf , "David S. Ahern" , kvm-devel , Glauber de Oliveira Costa , Gleb Natapov , Dor Laor To: Marcelo Tosatti Return-path: Received: from mx2.redhat.com ([66.187.237.31]:50979 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751429AbYL2NLr (ORCPT ); Mon, 29 Dec 2008 08:11:47 -0500 In-Reply-To: <20081228183807.GA3883@amt.cnet> Sender: kvm-owner@vger.kernel.org List-ID: Marcelo Tosatti wrote: > The tsc clock on older Linux 2.6 kernels compensates for lost ticks. > The algorithm uses the PIT count (latched) to measure the delay between > interrupt generation and handling, and sums that value, on the next > interrupt, to the TSC delta. > > Sheng investigated this problem in the discussions before in-kernel PIT > was merged: > > http://www.mail-archive.com/kvm-devel@lists.sourceforge.net/msg13873.html > > The algorithm overcompensates for lost ticks and the guest time runs > faster than the hosts. > > There are two issues: > > 1) A bug in the in-kernel PIT which miscalculates the count value. > > 2) For the case where more than one interrupt is lost, and later > reinjected, the value read from PIT count is meaningless for the purpose > of the tsc algorithm. The count is interpreted as the delay until the > next interrupt, which is not the case with reinjection. > > As Sheng mentioned in the thread above, Xen pulls back the TSC value > when reinjecting interrupts. VMWare ESX has a notion of "virtual TSC", > which I believe is similar in this context. > > For KVM I believe the best immediate solution (for now) is to provide an > option to disable reinjection, behaving similarly to real hardware. The > advantage is simplicity compared to virtualizing the time sources. > > The QEMU PIT emulation has a limit on the rate of interrupt reinjection, > perhaps something similar should be investigated in the future. > > The following patch (which contains the bugfix for 1) and disabled > reinjection) fixes the severe time drift on RHEL4 with "clock=tsc". > What I'm proposing is to condition reinjection with an option > (-kvm-pit-no-reinject or something). > > Comments or better ideas? > > > diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c > index e665d1c..608af7b 100644 > --- a/arch/x86/kvm/i8254.c > +++ b/arch/x86/kvm/i8254.c > @@ -201,13 +201,16 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps) > if (!atomic_inc_and_test(&pt->pending)) > set_bit(KVM_REQ_PENDING_TIMER, &vcpu0->requests); > > + if (atomic_read(&pt->pending) > 1) > + atomic_set(&pt->pending, 1); > + > Replace the atomic_inc() with atomic_set(, 1) instead? One less test, and more important, the logic is scattered less around the source. > if (vcpu0 && waitqueue_active(&vcpu0->wq)) > wake_up_interruptible(&vcpu0->wq); > > hrtimer_add_expires_ns(&pt->timer, pt->period); > pt->scheduled = hrtimer_get_expires_ns(&pt->timer); > if (pt->period) > - ps->channels[0].count_load_time = hrtimer_get_expires(&pt->timer); > + ps->channels[0].count_load_time = ktime_get(); > > return (pt->period == 0 ? 0 : 1); > } > I don't like the idea of punting to the user but looks like we don't have a choice. Hopefully vendors will port kvmclock to these kernels and release them as updates -- time simply doesn't work will with virtualization, especially Linux guests. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.