From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <mtosatti@redhat.com>
Subject: Re: [PATCH] Fix almost infinite loop in APIC
Date: Sun, 11 Jan 2009 02:55:58 -0200
Message-ID: <20090111045557.GA3661@amt.cnet>
References: <1231432566-9864-1-git-send-email-agraf@suse.de> <200901091434.39200.sheng@linux.intel.com> <496749BA.3080007@suse.de> <20090110112113.GB14678@yukikaze>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
To: Alexander Graf <agraf@suse.de>, Sheng Yang <sheng@linux.intel.com>,
	kvm@vger.kernel.org, avi@redhat.com, Kevin Wolf <kwolf@suse.de>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx2.redhat.com ([66.187.237.31]:52847 "EHLO mx2.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750894AbZAKE4O (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sat, 10 Jan 2009 23:56:14 -0500
Content-Disposition: inline
In-Reply-To: <20090110112113.GB14678@yukikaze>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Sat, Jan 10, 2009 at 07:21:13PM +0800, Sheng Yang wrote:
> On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote:
> > Sheng Yang wrote:
> > > On Friday 09 January 2009 00:36:06 Alexander Graf wrote:
> > >   
> > >> While booting Linux in VMware ESX, I encountered a strange effect
> > >> in the in-kernel lapic implementation: time went backwards!
> > >>
> > >> While this should never occur, because of that the while loop that
> > >> is done after the invalid calculations caused my host system to hang.
> > >>
> > >> In order to make debugging easier, let's replace this as suggested
> > >> with a modulo function and not run into the danger of looping forever.
> > >>
> > >> To replace the nice hint this bug gave me that the values are broken,
> > >> I added a printk message so people encountering this can at least
> > >> see that something is fishy.
> > >>
> > >> Of course, the real issue needs to be fixed as well! I'm open to ideas
> > >> why now < last_update!
> > >>
> > >> (Thanks to Kevin for his help in debugging this)
> > >>
> > >> Signed-off-by: Alexander Graf <agraf@suse.de>
> > >> Signed-off-by: Kevin Wolf <kwolf@suse.de>
> > >> ---
> > >>     
> > >
> > > Hi Alexander
> > >
> > > I'm a little suspect here:
> > >
> > >   
> > >> 	if (unlikely(ktime_to_ns(now) <=
> > >> 		ktime_to_ns(apic->timer.last_update))) {
> > >> 		/* Wrap around */
> > >> 		passed = ktime_add(( {
> > >> 				    (ktime_t) {
> > >> 				    .tv64 = KTIME_MAX -
> > >> 				    (apic->timer.last_update).tv64}; }
> > >> 				   ), now);
> > >> 		apic_debug("time elapsed\n");
> > >> 	} else
> > >> 		passed = ktime_sub(now, apic->timer.last_update);
> > >>     
> > >
> > > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and get_time() is 
> > > really ktime_get() which is almost impossible to wrap around. If it's 
> > > overflow, at least we need a warning. I think this piece of code due to clock 
> > > source change.
> > >
> > > So I doubt: due to some reason, now <= apic->timer.last_update, which cause a 
> > > big wrap around operation.
> > >
> > > And the most suspect:
> > >
> > >   
> > >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
> > >> {
> > >> 	struct kvm_lapic *apic = vcpu->arch.apic;
> > >>
> > >> 	if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> > >> 		apic->timer.last_update = ktime_add_ns(
> > >> 				apic->timer.last_update,
> > >> 				apic->timer.period);
> > >> }
> > >>     
> > >
> > > Not sure what's happening, have you tried this? (In fact, I am little willing 
> > > to replace all apic->timer.dev.base->get_time() with more explicit ktime_get() 
> > > as in pit.)
> > >   
> > 
> > Yes, this code is the culprit. Using that patch of yours now is always >
> > last_update. I'd still like to see that while loop go away ;-).

Sheng,

Will separate the discussions on items:


1) On the current code, last_update is assigned by kvm_apic_timer_intr_post
as the expiration time of the next interrupt, while apic_get_tmcct
interprets it as "countdown start":

    passed = ktime_sub(now, apic->timer.last_update);

This explains why your first patch against kvm_apic_timer_intr_post
fixed the bug Alexander was experiencing.

BTW, yes, the "overflow" handling is not necessary as you noted.

2) The current count calculation is not right (apart from any eventual
bugs), or I don't understand it. It uses the interrupt injection
time as "countdown start" (or rearm in Linux terminology), in
kvm_apic_timer_intr_post.

But the host hrtimer starts counting once rearmed (assuming periodic
configuration). So the inaccuracy of APIC_TMCCT depends on this delay
between hrtimer rearm and interrupt injection.

What is this scheme supposed to achieve? It seems much simpler, and
more accurate, to return scaled hrtimer_get_remaining on APIC_TMCCT
emulation.

3) And then there's interrupt reinjection. Once interrupts accumulate,
and we reinject, current count reads become completly bogus. This is
the reason for time drift on RHEL4-style kernels with "clock=tsc". The
algorithm calculates the interrupt handling delay by using the PIT
count (equivalent to APIC_TMCCT). But PIT count emulation uses the
hrtimer expiration time, which has nothing to do with the accumulated
interrupts.

So what I propose is to first switch lapic current count emulation to a
straightforward scaled hrtimer_get_remaining equivalent.

For the reinjection case, maintain an average of the delay between host
interrupt and interrupt injection (this can be generic, so both PIT
and LAPIC timer can use it). Return that scaled average on APIC_TMCCT
emulation whenever pending > 2.

What you think?

And as Avi noted, there should be a distinction between accumulated
interrupts, we only want to reinject the ones caused by host load (and
there are some interesting optimizations that can be done later, such
as stopping hrtimer from ticking until guest acks one interrupt, and
calculating the missed ones). But better start with the basic.