From mboxrd@z Thu Jan  1 00:00:00 1970
From: Sheng Yang <sheng@linux.intel.com>
Subject: Re: [PATCH] Fix almost infinite loop in APIC
Date: Tue, 13 Jan 2009 15:47:49 +0800
Message-ID: <200901131547.50626.sheng@linux.intel.com>
References: <1231432566-9864-1-git-send-email-agraf@suse.de> <20090110112113.GB14678@yukikaze> <20090111045557.GA3661@amt.cnet>
Mime-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Cc: Alexander Graf <agraf@suse.de>, kvm@vger.kernel.org,
	avi@redhat.com, Kevin Wolf <kwolf@suse.de>
To: Marcelo Tosatti <mtosatti@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mga06.intel.com ([134.134.136.21]:50971 "EHLO
	orsmga101.jf.intel.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org
	with ESMTP id S1752569AbZAMHsN (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 13 Jan 2009 02:48:13 -0500
In-Reply-To: <20090111045557.GA3661@amt.cnet>
Content-Disposition: inline
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Sunday 11 January 2009 12:55:58 Marcelo Tosatti wrote:
> On Sat, Jan 10, 2009 at 07:21:13PM +0800, Sheng Yang wrote:
> > On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote:
> > > Sheng Yang wrote:
> > > > On Friday 09 January 2009 00:36:06 Alexander Graf wrote:
> > > >> While booting Linux in VMware ESX, I encountered a strange effect
> > > >> in the in-kernel lapic implementation: time went backwards!
> > > >>
> > > >> ---
> > > >
> > > > Hi Alexander
> > > >
> > > > I'm a little suspect here:
> > > >> 	if (unlikely(ktime_to_ns(now) <=
> > > >> 		ktime_to_ns(apic->timer.last_update))) {
> > > >> 		/* Wrap around */
> > > >> 		passed = ktime_add(( {
> > > >> 				    (ktime_t) {
> > > >> 				    .tv64 = KTIME_MAX -
> > > >> 				    (apic->timer.last_update).tv64}; }
> > > >> 				   ), now);
> > > >> 		apic_debug("time elapsed\n");
> > > >> 	} else
> > > >> 		passed = ktime_sub(now, apic->timer.last_update);
> > > >
> > > > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and
> > > > get_time() is really ktime_get() which is almost impossible to wrap
> > > > around. If it's overflow, at least we need a warning. I think this
> > > > piece of code due to clock source change.
> > > >
> > > > So I doubt: due to some reason, now <= apic->timer.last_update, which
> > > > cause a big wrap around operation.
> > > >
> > > > And the most suspect:
> > > >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
> > > >> {
> > > >> 	struct kvm_lapic *apic = vcpu->arch.apic;
> > > >>
> > > >> 	if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> > > >> 		apic->timer.last_update = ktime_add_ns(
> > > >> 				apic->timer.last_update,
> > > >> 				apic->timer.period);
> > > >> }
> > > >
> > > > Not sure what's happening, have you tried this? (In fact, I am little
> > > > willing to replace all apic->timer.dev.base->get_time() with more
> > > > explicit ktime_get() as in pit.)
> > >
> > > Yes, this code is the culprit. Using that patch of yours now is always
> > > > last_update. I'd still like to see that while loop go away ;-).
>
> Sheng,
>
> Will separate the discussions on items:
>
>
> 1) On the current code, last_update is assigned by kvm_apic_timer_intr_post
> as the expiration time of the next interrupt, while apic_get_tmcct
> interprets it as "countdown start":
>
>     passed = ktime_sub(now, apic->timer.last_update);
>
> This explains why your first patch against kvm_apic_timer_intr_post
> fixed the bug Alexander was experiencing.
>
> BTW, yes, the "overflow" handling is not necessary as you noted.
>
> 2) The current count calculation is not right (apart from any eventual
> bugs), or I don't understand it. It uses the interrupt injection
> time as "countdown start" (or rearm in Linux terminology), in
> kvm_apic_timer_intr_post.
>
> But the host hrtimer starts counting once rearmed (assuming periodic
> configuration). So the inaccuracy of APIC_TMCCT depends on this delay
> between hrtimer rearm and interrupt injection.
>
> What is this scheme supposed to achieve? It seems much simpler, and
> more accurate, to return scaled hrtimer_get_remaining on APIC_TMCCT
> emulation.

Marcelo,

Sorry to late reply...

After rechecked the code, got some thoughts. last_update can indicated rearm 
time as following:

The original apic_timer_intr_post() got:

>         if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
>                 apic->timer.last_update = ktime_add_ns(
>                                 apic->timer.last_update,
>                                 apic->timer.period);

I think the purpose is let guest see a reasonable TMCCT. Which means:

1. Timer start to fire at start_apic_timer(). last_update=now(). 

2. Every one LAPIC timer interrupt was injected into the guest, so the time at 
least elapsed timer.period(the first alarm set at "period" later), then 
last_update is increased by period. Notice last_update's base is set before 
timer fire, so it's not indicated the next one, but *the time this interrupt 
should be injected*(time can be delayed)... So last_update = n*period + base.

3. If there is pending interrupt, last_update won't be updated, so we have 
this:

>                         while (counter_passed > tmcct)
>                                 counter_passed -= tmcct;
>                         tmcct -= counter_passed;
(notice tmcct is TMICT here.)

And the returned tmcct value indicated (counter_passed % tmict), a offset 
regardless of pending interrupt numbers.

I think now the overflow seems OK, but I am not sure why last_update can be 
bigger than ktime_get()? Maybe due to vcpu migration? Suspect some racy or 
boundary condition existed...

And base on this, I don't think my quick fix is correct...

>
> 3) And then there's interrupt reinjection. Once interrupts accumulate,
> and we reinject, current count reads become completly bogus. This is
> the reason for time drift on RHEL4-style kernels with "clock=tsc". The
> algorithm calculates the interrupt handling delay by using the PIT
> count (equivalent to APIC_TMCCT). But PIT count emulation uses the
> hrtimer expiration time, which has nothing to do with the accumulated
> interrupts.
>
> So what I propose is to first switch lapic current count emulation to a
> straightforward scaled hrtimer_get_remaining equivalent.
>
> For the reinjection case, maintain an average of the delay between host
> interrupt and interrupt injection (this can be generic, so both PIT
> and LAPIC timer can use it). Return that scaled average on APIC_TMCCT
> emulation whenever pending > 2.
>
> What you think?

1. If we simply use ktime_get() to update last_update, the interval can't be 
same between different last_update. I think we may got some problem here, but 
not for sure. For example, three delayed interrupt was injected one after one, 
last_update would show very little interval, and APIC_TMCCT may got trouble.  
Maintain an average of the delay seems a little tricky here, and I am not sure 
if it would help the problem...  Average is average at most, not the real 
affection at the time...

2. For the current mechanism, the interval is the same, so last_update always 
equal to n*period + base. If there are more than one pending interrupts, TMCCT 
should also can return the relative correct value.

So I am leaning toward to fix current problem, though I haven't find the root 
cause yet...

> And as Avi noted, there should be a distinction between accumulated
> interrupts, we only want to reinject the ones caused by host load (and
> there are some interesting optimizations that can be done later, such
> as stopping hrtimer from ticking until guest acks one interrupt, and
> calculating the missed ones). But better start with the basic.

Agree...

-- 
regards
Yang, Sheng