From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcelo Tosatti Subject: Re: [PATCH] Fix almost infinite loop in APIC Date: Sun, 11 Jan 2009 02:55:58 -0200 Message-ID: <20090111045557.GA3661@amt.cnet> References: <1231432566-9864-1-git-send-email-agraf@suse.de> <200901091434.39200.sheng@linux.intel.com> <496749BA.3080007@suse.de> <20090110112113.GB14678@yukikaze> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii To: Alexander Graf , Sheng Yang , kvm@vger.kernel.org, avi@redhat.com, Kevin Wolf Return-path: Received: from mx2.redhat.com ([66.187.237.31]:52847 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750894AbZAKE4O (ORCPT ); Sat, 10 Jan 2009 23:56:14 -0500 Content-Disposition: inline In-Reply-To: <20090110112113.GB14678@yukikaze> Sender: kvm-owner@vger.kernel.org List-ID: On Sat, Jan 10, 2009 at 07:21:13PM +0800, Sheng Yang wrote: > On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote: > > Sheng Yang wrote: > > > On Friday 09 January 2009 00:36:06 Alexander Graf wrote: > > > > > >> While booting Linux in VMware ESX, I encountered a strange effect > > >> in the in-kernel lapic implementation: time went backwards! > > >> > > >> While this should never occur, because of that the while loop that > > >> is done after the invalid calculations caused my host system to hang. > > >> > > >> In order to make debugging easier, let's replace this as suggested > > >> with a modulo function and not run into the danger of looping forever. > > >> > > >> To replace the nice hint this bug gave me that the values are broken, > > >> I added a printk message so people encountering this can at least > > >> see that something is fishy. > > >> > > >> Of course, the real issue needs to be fixed as well! I'm open to ideas > > >> why now < last_update! > > >> > > >> (Thanks to Kevin for his help in debugging this) > > >> > > >> Signed-off-by: Alexander Graf > > >> Signed-off-by: Kevin Wolf > > >> --- > > >> > > > > > > Hi Alexander > > > > > > I'm a little suspect here: > > > > > > > > >> if (unlikely(ktime_to_ns(now) <= > > >> ktime_to_ns(apic->timer.last_update))) { > > >> /* Wrap around */ > > >> passed = ktime_add(( { > > >> (ktime_t) { > > >> .tv64 = KTIME_MAX - > > >> (apic->timer.last_update).tv64}; } > > >> ), now); > > >> apic_debug("time elapsed\n"); > > >> } else > > >> passed = ktime_sub(now, apic->timer.last_update); > > >> > > > > > > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and get_time() is > > > really ktime_get() which is almost impossible to wrap around. If it's > > > overflow, at least we need a warning. I think this piece of code due to clock > > > source change. > > > > > > So I doubt: due to some reason, now <= apic->timer.last_update, which cause a > > > big wrap around operation. > > > > > > And the most suspect: > > > > > > > > >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec) > > >> { > > >> struct kvm_lapic *apic = vcpu->arch.apic; > > >> > > >> if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) > > >> apic->timer.last_update = ktime_add_ns( > > >> apic->timer.last_update, > > >> apic->timer.period); > > >> } > > >> > > > > > > Not sure what's happening, have you tried this? (In fact, I am little willing > > > to replace all apic->timer.dev.base->get_time() with more explicit ktime_get() > > > as in pit.) > > > > > > > Yes, this code is the culprit. Using that patch of yours now is always > > > last_update. I'd still like to see that while loop go away ;-). Sheng, Will separate the discussions on items: 1) On the current code, last_update is assigned by kvm_apic_timer_intr_post as the expiration time of the next interrupt, while apic_get_tmcct interprets it as "countdown start": passed = ktime_sub(now, apic->timer.last_update); This explains why your first patch against kvm_apic_timer_intr_post fixed the bug Alexander was experiencing. BTW, yes, the "overflow" handling is not necessary as you noted. 2) The current count calculation is not right (apart from any eventual bugs), or I don't understand it. It uses the interrupt injection time as "countdown start" (or rearm in Linux terminology), in kvm_apic_timer_intr_post. But the host hrtimer starts counting once rearmed (assuming periodic configuration). So the inaccuracy of APIC_TMCCT depends on this delay between hrtimer rearm and interrupt injection. What is this scheme supposed to achieve? It seems much simpler, and more accurate, to return scaled hrtimer_get_remaining on APIC_TMCCT emulation. 3) And then there's interrupt reinjection. Once interrupts accumulate, and we reinject, current count reads become completly bogus. This is the reason for time drift on RHEL4-style kernels with "clock=tsc". The algorithm calculates the interrupt handling delay by using the PIT count (equivalent to APIC_TMCCT). But PIT count emulation uses the hrtimer expiration time, which has nothing to do with the accumulated interrupts. So what I propose is to first switch lapic current count emulation to a straightforward scaled hrtimer_get_remaining equivalent. For the reinjection case, maintain an average of the delay between host interrupt and interrupt injection (this can be generic, so both PIT and LAPIC timer can use it). Return that scaled average on APIC_TMCCT emulation whenever pending > 2. What you think? And as Avi noted, there should be a distinction between accumulated interrupts, we only want to reinject the ones caused by host load (and there are some interesting optimizations that can be done later, such as stopping hrtimer from ticking until guest acks one interrupt, and calculating the missed ones). But better start with the basic.