From: Marcelo Tosatti <mtosatti@redhat.com>
To: Alexander Graf <agraf@suse.de>,
Sheng Yang <sheng@linux.intel.com>,
kvm@vger.kernel.org, avi@redhat.com, Kevin Wolf <kwolf@suse.de>
Subject: Re: [PATCH] Fix almost infinite loop in APIC
Date: Sun, 11 Jan 2009 02:55:58 -0200 [thread overview]
Message-ID: <20090111045557.GA3661@amt.cnet> (raw)
In-Reply-To: <20090110112113.GB14678@yukikaze>
On Sat, Jan 10, 2009 at 07:21:13PM +0800, Sheng Yang wrote:
> On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote:
> > Sheng Yang wrote:
> > > On Friday 09 January 2009 00:36:06 Alexander Graf wrote:
> > >
> > >> While booting Linux in VMware ESX, I encountered a strange effect
> > >> in the in-kernel lapic implementation: time went backwards!
> > >>
> > >> While this should never occur, because of that the while loop that
> > >> is done after the invalid calculations caused my host system to hang.
> > >>
> > >> In order to make debugging easier, let's replace this as suggested
> > >> with a modulo function and not run into the danger of looping forever.
> > >>
> > >> To replace the nice hint this bug gave me that the values are broken,
> > >> I added a printk message so people encountering this can at least
> > >> see that something is fishy.
> > >>
> > >> Of course, the real issue needs to be fixed as well! I'm open to ideas
> > >> why now < last_update!
> > >>
> > >> (Thanks to Kevin for his help in debugging this)
> > >>
> > >> Signed-off-by: Alexander Graf <agraf@suse.de>
> > >> Signed-off-by: Kevin Wolf <kwolf@suse.de>
> > >> ---
> > >>
> > >
> > > Hi Alexander
> > >
> > > I'm a little suspect here:
> > >
> > >
> > >> if (unlikely(ktime_to_ns(now) <=
> > >> ktime_to_ns(apic->timer.last_update))) {
> > >> /* Wrap around */
> > >> passed = ktime_add(( {
> > >> (ktime_t) {
> > >> .tv64 = KTIME_MAX -
> > >> (apic->timer.last_update).tv64}; }
> > >> ), now);
> > >> apic_debug("time elapsed\n");
> > >> } else
> > >> passed = ktime_sub(now, apic->timer.last_update);
> > >>
> > >
> > > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and get_time() is
> > > really ktime_get() which is almost impossible to wrap around. If it's
> > > overflow, at least we need a warning. I think this piece of code due to clock
> > > source change.
> > >
> > > So I doubt: due to some reason, now <= apic->timer.last_update, which cause a
> > > big wrap around operation.
> > >
> > > And the most suspect:
> > >
> > >
> > >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
> > >> {
> > >> struct kvm_lapic *apic = vcpu->arch.apic;
> > >>
> > >> if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> > >> apic->timer.last_update = ktime_add_ns(
> > >> apic->timer.last_update,
> > >> apic->timer.period);
> > >> }
> > >>
> > >
> > > Not sure what's happening, have you tried this? (In fact, I am little willing
> > > to replace all apic->timer.dev.base->get_time() with more explicit ktime_get()
> > > as in pit.)
> > >
> >
> > Yes, this code is the culprit. Using that patch of yours now is always >
> > last_update. I'd still like to see that while loop go away ;-).
Sheng,
Will separate the discussions on items:
1) On the current code, last_update is assigned by kvm_apic_timer_intr_post
as the expiration time of the next interrupt, while apic_get_tmcct
interprets it as "countdown start":
passed = ktime_sub(now, apic->timer.last_update);
This explains why your first patch against kvm_apic_timer_intr_post
fixed the bug Alexander was experiencing.
BTW, yes, the "overflow" handling is not necessary as you noted.
2) The current count calculation is not right (apart from any eventual
bugs), or I don't understand it. It uses the interrupt injection
time as "countdown start" (or rearm in Linux terminology), in
kvm_apic_timer_intr_post.
But the host hrtimer starts counting once rearmed (assuming periodic
configuration). So the inaccuracy of APIC_TMCCT depends on this delay
between hrtimer rearm and interrupt injection.
What is this scheme supposed to achieve? It seems much simpler, and
more accurate, to return scaled hrtimer_get_remaining on APIC_TMCCT
emulation.
3) And then there's interrupt reinjection. Once interrupts accumulate,
and we reinject, current count reads become completly bogus. This is
the reason for time drift on RHEL4-style kernels with "clock=tsc". The
algorithm calculates the interrupt handling delay by using the PIT
count (equivalent to APIC_TMCCT). But PIT count emulation uses the
hrtimer expiration time, which has nothing to do with the accumulated
interrupts.
So what I propose is to first switch lapic current count emulation to a
straightforward scaled hrtimer_get_remaining equivalent.
For the reinjection case, maintain an average of the delay between host
interrupt and interrupt injection (this can be generic, so both PIT
and LAPIC timer can use it). Return that scaled average on APIC_TMCCT
emulation whenever pending > 2.
What you think?
And as Avi noted, there should be a distinction between accumulated
interrupts, we only want to reinject the ones caused by host load (and
there are some interesting optimizations that can be done later, such
as stopping hrtimer from ticking until guest acks one interrupt, and
calculating the missed ones). But better start with the basic.
next prev parent reply other threads:[~2009-01-11 4:56 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-08 16:36 [PATCH] Fix almost infinite loop in APIC Alexander Graf
2009-01-09 6:34 ` Sheng Yang
2009-01-09 10:49 ` Alexander Graf
2009-01-09 12:57 ` Alexander Graf
2009-01-10 11:21 ` Sheng Yang
2009-01-11 4:55 ` Marcelo Tosatti [this message]
2009-01-13 7:47 ` Sheng Yang
2009-01-13 22:01 ` Marcelo Tosatti
2009-01-14 9:17 ` Sheng Yang
2009-01-14 17:03 ` Marcelo Tosatti
2009-01-15 7:20 ` Sheng Yang
2009-01-16 5:01 ` Marcelo Tosatti
2009-01-20 10:41 ` Alexander Graf
2009-01-20 11:20 ` Sheng Yang
2009-01-20 12:09 ` Alexander Graf
2009-01-20 12:30 ` Sheng Yang
2009-01-20 13:43 ` Sheng Yang
2009-01-20 18:51 ` Marcelo Tosatti
2009-01-21 2:40 ` Sheng Yang
2009-01-21 4:23 ` Marcelo Tosatti
2009-01-21 5:11 ` Sheng Yang
2009-01-21 15:07 ` Marcelo Tosatti
2009-01-21 16:01 ` Alexander Graf
2009-01-21 16:03 ` Alexander Graf
2009-01-21 16:18 ` Alexander Graf
2009-01-21 16:55 ` Marcelo Tosatti
2009-01-22 13:08 ` Avi Kivity
2009-01-23 17:58 ` Alex Williamson
2009-01-10 11:25 ` Sheng Yang
2009-01-10 11:28 ` Sheng Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090111045557.GA3661@amt.cnet \
--to=mtosatti@redhat.com \
--cc=agraf@suse.de \
--cc=avi@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=kwolf@suse.de \
--cc=sheng@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox