From: Avi Kivity <avi@qumranet.com>
To: "Yang, Sheng" <sheng.yang@intel.com>
Cc: kvm-devel@lists.sourceforge.net
Subject: Re: The SMP RHEL 5.1 PAE guest can't boot up issue
Date: Fri, 22 Feb 2008 18:16:16 +0200 [thread overview]
Message-ID: <47BEF550.8040803@qumranet.com> (raw)
In-Reply-To: <200802221657.34243.sheng.yang@intel.com>
[copying Thomas for a question about CONSTANT_TSC, below]
Yang, Sheng wrote:
> I believe I have found the root cause of SMP RHEL5.1 PAE guest can't boot up
> issue. The problem was caused by
> kvm:6685637b211ad67bdce21bfd9f91bc888b3acb4f
> "KVM: VMX: Ensure vcpu time stamp counter is monotonous" (It didn't take me
> much time to found the solution, but a lot of time to find the proper
> explanation... :( )
>
>
Thanks for tackling this difficult issue. Many have tried and failed,
looks like you finally nailed it :)
> As we guessed, the problem was the monotonous of TSC. I have traced to
> the 2.6.18 PAE guest kernel, and finally found it caused by a overflow in the
> loop of function update_wall_timer()(kernel/timer.c), when using TSC as
> clocksource by default.
>
> The reason is that the patch "KVM: VMX: Ensure vcpu time stamp counter is
> monotonous" bring big gap between different VCPUs (error between
> TSC_OFFSETs). Though I have proved that the patch can ensure the monotonous
> on each VCPU (which rejected my first thought...), the patch
> have 2 problems:
>
> 1. It have accumulated the error. Each vcpu's TSC is monotonous, but get
> slower and slower, compared to the host. That's because the TSC is very
> accuracy and the interval between reading TSC is big. But this is not very
> critical.
>
> 2. The critical one. In normal condition, VCPU0 migrated much more
> frequently than other VCPUs. And the patch add more "delta" (always negative
> if host TSC is stable) to TSC_OFFSET each
> time migrated. Then after boot for a while, VCPU0 became much
> slower than others (In my test, VCPU0 was migrated about two times than the
> others, and easily to be more than 100k cycles slower). In the guest kernel,
> clocksource TSC is global variable, the variable "cycle_last" may got the
> VCPU1's TSC value, then turn to VCPU0. For VCPU0's TSC_OFFSET is
> smaller than VCPU1's, so it's possible to got the "cycle_last" (from VCPU1)
> bigger than current TSC value (from VCPU0) in next tick. Then "u64 offset =
> clocksource_read() - cycle_last" overflowed and caused the "infinite" loop.
> And it can also explained why Marcelo's patch don't work - it just reduce the
> rate of gap increasing.
>
> The freezing didn't happen when using userspace IOAPIC, just because the qemu
> APIC didn't implement real LOWPRI(or round_robin) to choose CPU for delivery.
> It choose VCPU0 everytime if possible, so CPU1 in guest won't update
> cycle_last. :(
>
> This freezing only occurred on RHEL5/5.1 pae (kernel 2.6.18), because of they
> set IO-APIC IRQ0's dest_mask to 0x3 (with 2 vcpus) and dest_mode as
> LOWEST_PRIOITY, then other vcpus had chance to modify "cycle_last". In
> contrast, RHEL5/5.1 32e set IRQ0's dest_mode as FIXED, to CPU0, then don't
> have this problem. So does RHEL4(kernel 2.6.9).
>
> I don't know if the patch was still needed now, since it was posted long ago(I
> don't know which issue it solved). I'd like to post a revert patch if
> necessary.
>
I believe the patch is still necessary, since we still need to guarantee
that a vcpu's tsc is monotonous. I think there are three issues to be
addressed:
1. The majority of intel machines don't need the offset adjustment since
they already have a constant rate tsc that is synchronized on all cpus.
I think this is indicated by X86_FEATURE_CONSTANT_TSC (though I'm not
100% certain if it means that the rate is the same for all cpus, Thomas
can you clarify?)
This will improve tsc quality for those machines, but we can't depend on
it, since some machines don't have constant tsc. Further, I don't think
really large machines can have constant tsc since clock distribution
becomes difficult or impossible.
2. We should implement round robin and lowest priority like qemu does.
Xen does the same thing:
> /* HACK: Route IRQ0 only to VCPU0 to prevent time jumps. */
> #define IRQ0_SPECIAL_ROUTING 1
in arch/x86/hvm/vioapic.c, at least for irq 0.
3. The extra migrations on vcpu 0 are likely due to its role servicing
I/O on behalf of the entire virtual machine. We should move this extra
work to an independent thread. I have done some work in this area. It
is becoming more important as kvm becomes more scalable.
--
Any sufficiently difficult bug is indistinguishable from a feature.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
next prev parent reply other threads:[~2008-02-22 16:16 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-02-22 8:57 The SMP RHEL 5.1 PAE guest can't boot up issue Yang, Sheng
2008-02-22 16:16 ` Avi Kivity [this message]
2008-02-22 17:17 ` Marcelo Tosatti
2008-02-22 18:45 ` Avi Kivity
2008-02-22 20:12 ` Marcelo Tosatti
2008-02-23 15:24 ` Farkas Levente
2008-02-24 8:51 ` Avi Kivity
2008-02-25 4:09 ` Yang, Sheng
2008-02-25 18:03 ` Farkas Levente
2008-02-25 18:12 ` Avi Kivity
2008-02-25 18:24 ` Farkas Levente
2008-02-25 23:46 ` Dong, Eddie
2008-02-26 10:28 ` Avi Kivity
2008-02-29 4:35 ` Zhao Forrest
2008-03-04 11:38 ` Avi Kivity
2008-02-29 8:26 ` Zhao Forrest
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=47BEF550.8040803@qumranet.com \
--to=avi@qumranet.com \
--cc=kvm-devel@lists.sourceforge.net \
--cc=sheng.yang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox