public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Farkas Levente <lfarkas@bppiac.hu>
To: Avi Kivity <avi@qumranet.com>
Cc: kvm-devel@lists.sourceforge.net
Subject: Re: The SMP RHEL 5.1 PAE guest can't boot up issue
Date: Sat, 23 Feb 2008 16:24:07 +0100	[thread overview]
Message-ID: <47C03A97.9010008@bppiac.hu> (raw)
In-Reply-To: <47BEF550.8040803@qumranet.com>

Avi Kivity wrote:
> [copying Thomas for a question about CONSTANT_TSC, below]
> 
> Yang, Sheng wrote:
>> I believe I have found the root cause of SMP RHEL5.1 PAE guest can't boot up 
>> issue. The problem was caused by 
>> kvm:6685637b211ad67bdce21bfd9f91bc888b3acb4f
>> "KVM: VMX: Ensure vcpu time stamp counter is monotonous" (It didn't take me 
>> much time to found the solution, but a lot of time to find the proper 
>> explanation...  :( )
>>
>>   
> 
> Thanks for tackling this difficult issue.  Many have tried and failed, 
> looks like you finally nailed it :)
> 
> 
>> As we guessed, the problem was the monotonous of TSC. I have traced to 
>> the 2.6.18 PAE guest kernel, and finally found it caused by a overflow in the 
>> loop of function update_wall_timer()(kernel/timer.c), when using TSC as 
>> clocksource by default.
>>
>> The reason is that the patch "KVM: VMX: Ensure vcpu time stamp counter is 
>> monotonous" bring big gap between different VCPUs (error between 
>> TSC_OFFSETs). Though I have proved that the patch can ensure the monotonous 
>> on each VCPU (which rejected my first thought...), the patch 
>> have 2 problems:
>>
>> 1. It have accumulated the error. Each vcpu's TSC is monotonous, but get 
>> slower and slower, compared to the host. That's because the TSC is very 
>> accuracy and the interval between reading TSC is big. But this is not very 
>> critical.
>>
>> 2. The critical one. In normal condition, VCPU0 migrated much more 
>> frequently than other VCPUs. And the patch add more "delta" (always negative 
>> if host TSC is stable) to TSC_OFFSET each 
>> time migrated. Then after boot for a while, VCPU0 became much 
>> slower than others (In my test, VCPU0 was migrated about two times than the
>> others, and easily to be more than 100k cycles slower). In the guest kernel, 
>> clocksource TSC is global variable, the variable "cycle_last" may got the 
>> VCPU1's TSC value, then turn to VCPU0. For VCPU0's TSC_OFFSET is 
>> smaller than VCPU1's, so it's possible to got the "cycle_last" (from VCPU1) 
>> bigger than current TSC value (from VCPU0) in next tick. Then "u64 offset = 
>> clocksource_read() - cycle_last" overflowed and caused the "infinite" loop. 
>> And it can also explained why Marcelo's patch don't work - it just reduce the 
>> rate of gap increasing.
>>
>> The freezing didn't happen when using userspace IOAPIC, just because the qemu 
>> APIC didn't implement real LOWPRI(or round_robin) to choose CPU for delivery.
>> It choose VCPU0 everytime if possible, so CPU1 in guest won't update 
>> cycle_last. :( 
>>
>> This freezing only occurred on RHEL5/5.1 pae (kernel 2.6.18), because of they 
>> set IO-APIC IRQ0's dest_mask to 0x3 (with 2 vcpus) and dest_mode as 
>> LOWEST_PRIOITY, then other vcpus had chance to modify "cycle_last". In 
>> contrast, RHEL5/5.1 32e set IRQ0's dest_mode as FIXED, to CPU0, then don't 
>> have this problem. So does RHEL4(kernel 2.6.9). 
>>
>> I don't know if the patch was still needed now, since it was posted long ago(I 
>> don't know which issue it solved). I'd like to post a revert patch if 
>> necessary.
>>   
> 
> I believe the patch is still necessary, since we still need to guarantee 
> that a vcpu's tsc is monotonous.  I think there are three issues to be 
> addressed:
> 
> 1. The majority of intel machines don't need the offset adjustment since 
> they already have a constant rate tsc that is synchronized on all cpus.  
> I think this is indicated by X86_FEATURE_CONSTANT_TSC (though I'm not 
> 100% certain if it means that the rate is the same for all cpus, Thomas 
> can you clarify?)
> 
> This will improve tsc quality for those machines, but we can't depend on 
> it, since some machines don't have constant tsc.  Further, I don't think 
> really large machines can have constant tsc since clock distribution 
> becomes difficult or impossible.
> 
> 2. We should implement round robin and lowest priority like qemu does.  
> Xen does the same thing:
> 
>> /* HACK: Route IRQ0 only to VCPU0 to prevent time jumps. */
>> #define IRQ0_SPECIAL_ROUTING 1
> in arch/x86/hvm/vioapic.c, at least for irq 0.
> 
> 3. The extra migrations on vcpu 0 are likely due to its role servicing 
> I/O on behalf of the entire virtual machine.  We should move this extra 
> work to an independent thread.  I have done some work in this area.  It 
> is becoming more important as kvm becomes more scalable.
> 

will be a new release in the near future? since many of us waiting for
this bug to be fixed on quad and other multi core cpus.

-- 
  Levente                               "Si vis pacem para bellum!"

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

  parent reply	other threads:[~2008-02-23 15:24 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-22  8:57 The SMP RHEL 5.1 PAE guest can't boot up issue Yang, Sheng
2008-02-22 16:16 ` Avi Kivity
2008-02-22 17:17   ` Marcelo Tosatti
2008-02-22 18:45     ` Avi Kivity
2008-02-22 20:12       ` Marcelo Tosatti
2008-02-23 15:24   ` Farkas Levente [this message]
2008-02-24  8:51     ` Avi Kivity
2008-02-25  4:09       ` Yang, Sheng
2008-02-25 18:03       ` Farkas Levente
2008-02-25 18:12         ` Avi Kivity
2008-02-25 18:24           ` Farkas Levente
2008-02-25 23:46   ` Dong, Eddie
2008-02-26 10:28     ` Avi Kivity
2008-02-29  4:35       ` Zhao Forrest
2008-03-04 11:38       ` Avi Kivity
2008-02-29  8:26   ` Zhao Forrest

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47C03A97.9010008@bppiac.hu \
    --to=lfarkas@bppiac.hu \
    --cc=avi@qumranet.com \
    --cc=kvm-devel@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox