All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marcelo Tosatti <marcelo@kvack.org>
To: Avi Kivity <avi@qumranet.com>
Cc: kvm-devel@lists.sourceforge.net
Subject: Re: The SMP RHEL 5.1 PAE guest can't boot up issue
Date: Fri, 22 Feb 2008 14:17:57 -0300	[thread overview]
Message-ID: <20080222171756.GA10840@dmt> (raw)
In-Reply-To: <47BEF550.8040803@qumranet.com>

On Fri, Feb 22, 2008 at 06:16:16PM +0200, Avi Kivity wrote:
> > 2. The critical one. In normal condition, VCPU0 migrated much more 
> > frequently than other VCPUs. And the patch add more "delta" (always negative 
> > if host TSC is stable) to TSC_OFFSET each 
> > time migrated. Then after boot for a while, VCPU0 became much 
> > slower than others (In my test, VCPU0 was migrated about two times than the
> > others, and easily to be more than 100k cycles slower). In the guest kernel, 
> > clocksource TSC is global variable, the variable "cycle_last" may got the 
> > VCPU1's TSC value, then turn to VCPU0. For VCPU0's TSC_OFFSET is 
> > smaller than VCPU1's, so it's possible to got the "cycle_last" (from VCPU1) 
> > bigger than current TSC value (from VCPU0) in next tick. Then "u64 offset = 
> > clocksource_read() - cycle_last" overflowed and caused the "infinite" loop. 
> > And it can also explained why Marcelo's patch don't work - it just reduce the 
> > rate of gap increasing.

Another source of problems in this area is that the TSC_OFFSET is
initialized to represent zero at different times for VCPU0 (at boot) and
the remaining ones (at APIC_DM_INIT).

> > The freezing didn't happen when using userspace IOAPIC, just because the qemu 
> > APIC didn't implement real LOWPRI(or round_robin) to choose CPU for delivery.
> > It choose VCPU0 everytime if possible, so CPU1 in guest won't update 
> > cycle_last. :( 
> >
> > This freezing only occurred on RHEL5/5.1 pae (kernel 2.6.18), because of they 
> > set IO-APIC IRQ0's dest_mask to 0x3 (with 2 vcpus) and dest_mode as 
> > LOWEST_PRIOITY, then other vcpus had chance to modify "cycle_last". In 
> > contrast, RHEL5/5.1 32e set IRQ0's dest_mode as FIXED, to CPU0, then don't 
> > have this problem. So does RHEL4(kernel 2.6.9). 
> >
> > I don't know if the patch was still needed now, since it was posted long ago(I 
> > don't know which issue it solved). I'd like to post a revert patch if 
> > necessary.
> >   
> 
> I believe the patch is still necessary, since we still need to guarantee 
> that a vcpu's tsc is monotonous.  I think there are three issues to be 
> addressed:
> 
> 1. The majority of intel machines don't need the offset adjustment since 
> they already have a constant rate tsc that is synchronized on all cpus.  
> I think this is indicated by X86_FEATURE_CONSTANT_TSC (though I'm not 
> 100% certain if it means that the rate is the same for all cpus, Thomas 
> can you clarify?)

The TSC might be marked unstable for other reasons (C3 state, large
machines with clustered APIC, cpufreq).

> This will improve tsc quality for those machines, but we can't depend on 
> it, since some machines don't have constant tsc.  Further, I don't think 
> really large machines can have constant tsc since clock distribution 
> becomes difficult or impossible.

As discussed earlier, in case the host kernel does not have the TSC
stable, it needs to enforce a state which the guest OS will not trust
the TSC. The easier way to do that is to fake a C3 state. However, QEMU
does not emulate IO port based wait. This appears to be the reason for
the high-CPU-usage-on-idle with Windows guests, fixed by disabling C3
reporting on rombios (commit cb98751267c2d79f5674301ccac6c6b5c2e0c6b5 of
kvm-userspace).

> 
> 2. We should implement round robin and lowest priority like qemu does.  
> Xen does the same thing:
> 
> > /* HACK: Route IRQ0 only to VCPU0 to prevent time jumps. */
> > #define IRQ0_SPECIAL_ROUTING 1
> in arch/x86/hvm/vioapic.c, at least for irq 0.
> 
> 3. The extra migrations on vcpu 0 are likely due to its role servicing 
> I/O on behalf of the entire virtual machine.  We should move this extra 
> work to an independent thread.  I have done some work in this area.  It 
> is becoming more important as kvm becomes more scalable.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

  reply	other threads:[~2008-02-22 17:17 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-22  8:57 The SMP RHEL 5.1 PAE guest can't boot up issue Yang, Sheng
2008-02-22 16:16 ` Avi Kivity
2008-02-22 17:17   ` Marcelo Tosatti [this message]
2008-02-22 18:45     ` Avi Kivity
2008-02-22 20:12       ` Marcelo Tosatti
2008-02-23 15:24   ` Farkas Levente
2008-02-24  8:51     ` Avi Kivity
2008-02-25  4:09       ` Yang, Sheng
2008-02-25 18:03       ` Farkas Levente
2008-02-25 18:12         ` Avi Kivity
2008-02-25 18:24           ` Farkas Levente
2008-02-25 23:46   ` Dong, Eddie
2008-02-26 10:28     ` Avi Kivity
2008-02-29  4:35       ` Zhao Forrest
2008-03-04 11:38       ` Avi Kivity
2008-02-29  8:26   ` Zhao Forrest

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080222171756.GA10840@dmt \
    --to=marcelo@kvack.org \
    --cc=avi@qumranet.com \
    --cc=kvm-devel@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.