* The SMP RHEL 5.1 PAE guest can't boot up issue
@ 2008-02-22 8:57 Yang, Sheng
2008-02-22 16:16 ` Avi Kivity
0 siblings, 1 reply; 16+ messages in thread
From: Yang, Sheng @ 2008-02-22 8:57 UTC (permalink / raw)
To: kvm-devel
I believe I have found the root cause of SMP RHEL5.1 PAE guest can't boot up
issue. The problem was caused by
kvm:6685637b211ad67bdce21bfd9f91bc888b3acb4f
"KVM: VMX: Ensure vcpu time stamp counter is monotonous" (It didn't take me
much time to found the solution, but a lot of time to find the proper
explanation... :( )
As we guessed, the problem was the monotonous of TSC. I have traced to
the 2.6.18 PAE guest kernel, and finally found it caused by a overflow in the
loop of function update_wall_timer()(kernel/timer.c), when using TSC as
clocksource by default.
The reason is that the patch "KVM: VMX: Ensure vcpu time stamp counter is
monotonous" bring big gap between different VCPUs (error between
TSC_OFFSETs). Though I have proved that the patch can ensure the monotonous
on each VCPU (which rejected my first thought...), the patch
have 2 problems:
1. It have accumulated the error. Each vcpu's TSC is monotonous, but get
slower and slower, compared to the host. That's because the TSC is very
accuracy and the interval between reading TSC is big. But this is not very
critical.
2. The critical one. In normal condition, VCPU0 migrated much more
frequently than other VCPUs. And the patch add more "delta" (always negative
if host TSC is stable) to TSC_OFFSET each
time migrated. Then after boot for a while, VCPU0 became much
slower than others (In my test, VCPU0 was migrated about two times than the
others, and easily to be more than 100k cycles slower). In the guest kernel,
clocksource TSC is global variable, the variable "cycle_last" may got the
VCPU1's TSC value, then turn to VCPU0. For VCPU0's TSC_OFFSET is
smaller than VCPU1's, so it's possible to got the "cycle_last" (from VCPU1)
bigger than current TSC value (from VCPU0) in next tick. Then "u64 offset =
clocksource_read() - cycle_last" overflowed and caused the "infinite" loop.
And it can also explained why Marcelo's patch don't work - it just reduce the
rate of gap increasing.
The freezing didn't happen when using userspace IOAPIC, just because the qemu
APIC didn't implement real LOWPRI(or round_robin) to choose CPU for delivery.
It choose VCPU0 everytime if possible, so CPU1 in guest won't update
cycle_last. :(
This freezing only occurred on RHEL5/5.1 pae (kernel 2.6.18), because of they
set IO-APIC IRQ0's dest_mask to 0x3 (with 2 vcpus) and dest_mode as
LOWEST_PRIOITY, then other vcpus had chance to modify "cycle_last". In
contrast, RHEL5/5.1 32e set IRQ0's dest_mode as FIXED, to CPU0, then don't
have this problem. So does RHEL4(kernel 2.6.9).
I don't know if the patch was still needed now, since it was posted long ago(I
don't know which issue it solved). I'd like to post a revert patch if
necessary.
--
Thanks
Yang, Sheng
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-22 8:57 The SMP RHEL 5.1 PAE guest can't boot up issue Yang, Sheng
@ 2008-02-22 16:16 ` Avi Kivity
2008-02-22 17:17 ` Marcelo Tosatti
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Avi Kivity @ 2008-02-22 16:16 UTC (permalink / raw)
To: Yang, Sheng; +Cc: kvm-devel
[copying Thomas for a question about CONSTANT_TSC, below]
Yang, Sheng wrote:
> I believe I have found the root cause of SMP RHEL5.1 PAE guest can't boot up
> issue. The problem was caused by
> kvm:6685637b211ad67bdce21bfd9f91bc888b3acb4f
> "KVM: VMX: Ensure vcpu time stamp counter is monotonous" (It didn't take me
> much time to found the solution, but a lot of time to find the proper
> explanation... :( )
>
>
Thanks for tackling this difficult issue. Many have tried and failed,
looks like you finally nailed it :)
> As we guessed, the problem was the monotonous of TSC. I have traced to
> the 2.6.18 PAE guest kernel, and finally found it caused by a overflow in the
> loop of function update_wall_timer()(kernel/timer.c), when using TSC as
> clocksource by default.
>
> The reason is that the patch "KVM: VMX: Ensure vcpu time stamp counter is
> monotonous" bring big gap between different VCPUs (error between
> TSC_OFFSETs). Though I have proved that the patch can ensure the monotonous
> on each VCPU (which rejected my first thought...), the patch
> have 2 problems:
>
> 1. It have accumulated the error. Each vcpu's TSC is monotonous, but get
> slower and slower, compared to the host. That's because the TSC is very
> accuracy and the interval between reading TSC is big. But this is not very
> critical.
>
> 2. The critical one. In normal condition, VCPU0 migrated much more
> frequently than other VCPUs. And the patch add more "delta" (always negative
> if host TSC is stable) to TSC_OFFSET each
> time migrated. Then after boot for a while, VCPU0 became much
> slower than others (In my test, VCPU0 was migrated about two times than the
> others, and easily to be more than 100k cycles slower). In the guest kernel,
> clocksource TSC is global variable, the variable "cycle_last" may got the
> VCPU1's TSC value, then turn to VCPU0. For VCPU0's TSC_OFFSET is
> smaller than VCPU1's, so it's possible to got the "cycle_last" (from VCPU1)
> bigger than current TSC value (from VCPU0) in next tick. Then "u64 offset =
> clocksource_read() - cycle_last" overflowed and caused the "infinite" loop.
> And it can also explained why Marcelo's patch don't work - it just reduce the
> rate of gap increasing.
>
> The freezing didn't happen when using userspace IOAPIC, just because the qemu
> APIC didn't implement real LOWPRI(or round_robin) to choose CPU for delivery.
> It choose VCPU0 everytime if possible, so CPU1 in guest won't update
> cycle_last. :(
>
> This freezing only occurred on RHEL5/5.1 pae (kernel 2.6.18), because of they
> set IO-APIC IRQ0's dest_mask to 0x3 (with 2 vcpus) and dest_mode as
> LOWEST_PRIOITY, then other vcpus had chance to modify "cycle_last". In
> contrast, RHEL5/5.1 32e set IRQ0's dest_mode as FIXED, to CPU0, then don't
> have this problem. So does RHEL4(kernel 2.6.9).
>
> I don't know if the patch was still needed now, since it was posted long ago(I
> don't know which issue it solved). I'd like to post a revert patch if
> necessary.
>
I believe the patch is still necessary, since we still need to guarantee
that a vcpu's tsc is monotonous. I think there are three issues to be
addressed:
1. The majority of intel machines don't need the offset adjustment since
they already have a constant rate tsc that is synchronized on all cpus.
I think this is indicated by X86_FEATURE_CONSTANT_TSC (though I'm not
100% certain if it means that the rate is the same for all cpus, Thomas
can you clarify?)
This will improve tsc quality for those machines, but we can't depend on
it, since some machines don't have constant tsc. Further, I don't think
really large machines can have constant tsc since clock distribution
becomes difficult or impossible.
2. We should implement round robin and lowest priority like qemu does.
Xen does the same thing:
> /* HACK: Route IRQ0 only to VCPU0 to prevent time jumps. */
> #define IRQ0_SPECIAL_ROUTING 1
in arch/x86/hvm/vioapic.c, at least for irq 0.
3. The extra migrations on vcpu 0 are likely due to its role servicing
I/O on behalf of the entire virtual machine. We should move this extra
work to an independent thread. I have done some work in this area. It
is becoming more important as kvm becomes more scalable.
--
Any sufficiently difficult bug is indistinguishable from a feature.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-22 16:16 ` Avi Kivity
@ 2008-02-22 17:17 ` Marcelo Tosatti
2008-02-22 18:45 ` Avi Kivity
2008-02-23 15:24 ` Farkas Levente
` (2 subsequent siblings)
3 siblings, 1 reply; 16+ messages in thread
From: Marcelo Tosatti @ 2008-02-22 17:17 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
On Fri, Feb 22, 2008 at 06:16:16PM +0200, Avi Kivity wrote:
> > 2. The critical one. In normal condition, VCPU0 migrated much more
> > frequently than other VCPUs. And the patch add more "delta" (always negative
> > if host TSC is stable) to TSC_OFFSET each
> > time migrated. Then after boot for a while, VCPU0 became much
> > slower than others (In my test, VCPU0 was migrated about two times than the
> > others, and easily to be more than 100k cycles slower). In the guest kernel,
> > clocksource TSC is global variable, the variable "cycle_last" may got the
> > VCPU1's TSC value, then turn to VCPU0. For VCPU0's TSC_OFFSET is
> > smaller than VCPU1's, so it's possible to got the "cycle_last" (from VCPU1)
> > bigger than current TSC value (from VCPU0) in next tick. Then "u64 offset =
> > clocksource_read() - cycle_last" overflowed and caused the "infinite" loop.
> > And it can also explained why Marcelo's patch don't work - it just reduce the
> > rate of gap increasing.
Another source of problems in this area is that the TSC_OFFSET is
initialized to represent zero at different times for VCPU0 (at boot) and
the remaining ones (at APIC_DM_INIT).
> > The freezing didn't happen when using userspace IOAPIC, just because the qemu
> > APIC didn't implement real LOWPRI(or round_robin) to choose CPU for delivery.
> > It choose VCPU0 everytime if possible, so CPU1 in guest won't update
> > cycle_last. :(
> >
> > This freezing only occurred on RHEL5/5.1 pae (kernel 2.6.18), because of they
> > set IO-APIC IRQ0's dest_mask to 0x3 (with 2 vcpus) and dest_mode as
> > LOWEST_PRIOITY, then other vcpus had chance to modify "cycle_last". In
> > contrast, RHEL5/5.1 32e set IRQ0's dest_mode as FIXED, to CPU0, then don't
> > have this problem. So does RHEL4(kernel 2.6.9).
> >
> > I don't know if the patch was still needed now, since it was posted long ago(I
> > don't know which issue it solved). I'd like to post a revert patch if
> > necessary.
> >
>
> I believe the patch is still necessary, since we still need to guarantee
> that a vcpu's tsc is monotonous. I think there are three issues to be
> addressed:
>
> 1. The majority of intel machines don't need the offset adjustment since
> they already have a constant rate tsc that is synchronized on all cpus.
> I think this is indicated by X86_FEATURE_CONSTANT_TSC (though I'm not
> 100% certain if it means that the rate is the same for all cpus, Thomas
> can you clarify?)
The TSC might be marked unstable for other reasons (C3 state, large
machines with clustered APIC, cpufreq).
> This will improve tsc quality for those machines, but we can't depend on
> it, since some machines don't have constant tsc. Further, I don't think
> really large machines can have constant tsc since clock distribution
> becomes difficult or impossible.
As discussed earlier, in case the host kernel does not have the TSC
stable, it needs to enforce a state which the guest OS will not trust
the TSC. The easier way to do that is to fake a C3 state. However, QEMU
does not emulate IO port based wait. This appears to be the reason for
the high-CPU-usage-on-idle with Windows guests, fixed by disabling C3
reporting on rombios (commit cb98751267c2d79f5674301ccac6c6b5c2e0c6b5 of
kvm-userspace).
>
> 2. We should implement round robin and lowest priority like qemu does.
> Xen does the same thing:
>
> > /* HACK: Route IRQ0 only to VCPU0 to prevent time jumps. */
> > #define IRQ0_SPECIAL_ROUTING 1
> in arch/x86/hvm/vioapic.c, at least for irq 0.
>
> 3. The extra migrations on vcpu 0 are likely due to its role servicing
> I/O on behalf of the entire virtual machine. We should move this extra
> work to an independent thread. I have done some work in this area. It
> is becoming more important as kvm becomes more scalable.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-22 17:17 ` Marcelo Tosatti
@ 2008-02-22 18:45 ` Avi Kivity
2008-02-22 20:12 ` Marcelo Tosatti
0 siblings, 1 reply; 16+ messages in thread
From: Avi Kivity @ 2008-02-22 18:45 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: kvm-devel
Marcelo Tosatti wrote:
> Another source of problems in this area is that the TSC_OFFSET is
> initialized to represent zero at different times for VCPU0 (at boot) and
> the remaining ones (at APIC_DM_INIT).
>
>
I added tsc sync in the guest bios some time ago, so this should be
solved now.
>> This will improve tsc quality for those machines, but we can't depend on
>> it, since some machines don't have constant tsc. Further, I don't think
>> really large machines can have constant tsc since clock distribution
>> becomes difficult or impossible.
>>
>
> As discussed earlier, in case the host kernel does not have the TSC
> stable, it needs to enforce a state which the guest OS will not trust
> the TSC. The easier way to do that is to fake a C3 state. However, QEMU
> does not emulate IO port based wait. This appears to be the reason for
> the high-CPU-usage-on-idle with Windows guests, fixed by disabling C3
> reporting on rombios (commit cb98751267c2d79f5674301ccac6c6b5c2e0c6b5 of
> kvm-userspace).
>
>
Oh. Can you point me at documentation for the io port wait thing?
--
Any sufficiently difficult bug is indistinguishable from a feature.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-22 18:45 ` Avi Kivity
@ 2008-02-22 20:12 ` Marcelo Tosatti
0 siblings, 0 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2008-02-22 20:12 UTC (permalink / raw)
To: Avi Kivity; +Cc: Marcelo Tosatti, kvm-devel
On Fri, Feb 22, 2008 at 08:45:00PM +0200, Avi Kivity wrote:
> Marcelo Tosatti wrote:
> >Another source of problems in this area is that the TSC_OFFSET is
> >initialized to represent zero at different times for VCPU0 (at boot) and
> >the remaining ones (at APIC_DM_INIT).
> >
> >
>
> I added tsc sync in the guest bios some time ago, so this should be
> solved now.
>
> >>This will improve tsc quality for those machines, but we can't depend on
> >>it, since some machines don't have constant tsc. Further, I don't think
> >>really large machines can have constant tsc since clock distribution
> >>becomes difficult or impossible.
> >>
> >
> >As discussed earlier, in case the host kernel does not have the TSC
> >stable, it needs to enforce a state which the guest OS will not trust
> >the TSC. The easier way to do that is to fake a C3 state. However, QEMU
> >does not emulate IO port based wait. This appears to be the reason for
> >the high-CPU-usage-on-idle with Windows guests, fixed by disabling C3
> >reporting on rombios (commit cb98751267c2d79f5674301ccac6c6b5c2e0c6b5 of
> >kvm-userspace).
> >
> >
>
> Oh. Can you point me at documentation for the io port wait thing?
ACPI spec 3.0b section 4.7.3.5. Reading LVL2 or LVL3 register will cause
the processor to enter the specified C state.
See drivers/acpi/processor_idle.c::acpi_idle_do_entry.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-22 16:16 ` Avi Kivity
2008-02-22 17:17 ` Marcelo Tosatti
@ 2008-02-23 15:24 ` Farkas Levente
2008-02-24 8:51 ` Avi Kivity
2008-02-25 23:46 ` Dong, Eddie
2008-02-29 8:26 ` Zhao Forrest
3 siblings, 1 reply; 16+ messages in thread
From: Farkas Levente @ 2008-02-23 15:24 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
Avi Kivity wrote:
> [copying Thomas for a question about CONSTANT_TSC, below]
>
> Yang, Sheng wrote:
>> I believe I have found the root cause of SMP RHEL5.1 PAE guest can't boot up
>> issue. The problem was caused by
>> kvm:6685637b211ad67bdce21bfd9f91bc888b3acb4f
>> "KVM: VMX: Ensure vcpu time stamp counter is monotonous" (It didn't take me
>> much time to found the solution, but a lot of time to find the proper
>> explanation... :( )
>>
>>
>
> Thanks for tackling this difficult issue. Many have tried and failed,
> looks like you finally nailed it :)
>
>
>> As we guessed, the problem was the monotonous of TSC. I have traced to
>> the 2.6.18 PAE guest kernel, and finally found it caused by a overflow in the
>> loop of function update_wall_timer()(kernel/timer.c), when using TSC as
>> clocksource by default.
>>
>> The reason is that the patch "KVM: VMX: Ensure vcpu time stamp counter is
>> monotonous" bring big gap between different VCPUs (error between
>> TSC_OFFSETs). Though I have proved that the patch can ensure the monotonous
>> on each VCPU (which rejected my first thought...), the patch
>> have 2 problems:
>>
>> 1. It have accumulated the error. Each vcpu's TSC is monotonous, but get
>> slower and slower, compared to the host. That's because the TSC is very
>> accuracy and the interval between reading TSC is big. But this is not very
>> critical.
>>
>> 2. The critical one. In normal condition, VCPU0 migrated much more
>> frequently than other VCPUs. And the patch add more "delta" (always negative
>> if host TSC is stable) to TSC_OFFSET each
>> time migrated. Then after boot for a while, VCPU0 became much
>> slower than others (In my test, VCPU0 was migrated about two times than the
>> others, and easily to be more than 100k cycles slower). In the guest kernel,
>> clocksource TSC is global variable, the variable "cycle_last" may got the
>> VCPU1's TSC value, then turn to VCPU0. For VCPU0's TSC_OFFSET is
>> smaller than VCPU1's, so it's possible to got the "cycle_last" (from VCPU1)
>> bigger than current TSC value (from VCPU0) in next tick. Then "u64 offset =
>> clocksource_read() - cycle_last" overflowed and caused the "infinite" loop.
>> And it can also explained why Marcelo's patch don't work - it just reduce the
>> rate of gap increasing.
>>
>> The freezing didn't happen when using userspace IOAPIC, just because the qemu
>> APIC didn't implement real LOWPRI(or round_robin) to choose CPU for delivery.
>> It choose VCPU0 everytime if possible, so CPU1 in guest won't update
>> cycle_last. :(
>>
>> This freezing only occurred on RHEL5/5.1 pae (kernel 2.6.18), because of they
>> set IO-APIC IRQ0's dest_mask to 0x3 (with 2 vcpus) and dest_mode as
>> LOWEST_PRIOITY, then other vcpus had chance to modify "cycle_last". In
>> contrast, RHEL5/5.1 32e set IRQ0's dest_mode as FIXED, to CPU0, then don't
>> have this problem. So does RHEL4(kernel 2.6.9).
>>
>> I don't know if the patch was still needed now, since it was posted long ago(I
>> don't know which issue it solved). I'd like to post a revert patch if
>> necessary.
>>
>
> I believe the patch is still necessary, since we still need to guarantee
> that a vcpu's tsc is monotonous. I think there are three issues to be
> addressed:
>
> 1. The majority of intel machines don't need the offset adjustment since
> they already have a constant rate tsc that is synchronized on all cpus.
> I think this is indicated by X86_FEATURE_CONSTANT_TSC (though I'm not
> 100% certain if it means that the rate is the same for all cpus, Thomas
> can you clarify?)
>
> This will improve tsc quality for those machines, but we can't depend on
> it, since some machines don't have constant tsc. Further, I don't think
> really large machines can have constant tsc since clock distribution
> becomes difficult or impossible.
>
> 2. We should implement round robin and lowest priority like qemu does.
> Xen does the same thing:
>
>> /* HACK: Route IRQ0 only to VCPU0 to prevent time jumps. */
>> #define IRQ0_SPECIAL_ROUTING 1
> in arch/x86/hvm/vioapic.c, at least for irq 0.
>
> 3. The extra migrations on vcpu 0 are likely due to its role servicing
> I/O on behalf of the entire virtual machine. We should move this extra
> work to an independent thread. I have done some work in this area. It
> is becoming more important as kvm becomes more scalable.
>
will be a new release in the near future? since many of us waiting for
this bug to be fixed on quad and other multi core cpus.
--
Levente "Si vis pacem para bellum!"
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-23 15:24 ` Farkas Levente
@ 2008-02-24 8:51 ` Avi Kivity
2008-02-25 4:09 ` Yang, Sheng
2008-02-25 18:03 ` Farkas Levente
0 siblings, 2 replies; 16+ messages in thread
From: Avi Kivity @ 2008-02-24 8:51 UTC (permalink / raw)
To: Farkas Levente; +Cc: kvm-devel
[-- Attachment #1: Type: text/plain, Size: 276 bytes --]
Farkas Levente wrote:
> will be a new release in the near future? since many of us waiting for
> this bug to be fixed on quad and other multi core cpus.
>
>
Certainly. Can you try out the attached patch?
--
error compiling committee.c: too many arguments to function
[-- Attachment #2: irq0-to-vcpu0.patch --]
[-- Type: text/x-patch, Size: 854 bytes --]
diff --git a/kernel/ioapic.c b/kernel/ioapic.c
index 317f8e2..4232fd7 100644
--- a/kernel/ioapic.c
+++ b/kernel/ioapic.c
@@ -211,6 +211,10 @@ static void ioapic_deliver(struct kvm_ioapic *ioapic, int irq)
case IOAPIC_LOWEST_PRIORITY:
vcpu = kvm_get_lowest_prio_vcpu(ioapic->kvm, vector,
deliver_bitmask);
+#ifdef CONFIG_X86
+ if (irq == 0)
+ vcpu = ioapic->kvm->vcpus[0];
+#endif
if (vcpu != NULL)
ioapic_inj_irq(ioapic, vcpu, vector,
trig_mode, delivery_mode);
@@ -220,6 +224,10 @@ static void ioapic_deliver(struct kvm_ioapic *ioapic, int irq)
deliver_bitmask, vector, IOAPIC_LOWEST_PRIORITY);
break;
case IOAPIC_FIXED:
+#ifdef CONFIG_X86
+ if (irq == 0)
+ deliver_bitmask = 1;
+#endif
for (vcpu_id = 0; deliver_bitmask != 0; vcpu_id++) {
if (!(deliver_bitmask & (1 << vcpu_id)))
continue;
[-- Attachment #3: Type: text/plain, Size: 228 bytes --]
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
[-- Attachment #4: Type: text/plain, Size: 158 bytes --]
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-24 8:51 ` Avi Kivity
@ 2008-02-25 4:09 ` Yang, Sheng
2008-02-25 18:03 ` Farkas Levente
1 sibling, 0 replies; 16+ messages in thread
From: Yang, Sheng @ 2008-02-25 4:09 UTC (permalink / raw)
To: kvm-devel; +Cc: Avi Kivity
On Sunday 24 February 2008 16:51:07 Avi Kivity wrote:
> Farkas Levente wrote:
> > will be a new release in the near future? since many of us waiting for
> > this bug to be fixed on quad and other multi core cpus.
>
> Certainly. Can you try out the attached patch?
OK on my side. Once I was thinking it would somehow affect 2.6.22, for the
kernel would choose one of VCPUs to deliver all the PIT interrupt
alternately. But the test show it's OK.
--
Thanks
Yang, Sheng
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-24 8:51 ` Avi Kivity
2008-02-25 4:09 ` Yang, Sheng
@ 2008-02-25 18:03 ` Farkas Levente
2008-02-25 18:12 ` Avi Kivity
1 sibling, 1 reply; 16+ messages in thread
From: Farkas Levente @ 2008-02-25 18:03 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
Avi Kivity wrote:
> Farkas Levente wrote:
>> will be a new release in the near future? since many of us waiting for
>> this bug to be fixed on quad and other multi core cpus.
>>
>>
>
> Certainly. Can you try out the attached patch?
thanks. it works!:-)))
we've been waiting for this in the last half year!
thanks again.
--
Levente "Si vis pacem para bellum!"
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-25 18:03 ` Farkas Levente
@ 2008-02-25 18:12 ` Avi Kivity
2008-02-25 18:24 ` Farkas Levente
0 siblings, 1 reply; 16+ messages in thread
From: Avi Kivity @ 2008-02-25 18:12 UTC (permalink / raw)
To: Farkas Levente; +Cc: kvm-devel
Farkas Levente wrote:
> Avi Kivity wrote:
>
>> Farkas Levente wrote:
>>
>>> will be a new release in the near future? since many of us waiting for
>>> this bug to be fixed on quad and other multi core cpus.
>>>
>>>
>>>
>> Certainly. Can you try out the attached patch?
>>
>
> thanks. it works!:-)))
> we've been waiting for this in the last half year!
> thanks again.
>
Well, it was a tough one.
The credit belongs to Sheng Yang for figuring it out. The patch was
easy once the problem was understood.
--
error compiling committee.c: too many arguments to function
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-25 18:12 ` Avi Kivity
@ 2008-02-25 18:24 ` Farkas Levente
0 siblings, 0 replies; 16+ messages in thread
From: Farkas Levente @ 2008-02-25 18:24 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
Avi Kivity wrote:
> Farkas Levente wrote:
>> Avi Kivity wrote:
>>
>>> Farkas Levente wrote:
>>>
>>>> will be a new release in the near future? since many of us waiting for
>>>> this bug to be fixed on quad and other multi core cpus.
>>>>
>>>>
>>> Certainly. Can you try out the attached patch?
>>>
>>
>> thanks. it works!:-)))
>> we've been waiting for this in the last half year!
>> thanks again.
>>
>
> Well, it was a tough one.
>
> The credit belongs to Sheng Yang for figuring it out. The patch was
> easy once the problem was understood.
this patch alone deserve a new release.
--
Levente "Si vis pacem para bellum!"
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-22 16:16 ` Avi Kivity
2008-02-22 17:17 ` Marcelo Tosatti
2008-02-23 15:24 ` Farkas Levente
@ 2008-02-25 23:46 ` Dong, Eddie
2008-02-26 10:28 ` Avi Kivity
2008-02-29 8:26 ` Zhao Forrest
3 siblings, 1 reply; 16+ messages in thread
From: Dong, Eddie @ 2008-02-25 23:46 UTC (permalink / raw)
To: Avi Kivity, Yang, Sheng; +Cc: kvm-devel
>> I don't know if the patch was still needed now, since it was posted
>> long ago(I don't know which issue it solved). I'd like to post a
>> revert patch if necessary.
>>
>
> I believe the patch is still necessary, since we still need to
> guarantee that a vcpu's tsc is monotonous. I think there are three
> issues to be addressed:
>
> 1. The majority of intel machines don't need the offset adjustment
> since they already have a constant rate tsc that is synchronized on
> all cpus. I think this is indicated by X86_FEATURE_CONSTANT_TSC
> (though I'm not 100% certain if it means that the rate is the same
> for all cpus, Thomas can you clarify?)
So why not make the TSC_OFFSET adjustment conditional?
The original patch doesn't bring any benefit for those platforms
with CONSTANT TSC, especially if it is majority,
while the accumurated difference due to the patch will be very big
which makes guest timer worse.
>
> This will improve tsc quality for those machines, but we can't depend
> on it, since some machines don't have constant tsc. Further, I don't
> think really large machines can have constant tsc since clock
> distribution becomes difficult or impossible.
For NUMA machines, this is an issue, but depend on how you support
NUMA. One way is to bind VCPUs of a guest to same node if guest is not
NUMA, if this is the model, then we don't have issue.
I think Xen is planning in this way and it is same for KVM.
>
> 2. We should implement round robin and lowest priority like qemu does.
> Xen does the same thing:
>
>> /* HACK: Route IRQ0 only to VCPU0 to prevent time jumps. */
>> #define IRQ0_SPECIAL_ROUTING 1
> in arch/x86/hvm/vioapic.c, at least for irq 0.
We did same thing in Xen long time ago to avoid this issue.
It helps but not perfect.
>
> 3. The extra migrations on vcpu 0 are likely due to its role servicing
> I/O on behalf of the entire virtual machine. We should move this
> extra work to an independent thread. I have done some work in this
> area. It is becoming more important as kvm becomes more scalable.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-25 23:46 ` Dong, Eddie
@ 2008-02-26 10:28 ` Avi Kivity
2008-02-29 4:35 ` Zhao Forrest
2008-03-04 11:38 ` Avi Kivity
0 siblings, 2 replies; 16+ messages in thread
From: Avi Kivity @ 2008-02-26 10:28 UTC (permalink / raw)
To: Dong, Eddie; +Cc: kvm-devel
Dong, Eddie wrote:
>>> I don't know if the patch was still needed now, since it was posted
>>> long ago(I don't know which issue it solved). I'd like to post a
>>> revert patch if necessary.
>>>
>>>
>> I believe the patch is still necessary, since we still need to
>> guarantee that a vcpu's tsc is monotonous. I think there are three
>> issues to be addressed:
>>
>> 1. The majority of intel machines don't need the offset adjustment
>> since they already have a constant rate tsc that is synchronized on
>> all cpus. I think this is indicated by X86_FEATURE_CONSTANT_TSC
>> (though I'm not 100% certain if it means that the rate is the same
>> for all cpus, Thomas can you clarify?)
>>
>
> So why not make the TSC_OFFSET adjustment conditional?
>
Yes, that's what I meant. We just need to be sure that this is what
X86_FEATURE_CONSTANT_TSC means.
>
>> This will improve tsc quality for those machines, but we can't depend
>> on it, since some machines don't have constant tsc. Further, I don't
>> think really large machines can have constant tsc since clock
>> distribution becomes difficult or impossible.
>>
>
> For NUMA machines, this is an issue, but depend on how you support
> NUMA. One way is to bind VCPUs of a guest to same node if guest is not
> NUMA, if this is the model, then we don't have issue.
> I think Xen is planning in this way and it is same for KVM.
>
>
>
This is a user decision, many small VMs or a few larger ones. To
support the "many small VMs", we need to be able to detect the tsc
stability groups. I don't think that's the same as NUMA nodes for
processors with on-board memory controllers (where each processor is a
node).
>> 2. We should implement round robin and lowest priority like qemu does.
>> Xen does the same thing:
>>
>>
>>> /* HACK: Route IRQ0 only to VCPU0 to prevent time jumps. */
>>> #define IRQ0_SPECIAL_ROUTING 1
>>>
>> in arch/x86/hvm/vioapic.c, at least for irq 0.
>>
>
> We did same thing in Xen long time ago to avoid this issue.
> It helps but not perfect.
>
An equivalent hack is now in kvm as well.
--
error compiling committee.c: too many arguments to function
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-26 10:28 ` Avi Kivity
@ 2008-02-29 4:35 ` Zhao Forrest
2008-03-04 11:38 ` Avi Kivity
1 sibling, 0 replies; 16+ messages in thread
From: Zhao Forrest @ 2008-02-29 4:35 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
Avi, Eddie,
I have a kernel-newbie question related to this thread. I think that Yang's
mentioned case that TSC between different vcpus doesn't sync could
also happen with physical cpus. Namely I think a OS running on bare metal
hardware need to handle the unsynced TSC between physical cpus. But from
the discussion of this thread, it seems that RHEL5.1 kernel can not handle the
"unsynced TSC", right?
Thanks for your advice in advance,
Forrest
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-22 16:16 ` Avi Kivity
` (2 preceding siblings ...)
2008-02-25 23:46 ` Dong, Eddie
@ 2008-02-29 8:26 ` Zhao Forrest
3 siblings, 0 replies; 16+ messages in thread
From: Zhao Forrest @ 2008-02-29 8:26 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm-devel
>
> I believe the patch is still necessary, since we still need to guarantee
> that a vcpu's tsc is monotonous. I think there are three issues to be
> addressed:
>
> 1. The majority of intel machines don't need the offset adjustment since
> they already have a constant rate tsc that is synchronized on all cpus.
> I think this is indicated by X86_FEATURE_CONSTANT_TSC (though I'm not
> 100% certain if it means that the rate is the same for all cpus, Thomas
> can you clarify?)
>
> This will improve tsc quality for those machines, but we can't depend on
> it, since some machines don't have constant tsc. Further, I don't think
> really large machines can have constant tsc since clock distribution
> becomes difficult or impossible.
>
I have another newbie question: can the current Linux kernel handle the unsynced
TSC? If kernel can't handle this case, it still has problem to run
Linux on hardware
with unsynced TSC.
Thanks,
Forrest
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: The SMP RHEL 5.1 PAE guest can't boot up issue
2008-02-26 10:28 ` Avi Kivity
2008-02-29 4:35 ` Zhao Forrest
@ 2008-03-04 11:38 ` Avi Kivity
1 sibling, 0 replies; 16+ messages in thread
From: Avi Kivity @ 2008-03-04 11:38 UTC (permalink / raw)
To: Dong, Eddie; +Cc: kvm-devel
Avi Kivity wrote:
> Dong, Eddie wrote:
>>>> I don't know if the patch was still needed now, since it was posted
>>>> long ago(I don't know which issue it solved). I'd like to post a
>>>> revert patch if necessary.
>>>>
>>> I believe the patch is still necessary, since we still need to
>>> guarantee that a vcpu's tsc is monotonous. I think there are three
>>> issues to be addressed:
>>>
>>> 1. The majority of intel machines don't need the offset adjustment
>>> since they already have a constant rate tsc that is synchronized on
>>> all cpus. I think this is indicated by X86_FEATURE_CONSTANT_TSC
>>> (though I'm not 100% certain if it means that the rate is the same
>>> for all cpus, Thomas can you clarify?)
>>>
>>
>> So why not make the TSC_OFFSET adjustment conditional?
>>
>
> Yes, that's what I meant. We just need to be sure that this is what
> X86_FEATURE_CONSTANT_TSC means.
I changed tsc offset adjustment to only allow forward adjustment. Since
hosts with synced tsc never require positive adjustment, they should now
have better quality tsc.
--
error compiling committee.c: too many arguments to function
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2008-03-04 11:38 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-22 8:57 The SMP RHEL 5.1 PAE guest can't boot up issue Yang, Sheng
2008-02-22 16:16 ` Avi Kivity
2008-02-22 17:17 ` Marcelo Tosatti
2008-02-22 18:45 ` Avi Kivity
2008-02-22 20:12 ` Marcelo Tosatti
2008-02-23 15:24 ` Farkas Levente
2008-02-24 8:51 ` Avi Kivity
2008-02-25 4:09 ` Yang, Sheng
2008-02-25 18:03 ` Farkas Levente
2008-02-25 18:12 ` Avi Kivity
2008-02-25 18:24 ` Farkas Levente
2008-02-25 23:46 ` Dong, Eddie
2008-02-26 10:28 ` Avi Kivity
2008-02-29 4:35 ` Zhao Forrest
2008-03-04 11:38 ` Avi Kivity
2008-02-29 8:26 ` Zhao Forrest
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox