kvm guest loops_per_jiffy miscalibration under host load

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* kvm guest loops_per_jiffy miscalibration under host load
@ 2008-07-02 16:40 Marcelo Tosatti
  2008-07-03 13:17 ` Glauber Costa
  2008-07-07 18:17 ` Daniel P. Berrange
  0 siblings, 2 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2008-07-02 16:40 UTC (permalink / raw)
  To: kvm-devel; +Cc: gcosta, kraxel, chrisw, aliguori

Hello,

I have been discussing with Glauber and Gerd the problem where KVM
guests miscalibrate loops_per_jiffy if there's sufficient load on the
host.

calibrate_delay_direct() failed to get a good estimate for
loops_per_jiffy.
Probably due to long platform interrupts. Consider using "lpj=" boot
option.
Calibrating delay loop... <3>107.00 BogoMIPS (lpj=214016)

While this particular host calculates lpj=1597041.

This means that udelay() can delay for less than what asked for, with
fatal results such as:

..MP-BIOS bug: 8254 timer not connected to IO-APIC
Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the
'noapic' kernel parameter

This bug is easily triggered with a CPU hungry task on nice -20
running only during guest calibration (so that the timer check code on
io_apic_{32,64}.c fails to wait long enough for PIT interrupts to fire).

The problem is that the calibration routines assume a stable relation
between timer interrupt frequency (PIT at this boot stage) and
TSC/execution frequency.

The emulated timer frequency is based on the host system time and
therefore virtually resistant against heavy load, while the execution
of these routines on the guest is suspectible to scheduling of the QEMU
process.

To fix this in a transparent way (without direct "lpj=" boot parameter
assignment or a paravirt equivalent), it would be necessary to base the
emulated timer frequency on guest execution time instead of host system
time. But this can introduce timekeeping issues (recent Linux guests
seem to handle lost/late interrupts fine as long as the clocksource is
reliable) and just sounds scary.

Possible solutions:

- Require the admin to preset "lpj=". Nasty, not user friendly.
- Pass the proper lpj value via a paravirt interface. Won't cover
  fullvirt guests.
- Have the management app guarantee a minimum amount of CPU required
for proper calibration during guest initialization.

Comments, ideas?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-02 16:40 kvm guest loops_per_jiffy miscalibration under host load Marcelo Tosatti
@ 2008-07-03 13:17 ` Glauber Costa
  2008-07-04 22:51   ` Marcelo Tosatti
  2008-07-07  1:56   ` Anthony Liguori
  2008-07-07 18:17 ` Daniel P. Berrange
  1 sibling, 2 replies; 21+ messages in thread
From: Glauber Costa @ 2008-07-03 13:17 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel, kraxel, chrisw, aliguori

Marcelo Tosatti wrote:
> Hello,
> 
> I have been discussing with Glauber and Gerd the problem where KVM
> guests miscalibrate loops_per_jiffy if there's sufficient load on the
> host.
> 
> calibrate_delay_direct() failed to get a good estimate for
> loops_per_jiffy.
> Probably due to long platform interrupts. Consider using "lpj=" boot
> option.
> Calibrating delay loop... <3>107.00 BogoMIPS (lpj=214016)
> 
> While this particular host calculates lpj=1597041.
> 
> This means that udelay() can delay for less than what asked for, with
> fatal results such as:
> 
> ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the
> 'noapic' kernel parameter
> 
> This bug is easily triggered with a CPU hungry task on nice -20
> running only during guest calibration (so that the timer check code on
> io_apic_{32,64}.c fails to wait long enough for PIT interrupts to fire).
> 
> The problem is that the calibration routines assume a stable relation
> between timer interrupt frequency (PIT at this boot stage) and
> TSC/execution frequency.
> 
> The emulated timer frequency is based on the host system time and
> therefore virtually resistant against heavy load, while the execution
> of these routines on the guest is suspectible to scheduling of the QEMU
> process.
> 
> To fix this in a transparent way (without direct "lpj=" boot parameter
> assignment or a paravirt equivalent), it would be necessary to base the
> emulated timer frequency on guest execution time instead of host system
> time. But this can introduce timekeeping issues (recent Linux guests
> seem to handle lost/late interrupts fine as long as the clocksource is
> reliable) and just sounds scary.
> 
> Possible solutions:
> 
> - Require the admin to preset "lpj=". Nasty, not user friendly.
> - Pass the proper lpj value via a paravirt interface. Won't cover
>   fullvirt guests.
> - Have the management app guarantee a minimum amount of CPU required
> for proper calibration during guest initialization.
I don't like any of these solutions, and won't defend any of "the one". 
So no hard feelings. But I think the "less worse" among them IMHO is the
paravirt one. At least it goes in the general direction of "paravirt if 
you need to scale over xyz".

I think passing lpj is out of question, and giving the cpu resources for 
that time is kind of a kludge.

Or maybe we could put the timer expiration alone in a separate thread, 
with maximum priority (maybe rt priority)? dunno...


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-03 13:17 ` Glauber Costa
@ 2008-07-04 22:51   ` Marcelo Tosatti
  2008-07-07  1:56   ` Anthony Liguori
  1 sibling, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2008-07-04 22:51 UTC (permalink / raw)
  To: Glauber Costa; +Cc: kvm-devel, kraxel, chrisw, aliguori

Hi Glauber,

On Thu, Jul 03, 2008 at 10:17:05AM -0300, Glauber Costa wrote:
>> - Require the admin to preset "lpj=". Nasty, not user friendly.
>> - Pass the proper lpj value via a paravirt interface. Won't cover
>>   fullvirt guests.
>> - Have the management app guarantee a minimum amount of CPU required
>> for proper calibration during guest initialization.
> I don't like any of these solutions, and won't defend any of "the one".  
> So no hard feelings. But I think the "less worse" among them IMHO is the
> paravirt one. At least it goes in the general direction of "paravirt if  
> you need to scale over xyz".

What is worse is that this problem can happen with a single guest, given
enough lack of CPU power to the qemu process.

> I think passing lpj is out of question, and giving the cpu resources for  
> that time is kind of a kludge.

Yeah, but reserving cpu resources is the only automatic solution I can
think of for fullvirt guests.

> Or maybe we could put the timer expiration alone in a separate thread,  
> with maximum priority (maybe rt priority)? dunno...

The timer expiration already has high priority, since its emulated with
host kernel timers, so they're pretty close to the real hardware timing.

Problem is if the guest is not given enough CPU power to run the
calibration routines.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-03 13:17 ` Glauber Costa
  2008-07-04 22:51   ` Marcelo Tosatti
@ 2008-07-07  1:56   ` Anthony Liguori
  2008-07-07 18:27     ` Glauber Costa
  1 sibling, 1 reply; 21+ messages in thread
From: Anthony Liguori @ 2008-07-07  1:56 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Marcelo Tosatti, kvm-devel, kraxel, chrisw

Glauber Costa wrote:
> Marcelo Tosatti wrote:
>> Hello,
>>
>> I have been discussing with Glauber and Gerd the problem where KVM
>> guests miscalibrate loops_per_jiffy if there's sufficient load on the
>> host.
>>
>> calibrate_delay_direct() failed to get a good estimate for
>> loops_per_jiffy.
>> Probably due to long platform interrupts. Consider using "lpj=" boot
>> option.
>> Calibrating delay loop... <3>107.00 BogoMIPS (lpj=214016)
>>
>> While this particular host calculates lpj=1597041.
>>
>> This means that udelay() can delay for less than what asked for, with
>> fatal results such as:
>>
>> ..MP-BIOS bug: 8254 timer not connected to IO-APIC
>> Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the
>> 'noapic' kernel parameter
>>
>> This bug is easily triggered with a CPU hungry task on nice -20
>> running only during guest calibration (so that the timer check code on
>> io_apic_{32,64}.c fails to wait long enough for PIT interrupts to fire).
>>
>> The problem is that the calibration routines assume a stable relation
>> between timer interrupt frequency (PIT at this boot stage) and
>> TSC/execution frequency.
>>
>> The emulated timer frequency is based on the host system time and
>> therefore virtually resistant against heavy load, while the execution
>> of these routines on the guest is suspectible to scheduling of the QEMU
>> process.
>>
>> To fix this in a transparent way (without direct "lpj=" boot parameter
>> assignment or a paravirt equivalent), it would be necessary to base the
>> emulated timer frequency on guest execution time instead of host system
>> time. But this can introduce timekeeping issues (recent Linux guests
>> seem to handle lost/late interrupts fine as long as the clocksource is
>> reliable) and just sounds scary.
>>
>> Possible solutions:
>>
>> - Require the admin to preset "lpj=". Nasty, not user friendly.
>> - Pass the proper lpj value via a paravirt interface. Won't cover
>>   fullvirt guests.
>> - Have the management app guarantee a minimum amount of CPU required
>> for proper calibration during guest initialization.
> I don't like any of these solutions, and won't defend any of "the 
> one". So no hard feelings. But I think the "less worse" among them 
> IMHO is the
> paravirt one. At least it goes in the general direction of "paravirt 
> if you need to scale over xyz".

I agree.  A paravirt solution solves the problem.

> I think passing lpj is out of question, and giving the cpu resources 
> for that time is kind of a kludge.

It's all heuristics unfortunately.

> Or maybe we could put the timer expiration alone in a separate thread, 
> with maximum priority (maybe rt priority)? dunno...

But then if you have high-load because of a lot of guests running, you 
defeat yourself.  Any attempt to guarantee time to a guest will be 
defeated by lots of guests all attempting calibration at the same time.

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-02 16:40 kvm guest loops_per_jiffy miscalibration under host load Marcelo Tosatti
  2008-07-03 13:17 ` Glauber Costa
@ 2008-07-07 18:17 ` Daniel P. Berrange
  1 sibling, 0 replies; 21+ messages in thread
From: Daniel P. Berrange @ 2008-07-07 18:17 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel, gcosta, kraxel, chrisw, aliguori

On Wed, Jul 02, 2008 at 01:40:21PM -0300, Marcelo Tosatti wrote:
> I have been discussing with Glauber and Gerd the problem where KVM
> guests miscalibrate loops_per_jiffy if there's sufficient load on the
> host.
> 
> calibrate_delay_direct() failed to get a good estimate for
> loops_per_jiffy.
> Probably due to long platform interrupts. Consider using "lpj=" boot
> option.
> Calibrating delay loop... <3>107.00 BogoMIPS (lpj=214016)
> 
> While this particular host calculates lpj=1597041.

[snip]

> Possible solutions:
> 
> - Require the admin to preset "lpj=". Nasty, not user friendly.
> - Pass the proper lpj value via a paravirt interface. Won't cover
>   fullvirt guests.
> - Have the management app guarantee a minimum amount of CPU required
> for proper calibration during guest initialization.

I talked with Marcelo about this latter option - the idea being to have
libvirt do  set_priority(PRIO_PROC, $pid, -20) on the KVM binary for
a short time at startup. The problem with this is that there's no
easy way to determine when, or for how long, to provide this priority
boost.   During initial startup there is an arbitrary unknown delay
from the BIOS - 3 seconds to choose the boot device, a further 20-60
seconds if you choose PXE before the guest kernel  is actually booted,
arbitrary delay if booting from CDROM and it has not auto boot timeout,
arbitrary user configurable delay in GRUB to choose kernel if booting
from harddisk.

So to stand any reasonable chance of working libvirt would have to give
the -20 priority boost for a good 60 seconds. If we want to start 2 or
more guests at once this soon stops being a viable workaround. And we
can't detect reboots triggered by the admin inside the guest. So I 
don't see how we can reliably solve this from userspace management
tools, unless someone has better ideas than a priority boost for a short
while.

Daniel
-- 
|: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-07  1:56   ` Anthony Liguori
@ 2008-07-07 18:27     ` Glauber Costa
  2008-07-07 18:48       ` Marcelo Tosatti
  0 siblings, 1 reply; 21+ messages in thread
From: Glauber Costa @ 2008-07-07 18:27 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Glauber Costa, Marcelo Tosatti, kvm-devel, kraxel, chrisw

[-- Attachment #1: Type: text/plain, Size: 3864 bytes --]

On Sun, Jul 6, 2008 at 10:56 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> Glauber Costa wrote:
>>
>> Marcelo Tosatti wrote:
>>>
>>> Hello,
>>>
>>> I have been discussing with Glauber and Gerd the problem where KVM
>>> guests miscalibrate loops_per_jiffy if there's sufficient load on the
>>> host.
>>>
>>> calibrate_delay_direct() failed to get a good estimate for
>>> loops_per_jiffy.
>>> Probably due to long platform interrupts. Consider using "lpj=" boot
>>> option.
>>> Calibrating delay loop... <3>107.00 BogoMIPS (lpj=214016)
>>>
>>> While this particular host calculates lpj=1597041.
>>>
>>> This means that udelay() can delay for less than what asked for, with
>>> fatal results such as:
>>>
>>> ..MP-BIOS bug: 8254 timer not connected to IO-APIC
>>> Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the
>>> 'noapic' kernel parameter
>>>
>>> This bug is easily triggered with a CPU hungry task on nice -20
>>> running only during guest calibration (so that the timer check code on
>>> io_apic_{32,64}.c fails to wait long enough for PIT interrupts to fire).
>>>
>>> The problem is that the calibration routines assume a stable relation
>>> between timer interrupt frequency (PIT at this boot stage) and
>>> TSC/execution frequency.
>>>
>>> The emulated timer frequency is based on the host system time and
>>> therefore virtually resistant against heavy load, while the execution
>>> of these routines on the guest is suspectible to scheduling of the QEMU
>>> process.
>>>
>>> To fix this in a transparent way (without direct "lpj=" boot parameter
>>> assignment or a paravirt equivalent), it would be necessary to base the
>>> emulated timer frequency on guest execution time instead of host system
>>> time. But this can introduce timekeeping issues (recent Linux guests
>>> seem to handle lost/late interrupts fine as long as the clocksource is
>>> reliable) and just sounds scary.
>>>
>>> Possible solutions:
>>>
>>> - Require the admin to preset "lpj=". Nasty, not user friendly.
>>> - Pass the proper lpj value via a paravirt interface. Won't cover
>>>  fullvirt guests.
>>> - Have the management app guarantee a minimum amount of CPU required
>>> for proper calibration during guest initialization.
>>
>> I don't like any of these solutions, and won't defend any of "the one". So
>> no hard feelings. But I think the "less worse" among them IMHO is the
>> paravirt one. At least it goes in the general direction of "paravirt if
>> you need to scale over xyz".
>
> I agree.  A paravirt solution solves the problem.

Please, look at the patch I've attached.

It does  __delay with host help. This may have the nice effect of not
busy waiting for long-enough delays, and may well.

It is _completely_ PoC, just to show the idea. It's ugly, broken,
obviously have to go through pv-ops, etc.

Also, I intend to add a lpj field in the kvm clock memory area. We
could do just this later, do both, etc.

If we agree this is a viable solution, I'll start working on a patch

>> I think passing lpj is out of question, and giving the cpu resources for
>> that time is kind of a kludge.
>
> It's all heuristics unfortunately.
>
>> Or maybe we could put the timer expiration alone in a separate thread,
>> with maximum priority (maybe rt priority)? dunno...
>
> But then if you have high-load because of a lot of guests running, you
> defeat yourself.  Any attempt to guarantee time to a guest will be defeated
> by lots of guests all attempting calibration at the same time.
>
> Regards,
>
> Anthony Liguori
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Glauber Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

[-- Attachment #2: proposal.patch --]
[-- Type: application/octet-stream, Size: 1753 bytes --]

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 63a77ca..4a137d9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2539,6 +2539,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	case KVM_HC_MMU_OP:
 		r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2), &ret);
 		break;
+	case KVM_DELAY: {
+		u32 wait; /* make it 64 */
+		wait = a0 / cpu_data(raw_smp_processor_id()).loops_per_jiffy;
+		ret = schedule_timeout_uninterruptible(wait);
+		break;
+	}
 	default:
 		ret = -KVM_ENOSYS;
 		break;
diff --git a/arch/x86/lib/delay.c b/arch/x86/lib/delay.c
index f456860..28fd9a0 100644
--- a/arch/x86/lib/delay.c
+++ b/arch/x86/lib/delay.c
@@ -17,6 +17,7 @@
 #include <linux/preempt.h>
 #include <linux/delay.h>
 #include <linux/init.h>
+#include <linux/kvm_para.h>
 
 #include <asm/processor.h>
 #include <asm/delay.h>
@@ -84,6 +85,10 @@ static void delay_tsc(unsigned long loops)
 	preempt_enable();
 }
 
+static void delay_pv(unsigned long loops)
+{
+	kvm_hypercall1(KVM_DELAY, loops);
+}
 /*
  * Since we calibrate only once at boot, this
  * function should be set once at boot and not changed
@@ -106,7 +111,10 @@ int __devinit read_current_timer(unsigned long *timer_val)
 
 void __delay(unsigned long loops)
 {
-	delay_fn(loops);
+	if (paravirt_enabled())
+		delay_pv(loops);
+	else
+		delay_fn(loops);
 }
 EXPORT_SYMBOL(__delay);
 
diff --git a/include/asm-x86/kvm_para.h b/include/asm-x86/kvm_para.h
index bfd9900..00fc997 100644
--- a/include/asm-x86/kvm_para.h
+++ b/include/asm-x86/kvm_para.h
@@ -23,6 +23,7 @@
 #define KVM_MMU_OP_WRITE_PTE            1
 #define KVM_MMU_OP_FLUSH_TLB	        2
 #define KVM_MMU_OP_RELEASE_PT	        3
+#define KVM_DELAY			4
 
 /* Payload for KVM_HC_MMU_OP */
 struct kvm_mmu_op_header {

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-07 18:27     ` Glauber Costa
@ 2008-07-07 18:48       ` Marcelo Tosatti
  2008-07-07 19:21         ` Anthony Liguori
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2008-07-07 18:48 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Anthony Liguori, Glauber Costa, kvm-devel, kraxel, chrisw

On Mon, Jul 07, 2008 at 03:27:16PM -0300, Glauber Costa wrote:
> > I agree.  A paravirt solution solves the problem.
> 
> Please, look at the patch I've attached.
> 
> It does  __delay with host help. This may have the nice effect of not
> busy waiting for long-enough delays, and may well.
> 
> It is _completely_ PoC, just to show the idea. It's ugly, broken,
> obviously have to go through pv-ops, etc.
> 
> Also, I intend to add a lpj field in the kvm clock memory area. We
> could do just this later, do both, etc.
> 
> If we agree this is a viable solution, I'll start working on a patch

This stops interrupts from being processed during the delay. And also 
there are cases like this recently introduced break:

                /* Allow RT tasks to run */
                preempt_enable();
                rep_nop();
                preempt_disable();

I think it would be better to just pass the lpj value via paravirt and
let the guest busy-loop as usual.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-07 18:48       ` Marcelo Tosatti
@ 2008-07-07 19:21         ` Anthony Liguori
  2008-07-07 19:32           ` Glauber Costa
  0 siblings, 1 reply; 21+ messages in thread
From: Anthony Liguori @ 2008-07-07 19:21 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Glauber Costa, Glauber Costa, kvm-devel, kraxel, chrisw

Marcelo Tosatti wrote:
> On Mon, Jul 07, 2008 at 03:27:16PM -0300, Glauber Costa wrote:
>   
>>> I agree.  A paravirt solution solves the problem.
>>>       
>> Please, look at the patch I've attached.
>>
>> It does  __delay with host help. This may have the nice effect of not
>> busy waiting for long-enough delays, and may well.
>>
>> It is _completely_ PoC, just to show the idea. It's ugly, broken,
>> obviously have to go through pv-ops, etc.
>>
>> Also, I intend to add a lpj field in the kvm clock memory area. We
>> could do just this later, do both, etc.
>>
>> If we agree this is a viable solution, I'll start working on a patch
>>     
>
> This stops interrupts from being processed during the delay. And also 
> there are cases like this recently introduced break:
>
>                 /* Allow RT tasks to run */
>                 preempt_enable();
>                 rep_nop();
>                 preempt_disable();
>
> I think it would be better to just pass the lpj value via paravirt and
> let the guest busy-loop as usual.
>   

I agree.  VMI and Xen already pass a cpu_khz paravirt value.  Something 
similar would probably do the trick.

It may be worthwhile having udelay() or spinlocks call into KVM if 
they've been spinning long enough but I think that's a separate discussion.

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-07 19:21         ` Anthony Liguori
@ 2008-07-07 19:32           ` Glauber Costa
  2008-07-07 21:35             ` Glauber Costa
  0 siblings, 1 reply; 21+ messages in thread
From: Glauber Costa @ 2008-07-07 19:32 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Marcelo Tosatti, Glauber Costa, kvm-devel, kraxel, chrisw

On Mon, Jul 7, 2008 at 4:21 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> Marcelo Tosatti wrote:
>>
>> On Mon, Jul 07, 2008 at 03:27:16PM -0300, Glauber Costa wrote:
>>
>>>>
>>>> I agree.  A paravirt solution solves the problem.
>>>>
>>>
>>> Please, look at the patch I've attached.
>>>
>>> It does  __delay with host help. This may have the nice effect of not
>>> busy waiting for long-enough delays, and may well.
>>>
>>> It is _completely_ PoC, just to show the idea. It's ugly, broken,
>>> obviously have to go through pv-ops, etc.
>>>
>>> Also, I intend to add a lpj field in the kvm clock memory area. We
>>> could do just this later, do both, etc.
>>>
>>> If we agree this is a viable solution, I'll start working on a patch
>>>
>>
>> This stops interrupts from being processed during the delay. And also
>> there are cases like this recently introduced break:
>>
>>                /* Allow RT tasks to run */
>>                preempt_enable();
>>                rep_nop();
>>                preempt_disable();
>>
>> I think it would be better to just pass the lpj value via paravirt and
>> let the guest busy-loop as usual.
>>
>
> I agree.  VMI and Xen already pass a cpu_khz paravirt value.  Something
> similar would probably do the trick.

yeah, there is a pv-op for this, so I won't have to mess with the
clock interface. I'll draft a patch for it, and sent it.

> It may be worthwhile having udelay() or spinlocks call into KVM if they've
> been spinning long enough but I think that's a separate discussion.

I think it is, but I'd have to back it up with numbers. measurements
are on the way.
> Regards,
>
> Anthony Liguori
>
>



-- 
Glauber Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-07 19:32           ` Glauber Costa
@ 2008-07-07 21:35             ` Glauber Costa
  2008-07-11 21:18               ` David S. Ahern
  0 siblings, 1 reply; 21+ messages in thread
From: Glauber Costa @ 2008-07-07 21:35 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Anthony Liguori, Marcelo Tosatti, kvm-devel, kraxel, chrisw

[-- Attachment #1: Type: text/plain, Size: 1749 bytes --]

Glauber Costa wrote:
> On Mon, Jul 7, 2008 at 4:21 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
>> Marcelo Tosatti wrote:
>>> On Mon, Jul 07, 2008 at 03:27:16PM -0300, Glauber Costa wrote:
>>>
>>>>> I agree.  A paravirt solution solves the problem.
>>>>>
>>>> Please, look at the patch I've attached.
>>>>
>>>> It does  __delay with host help. This may have the nice effect of not
>>>> busy waiting for long-enough delays, and may well.
>>>>
>>>> It is _completely_ PoC, just to show the idea. It's ugly, broken,
>>>> obviously have to go through pv-ops, etc.
>>>>
>>>> Also, I intend to add a lpj field in the kvm clock memory area. We
>>>> could do just this later, do both, etc.
>>>>
>>>> If we agree this is a viable solution, I'll start working on a patch
>>>>
>>> This stops interrupts from being processed during the delay. And also
>>> there are cases like this recently introduced break:
>>>
>>>                /* Allow RT tasks to run */
>>>                preempt_enable();
>>>                rep_nop();
>>>                preempt_disable();
>>>
>>> I think it would be better to just pass the lpj value via paravirt and
>>> let the guest busy-loop as usual.
>>>
>> I agree.  VMI and Xen already pass a cpu_khz paravirt value.  Something
>> similar would probably do the trick.
> 
> yeah, there is a pv-op for this, so I won't have to mess with the
> clock interface. I'll draft a patch for it, and sent it.
> 
>> It may be worthwhile having udelay() or spinlocks call into KVM if they've
>> been spinning long enough but I think that's a separate discussion.
> 
> I think it is, but I'd have to back it up with numbers. measurements
> are on the way.
>> Regards,
>>
>> Anthony Liguori
>>
>>
> 
> 
> 
How about this? RFC only for now

[-- Attachment #2: preset.patch --]
[-- Type: text/plain, Size: 2677 bytes --]

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 87edf1c..8514b04 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -78,6 +78,14 @@ static cycle_t kvm_clock_read(void)
 	return ret;
 }
 
+static unsigned long kvm_get_cpu_khz(void)
+{
+	struct pvclock_vcpu_time_info *info;
+
+	info = &per_cpu(hv_clock, 0);
+	return pv_cpu_khz(info);
+}
+
 static struct clocksource kvm_clock = {
 	.name = "kvm-clock",
 	.read = kvm_clock_read,
@@ -153,6 +161,7 @@ void __init kvmclock_init(void)
 		pv_time_ops.get_wallclock = kvm_get_wallclock;
 		pv_time_ops.set_wallclock = kvm_set_wallclock;
 		pv_time_ops.sched_clock = kvm_clock_read;
+		pv_time_ops.get_cpu_khz = kvm_get_cpu_khz;
 #ifdef CONFIG_X86_LOCAL_APIC
 		pv_apic_ops.setup_secondary_clock = kvm_setup_secondary_clock;
 #endif
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 05fbe9a..2d325ae 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -97,6 +97,18 @@ static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
 	return dst->version;
 }
 
+unsigned long pv_cpu_khz(struct pvclock_vcpu_time_info *info)
+{
+	u64 cpu_khz = 1000000ULL << 32;
+
+	do_div(cpu_khz, info->tsc_to_system_mul);
+	if (info->tsc_shift < 0)
+		cpu_khz <<= -info->tsc_shift;
+	else
+		cpu_khz >>= info->tsc_shift;
+	return cpu_khz;
+}
+
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
 	struct pvclock_shadow_time shadow;
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 64f0038..074cabb 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -200,17 +200,10 @@ unsigned long long xen_sched_clock(void)
 /* Get the CPU speed from Xen */
 unsigned long xen_cpu_khz(void)
 {
-	u64 xen_khz = 1000000ULL << 32;
-	const struct pvclock_vcpu_time_info *info =
+	struct pvclock_vcpu_time_info *info =
 		&HYPERVISOR_shared_info->vcpu_info[0].time;
 
-	do_div(xen_khz, info->tsc_to_system_mul);
-	if (info->tsc_shift < 0)
-		xen_khz <<= -info->tsc_shift;
-	else
-		xen_khz >>= info->tsc_shift;
-
-	return xen_khz;
+	return pv_cpu_khz(info);
 }
 
 static cycle_t xen_clocksource_read(void)
diff --git a/include/asm-x86/pvclock.h b/include/asm-x86/pvclock.h
index 85b1bba..41d816f 100644
--- a/include/asm-x86/pvclock.h
+++ b/include/asm-x86/pvclock.h
@@ -6,6 +6,7 @@
 
 /* some helper functions for xen and kvm pv clock sources */
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+unsigned long pv_cpu_khz(struct pvclock_vcpu_time_info *info);
 void pvclock_read_wallclock(struct pvclock_wall_clock *wall,
 			    struct pvclock_vcpu_time_info *vcpu,
 			    struct timespec *ts);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-07 21:35             ` Glauber Costa
@ 2008-07-11 21:18               ` David S. Ahern
  2008-07-12 14:10                 ` Marcelo Tosatti
  0 siblings, 1 reply; 21+ messages in thread
From: David S. Ahern @ 2008-07-11 21:18 UTC (permalink / raw)
  To: Glauber Costa, Marcelo Tosatti; +Cc: kvm-devel

What's the status with this for full virt guests?

I am still seeing systematic time drifts in RHEL 3 and RHEL4 guests
which I've been digging into it the past few days. In the course of it I
have been launching guests with boosted priority (both nice -20 and
realtime priority (RR 1)) on a nearly 100% idle host.

One host is a PowerEdge 2950 running RHEL5.2 with kvm-70. With the
realtime priority boot I have routinely seen bogomips in the guest which
do not make sense. e.g.,

ksyms.2:bogomips        : 4639.94
ksyms.2:bogomips        : 4653.05
ksyms.2:bogomips        : 4653.05
ksyms.2:bogomips        : 24.52

and

ksyms.3:bogomips        : 4639.94
ksyms.3:bogomips        : 4653.05
ksyms.3:bogomips        : 16.33
ksyms.3:bogomips        : 12.87

Also, if I launch qemu with the "-no-kvm-pit -tdf" option the panic
guests panics with the message Marcelo posted at the start of the thread:

----

Calibrating delay loop... 4653.05 BogoMIPS

CPU: L2 cache: 2048K

Intel machine check reporting enabled on CPU#2.

CPU2: Intel QEMU Virtual CPU version 0.9.1 stepping 03

Booting processor 3/3 eip 2000

Initializing CPU#3

masked ExtINT on CPU#3

ESR value before enabling vector: 00000000

ESR value after enabling vector: 00000000

Calibrating delay loop... 19.60 BogoMIPS

CPU: L2 cache: 2048K

Intel machine check reporting enabled on CPU#3.

CPU3: Intel QEMU Virtual CPU version 0.9.1 stepping 03

Total of 4 processors activated (14031.20 BogoMIPS).

ENABLING IO-APIC IRQs

Setting 4 in the phys_id_present_map
...changing IO-APIC physical APIC ID to 4 ... ok.
..TIMER: vector=0x31 pin1=0 pin2=-1
..MP-BIOS bug: 8254 timer not connected to IO-APIC
...trying to set up timer (IRQ0) through the 8259A ...  failed.
...trying to set up timer as Virtual Wire IRQ... failed.
...trying to set up timer as ExtINT IRQ... failed :(.
Kernel panic: IO-APIC + timer doesn't work! pester mingo@redhat.com

----

I'm just looking for stable guest times. I'm not planning to keep the
boosted guest priority, just using it to ensure the guest is not
interrupted as I try to understand why the guest systematically drifts.

david

Glauber Costa wrote:
> Glauber Costa wrote:
>> On Mon, Jul 7, 2008 at 4:21 PM, Anthony Liguori <aliguori@us.ibm.com>
>> wrote:
>>> Marcelo Tosatti wrote:
>>>> On Mon, Jul 07, 2008 at 03:27:16PM -0300, Glauber Costa wrote:
>>>>
>>>>>> I agree.  A paravirt solution solves the problem.
>>>>>>
>>>>> Please, look at the patch I've attached.
>>>>>
>>>>> It does  __delay with host help. This may have the nice effect of not
>>>>> busy waiting for long-enough delays, and may well.
>>>>>
>>>>> It is _completely_ PoC, just to show the idea. It's ugly, broken,
>>>>> obviously have to go through pv-ops, etc.
>>>>>
>>>>> Also, I intend to add a lpj field in the kvm clock memory area. We
>>>>> could do just this later, do both, etc.
>>>>>
>>>>> If we agree this is a viable solution, I'll start working on a patch
>>>>>
>>>> This stops interrupts from being processed during the delay. And also
>>>> there are cases like this recently introduced break:
>>>>
>>>>                /* Allow RT tasks to run */
>>>>                preempt_enable();
>>>>                rep_nop();
>>>>                preempt_disable();
>>>>
>>>> I think it would be better to just pass the lpj value via paravirt and
>>>> let the guest busy-loop as usual.
>>>>
>>> I agree.  VMI and Xen already pass a cpu_khz paravirt value.  Something
>>> similar would probably do the trick.
>>
>> yeah, there is a pv-op for this, so I won't have to mess with the
>> clock interface. I'll draft a patch for it, and sent it.
>>
>>> It may be worthwhile having udelay() or spinlocks call into KVM if
>>> they've
>>> been spinning long enough but I think that's a separate discussion.
>>
>> I think it is, but I'd have to back it up with numbers. measurements
>> are on the way.
>>> Regards,
>>>
>>> Anthony Liguori
>>>
>>>
>>
>>
>>
> How about this? RFC only for now
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-11 21:18               ` David S. Ahern
@ 2008-07-12 14:10                 ` Marcelo Tosatti
  2008-07-12 19:28                   ` David S. Ahern
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2008-07-12 14:10 UTC (permalink / raw)
  To: David S. Ahern; +Cc: Glauber Costa, kvm-devel

Hi David,

On Fri, Jul 11, 2008 at 03:18:54PM -0600, David S. Ahern wrote:
> What's the status with this for full virt guests?

The consensus seems to be that fullvirt guests need assistance from the
management app (libvirt) to have boosted priority during their boot
stage, so loops_per_jiffy calibration can be performed safely. As Daniel
pointed out this is tricky because you can't know for sure how long the
boot up will take, if for example PXE is used.

Glauber is working on some paravirt patches to remedy the situation.

But loops_per_jiffy is not directly related to clock drifts, so this
is a separate problem.

> I am still seeing systematic time drifts in RHEL 3 and RHEL4 guests
> which I've been digging into it the past few days. 

All time drift issues we were aware of are fixed in kvm-70. Can you
please provide more details on how you see the time drifting with
RHEL3/4 guests? It slowly but continually drifts or there are large
drifts at once? Are they using TSC or ACPIPM as clocksource?

Also, most issues we've seen could only be replicated with dyntick
guests.

I'll try to reproduce it locally.

> In the course of it I have been launching guests with boosted priority
> (both nice -20 and realtime priority (RR 1)) on a nearly 100% idle
> host.

Can you also see wacked bogomips without boosting the guest priority?

> 
> One host is a PowerEdge 2950 running RHEL5.2 with kvm-70. 
> With the realtime priority boot I have routinely seen bogomips in the
> guest which do not make sense. e.g.,
> 
> ksyms.2:bogomips        : 4639.94
> ksyms.2:bogomips        : 4653.05
> ksyms.2:bogomips        : 4653.05
> ksyms.2:bogomips        : 24.52
> 
> and
> 
> ksyms.3:bogomips        : 4639.94
> ksyms.3:bogomips        : 4653.05
> ksyms.3:bogomips        : 16.33
> ksyms.3:bogomips        : 12.87

I'll look into it.

> 
> 
> Also, if I launch qemu with the "-no-kvm-pit -tdf" option the panic
> guests panics with the message Marcelo posted at the start of the thread:
> 
> ----
> 
> Calibrating delay loop... 4653.05 BogoMIPS
> 
> CPU: L2 cache: 2048K
> 
> Intel machine check reporting enabled on CPU#2.
> 
> CPU2: Intel QEMU Virtual CPU version 0.9.1 stepping 03
> 
> Booting processor 3/3 eip 2000
> 
> Initializing CPU#3
> 
> masked ExtINT on CPU#3
> 
> ESR value before enabling vector: 00000000
> 
> ESR value after enabling vector: 00000000
> 
> Calibrating delay loop... 19.60 BogoMIPS
> 
> CPU: L2 cache: 2048K
> 
> Intel machine check reporting enabled on CPU#3.
> 
> CPU3: Intel QEMU Virtual CPU version 0.9.1 stepping 03
> 
> Total of 4 processors activated (14031.20 BogoMIPS).
> 
> ENABLING IO-APIC IRQs
> 
> Setting 4 in the phys_id_present_map
> ...changing IO-APIC physical APIC ID to 4 ... ok.
> ..TIMER: vector=0x31 pin1=0 pin2=-1
> ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> ...trying to set up timer (IRQ0) through the 8259A ...  failed.
> ...trying to set up timer as Virtual Wire IRQ... failed.
> ...trying to set up timer as ExtINT IRQ... failed :(.
> Kernel panic: IO-APIC + timer doesn't work! pester mingo@redhat.com
> 
> ----
> 
> I'm just looking for stable guest times. I'm not planning to keep the
> boosted guest priority, just using it to ensure the guest is not
> interrupted as I try to understand why the guest systematically drifts.
> 
> david

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-12 14:10                 ` Marcelo Tosatti
@ 2008-07-12 19:28                   ` David S. Ahern
  0 siblings, 0 replies; 21+ messages in thread
From: David S. Ahern @ 2008-07-12 19:28 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Glauber Costa, kvm-devel

[-- Attachment #1: Type: text/plain, Size: 2992 bytes --]


Marcelo Tosatti wrote:
> Hi David,
> 
> On Fri, Jul 11, 2008 at 03:18:54PM -0600, David S. Ahern wrote:
>> What's the status with this for full virt guests?
> 
> The consensus seems to be that fullvirt guests need assistance from the
> management app (libvirt) to have boosted priority during their boot
> stage, so loops_per_jiffy calibration can be performed safely. As Daniel
> pointed out this is tricky because you can't know for sure how long the
> boot up will take, if for example PXE is used.

I boosted the priority of the guest to investigate an idea that maybe
some startup calibration in the guest was off slightly leading to
systematic drifting. I was on vacation last week and I am still catching
up with traffic on this list. I just happened to see your first message
with the panic which aligned with one of my tests.

> 
> Glauber is working on some paravirt patches to remedy the situation.
> 
> But loops_per_jiffy is not directly related to clock drifts, so this
> is a separate problem.
> 
>> I am still seeing systematic time drifts in RHEL 3 and RHEL4 guests
>> which I've been digging into it the past few days. 
> 
> All time drift issues we were aware of are fixed in kvm-70. Can you
> please provide more details on how you see the time drifting with
> RHEL3/4 guests? It slowly but continually drifts or there are large
> drifts at once? Are they using TSC or ACPIPM as clocksource?

The attached file shows one example of the drift I am seeing. It's for a
4-way RHEL3 guest started with 'nice -20'. After startup each vcpu was
pinned to a physical cpu using taskset. The only activity on the host is
this one single guest; the guest is relatively idle -- about 4% activity
(~1% user, ~3% system time). Host is synchronized to an ntp server; the
guest is not. The guest is started with the -localtime parameter.  From
the file you can see the guest gains about 1-2 seconds every 5 minutes.

Since it's a RHEL3 guest I believe the PIT is the only choice (how to
confirm?), though it does read the TSC (ie., use_tsc is 1).

> 
> Also, most issues we've seen could only be replicated with dyntick
> guests.
> 
> I'll try to reproduce it locally.
> 
>> In the course of it I have been launching guests with boosted priority
>> (both nice -20 and realtime priority (RR 1)) on a nearly 100% idle
>> host.
> 
> Can you also see wacked bogomips without boosting the guest priority?

The wacked bogomips only shows up when started with real-time priority.
With the 'nice -20' it's sane and close to what the host shows.

As another data point I restarted the RHEL3 guest using the -no-kvm-pit
and -tdf options (nice -20 priority boost). After 22 hours of uptime,
the guest is 29 seconds *behind* the host. Using the in-kernel pit the
guest time is always fast compared to the host.

I've seen similar drifting in RHEL4 guests, but I have not spent as much
time investigating it yet. On ESX adding clock=pit to the boot
parameters for RHEL4 guests helps immensely.

david

[-- Attachment #2: time-data --]
[-- Type: text/plain, Size: 5383 bytes --]

host-dt   host time    guest time    guest-host-diff
    300   1215748151   1215748262    111
    301   1215748452   1215748563    111
    300   1215748752   1215748865    113
    300   1215749052   1215749165    113
    300   1215749352   1215749466    114
    301   1215749653   1215749768    115
    300   1215749953   1215750069    116
    300   1215750253   1215750369    116
    300   1215750553   1215750671    118
    301   1215750854   1215750972    118
    300   1215751154   1215751273    119
    300   1215751454   1215751575    121
    300   1215751754   1215751875    121
    301   1215752055   1215752176    121
    300   1215752355   1215752477    122
    300   1215752655   1215752780    125
    300   1215752955   1215753083    128
    301   1215753256   1215753385    129
    300   1215753556   1215753686    130
    300   1215753856   1215753988    132
    300   1215754156   1215754289    133
    301   1215754457   1215754592    135
    300   1215754757   1215754894    137
    300   1215755057   1215755198    141
    300   1215755357   1215755499    142
    301   1215755658   1215755799    141
    300   1215755958   1215756101    143
    300   1215756258   1215756402    144
    300   1215756558   1215756702    144
    301   1215756859   1215757005    146
    300   1215757159   1215757307    148
    300   1215757459   1215757609    150
    301   1215757760   1215757910    150
    300   1215758060   1215758211    151
    300   1215758360   1215758515    155
    300   1215758660   1215758816    156
    301   1215758961   1215759118    157
    300   1215759261   1215759418    157
    300   1215759561   1215759720    159
    300   1215759861   1215760022    161
    301   1215760162   1215760323    161
    300   1215760462   1215760625    163
    300   1215760762   1215760927    165
    300   1215761062   1215761229    167
    301   1215761363   1215761532    169
    300   1215761663   1215761833    170
    300   1215761963   1215762136    173
    300   1215762263   1215762439    176
    301   1215762564   1215762741    177
    300   1215762864   1215763043    179
    300   1215763164   1215763348    184
    300   1215763464   1215763649    185
    301   1215763765   1215763950    185
    300   1215764065   1215764251    186
    300   1215764365   1215764552    187
    300   1215764665   1215764854    189
    301   1215764966   1215765154    188
    300   1215765266   1215765455    189
    300   1215765566   1215765757    191
    300   1215765866   1215766058    192
    300   1215766166   1215766358    192
    301   1215766467   1215766666    199
    300   1215766767   1215766966    199
    300   1215767067   1215767269    202
    300   1215767367   1215767571    204
    301   1215767668   1215767871    203
    300   1215767968   1215768173    205
    300   1215768268   1215768476    208
    300   1215768568   1215768777    209
    301   1215768869   1215769078    209
    300   1215769169   1215769379    210
    300   1215769469   1215769682    213
    300   1215769769   1215769984    215
    301   1215770070   1215770286    216
    300   1215770370   1215770589    219
    300   1215770670   1215770890    220
    301   1215770971   1215771191    220
    300   1215771271   1215771496    225
    300   1215771571   1215771798    227
    300   1215771871   1215772100    229
    301   1215772172   1215772403    231
    300   1215772472   1215772704    232
    300   1215772772   1215773007    235
    300   1215773072   1215773307    235
    301   1215773373   1215773613    240
    300   1215773673   1215773916    243
    300   1215773973   1215774220    247
    300   1215774273   1215774520    247
    301   1215774574   1215774822    248
    300   1215774874   1215775122    248
    300   1215775174   1215775425    251
    300   1215775474   1215775725    251
    301   1215775775   1215776027    252
    300   1215776075   1215776327    252
    300   1215776375   1215776628    253
    300   1215776675   1215776928    253
    301   1215776976   1215777230    254
    300   1215777276   1215777531    255
    300   1215777576   1215777833    257
    300   1215777876   1215778136    260
    301   1215778177   1215778435    258
    300   1215778477   1215778736    259
    300   1215778777   1215779037    260
    300   1215779077   1215779337    260
    301   1215779378   1215779640    262
    300   1215779678   1215779941    263
    300   1215779978   1215780241    263
    300   1215780278   1215780544    266
    301   1215780579   1215780843    264
    300   1215780879   1215781146    267
    300   1215781179   1215781448    269
    300   1215781479   1215781749    270
    301   1215782269   1215782544    275
    300   1215782569   1215782844    275
    300   1215782869   1215783145    276
    300   1215783169   1215783447    278
    301   1215783470   1215783749    279
    300   1215783770   1215784050    280
    300   1215784070   1215784352    282
    300   1215784370   1215784654    284
    301   1215784671   1215784955    284
    300   1215784971   1215785258    287
    300   1215785271   1215785559    288
    300   1215785571   1215785861    290
    301   1215785872   1215786162    290
    300   1215786172   1215786463    291
    300   1215786472   1215786767    295
    300   1215786772   1215787068    296
    301   1215787073   1215787369    296
    300   1215787373   1215787670    297

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
@ 2008-07-22  3:25 Marcelo Tosatti
  2008-07-22  8:22 ` Jan Kiszka
  2008-07-22 19:56 ` David S. Ahern
  0 siblings, 2 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2008-07-22  3:25 UTC (permalink / raw)
  To: David S. Ahern; +Cc: Glauber Costa, kvm-devel

On Sat, Jul 12, 2008 at 01:28:13PM -0600, David S. Ahern wrote:
> > All time drift issues we were aware of are fixed in kvm-70. Can you
> > please provide more details on how you see the time drifting with
> > RHEL3/4 guests? It slowly but continually drifts or there are large
> > drifts at once? Are they using TSC or ACPIPM as clocksource?
> 
> The attached file shows one example of the drift I am seeing. It's for a
> 4-way RHEL3 guest started with 'nice -20'. After startup each vcpu was
> pinned to a physical cpu using taskset. The only activity on the host is
> this one single guest; the guest is relatively idle -- about 4% activity
> (~1% user, ~3% system time). Host is synchronized to an ntp server; the
> guest is not. The guest is started with the -localtime parameter.  From
> the file you can see the guest gains about 1-2 seconds every 5 minutes.
> 
> Since it's a RHEL3 guest I believe the PIT is the only choice (how to
> confirm?), though it does read the TSC (ie., use_tsc is 1).

Since its an SMP guest I believe its using PIT to generate periodic
timers and ACPI pmtimer as a clock source.

> > Also, most issues we've seen could only be replicated with dyntick
> > guests.
> > 
> > I'll try to reproduce it locally.
> > 
> >> In the course of it I have been launching guests with boosted priority
> >> (both nice -20 and realtime priority (RR 1)) on a nearly 100% idle
> >> host.
> > 
> > Can you also see wacked bogomips without boosting the guest priority?
> 
> The wacked bogomips only shows up when started with real-time priority.
> With the 'nice -20' it's sane and close to what the host shows.
> 
> As another data point I restarted the RHEL3 guest using the -no-kvm-pit
> and -tdf options (nice -20 priority boost). After 22 hours of uptime,
> the guest is 29 seconds *behind* the host. Using the in-kernel pit the
> guest time is always fast compared to the host.
> 
> I've seen similar drifting in RHEL4 guests, but I have not spent as much
> time investigating it yet. On ESX adding clock=pit to the boot
> parameters for RHEL4 guests helps immensely.

The problem with clock=pmtmr and clock=tsc on older 2.6 kernels is lost
tick and irq latency adjustments, as mentioned in the VMWare paper
(http://www.vmware.com/pdf/vmware_timekeeping.pdf). They try to detect
this and compensate by advancing the clock. But the delay between the
host time fire, injection of guest irq and actual count read (either
tsc or pmtimer) fool these adjustments. clock=pit has no such lost tick
detection, so is susceptible to lost ticks under load (in theory).

The fact that qemu emulation is less suspectible to guest clock running
faster than it should is because the emulated PIT timer is rearmed
relative to alarm processing (next_expiration = current_time + count).
But that also means it is suspectible to host load, ie. the frequency is
virtual.

The in-kernel PIT rearms relative to host clock, so the frequency is
more reliable (next_expiration = prev_expiration + count).

So for RHEL4, clock=pit along with the following patch seems stable for
me, no drift either direction, even under guest/host load. Can you give
it a try with RHEL3 ? I'll be doing that shortly.


----------

Set the count load time to when the count is actually "loaded", not when
IRQ is injected.

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index c0f7872..b39b141 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -207,6 +207,7 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
 
 	pt->timer.expires = ktime_add_ns(pt->timer.expires, pt->period);
 	pt->scheduled = ktime_to_ns(pt->timer.expires);
+	ps->channels[0].count_load_time = pt->timer.expires;
 
 	return (pt->period == 0 ? 0 : 1);
 }
@@ -622,7 +623,6 @@ void kvm_pit_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
 		  arch->vioapic->redirtbl[0].fields.mask != 1))) {
 			ps->inject_pending = 1;
 			atomic_dec(&ps->pit_timer.pending);
-			ps->channels[0].count_load_time = ktime_get();
 		}
 	}
 }



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-22  3:25 Marcelo Tosatti
@ 2008-07-22  8:22 ` Jan Kiszka
  2008-07-22 12:49   ` Marcelo Tosatti
  2008-07-22 19:56 ` David S. Ahern
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Kiszka @ 2008-07-22  8:22 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: David S. Ahern, Glauber Costa, kvm-devel

Marcelo Tosatti wrote:
> On Sat, Jul 12, 2008 at 01:28:13PM -0600, David S. Ahern wrote:
>>> All time drift issues we were aware of are fixed in kvm-70. Can you
>>> please provide more details on how you see the time drifting with
>>> RHEL3/4 guests? It slowly but continually drifts or there are large
>>> drifts at once? Are they using TSC or ACPIPM as clocksource?
>> The attached file shows one example of the drift I am seeing. It's for a
>> 4-way RHEL3 guest started with 'nice -20'. After startup each vcpu was
>> pinned to a physical cpu using taskset. The only activity on the host is
>> this one single guest; the guest is relatively idle -- about 4% activity
>> (~1% user, ~3% system time). Host is synchronized to an ntp server; the
>> guest is not. The guest is started with the -localtime parameter.  From
>> the file you can see the guest gains about 1-2 seconds every 5 minutes.
>>
>> Since it's a RHEL3 guest I believe the PIT is the only choice (how to
>> confirm?), though it does read the TSC (ie., use_tsc is 1).
> 
> Since its an SMP guest I believe its using PIT to generate periodic
> timers and ACPI pmtimer as a clock source.
> 
>>> Also, most issues we've seen could only be replicated with dyntick
>>> guests.
>>>
>>> I'll try to reproduce it locally.
>>>
>>>> In the course of it I have been launching guests with boosted priority
>>>> (both nice -20 and realtime priority (RR 1)) on a nearly 100% idle
>>>> host.
>>> Can you also see wacked bogomips without boosting the guest priority?
>> The wacked bogomips only shows up when started with real-time priority.
>> With the 'nice -20' it's sane and close to what the host shows.
>>
>> As another data point I restarted the RHEL3 guest using the -no-kvm-pit
>> and -tdf options (nice -20 priority boost). After 22 hours of uptime,
>> the guest is 29 seconds *behind* the host. Using the in-kernel pit the
>> guest time is always fast compared to the host.
>>
>> I've seen similar drifting in RHEL4 guests, but I have not spent as much
>> time investigating it yet. On ESX adding clock=pit to the boot
>> parameters for RHEL4 guests helps immensely.
> 
> The problem with clock=pmtmr and clock=tsc on older 2.6 kernels is lost
> tick and irq latency adjustments, as mentioned in the VMWare paper
> (http://www.vmware.com/pdf/vmware_timekeeping.pdf). They try to detect
> this and compensate by advancing the clock. But the delay between the
> host time fire, injection of guest irq and actual count read (either
> tsc or pmtimer) fool these adjustments. clock=pit has no such lost tick
> detection, so is susceptible to lost ticks under load (in theory).
> 
> The fact that qemu emulation is less suspectible to guest clock running
> faster than it should is because the emulated PIT timer is rearmed
> relative to alarm processing (next_expiration = current_time + count).
> But that also means it is suspectible to host load, ie. the frequency is
> virtual.
> 
> The in-kernel PIT rearms relative to host clock, so the frequency is
> more reliable (next_expiration = prev_expiration + count).

The same happens under plain QEMU:

static void pit_irq_timer_update(PITChannelState *s, int64_t current_time);

static void pit_irq_timer(void *opaque)
{
    PITChannelState *s = opaque;

    pit_irq_timer_update(s, s->next_transition_time);
}

To my experience QEMU's PIT is suffering from lost ticks under load
(when some delay gets larger than 2*period).

I recently played a bit with QEMU new icount feature. Than one tracks
the guest progress based on a virtual instruction pointer, derives the
QEMU's virtual clock from it, but also tries to keep that clock in sync
with the host by periodically adjusting its scaling factor (kind of
virtual CPU frequency tuning to keep the TSC in sync with real time).
Works quite nicely, but my feeling is that the adjustment is not 100%
stable yet.

Maybe such pattern could be applied on kvm as well with tsc_vmexit -
tsc_vmentry serving as "guest progress counter" (instead of icount which
depends on QEMU's code translator).

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-22  8:22 ` Jan Kiszka
@ 2008-07-22 12:49   ` Marcelo Tosatti
  2008-07-22 15:54     ` Jan Kiszka
  2008-07-22 22:00     ` Dor Laor
  0 siblings, 2 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2008-07-22 12:49 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: David S. Ahern, Glauber Costa, kvm-devel

On Tue, Jul 22, 2008 at 10:22:00AM +0200, Jan Kiszka wrote:
> > The in-kernel PIT rearms relative to host clock, so the frequency is
> > more reliable (next_expiration = prev_expiration + count).
> 
> The same happens under plain QEMU:
> 
> static void pit_irq_timer_update(PITChannelState *s, int64_t current_time);
> 
> static void pit_irq_timer(void *opaque)
> {
>     PITChannelState *s = opaque;
> 
>     pit_irq_timer_update(s, s->next_transition_time);
> }

True. I misread "current_time".

> To my experience QEMU's PIT is suffering from lost ticks under load
> (when some delay gets larger than 2*period).

Yes, with clock=pit on RHEL4 its quite noticeable. Even with -tdf. The
in-kernel timer seems immune to that under the load I was testing.

> I recently played a bit with QEMU new icount feature. Than one tracks
> the guest progress based on a virtual instruction pointer, derives the
> QEMU's virtual clock from it, but also tries to keep that clock in sync
> with the host by periodically adjusting its scaling factor (kind of
> virtual CPU frequency tuning to keep the TSC in sync with real time).
> Works quite nicely, but my feeling is that the adjustment is not 100%
> stable yet.
> 
> Maybe such pattern could be applied on kvm as well with tsc_vmexit -
> tsc_vmentry serving as "guest progress counter" (instead of icount which
> depends on QEMU's code translator).

I see. Do you have patches around?



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-22 12:49   ` Marcelo Tosatti
@ 2008-07-22 15:54     ` Jan Kiszka
  2008-07-22 22:00     ` Dor Laor
  1 sibling, 0 replies; 21+ messages in thread
From: Jan Kiszka @ 2008-07-22 15:54 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: David S. Ahern, Glauber Costa, kvm-devel

Marcelo Tosatti wrote:
> On Tue, Jul 22, 2008 at 10:22:00AM +0200, Jan Kiszka wrote:
>>> The in-kernel PIT rearms relative to host clock, so the frequency is
>>> more reliable (next_expiration = prev_expiration + count).
>> The same happens under plain QEMU:
>>
>> static void pit_irq_timer_update(PITChannelState *s, int64_t current_time);
>>
>> static void pit_irq_timer(void *opaque)
>> {
>>     PITChannelState *s = opaque;
>>
>>     pit_irq_timer_update(s, s->next_transition_time);
>> }
> 
> True. I misread "current_time".
> 
>> To my experience QEMU's PIT is suffering from lost ticks under load
>> (when some delay gets larger than 2*period).
> 
> Yes, with clock=pit on RHEL4 its quite noticeable. Even with -tdf. The
> in-kernel timer seems immune to that under the load I was testing.
> 
>> I recently played a bit with QEMU new icount feature. Than one tracks
>> the guest progress based on a virtual instruction pointer, derives the
>> QEMU's virtual clock from it, but also tries to keep that clock in sync
>> with the host by periodically adjusting its scaling factor (kind of
>> virtual CPU frequency tuning to keep the TSC in sync with real time).
>> Works quite nicely, but my feeling is that the adjustment is not 100%
>> stable yet.
>>
>> Maybe such pattern could be applied on kvm as well with tsc_vmexit -
>> tsc_vmentry serving as "guest progress counter" (instead of icount which
>> depends on QEMU's code translator).
> 
> I see. Do you have patches around?

Unfortunately, not. It's so far just a vague idea how it /may/ work -
I'm lacking time to study or even implement details ATM.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-22  3:25 Marcelo Tosatti
  2008-07-22  8:22 ` Jan Kiszka
@ 2008-07-22 19:56 ` David S. Ahern
  2008-07-23  2:57   ` David S. Ahern
  2008-07-29 14:58   ` Marcelo Tosatti
  1 sibling, 2 replies; 21+ messages in thread
From: David S. Ahern @ 2008-07-22 19:56 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Glauber Costa, kvm-devel

I've been running a series of tests on RHEL3, RHEL4, and RHEL5. The
short of it is that all of them keep time quite well with 1 vcpu. In the
case of RHEL3 and RHEL4 time is stable for *both* the uniprocessor and
smp kernels, again with only 1 vcpu (there's no up/smp distinction in
the kernels for RHEL5).

As soon as the number of vcpus is >1, time drifts systematically with
the guest *leading* the host. I see this on unloaded guests and hosts
(ie., cpu usage on the host ~<5%). The drift is averaging around
0.5%-0.6% (i.e., 5 seconds gained in the guest per 1000 seconds of real
wall time).

This very reproducible. All I am doing is installing stock RHEL3.8, 4.4
and 5.2, i386 versions, starting them and watching the drift with no
time servers. In all of these recent cases the results are for in-kernel
pit.

more in-line below.


Marcelo Tosatti wrote:
> On Sat, Jul 12, 2008 at 01:28:13PM -0600, David S. Ahern wrote:
>>> All time drift issues we were aware of are fixed in kvm-70. Can you
>>> please provide more details on how you see the time drifting with
>>> RHEL3/4 guests? It slowly but continually drifts or there are large
>>> drifts at once? Are they using TSC or ACPIPM as clocksource?
>> The attached file shows one example of the drift I am seeing. It's for a
>> 4-way RHEL3 guest started with 'nice -20'. After startup each vcpu was
>> pinned to a physical cpu using taskset. The only activity on the host is
>> this one single guest; the guest is relatively idle -- about 4% activity
>> (~1% user, ~3% system time). Host is synchronized to an ntp server; the
>> guest is not. The guest is started with the -localtime parameter.  From
>> the file you can see the guest gains about 1-2 seconds every 5 minutes.
>>
>> Since it's a RHEL3 guest I believe the PIT is the only choice (how to
>> confirm?), though it does read the TSC (ie., use_tsc is 1).
> 
> Since its an SMP guest I believe its using PIT to generate periodic
> timers and ACPI pmtimer as a clock source.

Since my last post, I've been reading up on timekeeping and going
through the kernel code -- focusing on RHEL3 at the moment. AFAICT the
PIT is used for timekeeping, and the local APIC timer interrupts are
used as well (supposedly just for per-cpu system accounting, though I
have not gone through all of the code yet). I do not see references in
dmesg data regarding pmtimer; I thought RHEL3 was not ACPI aware.

> 
>>> Also, most issues we've seen could only be replicated with dyntick
>>> guests.
>>>
>>> I'll try to reproduce it locally.
>>>
>>>> In the course of it I have been launching guests with boosted priority
>>>> (both nice -20 and realtime priority (RR 1)) on a nearly 100% idle
>>>> host.
>>> Can you also see wacked bogomips without boosting the guest priority?
>> The wacked bogomips only shows up when started with real-time priority.
>> With the 'nice -20' it's sane and close to what the host shows.
>>
>> As another data point I restarted the RHEL3 guest using the -no-kvm-pit
>> and -tdf options (nice -20 priority boost). After 22 hours of uptime,
>> the guest is 29 seconds *behind* the host. Using the in-kernel pit the
>> guest time is always fast compared to the host.
>>
>> I've seen similar drifting in RHEL4 guests, but I have not spent as much
>> time investigating it yet. On ESX adding clock=pit to the boot
>> parameters for RHEL4 guests helps immensely.
> 
> The problem with clock=pmtmr and clock=tsc on older 2.6 kernels is lost
> tick and irq latency adjustments, as mentioned in the VMWare paper
> (http://www.vmware.com/pdf/vmware_timekeeping.pdf). They try to detect
> this and compensate by advancing the clock. But the delay between the
> host time fire, injection of guest irq and actual count read (either
> tsc or pmtimer) fool these adjustments. clock=pit has no such lost tick
> detection, so is susceptible to lost ticks under load (in theory).

I have read that document quite a few times; clock=pit is required on
esx for rhel4 guests to be sane.

> 
> The fact that qemu emulation is less suspectible to guest clock running
> faster than it should is because the emulated PIT timer is rearmed
> relative to alarm processing (next_expiration = current_time + count).
> But that also means it is suspectible to host load, ie. the frequency is
> virtual.
> 
> The in-kernel PIT rearms relative to host clock, so the frequency is
> more reliable (next_expiration = prev_expiration + count).
> 
> So for RHEL4, clock=pit along with the following patch seems stable for
> me, no drift either direction, even under guest/host load. Can you give
> it a try with RHEL3 ? I'll be doing that shortly.

I'll give it a shot and let you know.

david

> 
> 
> ----------
> 
> Set the count load time to when the count is actually "loaded", not when
> IRQ is injected.
> 
> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> index c0f7872..b39b141 100644
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -207,6 +207,7 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
>  
>  	pt->timer.expires = ktime_add_ns(pt->timer.expires, pt->period);
>  	pt->scheduled = ktime_to_ns(pt->timer.expires);
> +	ps->channels[0].count_load_time = pt->timer.expires;
>  
>  	return (pt->period == 0 ? 0 : 1);
>  }
> @@ -622,7 +623,6 @@ void kvm_pit_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
>  		  arch->vioapic->redirtbl[0].fields.mask != 1))) {
>  			ps->inject_pending = 1;
>  			atomic_dec(&ps->pit_timer.pending);
> -			ps->channels[0].count_load_time = ktime_get();
>  		}
>  	}
>  }
> 
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-22 12:49   ` Marcelo Tosatti
  2008-07-22 15:54     ` Jan Kiszka
@ 2008-07-22 22:00     ` Dor Laor
  1 sibling, 0 replies; 21+ messages in thread
From: Dor Laor @ 2008-07-22 22:00 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Jan Kiszka, David S. Ahern, Glauber Costa, kvm-devel

Marcelo Tosatti wrote:
> On Tue, Jul 22, 2008 at 10:22:00AM +0200, Jan Kiszka wrote:
>   
>>> The in-kernel PIT rearms relative to host clock, so the frequency is
>>> more reliable (next_expiration = prev_expiration + count).
>>>       
>> The same happens under plain QEMU:
>>
>> static void pit_irq_timer_update(PITChannelState *s, int64_t current_time);
>>
>> static void pit_irq_timer(void *opaque)
>> {
>>     PITChannelState *s = opaque;
>>
>>     pit_irq_timer_update(s, s->next_transition_time);
>> }
>>     
>
> True. I misread "current_time".
>
>   
>> To my experience QEMU's PIT is suffering from lost ticks under load
>> (when some delay gets larger than 2*period).
>>     
>
> Yes, with clock=pit on RHEL4 its quite noticeable. Even with -tdf. The
>   
Note that -tdf works only when you use userspace irqchip too, then it 
should work.
> in-kernel timer seems immune to that under the load I was testing.
>
>   
In the long run we should try to remove the in kernel pit. Currently it 
does handle pit
irq coalescing problem that leads to time drift.
The problem is that its not yet 100% production level, migration with it 
has some issues and
basically we should try not to duplicate userspace code unless there is 
no good reason (like performance).

There are floating patches by Glen Natapov for the pit and virtual rtc 
to prevent time drifts.
Hope they'll get accepted by qemu.
>> I recently played a bit with QEMU new icount feature. Than one tracks
>> the guest progress based on a virtual instruction pointer, derives the
>> QEMU's virtual clock from it, but also tries to keep that clock in sync
>> with the host by periodically adjusting its scaling factor (kind of
>> virtual CPU frequency tuning to keep the TSC in sync with real time).
>> Works quite nicely, but my feeling is that the adjustment is not 100%
>> stable yet.
>>
>> Maybe such pattern could be applied on kvm as well with tsc_vmexit -
>> tsc_vmentry serving as "guest progress counter" (instead of icount which
>> depends on QEMU's code translator).
>>     
>
> I see. Do you have patches around?
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-22 19:56 ` David S. Ahern
@ 2008-07-23  2:57   ` David S. Ahern
  2008-07-29 14:58   ` Marcelo Tosatti
  1 sibling, 0 replies; 21+ messages in thread
From: David S. Ahern @ 2008-07-23  2:57 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel



David S. Ahern wrote:
>> The in-kernel PIT rearms relative to host clock, so the frequency is
>> more reliable (next_expiration = prev_expiration + count).
>>
>> So for RHEL4, clock=pit along with the following patch seems stable for
>> me, no drift either direction, even under guest/host load. Can you give
>> it a try with RHEL3 ? I'll be doing that shortly.
> 
> I'll give it a shot and let you know.

After 6:46 of uptime, my RHEL4 guest is only 7 seconds ahead of the
host. The RHEL3 guest is 17 seconds ahead. Both are dramatic
improvements with the patch.

david

>>
>> ----------
>>
>> Set the count load time to when the count is actually "loaded", not when
>> IRQ is injected.
>>
>> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
>> index c0f7872..b39b141 100644
>> --- a/arch/x86/kvm/i8254.c
>> +++ b/arch/x86/kvm/i8254.c
>> @@ -207,6 +207,7 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
>>  
>>  	pt->timer.expires = ktime_add_ns(pt->timer.expires, pt->period);
>>  	pt->scheduled = ktime_to_ns(pt->timer.expires);
>> +	ps->channels[0].count_load_time = pt->timer.expires;
>>  
>>  	return (pt->period == 0 ? 0 : 1);
>>  }
>> @@ -622,7 +623,6 @@ void kvm_pit_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
>>  		  arch->vioapic->redirtbl[0].fields.mask != 1))) {
>>  			ps->inject_pending = 1;
>>  			atomic_dec(&ps->pit_timer.pending);
>> -			ps->channels[0].count_load_time = ktime_get();
>>  		}
>>  	}
>>  }
>>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-22 19:56 ` David S. Ahern
  2008-07-23  2:57   ` David S. Ahern
@ 2008-07-29 14:58   ` Marcelo Tosatti
  1 sibling, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2008-07-29 14:58 UTC (permalink / raw)
  To: David S. Ahern; +Cc: Glauber Costa, kvm-devel

On Tue, Jul 22, 2008 at 01:56:12PM -0600, David S. Ahern wrote:
> I've been running a series of tests on RHEL3, RHEL4, and RHEL5. The
> short of it is that all of them keep time quite well with 1 vcpu. In the
> case of RHEL3 and RHEL4 time is stable for *both* the uniprocessor and
> smp kernels, again with only 1 vcpu (there's no up/smp distinction in
> the kernels for RHEL5).
> 
> As soon as the number of vcpus is >1, time drifts systematically with
> the guest *leading* the host. I see this on unloaded guests and hosts
> (ie., cpu usage on the host ~<5%). The drift is averaging around
> 0.5%-0.6% (i.e., 5 seconds gained in the guest per 1000 seconds of real
> wall time).
> 
> This very reproducible. All I am doing is installing stock RHEL3.8, 4.4
> and 5.2, i386 versions, starting them and watching the drift with no
> time servers. In all of these recent cases the results are for in-kernel
> pit.

David,

You mentioned earlier problems with ntpd syncing the guest time? Can you
provide more details?

I find it _necessary_ to use the RR scheduling policy for any Linux
guest running at static 1000Hz (no dynticks), otherwise timer interrupts
will invariably be missed. And reinjection plus lost tick adjustment is
always problematic (will drift either way, depending which version of
Linux). With the standard batch scheduling policy _idle_ guests can wait
to run upto 6/7 ms in my testing (thus 6/7 lost timer events). Which
also means latency can be horrible.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2008-07-29 14:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-02 16:40 kvm guest loops_per_jiffy miscalibration under host load Marcelo Tosatti
2008-07-03 13:17 ` Glauber Costa
2008-07-04 22:51   ` Marcelo Tosatti
2008-07-07  1:56   ` Anthony Liguori
2008-07-07 18:27     ` Glauber Costa
2008-07-07 18:48       ` Marcelo Tosatti
2008-07-07 19:21         ` Anthony Liguori
2008-07-07 19:32           ` Glauber Costa
2008-07-07 21:35             ` Glauber Costa
2008-07-11 21:18               ` David S. Ahern
2008-07-12 14:10                 ` Marcelo Tosatti
2008-07-12 19:28                   ` David S. Ahern
2008-07-07 18:17 ` Daniel P. Berrange
  -- strict thread matches above, loose matches on Subject: below --
2008-07-22  3:25 Marcelo Tosatti
2008-07-22  8:22 ` Jan Kiszka
2008-07-22 12:49   ` Marcelo Tosatti
2008-07-22 15:54     ` Jan Kiszka
2008-07-22 22:00     ` Dor Laor
2008-07-22 19:56 ` David S. Ahern
2008-07-23  2:57   ` David S. Ahern
2008-07-29 14:58   ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox