Re: kvm guest loops_per_jiffy miscalibration under host load

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* Re: kvm guest loops_per_jiffy miscalibration under host load
@ 2008-07-22  3:25 Marcelo Tosatti
  2008-07-22  8:22 ` Jan Kiszka
  2008-07-22 19:56 ` David S. Ahern
  0 siblings, 2 replies; 10+ messages in thread
From: Marcelo Tosatti @ 2008-07-22  3:25 UTC (permalink / raw)
  To: David S. Ahern; +Cc: Glauber Costa, kvm-devel

On Sat, Jul 12, 2008 at 01:28:13PM -0600, David S. Ahern wrote:
> > All time drift issues we were aware of are fixed in kvm-70. Can you
> > please provide more details on how you see the time drifting with
> > RHEL3/4 guests? It slowly but continually drifts or there are large
> > drifts at once? Are they using TSC or ACPIPM as clocksource?
> 
> The attached file shows one example of the drift I am seeing. It's for a
> 4-way RHEL3 guest started with 'nice -20'. After startup each vcpu was
> pinned to a physical cpu using taskset. The only activity on the host is
> this one single guest; the guest is relatively idle -- about 4% activity
> (~1% user, ~3% system time). Host is synchronized to an ntp server; the
> guest is not. The guest is started with the -localtime parameter.  From
> the file you can see the guest gains about 1-2 seconds every 5 minutes.
> 
> Since it's a RHEL3 guest I believe the PIT is the only choice (how to
> confirm?), though it does read the TSC (ie., use_tsc is 1).

Since its an SMP guest I believe its using PIT to generate periodic
timers and ACPI pmtimer as a clock source.

> > Also, most issues we've seen could only be replicated with dyntick
> > guests.
> > 
> > I'll try to reproduce it locally.
> > 
> >> In the course of it I have been launching guests with boosted priority
> >> (both nice -20 and realtime priority (RR 1)) on a nearly 100% idle
> >> host.
> > 
> > Can you also see wacked bogomips without boosting the guest priority?
> 
> The wacked bogomips only shows up when started with real-time priority.
> With the 'nice -20' it's sane and close to what the host shows.
> 
> As another data point I restarted the RHEL3 guest using the -no-kvm-pit
> and -tdf options (nice -20 priority boost). After 22 hours of uptime,
> the guest is 29 seconds *behind* the host. Using the in-kernel pit the
> guest time is always fast compared to the host.
> 
> I've seen similar drifting in RHEL4 guests, but I have not spent as much
> time investigating it yet. On ESX adding clock=pit to the boot
> parameters for RHEL4 guests helps immensely.

The problem with clock=pmtmr and clock=tsc on older 2.6 kernels is lost
tick and irq latency adjustments, as mentioned in the VMWare paper
(http://www.vmware.com/pdf/vmware_timekeeping.pdf). They try to detect
this and compensate by advancing the clock. But the delay between the
host time fire, injection of guest irq and actual count read (either
tsc or pmtimer) fool these adjustments. clock=pit has no such lost tick
detection, so is susceptible to lost ticks under load (in theory).

The fact that qemu emulation is less suspectible to guest clock running
faster than it should is because the emulated PIT timer is rearmed
relative to alarm processing (next_expiration = current_time + count).
But that also means it is suspectible to host load, ie. the frequency is
virtual.

The in-kernel PIT rearms relative to host clock, so the frequency is
more reliable (next_expiration = prev_expiration + count).

So for RHEL4, clock=pit along with the following patch seems stable for
me, no drift either direction, even under guest/host load. Can you give
it a try with RHEL3 ? I'll be doing that shortly.


----------

Set the count load time to when the count is actually "loaded", not when
IRQ is injected.

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index c0f7872..b39b141 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -207,6 +207,7 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
 
 	pt->timer.expires = ktime_add_ns(pt->timer.expires, pt->period);
 	pt->scheduled = ktime_to_ns(pt->timer.expires);
+	ps->channels[0].count_load_time = pt->timer.expires;
 
 	return (pt->period == 0 ? 0 : 1);
 }
@@ -622,7 +623,6 @@ void kvm_pit_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
 		  arch->vioapic->redirtbl[0].fields.mask != 1))) {
 			ps->inject_pending = 1;
 			atomic_dec(&ps->pit_timer.pending);
-			ps->channels[0].count_load_time = ktime_get();
 		}
 	}
 }



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-22  3:25 kvm guest loops_per_jiffy miscalibration under host load Marcelo Tosatti
@ 2008-07-22  8:22 ` Jan Kiszka
  2008-07-22 12:49   ` Marcelo Tosatti
  2008-07-22 19:56 ` David S. Ahern
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Kiszka @ 2008-07-22  8:22 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: David S. Ahern, Glauber Costa, kvm-devel

Marcelo Tosatti wrote:
> On Sat, Jul 12, 2008 at 01:28:13PM -0600, David S. Ahern wrote:
>>> All time drift issues we were aware of are fixed in kvm-70. Can you
>>> please provide more details on how you see the time drifting with
>>> RHEL3/4 guests? It slowly but continually drifts or there are large
>>> drifts at once? Are they using TSC or ACPIPM as clocksource?
>> The attached file shows one example of the drift I am seeing. It's for a
>> 4-way RHEL3 guest started with 'nice -20'. After startup each vcpu was
>> pinned to a physical cpu using taskset. The only activity on the host is
>> this one single guest; the guest is relatively idle -- about 4% activity
>> (~1% user, ~3% system time). Host is synchronized to an ntp server; the
>> guest is not. The guest is started with the -localtime parameter.  From
>> the file you can see the guest gains about 1-2 seconds every 5 minutes.
>>
>> Since it's a RHEL3 guest I believe the PIT is the only choice (how to
>> confirm?), though it does read the TSC (ie., use_tsc is 1).
> 
> Since its an SMP guest I believe its using PIT to generate periodic
> timers and ACPI pmtimer as a clock source.
> 
>>> Also, most issues we've seen could only be replicated with dyntick
>>> guests.
>>>
>>> I'll try to reproduce it locally.
>>>
>>>> In the course of it I have been launching guests with boosted priority
>>>> (both nice -20 and realtime priority (RR 1)) on a nearly 100% idle
>>>> host.
>>> Can you also see wacked bogomips without boosting the guest priority?
>> The wacked bogomips only shows up when started with real-time priority.
>> With the 'nice -20' it's sane and close to what the host shows.
>>
>> As another data point I restarted the RHEL3 guest using the -no-kvm-pit
>> and -tdf options (nice -20 priority boost). After 22 hours of uptime,
>> the guest is 29 seconds *behind* the host. Using the in-kernel pit the
>> guest time is always fast compared to the host.
>>
>> I've seen similar drifting in RHEL4 guests, but I have not spent as much
>> time investigating it yet. On ESX adding clock=pit to the boot
>> parameters for RHEL4 guests helps immensely.
> 
> The problem with clock=pmtmr and clock=tsc on older 2.6 kernels is lost
> tick and irq latency adjustments, as mentioned in the VMWare paper
> (http://www.vmware.com/pdf/vmware_timekeeping.pdf). They try to detect
> this and compensate by advancing the clock. But the delay between the
> host time fire, injection of guest irq and actual count read (either
> tsc or pmtimer) fool these adjustments. clock=pit has no such lost tick
> detection, so is susceptible to lost ticks under load (in theory).
> 
> The fact that qemu emulation is less suspectible to guest clock running
> faster than it should is because the emulated PIT timer is rearmed
> relative to alarm processing (next_expiration = current_time + count).
> But that also means it is suspectible to host load, ie. the frequency is
> virtual.
> 
> The in-kernel PIT rearms relative to host clock, so the frequency is
> more reliable (next_expiration = prev_expiration + count).

The same happens under plain QEMU:

static void pit_irq_timer_update(PITChannelState *s, int64_t current_time);

static void pit_irq_timer(void *opaque)
{
    PITChannelState *s = opaque;

    pit_irq_timer_update(s, s->next_transition_time);
}

To my experience QEMU's PIT is suffering from lost ticks under load
(when some delay gets larger than 2*period).

I recently played a bit with QEMU new icount feature. Than one tracks
the guest progress based on a virtual instruction pointer, derives the
QEMU's virtual clock from it, but also tries to keep that clock in sync
with the host by periodically adjusting its scaling factor (kind of
virtual CPU frequency tuning to keep the TSC in sync with real time).
Works quite nicely, but my feeling is that the adjustment is not 100%
stable yet.

Maybe such pattern could be applied on kvm as well with tsc_vmexit -
tsc_vmentry serving as "guest progress counter" (instead of icount which
depends on QEMU's code translator).

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-22  8:22 ` Jan Kiszka
@ 2008-07-22 12:49   ` Marcelo Tosatti
  2008-07-22 15:54     ` Jan Kiszka
  2008-07-22 22:00     ` Dor Laor
  0 siblings, 2 replies; 10+ messages in thread
From: Marcelo Tosatti @ 2008-07-22 12:49 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: David S. Ahern, Glauber Costa, kvm-devel

On Tue, Jul 22, 2008 at 10:22:00AM +0200, Jan Kiszka wrote:
> > The in-kernel PIT rearms relative to host clock, so the frequency is
> > more reliable (next_expiration = prev_expiration + count).
> 
> The same happens under plain QEMU:
> 
> static void pit_irq_timer_update(PITChannelState *s, int64_t current_time);
> 
> static void pit_irq_timer(void *opaque)
> {
>     PITChannelState *s = opaque;
> 
>     pit_irq_timer_update(s, s->next_transition_time);
> }

True. I misread "current_time".

> To my experience QEMU's PIT is suffering from lost ticks under load
> (when some delay gets larger than 2*period).

Yes, with clock=pit on RHEL4 its quite noticeable. Even with -tdf. The
in-kernel timer seems immune to that under the load I was testing.

> I recently played a bit with QEMU new icount feature. Than one tracks
> the guest progress based on a virtual instruction pointer, derives the
> QEMU's virtual clock from it, but also tries to keep that clock in sync
> with the host by periodically adjusting its scaling factor (kind of
> virtual CPU frequency tuning to keep the TSC in sync with real time).
> Works quite nicely, but my feeling is that the adjustment is not 100%
> stable yet.
> 
> Maybe such pattern could be applied on kvm as well with tsc_vmexit -
> tsc_vmentry serving as "guest progress counter" (instead of icount which
> depends on QEMU's code translator).

I see. Do you have patches around?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-22 12:49   ` Marcelo Tosatti
@ 2008-07-22 15:54     ` Jan Kiszka
  2008-07-22 22:00     ` Dor Laor
  1 sibling, 0 replies; 10+ messages in thread
From: Jan Kiszka @ 2008-07-22 15:54 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: David S. Ahern, Glauber Costa, kvm-devel

Marcelo Tosatti wrote:
> On Tue, Jul 22, 2008 at 10:22:00AM +0200, Jan Kiszka wrote:
>>> The in-kernel PIT rearms relative to host clock, so the frequency is
>>> more reliable (next_expiration = prev_expiration + count).
>> The same happens under plain QEMU:
>>
>> static void pit_irq_timer_update(PITChannelState *s, int64_t current_time);
>>
>> static void pit_irq_timer(void *opaque)
>> {
>>     PITChannelState *s = opaque;
>>
>>     pit_irq_timer_update(s, s->next_transition_time);
>> }
> 
> True. I misread "current_time".
> 
>> To my experience QEMU's PIT is suffering from lost ticks under load
>> (when some delay gets larger than 2*period).
> 
> Yes, with clock=pit on RHEL4 its quite noticeable. Even with -tdf. The
> in-kernel timer seems immune to that under the load I was testing.
> 
>> I recently played a bit with QEMU new icount feature. Than one tracks
>> the guest progress based on a virtual instruction pointer, derives the
>> QEMU's virtual clock from it, but also tries to keep that clock in sync
>> with the host by periodically adjusting its scaling factor (kind of
>> virtual CPU frequency tuning to keep the TSC in sync with real time).
>> Works quite nicely, but my feeling is that the adjustment is not 100%
>> stable yet.
>>
>> Maybe such pattern could be applied on kvm as well with tsc_vmexit -
>> tsc_vmentry serving as "guest progress counter" (instead of icount which
>> depends on QEMU's code translator).
> 
> I see. Do you have patches around?

Unfortunately, not. It's so far just a vague idea how it /may/ work -
I'm lacking time to study or even implement details ATM.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-22  3:25 kvm guest loops_per_jiffy miscalibration under host load Marcelo Tosatti
  2008-07-22  8:22 ` Jan Kiszka
@ 2008-07-22 19:56 ` David S. Ahern
  2008-07-23  2:57   ` David S. Ahern
  2008-07-29 14:58   ` Marcelo Tosatti
  1 sibling, 2 replies; 10+ messages in thread
From: David S. Ahern @ 2008-07-22 19:56 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Glauber Costa, kvm-devel

I've been running a series of tests on RHEL3, RHEL4, and RHEL5. The
short of it is that all of them keep time quite well with 1 vcpu. In the
case of RHEL3 and RHEL4 time is stable for *both* the uniprocessor and
smp kernels, again with only 1 vcpu (there's no up/smp distinction in
the kernels for RHEL5).

As soon as the number of vcpus is >1, time drifts systematically with
the guest *leading* the host. I see this on unloaded guests and hosts
(ie., cpu usage on the host ~<5%). The drift is averaging around
0.5%-0.6% (i.e., 5 seconds gained in the guest per 1000 seconds of real
wall time).

This very reproducible. All I am doing is installing stock RHEL3.8, 4.4
and 5.2, i386 versions, starting them and watching the drift with no
time servers. In all of these recent cases the results are for in-kernel
pit.

more in-line below.


Marcelo Tosatti wrote:
> On Sat, Jul 12, 2008 at 01:28:13PM -0600, David S. Ahern wrote:
>>> All time drift issues we were aware of are fixed in kvm-70. Can you
>>> please provide more details on how you see the time drifting with
>>> RHEL3/4 guests? It slowly but continually drifts or there are large
>>> drifts at once? Are they using TSC or ACPIPM as clocksource?
>> The attached file shows one example of the drift I am seeing. It's for a
>> 4-way RHEL3 guest started with 'nice -20'. After startup each vcpu was
>> pinned to a physical cpu using taskset. The only activity on the host is
>> this one single guest; the guest is relatively idle -- about 4% activity
>> (~1% user, ~3% system time). Host is synchronized to an ntp server; the
>> guest is not. The guest is started with the -localtime parameter.  From
>> the file you can see the guest gains about 1-2 seconds every 5 minutes.
>>
>> Since it's a RHEL3 guest I believe the PIT is the only choice (how to
>> confirm?), though it does read the TSC (ie., use_tsc is 1).
> 
> Since its an SMP guest I believe its using PIT to generate periodic
> timers and ACPI pmtimer as a clock source.

Since my last post, I've been reading up on timekeeping and going
through the kernel code -- focusing on RHEL3 at the moment. AFAICT the
PIT is used for timekeeping, and the local APIC timer interrupts are
used as well (supposedly just for per-cpu system accounting, though I
have not gone through all of the code yet). I do not see references in
dmesg data regarding pmtimer; I thought RHEL3 was not ACPI aware.

> 
>>> Also, most issues we've seen could only be replicated with dyntick
>>> guests.
>>>
>>> I'll try to reproduce it locally.
>>>
>>>> In the course of it I have been launching guests with boosted priority
>>>> (both nice -20 and realtime priority (RR 1)) on a nearly 100% idle
>>>> host.
>>> Can you also see wacked bogomips without boosting the guest priority?
>> The wacked bogomips only shows up when started with real-time priority.
>> With the 'nice -20' it's sane and close to what the host shows.
>>
>> As another data point I restarted the RHEL3 guest using the -no-kvm-pit
>> and -tdf options (nice -20 priority boost). After 22 hours of uptime,
>> the guest is 29 seconds *behind* the host. Using the in-kernel pit the
>> guest time is always fast compared to the host.
>>
>> I've seen similar drifting in RHEL4 guests, but I have not spent as much
>> time investigating it yet. On ESX adding clock=pit to the boot
>> parameters for RHEL4 guests helps immensely.
> 
> The problem with clock=pmtmr and clock=tsc on older 2.6 kernels is lost
> tick and irq latency adjustments, as mentioned in the VMWare paper
> (http://www.vmware.com/pdf/vmware_timekeeping.pdf). They try to detect
> this and compensate by advancing the clock. But the delay between the
> host time fire, injection of guest irq and actual count read (either
> tsc or pmtimer) fool these adjustments. clock=pit has no such lost tick
> detection, so is susceptible to lost ticks under load (in theory).

I have read that document quite a few times; clock=pit is required on
esx for rhel4 guests to be sane.

> 
> The fact that qemu emulation is less suspectible to guest clock running
> faster than it should is because the emulated PIT timer is rearmed
> relative to alarm processing (next_expiration = current_time + count).
> But that also means it is suspectible to host load, ie. the frequency is
> virtual.
> 
> The in-kernel PIT rearms relative to host clock, so the frequency is
> more reliable (next_expiration = prev_expiration + count).
> 
> So for RHEL4, clock=pit along with the following patch seems stable for
> me, no drift either direction, even under guest/host load. Can you give
> it a try with RHEL3 ? I'll be doing that shortly.

I'll give it a shot and let you know.

david

> 
> 
> ----------
> 
> Set the count load time to when the count is actually "loaded", not when
> IRQ is injected.
> 
> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> index c0f7872..b39b141 100644
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -207,6 +207,7 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
>  
>  	pt->timer.expires = ktime_add_ns(pt->timer.expires, pt->period);
>  	pt->scheduled = ktime_to_ns(pt->timer.expires);
> +	ps->channels[0].count_load_time = pt->timer.expires;
>  
>  	return (pt->period == 0 ? 0 : 1);
>  }
> @@ -622,7 +623,6 @@ void kvm_pit_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
>  		  arch->vioapic->redirtbl[0].fields.mask != 1))) {
>  			ps->inject_pending = 1;
>  			atomic_dec(&ps->pit_timer.pending);
> -			ps->channels[0].count_load_time = ktime_get();
>  		}
>  	}
>  }
> 
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-22 12:49   ` Marcelo Tosatti
  2008-07-22 15:54     ` Jan Kiszka
@ 2008-07-22 22:00     ` Dor Laor
  1 sibling, 0 replies; 10+ messages in thread
From: Dor Laor @ 2008-07-22 22:00 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Jan Kiszka, David S. Ahern, Glauber Costa, kvm-devel

Marcelo Tosatti wrote:
> On Tue, Jul 22, 2008 at 10:22:00AM +0200, Jan Kiszka wrote:
>   
>>> The in-kernel PIT rearms relative to host clock, so the frequency is
>>> more reliable (next_expiration = prev_expiration + count).
>>>       
>> The same happens under plain QEMU:
>>
>> static void pit_irq_timer_update(PITChannelState *s, int64_t current_time);
>>
>> static void pit_irq_timer(void *opaque)
>> {
>>     PITChannelState *s = opaque;
>>
>>     pit_irq_timer_update(s, s->next_transition_time);
>> }
>>     
>
> True. I misread "current_time".
>
>   
>> To my experience QEMU's PIT is suffering from lost ticks under load
>> (when some delay gets larger than 2*period).
>>     
>
> Yes, with clock=pit on RHEL4 its quite noticeable. Even with -tdf. The
>   
Note that -tdf works only when you use userspace irqchip too, then it 
should work.
> in-kernel timer seems immune to that under the load I was testing.
>
>   
In the long run we should try to remove the in kernel pit. Currently it 
does handle pit
irq coalescing problem that leads to time drift.
The problem is that its not yet 100% production level, migration with it 
has some issues and
basically we should try not to duplicate userspace code unless there is 
no good reason (like performance).

There are floating patches by Glen Natapov for the pit and virtual rtc 
to prevent time drifts.
Hope they'll get accepted by qemu.
>> I recently played a bit with QEMU new icount feature. Than one tracks
>> the guest progress based on a virtual instruction pointer, derives the
>> QEMU's virtual clock from it, but also tries to keep that clock in sync
>> with the host by periodically adjusting its scaling factor (kind of
>> virtual CPU frequency tuning to keep the TSC in sync with real time).
>> Works quite nicely, but my feeling is that the adjustment is not 100%
>> stable yet.
>>
>> Maybe such pattern could be applied on kvm as well with tsc_vmexit -
>> tsc_vmentry serving as "guest progress counter" (instead of icount which
>> depends on QEMU's code translator).
>>     
>
> I see. Do you have patches around?
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-22 19:56 ` David S. Ahern
@ 2008-07-23  2:57   ` David S. Ahern
  2008-07-29 14:58   ` Marcelo Tosatti
  1 sibling, 0 replies; 10+ messages in thread
From: David S. Ahern @ 2008-07-23  2:57 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-devel



David S. Ahern wrote:
>> The in-kernel PIT rearms relative to host clock, so the frequency is
>> more reliable (next_expiration = prev_expiration + count).
>>
>> So for RHEL4, clock=pit along with the following patch seems stable for
>> me, no drift either direction, even under guest/host load. Can you give
>> it a try with RHEL3 ? I'll be doing that shortly.
> 
> I'll give it a shot and let you know.

After 6:46 of uptime, my RHEL4 guest is only 7 seconds ahead of the
host. The RHEL3 guest is 17 seconds ahead. Both are dramatic
improvements with the patch.

david

>>
>> ----------
>>
>> Set the count load time to when the count is actually "loaded", not when
>> IRQ is injected.
>>
>> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
>> index c0f7872..b39b141 100644
>> --- a/arch/x86/kvm/i8254.c
>> +++ b/arch/x86/kvm/i8254.c
>> @@ -207,6 +207,7 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
>>  
>>  	pt->timer.expires = ktime_add_ns(pt->timer.expires, pt->period);
>>  	pt->scheduled = ktime_to_ns(pt->timer.expires);
>> +	ps->channels[0].count_load_time = pt->timer.expires;
>>  
>>  	return (pt->period == 0 ? 0 : 1);
>>  }
>> @@ -622,7 +623,6 @@ void kvm_pit_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
>>  		  arch->vioapic->redirtbl[0].fields.mask != 1))) {
>>  			ps->inject_pending = 1;
>>  			atomic_dec(&ps->pit_timer.pending);
>> -			ps->channels[0].count_load_time = ktime_get();
>>  		}
>>  	}
>>  }
>>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm guest loops_per_jiffy miscalibration under host load
  2008-07-22 19:56 ` David S. Ahern
  2008-07-23  2:57   ` David S. Ahern
@ 2008-07-29 14:58   ` Marcelo Tosatti
  2008-07-29 16:06     ` PIT/ntp/timekeeping [was Re: kvm guest loops_per_jiffy miscalibration under host load] David S. Ahern
  1 sibling, 1 reply; 10+ messages in thread
From: Marcelo Tosatti @ 2008-07-29 14:58 UTC (permalink / raw)
  To: David S. Ahern; +Cc: Glauber Costa, kvm-devel

On Tue, Jul 22, 2008 at 01:56:12PM -0600, David S. Ahern wrote:
> I've been running a series of tests on RHEL3, RHEL4, and RHEL5. The
> short of it is that all of them keep time quite well with 1 vcpu. In the
> case of RHEL3 and RHEL4 time is stable for *both* the uniprocessor and
> smp kernels, again with only 1 vcpu (there's no up/smp distinction in
> the kernels for RHEL5).
> 
> As soon as the number of vcpus is >1, time drifts systematically with
> the guest *leading* the host. I see this on unloaded guests and hosts
> (ie., cpu usage on the host ~<5%). The drift is averaging around
> 0.5%-0.6% (i.e., 5 seconds gained in the guest per 1000 seconds of real
> wall time).
> 
> This very reproducible. All I am doing is installing stock RHEL3.8, 4.4
> and 5.2, i386 versions, starting them and watching the drift with no
> time servers. In all of these recent cases the results are for in-kernel
> pit.

David,

You mentioned earlier problems with ntpd syncing the guest time? Can you
provide more details?

I find it _necessary_ to use the RR scheduling policy for any Linux
guest running at static 1000Hz (no dynticks), otherwise timer interrupts
will invariably be missed. And reinjection plus lost tick adjustment is
always problematic (will drift either way, depending which version of
Linux). With the standard batch scheduling policy _idle_ guests can wait
to run upto 6/7 ms in my testing (thus 6/7 lost timer events). Which
also means latency can be horrible.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* PIT/ntp/timekeeping [was Re: kvm guest loops_per_jiffy miscalibration under host load]
  2008-07-29 14:58   ` Marcelo Tosatti
@ 2008-07-29 16:06     ` David S. Ahern
  2008-07-29 17:29       ` David S. Ahern
  0 siblings, 1 reply; 10+ messages in thread
From: David S. Ahern @ 2008-07-29 16:06 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Glauber Costa, kvm-devel

Marcelo Tosatti wrote:
> On Tue, Jul 22, 2008 at 01:56:12PM -0600, David S. Ahern wrote:
>> I've been running a series of tests on RHEL3, RHEL4, and RHEL5. The
>> short of it is that all of them keep time quite well with 1 vcpu. In the
>> case of RHEL3 and RHEL4 time is stable for *both* the uniprocessor and
>> smp kernels, again with only 1 vcpu (there's no up/smp distinction in
>> the kernels for RHEL5).
>>
>> As soon as the number of vcpus is >1, time drifts systematically with
>> the guest *leading* the host. I see this on unloaded guests and hosts
>> (ie., cpu usage on the host ~<5%). The drift is averaging around
>> 0.5%-0.6% (i.e., 5 seconds gained in the guest per 1000 seconds of real
>> wall time).
>>
>> This very reproducible. All I am doing is installing stock RHEL3.8, 4.4
>> and 5.2, i386 versions, starting them and watching the drift with no
>> time servers. In all of these recent cases the results are for in-kernel
>> pit.
> 
> David,
> 
> You mentioned earlier problems with ntpd syncing the guest time? Can you
> provide more details?
> 

It would lose sync often, and 'ntpq -c pe' would show a '*' indicative
of a sync when in fact time in the guest was off by 5-10 seconds. It may
very well be a side effect of the drift due to repeated timer injection
of timer interrupts / lost interrupts.

With your PIT injection patches:

1. For a stock RHEL4.4 guest, ntpd synchronized quickly and saw no need
to adjust time after the initial startup tweak of 1.004620 sec by
ntpdate. After 40 hours it has maintained time very well with no
adjustments. Of course the guest is relatively idle -- it is only
keeping time.

2. For a stock RHEL3.8 guest, I cannot get ntpd to do anything. This
guest is running on the same host as the RHEL4 guest and using the same
time server. This guest has been around for a few weeks and has been
subjected to very tests -- like running with the no-kernel-pit and -tdf
options. In light of 3. below I'll re-create this guest and see if the
problem goes away.

3. For a RHEL3.8 guest running a Cisco product, ntpd was able to
synchonize just fine. We are running ntpd with different arguments;
however using the same syntax on the stock rhel3 guest did not help.

As for as time updates, over 21+ hours of uptime there have been 20 time
resets -- adjustments ranging from -1.01 seconds to +0.75 seconds. This
is a remarkable improvement. Before this PIT patch set I was seeing time
resets of 3-5 seconds every 15 minutes. This is a 2 vcpu guest running a
modest load (disk + network) that pushes cpu usage of ~25%. Point being
that the guest is keeping time reasonably well while do something
useful. :-)

I am planning to install 4 vcpu guests for both RHEL3 and RHEL4 today
and again with modest loads to see how it holds up.

> I find it _necessary_ to use the RR scheduling policy for any Linux
> guest running at static 1000Hz (no dynticks), otherwise timer interrupts
> will invariably be missed. And reinjection plus lost tick adjustment is
> always problematic (will drift either way, depending which version of
> Linux). With the standard batch scheduling policy _idle_ guests can wait
> to run upto 6/7 ms in my testing (thus 6/7 lost timer events). Which
> also means latency can be horrible.
> 

Noted. I'd prefer not to start priority escalations, but if it's needed....

What about for the RHEL4.7 kernel running at 250 HZ? I understand it
with 4.7 you can pass a command line divider to run the clock at a
slower rate. In the past I've recompiled RHEL4 kernels to run at 250 HZ
which was a trade-off between too fast (overhead of timer interrupts)
and too slow (need for better scheduling latency).

david

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: PIT/ntp/timekeeping [was Re: kvm guest loops_per_jiffy miscalibration under host load]
  2008-07-29 16:06     ` PIT/ntp/timekeeping [was Re: kvm guest loops_per_jiffy miscalibration under host load] David S. Ahern
@ 2008-07-29 17:29       ` David S. Ahern
  0 siblings, 0 replies; 10+ messages in thread
From: David S. Ahern @ 2008-07-29 17:29 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Glauber Costa, kvm-devel

[-- Attachment #1: Type: text/plain, Size: 5203 bytes --]

Forgot to add something in my last response: Another time-based oddity
I'm seeing in multi-processor guests is the microseconds value changing
as the process is moved between the vcpus.

The attached code exemplifies what I mean. In a RHEL3 VM with 2 vcpus,
start the program with an argument of 990000 (to get a wakeup every ~1
sec). Once started lock it to a vcpu. You'll nice consistent output like:

1217351975.261974
1217351976.262292
1217351977.262608
1217351978.262929
1217351979.263243
1217351980.263563
1217351981.263940

Then switch the affinity to the other vcpu. The microseconds value jumps:

1217351982.796132
1217351983.797411
1217351984.797719
1217351985.798041
1217351986.798368
1217351987.798788
1217351988.799025

Toggling the affinity or letting the process roam between the 2
processors causes the microseconds to jump. These means that data logged
using the microseconds value will show time jumps back and forth.

As I understand it the root cause is the TSC-based updates to what is
returned by gettimeofday so the fact that they toggle means the 2 vcpus
see different tsc counts. Is there anyway to make the counts coherent as
processes roam vcpus?

david


David S. Ahern wrote:
> 
> Marcelo Tosatti wrote:
>> On Tue, Jul 22, 2008 at 01:56:12PM -0600, David S. Ahern wrote:
>>> I've been running a series of tests on RHEL3, RHEL4, and RHEL5. The
>>> short of it is that all of them keep time quite well with 1 vcpu. In the
>>> case of RHEL3 and RHEL4 time is stable for *both* the uniprocessor and
>>> smp kernels, again with only 1 vcpu (there's no up/smp distinction in
>>> the kernels for RHEL5).
>>>
>>> As soon as the number of vcpus is >1, time drifts systematically with
>>> the guest *leading* the host. I see this on unloaded guests and hosts
>>> (ie., cpu usage on the host ~<5%). The drift is averaging around
>>> 0.5%-0.6% (i.e., 5 seconds gained in the guest per 1000 seconds of real
>>> wall time).
>>>
>>> This very reproducible. All I am doing is installing stock RHEL3.8, 4.4
>>> and 5.2, i386 versions, starting them and watching the drift with no
>>> time servers. In all of these recent cases the results are for in-kernel
>>> pit.
>> David,
>>
>> You mentioned earlier problems with ntpd syncing the guest time? Can you
>> provide more details?
>>
> 
> It would lose sync often, and 'ntpq -c pe' would show a '*' indicative
> of a sync when in fact time in the guest was off by 5-10 seconds. It may
> very well be a side effect of the drift due to repeated timer injection
> of timer interrupts / lost interrupts.
> 
> With your PIT injection patches:
> 
> 1. For a stock RHEL4.4 guest, ntpd synchronized quickly and saw no need
> to adjust time after the initial startup tweak of 1.004620 sec by
> ntpdate. After 40 hours it has maintained time very well with no
> adjustments. Of course the guest is relatively idle -- it is only
> keeping time.
> 
> 2. For a stock RHEL3.8 guest, I cannot get ntpd to do anything. This
> guest is running on the same host as the RHEL4 guest and using the same
> time server. This guest has been around for a few weeks and has been
> subjected to very tests -- like running with the no-kernel-pit and -tdf
> options. In light of 3. below I'll re-create this guest and see if the
> problem goes away.
> 
> 3. For a RHEL3.8 guest running a Cisco product, ntpd was able to
> synchonize just fine. We are running ntpd with different arguments;
> however using the same syntax on the stock rhel3 guest did not help.
> 
> As for as time updates, over 21+ hours of uptime there have been 20 time
> resets -- adjustments ranging from -1.01 seconds to +0.75 seconds. This
> is a remarkable improvement. Before this PIT patch set I was seeing time
> resets of 3-5 seconds every 15 minutes. This is a 2 vcpu guest running a
> modest load (disk + network) that pushes cpu usage of ~25%. Point being
> that the guest is keeping time reasonably well while do something
> useful. :-)
> 
> I am planning to install 4 vcpu guests for both RHEL3 and RHEL4 today
> and again with modest loads to see how it holds up.
> 
>> I find it _necessary_ to use the RR scheduling policy for any Linux
>> guest running at static 1000Hz (no dynticks), otherwise timer interrupts
>> will invariably be missed. And reinjection plus lost tick adjustment is
>> always problematic (will drift either way, depending which version of
>> Linux). With the standard batch scheduling policy _idle_ guests can wait
>> to run upto 6/7 ms in my testing (thus 6/7 lost timer events). Which
>> also means latency can be horrible.
>>
> 
> Noted. I'd prefer not to start priority escalations, but if it's needed....
> 
> What about for the RHEL4.7 kernel running at 250 HZ? I understand it
> with 4.7 you can pass a command line divider to run the clock at a
> slower rate. In the past I've recompiled RHEL4 kernels to run at 250 HZ
> which was a trade-off between too fast (overhead of timer interrupts)
> and too slow (need for better scheduling latency).
> 
> 
> david
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[-- Attachment #2: showtime.c --]
[-- Type: text/x-csrc, Size: 545 bytes --]


#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <libgen.h>

int main(int argc, char *argv[])
{
	unsigned long tsleep;
	struct timeval tv;

	if (argc != 2) {
		printf("usage: %s sleeptime\n", basename(argv[0]));
		return 1;
	}

	tsleep = atoi(argv[1]);
	if (tsleep == 0) {
		printf("usage: invalid sleeptime\n");
		return 2;
	}

	while(1) {
		if (gettimeofday(&tv, NULL) != 0)
			printf("gettimeofday failed\n");
		else
			printf("%ld.%ld\n", tv.tv_sec, tv.tv_usec);

		usleep(tsleep);
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-07-29 17:29 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-22  3:25 kvm guest loops_per_jiffy miscalibration under host load Marcelo Tosatti
2008-07-22  8:22 ` Jan Kiszka
2008-07-22 12:49   ` Marcelo Tosatti
2008-07-22 15:54     ` Jan Kiszka
2008-07-22 22:00     ` Dor Laor
2008-07-22 19:56 ` David S. Ahern
2008-07-23  2:57   ` David S. Ahern
2008-07-29 14:58   ` Marcelo Tosatti
2008-07-29 16:06     ` PIT/ntp/timekeeping [was Re: kvm guest loops_per_jiffy miscalibration under host load] David S. Ahern
2008-07-29 17:29       ` David S. Ahern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox