public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* kvm guest loops_per_jiffy miscalibration under host load
@ 2008-07-02 16:40 Marcelo Tosatti
  2008-07-03 13:17 ` Glauber Costa
  2008-07-07 18:17 ` Daniel P. Berrange
  0 siblings, 2 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2008-07-02 16:40 UTC (permalink / raw)
  To: kvm-devel; +Cc: gcosta, kraxel, chrisw, aliguori

Hello,

I have been discussing with Glauber and Gerd the problem where KVM
guests miscalibrate loops_per_jiffy if there's sufficient load on the
host.

calibrate_delay_direct() failed to get a good estimate for
loops_per_jiffy.
Probably due to long platform interrupts. Consider using "lpj=" boot
option.
Calibrating delay loop... <3>107.00 BogoMIPS (lpj=214016)

While this particular host calculates lpj=1597041.

This means that udelay() can delay for less than what asked for, with
fatal results such as:

..MP-BIOS bug: 8254 timer not connected to IO-APIC
Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the
'noapic' kernel parameter

This bug is easily triggered with a CPU hungry task on nice -20
running only during guest calibration (so that the timer check code on
io_apic_{32,64}.c fails to wait long enough for PIT interrupts to fire).

The problem is that the calibration routines assume a stable relation
between timer interrupt frequency (PIT at this boot stage) and
TSC/execution frequency.

The emulated timer frequency is based on the host system time and
therefore virtually resistant against heavy load, while the execution
of these routines on the guest is suspectible to scheduling of the QEMU
process.

To fix this in a transparent way (without direct "lpj=" boot parameter
assignment or a paravirt equivalent), it would be necessary to base the
emulated timer frequency on guest execution time instead of host system
time. But this can introduce timekeeping issues (recent Linux guests
seem to handle lost/late interrupts fine as long as the clocksource is
reliable) and just sounds scary.

Possible solutions:

- Require the admin to preset "lpj=". Nasty, not user friendly.
- Pass the proper lpj value via a paravirt interface. Won't cover
  fullvirt guests.
- Have the management app guarantee a minimum amount of CPU required
for proper calibration during guest initialization.

Comments, ideas?


^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: kvm guest loops_per_jiffy miscalibration under host load
@ 2008-07-22  3:25 Marcelo Tosatti
  2008-07-22  8:22 ` Jan Kiszka
  2008-07-22 19:56 ` David S. Ahern
  0 siblings, 2 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2008-07-22  3:25 UTC (permalink / raw)
  To: David S. Ahern; +Cc: Glauber Costa, kvm-devel

On Sat, Jul 12, 2008 at 01:28:13PM -0600, David S. Ahern wrote:
> > All time drift issues we were aware of are fixed in kvm-70. Can you
> > please provide more details on how you see the time drifting with
> > RHEL3/4 guests? It slowly but continually drifts or there are large
> > drifts at once? Are they using TSC or ACPIPM as clocksource?
> 
> The attached file shows one example of the drift I am seeing. It's for a
> 4-way RHEL3 guest started with 'nice -20'. After startup each vcpu was
> pinned to a physical cpu using taskset. The only activity on the host is
> this one single guest; the guest is relatively idle -- about 4% activity
> (~1% user, ~3% system time). Host is synchronized to an ntp server; the
> guest is not. The guest is started with the -localtime parameter.  From
> the file you can see the guest gains about 1-2 seconds every 5 minutes.
> 
> Since it's a RHEL3 guest I believe the PIT is the only choice (how to
> confirm?), though it does read the TSC (ie., use_tsc is 1).

Since its an SMP guest I believe its using PIT to generate periodic
timers and ACPI pmtimer as a clock source.

> > Also, most issues we've seen could only be replicated with dyntick
> > guests.
> > 
> > I'll try to reproduce it locally.
> > 
> >> In the course of it I have been launching guests with boosted priority
> >> (both nice -20 and realtime priority (RR 1)) on a nearly 100% idle
> >> host.
> > 
> > Can you also see wacked bogomips without boosting the guest priority?
> 
> The wacked bogomips only shows up when started with real-time priority.
> With the 'nice -20' it's sane and close to what the host shows.
> 
> As another data point I restarted the RHEL3 guest using the -no-kvm-pit
> and -tdf options (nice -20 priority boost). After 22 hours of uptime,
> the guest is 29 seconds *behind* the host. Using the in-kernel pit the
> guest time is always fast compared to the host.
> 
> I've seen similar drifting in RHEL4 guests, but I have not spent as much
> time investigating it yet. On ESX adding clock=pit to the boot
> parameters for RHEL4 guests helps immensely.

The problem with clock=pmtmr and clock=tsc on older 2.6 kernels is lost
tick and irq latency adjustments, as mentioned in the VMWare paper
(http://www.vmware.com/pdf/vmware_timekeeping.pdf). They try to detect
this and compensate by advancing the clock. But the delay between the
host time fire, injection of guest irq and actual count read (either
tsc or pmtimer) fool these adjustments. clock=pit has no such lost tick
detection, so is susceptible to lost ticks under load (in theory).

The fact that qemu emulation is less suspectible to guest clock running
faster than it should is because the emulated PIT timer is rearmed
relative to alarm processing (next_expiration = current_time + count).
But that also means it is suspectible to host load, ie. the frequency is
virtual.

The in-kernel PIT rearms relative to host clock, so the frequency is
more reliable (next_expiration = prev_expiration + count).

So for RHEL4, clock=pit along with the following patch seems stable for
me, no drift either direction, even under guest/host load. Can you give
it a try with RHEL3 ? I'll be doing that shortly.


----------

Set the count load time to when the count is actually "loaded", not when
IRQ is injected.

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index c0f7872..b39b141 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -207,6 +207,7 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
 
 	pt->timer.expires = ktime_add_ns(pt->timer.expires, pt->period);
 	pt->scheduled = ktime_to_ns(pt->timer.expires);
+	ps->channels[0].count_load_time = pt->timer.expires;
 
 	return (pt->period == 0 ? 0 : 1);
 }
@@ -622,7 +623,6 @@ void kvm_pit_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
 		  arch->vioapic->redirtbl[0].fields.mask != 1))) {
 			ps->inject_pending = 1;
 			atomic_dec(&ps->pit_timer.pending);
-			ps->channels[0].count_load_time = ktime_get();
 		}
 	}
 }



^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2008-07-29 14:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-02 16:40 kvm guest loops_per_jiffy miscalibration under host load Marcelo Tosatti
2008-07-03 13:17 ` Glauber Costa
2008-07-04 22:51   ` Marcelo Tosatti
2008-07-07  1:56   ` Anthony Liguori
2008-07-07 18:27     ` Glauber Costa
2008-07-07 18:48       ` Marcelo Tosatti
2008-07-07 19:21         ` Anthony Liguori
2008-07-07 19:32           ` Glauber Costa
2008-07-07 21:35             ` Glauber Costa
2008-07-11 21:18               ` David S. Ahern
2008-07-12 14:10                 ` Marcelo Tosatti
2008-07-12 19:28                   ` David S. Ahern
2008-07-07 18:17 ` Daniel P. Berrange
  -- strict thread matches above, loose matches on Subject: below --
2008-07-22  3:25 Marcelo Tosatti
2008-07-22  8:22 ` Jan Kiszka
2008-07-22 12:49   ` Marcelo Tosatti
2008-07-22 15:54     ` Jan Kiszka
2008-07-22 22:00     ` Dor Laor
2008-07-22 19:56 ` David S. Ahern
2008-07-23  2:57   ` David S. Ahern
2008-07-29 14:58   ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox