RFC: kvmclock / tsc server side fix

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Zachary Amsden <zamsden@redhat.com>
To: kvm <kvm@vger.kernel.org>, Avi Kivity <avi@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Glauber Costa <glommer@redhat.com>
Subject: RFC: kvmclock / tsc server side fix
Date: Fri, 14 May 2010 16:07:43 -1000	[thread overview]
Message-ID: <4BEE01EF.7000506@redhat.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 588 bytes --]

I believe this fixes the root cause of the kvmclock warp.  It's quite a 
plausible phenomenon, and explains why it was so easy to produce.

Currently it depends on some other patches; I can send a whole patchset, 
but with all the patch activity, it isn't clear what has been applied 
and to what trees.  Where have Glauber's recent patches been applied?

I am looking for comments if this is a reasonably good explanation and 
fix for the problem.

I realize I messed up the overshoot calculation, it is not converted to 
nsec, but the debug stats are just for debugging.

Thanks,

Zach

[-- Attachment #2: tsc-server.patch --]
[-- Type: text/plain, Size: 5292 bytes --]

commit 24e1f31a4cdb43a8e5cab6cfb95d710c7c7bf18a
Author: Zachary Amsden <zamsden@redhat.com>
Date:   Fri Feb 26 15:13:31 2010 -1000

    Fix a possible backwards warp of kvmclock
    
    Kernel time, which advances in discrete steps may progress much slower
    than TSC.  As a result, when kvmclock is adjusted to a new base, the
    apparent time to the guest, which runs at a much higher, nsec scaled
    rate based on the current TSC, may have already been observed to have
    a larger value (kernel_ns + scaled tsc) than the value to which we are
    setting it (kernel_ns + 0).
    
    We must instead compute the clock as potentially observed by the guest
    for kernel_ns to make sure it does not go backwards.
    
    Signed-off-by: Zachary Amsden <zamsden@redhat.com>

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 83df4db..ba765fa 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -453,6 +453,8 @@ struct kvm_vcpu_stat {
 	u32 hypercalls;
 	u32 irq_injections;
 	u32 nmi_injections;
+	u32 tsc_overshoot;
+	u32 tsc_ahead;
 };
 
 struct kvm_x86_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bb44f9e..2bf7e86 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -134,6 +134,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 	{ "insn_emulation_fail", VCPU_STAT(insn_emulation_fail) },
 	{ "irq_injections", VCPU_STAT(irq_injections) },
 	{ "nmi_injections", VCPU_STAT(nmi_injections) },
+	{ "tsc_overshoot", VCPU_STAT(tsc_overshoot) },
+	{ "tsc_ahead", VCPU_STAT(tsc_ahead) },
 	{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
 	{ "mmu_pte_write", VM_STAT(mmu_pte_write) },
 	{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
@@ -849,35 +851,80 @@ static int kvm_recompute_guest_time(struct kvm_vcpu *v)
 	struct kvm_vcpu_arch *vcpu = &v->arch;
 	void *shared_kaddr;
 	unsigned long this_tsc_khz;
+	s64 kernel_ns, delta;
+	u64 tsc_timestamp;
+	bool upscale;
 
 	if ((!vcpu->time_page))
 		return 0;
 
-	this_tsc_khz = get_cpu_var(cpu_tsc_khz);
-	put_cpu_var(cpu_tsc_khz);
+	/*
+	 * The protection we require is simple: we must not be preempted from
+	 * the CPU between our read of the TSC khz and our read of the TSC.
+	 * Interrupt protection is not strictly required, but it does result in
+	 * greater accuracy for the TSC / kernel_ns measurement.
+	 */
+	local_irq_save(flags);
+	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
+	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
+	ktime_get_ts(&ts);
+	monotonic_to_bootbased(&ts);
+	kernel_ns = timespec_to_ns(&ts);
+	local_irq_restore(flags);
+
 	if (unlikely(this_tsc_khz == 0)) {
 		kvm_request_guest_time_update(v);
 		return 1;
 	}
 
+	/*
+	 * Time as measured by the TSC may go backwards when resetting the base
+	 * tsc_timestamp.  The reason for this is that the TSC resolution is
+	 * higher than the resolution of the other clock scales.  Thus, many
+	 * possible measurments of the TSC correspond to one measurement of any
+	 * other clock, and so a spread of values is possible.  This is not a
+	 * problem for the computation of the nanosecond clock; with TSC rates
+	 * around 1GHZ, there can only be a few cycles which correspond to one
+	 * nanosecond value, and any path through this code will inevitably
+	 * take longer than that.  However, with the kernel_ns value itself,
+	 * the precision may be much lower, down to HZ granularity.  If the
+	 * first sampling of TSC against kernel_ns ends in the low part of the
+	 * range, and the second in the high end of the range, we can get:
+	 *
+	 * (TSC - offset_low) * S + kns_old > (TSC - offset_high) * S + kns_new
+	 *
+	 * As the sampling errors potentially range in the thousands of cycles,
+	 * it is possible such a time value has already been observed by the
+	 * guest.  To protect against this, we must compute the system time as
+	 * observed by the guest and ensure the new system time is greater.
+ 	 */
+	delta = native_read_tsc() - vcpu->hv_clock.tsc_timestamp;
+	delta = pvclock_scale_delta(delta, vcpu->hv_clock.tsc_to_system_mul,
+				    vcpu->hv_clock.tsc_shift);
+	delta += vcpu->hv_clock.system_time;
+
 	if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
+		upscale = this_tsc_khz > vcpu->hw_tsc_khz;
 		kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
 				   &vcpu->hv_clock.tsc_shift,
 				   &vcpu->hv_clock.tsc_to_system_mul);
 		vcpu->hw_tsc_khz = this_tsc_khz;
 	}
 
-	/* Keep irq disabled to prevent changes to the clock */
-	local_irq_save(flags);
-	kvm_get_msr(v, MSR_IA32_TSC, &vcpu->hv_clock.tsc_timestamp);
-	ktime_get_ts(&ts);
-	monotonic_to_bootbased(&ts);
-	local_irq_restore(flags);
+	if (delta > kernel_ns) {
+		s64 overshoot = delta - kernel_ns;
+		++v->stat.tsc_ahead;
+		if (upscale)
+			overshoot = overshoot * 9 / 10;
+		if (overshoot > 1000ULL * this_tsc_khz / HZ) {
+			++v->stat.tsc_overshoot;
+		}
+		kernel_ns = delta;
+	}
 
 	/* With all the info we got, fill in the values */
-
-	vcpu->hv_clock.system_time = ts.tv_nsec +
-				     (NSEC_PER_SEC * (u64)ts.tv_sec) + v->kvm->arch.kvmclock_offset;
+	vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
+	vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
 
 	/*
 	 * The interface expects us to write an even number signaling that the

next             reply	other threads:[~2010-05-15  2:07 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-15  2:07 Zachary Amsden [this message]
2010-05-17 15:36 ` RFC: kvmclock / tsc server side fix Glauber Costa
2010-05-17 19:38   ` Zachary Amsden
2010-05-18 14:08     ` Glauber Costa
2010-05-18 15:00       ` Zachary Amsden

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:83df4db dfblob:ba765fa dfblob:bb44f9e dfblob:2bf7e86 )
 OR (
bs:"RFC: kvmclock / tsc server side fix" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4BEE01EF.7000506@redhat.com \
    --to=zamsden@redhat.com \
    --cc=avi@redhat.com \
    --cc=glommer@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=mtosatti@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.