[patch 0/8] KVM: Remaining body of TSC emulation work (Zachary Amsden)

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [patch 0/8] KVM: Remaining body of TSC emulation work (Zachary Amsden)
@ 2012-02-03 17:43 Marcelo Tosatti
  2012-02-03 17:43 ` [patch 1/8] Infrastructure for software and hardware based TSC rate scaling Marcelo Tosatti
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2012-02-03 17:43 UTC (permalink / raw)
  To: kvm; +Cc: zamsden, joerg.roedel

Rebase Zachary's last TSC series. From his original email:

"This is the remaining bulk of work I have related to TSC emulation.
In summary, I believe this fixes all known issues with TSC.  A few
rather subtle issues are cleaned up, S4 suspend is fixed, and the
API for adjust_tsc_offset is expanded to take arguments in either
guest cycles or optionally host cycles.  The final patch adds
software TSC emulation, with a configurable default setting.

Note that TSC trapping will only be used when other methods fail
due to unstable TSC or lack of scaling hardware.  With the improved
offset matching, I was even able to get SMP guests to boot with a
TSC clock; cycle testing showed a maximum backwards leap of around
0.25 ms, which is actually fairly good.  With the last patch applied,
software TSC emulation kicks in and the backwards TSC count even on
my broken hardware dropped to zero.

Some of this code (the S4 suspend compensation) has already been
ported into RHEL to fix various bugs - however upstream had diverged
a bit from the original course I planned; I had to add a few things
that had been optimized out of upstream back in (last_host_tsc).

In the course of doing this, I think the new code looks much
cleaner, with well documented and easy to understand variables.
Yes, there are a lot of tracking variables to maintain this whole
infrastructure - and yes, some of them appear to be redundant of
easily computable from others - but in actuality, the information
they provide is easy to understand, and the resulting code is much
easier to verify than a complex system where some quantities may
be undefined or computed on the fly and thus causing subtle races.
"

The last patch which introduces TSC trapping has not been included: 
it unconditionally enables TSC trapping if the guest frequency 
differs from the host or the host TSC is unstable. It should instead
use tsc catchup if the guest frequency is higher than the host. Also, 
there is no need for trapping if the guest is using kvmclock.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 1/8] Infrastructure for software and hardware based TSC rate scaling
  2012-02-03 17:43 [patch 0/8] KVM: Remaining body of TSC emulation work (Zachary Amsden) Marcelo Tosatti
@ 2012-02-03 17:43 ` Marcelo Tosatti
  2012-02-08 14:28   ` Joerg Roedel
  2012-02-03 17:43 ` [patch 2/8] Improve TSC offset matching Marcelo Tosatti
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2012-02-03 17:43 UTC (permalink / raw)
  To: kvm; +Cc: zamsden, joerg.roedel, Marcelo Tosatti

[-- Attachment #1: 01-infrastructure-for-software-and-hardware-based-tsc.patch --]
[-- Type: text/plain, Size: 9524 bytes --]

From: Zachary Amsden <zamsden@gmail.com>

This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate.  Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code.  This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.

There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed.  Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used.  The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.

In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm/arch/x86/include/asm/kvm_host.h
===================================================================
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -422,10 +422,11 @@ struct kvm_vcpu_arch {
 	u64 last_kernel_ns;
 	u64 last_tsc_nsec;
 	u64 last_tsc_write;
-	u32 virtual_tsc_khz;
 	bool tsc_catchup;
-	u32  tsc_catchup_mult;
-	s8   tsc_catchup_shift;
+	bool tsc_always_catchup;
+	s8 virtual_tsc_shift;
+	u32 virtual_tsc_mult;
+	u32 virtual_tsc_khz;
 
 	atomic_t nmi_queued;  /* unprocessed asynchronous NMIs */
 	unsigned nmi_pending; /* NMI queued after currently running handler */
@@ -651,7 +652,7 @@ struct kvm_x86_ops {
 
 	bool (*has_wbinvd_exit)(void);
 
-	void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz);
+	void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale);
 	void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
 	u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
Index: kvm/arch/x86/kvm/svm.c
===================================================================
--- kvm.orig/arch/x86/kvm/svm.c
+++ kvm/arch/x86/kvm/svm.c
@@ -964,20 +964,25 @@ static u64 svm_scale_tsc(struct kvm_vcpu
 	return _tsc;
 }
 
-static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
+static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 	u64 ratio;
 	u64 khz;
 
-	/* TSC scaling supported? */
-	if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR))
+	/* Guest TSC same frequency as host TSC? */
+	if (!scale) {
+		svm->tsc_ratio = TSC_RATIO_DEFAULT;
 		return;
+	}
 
-	/* TSC-Scaling disabled or guest TSC same frequency as host TSC? */
-	if (user_tsc_khz == 0) {
-		vcpu->arch.virtual_tsc_khz = 0;
-		svm->tsc_ratio = TSC_RATIO_DEFAULT;
+	/* TSC scaling supported? */
+	if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR)) {
+		if (user_tsc_khz > tsc_khz) {
+			vcpu->arch.tsc_catchup = 1;
+			vcpu->arch.tsc_always_catchup = 1;
+		} else
+			WARN(1, "user requested TSC rate below hardware speed\n");
 		return;
 	}
 
@@ -992,7 +997,6 @@ static void svm_set_tsc_khz(struct kvm_v
 				user_tsc_khz);
 		return;
 	}
-	vcpu->arch.virtual_tsc_khz = user_tsc_khz;
 	svm->tsc_ratio             = ratio;
 }
 
Index: kvm/arch/x86/kvm/vmx.c
===================================================================
--- kvm.orig/arch/x86/kvm/vmx.c
+++ kvm/arch/x86/kvm/vmx.c
@@ -1817,13 +1817,19 @@ u64 vmx_read_l1_tsc(struct kvm_vcpu *vcp
 }
 
 /*
- * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ
- * ioctl. In this case the call-back should update internal vmx state to make
- * the changes effective.
+ * Engage any workarounds for mis-matched TSC rates.  Currently limited to
+ * software catchup for faster rates on slower CPUs.
  */
-static void vmx_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
+static void vmx_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
 {
-	/* Nothing to do here */
+	if (!scale)
+		return;
+
+	if (user_tsc_khz > tsc_khz) {
+		vcpu->arch.tsc_catchup = 1;
+		vcpu->arch.tsc_always_catchup = 1;
+	} else
+		WARN(1, "user requested TSC rate below hardware speed\n");
 }
 
 /*
Index: kvm/arch/x86/kvm/x86.c
===================================================================
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -96,6 +96,10 @@ EXPORT_SYMBOL_GPL(kvm_has_tsc_control);
 u32  kvm_max_guest_tsc_khz;
 EXPORT_SYMBOL_GPL(kvm_max_guest_tsc_khz);
 
+/* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
+static u32 tsc_tolerance_ppm = 250;
+module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
+
 #define KVM_NR_SHARED_MSRS 16
 
 struct kvm_shared_msrs_global {
@@ -968,49 +972,43 @@ static inline u64 get_kernel_ns(void)
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 unsigned long max_tsc_khz;
 
-static inline int kvm_tsc_changes_freq(void)
-{
-	int cpu = get_cpu();
-	int ret = !boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
-		  cpufreq_quick_get(cpu) != 0;
-	put_cpu();
-	return ret;
-}
-
-u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu)
-{
-	if (vcpu->arch.virtual_tsc_khz)
-		return vcpu->arch.virtual_tsc_khz;
-	else
-		return __this_cpu_read(cpu_tsc_khz);
-}
-
 static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
 {
-	u64 ret;
-
-	WARN_ON(preemptible());
-	if (kvm_tsc_changes_freq())
-		printk_once(KERN_WARNING
-		 "kvm: unreliable cycle conversion on adjustable rate TSC\n");
-	ret = nsec * vcpu_tsc_khz(vcpu);
-	do_div(ret, USEC_PER_SEC);
-	return ret;
+	return pvclock_scale_delta(nsec, vcpu->arch.virtual_tsc_mult,
+				   vcpu->arch.virtual_tsc_shift);
 }
 
-static void kvm_init_tsc_catchup(struct kvm_vcpu *vcpu, u32 this_tsc_khz)
+static void kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 this_tsc_khz)
 {
+	u32 thresh_lo, thresh_hi;
+	int use_scaling = 0;
+
 	/* Compute a scale to convert nanoseconds in TSC cycles */
 	kvm_get_time_scale(this_tsc_khz, NSEC_PER_SEC / 1000,
-			   &vcpu->arch.tsc_catchup_shift,
-			   &vcpu->arch.tsc_catchup_mult);
+			   &vcpu->arch.virtual_tsc_shift,
+			   &vcpu->arch.virtual_tsc_mult);
+	vcpu->arch.virtual_tsc_khz = this_tsc_khz;
+
+	/*
+	 * Compute the variation in TSC rate which is acceptable
+	 * within the range of tolerance and decide if the
+	 * rate being applied is within that bounds of the hardware
+	 * rate.  If so, no scaling or compensation need be done.
+	 */
+	thresh_lo = (u64)tsc_khz * (1000000 - tsc_tolerance_ppm) / 1000000;
+	thresh_hi = (u64)tsc_khz * (1000000 + tsc_tolerance_ppm) / 1000000;
+	if (this_tsc_khz < thresh_lo || this_tsc_khz > thresh_hi) {
+		pr_debug("kvm: requested TSC rate %u falls outside tolerance [%u,%u]\n", this_tsc_khz, thresh_lo, thresh_hi);
+		use_scaling = 1;
+	}
+	kvm_x86_ops->set_tsc_khz(vcpu, this_tsc_khz, use_scaling);
 }
 
 static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 {
 	u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.last_tsc_nsec,
-				      vcpu->arch.tsc_catchup_mult,
-				      vcpu->arch.tsc_catchup_shift);
+				      vcpu->arch.virtual_tsc_mult,
+				      vcpu->arch.virtual_tsc_shift);
 	tsc += vcpu->arch.last_tsc_write;
 	return tsc;
 }
@@ -1077,7 +1075,7 @@ static int kvm_guest_time_update(struct 
 	local_irq_save(flags);
 	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v);
 	kernel_ns = get_kernel_ns();
-	this_tsc_khz = vcpu_tsc_khz(v);
+	this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
 	if (unlikely(this_tsc_khz == 0)) {
 		local_irq_restore(flags);
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
@@ -2804,26 +2802,21 @@ long kvm_arch_vcpu_ioctl(struct file *fi
 		u32 user_tsc_khz;
 
 		r = -EINVAL;
-		if (!kvm_has_tsc_control)
-			break;
-
 		user_tsc_khz = (u32)arg;
 
 		if (user_tsc_khz >= kvm_max_guest_tsc_khz)
 			goto out;
 
-		kvm_x86_ops->set_tsc_khz(vcpu, user_tsc_khz);
+		if (user_tsc_khz == 0)
+			user_tsc_khz = tsc_khz;
+
+		kvm_set_tsc_khz(vcpu, user_tsc_khz);
 
 		r = 0;
 		goto out;
 	}
 	case KVM_GET_TSC_KHZ: {
-		r = -EIO;
-		if (check_tsc_unstable())
-			goto out;
-
-		r = vcpu_tsc_khz(vcpu);
-
+		r = vcpu->arch.virtual_tsc_khz;
 		goto out;
 	}
 	default:
@@ -5312,6 +5305,8 @@ static int vcpu_enter_guest(struct kvm_v
 		profile_hit(KVM_PROFILING, (void *)rip);
 	}
 
+	if (unlikely(vcpu->arch.tsc_always_catchup))
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
 	kvm_lapic_sync_from_vapic(vcpu);
 
@@ -6004,7 +5999,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *
 	}
 	vcpu->arch.pio_data = page_address(page);
 
-	kvm_init_tsc_catchup(vcpu, max_tsc_khz);
+	kvm_set_tsc_khz(vcpu, max_tsc_khz);
 
 	r = kvm_mmu_create(vcpu);
 	if (r < 0)
Index: kvm/arch/x86/kvm/lapic.c
===================================================================
--- kvm.orig/arch/x86/kvm/lapic.c
+++ kvm/arch/x86/kvm/lapic.c
@@ -731,7 +731,7 @@ static void start_apic_timer(struct kvm_
 		u64 guest_tsc, tscdeadline = apic->lapic_timer.tscdeadline;
 		u64 ns = 0;
 		struct kvm_vcpu *vcpu = apic->vcpu;
-		unsigned long this_tsc_khz = vcpu_tsc_khz(vcpu);
+		unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz;
 		unsigned long flags;
 
 		if (unlikely(!tscdeadline || !this_tsc_khz))



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 1/8] Infrastructure for software and hardware based TSC rate scaling
  2012-02-03 17:43 ` [patch 1/8] Infrastructure for software and hardware based TSC rate scaling Marcelo Tosatti
@ 2012-02-08 14:28   ` Joerg Roedel
  0 siblings, 0 replies; 13+ messages in thread
From: Joerg Roedel @ 2012-02-08 14:28 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, zamsden

On Fri, Feb 03, 2012 at 03:43:50PM -0200, Marcelo Tosatti wrote:
> +		if (user_tsc_khz > tsc_khz) {
> +			vcpu->arch.tsc_catchup = 1;
> +			vcpu->arch.tsc_always_catchup = 1;
> +		} else
> +			WARN(1, "user requested TSC rate below hardware speed\n");
>  		return;

[...]

> +	if (user_tsc_khz > tsc_khz) {
> +		vcpu->arch.tsc_catchup = 1;
> +		vcpu->arch.tsc_always_catchup = 1;
> +	} else
> +		WARN(1, "user requested TSC rate below hardware speed\n");

Is it a good idea to have a user-triggerable WARN?


	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 2/8] Improve TSC offset matching
  2012-02-03 17:43 [patch 0/8] KVM: Remaining body of TSC emulation work (Zachary Amsden) Marcelo Tosatti
  2012-02-03 17:43 ` [patch 1/8] Infrastructure for software and hardware based TSC rate scaling Marcelo Tosatti
@ 2012-02-03 17:43 ` Marcelo Tosatti
  2012-02-03 17:43 ` [patch 3/8] Leave TSC synchronization window open with each new sync Marcelo Tosatti
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2012-02-03 17:43 UTC (permalink / raw)
  To: kvm; +Cc: zamsden, joerg.roedel, Marcelo Tosatti

[-- Attachment #1: 02-improve-tsc-offset-matching.patch --]
[-- Type: text/plain, Size: 3894 bytes --]

From: Zachary Amsden <zamsden@gmail.com>

There are a few improvements that can be made to the TSC offset
matching code.  First, we don't need to call the 128-bit multiply
(especially on a constant number), the code works much nicer to
do computation in nanosecond units.

Second, the way everything is setup with software TSC rate scaling,
we currently have per-cpu rates.  Obviously this isn't too desirable
to use in practice, but if for some reason we do change the rate of
all VCPUs at runtime, then reset the TSCs, we will only want to
match offsets for VCPUs running at the same rate.

Finally, for the case where we have an unstable host TSC, but
rate scaling is being done in hardware, we should call the platform
code to compute the TSC offset, so the math is reorganized to recompute
the base instead, then transform the base into an offset using the
existing API.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm/arch/x86/include/asm/kvm_host.h
===================================================================
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -513,6 +513,7 @@ struct kvm_arch {
 	u64 last_tsc_nsec;
 	u64 last_tsc_offset;
 	u64 last_tsc_write;
+	u32 last_tsc_khz;
 
 	struct kvm_xen_hvm_config xen_hvm_config;
 
Index: kvm/arch/x86/kvm/x86.c
===================================================================
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1018,33 +1018,39 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
 	unsigned long flags;
-	s64 sdiff;
+	s64 nsdiff;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 	offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
 	ns = get_kernel_ns();
 	elapsed = ns - kvm->arch.last_tsc_nsec;
-	sdiff = data - kvm->arch.last_tsc_write;
-	if (sdiff < 0)
-		sdiff = -sdiff;
+
+	/* n.b - signed multiplication and division required */
+	nsdiff = data - kvm->arch.last_tsc_write;
+	nsdiff = (nsdiff * 1000) / vcpu->arch.virtual_tsc_khz;
+	nsdiff -= elapsed;
+	if (nsdiff < 0)
+		nsdiff = -nsdiff;
 
 	/*
-	 * Special case: close write to TSC within 5 seconds of
-	 * another CPU is interpreted as an attempt to synchronize
-	 * The 5 seconds is to accommodate host load / swapping as
-	 * well as any reset of TSC during the boot process.
-	 *
-	 * In that case, for a reliable TSC, we can match TSC offsets,
-	 * or make a best guest using elapsed value.
-	 */
-	if (sdiff < nsec_to_cycles(vcpu, 5ULL * NSEC_PER_SEC) &&
-	    elapsed < 5ULL * NSEC_PER_SEC) {
+	 * Special case: TSC write with a small delta (1 second) of virtual
+	 * cycle time against real time is interpreted as an attempt to
+	 * synchronize the CPU.
+         *
+	 * For a reliable TSC, we can match TSC offsets, and for an unstable
+	 * TSC, we add elapsed time in this computation.  We could let the
+	 * compensation code attempt to catch up if we fall behind, but
+	 * it's better to try to match offsets from the beginning.
+         */
+	if (nsdiff < NSEC_PER_SEC &&
+	    vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
 		if (!check_tsc_unstable()) {
 			offset = kvm->arch.last_tsc_offset;
 			pr_debug("kvm: matched tsc offset for %llu\n", data);
 		} else {
 			u64 delta = nsec_to_cycles(vcpu, elapsed);
-			offset += delta;
+			data += delta;
+			offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
 			pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
 		}
 		ns = kvm->arch.last_tsc_nsec;
@@ -1052,6 +1058,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 	kvm->arch.last_tsc_nsec = ns;
 	kvm->arch.last_tsc_write = data;
 	kvm->arch.last_tsc_offset = offset;
+	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
 	kvm_x86_ops->write_tsc_offset(vcpu, offset);
 	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 3/8] Leave TSC synchronization window open with each new sync
  2012-02-03 17:43 [patch 0/8] KVM: Remaining body of TSC emulation work (Zachary Amsden) Marcelo Tosatti
  2012-02-03 17:43 ` [patch 1/8] Infrastructure for software and hardware based TSC rate scaling Marcelo Tosatti
  2012-02-03 17:43 ` [patch 2/8] Improve TSC offset matching Marcelo Tosatti
@ 2012-02-03 17:43 ` Marcelo Tosatti
  2012-02-03 17:43 ` [patch 4/8] Fix last_guest_tsc / tsc_offset semantics Marcelo Tosatti
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2012-02-03 17:43 UTC (permalink / raw)
  To: kvm; +Cc: zamsden, joerg.roedel, Marcelo Tosatti

[-- Attachment #1: 03-leave-tsc-sync-window-open.patch --]
[-- Type: text/plain, Size: 2415 bytes --]

From: Zachary Amsden <zamsden@gmail.com>

Currently, when the TSC is written by the guest, the variable
ns is updated to force the current write to appear to have taken
place at the time of the first write in this sync phase.  This
leaves a cliff at the end of the match window where updates will
fall of the end.  There are two scenarios where this can be a
problem in practe - first, on a system with a large number of
VCPUs, the sync period may last for an extended period of time.

The second way this can happen is if the VM reboots very rapidly
and we catch a VCPU TSC synchronization just around the edge.
We may be unaware of the reboot, and thus the first VCPU might
synchronize with an old set of the timer (at, say 0.97 seconds
ago, when first powered on).  The second VCPU can come in 0.04
seconds later to try to synchronize, but it misses the window
because it is just over the threshold.

Instead, stop doing this artificial setback of the ns variable
and just update it with every write of the TSC.

It may be observed that doing so causes values computed by
compute_guest_tsc to diverge slightly across CPUs - note that
the last_tsc_ns and last_tsc_write variable are used here, and
now they last_tsc_ns will be different for each VCPU, reflecting
the actual time of the update.

However, compute_guest_tsc is used only for guests which already
have TSC stability issues, and further, note that the previous
patch has caused last_tsc_write to be incremented by the difference
in nanoseconds, converted back into guest cycles.  As such, only
boundary rounding errors should be visible, which given the
resolution in nanoseconds, is going to only be a few cycles and
only visible in cross-CPU consistency tests.  The problem can be
fixed by adding a new set of variables to track the start offset
and start write value for the current sync cycle.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm/arch/x86/kvm/x86.c
===================================================================
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1053,7 +1053,6 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 			offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
 			pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
 		}
-		ns = kvm->arch.last_tsc_nsec;
 	}
 	kvm->arch.last_tsc_nsec = ns;
 	kvm->arch.last_tsc_write = data;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 4/8] Fix last_guest_tsc / tsc_offset semantics
  2012-02-03 17:43 [patch 0/8] KVM: Remaining body of TSC emulation work (Zachary Amsden) Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2012-02-03 17:43 ` [patch 3/8] Leave TSC synchronization window open with each new sync Marcelo Tosatti
@ 2012-02-03 17:43 ` Marcelo Tosatti
  2012-02-03 17:43 ` [patch 5/8] Add last_host_tsc tracking back to KVM Marcelo Tosatti
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2012-02-03 17:43 UTC (permalink / raw)
  To: kvm; +Cc: zamsden, joerg.roedel, Marcelo Tosatti

[-- Attachment #1: 04-fix-last-guest-tsc.patch --]
[-- Type: text/plain, Size: 2677 bytes --]

From: Zachary Amsden <zamsden@gmail.com>

The variable last_guest_tsc was being used as an ad-hoc indicator
that guest TSC has been initialized and recorded correctly.  However,
it may not have been, it could be that guest TSC has been set to some
large value, the back to a small value (by, say, a software reboot).

This defeats the logic and causes KVM to falsely assume that the
guest TSC has gone backwards, marking the host TSC unstable, which
is undesirable behavior.

In addition, rather than try to compute an offset adjustment for the
TSC on unstable platforms, just recompute the whole offset.  This
allows us to get rid of one callsite for adjust_tsc_offset, which
is problematic because the units it takes are in guest units, but
here, the computation was originally being done in host units.

Doing this, and also recording last_guest_tsc when the TSC is written
allow us to remove the tricky logic which depended on last_guest_tsc
being zero to indicate a reset of uninitialized value.

Instead, we now have the guarantee that the guest TSC offset is
always at least something which will get us last_guest_tsc.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm/arch/x86/kvm/x86.c
===================================================================
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1065,6 +1065,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 	vcpu->arch.hv_clock.tsc_timestamp = 0;
 	vcpu->arch.last_tsc_write = data;
 	vcpu->arch.last_tsc_nsec = ns;
+	vcpu->arch.last_guest_tsc = data;
 }
 EXPORT_SYMBOL_GPL(kvm_write_tsc);

@@ -1133,7 +1134,7 @@ static int kvm_guest_time_update(struct 
 	 * observed by the guest and ensure the new system time is greater.
 	 */
 	max_kernel_ns = 0;
-	if (vcpu->hv_clock.tsc_timestamp && vcpu->last_guest_tsc) {
+	if (vcpu->hv_clock.tsc_timestamp) {
 		max_kernel_ns = vcpu->last_guest_tsc -
 				vcpu->hv_clock.tsc_timestamp;
 		max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
@@ -2243,13 +2244,14 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 		u64 tsc;

 		tsc = kvm_x86_ops->read_l1_tsc(vcpu);
-		tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
-			     tsc - vcpu->arch.last_guest_tsc;
+		tsc_delta = tsc - vcpu->arch.last_guest_tsc;

 		if (tsc_delta < 0)
 			mark_tsc_unstable("KVM discovered backwards TSC");
 		if (check_tsc_unstable()) {
-			kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
+			u64 offset = kvm_x86_ops->compute_tsc_offset(vcpu,
+						vcpu->arch.last_guest_tsc);
+			kvm_x86_ops->write_tsc_offset(vcpu, offset);
 			vcpu->arch.tsc_catchup = 1;
 		}
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 5/8] Add last_host_tsc tracking back to KVM
  2012-02-03 17:43 [patch 0/8] KVM: Remaining body of TSC emulation work (Zachary Amsden) Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2012-02-03 17:43 ` [patch 4/8] Fix last_guest_tsc / tsc_offset semantics Marcelo Tosatti
@ 2012-02-03 17:43 ` Marcelo Tosatti
  2012-02-03 17:43 ` [patch 6/8] Allow adjust_tsc_offset to be in host or guest cycles Marcelo Tosatti
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2012-02-03 17:43 UTC (permalink / raw)
  To: kvm; +Cc: zamsden, joerg.roedel, Marcelo Tosatti

[-- Attachment #1: 05-add-last-host-tsc.patch --]
[-- Type: text/plain, Size: 2370 bytes --]

From: Zachary Amsden <zamsden@gmail.com>

The variable last_host_tsc was removed from upstream code.  I am adding
it back for two reasons.  First, it is unnecessary to use guest TSC
computation to conclude information about the host TSC.  The guest may
set the TSC backwards (this case handled by the previous patch), but
the computation of guest TSC (and fetching an MSR) is significanlty more
work and complexity than simply reading the hardware counter.  In addition,
we don't actually need the guest TSC for any part of the computation,
by always recomputing the offset, we can eliminate the need to deal with
the current offset and any scaling factors that may apply.

The second reason is that later on, we are going to be using the host
TSC value to restore TSC offsets after a host S4 suspend, so we need to
be reading the host values, not the guest values here.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm/arch/x86/include/asm/kvm_host.h
===================================================================
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -422,6 +422,7 @@ struct kvm_vcpu_arch {
 	u64 last_kernel_ns;
 	u64 last_tsc_nsec;
 	u64 last_tsc_write;
+	u64 last_host_tsc;
 	bool tsc_catchup;
 	bool tsc_always_catchup;
 	s8 virtual_tsc_shift;
Index: kvm/arch/x86/kvm/x86.c
===================================================================
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -2239,13 +2239,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 
 	kvm_x86_ops->vcpu_load(vcpu, cpu);
 	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
-		/* Make sure TSC doesn't go backwards */
-		s64 tsc_delta;
-		u64 tsc;
-
-		tsc = kvm_x86_ops->read_l1_tsc(vcpu);
-		tsc_delta = tsc - vcpu->arch.last_guest_tsc;
-
+		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
+				native_read_tsc() - vcpu->arch.last_host_tsc;
 		if (tsc_delta < 0)
 			mark_tsc_unstable("KVM discovered backwards TSC");
 		if (check_tsc_unstable()) {
@@ -2268,7 +2263,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *
 {
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);
-	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
+	vcpu->arch.last_host_tsc = native_read_tsc();
 }
 
 static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 6/8] Allow adjust_tsc_offset to be in host or guest cycles
  2012-02-03 17:43 [patch 0/8] KVM: Remaining body of TSC emulation work (Zachary Amsden) Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2012-02-03 17:43 ` [patch 5/8] Add last_host_tsc tracking back to KVM Marcelo Tosatti
@ 2012-02-03 17:43 ` Marcelo Tosatti
  2012-02-03 17:43 ` [patch 7/8] Dont mark TSC unstable due to S4 suspend Marcelo Tosatti
  2012-02-03 17:43 ` [patch 8/8] Track TSC synchronization in generations Marcelo Tosatti
  7 siblings, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2012-02-03 17:43 UTC (permalink / raw)
  To: kvm; +Cc: zamsden, joerg.roedel, Marcelo Tosatti

[-- Attachment #1: 06-adjust-tsc-offset-guest-host.patch --]
[-- Type: text/plain, Size: 2903 bytes --]

Redefine the API to take a parameter indicating whether an
adjustment is in host or guest cycles.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm/arch/x86/include/asm/kvm_host.h
===================================================================
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -646,7 +646,7 @@ struct kvm_x86_ops {
 	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
 	int (*get_lpage_level)(void);
 	bool (*rdtscp_supported)(void);
-	void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment);
+	void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment, bool host);
 
 	void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
 
@@ -676,6 +676,17 @@ struct kvm_arch_async_pf {
 
 extern struct kvm_x86_ops *kvm_x86_ops;
 
+static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
+					   s64 adjustment)
+{
+	kvm_x86_ops->adjust_tsc_offset(vcpu, adjustment, false);
+}
+
+static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
+{
+	kvm_x86_ops->adjust_tsc_offset(vcpu, adjustment, true);
+}
+
 int kvm_mmu_module_init(void);
 void kvm_mmu_module_exit(void);
 
Index: kvm/arch/x86/kvm/svm.c
===================================================================
--- kvm.orig/arch/x86/kvm/svm.c
+++ kvm/arch/x86/kvm/svm.c
@@ -1016,10 +1016,14 @@ static void svm_write_tsc_offset(struct 
 	mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
 }
 
-static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
+static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool host)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
+	WARN_ON(adjustment < 0);
+	if (host)
+		adjustment = svm_scale_tsc(vcpu, adjustment);
+
 	svm->vmcb->control.tsc_offset += adjustment;
 	if (is_guest_mode(vcpu))
 		svm->nested.hsave->control.tsc_offset += adjustment;
Index: kvm/arch/x86/kvm/vmx.c
===================================================================
--- kvm.orig/arch/x86/kvm/vmx.c
+++ kvm/arch/x86/kvm/vmx.c
@@ -1856,7 +1856,7 @@ static void vmx_write_tsc_offset(struct 
 	}
 }
 
-static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
+static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool host)
 {
 	u64 offset = vmcs_read64(TSC_OFFSET);
 	vmcs_write64(TSC_OFFSET, offset + adjustment);
Index: kvm/arch/x86/kvm/x86.c
===================================================================
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1102,7 +1102,7 @@ static int kvm_guest_time_update(struct 
 	if (vcpu->tsc_catchup) {
 		u64 tsc = compute_guest_tsc(v, kernel_ns);
 		if (tsc > tsc_timestamp) {
-			kvm_x86_ops->adjust_tsc_offset(v, tsc - tsc_timestamp);
+			adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
 			tsc_timestamp = tsc;
 		}
 	}



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 7/8] Dont mark TSC unstable due to S4 suspend
  2012-02-03 17:43 [patch 0/8] KVM: Remaining body of TSC emulation work (Zachary Amsden) Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2012-02-03 17:43 ` [patch 6/8] Allow adjust_tsc_offset to be in host or guest cycles Marcelo Tosatti
@ 2012-02-03 17:43 ` Marcelo Tosatti
  2012-02-08 15:18   ` Joerg Roedel
  2012-02-03 17:43 ` [patch 8/8] Track TSC synchronization in generations Marcelo Tosatti
  7 siblings, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2012-02-03 17:43 UTC (permalink / raw)
  To: kvm; +Cc: zamsden, joerg.roedel, Marcelo Tosatti

[-- Attachment #1: 07-dont-mark-tsc-unstable-suspend.patch --]
[-- Type: text/plain, Size: 5829 bytes --]

From: Zachary Amsden <zamsden@gmail.com>

During a host suspend, TSC may go backwards, which KVM interprets
as an unstable TSC.  Technically, KVM should not be marking the
TSC unstable, which causes the TSC clocksource to go bad, but we
need to be adjusting the TSC offsets in such a case.

Dealing with this issue is a little tricky as the only place we
can reliably do it is before much of the timekeeping infrastructure
is up and running.  On top of this, we are not in a KVM thread
context, so we may not be able to safely access VCPU fields.
Instead, we compute our best known hardware offset at power-up and
stash it to be applied to all VCPUs when they actually start running.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm/arch/x86/include/asm/kvm_host.h
===================================================================
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -423,6 +423,7 @@ struct kvm_vcpu_arch {
 	u64 last_tsc_nsec;
 	u64 last_tsc_write;
 	u64 last_host_tsc;
+	u64 tsc_offset_adjustment;
 	bool tsc_catchup;
 	bool tsc_always_catchup;
 	s8 virtual_tsc_shift;
Index: kvm/arch/x86/kvm/x86.c
===================================================================
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -2238,6 +2238,14 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 	}
 
 	kvm_x86_ops->vcpu_load(vcpu, cpu);
+
+	/* Apply any externally detected TSC adjustments (due to suspend) */
+	if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
+		adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
+		vcpu->arch.tsc_offset_adjustment = 0;
+		set_bit(KVM_REQ_CLOCK_UPDATE, &vcpu->requests);
+	}
+
 	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
 		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
 				native_read_tsc() - vcpu->arch.last_host_tsc;
@@ -5950,13 +5958,88 @@ int kvm_arch_hardware_enable(void *garba
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
 	int i;
+	int ret;
+	u64 local_tsc;
+	u64 max_tsc = 0;
+	bool stable, backwards_tsc = false;
 
 	kvm_shared_msr_cpu_online();
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
-			if (vcpu->cpu == smp_processor_id())
-				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-	return kvm_x86_ops->hardware_enable(garbage);
+ 	ret = kvm_x86_ops->hardware_enable(garbage);
+	if (ret != 0)
+		return ret;
+
+	local_tsc = native_read_tsc();
+	stable = !check_tsc_unstable();
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		kvm_for_each_vcpu(i, vcpu, kvm) {
+			if (!stable && vcpu->cpu == smp_processor_id())
+				set_bit(KVM_REQ_CLOCK_UPDATE, &vcpu->requests);
+			if (stable && vcpu->arch.last_host_tsc > local_tsc) {
+				backwards_tsc = true;
+				if (vcpu->arch.last_host_tsc > max_tsc)
+					max_tsc = vcpu->arch.last_host_tsc;
+			}
+		}
+	}
+
+	/*
+	 * Sometimes, even reliable TSCs go backwards.  This happens on
+	 * platforms that reset TSC during suspend or hibernate actions, but
+	 * maintain synchronization.  We must compensate.  Fortunately, we can
+	 * detect that condition here, which happens early in CPU bringup,
+	 * before any KVM threads can be running.  Unfortunately, we can't
+	 * bring the TSCs fully up to date with real time, as we aren't yet far
+	 * enough into CPU bringup that we know how much real time has actually
+	 * elapsed; our helper function, get_kernel_ns() will be using boot
+	 * variables that haven't been updated yet.
+	 *
+	 * So we simply find the maximum observed TSC above, then record the
+	 * adjustment to TSC in each VCPU.  When the VCPU later gets loaded,
+	 * the adjustment will be applied.  Note that we accumulate
+	 * adjustments, in case multiple suspend cycles happen before some VCPU
+	 * gets a chance to run again.  In the event that no KVM threads get a
+	 * chance to run, we will miss the entire elapsed period, as we'll have
+	 * reset last_host_tsc, so VCPUs will not have the TSC adjusted and may
+	 * loose cycle time.  This isn't too big a deal, since the loss will be
+	 * uniform across all VCPUs (not to mention the scenario is extremely
+	 * unlikely). It is possible that a second hibernate recovery happens
+	 * much faster than a first, causing the observed TSC here to be
+	 * smaller; this would require additional padding adjustment, which is
+	 * why we set last_host_tsc to the local tsc observed here.
+	 *
+	 * N.B. - this code below runs only on platforms with reliable TSC,
+	 * as that is the only way backwards_tsc is set above.  Also note
+	 * that this runs for ALL vcpus, which is not a bug; all VCPUs should
+	 * have the same delta_cyc adjustment applied if backwards_tsc
+	 * is detected.  Note further, this adjustment is only done once,
+	 * as we reset last_host_tsc on all VCPUs to stop this from being
+	 * called multiple times (one for each physical CPU bringup).
+	 *
+	 * Platforms with unnreliable TSCs don't have to deal with this, they
+	 * will be compensated by the logic in vcpu_load, which sets the TSC to
+	 * catchup mode.  This will catchup all VCPUs to real time, but cannot
+	 * guarantee that they stay in perfect synchronization.
+	 */
+	if (backwards_tsc) {
+		u64 delta_cyc = max_tsc - local_tsc;
+		list_for_each_entry(kvm, &vm_list, vm_list) {
+			kvm_for_each_vcpu(i, vcpu, kvm) {
+				vcpu->arch.tsc_offset_adjustment += delta_cyc;
+				vcpu->arch.last_host_tsc = local_tsc;
+			}
+
+			/*
+			 * We have to disable TSC offset matching.. if you were
+			 * booting a VM while issuing an S4 host suspend....
+			 * you may have some problem.  Solving this issue is
+			 * left as an exercise to the reader.
+			 */
+			kvm->arch.last_tsc_nsec = 0;
+			kvm->arch.last_tsc_write = 0;
+		}
+
+	}
+	return 0;
 }
 
 void kvm_arch_hardware_disable(void *garbage)



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 7/8] Dont mark TSC unstable due to S4 suspend
  2012-02-03 17:43 ` [patch 7/8] Dont mark TSC unstable due to S4 suspend Marcelo Tosatti
@ 2012-02-08 15:18   ` Joerg Roedel
  2012-02-08 15:56     ` Marcelo Tosatti
  0 siblings, 1 reply; 13+ messages in thread
From: Joerg Roedel @ 2012-02-08 15:18 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, zamsden

On Fri, Feb 03, 2012 at 03:43:56PM -0200, Marcelo Tosatti wrote:
> +	if (backwards_tsc) {
> +		u64 delta_cyc = max_tsc - local_tsc;
> +		list_for_each_entry(kvm, &vm_list, vm_list) {
> +			kvm_for_each_vcpu(i, vcpu, kvm) {
> +				vcpu->arch.tsc_offset_adjustment += delta_cyc;
> +				vcpu->arch.last_host_tsc = local_tsc;
> +			}
> +
> +			/*
> +			 * We have to disable TSC offset matching.. if you were
> +			 * booting a VM while issuing an S4 host suspend....
> +			 * you may have some problem.  Solving this issue is
> +			 * left as an exercise to the reader.
> +			 */
> +			kvm->arch.last_tsc_nsec = 0;
> +			kvm->arch.last_tsc_write = 0;
> +		}
> +
> +	}
> +	return 0;

This is not going to work when tsc-scaling is enabled. The
adjust_tsc_offset_host() function just scales the offset the same way
the tsc is scaled. But that is broken because the tsc-offset is applied
_after_ the tsc-ratio by scaling hardware. So to get the desired
tsc-value in the guest the offset needs to be scaled in the opposite
direction as the tsc itself. This is rather complicated to implement.

I suggest to drop patch 6/8 and just let adjust_tsc_offset() be using
guest-tsc units. These loops have to be changed then to work with the
guest-tsc instead of the host-tsc.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 7/8] Dont mark TSC unstable due to S4 suspend
  2012-02-08 15:18   ` Joerg Roedel
@ 2012-02-08 15:56     ` Marcelo Tosatti
  2012-02-09 14:28       ` Joerg Roedel
  0 siblings, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2012-02-08 15:56 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: kvm, zamsden

On Wed, Feb 08, 2012 at 04:18:48PM +0100, Joerg Roedel wrote:
> On Fri, Feb 03, 2012 at 03:43:56PM -0200, Marcelo Tosatti wrote:
> > +	if (backwards_tsc) {
> > +		u64 delta_cyc = max_tsc - local_tsc;
> > +		list_for_each_entry(kvm, &vm_list, vm_list) {
> > +			kvm_for_each_vcpu(i, vcpu, kvm) {
> > +				vcpu->arch.tsc_offset_adjustment += delta_cyc;
> > +				vcpu->arch.last_host_tsc = local_tsc;
> > +			}
> > +
> > +			/*
> > +			 * We have to disable TSC offset matching.. if you were
> > +			 * booting a VM while issuing an S4 host suspend....
> > +			 * you may have some problem.  Solving this issue is
> > +			 * left as an exercise to the reader.
> > +			 */
> > +			kvm->arch.last_tsc_nsec = 0;
> > +			kvm->arch.last_tsc_write = 0;
> > +		}
> > +
> > +	}
> > +	return 0;
> 
> This is not going to work when tsc-scaling is enabled. The
> adjust_tsc_offset_host() function just scales the offset the same way
> the tsc is scaled. But that is broken because the tsc-offset is applied
> _after_ the tsc-ratio by scaling hardware. So to get the desired
> tsc-value in the guest the offset needs to be scaled in the opposite
> direction as the tsc itself. This is rather complicated to implement.

By saying that "tsc-offset is applied _after_ the tsc-ratio by scaling
hardware" you mean that guest tsc is calculated as

	tsc = host_tsc_value * tsc_ratio
	tsc += tsc_offset

?

If so, that means the tsc_offset must be scaled to guest tsc units.

Which is what both adjust_tsc_offset(host=true) and compute_tsc_offset()
do.

What am i missing here?

> I suggest to drop patch 6/8 and just let adjust_tsc_offset() be using
> guest-tsc units. These loops have to be changed then to work with the
> guest-tsc instead of the host-tsc.
> 
> 
> 	Joerg
> 
> -- 
> AMD Operating System Research Center
> 
> Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
> General Managers: Alberto Bozzo
> Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 7/8] Dont mark TSC unstable due to S4 suspend
  2012-02-08 15:56     ` Marcelo Tosatti
@ 2012-02-09 14:28       ` Joerg Roedel
  0 siblings, 0 replies; 13+ messages in thread
From: Joerg Roedel @ 2012-02-09 14:28 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, zamsden

On Wed, Feb 08, 2012 at 01:56:46PM -0200, Marcelo Tosatti wrote:
> On Wed, Feb 08, 2012 at 04:18:48PM +0100, Joerg Roedel wrote:

> > This is not going to work when tsc-scaling is enabled. The
> > adjust_tsc_offset_host() function just scales the offset the same way
> > the tsc is scaled. But that is broken because the tsc-offset is applied
> > _after_ the tsc-ratio by scaling hardware. So to get the desired
> > tsc-value in the guest the offset needs to be scaled in the opposite
> > direction as the tsc itself. This is rather complicated to implement.
> 
> By saying that "tsc-offset is applied _after_ the tsc-ratio by scaling
> hardware" you mean that guest tsc is calculated as
> 
> 	tsc = host_tsc_value * tsc_ratio
> 	tsc += tsc_offset
> 
> ?
> 
> If so, that means the tsc_offset must be scaled to guest tsc units.
> 
> Which is what both adjust_tsc_offset(host=true) and compute_tsc_offset()
> do.
> 
> What am i missing here?

Nothing. You are right, bad math on my side.


	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 8/8] Track TSC synchronization in generations
  2012-02-03 17:43 [patch 0/8] KVM: Remaining body of TSC emulation work (Zachary Amsden) Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2012-02-03 17:43 ` [patch 7/8] Dont mark TSC unstable due to S4 suspend Marcelo Tosatti
@ 2012-02-03 17:43 ` Marcelo Tosatti
  7 siblings, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2012-02-03 17:43 UTC (permalink / raw)
  To: kvm; +Cc: zamsden, joerg.roedel, Marcelo Tosatti

[-- Attachment #1: 08-track-tsc-sync-in-generations.patch --]
[-- Type: text/plain, Size: 4120 bytes --]

From: Zachary Amsden <zamsden@gmail.com>

This allows us to track the original nanosecond and counter values
at each phase of TSC writing by the guest.  This gets us perfect
offset matching for stable TSC systems, and perfect software
computed TSC matching for machines with unstable TSC.

Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm/arch/x86/include/asm/kvm_host.h
===================================================================
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -420,10 +420,11 @@ struct kvm_vcpu_arch {
 
 	u64 last_guest_tsc;
 	u64 last_kernel_ns;
-	u64 last_tsc_nsec;
-	u64 last_tsc_write;
 	u64 last_host_tsc;
 	u64 tsc_offset_adjustment;
+	u64 this_tsc_nsec;
+	u64 this_tsc_write;
+	u8  this_tsc_generation;
 	bool tsc_catchup;
 	bool tsc_always_catchup;
 	s8 virtual_tsc_shift;
@@ -513,9 +514,12 @@ struct kvm_arch {
 	s64 kvmclock_offset;
 	raw_spinlock_t tsc_write_lock;
 	u64 last_tsc_nsec;
-	u64 last_tsc_offset;
 	u64 last_tsc_write;
 	u32 last_tsc_khz;
+	u64 cur_tsc_nsec;
+	u64 cur_tsc_write;
+	u64 cur_tsc_offset;
+	u8  cur_tsc_generation;
 
 	struct kvm_xen_hvm_config xen_hvm_config;
 
Index: kvm/arch/x86/kvm/x86.c
===================================================================
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1006,10 +1006,10 @@ static void kvm_set_tsc_khz(struct kvm_v
 
 static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 {
-	u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.last_tsc_nsec,
+	u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
 				      vcpu->arch.virtual_tsc_mult,
 				      vcpu->arch.virtual_tsc_shift);
-	tsc += vcpu->arch.last_tsc_write;
+	tsc += vcpu->arch.this_tsc_write;
 	return tsc;
 }
 
@@ -1045,7 +1045,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 	if (nsdiff < NSEC_PER_SEC &&
 	    vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
 		if (!check_tsc_unstable()) {
-			offset = kvm->arch.last_tsc_offset;
+			offset = kvm->arch.cur_tsc_offset;
 			pr_debug("kvm: matched tsc offset for %llu\n", data);
 		} else {
 			u64 delta = nsec_to_cycles(vcpu, elapsed);
@@ -1053,20 +1053,45 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 			offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
 			pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
 		}
+	} else {
+		/*
+		 * We split periods of matched TSC writes into generations.
+		 * For each generation, we track the original measured
+		 * nanosecond time, offset, and write, so if TSCs are in
+		 * sync, we can match exact offset, and if not, we can match
+		 * exact software computaion in compute_guest_tsc()
+		 *
+		 * These values are tracked in kvm->arch.cur_xxx variables.
+		 */
+		kvm->arch.cur_tsc_generation++;
+		kvm->arch.cur_tsc_nsec = ns;
+		kvm->arch.cur_tsc_write = data;
+		kvm->arch.cur_tsc_offset = offset;
+		pr_debug("kvm: new tsc generation %u, clock %llu\n",
+			 kvm->arch.cur_tsc_generation, data);
 	}
+
+	/*
+	 * We also track th most recent recorded KHZ, write and time to
+	 * allow the matching interval to be extended at each write.
+	 */
 	kvm->arch.last_tsc_nsec = ns;
 	kvm->arch.last_tsc_write = data;
-	kvm->arch.last_tsc_offset = offset;
 	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
-	kvm_x86_ops->write_tsc_offset(vcpu, offset);
-	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 
 	/* Reset of TSC must disable overshoot protection below */
 	vcpu->arch.hv_clock.tsc_timestamp = 0;
-	vcpu->arch.last_tsc_write = data;
-	vcpu->arch.last_tsc_nsec = ns;
 	vcpu->arch.last_guest_tsc = data;
+
+	/* Keep track of which generation this VCPU has synchronized to */
+	vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
+	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
+	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
+
+	kvm_x86_ops->write_tsc_offset(vcpu, offset);
+	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 }
+
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
 static int kvm_guest_time_update(struct kvm_vcpu *v)



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-02-09 14:29 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-03 17:43 [patch 0/8] KVM: Remaining body of TSC emulation work (Zachary Amsden) Marcelo Tosatti
2012-02-03 17:43 ` [patch 1/8] Infrastructure for software and hardware based TSC rate scaling Marcelo Tosatti
2012-02-08 14:28   ` Joerg Roedel
2012-02-03 17:43 ` [patch 2/8] Improve TSC offset matching Marcelo Tosatti
2012-02-03 17:43 ` [patch 3/8] Leave TSC synchronization window open with each new sync Marcelo Tosatti
2012-02-03 17:43 ` [patch 4/8] Fix last_guest_tsc / tsc_offset semantics Marcelo Tosatti
2012-02-03 17:43 ` [patch 5/8] Add last_host_tsc tracking back to KVM Marcelo Tosatti
2012-02-03 17:43 ` [patch 6/8] Allow adjust_tsc_offset to be in host or guest cycles Marcelo Tosatti
2012-02-03 17:43 ` [patch 7/8] Dont mark TSC unstable due to S4 suspend Marcelo Tosatti
2012-02-08 15:18   ` Joerg Roedel
2012-02-08 15:56     ` Marcelo Tosatti
2012-02-09 14:28       ` Joerg Roedel
2012-02-03 17:43 ` [patch 8/8] Track TSC synchronization in generations Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox