* RFC: kvmclock / tsc server side fix
@ 2010-05-15 2:07 Zachary Amsden
2010-05-17 15:36 ` Glauber Costa
0 siblings, 1 reply; 5+ messages in thread
From: Zachary Amsden @ 2010-05-15 2:07 UTC (permalink / raw)
To: kvm, Avi Kivity, Marcelo Tosatti, Glauber Costa
[-- Attachment #1: Type: text/plain, Size: 588 bytes --]
I believe this fixes the root cause of the kvmclock warp. It's quite a
plausible phenomenon, and explains why it was so easy to produce.
Currently it depends on some other patches; I can send a whole patchset,
but with all the patch activity, it isn't clear what has been applied
and to what trees. Where have Glauber's recent patches been applied?
I am looking for comments if this is a reasonably good explanation and
fix for the problem.
I realize I messed up the overshoot calculation, it is not converted to
nsec, but the debug stats are just for debugging.
Thanks,
Zach
[-- Attachment #2: tsc-server.patch --]
[-- Type: text/plain, Size: 5292 bytes --]
commit 24e1f31a4cdb43a8e5cab6cfb95d710c7c7bf18a
Author: Zachary Amsden <zamsden@redhat.com>
Date: Fri Feb 26 15:13:31 2010 -1000
Fix a possible backwards warp of kvmclock
Kernel time, which advances in discrete steps may progress much slower
than TSC. As a result, when kvmclock is adjusted to a new base, the
apparent time to the guest, which runs at a much higher, nsec scaled
rate based on the current TSC, may have already been observed to have
a larger value (kernel_ns + scaled tsc) than the value to which we are
setting it (kernel_ns + 0).
We must instead compute the clock as potentially observed by the guest
for kernel_ns to make sure it does not go backwards.
Signed-off-by: Zachary Amsden <zamsden@redhat.com>
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 83df4db..ba765fa 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -453,6 +453,8 @@ struct kvm_vcpu_stat {
u32 hypercalls;
u32 irq_injections;
u32 nmi_injections;
+ u32 tsc_overshoot;
+ u32 tsc_ahead;
};
struct kvm_x86_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bb44f9e..2bf7e86 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -134,6 +134,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "insn_emulation_fail", VCPU_STAT(insn_emulation_fail) },
{ "irq_injections", VCPU_STAT(irq_injections) },
{ "nmi_injections", VCPU_STAT(nmi_injections) },
+ { "tsc_overshoot", VCPU_STAT(tsc_overshoot) },
+ { "tsc_ahead", VCPU_STAT(tsc_ahead) },
{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
{ "mmu_pte_write", VM_STAT(mmu_pte_write) },
{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
@@ -849,35 +851,80 @@ static int kvm_recompute_guest_time(struct kvm_vcpu *v)
struct kvm_vcpu_arch *vcpu = &v->arch;
void *shared_kaddr;
unsigned long this_tsc_khz;
+ s64 kernel_ns, delta;
+ u64 tsc_timestamp;
+ bool upscale;
if ((!vcpu->time_page))
return 0;
- this_tsc_khz = get_cpu_var(cpu_tsc_khz);
- put_cpu_var(cpu_tsc_khz);
+ /*
+ * The protection we require is simple: we must not be preempted from
+ * the CPU between our read of the TSC khz and our read of the TSC.
+ * Interrupt protection is not strictly required, but it does result in
+ * greater accuracy for the TSC / kernel_ns measurement.
+ */
+ local_irq_save(flags);
+ this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
+ kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
+ ktime_get_ts(&ts);
+ monotonic_to_bootbased(&ts);
+ kernel_ns = timespec_to_ns(&ts);
+ local_irq_restore(flags);
+
if (unlikely(this_tsc_khz == 0)) {
kvm_request_guest_time_update(v);
return 1;
}
+ /*
+ * Time as measured by the TSC may go backwards when resetting the base
+ * tsc_timestamp. The reason for this is that the TSC resolution is
+ * higher than the resolution of the other clock scales. Thus, many
+ * possible measurments of the TSC correspond to one measurement of any
+ * other clock, and so a spread of values is possible. This is not a
+ * problem for the computation of the nanosecond clock; with TSC rates
+ * around 1GHZ, there can only be a few cycles which correspond to one
+ * nanosecond value, and any path through this code will inevitably
+ * take longer than that. However, with the kernel_ns value itself,
+ * the precision may be much lower, down to HZ granularity. If the
+ * first sampling of TSC against kernel_ns ends in the low part of the
+ * range, and the second in the high end of the range, we can get:
+ *
+ * (TSC - offset_low) * S + kns_old > (TSC - offset_high) * S + kns_new
+ *
+ * As the sampling errors potentially range in the thousands of cycles,
+ * it is possible such a time value has already been observed by the
+ * guest. To protect against this, we must compute the system time as
+ * observed by the guest and ensure the new system time is greater.
+ */
+ delta = native_read_tsc() - vcpu->hv_clock.tsc_timestamp;
+ delta = pvclock_scale_delta(delta, vcpu->hv_clock.tsc_to_system_mul,
+ vcpu->hv_clock.tsc_shift);
+ delta += vcpu->hv_clock.system_time;
+
if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
+ upscale = this_tsc_khz > vcpu->hw_tsc_khz;
kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
&vcpu->hv_clock.tsc_shift,
&vcpu->hv_clock.tsc_to_system_mul);
vcpu->hw_tsc_khz = this_tsc_khz;
}
- /* Keep irq disabled to prevent changes to the clock */
- local_irq_save(flags);
- kvm_get_msr(v, MSR_IA32_TSC, &vcpu->hv_clock.tsc_timestamp);
- ktime_get_ts(&ts);
- monotonic_to_bootbased(&ts);
- local_irq_restore(flags);
+ if (delta > kernel_ns) {
+ s64 overshoot = delta - kernel_ns;
+ ++v->stat.tsc_ahead;
+ if (upscale)
+ overshoot = overshoot * 9 / 10;
+ if (overshoot > 1000ULL * this_tsc_khz / HZ) {
+ ++v->stat.tsc_overshoot;
+ }
+ kernel_ns = delta;
+ }
/* With all the info we got, fill in the values */
-
- vcpu->hv_clock.system_time = ts.tv_nsec +
- (NSEC_PER_SEC * (u64)ts.tv_sec) + v->kvm->arch.kvmclock_offset;
+ vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
+ vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
/*
* The interface expects us to write an even number signaling that the
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: RFC: kvmclock / tsc server side fix
2010-05-15 2:07 RFC: kvmclock / tsc server side fix Zachary Amsden
@ 2010-05-17 15:36 ` Glauber Costa
2010-05-17 19:38 ` Zachary Amsden
0 siblings, 1 reply; 5+ messages in thread
From: Glauber Costa @ 2010-05-17 15:36 UTC (permalink / raw)
To: Zachary Amsden; +Cc: kvm, Avi Kivity, Marcelo Tosatti
On Fri, May 14, 2010 at 04:07:43PM -1000, Zachary Amsden wrote:
> I believe this fixes the root cause of the kvmclock warp. It's
> quite a plausible phenomenon, and explains why it was so easy to
> produce.
>
You mean this is the case for both SMP and UP, or just UP as we talked
before?
I don't get the role of upscale in your patch. Frequency changes are
already handled by the cpufreq notifier.
> Currently it depends on some other patches; I can send a whole
> patchset, but with all the patch activity, it isn't clear what has
> been applied and to what trees. Where have Glauber's recent patches
> been applied?
>
> I am looking for comments if this is a reasonably good explanation
> and fix for the problem.
>
> I realize I messed up the overshoot calculation, it is not converted
> to nsec, but the debug stats are just for debugging.
>
> Thanks,
>
> Zach
> commit 24e1f31a4cdb43a8e5cab6cfb95d710c7c7bf18a
> Author: Zachary Amsden <zamsden@redhat.com>
> Date: Fri Feb 26 15:13:31 2010 -1000
>
> Fix a possible backwards warp of kvmclock
>
> Kernel time, which advances in discrete steps may progress much slower
> than TSC. As a result, when kvmclock is adjusted to a new base, the
> apparent time to the guest, which runs at a much higher, nsec scaled
> rate based on the current TSC, may have already been observed to have
> a larger value (kernel_ns + scaled tsc) than the value to which we are
> setting it (kernel_ns + 0).
>
> We must instead compute the clock as potentially observed by the guest
> for kernel_ns to make sure it does not go backwards.
>
> Signed-off-by: Zachary Amsden <zamsden@redhat.com>
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 83df4db..ba765fa 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -453,6 +453,8 @@ struct kvm_vcpu_stat {
> u32 hypercalls;
> u32 irq_injections;
> u32 nmi_injections;
> + u32 tsc_overshoot;
> + u32 tsc_ahead;
> };
>
> struct kvm_x86_ops {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index bb44f9e..2bf7e86 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -134,6 +134,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "insn_emulation_fail", VCPU_STAT(insn_emulation_fail) },
> { "irq_injections", VCPU_STAT(irq_injections) },
> { "nmi_injections", VCPU_STAT(nmi_injections) },
> + { "tsc_overshoot", VCPU_STAT(tsc_overshoot) },
> + { "tsc_ahead", VCPU_STAT(tsc_ahead) },
> { "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
> { "mmu_pte_write", VM_STAT(mmu_pte_write) },
> { "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
> @@ -849,35 +851,80 @@ static int kvm_recompute_guest_time(struct kvm_vcpu *v)
> struct kvm_vcpu_arch *vcpu = &v->arch;
> void *shared_kaddr;
> unsigned long this_tsc_khz;
> + s64 kernel_ns, delta;
> + u64 tsc_timestamp;
> + bool upscale;
>
> if ((!vcpu->time_page))
> return 0;
>
> - this_tsc_khz = get_cpu_var(cpu_tsc_khz);
> - put_cpu_var(cpu_tsc_khz);
> + /*
> + * The protection we require is simple: we must not be preempted from
> + * the CPU between our read of the TSC khz and our read of the TSC.
> + * Interrupt protection is not strictly required, but it does result in
> + * greater accuracy for the TSC / kernel_ns measurement.
> + */
> + local_irq_save(flags);
> + this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
> + kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
> + ktime_get_ts(&ts);
> + monotonic_to_bootbased(&ts);
> + kernel_ns = timespec_to_ns(&ts);
> + local_irq_restore(flags);
> +
> if (unlikely(this_tsc_khz == 0)) {
> kvm_request_guest_time_update(v);
> return 1;
> }
>
> + /*
> + * Time as measured by the TSC may go backwards when resetting the base
> + * tsc_timestamp. The reason for this is that the TSC resolution is
> + * higher than the resolution of the other clock scales. Thus, many
> + * possible measurments of the TSC correspond to one measurement of any
> + * other clock, and so a spread of values is possible. This is not a
> + * problem for the computation of the nanosecond clock; with TSC rates
> + * around 1GHZ, there can only be a few cycles which correspond to one
> + * nanosecond value, and any path through this code will inevitably
> + * take longer than that. However, with the kernel_ns value itself,
> + * the precision may be much lower, down to HZ granularity. If the
> + * first sampling of TSC against kernel_ns ends in the low part of the
> + * range, and the second in the high end of the range, we can get:
> + *
> + * (TSC - offset_low) * S + kns_old > (TSC - offset_high) * S + kns_new
> + *
> + * As the sampling errors potentially range in the thousands of cycles,
> + * it is possible such a time value has already been observed by the
> + * guest. To protect against this, we must compute the system time as
> + * observed by the guest and ensure the new system time is greater.
> + */
> + delta = native_read_tsc() - vcpu->hv_clock.tsc_timestamp;
> + delta = pvclock_scale_delta(delta, vcpu->hv_clock.tsc_to_system_mul,
> + vcpu->hv_clock.tsc_shift);
> + delta += vcpu->hv_clock.system_time;
> +
> if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
> + upscale = this_tsc_khz > vcpu->hw_tsc_khz;
> kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
> &vcpu->hv_clock.tsc_shift,
> &vcpu->hv_clock.tsc_to_system_mul);
> vcpu->hw_tsc_khz = this_tsc_khz;
> }
>
> - /* Keep irq disabled to prevent changes to the clock */
> - local_irq_save(flags);
> - kvm_get_msr(v, MSR_IA32_TSC, &vcpu->hv_clock.tsc_timestamp);
> - ktime_get_ts(&ts);
> - monotonic_to_bootbased(&ts);
> - local_irq_restore(flags);
> + if (delta > kernel_ns) {
> + s64 overshoot = delta - kernel_ns;
> + ++v->stat.tsc_ahead;
> + if (upscale)
> + overshoot = overshoot * 9 / 10;
> + if (overshoot > 1000ULL * this_tsc_khz / HZ) {
> + ++v->stat.tsc_overshoot;
> + }
> + kernel_ns = delta;
> + }
>
> /* With all the info we got, fill in the values */
> -
> - vcpu->hv_clock.system_time = ts.tv_nsec +
> - (NSEC_PER_SEC * (u64)ts.tv_sec) + v->kvm->arch.kvmclock_offset;
> + vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
> + vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
>
> /*
> * The interface expects us to write an even number signaling that the
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: RFC: kvmclock / tsc server side fix
2010-05-17 15:36 ` Glauber Costa
@ 2010-05-17 19:38 ` Zachary Amsden
2010-05-18 14:08 ` Glauber Costa
0 siblings, 1 reply; 5+ messages in thread
From: Zachary Amsden @ 2010-05-17 19:38 UTC (permalink / raw)
To: Glauber Costa; +Cc: kvm, Avi Kivity, Marcelo Tosatti
On 05/17/2010 05:36 AM, Glauber Costa wrote:
> On Fri, May 14, 2010 at 04:07:43PM -1000, Zachary Amsden wrote:
>
>> I believe this fixes the root cause of the kvmclock warp. It's
>> quite a plausible phenomenon, and explains why it was so easy to
>> produce.
>>
>>
> You mean this is the case for both SMP and UP, or just UP as we talked
> before?
>
It's possible on both SMP and UP, guest and host. It is impossible on
UP host unless special circumstances come into play (one of my patches
created these circumstances).
> I don't get the role of upscale in your patch. Frequency changes are
> already handled by the cpufreq notifier.
>
The only purpose of upscale is to downscale the measurement of delta
used for counting stats if CPU frequency was raised since last
observed. This is because moving to a faster TSC rate means we might
have counted some cycles at the wrong rate while the rate was in
transition. It doesn't much matter, as the delta for which "overrun" is
logged was computed wrong anyway.
I'll clean up my patches and resend as a series today.
Zach
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RFC: kvmclock / tsc server side fix
2010-05-17 19:38 ` Zachary Amsden
@ 2010-05-18 14:08 ` Glauber Costa
2010-05-18 15:00 ` Zachary Amsden
0 siblings, 1 reply; 5+ messages in thread
From: Glauber Costa @ 2010-05-18 14:08 UTC (permalink / raw)
To: Zachary Amsden; +Cc: kvm, Avi Kivity, Marcelo Tosatti
On Mon, May 17, 2010 at 09:38:30AM -1000, Zachary Amsden wrote:
> On 05/17/2010 05:36 AM, Glauber Costa wrote:
> >On Fri, May 14, 2010 at 04:07:43PM -1000, Zachary Amsden wrote:
> >>I believe this fixes the root cause of the kvmclock warp. It's
> >>quite a plausible phenomenon, and explains why it was so easy to
> >>produce.
> >>
> >You mean this is the case for both SMP and UP, or just UP as we talked
> >before?
>
> It's possible on both SMP and UP, guest and host. It is impossible
> on UP host unless special circumstances come into play (one of my
> patches created these circumstances).
>
> >I don't get the role of upscale in your patch. Frequency changes are
> >already handled by the cpufreq notifier.
>
> The only purpose of upscale is to downscale the measurement of delta
> used for counting stats if CPU frequency was raised since last
> observed. This is because moving to a faster TSC rate means we
> might have counted some cycles at the wrong rate while the rate was
> in transition. It doesn't much matter, as the delta for which
> "overrun" is logged was computed wrong anyway.
mind sending the stats part in a separate patch?
>
> I'll clean up my patches and resend as a series today.
>
> Zach
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RFC: kvmclock / tsc server side fix
2010-05-18 14:08 ` Glauber Costa
@ 2010-05-18 15:00 ` Zachary Amsden
0 siblings, 0 replies; 5+ messages in thread
From: Zachary Amsden @ 2010-05-18 15:00 UTC (permalink / raw)
To: Glauber Costa; +Cc: kvm, Avi Kivity, Marcelo Tosatti
On 05/18/2010 04:08 AM, Glauber Costa wrote:
> On Mon, May 17, 2010 at 09:38:30AM -1000, Zachary Amsden wrote:
>
>> On 05/17/2010 05:36 AM, Glauber Costa wrote:
>>
>>> On Fri, May 14, 2010 at 04:07:43PM -1000, Zachary Amsden wrote:
>>>
>>>> I believe this fixes the root cause of the kvmclock warp. It's
>>>> quite a plausible phenomenon, and explains why it was so easy to
>>>> produce.
>>>>
>>>>
>>> You mean this is the case for both SMP and UP, or just UP as we talked
>>> before?
>>>
>> It's possible on both SMP and UP, guest and host. It is impossible
>> on UP host unless special circumstances come into play (one of my
>> patches created these circumstances).
>>
>>
>>> I don't get the role of upscale in your patch. Frequency changes are
>>> already handled by the cpufreq notifier.
>>>
>> The only purpose of upscale is to downscale the measurement of delta
>> used for counting stats if CPU frequency was raised since last
>> observed. This is because moving to a faster TSC rate means we
>> might have counted some cycles at the wrong rate while the rate was
>> in transition. It doesn't much matter, as the delta for which
>> "overrun" is logged was computed wrong anyway.
>>
> mind sending the stats part in a separate patch?
>
Yeah, part of this was badly broken.
>
>> I'll clean up my patches and resend as a series today.
>>
>> Zach
>>
Or today. Nasty bug.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2010-05-18 15:00 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-15 2:07 RFC: kvmclock / tsc server side fix Zachary Amsden
2010-05-17 15:36 ` Glauber Costa
2010-05-17 19:38 ` Zachary Amsden
2010-05-18 14:08 ` Glauber Costa
2010-05-18 15:00 ` Zachary Amsden
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).