From: Zachary Amsden <zamsden@redhat.com>
To: Avi Kivity <avi@redhat.com>
Cc: mtosatti@redhat.com, glommer@redhat.com, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 11/17] Fix a possible backwards warp of kvmclock
Date: Tue, 15 Jun 2010 10:37:01 -1000 [thread overview]
Message-ID: <4C17E46D.5060000@redhat.com> (raw)
In-Reply-To: <4C173C6E.90503@redhat.com>
On 06/14/2010 10:40 PM, Avi Kivity wrote:
> On 06/15/2010 10:34 AM, Zachary Amsden wrote:
>> Kernel time, which advances in discrete steps may progress much slower
>> than TSC. As a result, when kvmclock is adjusted to a new base, the
>> apparent time to the guest, which runs at a much higher, nsec scaled
>> rate based on the current TSC, may have already been observed to have
>> a larger value (kernel_ns + scaled tsc) than the value to which we are
>> setting it (kernel_ns + 0).
>>
>> We must instead compute the clock as potentially observed by the guest
>> for kernel_ns to make sure it does not go backwards.
>>
>> @@ -455,6 +457,8 @@ struct kvm_vcpu_stat {
>> u32 hypercalls;
>> u32 irq_injections;
>> u32 nmi_injections;
>> + u32 tsc_overshoot;
>> + u32 tsc_ahead;
>> };
>
> Please don't add new stats, instead add tracepoints which can also be
> observed as stats.
>
> But does this really merit exposing? What would one do with this
> information?
>
>> struct kvm_vcpu_arch *vcpu =&v->arch;
>> void *shared_kaddr;
>> unsigned long this_tsc_khz;
>> + s64 kernel_ns, max_kernel_ns;
>> + u64 tsc_timestamp;
>>
>> if ((!vcpu->time_page))
>> return 0;
>>
>> - this_tsc_khz = get_cpu_var(cpu_tsc_khz);
>> - put_cpu_var(cpu_tsc_khz);
>> + /*
>> + * The protection we require is simple: we must not be preempted
>> from
>> + * the CPU between our read of the TSC khz and our read of the TSC.
>> + * Interrupt protection is not strictly required, but it does
>> result in
>> + * greater accuracy for the TSC / kernel_ns measurement.
>> + */
>> + local_irq_save(flags);
>> + this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
>> + kvm_get_msr(v, MSR_IA32_TSC,&tsc_timestamp);
>
> That's a slow path, since it has to go through kvm_get_msr()'s if
> tree. Could use its own accessor.
>
> But this isn't introduced by this patch, so it can be fixed by another.
>
>> + ktime_get_ts(&ts);
>> + monotonic_to_bootbased(&ts);
>> + kernel_ns = timespec_to_ns(&ts);
>> + local_irq_restore(flags);
>> +
>> if (unlikely(this_tsc_khz == 0)) {
>> kvm_request_guest_time_update(v);
>> return 1;
>> }
>>
>> + /*
>> + * Time as measured by the TSC may go backwards when resetting
>> the base
>> + * tsc_timestamp. The reason for this is that the TSC
>> resolution is
>> + * higher than the resolution of the other clock scales. Thus,
>> many
>> + * possible measurments of the TSC correspond to one measurement
>> of any
>> + * other clock, and so a spread of values is possible. This is
>> not a
>> + * problem for the computation of the nanosecond clock; with TSC
>> rates
>> + * around 1GHZ, there can only be a few cycles which correspond
>> to one
>> + * nanosecond value, and any path through this code will inevitably
>> + * take longer than that. However, with the kernel_ns value
>> itself,
>> + * the precision may be much lower, down to HZ granularity. If the
>> + * first sampling of TSC against kernel_ns ends in the low part
>> of the
>> + * range, and the second in the high end of the range, we can get:
>> + *
>> + * (TSC - offset_low) * S + kns_old> (TSC - offset_high) * S +
>> kns_new
>> + *
>> + * As the sampling errors potentially range in the thousands of
>> cycles,
>> + * it is possible such a time value has already been observed by
>> the
>> + * guest. To protect against this, we must compute the system
>> time as
>> + * observed by the guest and ensure the new system time is greater.
>> + */
>> + max_kernel_ns = 0;
>> + if (vcpu->hv_clock.tsc_timestamp) {
>> + max_kernel_ns = vcpu->last_guest_tsc -
>> + vcpu->hv_clock.tsc_timestamp;
>> + max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
>> + vcpu->hv_clock.tsc_to_system_mul,
>> + vcpu->hv_clock.tsc_shift);
>> + max_kernel_ns += vcpu->last_kernel_ns;
>> + }
>> +
>> if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
>> - kvm_set_time_scale(this_tsc_khz,&vcpu->hv_clock);
>> + kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
>> + &vcpu->hv_clock.tsc_shift,
>> + &vcpu->hv_clock.tsc_to_system_mul);
>> vcpu->hw_tsc_khz = this_tsc_khz;
>> }
>>
>> - /* Keep irq disabled to prevent changes to the clock */
>> - local_irq_save(flags);
>> - kvm_get_msr(v, MSR_IA32_TSC,&vcpu->hv_clock.tsc_timestamp);
>> - ktime_get_ts(&ts);
>> - monotonic_to_bootbased(&ts);
>> - local_irq_restore(flags);
>> + if (max_kernel_ns> kernel_ns) {
>> + s64 overshoot = max_kernel_ns - kernel_ns;
>> + ++v->stat.tsc_ahead;
>> + if (overshoot> NSEC_PER_SEC / HZ) {
>> + ++v->stat.tsc_overshoot;
>> + if (printk_ratelimit())
>> + pr_debug("ns overshoot: %lld\n", overshoot);
>> + }
>
> A tracepoint here would allow recording both the number of overshoots
> and the value of the overshoot. But I don't think this is of much use
> day-to-day.
FWIW, I was using this to track how often this case would hit and by how
much. Originally, tsc_ahead was firing near 100% and tsc_overshoot near
0%, but moving the observation of last_guest_tsc into the exit path
decreased both number to near zero. Obviously it's a bit hardware
dependent, as it matters how high resolution the kernel clocksource is
(and how recent your kernel).
I'll rip the stats stuff for sure.
next prev parent reply other threads:[~2010-06-15 20:37 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-06-15 7:34 TSC cleanups, fixes, documentation for KVM Zachary Amsden
2010-06-15 7:34 ` [PATCH 01/17] Eliminate duplicated timer code Zachary Amsden
2010-06-16 13:07 ` Glauber Costa
2010-06-15 7:34 ` [PATCH 02/17] Make cpu_tsc_khz updates use local CPU Zachary Amsden
2010-06-15 8:02 ` Avi Kivity
2010-06-15 7:34 ` [PATCH 03/17] Unify vendor TSC logic Zachary Amsden
2010-06-16 8:10 ` Jason Wang
2010-06-16 13:22 ` Glauber Costa
2010-06-17 8:03 ` Jason Wang
2010-06-16 18:42 ` Zachary Amsden
2010-06-17 8:15 ` Jason Wang
2010-06-17 20:30 ` Zachary Amsden
2010-06-15 7:34 ` [PATCH 04/17] Fix deep C-state TSC desynchronization Zachary Amsden
2010-06-15 8:09 ` Avi Kivity
2010-06-15 8:14 ` Zachary Amsden
2010-06-16 8:10 ` Jason Wang
2010-06-16 18:43 ` Zachary Amsden
2010-06-16 13:24 ` Glauber Costa
2010-06-16 19:20 ` Zachary Amsden
2010-06-15 7:34 ` [PATCH 05/17] Keep SMP VMs more in sync on unstable TSC Zachary Amsden
2010-06-15 8:11 ` Avi Kivity
2010-06-16 8:11 ` Jason Wang
2010-06-16 13:32 ` Glauber Costa
2010-06-16 21:15 ` Zachary Amsden
2010-06-15 7:34 ` [PATCH 06/17] Rename KVM_REQ_KVMCLOCK_UPDATE Zachary Amsden
2010-06-15 7:34 ` [PATCH 07/17] Perform hardware_enable in CPU_STARTING callback Zachary Amsden
2010-06-15 7:34 ` [PATCH 08/17] Add clock sync request to hardware enable Zachary Amsden
2010-06-15 8:24 ` Avi Kivity
2010-06-15 7:34 ` [PATCH 09/17] Move scale_delta into common header Zachary Amsden
2010-06-15 7:34 ` [PATCH 10/17] Make KVM clock computation work for other scales Zachary Amsden
2010-06-15 7:34 ` [PATCH 11/17] Fix a possible backwards warp of kvmclock Zachary Amsden
2010-06-15 8:40 ` Avi Kivity
2010-06-15 20:37 ` Zachary Amsden [this message]
2010-06-15 23:47 ` Marcelo Tosatti
2010-06-16 0:21 ` Zachary Amsden
2010-06-16 0:39 ` Marcelo Tosatti
2010-06-16 8:11 ` Jason Wang
2010-06-16 13:58 ` Glauber Costa
2010-06-16 14:13 ` Avi Kivity
2010-06-16 14:58 ` Glauber Costa
2010-06-16 22:38 ` Zachary Amsden
2010-06-16 19:36 ` Zachary Amsden
2010-06-15 7:34 ` [PATCH 12/17] Add helper function get_kernel_ns Zachary Amsden
2010-06-15 8:41 ` Avi Kivity
2010-06-15 21:03 ` Zachary Amsden
2010-06-15 21:13 ` john stultz
2010-06-16 8:12 ` Jason Wang
2010-06-16 14:03 ` Glauber Costa
2010-06-15 7:34 ` [PATCH 13/17] Add TSC offset tracking Zachary Amsden
2010-06-15 8:44 ` Avi Kivity
2010-06-15 7:34 ` [PATCH 14/17] Fix SVM VMCB reset Zachary Amsden
2010-06-15 7:34 ` [PATCH 15/17] Fix AMD C1 TSC desynchronization Zachary Amsden
2010-06-15 8:47 ` Avi Kivity
2010-06-15 9:21 ` Zachary Amsden
2010-06-15 14:46 ` Roedel, Joerg
2010-06-15 7:34 ` [PATCH 16/17] TSC reset compensation Zachary Amsden
2010-06-15 8:51 ` Avi Kivity
2010-06-15 20:32 ` Zachary Amsden
2010-06-16 0:27 ` Marcelo Tosatti
2010-06-16 0:32 ` Zachary Amsden
2010-06-16 13:52 ` Glauber Costa
2010-06-16 22:36 ` Zachary Amsden
2010-06-15 7:34 ` [PATCH 17/17] Add timekeeping documentation Zachary Amsden
2010-06-15 8:51 ` Avi Kivity
2010-06-15 20:27 ` Randy Dunlap
2010-06-16 23:59 ` Zachary Amsden
2010-06-17 8:55 ` Andi Kleen
2010-06-17 21:14 ` Zachary Amsden
2010-06-18 7:49 ` Andi Kleen
2010-06-18 16:33 ` Zachary Amsden
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C17E46D.5060000@redhat.com \
--to=zamsden@redhat.com \
--cc=avi@redhat.com \
--cc=glommer@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mtosatti@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox