From: Paul Durrant <xadimgnik@gmail.com>
To: David Woodhouse <dwmw2@infradead.org>,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: Sean Christopherson <seanjc@google.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>
Subject: Re: [PATCH] KVM: x86/xen: improve accuracy of Xen timers
Date: Tue, 31 Oct 2023 10:42:42 +0000 [thread overview]
Message-ID: <1a679274-bbff-4549-a1ea-c7ea9f1707cc@xen.org> (raw)
In-Reply-To: <96da7273adfff2a346de9a4a27ce064f6fe0d0a1.camel@infradead.org>
On 30/10/2023 15:50, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> A test program such as http://david.woodhou.se/timerlat.c confirms user
> reports that timers are increasingly inaccurate as the lifetime of a
> guest increases. Reporting the actual delay observed when asking for
> 100µs of sleep, it starts off OK on a newly-launched guest but gets
> worse over time, giving incorrect sleep times:
>
> root@ip-10-0-193-21:~# ./timerlat -c -n 5
> 00000000 latency 103243/100000 (3.2430%)
> 00000001 latency 103243/100000 (3.2430%)
> 00000002 latency 103242/100000 (3.2420%)
> 00000003 latency 103245/100000 (3.2450%)
> 00000004 latency 103245/100000 (3.2450%)
>
> The biggest problem is that get_kvmclock_ns() returns inaccurate values
> when the guest TSC is scaled. The guest sees a TSC value scaled from the
> host TSC by a mul/shift conversion (hopefully done in hardware). The
> guest then converts that guest TSC value into nanoseconds using the
> mul/shift conversion given to it by the KVM pvclock information.
>
> But get_kvmclock_ns() performs only a single conversion directly from
> host TSC to nanoseconds, giving a different result. A test program at
> http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
> over a day.
>
> It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
> that. The actual guest hv_clock is per-CPU, and *theoretically* each
> vCPU could be running at a *different* frequency. But this patch is
> needed anyway because...
>
> The other issue with Xen timers was that the code would snapshot the
> host CLOCK_MONOTONIC at some point in time, and then... after a few
> interrupts may have occurred, some preemption perhaps... would also read
> the guest's kvmclock. Then it would proceed under the false assumption
> that those two happened at the *same* time. Any time which *actually*
> elapsed between reading the two clocks was introduced as inaccuracies
> in the time at which the timer fired.
>
> Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
> host TSC just *once*, then use the returned TSC value to calculate the
> kvmclock (making sure to do that the way the guest would instead of
> making the same mistake get_kvmclock_ns() does).
>
> Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
> timers still have to use CLOCK_MONOTONIC. In practice the difference
> between the two won't matter over the timescales involved, as the
> *absolute* values don't matter; just the delta.
>
> This does mean a new variant of kvm_get_time_and_clockread() is needed;
> called kvm_get_monotonic_and_clockread() because that's what it does.
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
> arch/x86/kvm/x86.c | 30 ++++++++++++
> arch/x86/kvm/x86.h | 1 +
> arch/x86/kvm/xen.c | 111 +++++++++++++++++++++++++++++++--------------
> 3 files changed, 109 insertions(+), 33 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 41cce5031126..aeede83d65dc 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2863,6 +2863,25 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
> return mode;
> }
>
> +static int do_monotonic(s64 *t, u64 *tsc_timestamp)
> +{
> + struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> + unsigned long seq;
> + int mode;
> + u64 ns;
> +
> + do {
> + seq = read_seqcount_begin(>od->seq);
> + ns = gtod->clock.base_cycles;
> + ns += vgettsc(>od->clock, tsc_timestamp, &mode);
> + ns >>= gtod->clock.shift;
> + ns += ktime_to_ns(ktime_add(gtod->clock.offset, gtod->offs_boot));
> + } while (unlikely(read_seqcount_retry(>od->seq, seq)));
> + *t = ns;
> +
> + return mode;
> +}
> +
> static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
> {
> struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> @@ -2895,6 +2914,17 @@ static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
> tsc_timestamp));
> }
>
> +/* returns true if host is using TSC based clocksource */
> +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
> +{
> + /* checked again under seqlock below */
> + if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
> + return false;
> +
> + return gtod_is_based_on_tsc(do_monotonic(kernel_ns,
> + tsc_timestamp));
> +}
> +
> /* returns true if host is using TSC based clocksource */
> static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
> u64 *tsc_timestamp)
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 1e7be1f6ab29..c08c6f729965 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -293,6 +293,7 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
> void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
>
> u64 get_kvmclock_ns(struct kvm *kvm);
> +bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);
>
> int kvm_read_guest_virt(struct kvm_vcpu *vcpu,
> gva_t addr, void *val, unsigned int bytes,
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 0ea6016ad132..00a1e924a717 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -24,6 +24,7 @@
> #include <xen/interface/sched.h>
>
> #include <asm/xen/cpuid.h>
> +#include <asm/pvclock.h>
>
> #include "cpuid.h"
> #include "trace.h"
> @@ -144,17 +145,87 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
> return HRTIMER_NORESTART;
> }
>
> -static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
> +static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs,
> + bool linux_wa)
> {
> + uint64_t guest_now;
> + int64_t kernel_now, delta;
> +
> + /*
> + * The guest provides the requested timeout in absolute nanoseconds
> + * of the KVM clock — as *it* sees it, based on the scaled TSC and
> + * the pvclock information provided by KVM.
> + *
> + * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW
> + * so use CLOCK_MONOTONIC. In the timescales covered by timers, the
> + * difference won't matter much as there is no cumulative effect.
> + *
> + * Calculate the time for some arbitrary point in time around "now"
> + * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the
> + * delta between the kvmclock "now" value and the guest's requested
> + * timeout, apply the "Linux workaround" described below, and add
> + * the resulting delta to the CLOCK_MONOTONIC "now" value, to get
> + * the absolute CLOCK_MONOTONIC time at which the timer should
> + * fire.
> + */
> + if (vcpu->kvm->arch.use_master_clock &&
> + static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> + uint64_t host_tsc, guest_tsc;
> +
> + if (!IS_ENABLED(CONFIG_64BIT) ||
> + !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) {
> + /*
> + * Don't fall back to get_kvmclock_ns() because it's
> + * broken; it has a systemic error in its results
> + * because it scales directly from host TSC to
> + * nanoseconds, and doesn't scale first to guest TSC
> + * and then* to nanoseconds as the guest does.
> + *
> + * There is a small error introduced here because time
> + * continues to elapse between the ktime_get() and the
> + * subsequent rdtsc(). But not the systemic drift due
> + * to get_kvmclock_ns().
> + */
> + kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */
> + host_tsc = rdtsc();
> + }
> +
> + /* Calculate the guest kvmclock as the guest would do it. */
> + guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> + guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock, guest_tsc);
> + } else {
> + /* Without CONSTANT_TSC, get_kvmclock_ns() is the only option */
> + guest_now = get_kvmclock_ns(vcpu->kvm);
> + kernel_now = ktime_get();
> + }
> +
> + delta = guest_abs - guest_now;
> +
> + /* Xen has a 'Linux workaround' in do_set_timer_op() which
> + * checks for negative absolute timeout values (caused by
> + * integer overflow), and for values about 13 days in the
> + * future (2^50ns) which would be caused by jiffies
> + * overflow. For those cases, it sets the timeout 100ms in
> + * the future (not *too* soon, since if a guest really did
> + * set a long timeout on purpose we don't want to keep
> + * churning CPU time by waking it up).
> + */
> + if (linux_wa) {
> + if ((unlikely((int64_t)guest_abs < 0 ||
> + (delta > 0 && (uint32_t) (delta >> 50) != 0)))) {
> + delta = 100 * NSEC_PER_MSEC;
> + guest_abs = guest_now + delta;
> + }
> + }
> +
> atomic_set(&vcpu->arch.xen.timer_pending, 0);
> vcpu->arch.xen.timer_expires = guest_abs;
>
> - if (delta_ns <= 0) {
> + if (delta <= 0) {
> xen_timer_callback(&vcpu->arch.xen.timer);
> } else {
> - ktime_t ktime_now = ktime_get();
> hrtimer_start(&vcpu->arch.xen.timer,
> - ktime_add_ns(ktime_now, delta_ns),
> + ktime_add_ns(kernel_now, delta),
> HRTIMER_MODE_ABS_HARD);
> }
> }
> @@ -923,8 +994,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
> /* Start the timer if the new value has a valid vector+expiry. */
> if (data->u.timer.port && data->u.timer.expires_ns)
> kvm_xen_start_timer(vcpu, data->u.timer.expires_ns,
> - data->u.timer.expires_ns -
> - get_kvmclock_ns(vcpu->kvm));
> + false);
There is no documented ordering requirement on setting
KVM_XEN_VCPU_ATTR_TYPE_TIMER versus KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO or
KVM_XEN_ATTR_TYPE_SHARED_INFO but kvm_xen_start_timer() now needs the
vCPU's pvclock to be valid. Should actually starting the timer not be
deferred until then? (Or simply add a check here and have the attribute
setting fail if the pvclock is not valid).
Paul
>
> r = 0;
> break;
> @@ -1340,7 +1410,6 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
> {
> struct vcpu_set_singleshot_timer oneshot;
> struct x86_exception e;
> - s64 delta;
>
> if (!kvm_xen_timer_enabled(vcpu))
> return false;
> @@ -1374,13 +1443,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
> return true;
> }
>
> - delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm);
> - if ((oneshot.flags & VCPU_SSHOTTMR_future) && delta < 0) {
> - *r = -ETIME;
> - return true;
> - }
> -
> - kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta);
> + kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false);
> *r = 0;
> return true;
>
> @@ -1404,25 +1467,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout,
> return false;
>
> if (timeout) {
> - uint64_t guest_now = get_kvmclock_ns(vcpu->kvm);
> - int64_t delta = timeout - guest_now;
> -
> - /* Xen has a 'Linux workaround' in do_set_timer_op() which
> - * checks for negative absolute timeout values (caused by
> - * integer overflow), and for values about 13 days in the
> - * future (2^50ns) which would be caused by jiffies
> - * overflow. For those cases, it sets the timeout 100ms in
> - * the future (not *too* soon, since if a guest really did
> - * set a long timeout on purpose we don't want to keep
> - * churning CPU time by waking it up).
> - */
> - if (unlikely((int64_t)timeout < 0 ||
> - (delta > 0 && (uint32_t) (delta >> 50) != 0))) {
> - delta = 100 * NSEC_PER_MSEC;
> - timeout = guest_now + delta;
> - }
> -
> - kvm_xen_start_timer(vcpu, timeout, delta);
> + kvm_xen_start_timer(vcpu, timeout, true);
> } else {
> kvm_xen_stop_timer(vcpu);
> }
next prev parent reply other threads:[~2023-10-31 10:42 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-30 15:50 [PATCH] KVM: x86/xen: improve accuracy of Xen timers David Woodhouse
2023-10-31 10:42 ` Paul Durrant [this message]
2023-10-31 11:42 ` David Woodhouse
2023-10-31 12:06 ` Paul Durrant
2023-10-31 12:11 ` David Woodhouse
2023-10-31 12:22 ` Paul Durrant
2023-10-31 16:34 ` David Woodhouse
2023-10-31 19:45 ` David Woodhouse
2023-10-31 22:38 ` David Woodhouse
2023-10-31 23:13 ` [PATCH v2] " David Woodhouse
2023-11-07 1:44 ` Dongli Zhang
2023-11-07 8:17 ` David Woodhouse
2023-11-07 23:07 ` Dongli Zhang
2023-11-07 23:24 ` David Woodhouse
2023-11-08 1:43 ` Dongli Zhang
2023-11-08 15:25 ` David Woodhouse
2023-11-08 18:22 ` Dongli Zhang
2023-11-08 19:14 ` David Woodhouse
[not found] <b5a974bdc330be91c2356f5bb2cc68ef1cc7ed40.camel@infradead.org>
2023-10-29 10:37 ` [PATCH] " kernel test robot
2023-11-02 22:11 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1a679274-bbff-4549-a1ea-c7ea9f1707cc@xen.org \
--to=xadimgnik@gmail.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=dwmw2@infradead.org \
--cc=hpa@zytor.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=paul@xen.org \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox