From: Sean Christopherson <seanjc@google.com>
To: David Woodhouse <dwmw2@infradead.org>
Cc: kvm@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>,
Jonathan Corbet <corbet@lwn.net>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
Paul Durrant <paul@xen.org>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Valentin Schneider <vschneid@redhat.com>,
Shuah Khan <shuah@kernel.org>,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
jalliste@amazon.co.uk, sveith@amazon.de, zide.chen@intel.com,
Dongli Zhang <dongli.zhang@oracle.com>,
Chenyi Qiang <chenyi.qiang@intel.com>
Subject: Re: [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get_kvmclock()
Date: Tue, 13 Aug 2024 19:58:16 -0700 [thread overview]
Message-ID: <ZrwdSLvlhde6uaAB@google.com> (raw)
In-Reply-To: <20240522001817.619072-10-dwmw2@infradead.org>
On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> When in 'master clock mode' (i.e. when host and guest TSCs are behaving
> sanely and in sync), the KVM clock is defined in terms of the guest TSC.
>
> When TSC scaling is used, calculating the KVM clock directly from *host*
> TSC cycles leads to a systemic drift from the values calculated by the
> guest from its TSC.
>
> Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers")
> had a simple workaround for the specific case of Xen timers, as it had an
> actual vCPU to hand and could use its scaling information. That commit
> noted that it was broken for the general case of get_kvmclock_ns(), and
> said "I'll come back to that".
>
> Since __get_kvmclock() is invoked without a specific CPU, it needs to
> be able to find or generate the scaling values required to perform the
> correct calculation.
>
> Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC,
> so it isn't as complex as it might have been.
>
> In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in
> kvm->arch.last_tsc_scaling_ratio. That is only protected by the
> tsc_write_lock, so in pvclock_update_vm_gtod_copy(), copy it into a
> separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed
> using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and
> shift factors to convert to nanoseconds for the corresponding KVM clock,
> just as kvm_guest_time_update() would.
>
> In __get_kvmclock(), which runs within a seqcount retry loop, use those
> values to convert host to guest TSC and then to nanoseconds. Only fall
> back to using get_kvmclock_base_ns() when not in master clock mode.
>
> There was previously a code path in __get_kvmclock() which looked like
> it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
> even on 32-bit hosts. In practice that could never happen as the
> ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
> hosts it would never be set when the system clock isn't TSC-based. So
> that code path is now removed.
This should be a separate patch. Actually, patches, plural. More below
> The kvm_get_wall_clock_epoch() function had the same problem; make it
> just call get_kvmclock() and subtract kvmclock from wallclock, with
> the same fallback as before.
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
...
> @@ -3100,36 +3131,49 @@ static unsigned long get_cpu_tsc_khz(void)
> static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
> {
> struct kvm_arch *ka = &kvm->arch;
> - struct pvclock_vcpu_time_info hv_clock;
> +
> +#ifdef CONFIG_X86_64
> + uint64_t cur_tsc_khz = 0;
> + struct timespec64 ts;
>
> /* both __this_cpu_read() and rdtsc() should be on the same cpu */
> get_cpu();
>
> - data->flags = 0;
> if (ka->use_master_clock &&
> - (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
> -#ifdef CONFIG_X86_64
> - struct timespec64 ts;
> + (cur_tsc_khz = get_cpu_tsc_khz()) &&
That is mean. And if you push it inside the if-statement, the {get,put}_cpu()
can be avoided when the master clock isn't being used, e.g.
if (ka->use_master_clock) {
/*
* The RDTSC needs to happen on the same CPU whose frequency is
* used to compute kvmclock's time.
*/
get_cpu();
cur_tsc_khz = get_cpu_tsc_khz();
if (cur_tsc_khz &&
!kvm_get_walltime_and_clockread(&ts, &data->host_tsc))
cur_tsc_khz = 0;
put_cpu();
}
However, the changelog essentially claims kvm_get_walltime_and_clockread() should
never fail when use_master_clock is enabled, which suggests a WARN is warranted.
There was previously a code path in __get_kvmclock() which looked like
it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
even on 32-bit hosts. In practice that could never happen as the
ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
hosts it would never be set when the system clock isn't TSC-based. So
that code path is now removed.
But, I think kvm_get_walltime_and_clockread() can fail when use_master_clock is
true, i.e. I don't think a WARN is viable as it could get false positives.
Ah, this is protected by pvclock_sc, so a stale use_master_clock should result
in a retry. What if we WARN on that?
Hrm, that requires plumbing in the original sequence count. Ah, but looking at
the patch as a whole, if we keep kvm_get_wall_clock_epoch()'s style, then it's
much easier. And FWIW, I like the existing kvm_get_wall_clock_epoch() style a
lot more than the get_kvmclock() => __get_kvmclock() approach.
So, can we do this as prep patch #1?
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9c14d0f5a684..98806a59e110 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3360,9 +3360,16 @@ uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
local_tsc_khz = get_cpu_tsc_khz();
+ /*
+ * The master clock depends on the pvclock being based on TSC,
+ * so the only way kvm_get_walltime_and_clockread() can fail is
+ * if the clocksource changed and use_master_clock is stale, in
+ * which case a seqcount retry should be pending.
+ */
if (local_tsc_khz &&
- !kvm_get_walltime_and_clockread(&ts, &host_tsc))
- local_tsc_khz = 0; /* Fall back to old method */
+ !kvm_get_walltime_and_clockread(&ts, &host_tsc) &&
+ WARN_ON_ONCE(!read_seqcount_retry(&ka->pvclock_sc, seq)))
+ local_tsc_khz = 0; /* Fall back to old method */
put_cpu();
And then as patch(es) 2..7 (give or take)
(2) fold __get_kvmclock() into get_kvmclock()
(3) and the same WARN on the seqcount in get_kvmclock() (but skimp on the comments)
(4) use get_kvmclock_base_ns() as the fallback in get_kvmclock(), i.e. delete
the raw rdtsc() and setting of KVM_CLOCK_TSC_STABLE w/o KVM_CLOCK_REALTIME
(5) use get_cpu_tsc_khz() instead of open coding something similar
(6) scale TSC when computing kvmclock (the core of this patch)
(7) use get_kvmclock() in kvm_get_wall_clock_epoch() as the will be 100%
equivalent at this point.
> + !kvm_get_walltime_and_clockread(&ts, &data->host_tsc))
> + cur_tsc_khz = 0;
>
> - if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
> - data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
> - data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
> - } else
> -#endif
> - data->host_tsc = rdtsc();
> -
> - data->flags |= KVM_CLOCK_TSC_STABLE;
> - hv_clock.tsc_timestamp = ka->master_cycle_now;
> - hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> - kvm_get_time_scale(NSEC_PER_SEC, get_cpu_tsc_khz() * 1000LL,
> - &hv_clock.tsc_shift,
> - &hv_clock.tsc_to_system_mul);
> - data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
> - } else {
> - data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
> + put_cpu();
> +
> + if (cur_tsc_khz) {
> + uint64_t tsc_cycles;
> + uint32_t mul;
> + int8_t shift;
> +
> + tsc_cycles = data->host_tsc - ka->master_cycle_now;
> +
> + if (kvm_caps.has_tsc_control)
> + tsc_cycles = kvm_scale_tsc(tsc_cycles,
> + ka->master_tsc_scaling_ratio);
> +
> + if (static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> + mul = ka->master_tsc_mul;
> + shift = ka->master_tsc_shift;
> + } else {
> + kvm_get_time_scale(NSEC_PER_SEC, cur_tsc_khz * 1000LL,
> + &shift, &mul);
> + }
> + data->clock = ka->master_kernel_ns + ka->kvmclock_offset +
> + pvclock_scale_delta(tsc_cycles, mul, shift);
> + data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
> + data->flags = KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC | KVM_CLOCK_TSC_STABLE;
> + return;
> }
> +#endif
>
> - put_cpu();
> + data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
> + data->flags = 0;
> }
next prev parent reply other threads:[~2024-08-14 2:58 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-22 0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
2024-05-22 0:16 ` [RFC PATCH v3 01/21] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() David Woodhouse
2024-05-22 0:16 ` [RFC PATCH v3 02/21] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force David Woodhouse
2024-08-13 17:50 ` Sean Christopherson
2024-05-22 0:16 ` [RFC PATCH v3 03/21] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration David Woodhouse
2024-05-22 0:16 ` [RFC PATCH v3 04/21] UAPI: x86: Move pvclock-abi to UAPI for x86 platforms David Woodhouse
2024-05-24 13:14 ` Paul Durrant
2024-08-13 18:07 ` Sean Christopherson
2024-05-22 0:17 ` [RFC PATCH v3 05/21] KVM: selftests: Add KVM/PV clock selftest to prove timer correction David Woodhouse
2024-08-13 18:55 ` Sean Christopherson
2024-05-22 0:17 ` [RFC PATCH v3 06/21] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC David Woodhouse
2024-05-22 0:17 ` [RFC PATCH v3 07/21] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration David Woodhouse
2024-08-14 1:52 ` Sean Christopherson
2024-05-22 0:17 ` [RFC PATCH v3 08/21] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host David Woodhouse
2024-08-14 1:57 ` Sean Christopherson
2024-05-22 0:17 ` [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get_kvmclock() David Woodhouse
2024-05-24 13:20 ` Paul Durrant
2024-08-14 2:58 ` Sean Christopherson [this message]
2024-05-22 0:17 ` [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time() David Woodhouse
2024-05-24 13:26 ` Paul Durrant
2024-08-14 4:57 ` Sean Christopherson
2024-05-22 0:17 ` [RFC PATCH v3 11/21] KVM: x86: Simplify and comment kvm_get_time_scale() David Woodhouse
2024-05-24 13:53 ` Paul Durrant
2024-08-15 15:46 ` Sean Christopherson
2024-05-22 0:17 ` [RFC PATCH v3 12/21] KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset() David Woodhouse
2024-05-24 13:56 ` Paul Durrant
2024-08-15 15:52 ` Sean Christopherson
2024-05-22 0:17 ` [RFC PATCH v3 13/21] KVM: x86: Improve synchronization in kvm_synchronize_tsc() David Woodhouse
2024-05-24 14:03 ` Paul Durrant
2024-08-15 16:00 ` Sean Christopherson
2024-05-22 0:17 ` [RFC PATCH v3 14/21] KVM: x86: Kill cur_tsc_{nsec,offset,write} fields David Woodhouse
2024-05-24 14:05 ` Paul Durrant
2024-08-15 16:30 ` Sean Christopherson
2024-05-22 0:17 ` [RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other David Woodhouse
2024-05-24 14:10 ` Paul Durrant
2024-08-16 2:38 ` Sean Christopherson
2024-05-22 0:17 ` [RFC PATCH v3 16/21] KVM: x86: Factor out kvm_use_master_clock() David Woodhouse
2024-05-24 14:13 ` Paul Durrant
2024-08-15 17:12 ` Sean Christopherson
2024-05-22 0:17 ` [RFC PATCH v3 17/21] KVM: x86: Avoid global clock update on setting KVM clock MSR David Woodhouse
2024-05-24 14:14 ` Paul Durrant
2024-08-16 4:28 ` Sean Christopherson
2024-05-22 0:17 ` [RFC PATCH v3 18/21] KVM: x86: Avoid gratuitous global clock reload in kvm_arch_vcpu_load() David Woodhouse
2024-05-24 14:16 ` Paul Durrant
2024-08-15 17:31 ` Sean Christopherson
2024-05-22 0:17 ` [RFC PATCH v3 19/21] KVM: x86: Avoid periodic KVM clock updates in master clock mode David Woodhouse
2024-05-24 14:18 ` Paul Durrant
2024-08-16 4:33 ` Sean Christopherson
2024-05-22 0:17 ` [RFC PATCH v3 20/21] KVM: x86/xen: Prevent runstate times from becoming negative David Woodhouse
2024-05-24 14:21 ` Paul Durrant
2024-08-16 4:39 ` Sean Christopherson
2024-08-20 10:22 ` David Woodhouse
2024-08-20 15:08 ` Steven Rostedt
2024-08-20 15:42 ` David Woodhouse
2024-05-22 0:17 ` [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative David Woodhouse
2024-05-24 14:25 ` Paul Durrant
2024-07-02 14:09 ` Shrikanth Hegde
2024-08-16 4:35 ` Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZrwdSLvlhde6uaAB@google.com \
--to=seanjc@google.com \
--cc=bp@alien8.de \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=chenyi.qiang@intel.com \
--cc=corbet@lwn.net \
--cc=dave.hansen@linux.intel.com \
--cc=dietmar.eggemann@arm.com \
--cc=dongli.zhang@oracle.com \
--cc=dwmw2@infradead.org \
--cc=hpa@zytor.com \
--cc=jalliste@amazon.co.uk \
--cc=juri.lelli@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=paul@xen.org \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=shuah@kernel.org \
--cc=sveith@amazon.de \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=x86@kernel.org \
--cc=zide.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).