kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 00/21] Cleaning up the KVM clock mess
@ 2024-05-22  0:16 David Woodhouse
  2024-05-22  0:16 ` [RFC PATCH v3 01/21] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() David Woodhouse
                   ` (20 more replies)
  0 siblings, 21 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:16 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

Clean up the KVM clock mess somewhat so that it is either based on the guest
TSC ("master clock" mode), or on the host CLOCK_MONOTONIC_RAW in cases where
the TSC isn't usable.

Eliminate the third variant where it was based directly on the *host* TSC,
due to bugs in e.g. __get_kvmclock().

Kill off the last vestiges of the KVM clock being based on CLOCK_MONOTONIC
instead of CLOCK_MONOTONIC_RAW and thus being subject to NTP skew.

Fix up migration support to allow the KVM clock to be saved/restored as an
arithmetic function of the guest TSC, since that's what it actually is in
the *common* case so it can be migrated precisely. Or at least to within
±1 ns which is good enough, as discussed in
https://lore.kernel.org/kvm/c8dca08bf848e663f192de6705bf04aa3966e856.camel@infradead.org

In v2 of this series, TSC synchronization is improved and simplified a bit
too, and we allow masterclock mode to be used even when the guest TSCs are
out of sync, as long as they're running at the same *rate*. The different
*offset* shouldn't matter.

And the kvm_get_time_scale() function annoyed me by being entirely opaque,
so I studied it until my brain hurt and then added some comments.

In v2 I also dropped the commits which were removing the periodic clock
syncs. In v3 I put them back again but *only* for the non-masterclock
mode, along with cleaning up some other gratuitous clock jumps while in
masterclock mode. And Jack's patch to move the pvclock structure to uapi.

I also fixed the bug pointed out by Chenyi Qiang, that I was failing to
set vcpu->arch.this_tsc_{nsec,write} after removing the cur_tsc_* fields.

I also included patches to fix advertised steal time going backwards, and
to make the guest more resilient to it. Those may end up being split out
and submitted under separate cover (with selftests).

Still needs more comprehensive selftests.

David Woodhouse (18):
      KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init()
      KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
      KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC
      KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration
      KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host
      KVM: x86: Fix KVM clock precision in __get_kvmclock()
      KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
      KVM: x86: Simplify and comment kvm_get_time_scale()
      KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset()
      KVM: x86: Improve synchronization in kvm_synchronize_tsc()
      KVM: x86: Kill cur_tsc_{nsec,offset,write} fields
      KVM: x86: Allow KVM master clock mode when TSCs are offset from each other
      KVM: x86: Factor out kvm_use_master_clock()
      KVM: x86: Avoid global clock update on setting KVM clock MSR
      KVM: x86: Avoid gratuitous global clock reload in kvm_arch_vcpu_load()
      KVM: x86: Avoid periodic KVM clock updates in master clock mode
      KVM: x86/xen: Prevent runstate times from becoming negative
      sched/cputime: Cope with steal time going backwards or negative

Jack Allister (3):
      KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
      UAPI: x86: Move pvclock-abi to UAPI for x86 platforms
      KVM: selftests: Add KVM/PV clock selftest to prove timer correction

 Documentation/virt/kvm/api.rst                    |  37 ++
 Documentation/virt/kvm/devices/vcpu.rst           | 115 +++-
 arch/x86/include/asm/kvm_host.h                   |  15 +-
 arch/x86/include/uapi/asm/kvm.h                   |   6 +
 arch/x86/include/{ => uapi}/asm/pvclock-abi.h     |  24 +-
 arch/x86/kvm/svm/svm.c                            |   3 +-
 arch/x86/kvm/vmx/vmx.c                            |   2 +-
 arch/x86/kvm/x86.c                                | 716 +++++++++++++++-------
 arch/x86/kvm/xen.c                                |  22 +-
 include/uapi/linux/kvm.h                          |   3 +
 kernel/sched/cputime.c                            |  20 +-
 tools/testing/selftests/kvm/Makefile              |   1 +
 tools/testing/selftests/kvm/x86_64/pvclock_test.c | 192 ++++++
 13 files changed, 884 insertions(+), 272 deletions(-)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 01/21] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init()
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
@ 2024-05-22  0:16 ` David Woodhouse
  2024-05-22  0:16 ` [RFC PATCH v3 02/21] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force David Woodhouse
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:16 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

The KVM clock is an interesting thing. It is defined as "nanoseconds
since the guest was created", but in practice it runs at two *different*
rates — or three different rates, if you count implementation bugs.

Definition A is that it runs synchronously with the CLOCK_MONOTONIC_RAW
of the host, with a delta of kvm->arch.kvmclock_offset.

But that version doesn't actually get used in the common case, where the
host has a reliable TSC and the guest TSCs are all running at the same
rate and in sync with each other, and kvm->arch.use_master_clock is set.

In that common case, definition B is used: There is a reference point in
time at kvm->arch.master_kernel_ns (again a CLOCK_MONOTONIC_RAW time),
and a corresponding host TSC value kvm->arch.master_cycle_now. This
fixed point in time is converted to guest units (the time offset by
kvmclock_offset and the TSC Value scaled and offset to be a guest TSC
value) and advertised to the guest in the pvclock structure. While in
this 'use_master_clock' mode, the fixed point in time never needs to be
changed, and the clock runs precisely in time with the guest TSC, at the
rate advertised in the pvclock structure.

The third definition C is implemented in kvm_get_wall_clock_epoch() and
__get_kvmclock(), using the master_cycle_now and master_kernel_ns fields
but converting the *host* TSC cycles directly to a value in nanoseconds
instead of scaling via the guest TSC.

One might naïvely think that all three definitions are identical, since
CLOCK_MONOTONIC_RAW is not skewed by NTP frequency corrections; all
three are just the result of counting the host TSC at a known frequency,
or the scaled guest TSC at a known precise fraction of the host's
frequency. The problem is with arithmetic precision, and the way that
frequency scaling is done in a division-free way by multiplying by a
scale factor, then shifting right. In practice, all three ways of
calculating the KVM clock will suffer a systemic drift from each other.

Eventually, definition C should just be eliminated. Commit 451a707813ae
("KVM: x86/xen: improve accuracy of Xen timers") worked around it for
the specific case of Xen timers, which are defined in terms of the KVM
clock and suffered from a continually increasing error in timer expiry
times. That commit notes that get_kvmclock_ns() is non-trivial to fix
and says "I'll come back to that", which remains true.

Definitions A and B do need to coexist, the former to handle the case
where the host or guest TSC is suboptimally configured. But KVM should
be more careful about switching between them, and the discontinuity in
guest time which could result.

In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of
time as the reference in master_kernel_ns and master_cycle_now, yanking
the guest's clock back to match definition A at that moment.

When invoked from in 'use_master_clock' mode, kvm_update_masterclock()
should probably *adjust* kvm->arch.kvmclock_offset to account for the
drift, instead of yanking the clock back to defintion A. But in the
meantime there are a bunch of places where it just doesn't need to be
invoked at all.

To start with: there is no need to do such an update when a Xen guest
populates the shared_info page. This seems to have been a hangover from
the very first implementation of shared_info which automatically
populated the vcpu_info structures at their default locations, but even
then it should just have raised KVM_REQ_CLOCK_UPDATE on each vCPU
instead of using KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is
expected to explicitly set the vcpu_info even in its default locations,
there's not even any need for that either.

Fixes: 629b5348841a1 ("KVM: x86/xen: update wallclock region")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/xen.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index f65b35a05d91..5a83a8154b79 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -98,8 +98,6 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 	wc->version = wc_version + 1;
 	read_unlock_irq(&gpc->lock);
 
-	kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
-
 out:
 	srcu_read_unlock(&kvm->srcu, idx);
 	return ret;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 02/21] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
  2024-05-22  0:16 ` [RFC PATCH v3 01/21] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() David Woodhouse
@ 2024-05-22  0:16 ` David Woodhouse
  2024-08-13 17:50   ` Sean Christopherson
  2024-05-22  0:16 ` [RFC PATCH v3 03/21] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration David Woodhouse
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:16 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

The kvm_guest_time_update() function scales the host TSC frequency to
the guest's using kvm_scale_tsc() and the v->arch.l1_tsc_scaling_ratio
scaling ratio previously calculated for that vCPU. Then calcuates the
scaling factors for the KVM clock itself based on that guest TSC
frequency.

However, it uses kHz as the unit when scaling, and then multiplies by
1000 only at the end.

With a host TSC frequency of 3000MHz and a guest set to 2500MHz, the
result of kvm_scale_tsc() will actually come out at 2,499,999kHz. So
the KVM clock advertised to the guest is based on a frequency of
2,499,999,000 Hz.

By using Hz as the unit from the beginning, the KVM clock would be based
on a more accurate frequency of 2,499,999,999 Hz in this example.

Fixes: 78db6a503796 ("KVM: x86: rewrite handling of scaled TSC for kvmclock")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/x86.c              | 17 +++++++++--------
 arch/x86/kvm/xen.c              |  2 +-
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 01c69840647e..8440c4081727 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -887,7 +887,7 @@ struct kvm_vcpu_arch {
 
 	gpa_t time;
 	struct pvclock_vcpu_time_info hv_clock;
-	unsigned int hw_tsc_khz;
+	unsigned int hw_tsc_hz;
 	struct gfn_to_pfn_cache pv_time;
 	/* set guest stopped flag in pvclock flags field */
 	bool pvclock_set_guest_stopped_request;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2d2619d3eee4..23281c508c27 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3215,7 +3215,8 @@ static void kvm_setup_guest_pvclock(struct kvm_vcpu *v,
 
 static int kvm_guest_time_update(struct kvm_vcpu *v)
 {
-	unsigned long flags, tgt_tsc_khz;
+	unsigned long flags;
+	uint64_t tgt_tsc_hz;
 	unsigned seq;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
 	struct kvm_arch *ka = &v->kvm->arch;
@@ -3252,8 +3253,8 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	tgt_tsc_khz = get_cpu_tsc_khz();
-	if (unlikely(tgt_tsc_khz == 0)) {
+	tgt_tsc_hz = get_cpu_tsc_khz() * 1000LL;
+	if (unlikely(tgt_tsc_hz == 0)) {
 		local_irq_restore(flags);
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
 		return 1;
@@ -3288,14 +3289,14 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	/* With all the info we got, fill in the values */
 
 	if (kvm_caps.has_tsc_control)
-		tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz,
-					    v->arch.l1_tsc_scaling_ratio);
+		tgt_tsc_hz = kvm_scale_tsc(tgt_tsc_hz,
+					   v->arch.l1_tsc_scaling_ratio);
 
-	if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
-		kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
+	if (unlikely(vcpu->hw_tsc_hz != tgt_tsc_hz)) {
+		kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_hz,
 				   &vcpu->hv_clock.tsc_shift,
 				   &vcpu->hv_clock.tsc_to_system_mul);
-		vcpu->hw_tsc_khz = tgt_tsc_khz;
+		vcpu->hw_tsc_hz = tgt_tsc_hz;
 		kvm_xen_update_tsc_info(v);
 	}
 
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 5a83a8154b79..014048c22652 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -2273,7 +2273,7 @@ void kvm_xen_update_tsc_info(struct kvm_vcpu *vcpu)
 
 	entry = kvm_find_cpuid_entry_index(vcpu, function, 2);
 	if (entry)
-		entry->eax = vcpu->arch.hw_tsc_khz;
+		entry->eax = vcpu->arch.hw_tsc_hz / 1000;
 }
 
 void kvm_xen_init_vm(struct kvm *kvm)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 03/21] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
  2024-05-22  0:16 ` [RFC PATCH v3 01/21] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() David Woodhouse
  2024-05-22  0:16 ` [RFC PATCH v3 02/21] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force David Woodhouse
@ 2024-05-22  0:16 ` David Woodhouse
  2024-05-22  0:16 ` [RFC PATCH v3 04/21] UAPI: x86: Move pvclock-abi to UAPI for x86 platforms David Woodhouse
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:16 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: Jack Allister <jalliste@amazon.com>

In the common case (where kvm->arch.use_master_clock is true), the KVM
clock is defined as a simple arithmetic function of the guest TSC, based on
a reference point stored in kvm->arch.master_kernel_ns and
kvm->arch.master_cycle_now.

The existing KVM_[GS]ET_CLOCK functionality does not allow for this
relationship to be precisely saved and restored by userspace. All it can
currently do is set the KVM clock at a given UTC reference time, which is
necessarily imprecise.

So on live update, the guest TSC can remain cycle accurate at precisely the
same offset from the host TSC, but there is no way for userspace to restore
the KVM clock accurately.

Even on live migration to a new host, where the accuracy of the guest time-
keeping is fundamentally limited by the accuracy of wallclock
synchronization between the source and destination hosts, the clock jump
experienced by the guest's TSC and its KVM clock should at least be
*consistent*. Even when the guest TSC suffers a discontinuity, its KVM
clock should still remain the *same* arithmetic function of the guest TSC,
and not suffer an *additional* discontinuity.

To allow for accurate migration of the KVM clock, add per-vCPU ioctls which
save and restore the actual PV clock info in pvclock_vcpu_time_info.

The restoration in KVM_SET_CLOCK_GUEST works by creating a new reference
point in time just as kvm_update_masterclock() does, and calculating the
corresponding guest TSC value. This guest TSC value is then passed through
the user-provided pvclock structure to generate the *intended* KVM clock
value at that point in time, and through the *actual* KVM clock calculation.
Then kvm->arch.kvmclock_offset is adjusted to eliminate for the difference.

Where kvm->arch.use_master_clock is false (because the host TSC is
unreliable, or the guest TSCs are configured strangely), the KVM clock
is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
returns an error. In this case, as documented, userspace shall use the
legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
case since the clocks are imprecise in this mode anyway.

On *restoration*, if kvm->arch.use_master_clock is false, an error is
returned for similar reasons and userspace shall fall back to using
KVM_SET_CLOCK. This does mean that, as documented, userspace needs to use
*both* KVM_GET_CLOCK_GUEST and KVM_GET_CLOCK and send both results with the
migration data (unless the intent is to refuse to resume on a host with bad
TSC).

(It may have been possible to make KVM_SET_CLOCK_GUEST "good enough" in the
non-masterclock mode, as that mode is necessarily imprecise anyway. The
explicit fallback allows userspace to deliberately fail migration to a host
with misbehaving TSC where master clock mode wouldn't be active.)

Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Jack Allister <jalliste@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
CC: Dongli Zhang <dongli.zhang@oracle.com>
---
 Documentation/virt/kvm/api.rst |  37 ++++++++
 arch/x86/kvm/x86.c             | 156 +++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h       |   3 +
 3 files changed, 196 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f0b76ff5030d..758f6fc08fe5 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6352,6 +6352,43 @@ a single guest_memfd file, but the bound ranges must not overlap).
 
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
+4.143 KVM_GET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86_64
+:Type: vcpu ioctl
+:Parameters: struct pvclock_vcpu_time_info (out)
+:Returns: 0 on success, <0 on error
+
+Retrieves the current time information structure used for KVM/PV clocks,
+in precisely the form advertised to the guest vCPU, which gives parameters
+for a direct conversion from a guest TSC value to nanoseconds.
+
+When the KVM clock not is in "master clock" mode, for example because the
+host TSC is unreliable or the guest TSCs are oddly configured, the KVM clock
+is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest TSC.
+In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL.
+
+4.144 KVM_SET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86_64
+:Type: vcpu ioctl
+:Parameters: struct pvclock_vcpu_time_info (in)
+:Returns: 0 on success, <0 on error
+
+Sets the KVM clock (for the whole VM) in terms of the vCPU TSC, using the
+pvclock structure as returned by KVM_GET_CLOCK_GUEST. This allows the precise
+arithmetic relationship between guest TSC and KVM clock to be preserved by
+userspace across migration.
+
+When the KVM clock is not in "master clock" mode, and the KVM clock is actually
+defined by the host CLOCK_MONOTONIC_RAW, this ioctl returns -EINVAL. Userspace
+may choose to set the clock using the less precise KVM_SET_CLOCK ioctl, or may
+choose to fail, denying migration to a host whose TSC is misbehaving.
+
 5. The kvm_run structure
 ========================
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 23281c508c27..42abce7b4fc9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5868,6 +5868,154 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 	}
 }
 
+#ifdef CONFIG_X86_64
+static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
+{
+	struct pvclock_vcpu_time_info *hv_clock = &v->arch.hv_clock;
+
+	/*
+	 * If KVM_REQ_CLOCK_UPDATE is already pending, or if the hv_clock has
+	 * never been generated at all, call kvm_guest_time_update() to do so.
+	 * Might as well use the PVCLOCK_TSC_STABLE_BIT as the check for ever
+	 * having been written.
+	 */
+	if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, v) ||
+	    !(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT)) {
+		if (kvm_guest_time_update(v))
+			return -EINVAL;
+	}
+
+	/*
+	 * PVCLOCK_TSC_STABLE_BIT is set in use_master_clock mode where the
+	 * KVM clock is defined in terms of the guest TSC. Otherwise, it is
+	 * is defined by the host CLOCK_MONOTONIC_RAW, and userspace should
+	 * use the legacy KVM_[GS]ET_CLOCK to migrate it.
+	 */
+	if (!(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT))
+		return -EINVAL;
+
+	if (copy_to_user(argp, hv_clock, sizeof(*hv_clock)))
+		return -EFAULT;
+
+	return 0;
+}
+
+/*
+ * Reverse the calculation in the hv_clock definition.
+ *
+ * time_ns = ( (cycles << shift) * mul ) >> 32;
+ * (although shift can be negative, so that's bad C)
+ *
+ * So for a single second,
+ *  NSEC_PER_SEC = ( ( FREQ_HZ << shift) * mul ) >> 32
+ *  NSEC_PER_SEC << 32 = ( FREQ_HZ << shift ) * mul
+ *  ( NSEC_PER_SEC << 32 ) / mul = FREQ_HZ << shift
+ *  ( NSEC_PER_SEC << 32 ) / mul ) >> shift = FREQ_HZ
+ */
+static uint64_t hvclock_to_hz(uint32_t mul, int8_t shift)
+{
+	uint64_t tm = NSEC_PER_SEC << 32;
+
+	/* Maximise precision. Shift right until the top bit is set */
+	tm <<= 2;
+	shift += 2;
+
+	/* While 'mul' is even, increase the shift *after* the division */
+	while (!(mul & 1)) {
+		shift++;
+		mul >>= 1;
+	}
+
+	tm /= mul;
+
+	if (shift > 0)
+		return tm >> shift;
+	else
+		return tm << -shift;
+}
+
+static int kvm_vcpu_ioctl_set_clock_guest(struct kvm_vcpu *v, void __user *argp)
+{
+	struct pvclock_vcpu_time_info user_hv_clock;
+	struct kvm *kvm = v->kvm;
+	struct kvm_arch *ka = &kvm->arch;
+	uint64_t curr_tsc_hz, user_tsc_hz;
+	uint64_t user_clk_ns;
+	uint64_t guest_tsc;
+	int rc = 0;
+
+	if (copy_from_user(&user_hv_clock, argp, sizeof(user_hv_clock)))
+		return -EFAULT;
+
+	if (!user_hv_clock.tsc_to_system_mul)
+		return -EINVAL;
+
+	user_tsc_hz = hvclock_to_hz(user_hv_clock.tsc_to_system_mul,
+				    user_hv_clock.tsc_shift);
+
+
+	kvm_hv_request_tsc_page_update(kvm);
+	kvm_start_pvclock_update(kvm);
+	pvclock_update_vm_gtod_copy(kvm);
+
+	/*
+	 * If not in use_master_clock mode, do not allow userspace to set
+	 * the clock in terms of the guest TSC. Userspace should either
+	 * fail the migration (to a host with suboptimal TSCs), or should
+	 * knowingly restore the KVM clock using KVM_SET_CLOCK instead.
+	 */
+	if (!ka->use_master_clock) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	curr_tsc_hz = get_cpu_tsc_khz() * 1000LL;
+	if (unlikely(curr_tsc_hz == 0)) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	if (kvm_caps.has_tsc_control)
+		curr_tsc_hz = kvm_scale_tsc(curr_tsc_hz,
+					    v->arch.l1_tsc_scaling_ratio);
+
+	/*
+	 * The scaling factors in the hv_clock do not depend solely on the
+	 * TSC frequency *requested* by userspace. They actually use the
+	 * host TSC frequency that was measured/detected by the host kernel,
+	 * scaled by kvm_scale_tsc() with the vCPU's l1_tsc_scaling_ratio.
+	 *
+	 * So a sanity check that they *precisely* match would have false
+	 * negatives. Allow for a discrepancy of 1 kHz either way.
+	 */
+	if (user_tsc_hz < curr_tsc_hz - 1000 ||
+	    user_tsc_hz > curr_tsc_hz + 1000) {
+		rc = -ERANGE;
+		goto out;
+	}
+
+	/*
+	 * The call to pvclock_update_vm_gtod_copy() has created a new time
+	 * reference point in ka->master_cycle_now and ka->master_kernel_ns.
+	 *
+	 * Calculate the guest TSC at that moment, and the corresponding KVM
+	 * clock value according to user_hv_clock. The value according to the
+	 * current hv_clock will of course be ka->master_kernel_ns since no
+	 * TSC cycles have elapsed.
+	 *
+	 * Adjust ka->kvmclock_offset to the delta, so that both definitions
+	 * of the clock give precisely the same reading at the reference time.
+	 */
+	guest_tsc = kvm_read_l1_tsc(v, ka->master_cycle_now);
+	user_clk_ns = __pvclock_read_cycles(&user_hv_clock, guest_tsc);
+	ka->kvmclock_offset = user_clk_ns - ka->master_kernel_ns;
+
+out:
+	kvm_end_pvclock_update(kvm);
+	return rc;
+}
+#endif
+
 long kvm_arch_vcpu_ioctl(struct file *filp,
 			 unsigned int ioctl, unsigned long arg)
 {
@@ -6256,6 +6404,14 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		srcu_read_unlock(&vcpu->kvm->srcu, idx);
 		break;
 	}
+#ifdef CONFIG_X86_64
+	case KVM_SET_CLOCK_GUEST:
+		r = kvm_vcpu_ioctl_set_clock_guest(vcpu, argp);
+		break;
+	case KVM_GET_CLOCK_GUEST:
+		r = kvm_vcpu_ioctl_get_clock_guest(vcpu, argp);
+		break;
+#endif
 #ifdef CONFIG_KVM_HYPERV
 	case KVM_GET_SUPPORTED_HV_CPUID:
 		r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..0d306311e4d6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1548,4 +1548,7 @@ struct kvm_create_guest_memfd {
 	__u64 reserved[6];
 };
 
+#define KVM_SET_CLOCK_GUEST       _IOW(KVMIO,  0xd5, struct pvclock_vcpu_time_info)
+#define KVM_GET_CLOCK_GUEST       _IOR(KVMIO,  0xd6, struct pvclock_vcpu_time_info)
+
 #endif /* __LINUX_KVM_H */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 04/21] UAPI: x86: Move pvclock-abi to UAPI for x86 platforms
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (2 preceding siblings ...)
  2024-05-22  0:16 ` [RFC PATCH v3 03/21] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration David Woodhouse
@ 2024-05-22  0:16 ` David Woodhouse
  2024-05-24 13:14   ` Paul Durrant
  2024-08-13 18:07   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 05/21] KVM: selftests: Add KVM/PV clock selftest to prove timer correction David Woodhouse
                   ` (16 subsequent siblings)
  20 siblings, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:16 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: Jack Allister <jalliste@amazon.com>

KVM provides a new interface for performing a fixup/correction of the KVM
clock against the reference TSC. The KVM_[GS]ET_CLOCK_GUEST API requires a
pvclock_vcpu_time_info, as such the caller must know about this definition.

Move the definition to the UAPI folder so that it is exported to usermode
and also change the type definitions to use the standard for UAPI exports.

Signed-off-by: Jack Allister <jalliste@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/{ => uapi}/asm/pvclock-abi.h | 24 +++++++++----------
 1 file changed, 12 insertions(+), 12 deletions(-)
 rename arch/x86/include/{ => uapi}/asm/pvclock-abi.h (83%)

diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/uapi/asm/pvclock-abi.h
similarity index 83%
rename from arch/x86/include/asm/pvclock-abi.h
rename to arch/x86/include/uapi/asm/pvclock-abi.h
index 1436226efe3e..48cc53f62dd6 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/uapi/asm/pvclock-abi.h
@@ -1,4 +1,4 @@
-/* SPDX-License-Identifier: GPL-2.0 */
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
 #ifndef _ASM_X86_PVCLOCK_ABI_H
 #define _ASM_X86_PVCLOCK_ABI_H
 #ifndef __ASSEMBLY__
@@ -24,20 +24,20 @@
  */
 
 struct pvclock_vcpu_time_info {
-	u32   version;
-	u32   pad0;
-	u64   tsc_timestamp;
-	u64   system_time;
-	u32   tsc_to_system_mul;
-	s8    tsc_shift;
-	u8    flags;
-	u8    pad[2];
+	__u32   version;
+	__u32   pad0;
+	__u64   tsc_timestamp;
+	__u64   system_time;
+	__u32   tsc_to_system_mul;
+	__s8    tsc_shift;
+	__u8    flags;
+	__u8    pad[2];
 } __attribute__((__packed__)); /* 32 bytes */
 
 struct pvclock_wall_clock {
-	u32   version;
-	u32   sec;
-	u32   nsec;
+	__u32   version;
+	__u32   sec;
+	__u32   nsec;
 } __attribute__((__packed__));
 
 #define PVCLOCK_TSC_STABLE_BIT	(1 << 0)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 05/21] KVM: selftests: Add KVM/PV clock selftest to prove timer correction
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (3 preceding siblings ...)
  2024-05-22  0:16 ` [RFC PATCH v3 04/21] UAPI: x86: Move pvclock-abi to UAPI for x86 platforms David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-08-13 18:55   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 06/21] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC David Woodhouse
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: Jack Allister <jalliste@amazon.com>

A VM's KVM/PV clock has an inherent relationship to its TSC (guest). When
either the host system live-updates or the VM is live-migrated this pairing
of the two clock sources should stay the same.

In reality this is not the case without some correction taking place. Two
new IOCTLs (KVM_GET_CLOCK_GUEST/KVM_SET_CLOCK_GUEST) can be utilized to
perform a correction on the PVTI (PV time information) structure held by
KVM to effectively fixup the kvmclock_offset prior to the guest VM resuming
in either a live-update/migration scenario.

This test proves that without the necessary fixup there is a perceived
change in the guest TSC & KVM/PV clock relationship before and after a LU/
LM takes place.

The following steps are made to verify there is a delta in the relationship
and that it can be corrected:

1. PVTI is sampled by guest at boot (let's call this PVTI0).
2. Induce a change in PVTI data (KVM_REQ_MASTERCLOCK_UPDATE).
3. PVTI is sampled by guest after change (PVTI1).
4. Correction is requested by usermode to KVM using PVTI0.
5. PVTI is sampled by guest after correction (PVTI2).

The guest the records a singular TSC reference point in time and uses it to
calculate 3 KVM clock values utilizing the 3 recorded PVTI prior. Let's
call each clock value CLK[0-2].

In a perfect world CLK[0-2] should all be the same value if the KVM clock
& TSC relationship is preserved across the LU/LM (or faked in this test),
however it is not.

A delta can be observed between CLK0-CLK1 due to KVM recalculating the PVTI
(and the inaccuracies associated with that). A delta of ~3500ns can be
observed if guest TSC scaling to half host TSC frequency is also enabled,
where as without scaling this is observed at ~180ns.

With the correction it should be possible to achieve a delta of ±1ns.

An option to enable guest TSC scaling is available via invoking the tester
with -s/--scale-tsc.

Example of the test output below:
* selftests: kvm: pvclock_test
* scaling tsc from 2999999KHz to 1499999KHz
* before=5038374946 uncorrected=5038371437 corrected=5038374945
* delta_uncorrected=3509 delta_corrected=1

Signed-off-by: Jack Allister <jalliste@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
CC: Dongli Zhang <dongli.zhang@oracle.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/x86_64/pvclock_test.c       | 192 ++++++++++++++++++
 2 files changed, 193 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 871e2de3eb05..e33f56fedb0c 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -87,6 +87,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/pmu_counters_test
 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
 TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
 TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
+TEST_GEN_PROGS_x86_64 += x86_64/pvclock_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
 TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/pvclock_test.c b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
new file mode 100644
index 000000000000..376ffb730a53
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2024, Amazon.com, Inc. or its affiliates.
+ *
+ * Tests for pvclock API
+ * KVM_SET_CLOCK_GUEST/KVM_GET_CLOCK_GUEST
+ */
+#include <asm/pvclock.h>
+#include <asm/pvclock-abi.h>
+#include <sys/stat.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+
+enum {
+	STAGE_FIRST_BOOT,
+	STAGE_UNCORRECTED,
+	STAGE_CORRECTED
+};
+
+#define KVMCLOCK_GPA 0xc0000000ull
+#define KVMCLOCK_SIZE sizeof(struct pvclock_vcpu_time_info)
+
+static void trigger_pvti_update(vm_paddr_t pvti_pa)
+{
+	/*
+	 * We need a way to trigger KVM to update the fields
+	 * in the PV time info. The easiest way to do this is
+	 * to temporarily switch to the old KVM system time
+	 * method and then switch back to the new one.
+	 */
+	wrmsr(MSR_KVM_SYSTEM_TIME, pvti_pa | KVM_MSR_ENABLED);
+	wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
+}
+
+static void guest_code(vm_paddr_t pvti_pa)
+{
+	struct pvclock_vcpu_time_info *pvti_va =
+		(struct pvclock_vcpu_time_info *)pvti_pa;
+
+	struct pvclock_vcpu_time_info pvti_boot;
+	struct pvclock_vcpu_time_info pvti_uncorrected;
+	struct pvclock_vcpu_time_info pvti_corrected;
+	uint64_t cycles_boot;
+	uint64_t cycles_uncorrected;
+	uint64_t cycles_corrected;
+	uint64_t tsc_guest;
+
+	/*
+	 * Setup the KVMCLOCK in the guest & store the original
+	 * PV time structure that is used.
+	 */
+	wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
+	pvti_boot = *pvti_va;
+	GUEST_SYNC(STAGE_FIRST_BOOT);
+
+	/*
+	 * Trigger an update of the PVTI, if we calculate
+	 * the KVM clock using this structure we'll see
+	 * a delta from the TSC.
+	 */
+	trigger_pvti_update(pvti_pa);
+	pvti_uncorrected = *pvti_va;
+	GUEST_SYNC(STAGE_UNCORRECTED);
+
+	/*
+	 * The test should have triggered the correction by this
+	 * point in time. We have a copy of each of the PVTI structs
+	 * at each stage now.
+	 *
+	 * Let's sample the timestamp at a SINGLE point in time and
+	 * then calculate what the KVM clock would be using the PVTI
+	 * from each stage.
+	 *
+	 * Then return each of these values to the tester.
+	 */
+	pvti_corrected = *pvti_va;
+	tsc_guest = rdtsc();
+
+	cycles_boot = __pvclock_read_cycles(&pvti_boot, tsc_guest);
+	cycles_uncorrected = __pvclock_read_cycles(&pvti_uncorrected, tsc_guest);
+	cycles_corrected = __pvclock_read_cycles(&pvti_corrected, tsc_guest);
+
+	GUEST_SYNC_ARGS(STAGE_CORRECTED, cycles_boot, cycles_uncorrected,
+			cycles_corrected, 0);
+}
+
+static void run_test(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
+{
+	struct pvclock_vcpu_time_info pvti_before;
+	uint64_t before, uncorrected, corrected;
+	int64_t delta_uncorrected, delta_corrected;
+	struct ucall uc;
+	uint64_t ucall_reason;
+
+	/* Loop through each stage of the test. */
+	while (true) {
+
+		/* Start/restart the running vCPU code. */
+		vcpu_run(vcpu);
+		TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+
+		/* Retrieve and verify our stage. */
+		ucall_reason = get_ucall(vcpu, &uc);
+		TEST_ASSERT(ucall_reason == UCALL_SYNC,
+			    "Unhandled ucall reason=%lu",
+			    ucall_reason);
+
+		/* Run host specific code relating to stage. */
+		switch (uc.args[1]) {
+		case STAGE_FIRST_BOOT:
+			/* Store the KVM clock values before an update. */
+			vcpu_ioctl(vcpu, KVM_GET_CLOCK_GUEST, &pvti_before);
+
+			/* Sleep for a set amount of time to increase delta. */
+			sleep(5);
+			break;
+
+		case STAGE_UNCORRECTED:
+			/* Restore the KVM clock values. */
+			vcpu_ioctl(vcpu, KVM_SET_CLOCK_GUEST, &pvti_before);
+			break;
+
+		case STAGE_CORRECTED:
+			/* Query the clock information and verify delta. */
+			before = uc.args[2];
+			uncorrected = uc.args[3];
+			corrected = uc.args[4];
+
+			delta_uncorrected = before - uncorrected;
+			delta_corrected = before - corrected;
+
+			pr_info("before=%lu uncorrected=%lu corrected=%lu\n",
+				before, uncorrected, corrected);
+
+			pr_info("delta_uncorrected=%ld delta_corrected=%ld\n",
+				delta_uncorrected, delta_corrected);
+
+			TEST_ASSERT((delta_corrected <= 1) && (delta_corrected >= -1),
+				    "larger than expected delta detected = %ld", delta_corrected);
+			return;
+		}
+	}
+}
+
+static void configure_pvclock(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
+{
+	unsigned int gpages;
+
+	gpages = vm_calc_num_guest_pages(VM_MODE_DEFAULT, KVMCLOCK_SIZE);
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+				    KVMCLOCK_GPA, 1, gpages, 0);
+	virt_map(vm, KVMCLOCK_GPA, KVMCLOCK_GPA, gpages);
+
+	vcpu_args_set(vcpu, 1, KVMCLOCK_GPA);
+}
+
+static void configure_scaled_tsc(struct kvm_vcpu *vcpu)
+{
+	uint64_t tsc_khz;
+
+	tsc_khz =  __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL);
+	pr_info("scaling tsc from %ldKHz to %ldKHz\n", tsc_khz, tsc_khz / 2);
+	tsc_khz /= 2;
+	vcpu_ioctl(vcpu, KVM_SET_TSC_KHZ, (void *)tsc_khz);
+}
+
+int main(int argc, char *argv[])
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	bool scale_tsc;
+
+	scale_tsc = argc > 1 && (!strncmp(argv[1], "-s", 3) ||
+				 !strncmp(argv[1], "--scale-tsc", 10));
+
+	TEST_REQUIRE(sys_clocksource_is_based_on_tsc());
+
+	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
+
+	configure_pvclock(vm, vcpu);
+
+	if (scale_tsc)
+		configure_scaled_tsc(vcpu);
+
+	run_test(vm, vcpu);
+
+	return 0;
+}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 06/21] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (4 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 05/21] KVM: selftests: Add KVM/PV clock selftest to prove timer correction David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-05-22  0:17 ` [RFC PATCH v3 07/21] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration David Woodhouse
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

KVM does make an attempt to cope with non-constant TSC, and has notifiers
to handle host TSC frequency changes. However, it *only* adjusts the KVM
clock, and doesn't adjust TSC frequency scaling when the host changes.

This is presumably because non-constant TSCs were fixed in hardware long
before TSC scaling was implemented, so there should never be real CPUs
which have TSC scaling but *not* CONSTANT_TSC.

Such a combination could potentially happen in some odd L1 nesting
environment, but it isn't worth trying to support it. Just make the
dependency explicit.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/svm/svm.c | 3 ++-
 arch/x86/kvm/vmx/vmx.c | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0f3b59da0d4a..4d3ec1c3231e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5202,7 +5202,8 @@ static __init int svm_hardware_setup(void)
 		kvm_enable_efer_bits(EFER_FFXSR);
 
 	if (tsc_scaling) {
-		if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR)) {
+		if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR) ||
+		    !boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
 			tsc_scaling = false;
 		} else {
 			pr_info("TSC scaling supported\n");
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6780313914f8..bee830adf744 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8428,7 +8428,7 @@ __init int vmx_hardware_setup(void)
 	if (!enable_apicv || !cpu_has_vmx_ipiv())
 		enable_ipiv = false;
 
-	if (cpu_has_vmx_tsc_scaling())
+	if (cpu_has_vmx_tsc_scaling() && boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
 		kvm_caps.has_tsc_control = true;
 
 	kvm_caps.max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 07/21] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (5 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 06/21] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-08-14  1:52   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 08/21] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host David Woodhouse
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully
inadequate. It ignores TSC scaling, and ignores the fact that the host
TSC may differ from one host to the next (and in fact because of the way
the kernel calibrates it, it generally differs from one boot to the next
even on the same hardware).

Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits,
and attempt to document the process that userspace needs to follow to
preserve the TSC across migration.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 Documentation/virt/kvm/devices/vcpu.rst | 115 ++++++++++++++++++------
 arch/x86/include/uapi/asm/kvm.h         |   6 ++
 arch/x86/kvm/x86.c                      |  15 ++++
 3 files changed, 109 insertions(+), 27 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 31f14ec4a65b..2a054443118d 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -216,52 +216,113 @@ Returns:
 Specifies the guest's TSC offset relative to the host's TSC. The guest's
 TSC is then derived by the following equation:
 
-  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
+  guest_tsc = (( host_tsc * tsc_scale_ratio ) >> tsc_scale_bits ) + KVM_VCPU_TSC_OFFSET
+
+The values of tsc_scale_ratio and tsc_scale_bits can be obtained using
+the KVM_VCPU_TSC_SCALE attribute.
 
 This attribute is useful to adjust the guest's TSC on live migration,
 so that the TSC counts the time during which the VM was paused. The
-following describes a possible algorithm to use for this purpose.
+following describes a possible algorithm to use for this purpose,
 
 From the source VMM process:
 
-1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
+1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (host_tsc_src),
    kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
-   (host_src).
+   (time_src) at a given moment (Tsrc).
+
+2. For each vCPU[i]:
+
+   a. Read the KVM_VCPU_TSC_OFFSET attribute to record the guest TSC offset
+      (ofs_src[i]).
 
-2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the
-   guest TSC offset (ofs_src[i]).
+   b. Read the KVM_VCPU_TSC_SCALE attribute to record the guest TSC scaling
+      ratio (ratio_src[i], frac_bits_src[i]).
 
-3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
-   guest's TSC (freq).
+   c. Use host_tsc_src and the scaling/offset factors to calculate this
+      vCPU's TSC at time Tsrc:
+      tsc_src[i] = (( host_tsc_src * ratio_src[i] ) >> frac_bits_src[i] ) + ofs_src[i]
+
+3. Invoke the KVM_GET_CLOCK_GUEST ioctl on the boot vCPU to return the KVM
+   clock as a function of the guest TSC (pvti_src).  (This ioctl not succeed
+   if the host and guest TSCs are not consistent and well-behaved.)
 
 From the destination VMM process:
 
-4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
-   kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective
+4. Before creating the vCPUs, invoke the KVM_SET_TSC_KHZ ioctl on the VM, to
+   set the scaled frequency of the guest's TSC (freq).
+
+5. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
+   kvmclock (guest_src) and CLOCK_REALTIME (time_src) in their respective
    fields.  Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
    structure.
 
-   KVM will advance the VM's kvmclock to account for elapsed time since
-   recording the clock values.  Note that this will cause problems in
+   KVM will restore the VM's kvmclock, accounting for elapsed time since
+   the clock values were recorded.  Note that this will cause problems in
    the guest (e.g., timeouts) unless CLOCK_REALTIME is synchronized
    between the source and destination, and a reasonably short time passes
-   between the source pausing the VMs and the destination executing
-   steps 4-7.
+   between the source pausing the VMs and the destination resuming them.
+   Due to the KVM_[SG]ET_CLOCK API using CLOCK_REALTIME instead of
+   CLOCK_TAI, leap seconds during the migration may also introduce errors.
+
+6. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (host_tsc_dst) and
+   host CLOCK_REALTIME nanoseconds (time_dst) at a given moment (Tdst).
+
+7. Calculate the number of nanoseconds elapsed between Tsrc and Tdst:
+   ΔT = time_dst - time_src
+
+8. As each vCPU[i] is created:
+
+   a. Read the KVM_VCPU_TSC_SCALE attribute to record the guest TSC scaling
+      ratio (ratio_dst[i], frac_bits_dst[i]).
+
+   b. Calculate the intended guest TSC value at time Tdst:
+      tsc_dst[i] = tsc_tsc[i] + (ΔT * freq[i])
+
+   c. Use host_tsc_dst and the scaling/offset factors to calculate this vCPU's
+      TSC at time Tsrc without taking offsetting into account:
+      raw_dst[i] = (( host_tsc_dst * ratio_dst[i] ) >> frac_bits_dst[i] )
+
+   d. Calculate ofs_src[i] = tsc_dst[i] + raw_dst[i] and set the resulting
+      offset using the KVM_VCPU_TSC_OFFSET attrribute.
 
-5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and
-   kvmclock nanoseconds (guest_dest).
+9. If pvti_src was provided, invoke the KVM_SET_CLOCK_GUEST ioctl on the boot
+   vCPU to restore the KVM clock as a precise function of the guest TSC. If
+   pvti_src was not provided by the source, or the ioctl fails on the
+   destination, the KVM clock is operating in its less precise mode where it
+   is defined in terms of the host CLOCK_MONOTONIC_RAW, so the value
+   previously set in step 5 is as accurate as it can be.
+
+4.2 ATTRIBUTE: KVM_VCPU_TSC_SCALE
+
+:Parameters: 64-bit fixed point TSC scale factor
+
+Returns:
+
+	 ======= ======================================
+	 -EFAULT Error reading the provided parameter
+		 address.
+	 -ENXIO  Attribute not supported
+	 -EINVAL Invalid request to write the attribute
+	 ======= ======================================
+
+This read-only attribute reports the guest's TSC scaling factor, in the form
+of a fixed-point number represented by the following structure:
+
+    struct kvm_vcpu_tsc_scale {
+	    __u64	tsc_ratio;
+	    __u64	tsc_frac_bits;
+    };
 
-6. Adjust the guest TSC offsets for every vCPU to account for (1) time
-   elapsed since recording state and (2) difference in TSCs between the
-   source and destination machine:
 
-   ofs_dst[i] = ofs_src[i] -
-     (guest_src - guest_dest) * freq +
-     (tsc_src - tsc_dest)
+The tsc_frac_bits field indicate the location of the fixed point, such that
+host TSC values are converted to guest TSC using the formula:
 
-   ("ofs[i] + tsc - guest * freq" is the guest TSC value corresponding to
-   a time of 0 in kvmclock.  The above formula ensures that it is the
-   same on the destination as it was on the source).
+    guest_tsc = ( ( host_tsc * tsc_ratio ) >> tsc_frac_bits) + offset
 
-7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
-   respective value derived in the previous step.
+Userspace can use this to precisely calculate the guest TSC from the host
+TSC at any given moment. This is needed for accurate migration of guests,
+as described in the documentation for the KVM_VCPU_TSC_OFFSET attribute.
+In conjunction with the KVM_GET_CLOCK_GUEST ioctl, it also provides a way
+for userspace to quickly calculate the KVM clock for a guest, to use as a
+time reference for hypercalls or emulation of other timer devices.
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 72ad5ace118d..fe7c98907818 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -864,6 +864,12 @@ struct kvm_hyperv_eventfd {
 /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define   KVM_VCPU_TSC_SCALE  1 /* attribute for TSC scaling factor */
+
+struct kvm_vcpu_tsc_scale {
+	__u64 tsc_ratio;
+	__u64 tsc_frac_bits;
+};
 
 /* x86-specific KVM_EXIT_HYPERCALL flags. */
 #define KVM_EXIT_HYPERCALL_LONG_MODE	_BITULL(0)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 42abce7b4fc9..00a7c1188dec 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5715,6 +5715,7 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
 
 	switch (attr->attr) {
 	case KVM_VCPU_TSC_OFFSET:
+	case KVM_VCPU_TSC_SCALE:
 		r = 0;
 		break;
 	default:
@@ -5737,6 +5738,17 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
 			break;
 		r = 0;
 		break;
+	case KVM_VCPU_TSC_SCALE: {
+		struct kvm_vcpu_tsc_scale scale;
+
+		scale.tsc_ratio = vcpu->arch.l1_tsc_scaling_ratio;
+		scale.tsc_frac_bits = kvm_caps.tsc_scaling_ratio_frac_bits;
+		r = -EFAULT;
+		if (copy_to_user(uaddr, &scale, sizeof(scale)))
+			break;
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
@@ -5777,6 +5789,9 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
 		r = 0;
 		break;
 	}
+	case KVM_VCPU_TSC_SCALE:
+		r = -EINVAL; /* Read only */
+		break;
 	default:
 		r = -ENXIO;
 	}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 08/21] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (6 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 07/21] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-08-14  1:57   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get_kvmclock() David Woodhouse
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

Commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to monotonic raw
clock") did so only for 64-bit hosts, by capturing the boot offset from
within the existing clocksource notifier update_pvclock_gtod().

That notifier was added in commit 16e8d74d2da9 ("KVM: x86: notifier for
clocksource changes") but only on x86_64, because its original purpose
was just to disable the "master clock" mode which is only supported on
x86_64.

Now that the notifier is used for more than disabling master clock mode,
(well, OK, more than a decade later but clocks are hard), enable it for
the 32-bit build too so that get_kvmclock_base_ns() can be unaffected by
NTP sync on 32-bit too.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/x86.c | 18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 00a7c1188dec..44b3d2a0da5b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2245,7 +2245,6 @@ static int do_set_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data)
 	return kvm_set_msr_ignored_check(vcpu, index, *data, true);
 }
 
-#ifdef CONFIG_X86_64
 struct pvclock_clock {
 	int vclock_mode;
 	u64 cycle_last;
@@ -2303,13 +2302,6 @@ static s64 get_kvmclock_base_ns(void)
 	/* Count up from boot time, but with the frequency of the raw clock.  */
 	return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
 }
-#else
-static s64 get_kvmclock_base_ns(void)
-{
-	/* Master clock not used, so we can just use CLOCK_BOOTTIME.  */
-	return ktime_get_boottime_ns();
-}
-#endif
 
 static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_ofs)
 {
@@ -9819,6 +9811,7 @@ static void pvclock_irq_work_fn(struct irq_work *w)
 }
 
 static DEFINE_IRQ_WORK(pvclock_irq_work, pvclock_irq_work_fn);
+#endif
 
 /*
  * Notification about pvclock gtod data update.
@@ -9826,26 +9819,26 @@ static DEFINE_IRQ_WORK(pvclock_irq_work, pvclock_irq_work_fn);
 static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
 			       void *priv)
 {
-	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 	struct timekeeper *tk = priv;
 
 	update_pvclock_gtod(tk);
 
+#ifdef CONFIG_X86_64
 	/*
 	 * Disable master clock if host does not trust, or does not use,
 	 * TSC based clocksource. Delegate queue_work() to irq_work as
 	 * this is invoked with tk_core.seq write held.
 	 */
-	if (!gtod_is_based_on_tsc(gtod->clock.vclock_mode) &&
+	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode) &&
 	    atomic_read(&kvm_guest_has_master_clock) != 0)
 		irq_work_queue(&pvclock_irq_work);
+#endif
 	return 0;
 }
 
 static struct notifier_block pvclock_gtod_notifier = {
 	.notifier_call = pvclock_gtod_notify,
 };
-#endif
 
 static inline void kvm_ops_update(struct kvm_x86_init_ops *ops)
 {
@@ -9984,9 +9977,10 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
 
 	if (pi_inject_timer == -1)
 		pi_inject_timer = housekeeping_enabled(HK_TYPE_TIMER);
-#ifdef CONFIG_X86_64
+
 	pvclock_gtod_register_notifier(&pvclock_gtod_notifier);
 
+#ifdef CONFIG_X86_64
 	if (hypervisor_is_type(X86_HYPER_MS_HYPERV))
 		set_hv_tscchange_cb(kvm_hyperv_tsc_notifier);
 #endif
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get_kvmclock()
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (7 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 08/21] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-05-24 13:20   ` Paul Durrant
  2024-08-14  2:58   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time() David Woodhouse
                   ` (11 subsequent siblings)
  20 siblings, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

When in 'master clock mode' (i.e. when host and guest TSCs are behaving
sanely and in sync), the KVM clock is defined in terms of the guest TSC.

When TSC scaling is used, calculating the KVM clock directly from *host*
TSC cycles leads to a systemic drift from the values calculated by the
guest from its TSC.

Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers")
had a simple workaround for the specific case of Xen timers, as it had an
actual vCPU to hand and could use its scaling information. That commit
noted that it was broken for the general case of get_kvmclock_ns(), and
said "I'll come back to that".

Since __get_kvmclock() is invoked without a specific CPU, it needs to
be able to find or generate the scaling values required to perform the
correct calculation.

Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC,
so it isn't as complex as it might have been.

In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in
kvm->arch.last_tsc_scaling_ratio. That is only protected by the
tsc_write_lock, so in pvclock_update_vm_gtod_copy(), copy it into a
separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed
using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and
shift factors to convert to nanoseconds for the corresponding KVM clock,
just as kvm_guest_time_update() would.

In __get_kvmclock(), which runs within a seqcount retry loop, use those
values to convert host to guest TSC and then to nanoseconds. Only fall
back to using get_kvmclock_base_ns() when not in master clock mode.

There was previously a code path in __get_kvmclock() which looked like
it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
even on 32-bit hosts. In practice that could never happen as the
ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
hosts it would never be set when the system clock isn't TSC-based. So
that code path is now removed.

The kvm_get_wall_clock_epoch() function had the same problem; make it
just call get_kvmclock() and subtract kvmclock from wallclock, with
the same fallback as before.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h |   4 +
 arch/x86/kvm/x86.c              | 151 ++++++++++++++++----------------
 2 files changed, 79 insertions(+), 76 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8440c4081727..b01c1d000fff 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1353,6 +1353,7 @@ struct kvm_arch {
 	u64 last_tsc_write;
 	u32 last_tsc_khz;
 	u64 last_tsc_offset;
+	u64 last_tsc_scaling_ratio;
 	u64 cur_tsc_nsec;
 	u64 cur_tsc_write;
 	u64 cur_tsc_offset;
@@ -1364,6 +1365,9 @@ struct kvm_arch {
 
 	seqcount_raw_spinlock_t pvclock_sc;
 	bool use_master_clock;
+	s8 master_tsc_shift;
+	u32 master_tsc_mul;
+	u64 master_tsc_scaling_ratio;
 	u64 master_kernel_ns;
 	u64 master_cycle_now;
 	struct delayed_work kvmclock_update_work;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 44b3d2a0da5b..89918ba266cd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2674,6 +2674,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm->arch.last_tsc_nsec = ns;
 	kvm->arch.last_tsc_write = tsc;
 	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
+	kvm->arch.last_tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
 	kvm->arch.last_tsc_offset = offset;
 
 	vcpu->arch.last_guest_tsc = tsc;
@@ -3009,6 +3010,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 {
 #ifdef CONFIG_X86_64
 	struct kvm_arch *ka = &kvm->arch;
+	uint64_t last_tsc_hz;
 	int vclock_mode;
 	bool host_tsc_clocksource, vcpus_matched;
 
@@ -3028,6 +3030,35 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 				&& !ka->backwards_tsc_observed
 				&& !ka->boot_vcpu_runs_old_kvmclock;
 
+	/*
+	 * When TSC scaling is in use (which can thankfully only happen
+	 * with X86_FEATURE_CONSTANT_TSC), the host must calculate the
+	 * KVM clock precisely as the guest would, by scaling through
+	 * the guest TSC frequency. Otherwise, differences in arithmetic
+	 * precision lead to systemic drift between the guest's and the
+	 * host's idea of the time.
+	 */
+	if (kvm_caps.has_tsc_control) {
+		/*
+		 * Copy from the field protected solely by ka->tsc_write_lock,
+		 * to the field protected by the ka->pvclock_sc seqlock.
+		 */
+		ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio ? :
+			kvm_caps.default_tsc_scaling_ratio;
+
+		/*
+		 * Calculate the scaling factors precisely the same way
+		 * that kvm_guest_time_update() does.
+		 */
+		last_tsc_hz = kvm_scale_tsc(tsc_khz * 1000LL,
+					    ka->master_tsc_scaling_ratio);
+		kvm_get_time_scale(NSEC_PER_SEC, last_tsc_hz,
+				   &ka->master_tsc_shift, &ka->master_tsc_mul);
+	} else if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+		kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1000LL,
+				   &ka->master_tsc_shift, &ka->master_tsc_mul);
+	}
+
 	if (ka->use_master_clock)
 		atomic_set(&kvm_guest_has_master_clock, 1);
 
@@ -3100,36 +3131,49 @@ static unsigned long get_cpu_tsc_khz(void)
 static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 {
 	struct kvm_arch *ka = &kvm->arch;
-	struct pvclock_vcpu_time_info hv_clock;
+
+#ifdef CONFIG_X86_64
+	uint64_t cur_tsc_khz = 0;
+	struct timespec64 ts;
 
 	/* both __this_cpu_read() and rdtsc() should be on the same cpu */
 	get_cpu();
 
-	data->flags = 0;
 	if (ka->use_master_clock &&
-	    (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
-#ifdef CONFIG_X86_64
-		struct timespec64 ts;
+	    (cur_tsc_khz = get_cpu_tsc_khz()) &&
+	    !kvm_get_walltime_and_clockread(&ts, &data->host_tsc))
+		cur_tsc_khz = 0;
 
-		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
-			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
-			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
-		} else
-#endif
-		data->host_tsc = rdtsc();
-
-		data->flags |= KVM_CLOCK_TSC_STABLE;
-		hv_clock.tsc_timestamp = ka->master_cycle_now;
-		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
-		kvm_get_time_scale(NSEC_PER_SEC, get_cpu_tsc_khz() * 1000LL,
-				   &hv_clock.tsc_shift,
-				   &hv_clock.tsc_to_system_mul);
-		data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
-	} else {
-		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
+	put_cpu();
+
+	if (cur_tsc_khz) {
+		uint64_t tsc_cycles;
+		uint32_t mul;
+		int8_t shift;
+
+		tsc_cycles = data->host_tsc - ka->master_cycle_now;
+
+		if (kvm_caps.has_tsc_control)
+			tsc_cycles = kvm_scale_tsc(tsc_cycles,
+						   ka->master_tsc_scaling_ratio);
+
+		if (static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+			mul = ka->master_tsc_mul;
+			shift = ka->master_tsc_shift;
+		} else {
+			kvm_get_time_scale(NSEC_PER_SEC, cur_tsc_khz * 1000LL,
+					   &shift, &mul);
+		}
+		data->clock = ka->master_kernel_ns + ka->kvmclock_offset +
+			pvclock_scale_delta(tsc_cycles, mul, shift);
+		data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
+		data->flags = KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC | KVM_CLOCK_TSC_STABLE;
+		return;
 	}
+#endif
 
-	put_cpu();
+	data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
+	data->flags = 0;
 }
 
 static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
@@ -3330,68 +3374,23 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
  * that and kvmclock, but even that would be subject to change over
  * time.
  *
- * Attempt to calculate the epoch at a given moment using the *same*
- * TSC reading via kvm_get_walltime_and_clockread() to obtain both
- * wallclock and kvmclock times, and subtracting one from the other.
+ * Use get_kvmclock() to obtain a simultaneous reading of wallclock
+ * and kvmclock times from the *same* TSC reading, and subtract one
+ * from the other.
  *
  * Fall back to using their values at slightly different moments by
  * calling ktime_get_real_ns() and get_kvmclock_ns() separately.
  */
 uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
 {
-#ifdef CONFIG_X86_64
-	struct pvclock_vcpu_time_info hv_clock;
-	struct kvm_arch *ka = &kvm->arch;
-	unsigned long seq, local_tsc_khz;
-	struct timespec64 ts;
-	uint64_t host_tsc;
-
-	do {
-		seq = read_seqcount_begin(&ka->pvclock_sc);
-
-		local_tsc_khz = 0;
-		if (!ka->use_master_clock)
-			break;
-
-		/*
-		 * The TSC read and the call to get_cpu_tsc_khz() must happen
-		 * on the same CPU.
-		 */
-		get_cpu();
-
-		local_tsc_khz = get_cpu_tsc_khz();
-
-		if (local_tsc_khz &&
-		    !kvm_get_walltime_and_clockread(&ts, &host_tsc))
-			local_tsc_khz = 0; /* Fall back to old method */
-
-		put_cpu();
-
-		/*
-		 * These values must be snapshotted within the seqcount loop.
-		 * After that, it's just mathematics which can happen on any
-		 * CPU at any time.
-		 */
-		hv_clock.tsc_timestamp = ka->master_cycle_now;
-		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
+	struct kvm_clock_data data;
 
-	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
+	get_kvmclock(kvm, &data);
 
-	/*
-	 * If the conditions were right, and obtaining the wallclock+TSC was
-	 * successful, calculate the KVM clock at the corresponding time and
-	 * subtract one from the other to get the guest's epoch in nanoseconds
-	 * since 1970-01-01.
-	 */
-	if (local_tsc_khz) {
-		kvm_get_time_scale(NSEC_PER_SEC, local_tsc_khz * NSEC_PER_USEC,
-				   &hv_clock.tsc_shift,
-				   &hv_clock.tsc_to_system_mul);
-		return ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec -
-			__pvclock_read_cycles(&hv_clock, host_tsc);
-	}
-#endif
-	return ktime_get_real_ns() - get_kvmclock_ns(kvm);
+	if (data.flags & KVM_CLOCK_REALTIME)
+		return data.realtime - data.clock;
+	else
+		return ktime_get_real_ns() - data.clock;
 }
 
 /*
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (8 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get_kvmclock() David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-05-24 13:26   ` Paul Durrant
  2024-08-14  4:57   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 11/21] KVM: x86: Simplify and comment kvm_get_time_scale() David Woodhouse
                   ` (10 subsequent siblings)
  20 siblings, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

There was some confusion in kvm_update_guest_time() when software needs
to advance the guest TSC.

In master clock mode, there are two points of time which need to be taken
into account. First there is the master clock reference point, stored in
kvm->arch.master_kernel_ns (and associated host TSC ->master_cycle_now).
Secondly, there is the time *now*, at the point kvm_update_guest_time()
is being called.

With software TSC upscaling, the guest TSC is getting further and further
ahead of the host TSC as time elapses. So at time "now", the guest TSC
should be further ahead of the host, than it was at master_kernel_ns.

The adjustment in kvm_update_guest_time() was not taking that into
account, and was only advancing the guest TSC by the appropriate amount
for master_kernel_ns, *not* the current time.

Fix it to calculate them both correctly.

Since the KVM clock reference point in master_kernel_ns might actually
be *earlier* than the reference point used for the guest TSC
(vcpu->last_tsc_nsec), this might lead to a negative delta. Fix the
compute_guest_tsc() function to cope with negative numbers, which
then means there is no need to force a master clock update when the
guest TSC is written.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 73 +++++++++++++++++++++++++++++++++++-----------
 1 file changed, 56 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 89918ba266cd..e09dc44978ea 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2491,10 +2491,19 @@ static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
 
 static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 {
-	u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
-				      vcpu->arch.virtual_tsc_mult,
-				      vcpu->arch.virtual_tsc_shift);
-	tsc += vcpu->arch.this_tsc_write;
+	s64 delta = kernel_ns - vcpu->arch.this_tsc_nsec;
+	u64 tsc = vcpu->arch.this_tsc_write;
+
+	/* pvclock_scale_delta cannot cope with negative deltas */
+	if (delta >= 0)
+		tsc += pvclock_scale_delta(delta,
+					   vcpu->arch.virtual_tsc_mult,
+					   vcpu->arch.virtual_tsc_shift);
+	else
+		tsc -= pvclock_scale_delta(-delta,
+					   vcpu->arch.virtual_tsc_mult,
+					   vcpu->arch.virtual_tsc_shift);
+
 	return tsc;
 }
 
@@ -2505,7 +2514,7 @@ static inline bool gtod_is_based_on_tsc(int mode)
 }
 #endif
 
-static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
+static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 {
 #ifdef CONFIG_X86_64
 	struct kvm_arch *ka = &vcpu->kvm->arch;
@@ -2522,12 +2531,9 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
 
 	/*
 	 * Request a masterclock update if the masterclock needs to be toggled
-	 * on/off, or when starting a new generation and the masterclock is
-	 * enabled (compute_guest_tsc() requires the masterclock snapshot to be
-	 * taken _after_ the new generation is created).
+	 * on/off.
 	 */
-	if ((ka->use_master_clock && new_generation) ||
-	    (ka->use_master_clock != use_master_clock))
+	if ((ka->use_master_clock != use_master_clock))
 		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
 
 	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
@@ -2705,7 +2711,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
 	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
 
-	kvm_track_tsc_matching(vcpu, !matched);
+	kvm_track_tsc_matching(vcpu);
 }
 
 static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
@@ -3300,8 +3306,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 		kernel_ns = get_kvmclock_base_ns();
 	}
 
-	tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
-
 	/*
 	 * We may have to catch up the TSC to match elapsed wall clock
 	 * time for two reasons, even if kvmclock is used.
@@ -3313,11 +3317,46 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	 *	very slowly.
 	 */
 	if (vcpu->tsc_catchup) {
-		u64 tsc = compute_guest_tsc(v, kernel_ns);
-		if (tsc > tsc_timestamp) {
-			adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
-			tsc_timestamp = tsc;
+		uint64_t now_host_tsc, now_guest_tsc;
+		int64_t adjustment;
+
+		/*
+		 * First, calculate what the guest TSC should be at the
+		 * time (kernel_ns) which will be placed in the hvclock.
+		 * This may be the *current* time, or it may be the time
+		 * of the master clock reference. This is 'tsc_timestamp'.
+		 */
+		tsc_timestamp = compute_guest_tsc(v, kernel_ns);
+
+		now_guest_tsc = tsc_timestamp;
+		now_host_tsc = host_tsc;
+
+#ifdef CONFIG_X86_64
+		/*
+		 * If the master clock was used, calculate what the guest
+		 * TSC should be *now* in order to advance to that.
+		 */
+		if (use_master_clock) {
+			int64_t now_kernel_ns;
+
+			if (!kvm_get_time_and_clockread(&now_kernel_ns,
+							&now_host_tsc)) {
+				now_kernel_ns = get_kvmclock_base_ns();
+				now_host_tsc = rdtsc();
+			}
+			now_guest_tsc = compute_guest_tsc(v, now_kernel_ns);
 		}
+#endif
+		/*
+		 * Calculate the delta between what the guest TSC *should* be,
+		 * and what it actually is according to kvm_read_l1_tsc().
+		 */
+		adjustment = now_guest_tsc - kvm_read_l1_tsc(v, now_host_tsc);
+
+		if (adjustment > 0)
+			adjust_tsc_offset_guest(v, adjustment);
+	} else {
+		tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
 	}
 
 	local_irq_restore(flags);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 11/21] KVM: x86: Simplify and comment kvm_get_time_scale()
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (9 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time() David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-05-24 13:53   ` Paul Durrant
  2024-08-15 15:46   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 12/21] KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset() David Woodhouse
                   ` (9 subsequent siblings)
  20 siblings, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

Commit 3ae13faac400 ("KVM: x86: pass kvm_get_time_scale arguments in hertz")
made this function take 64-bit values in Hz rather than 32-bit kHz. Thus
making it entrely pointless to shadow its arguments into local 64-bit
variables. Just use scaled_hz and base_hz directly.

Also rename the 'tps32' variable to 'base32', having utterly failed to
think of any reason why it might have been called that in the first place.
This could probably have been eliminated too, but it helps to make the
code clearer and *might* just help a naïve 32-bit compiler realise that it
doesn't need to do full 64-bit shifts.

Having taken the time to reverse-engineer the function, add some comments
explaining it.

No functional change intended.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 60 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 47 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e09dc44978ea..ef3cd6113037 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2375,32 +2375,66 @@ static uint32_t div_frac(uint32_t dividend, uint32_t divisor)
 	return dividend;
 }
 
+/*
+ * Calculate scaling factors to be applied with pvclock_scale_delta().
+ *
+ * The output of this function is a fixed-point factor which is used to
+ * scale a tick count at base_hz, to a tick count at scaled_hz, within
+ * the limitations of the Xen/KVM pvclock ABI.
+ *
+ * Mathematically, the factor is (*pmultiplier) >> (32 - *pshift).
+ *
+ * Working backwards, the div_frac() function divides (dividend << 32) by
+ * the given divisor, in other words giving dividend/divisor in the form
+ * of a 32-bit fixed-point fraction in the range 0 to 0x0.FFFFFFFF, which
+ * is (*pmultiplier >> 32).
+ *
+ * The rest of the function is shifting the scaled_hz and base_hz left or
+ * right as appropriate to ensure maximal precision within the constraints.
+ *
+ * The first constraint is that the result of the division *must* be less
+ * than 1, which means the dividend (derived from scaled_hz) must be greater
+ * than the divisor (derived from base_hz).
+ *
+ * The second constraint is that for optimal precision, the dividend (scaled)
+ * shouldn't be more than twice the divisor (base) — i.e. the top bit ought
+ * to be set in the resulting *pmultiplier.
+ */
 static void kvm_get_time_scale(uint64_t scaled_hz, uint64_t base_hz,
 			       s8 *pshift, u32 *pmultiplier)
 {
-	uint64_t scaled64;
 	int32_t  shift = 0;
-	uint64_t tps64;
-	uint32_t tps32;
+	uint32_t base32;
 
-	tps64 = base_hz;
-	scaled64 = scaled_hz;
-	while (tps64 > scaled64*2 || tps64 & 0xffffffff00000000ULL) {
-		tps64 >>= 1;
+	/*
+	 * Start by shifting the base_hz right until it fits in 32 bits, and
+	 * is lower than double the target rate. This introduces a negative
+	 * shift value which would result in pvclock_scale_delta() shifting
+	 * the actual tick count right before performing the multiplication.
+	 */
+	while (base_hz > scaled_hz*2 || base_hz & 0xffffffff00000000ULL) {
+		base_hz >>= 1;
 		shift--;
 	}
 
-	tps32 = (uint32_t)tps64;
-	while (tps32 <= scaled64 || scaled64 & 0xffffffff00000000ULL) {
-		if (scaled64 & 0xffffffff00000000ULL || tps32 & 0x80000000)
-			scaled64 >>= 1;
+	/* Now the shifted base_hz fits in 32 bits, copy it to base32 */
+	base32 = (uint32_t)base_hz;
+
+	/*
+	 * Next, shift the scaled_hz right until it fits in 32 bits, and ensure
+	 * that the shifted base_hz is not larger (so that the result of the
+	 * final division also fits in 32 bits).
+	 */
+	while (base32 <= scaled_hz || scaled_hz & 0xffffffff00000000ULL) {
+		if (scaled_hz & 0xffffffff00000000ULL || base32 & 0x80000000)
+			scaled_hz >>= 1;
 		else
-			tps32 <<= 1;
+			base32 <<= 1;
 		shift++;
 	}
 
 	*pshift = shift;
-	*pmultiplier = div_frac(scaled64, tps32);
+	*pmultiplier = div_frac(scaled_hz, base32);
 }
 
 #ifdef CONFIG_X86_64
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 12/21] KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset()
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (10 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 11/21] KVM: x86: Simplify and comment kvm_get_time_scale() David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-05-24 13:56   ` Paul Durrant
  2024-08-15 15:52   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 13/21] KVM: x86: Improve synchronization in kvm_synchronize_tsc() David Woodhouse
                   ` (8 subsequent siblings)
  20 siblings, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

Let the callers pass the host TSC value in as an explicit parameter.

This leaves some fairly obviously stupid code, which using this function
to compare the guest TSC at some *other* time, with the newly-minted TSC
value from rdtsc(). Unless it's being used to measure *elapsed* time,
that isn't very sensible.

In this case, "obviously stupid" is an improvement over being non-obviously
so.

No functional change intended.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ef3cd6113037..ea59694d712a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2601,11 +2601,12 @@ u64 kvm_scale_tsc(u64 tsc, u64 ratio)
 	return _tsc;
 }
 
-static u64 kvm_compute_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)
+static u64 kvm_compute_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 host_tsc,
+				     u64 target_tsc)
 {
 	u64 tsc;
 
-	tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio);
+	tsc = kvm_scale_tsc(host_tsc, vcpu->arch.l1_tsc_scaling_ratio);
 
 	return target_tsc - tsc;
 }
@@ -2758,7 +2759,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
 	bool synchronizing = false;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
-	offset = kvm_compute_l1_tsc_offset(vcpu, data);
+	offset = kvm_compute_l1_tsc_offset(vcpu, rdtsc(), data);
 	ns = get_kvmclock_base_ns();
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
@@ -2809,7 +2810,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
 		} else {
 			u64 delta = nsec_to_cycles(vcpu, elapsed);
 			data += delta;
-			offset = kvm_compute_l1_tsc_offset(vcpu, data);
+			offset = kvm_compute_l1_tsc_offset(vcpu, rdtsc(), data);
 		}
 		matched = true;
 	}
@@ -4024,7 +4025,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (msr_info->host_initiated) {
 			kvm_synchronize_tsc(vcpu, &data);
 		} else {
-			u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
+			u64 adj = kvm_compute_l1_tsc_offset(vcpu, rdtsc(), data) -
+				vcpu->arch.l1_tsc_offset;
 			adjust_tsc_offset_guest(vcpu, adj);
 			vcpu->arch.ia32_tsc_adjust_msr += adj;
 		}
@@ -5098,7 +5100,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 			mark_tsc_unstable("KVM discovered backwards TSC");
 
 		if (kvm_check_tsc_unstable()) {
-			u64 offset = kvm_compute_l1_tsc_offset(vcpu,
+			u64 offset = kvm_compute_l1_tsc_offset(vcpu, rdtsc(),
 						vcpu->arch.last_guest_tsc);
 			kvm_vcpu_write_tsc_offset(vcpu, offset);
 			vcpu->arch.tsc_catchup = 1;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 13/21] KVM: x86: Improve synchronization in kvm_synchronize_tsc()
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (11 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 12/21] KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset() David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-05-24 14:03   ` Paul Durrant
  2024-08-15 16:00   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 14/21] KVM: x86: Kill cur_tsc_{nsec,offset,write} fields David Woodhouse
                   ` (7 subsequent siblings)
  20 siblings, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

When synchronizing to an existing TSC (either by explicitly writing zero,
or the legacy hack where the TSC is written within one second's worth of
the previously written TSC), the last_tsc_write and last_tsc_nsec values
were being misrecorded by __kvm_synchronize_tsc(). The *unsynchronized*
value of the TSC (perhaps even zero) was bring recorded, along with the
current time at which kvm_synchronize_tsc() was called. This could cause
*subsequent* writes to fail to synchronize correctly.

Fix that by resetting {data, ns} to the previous values before passing
them to __kvm_synchronize_tsc() when synchronization is detected. Except
in the case where the TSC is unstable and *has* to be synthesised from
the host clock, in which case attempt to create a nsec/tsc pair which is
on the correct line.

Furthermore, there were *three* different TSC reads used for calculating
the "current" time, all slightly different from each other. Fix that by
using kvm_get_time_and_clockread() where possible and using the same
host_tsc value in all cases.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 32 ++++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ea59694d712a..6ec43f39bdb0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -201,6 +201,10 @@ module_param(eager_page_split, bool, 0644);
 static bool __read_mostly mitigate_smt_rsb;
 module_param(mitigate_smt_rsb, bool, 0444);
 
+#ifdef CONFIG_X86_64
+static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);
+#endif
+
 /*
  * Restoring the host value for MSRs that are only consumed when running in
  * usermode, e.g. SYSCALL MSRs and TSC_AUX, can be deferred until the CPU
@@ -2753,14 +2757,22 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
 {
 	u64 data = user_value ? *user_value : 0;
 	struct kvm *kvm = vcpu->kvm;
-	u64 offset, ns, elapsed;
+	u64 offset, host_tsc, ns, elapsed;
 	unsigned long flags;
 	bool matched = false;
 	bool synchronizing = false;
 
+#ifdef CONFIG_X86_64
+	if (!kvm_get_time_and_clockread(&ns, &host_tsc))
+#endif
+	{
+		ns = get_kvmclock_base_ns();
+		host_tsc = rdtsc();
+	}
+
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
-	offset = kvm_compute_l1_tsc_offset(vcpu, rdtsc(), data);
-	ns = get_kvmclock_base_ns();
+
+	offset = kvm_compute_l1_tsc_offset(vcpu, host_tsc, data);
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
 	if (vcpu->arch.virtual_tsc_khz) {
@@ -2805,12 +2817,24 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
          */
 	if (synchronizing &&
 	    vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
+		/*
+		 * If synchronizing, the "last written" TSC value/time recorded
+		 * by __kvm_synchronize_tsc() should not change (i.e. should
+		 * be precisely the same as the existing generation)...
+		 */
+		data = kvm->arch.last_tsc_write;
+
 		if (!kvm_check_tsc_unstable()) {
 			offset = kvm->arch.cur_tsc_offset;
+			ns = kvm->arch.cur_tsc_nsec;
 		} else {
+			/*
+			 * ... unless the TSC is unstable and has to be
+			 * synthesised from the host clock in nanoseconds.
+			 */
 			u64 delta = nsec_to_cycles(vcpu, elapsed);
 			data += delta;
-			offset = kvm_compute_l1_tsc_offset(vcpu, rdtsc(), data);
+			offset = kvm_compute_l1_tsc_offset(vcpu, host_tsc, data);
 		}
 		matched = true;
 	}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 14/21] KVM: x86: Kill cur_tsc_{nsec,offset,write} fields
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (12 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 13/21] KVM: x86: Improve synchronization in kvm_synchronize_tsc() David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-05-24 14:05   ` Paul Durrant
  2024-08-15 16:30   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other David Woodhouse
                   ` (6 subsequent siblings)
  20 siblings, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

These pointlessly duplicate the last_tsc_{nsec,offset,write} values.

The only place they were used was where the TSC is stable and a new vCPU
is being synchronized to the previous setting, in which case the 'last_'
value is definitely identical.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h |  3 ---
 arch/x86/kvm/x86.c              | 19 ++++++++-----------
 2 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b01c1d000fff..7d06f389a607 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1354,9 +1354,6 @@ struct kvm_arch {
 	u32 last_tsc_khz;
 	u64 last_tsc_offset;
 	u64 last_tsc_scaling_ratio;
-	u64 cur_tsc_nsec;
-	u64 cur_tsc_write;
-	u64 cur_tsc_offset;
 	u64 cur_tsc_generation;
 	int nr_vcpus_matched_tsc;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6ec43f39bdb0..ab5d55071253 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2713,11 +2713,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	lockdep_assert_held(&kvm->arch.tsc_write_lock);
 
 	/*
-	 * We also track th most recent recorded KHZ, write and time to
-	 * allow the matching interval to be extended at each write.
+	 * Track the last recorded kHz (and associated scaling ratio for
+	 * calculating the guest TSC), and offset.
 	 */
-	kvm->arch.last_tsc_nsec = ns;
-	kvm->arch.last_tsc_write = tsc;
 	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
 	kvm->arch.last_tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
 	kvm->arch.last_tsc_offset = offset;
@@ -2736,10 +2734,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 		 *
 		 * These values are tracked in kvm->arch.cur_xxx variables.
 		 */
+		kvm->arch.last_tsc_nsec = ns;
+		kvm->arch.last_tsc_write = tsc;
 		kvm->arch.cur_tsc_generation++;
-		kvm->arch.cur_tsc_nsec = ns;
-		kvm->arch.cur_tsc_write = tsc;
-		kvm->arch.cur_tsc_offset = offset;
 		kvm->arch.nr_vcpus_matched_tsc = 0;
 	} else if (vcpu->arch.this_tsc_generation != kvm->arch.cur_tsc_generation) {
 		kvm->arch.nr_vcpus_matched_tsc++;
@@ -2747,8 +2744,8 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 
 	/* Keep track of which generation this VCPU has synchronized to */
 	vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
-	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
-	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
+	vcpu->arch.this_tsc_nsec = kvm->arch.last_tsc_nsec;
+	vcpu->arch.this_tsc_write = kvm->arch.last_tsc_write;
 
 	kvm_track_tsc_matching(vcpu);
 }
@@ -2825,8 +2822,8 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
 		data = kvm->arch.last_tsc_write;
 
 		if (!kvm_check_tsc_unstable()) {
-			offset = kvm->arch.cur_tsc_offset;
-			ns = kvm->arch.cur_tsc_nsec;
+			offset = kvm->arch.last_tsc_offset;
+			ns = kvm->arch.last_tsc_nsec;
 		} else {
 			/*
 			 * ... unless the TSC is unstable and has to be
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (13 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 14/21] KVM: x86: Kill cur_tsc_{nsec,offset,write} fields David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-05-24 14:10   ` Paul Durrant
  2024-08-16  2:38   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 16/21] KVM: x86: Factor out kvm_use_master_clock() David Woodhouse
                   ` (5 subsequent siblings)
  20 siblings, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

There is no reason why the KVM clock cannot be in masterclock mode when
the TSCs are not in sync, as long as they are at the same *frequency*.

Running at a different frequency would lead to a systemic skew between
the clock(s) as observed by different vCPUs due to arithmetic precision
in the scaling. So that should indeed force the clock to be based on the
host's CLOCK_MONOTONIC_RAW instead of being in masterclock mode where it
is defined by the (or 'a') guest TSC.

But when the vCPUs merely have a different TSC *offset*, that's not a
problem. The offset is applied to that vCPU's kvmclock->tsc_timestamp
field, and it all comes out in the wash.

So, remove ka->nr_vcpus_matched_tsc and replace it with a new field
ka->all_vcpus_matched_tsc which is not only changed to a boolean, but
also now tracks that the *frequency* matches, not the precise offset.

Using a *count* was always racy because a new vCPU could be being
created *while* kvm_track_tsc_matching() was running and comparing with
kvm->online_vcpus. That variable is only atomic with respect to itself.
In particular, kvm_arch_vcpu_create() runs before kvm->online_vcpus is
incremented for the new vCPU, and kvm_arch_vcpu_postcreate() runs later.

Repurpose kvm_track_tsc_matching() to be called from kvm_set_tsc_khz(),
and kill the cur_tsc_generation/last_tsc_generation fields which tracked
the precise TSC matching.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h |   6 +-
 arch/x86/kvm/x86.c              | 130 +++++++++++++++++---------------
 2 files changed, 71 insertions(+), 65 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7d06f389a607..52dbb2d7690b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -906,7 +906,6 @@ struct kvm_vcpu_arch {
 	u64 tsc_offset_adjustment;
 	u64 this_tsc_nsec;
 	u64 this_tsc_write;
-	u64 this_tsc_generation;
 	bool tsc_catchup;
 	bool tsc_always_catchup;
 	s8 virtual_tsc_shift;
@@ -1354,11 +1353,10 @@ struct kvm_arch {
 	u32 last_tsc_khz;
 	u64 last_tsc_offset;
 	u64 last_tsc_scaling_ratio;
-	u64 cur_tsc_generation;
-	int nr_vcpus_matched_tsc;
+	bool all_vcpus_matched_tsc;
 
-	u32 default_tsc_khz;
 	bool user_set_tsc;
+	u32 default_tsc_khz;
 
 	seqcount_raw_spinlock_t pvclock_sc;
 	bool use_master_clock;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ab5d55071253..e21b8c075bf6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2493,40 +2493,6 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
 	return 0;
 }
 
-static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
-{
-	u32 thresh_lo, thresh_hi;
-	int use_scaling = 0;
-
-	/* tsc_khz can be zero if TSC calibration fails */
-	if (user_tsc_khz == 0) {
-		/* set tsc_scaling_ratio to a safe value */
-		kvm_vcpu_write_tsc_multiplier(vcpu, kvm_caps.default_tsc_scaling_ratio);
-		return -1;
-	}
-
-	/* Compute a scale to convert nanoseconds in TSC cycles */
-	kvm_get_time_scale(user_tsc_khz * 1000LL, NSEC_PER_SEC,
-			   &vcpu->arch.virtual_tsc_shift,
-			   &vcpu->arch.virtual_tsc_mult);
-	vcpu->arch.virtual_tsc_khz = user_tsc_khz;
-
-	/*
-	 * Compute the variation in TSC rate which is acceptable
-	 * within the range of tolerance and decide if the
-	 * rate being applied is within that bounds of the hardware
-	 * rate.  If so, no scaling or compensation need be done.
-	 */
-	thresh_lo = adjust_tsc_khz(tsc_khz, -tsc_tolerance_ppm);
-	thresh_hi = adjust_tsc_khz(tsc_khz, tsc_tolerance_ppm);
-	if (user_tsc_khz < thresh_lo || user_tsc_khz > thresh_hi) {
-		pr_debug("requested TSC rate %u falls outside tolerance [%u,%u]\n",
-			 user_tsc_khz, thresh_lo, thresh_hi);
-		use_scaling = 1;
-	}
-	return set_tsc_khz(vcpu, user_tsc_khz, use_scaling);
-}
-
 static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 {
 	s64 delta = kernel_ns - vcpu->arch.this_tsc_nsec;
@@ -2557,14 +2523,34 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 #ifdef CONFIG_X86_64
 	struct kvm_arch *ka = &vcpu->kvm->arch;
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+	struct kvm_vcpu *v;
+	unsigned long i;
+	bool matched = true;
+
+	lockdep_assert_held(&vcpu->kvm->arch.tsc_write_lock);
+
+	/*
+	 * When called via kvm_arch_vcpu_create() for a new vCPU, the count
+	 * in kvm->online_vcpus will not yet have been incremented and the
+	 * kvm_for_each_vcpu() loop will not iterate over 'vcpu'.
+	 *
+	 * The right thing still happens in that case, because a match will
+	 * be found only if the new vCPU's TSC is at the same rate as *all*
+	 * the existing vCPUs' TSCs.
+	 */
+	kvm_for_each_vcpu(i, v, vcpu->kvm) {
+		if (v->arch.virtual_tsc_khz != vcpu->arch.virtual_tsc_khz) {
+			matched = false;
+			break;
+		}
+	}
+	ka->all_vcpus_matched_tsc = matched;
 
 	/*
 	 * To use the masterclock, the host clocksource must be based on TSC
-	 * and all vCPUs must have matching TSCs.  Note, the count for matching
-	 * vCPUs doesn't include the reference vCPU, hence "+1".
+	 * and all vCPUs must have matching TSC frequencies.
 	 */
-	bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 ==
-				 atomic_read(&vcpu->kvm->online_vcpus)) &&
+	bool use_master_clock = ka->all_vcpus_matched_tsc &&
 				gtod_is_based_on_tsc(gtod->clock.vclock_mode);
 
 	/*
@@ -2574,12 +2560,51 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 	if ((ka->use_master_clock != use_master_clock))
 		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
 
-	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
+	trace_kvm_track_tsc(vcpu->vcpu_id, ka->all_vcpus_matched_tsc,
 			    atomic_read(&vcpu->kvm->online_vcpus),
 		            ka->use_master_clock, gtod->clock.vclock_mode);
 #endif
 }
 
+static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
+{
+	u32 thresh_lo, thresh_hi;
+	int use_scaling = 0;
+
+	/* tsc_khz can be zero if TSC calibration fails */
+	if (user_tsc_khz == 0) {
+		/* set tsc_scaling_ratio to a safe value */
+		kvm_vcpu_write_tsc_multiplier(vcpu, kvm_caps.default_tsc_scaling_ratio);
+		return -1;
+	}
+
+	/* Compute a scale to convert nanoseconds in TSC cycles */
+	kvm_get_time_scale(user_tsc_khz * 1000LL, NSEC_PER_SEC,
+			   &vcpu->arch.virtual_tsc_shift,
+			   &vcpu->arch.virtual_tsc_mult);
+
+	raw_spin_lock_irq(&vcpu->kvm->arch.tsc_write_lock);
+	vcpu->arch.virtual_tsc_khz = user_tsc_khz;
+	kvm_track_tsc_matching(vcpu);
+	raw_spin_unlock_irq(&vcpu->kvm->arch.tsc_write_lock);
+
+	/*
+	 * Compute the variation in TSC rate which is acceptable
+	 * within the range of tolerance and decide if the
+	 * rate being applied is within that bounds of the hardware
+	 * rate.  If so, no scaling or compensation need be done.
+	 */
+	thresh_lo = adjust_tsc_khz(tsc_khz, -tsc_tolerance_ppm);
+	thresh_hi = adjust_tsc_khz(tsc_khz, tsc_tolerance_ppm);
+	if (user_tsc_khz < thresh_lo || user_tsc_khz > thresh_hi) {
+		pr_debug("requested TSC rate %u falls outside tolerance [%u,%u]\n",
+			 user_tsc_khz, thresh_lo, thresh_hi);
+		use_scaling = 1;
+	}
+	return set_tsc_khz(vcpu, user_tsc_khz, use_scaling);
+}
+
+
 /*
  * Multiply tsc by a fixed point number represented by ratio.
  *
@@ -2725,29 +2750,13 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm_vcpu_write_tsc_offset(vcpu, offset);
 
 	if (!matched) {
-		/*
-		 * We split periods of matched TSC writes into generations.
-		 * For each generation, we track the original measured
-		 * nanosecond time, offset, and write, so if TSCs are in
-		 * sync, we can match exact offset, and if not, we can match
-		 * exact software computation in compute_guest_tsc()
-		 *
-		 * These values are tracked in kvm->arch.cur_xxx variables.
-		 */
 		kvm->arch.last_tsc_nsec = ns;
 		kvm->arch.last_tsc_write = tsc;
-		kvm->arch.cur_tsc_generation++;
-		kvm->arch.nr_vcpus_matched_tsc = 0;
-	} else if (vcpu->arch.this_tsc_generation != kvm->arch.cur_tsc_generation) {
-		kvm->arch.nr_vcpus_matched_tsc++;
 	}
 
-	/* Keep track of which generation this VCPU has synchronized to */
-	vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
+	/* Keep track of the point this VCPU has synchronized to */
 	vcpu->arch.this_tsc_nsec = kvm->arch.last_tsc_nsec;
 	vcpu->arch.this_tsc_write = kvm->arch.last_tsc_write;
-
-	kvm_track_tsc_matching(vcpu);
 }
 
 static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
@@ -3074,11 +3083,9 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 	struct kvm_arch *ka = &kvm->arch;
 	uint64_t last_tsc_hz;
 	int vclock_mode;
-	bool host_tsc_clocksource, vcpus_matched;
+	bool host_tsc_clocksource;
 
 	lockdep_assert_held(&kvm->arch.tsc_write_lock);
-	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
-			atomic_read(&kvm->online_vcpus));
 
 	/*
 	 * If the host uses TSC clock, then passthrough TSC as stable
@@ -3088,7 +3095,8 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 					&ka->master_kernel_ns,
 					&ka->master_cycle_now);
 
-	ka->use_master_clock = host_tsc_clocksource && vcpus_matched
+	ka->use_master_clock = host_tsc_clocksource
+				&& ka->all_vcpus_matched_tsc
 				&& !ka->backwards_tsc_observed
 				&& !ka->boot_vcpu_runs_old_kvmclock;
 
@@ -3126,7 +3134,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 
 	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
 	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode,
-					vcpus_matched);
+					ka->all_vcpus_matched_tsc);
 #endif
 }
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 16/21] KVM: x86: Factor out kvm_use_master_clock()
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (14 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-05-24 14:13   ` Paul Durrant
  2024-08-15 17:12   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 17/21] KVM: x86: Avoid global clock update on setting KVM clock MSR David Woodhouse
                   ` (4 subsequent siblings)
  20 siblings, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

Both kvm_track_tsc_matching() and pvclock_update_vm_gtod_copy() make a
decision about whether the KVM clock should be in master clock mode.

They use *different* criteria for the decision though. This isn't really
a problem; it only has the potential to cause unnecessary invocations of
KVM_REQ_MASTERCLOCK_UPDATE if the masterclock was disabled due to TSC
going backwards, or the guest using the old MSR. But it isn't pretty.

Factor the decision out to a single function. And document the historical
reason why it's disabled for guests that use the old MSR_KVM_SYSTEM_TIME.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e21b8c075bf6..437412b36cae 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2518,6 +2518,27 @@ static inline bool gtod_is_based_on_tsc(int mode)
 }
 #endif
 
+static bool kvm_use_master_clock(struct kvm *kvm)
+{
+	struct kvm_arch *ka = &kvm->arch;
+
+	/*
+	 * The 'old kvmclock' check is a workaround (from 2015) for a
+	 * SUSE 2.6.16 kernel that didn't boot if the system_time in
+	 * its kvmclock was too far behind the current time. So the
+	 * mode of just setting the reference point and allowing time
+	 * to proceed linearly from there makes it fail to boot.
+	 * Despite that being kind of the *point* of the way the clock
+	 * is exposed to the guest. By coincidence, the offending
+	 * kernels used the old MSR_KVM_SYSTEM_TIME, which was moved
+	 * only because it resided in the wrong number range. So the
+	 * workaround is activated for *all* guests using the old MSR.
+	 */
+	return ka->all_vcpus_matched_tsc &&
+		!ka->backwards_tsc_observed &&
+		!ka->boot_vcpu_runs_old_kvmclock;
+}
+
 static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 {
 #ifdef CONFIG_X86_64
@@ -2550,7 +2571,7 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 	 * To use the masterclock, the host clocksource must be based on TSC
 	 * and all vCPUs must have matching TSC frequencies.
 	 */
-	bool use_master_clock = ka->all_vcpus_matched_tsc &&
+	bool use_master_clock = kvm_use_master_clock(vcpu->kvm) &&
 				gtod_is_based_on_tsc(gtod->clock.vclock_mode);
 
 	/*
@@ -3096,9 +3117,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 					&ka->master_cycle_now);
 
 	ka->use_master_clock = host_tsc_clocksource
-				&& ka->all_vcpus_matched_tsc
-				&& !ka->backwards_tsc_observed
-				&& !ka->boot_vcpu_runs_old_kvmclock;
+				&& kvm_use_master_clock(kvm);
 
 	/*
 	 * When TSC scaling is in use (which can thankfully only happen
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 17/21] KVM: x86: Avoid global clock update on setting KVM clock MSR
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (15 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 16/21] KVM: x86: Factor out kvm_use_master_clock() David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-05-24 14:14   ` Paul Durrant
  2024-08-16  4:28   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 18/21] KVM: x86: Avoid gratuitous global clock reload in kvm_arch_vcpu_load() David Woodhouse
                   ` (3 subsequent siblings)
  20 siblings, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

Commit 0061d53daf26 ("KVM: x86: limit difference between kvmclock updates")
introduced a KVM_REQ_GLOBAL_CLOCK_UPDATE when one vCPU set up its clock.

This was a workaround for the ever-drifting clocks which were based on the
host's CLOCK_MONOTONIC and thus subject to NTP skew. On booting or resuming
a guest, it just leads to running kvm_guest_time_update() twice for each
vCPU for now good reason.

Just use KVM_REQ_CLOCK_UPDATE on the vCPU itself, and only in the case
where the KVM clock is being set up, not turned off.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 437412b36cae..32a873d5ed00 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2361,13 +2361,13 @@ static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
 	}
 
 	vcpu->arch.time = system_time;
-	kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
 
 	/* we verify if the enable bit is set... */
-	if (system_time & 1)
+	if (system_time & 1) {
 		kvm_gpc_activate(&vcpu->arch.pv_time, system_time & ~1ULL,
 				 sizeof(struct pvclock_vcpu_time_info));
-	else
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+	} else
 		kvm_gpc_deactivate(&vcpu->arch.pv_time);
 
 	return;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 18/21] KVM: x86: Avoid gratuitous global clock reload in kvm_arch_vcpu_load()
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (16 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 17/21] KVM: x86: Avoid global clock update on setting KVM clock MSR David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-05-24 14:16   ` Paul Durrant
  2024-08-15 17:31   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 19/21] KVM: x86: Avoid periodic KVM clock updates in master clock mode David Woodhouse
                   ` (2 subsequent siblings)
  20 siblings, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

Commit d98d07ca7e034 ("KVM: x86: update pvclock area conditionally, on
cpu migration") turned an unconditional KVM_REQ_CLOCK_UPDATE into a
conditional one, if either the master clock isn't enabled *or* the vCPU
was not previously scheduled (vcpu->cpu == -1). The commit message doesn't
explain the latter condition, which is specifically for the master clock
case.

Commit 0061d53daf26f ("KVM: x86: limit difference between kvmclock
updates") later turned that into a KVM_REQ_GLOBAL_CLOCK_UPDATE to avoid
skew between vCPUs.

In master clock mode there is no need for any of that, regardless of
whether/where this vCPU was previously scheduled.

Do it only if (!kvm->arch.use_master_clock).

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 32a873d5ed00..dd53860ca284 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5161,7 +5161,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		 * On a host with synchronized TSC, there is no need to update
 		 * kvmclock on vcpu->cpu migration
 		 */
-		if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
+		if (!vcpu->kvm->arch.use_master_clock)
 			kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
 		if (vcpu->cpu != cpu)
 			kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 19/21] KVM: x86: Avoid periodic KVM clock updates in master clock mode
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (17 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 18/21] KVM: x86: Avoid gratuitous global clock reload in kvm_arch_vcpu_load() David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-05-24 14:18   ` Paul Durrant
  2024-08-16  4:33   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 20/21] KVM: x86/xen: Prevent runstate times from becoming negative David Woodhouse
  2024-05-22  0:17 ` [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative David Woodhouse
  20 siblings, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

When the KVM clock is in master clock mode, updating the KVM clock is
pointless. Let the periodic work 'expire', and start it running again
from kvm_end_pvclock_update() if the master clock mode is ever turned
off again.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dd53860ca284..10b82f1b110d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -158,6 +158,8 @@ module_param(min_timer_period_us, uint, 0644);
 static bool __read_mostly kvmclock_periodic_sync = true;
 module_param(kvmclock_periodic_sync, bool, 0444);
 
+#define KVMCLOCK_SYNC_PERIOD (300 * HZ)
+
 /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
 static u32 __read_mostly tsc_tolerance_ppm = 250;
 module_param(tsc_tolerance_ppm, uint, 0644);
@@ -3187,6 +3189,10 @@ static void kvm_end_pvclock_update(struct kvm *kvm)
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
+	if (kvmclock_periodic_sync && !kvm->arch.use_master_clock)
+		schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
+				      KVMCLOCK_SYNC_PERIOD);
+
 	/* guest entries allowed */
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
@@ -3555,8 +3561,6 @@ static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
 					KVMCLOCK_UPDATE_DELAY);
 }
 
-#define KVMCLOCK_SYNC_PERIOD (300 * HZ)
-
 static void kvmclock_sync_fn(struct work_struct *work)
 {
 	struct delayed_work *dwork = to_delayed_work(work);
@@ -3564,6 +3568,9 @@ static void kvmclock_sync_fn(struct work_struct *work)
 					   kvmclock_sync_work);
 	struct kvm *kvm = container_of(ka, struct kvm, arch);
 
+	if (!kvm->arch.use_master_clock)
+		return;
+
 	schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
 	schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
 					KVMCLOCK_SYNC_PERIOD);
@@ -12551,7 +12558,8 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 
 	mutex_unlock(&vcpu->mutex);
 
-	if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0)
+	if (kvmclock_periodic_sync && !kvm->arch.use_master_clock &&
+	    vcpu->vcpu_idx == 0)
 		schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
 						KVMCLOCK_SYNC_PERIOD);
 }
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 20/21] KVM: x86/xen: Prevent runstate times from becoming negative
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (18 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 19/21] KVM: x86: Avoid periodic KVM clock updates in master clock mode David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-05-24 14:21   ` Paul Durrant
  2024-08-16  4:39   ` Sean Christopherson
  2024-05-22  0:17 ` [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative David Woodhouse
  20 siblings, 2 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

When kvm_xen_update_runstate() is invoked to set a vCPU's runstate, the
time spent in the previous runstate is accounted. This is based on the
delta between the current KVM clock time, and the previous value stored
in vcpu->arch.xen.runstate_entry_time.

If the KVM clock goes backwards, that delta will be negative. Or, since
it's an unsigned 64-bit integer, very *large*. Linux guests deal with
that particularly badly, reporting 100% steal time for ever more (well,
for *centuries* at least, until the delta has been consumed).

So when a negative delta is detected, just refrain from updating the
runstates until the KVM clock catches up with runstate_entry_time again.

The userspace APIs for setting the runstate times do not allow them to
be set past the current KVM clock, but userspace can still adjust the
KVM clock *after* setting the runstate times, which would cause this
situation to occur.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/xen.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 014048c22652..3d4111de4472 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -538,24 +538,34 @@ void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
 {
 	struct kvm_vcpu_xen *vx = &v->arch.xen;
 	u64 now = get_kvmclock_ns(v->kvm);
-	u64 delta_ns = now - vx->runstate_entry_time;
 	u64 run_delay = current->sched_info.run_delay;
+	s64 delta_ns = now - vx->runstate_entry_time;
+	s64 steal_ns = run_delay - vx->last_steal;
 
 	if (unlikely(!vx->runstate_entry_time))
 		vx->current_runstate = RUNSTATE_offline;
 
+	vx->last_steal = run_delay;
+
+	/*
+	 * If KVM clock time went backwards, stop updating until it
+	 * catches up (or the runstates are reset by userspace).
+	 */
+	if (delta_ns < 0)
+		return;
+
 	/*
 	 * Time waiting for the scheduler isn't "stolen" if the
 	 * vCPU wasn't running anyway.
 	 */
-	if (vx->current_runstate == RUNSTATE_running) {
-		u64 steal_ns = run_delay - vx->last_steal;
+	if (vx->current_runstate == RUNSTATE_running && steal_ns > 0) {
+		if (steal_ns > delta_ns)
+			steal_ns = delta_ns;
 
 		delta_ns -= steal_ns;
 
 		vx->runstate_times[RUNSTATE_runnable] += steal_ns;
 	}
-	vx->last_steal = run_delay;
 
 	vx->runstate_times[vx->current_runstate] += delta_ns;
 	vx->current_runstate = state;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative
  2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
                   ` (19 preceding siblings ...)
  2024-05-22  0:17 ` [RFC PATCH v3 20/21] KVM: x86/xen: Prevent runstate times from becoming negative David Woodhouse
@ 2024-05-22  0:17 ` David Woodhouse
  2024-05-24 14:25   ` Paul Durrant
                     ` (2 more replies)
  20 siblings, 3 replies; 58+ messages in thread
From: David Woodhouse @ 2024-05-22  0:17 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

From: David Woodhouse <dwmw@amazon.co.uk>

In steal_account_process_time(), a delta is calculated between the value
returned by paravirt_steal_clock(), and this_rq()->prev_steal_time which
is assumed to be the *previous* value returned by paravirt_steal_clock().

However, instead of just assigning the newly-read value directly into
->prev_steal_time for use in the next iteration, ->prev_steal_time is
*incremented* by the calculated delta.

This used to be roughly the same, modulo conversion to jiffies and back,
until commit 807e5b80687c0 ("sched/cputime: Add steal time support to
full dynticks CPU time accounting") started clamping that delta to a
maximum of the actual time elapsed. So now, if the value returned by
paravirt_steal_clock() jumps by a large amount, instead of a *single*
period of reporting 100% steal time, the system will report 100% steal
time for as long as it takes to "catch up" with the reported value.
Which is up to 584 years.

But there is a benefit to advancing ->prev_steal_time only by the time
which was *accounted* as having been stolen. It means that any extra
time truncated by the clamping will be accounted in the next sample
period rather than lost. Given the stochastic nature of the sampling,
that is more accurate overall.

So, continue to advance ->prev_steal_time by the accounted value as
long as the delta isn't egregiously large (for which, use maxtime * 2).
If the delta is more than that, just set ->prev_steal_time directly to
the value returned by paravirt_steal_clock().

Fixes: 807e5b80687c0 ("sched/cputime: Add steal time support to full dynticks CPU time accounting")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 kernel/sched/cputime.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index af7952f12e6c..3a8a8b38966d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -254,13 +254,21 @@ static __always_inline u64 steal_account_process_time(u64 maxtime)
 {
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
-		u64 steal;
-
-		steal = paravirt_steal_clock(smp_processor_id());
-		steal -= this_rq()->prev_steal_time;
-		steal = min(steal, maxtime);
+		u64 steal, abs_steal;
+
+		abs_steal = paravirt_steal_clock(smp_processor_id());
+		steal = abs_steal - this_rq()->prev_steal_time;
+		if (unlikely(steal > maxtime)) {
+			/*
+			 * If the delta isn't egregious, it can be counted
+			 * in the next time period. Only advance by maxtime.
+			 */
+			if (steal < maxtime * 2)
+				abs_steal = this_rq()->prev_steal_time + maxtime;
+			steal = maxtime;
+		}
 		account_steal_time(steal);
-		this_rq()->prev_steal_time += steal;
+		this_rq()->prev_steal_time = abs_steal;
 
 		return steal;
 	}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 04/21] UAPI: x86: Move pvclock-abi to UAPI for x86 platforms
  2024-05-22  0:16 ` [RFC PATCH v3 04/21] UAPI: x86: Move pvclock-abi to UAPI for x86 platforms David Woodhouse
@ 2024-05-24 13:14   ` Paul Durrant
  2024-08-13 18:07   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Paul Durrant @ 2024-05-24 13:14 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On 22/05/2024 01:16, David Woodhouse wrote:
> From: Jack Allister <jalliste@amazon.com>
> 
> KVM provides a new interface for performing a fixup/correction of the KVM
> clock against the reference TSC. The KVM_[GS]ET_CLOCK_GUEST API requires a
> pvclock_vcpu_time_info, as such the caller must know about this definition.
> 
> Move the definition to the UAPI folder so that it is exported to usermode
> and also change the type definitions to use the standard for UAPI exports.
> 
> Signed-off-by: Jack Allister <jalliste@amazon.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/include/{ => uapi}/asm/pvclock-abi.h | 24 +++++++++----------
>   1 file changed, 12 insertions(+), 12 deletions(-)
>   rename arch/x86/include/{ => uapi}/asm/pvclock-abi.h (83%)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get_kvmclock()
  2024-05-22  0:17 ` [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get_kvmclock() David Woodhouse
@ 2024-05-24 13:20   ` Paul Durrant
  2024-08-14  2:58   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Paul Durrant @ 2024-05-24 13:20 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On 22/05/2024 01:17, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> When in 'master clock mode' (i.e. when host and guest TSCs are behaving
> sanely and in sync), the KVM clock is defined in terms of the guest TSC.
> 
> When TSC scaling is used, calculating the KVM clock directly from *host*
> TSC cycles leads to a systemic drift from the values calculated by the
> guest from its TSC.
> 
> Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers")
> had a simple workaround for the specific case of Xen timers, as it had an
> actual vCPU to hand and could use its scaling information. That commit
> noted that it was broken for the general case of get_kvmclock_ns(), and
> said "I'll come back to that".
> 
> Since __get_kvmclock() is invoked without a specific CPU, it needs to
> be able to find or generate the scaling values required to perform the
> correct calculation.
> 
> Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC,
> so it isn't as complex as it might have been.
> 
> In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in
> kvm->arch.last_tsc_scaling_ratio. That is only protected by the
> tsc_write_lock, so in pvclock_update_vm_gtod_copy(), copy it into a
> separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed
> using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and
> shift factors to convert to nanoseconds for the corresponding KVM clock,
> just as kvm_guest_time_update() would.
> 
> In __get_kvmclock(), which runs within a seqcount retry loop, use those
> values to convert host to guest TSC and then to nanoseconds. Only fall
> back to using get_kvmclock_base_ns() when not in master clock mode.
> 
> There was previously a code path in __get_kvmclock() which looked like
> it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
> even on 32-bit hosts. In practice that could never happen as the
> ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
> hosts it would never be set when the system clock isn't TSC-based. So
> that code path is now removed.
> 
> The kvm_get_wall_clock_epoch() function had the same problem; make it
> just call get_kvmclock() and subtract kvmclock from wallclock, with
> the same fallback as before.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/include/asm/kvm_host.h |   4 +
>   arch/x86/kvm/x86.c              | 151 ++++++++++++++++----------------
>   2 files changed, 79 insertions(+), 76 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
  2024-05-22  0:17 ` [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time() David Woodhouse
@ 2024-05-24 13:26   ` Paul Durrant
  2024-08-14  4:57   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Paul Durrant @ 2024-05-24 13:26 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On 22/05/2024 01:17, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> There was some confusion in kvm_update_guest_time() when software needs
> to advance the guest TSC.
> 
> In master clock mode, there are two points of time which need to be taken
> into account. First there is the master clock reference point, stored in
> kvm->arch.master_kernel_ns (and associated host TSC ->master_cycle_now).
> Secondly, there is the time *now*, at the point kvm_update_guest_time()
> is being called.
> 
> With software TSC upscaling, the guest TSC is getting further and further
> ahead of the host TSC as time elapses. So at time "now", the guest TSC
> should be further ahead of the host, than it was at master_kernel_ns.
> 
> The adjustment in kvm_update_guest_time() was not taking that into
> account, and was only advancing the guest TSC by the appropriate amount
> for master_kernel_ns, *not* the current time.
> 
> Fix it to calculate them both correctly.
> 
> Since the KVM clock reference point in master_kernel_ns might actually
> be *earlier* than the reference point used for the guest TSC
> (vcpu->last_tsc_nsec), this might lead to a negative delta. Fix the
> compute_guest_tsc() function to cope with negative numbers, which
> then means there is no need to force a master clock update when the
> guest TSC is written.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk > ---
>   arch/x86/kvm/x86.c | 73 +++++++++++++++++++++++++++++++++++-----------
>   1 file changed, 56 insertions(+), 17 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 11/21] KVM: x86: Simplify and comment kvm_get_time_scale()
  2024-05-22  0:17 ` [RFC PATCH v3 11/21] KVM: x86: Simplify and comment kvm_get_time_scale() David Woodhouse
@ 2024-05-24 13:53   ` Paul Durrant
  2024-08-15 15:46   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Paul Durrant @ 2024-05-24 13:53 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On 22/05/2024 01:17, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Commit 3ae13faac400 ("KVM: x86: pass kvm_get_time_scale arguments in hertz")
> made this function take 64-bit values in Hz rather than 32-bit kHz. Thus
> making it entrely pointless to shadow its arguments into local 64-bit
> variables. Just use scaled_hz and base_hz directly.
> 
> Also rename the 'tps32' variable to 'base32', having utterly failed to
> think of any reason why it might have been called that in the first place.
> This could probably have been eliminated too, but it helps to make the
> code clearer and *might* just help a naïve 32-bit compiler realise that it
> doesn't need to do full 64-bit shifts.
> 
> Having taken the time to reverse-engineer the function, add some comments
> explaining it.
> 
> No functional change intended.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/kvm/x86.c | 60 ++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 47 insertions(+), 13 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 12/21] KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset()
  2024-05-22  0:17 ` [RFC PATCH v3 12/21] KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset() David Woodhouse
@ 2024-05-24 13:56   ` Paul Durrant
  2024-08-15 15:52   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Paul Durrant @ 2024-05-24 13:56 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On 22/05/2024 01:17, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Let the callers pass the host TSC value in as an explicit parameter.
> 
> This leaves some fairly obviously stupid code, which using this function
> to compare the guest TSC at some *other* time, with the newly-minted TSC
> value from rdtsc(). Unless it's being used to measure *elapsed* time,
> that isn't very sensible.
> 
> In this case, "obviously stupid" is an improvement over being non-obviously
> so.
> 
> No functional change intended.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/kvm/x86.c | 14 ++++++++------
>   1 file changed, 8 insertions(+), 6 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 13/21] KVM: x86: Improve synchronization in kvm_synchronize_tsc()
  2024-05-22  0:17 ` [RFC PATCH v3 13/21] KVM: x86: Improve synchronization in kvm_synchronize_tsc() David Woodhouse
@ 2024-05-24 14:03   ` Paul Durrant
  2024-08-15 16:00   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Paul Durrant @ 2024-05-24 14:03 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On 22/05/2024 01:17, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> When synchronizing to an existing TSC (either by explicitly writing zero,
> or the legacy hack where the TSC is written within one second's worth of
> the previously written TSC), the last_tsc_write and last_tsc_nsec values
> were being misrecorded by __kvm_synchronize_tsc(). The *unsynchronized*
> value of the TSC (perhaps even zero) was bring recorded, along with the
> current time at which kvm_synchronize_tsc() was called. This could cause
> *subsequent* writes to fail to synchronize correctly.
> 
> Fix that by resetting {data, ns} to the previous values before passing
> them to __kvm_synchronize_tsc() when synchronization is detected. Except
> in the case where the TSC is unstable and *has* to be synthesised from
> the host clock, in which case attempt to create a nsec/tsc pair which is
> on the correct line.
> 
> Furthermore, there were *three* different TSC reads used for calculating
> the "current" time, all slightly different from each other. Fix that by
> using kvm_get_time_and_clockread() where possible and using the same
> host_tsc value in all cases.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/kvm/x86.c | 32 ++++++++++++++++++++++++++++----
>   1 file changed, 28 insertions(+), 4 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 14/21] KVM: x86: Kill cur_tsc_{nsec,offset,write} fields
  2024-05-22  0:17 ` [RFC PATCH v3 14/21] KVM: x86: Kill cur_tsc_{nsec,offset,write} fields David Woodhouse
@ 2024-05-24 14:05   ` Paul Durrant
  2024-08-15 16:30   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Paul Durrant @ 2024-05-24 14:05 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On 22/05/2024 01:17, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> These pointlessly duplicate the last_tsc_{nsec,offset,write} values.
> 
> The only place they were used was where the TSC is stable and a new vCPU
> is being synchronized to the previous setting, in which case the 'last_'
> value is definitely identical.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/include/asm/kvm_host.h |  3 ---
>   arch/x86/kvm/x86.c              | 19 ++++++++-----------
>   2 files changed, 8 insertions(+), 14 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other
  2024-05-22  0:17 ` [RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other David Woodhouse
@ 2024-05-24 14:10   ` Paul Durrant
  2024-08-16  2:38   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Paul Durrant @ 2024-05-24 14:10 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On 22/05/2024 01:17, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> There is no reason why the KVM clock cannot be in masterclock mode when
> the TSCs are not in sync, as long as they are at the same *frequency*.
> 
> Running at a different frequency would lead to a systemic skew between
> the clock(s) as observed by different vCPUs due to arithmetic precision
> in the scaling. So that should indeed force the clock to be based on the
> host's CLOCK_MONOTONIC_RAW instead of being in masterclock mode where it
> is defined by the (or 'a') guest TSC.
> 
> But when the vCPUs merely have a different TSC *offset*, that's not a
> problem. The offset is applied to that vCPU's kvmclock->tsc_timestamp
> field, and it all comes out in the wash.
> 
> So, remove ka->nr_vcpus_matched_tsc and replace it with a new field
> ka->all_vcpus_matched_tsc which is not only changed to a boolean, but
> also now tracks that the *frequency* matches, not the precise offset.
> 
> Using a *count* was always racy because a new vCPU could be being
> created *while* kvm_track_tsc_matching() was running and comparing with
> kvm->online_vcpus. That variable is only atomic with respect to itself.
> In particular, kvm_arch_vcpu_create() runs before kvm->online_vcpus is
> incremented for the new vCPU, and kvm_arch_vcpu_postcreate() runs later.
> 
> Repurpose kvm_track_tsc_matching() to be called from kvm_set_tsc_khz(),
> and kill the cur_tsc_generation/last_tsc_generation fields which tracked
> the precise TSC matching.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/include/asm/kvm_host.h |   6 +-
>   arch/x86/kvm/x86.c              | 130 +++++++++++++++++---------------
>   2 files changed, 71 insertions(+), 65 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 16/21] KVM: x86: Factor out kvm_use_master_clock()
  2024-05-22  0:17 ` [RFC PATCH v3 16/21] KVM: x86: Factor out kvm_use_master_clock() David Woodhouse
@ 2024-05-24 14:13   ` Paul Durrant
  2024-08-15 17:12   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Paul Durrant @ 2024-05-24 14:13 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On 22/05/2024 01:17, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Both kvm_track_tsc_matching() and pvclock_update_vm_gtod_copy() make a
> decision about whether the KVM clock should be in master clock mode.
> 
> They use *different* criteria for the decision though. This isn't really
> a problem; it only has the potential to cause unnecessary invocations of
> KVM_REQ_MASTERCLOCK_UPDATE if the masterclock was disabled due to TSC
> going backwards, or the guest using the old MSR. But it isn't pretty.
> 
> Factor the decision out to a single function. And document the historical
> reason why it's disabled for guests that use the old MSR_KVM_SYSTEM_TIME.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/kvm/x86.c | 27 +++++++++++++++++++++++----
>   1 file changed, 23 insertions(+), 4 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 17/21] KVM: x86: Avoid global clock update on setting KVM clock MSR
  2024-05-22  0:17 ` [RFC PATCH v3 17/21] KVM: x86: Avoid global clock update on setting KVM clock MSR David Woodhouse
@ 2024-05-24 14:14   ` Paul Durrant
  2024-08-16  4:28   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Paul Durrant @ 2024-05-24 14:14 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On 22/05/2024 01:17, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Commit 0061d53daf26 ("KVM: x86: limit difference between kvmclock updates")
> introduced a KVM_REQ_GLOBAL_CLOCK_UPDATE when one vCPU set up its clock.
> 
> This was a workaround for the ever-drifting clocks which were based on the
> host's CLOCK_MONOTONIC and thus subject to NTP skew. On booting or resuming
> a guest, it just leads to running kvm_guest_time_update() twice for each
> vCPU for now good reason.
> 
> Just use KVM_REQ_CLOCK_UPDATE on the vCPU itself, and only in the case
> where the KVM clock is being set up, not turned off.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/kvm/x86.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 18/21] KVM: x86: Avoid gratuitous global clock reload in kvm_arch_vcpu_load()
  2024-05-22  0:17 ` [RFC PATCH v3 18/21] KVM: x86: Avoid gratuitous global clock reload in kvm_arch_vcpu_load() David Woodhouse
@ 2024-05-24 14:16   ` Paul Durrant
  2024-08-15 17:31   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Paul Durrant @ 2024-05-24 14:16 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On 22/05/2024 01:17, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Commit d98d07ca7e034 ("KVM: x86: update pvclock area conditionally, on
> cpu migration") turned an unconditional KVM_REQ_CLOCK_UPDATE into a
> conditional one, if either the master clock isn't enabled *or* the vCPU
> was not previously scheduled (vcpu->cpu == -1). The commit message doesn't
> explain the latter condition, which is specifically for the master clock
> case.
> 
> Commit 0061d53daf26f ("KVM: x86: limit difference between kvmclock
> updates") later turned that into a KVM_REQ_GLOBAL_CLOCK_UPDATE to avoid
> skew between vCPUs.
> 
> In master clock mode there is no need for any of that, regardless of
> whether/where this vCPU was previously scheduled.
> 
> Do it only if (!kvm->arch.use_master_clock).
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/kvm/x86.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 19/21] KVM: x86: Avoid periodic KVM clock updates in master clock mode
  2024-05-22  0:17 ` [RFC PATCH v3 19/21] KVM: x86: Avoid periodic KVM clock updates in master clock mode David Woodhouse
@ 2024-05-24 14:18   ` Paul Durrant
  2024-08-16  4:33   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Paul Durrant @ 2024-05-24 14:18 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On 22/05/2024 01:17, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> When the KVM clock is in master clock mode, updating the KVM clock is
> pointless. Let the periodic work 'expire', and start it running again
> from kvm_end_pvclock_update() if the master clock mode is ever turned
> off again.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/kvm/x86.c | 14 +++++++++++---
>   1 file changed, 11 insertions(+), 3 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 20/21] KVM: x86/xen: Prevent runstate times from becoming negative
  2024-05-22  0:17 ` [RFC PATCH v3 20/21] KVM: x86/xen: Prevent runstate times from becoming negative David Woodhouse
@ 2024-05-24 14:21   ` Paul Durrant
  2024-08-16  4:39   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Paul Durrant @ 2024-05-24 14:21 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On 22/05/2024 01:17, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> When kvm_xen_update_runstate() is invoked to set a vCPU's runstate, the
> time spent in the previous runstate is accounted. This is based on the
> delta between the current KVM clock time, and the previous value stored
> in vcpu->arch.xen.runstate_entry_time.
> 
> If the KVM clock goes backwards, that delta will be negative. Or, since
> it's an unsigned 64-bit integer, very *large*. Linux guests deal with
> that particularly badly, reporting 100% steal time for ever more (well,
> for *centuries* at least, until the delta has been consumed).
> 
> So when a negative delta is detected, just refrain from updating the
> runstates until the KVM clock catches up with runstate_entry_time again.
> 
> The userspace APIs for setting the runstate times do not allow them to
> be set past the current KVM clock, but userspace can still adjust the
> KVM clock *after* setting the runstate times, which would cause this
> situation to occur.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/kvm/xen.c | 18 ++++++++++++++----
>   1 file changed, 14 insertions(+), 4 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative
  2024-05-22  0:17 ` [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative David Woodhouse
@ 2024-05-24 14:25   ` Paul Durrant
  2024-07-02 14:09   ` Shrikanth Hegde
  2024-08-16  4:35   ` Sean Christopherson
  2 siblings, 0 replies; 58+ messages in thread
From: Paul Durrant @ 2024-05-24 14:25 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On 22/05/2024 01:17, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> In steal_account_process_time(), a delta is calculated between the value
> returned by paravirt_steal_clock(), and this_rq()->prev_steal_time which
> is assumed to be the *previous* value returned by paravirt_steal_clock().
> 
> However, instead of just assigning the newly-read value directly into
> ->prev_steal_time for use in the next iteration, ->prev_steal_time is
> *incremented* by the calculated delta.
> 
> This used to be roughly the same, modulo conversion to jiffies and back,
> until commit 807e5b80687c0 ("sched/cputime: Add steal time support to
> full dynticks CPU time accounting") started clamping that delta to a
> maximum of the actual time elapsed. So now, if the value returned by
> paravirt_steal_clock() jumps by a large amount, instead of a *single*
> period of reporting 100% steal time, the system will report 100% steal
> time for as long as it takes to "catch up" with the reported value.
> Which is up to 584 years.
> 
> But there is a benefit to advancing ->prev_steal_time only by the time
> which was *accounted* as having been stolen. It means that any extra
> time truncated by the clamping will be accounted in the next sample
> period rather than lost. Given the stochastic nature of the sampling,
> that is more accurate overall.
> 
> So, continue to advance ->prev_steal_time by the accounted value as
> long as the delta isn't egregiously large (for which, use maxtime * 2).
> If the delta is more than that, just set ->prev_steal_time directly to
> the value returned by paravirt_steal_clock().
> 
> Fixes: 807e5b80687c0 ("sched/cputime: Add steal time support to full dynticks CPU time accounting")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   kernel/sched/cputime.c | 20 ++++++++++++++------
>   1 file changed, 14 insertions(+), 6 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative
  2024-05-22  0:17 ` [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative David Woodhouse
  2024-05-24 14:25   ` Paul Durrant
@ 2024-07-02 14:09   ` Shrikanth Hegde
  2024-08-16  4:35   ` Sean Christopherson
  2 siblings, 0 replies; 58+ messages in thread
From: Shrikanth Hegde @ 2024-07-02 14:09 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Shuah Khan, linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang, kvm



On 5/22/24 5:47 AM, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> In steal_account_process_time(), a delta is calculated between the value
> returned by paravirt_steal_clock(), and this_rq()->prev_steal_time which
> is assumed to be the *previous* value returned by paravirt_steal_clock().
> 
> However, instead of just assigning the newly-read value directly into
> ->prev_steal_time for use in the next iteration, ->prev_steal_time is
> *incremented* by the calculated delta.


Does this happen because ULONG_MAX and u64 aren't of same size? If so, 
would using the u64 variant of MAX would be a simpler fix?

> 
> This used to be roughly the same, modulo conversion to jiffies and back,
> until commit 807e5b80687c0 ("sched/cputime: Add steal time support to
> full dynticks CPU time accounting") started clamping that delta to a
> maximum of the actual time elapsed. So now, if the value returned by
> paravirt_steal_clock() jumps by a large amount, instead of a *single*
> period of reporting 100% steal time, the system will report 100% steal
> time for as long as it takes to "catch up" with the reported value.
> Which is up to 584 years.
> 
> But there is a benefit to advancing ->prev_steal_time only by the time
> which was *accounted* as having been stolen. It means that any extra
> time truncated by the clamping will be accounted in the next sample
> period rather than lost. Given the stochastic nature of the sampling,
> that is more accurate overall.
> 
> So, continue to advance ->prev_steal_time by the accounted value as
> long as the delta isn't egregiously large (for which, use maxtime * 2).
> If the delta is more than that, just set ->prev_steal_time directly to
> the value returned by paravirt_steal_clock().
> 
> Fixes: 807e5b80687c0 ("sched/cputime: Add steal time support to full dynticks CPU time accounting")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  kernel/sched/cputime.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index af7952f12e6c..3a8a8b38966d 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -254,13 +254,21 @@ static __always_inline u64 steal_account_process_time(u64 maxtime)
>  {
>  #ifdef CONFIG_PARAVIRT
>  	if (static_key_false(&paravirt_steal_enabled)) {
> -		u64 steal;
> -
> -		steal = paravirt_steal_clock(smp_processor_id());
> -		steal -= this_rq()->prev_steal_time;
> -		steal = min(steal, maxtime);
> +		u64 steal, abs_steal;
> +
> +		abs_steal = paravirt_steal_clock(smp_processor_id());
> +		steal = abs_steal - this_rq()->prev_steal_time;
> +		if (unlikely(steal > maxtime)) {
> +			/*
> +			 * If the delta isn't egregious, it can be counted
> +			 * in the next time period. Only advance by maxtime.
> +			 */
> +			if (steal < maxtime * 2)
> +				abs_steal = this_rq()->prev_steal_time + maxtime;
> +			steal = maxtime;
> +		}
>  		account_steal_time(steal);
> -		this_rq()->prev_steal_time += steal;
> +		this_rq()->prev_steal_time = abs_steal;
>  
>  		return steal;
>  	}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 02/21] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
  2024-05-22  0:16 ` [RFC PATCH v3 02/21] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force David Woodhouse
@ 2024-08-13 17:50   ` Sean Christopherson
  0 siblings, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-13 17:50 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The kvm_guest_time_update() function scales the host TSC frequency to
> the guest's using kvm_scale_tsc() and the v->arch.l1_tsc_scaling_ratio
> scaling ratio previously calculated for that vCPU. Then calcuates the
> scaling factors for the KVM clock itself based on that guest TSC
> frequency.
> 
> However, it uses kHz as the unit when scaling, and then multiplies by
> 1000 only at the end.
> 
> With a host TSC frequency of 3000MHz and a guest set to 2500MHz, the
> result of kvm_scale_tsc() will actually come out at 2,499,999kHz. So
> the KVM clock advertised to the guest is based on a frequency of
> 2,499,999,000 Hz.
> 
> By using Hz as the unit from the beginning, the KVM clock would be based
> on a more accurate frequency of 2,499,999,999 Hz in this example.
> 
> Fixes: 78db6a503796 ("KVM: x86: rewrite handling of scaled TSC for kvmclock")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Reviewed-by: Paul Durrant <paul@xen.org>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/kvm/x86.c              | 17 +++++++++--------
>  arch/x86/kvm/xen.c              |  2 +-
>  3 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 01c69840647e..8440c4081727 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -887,7 +887,7 @@ struct kvm_vcpu_arch {
>  
>  	gpa_t time;
>  	struct pvclock_vcpu_time_info hv_clock;
> -	unsigned int hw_tsc_khz;
> +	unsigned int hw_tsc_hz;

Isn't there an overflow issue here?  The local variable is a 64-bit value, but
kvm_vcpu_arch.hw_tsc_hz is a 32-bit value.  And unless I'm having an even worse
review week than I thought, a guest TSC frequency > 4Ghz will get truncated.

>  	struct gfn_to_pfn_cache pv_time;
>  	/* set guest stopped flag in pvclock flags field */
>  	bool pvclock_set_guest_stopped_request;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2d2619d3eee4..23281c508c27 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3215,7 +3215,8 @@ static void kvm_setup_guest_pvclock(struct kvm_vcpu *v,
>  
>  static int kvm_guest_time_update(struct kvm_vcpu *v)
>  {
> -	unsigned long flags, tgt_tsc_khz;
> +	unsigned long flags;
> +	uint64_t tgt_tsc_hz;

s/uint64_t/u64 for kernel code.  There are more than a few uses of uint64_t in
KVM, but u64 is far and away the dominant flavor.

> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 5a83a8154b79..014048c22652 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -2273,7 +2273,7 @@ void kvm_xen_update_tsc_info(struct kvm_vcpu *vcpu)
>  
>  	entry = kvm_find_cpuid_entry_index(vcpu, function, 2);
>  	if (entry)
> -		entry->eax = vcpu->arch.hw_tsc_khz;
> +		entry->eax = vcpu->arch.hw_tsc_hz / 1000;

And if hw_tsc_hz is a u64, this will need to use div_u64() to play nice with
32-bit kernels.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 04/21] UAPI: x86: Move pvclock-abi to UAPI for x86 platforms
  2024-05-22  0:16 ` [RFC PATCH v3 04/21] UAPI: x86: Move pvclock-abi to UAPI for x86 platforms David Woodhouse
  2024-05-24 13:14   ` Paul Durrant
@ 2024-08-13 18:07   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-13 18:07 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> From: Jack Allister <jalliste@amazon.com>
> 
> KVM provides a new interface for performing a fixup/correction of the KVM
> clock against the reference TSC. The KVM_[GS]ET_CLOCK_GUEST API requires a
> pvclock_vcpu_time_info, as such the caller must know about this definition.
> 
> Move the definition to the UAPI folder so that it is exported to usermode
> and also change the type definitions to use the standard for UAPI exports.

Shouldn't this be order before the introduction of KVM_[GS]ET_CLOCK_GUEST?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 05/21] KVM: selftests: Add KVM/PV clock selftest to prove timer correction
  2024-05-22  0:17 ` [RFC PATCH v3 05/21] KVM: selftests: Add KVM/PV clock selftest to prove timer correction David Woodhouse
@ 2024-08-13 18:55   ` Sean Christopherson
  0 siblings, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-13 18:55 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> The guest the records a singular TSC reference point in time and uses it to
            ^^^^
            then

> calculate 3 KVM clock values utilizing the 3 recorded PVTI prior. Let's
> call each clock value CLK[0-2].
> 
> In a perfect world CLK[0-2] should all be the same value if the KVM clock
> & TSC relationship is preserved across the LU/LM (or faked in this test),
> however it is not.
> 
> A delta can be observed between CLK0-CLK1 due to KVM recalculating the PVTI
> (and the inaccuracies associated with that). A delta of ~3500ns can be
> observed if guest TSC scaling to half host TSC frequency is also enabled,
> where as without scaling this is observed at ~180ns.

It'd be helpful to explain why TSC scaling results in a larger drift.  I'm by no
means a clock expert, but I've likely stared at this code more than most and it's
not obvious to me why scaling is problematic.  If I thought hard maybe I could
figure it out, but it's been an -ENOCOFFE sort of week so far, so I wouldn't bet
on it :-)

> +static void trigger_pvti_update(vm_paddr_t pvti_pa)
> +{
> +	/*
> +	 * We need a way to trigger KVM to update the fields

Please avoid "we", it's unnecessarily confusing as there are too many possible
subjects that "we" could apply to.  And please use the "full" 80 characters for
comments, there's no reason to wrap more aggressively.  E.g.

	/*
	 * Toggle between KVM's old and new system time methods to coerce KVM
	 * into updating the fields in the PV time info struct.
	 */

> +	 * in the PV time info. The easiest way to do this is
> +	 * to temporarily switch to the old KVM system time
> +	 * method and then switch back to the new one.
> +	 */
> +	wrmsr(MSR_KVM_SYSTEM_TIME, pvti_pa | KVM_MSR_ENABLED);
> +	wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
> +}
> +
> +static void guest_code(vm_paddr_t pvti_pa)
> +{
> +	struct pvclock_vcpu_time_info *pvti_va =
> +		(struct pvclock_vcpu_time_info *)pvti_pa;

Casting to "void *" will let this vit on a single line.  Though I don't see any
reason to take a vm_paddr_t, the infrastructure doesn't validate the arg types.

> +	struct pvclock_vcpu_time_info pvti_boot;
> +	struct pvclock_vcpu_time_info pvti_uncorrected;
> +	struct pvclock_vcpu_time_info pvti_corrected;
> +	uint64_t cycles_boot;
> +	uint64_t cycles_uncorrected;
> +	uint64_t cycles_corrected;
> +	uint64_t tsc_guest;
> +
> +	/*
> +	 * Setup the KVMCLOCK in the guest & store the original

s/&/and

And wrap less aggressively here too.

> +	 * PV time structure that is used.
> +	 */
> +	wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
> +	pvti_boot = *pvti_va;
> +	GUEST_SYNC(STAGE_FIRST_BOOT);
> +
> +	/*
> +	 * Trigger an update of the PVTI, if we calculate
> +	 * the KVM clock using this structure we'll see
> +	 * a delta from the TSC.

Too many pronouns.  Maybe this?

	/*
	 * Trigger an update of the PVTI and snapshot the time, which at this
	 * point is uncorrected, i.e. have a 

> +	 */
> +	trigger_pvti_update(pvti_pa);
> +	pvti_uncorrected = *pvti_va;
> +	GUEST_SYNC(STAGE_UNCORRECTED);
> +
> +	/*
> +	 * The test should have triggered the correction by this
> +	 * point in time. We have a copy of each of the PVTI structs
> +	 * at each stage now.
> +	 *
> +	 * Let's sample the timestamp at a SINGLE point in time and
> +	 * then calculate what the KVM clock would be using the PVTI
> +	 * from each stage.
> +	 *
> +	 * Then return each of these values to the tester.
> +	 */

	/*
	 * Snapshot the corrected time (the host does KVM_SET_CLOCK_GUEST when
	 * handling STAGE_UNCORRECTED).
	 */  

> +	pvti_corrected = *pvti_va;

	/*
	 * Sample the timestamp at a SINGLE point in time, and then calculate
	 * the effective KVM clock using the PVTI from each stage, and sync all
	 * values back to the host for verification.
	 */


On that last point though, why sync things back to the host?  The verification
can be done in the guest via __GUEST_ASSERT(), that way there are few magic
fields being passed around, e.g. no need for uc.args[2..4].

> +	tsc_guest = rdtsc();
> +
> +	cycles_boot = __pvclock_read_cycles(&pvti_boot, tsc_guest);
> +	cycles_uncorrected = __pvclock_read_cycles(&pvti_uncorrected, tsc_guest);
> +	cycles_corrected = __pvclock_read_cycles(&pvti_corrected, tsc_guest);
> +
> +	GUEST_SYNC_ARGS(STAGE_CORRECTED, cycles_boot, cycles_uncorrected,
> +			cycles_corrected, 0);
> +}

> +
> +static void run_test(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
> +{
> +	struct pvclock_vcpu_time_info pvti_before;
> +	uint64_t before, uncorrected, corrected;
> +	int64_t delta_uncorrected, delta_corrected;
> +	struct ucall uc;
> +	uint64_t ucall_reason;
> +
> +	/* Loop through each stage of the test. */
> +	while (true) {
> +
> +		/* Start/restart the running vCPU code. */
> +		vcpu_run(vcpu);
> +		TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
> +
> +		/* Retrieve and verify our stage. */
> +		ucall_reason = get_ucall(vcpu, &uc);
> +		TEST_ASSERT(ucall_reason == UCALL_SYNC,
> +			    "Unhandled ucall reason=%lu",
> +			    ucall_reason);

Or just TEST_ASSERT_EQ().

> +		/* Run host specific code relating to stage. */
> +		switch (uc.args[1]) {
> +		case STAGE_FIRST_BOOT:
> +			/* Store the KVM clock values before an update. */
> +			vcpu_ioctl(vcpu, KVM_GET_CLOCK_GUEST, &pvti_before);
> +
> +			/* Sleep for a set amount of time to increase delta. */
> +			sleep(5);

This is probably worth plumbing in via command line, e.g. so that the test can
run with a shorter sleep() by default while also allowing users to stress things
by running with longer delays.  Ideally, the default sleep() would be as short
as possible while still detecting ~100% of bugs.

> +			break;
> +
> +		case STAGE_UNCORRECTED:
> +			/* Restore the KVM clock values. */
> +			vcpu_ioctl(vcpu, KVM_SET_CLOCK_GUEST, &pvti_before);
> +			break;
> +
> +		case STAGE_CORRECTED:
> +			/* Query the clock information and verify delta. */
> +			before = uc.args[2];
> +			uncorrected = uc.args[3];
> +			corrected = uc.args[4];
> +
> +			delta_uncorrected = before - uncorrected;
> +			delta_corrected = before - corrected;
> +
> +			pr_info("before=%lu uncorrected=%lu corrected=%lu\n",
> +				before, uncorrected, corrected);
> +
> +			pr_info("delta_uncorrected=%ld delta_corrected=%ld\n",
> +				delta_uncorrected, delta_corrected);
> +
> +			TEST_ASSERT((delta_corrected <= 1) && (delta_corrected >= -1),
> +				    "larger than expected delta detected = %ld", delta_corrected);
> +			return;
> +		}
> +	}
> +}
> +
> +static void configure_pvclock(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
> +{
> +	unsigned int gpages;

I'd prefer something like nr_pages
> +
> +	gpages = vm_calc_num_guest_pages(VM_MODE_DEFAULT, KVMCLOCK_SIZE);
> +	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
> +				    KVMCLOCK_GPA, 1, gpages, 0);
> +	virt_map(vm, KVMCLOCK_GPA, KVMCLOCK_GPA, gpages);
> +
> +	vcpu_args_set(vcpu, 1, KVMCLOCK_GPA);

This is somewhat silly.  If you're going to hardcode the address, just use the
#define in both the host and the guest.  Then this helper doesn't need to take
a vCPU and could be more easily expanded to multiple vCPUs (if there's a good
reason to do so).

> +}
> +
> +static void configure_scaled_tsc(struct kvm_vcpu *vcpu)
> +{
> +	uint64_t tsc_khz;
> +
> +	tsc_khz =  __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL);
> +	pr_info("scaling tsc from %ldKHz to %ldKHz\n", tsc_khz, tsc_khz / 2);
> +	tsc_khz /= 2;

There's nothing special about scaling to 50%, correct?  So rather than hardcode
a single testcase, enumerate over a variety of frequencies, and specifically
cross the 32-bit boundary, e.g. 1Ghz - 5Ghz at 500Mhz jumps or something, plus
the host's native unscaled value.

> +	vcpu_ioctl(vcpu, KVM_SET_TSC_KHZ, (void *)tsc_khz);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	struct kvm_vcpu *vcpu;
> +	struct kvm_vm *vm;
> +	bool scale_tsc;
> +
> +	scale_tsc = argc > 1 && (!strncmp(argv[1], "-s", 3) ||
> +				 !strncmp(argv[1], "--scale-tsc", 10));

I think it's worth adding proper argument parsing, e.g. to print a help.  The
boilerplate is annoying, but it'll payoff in the long run as I suspect we'll end
up with more params, e.g. to configure the sleep/delay, the min/max frequency,
the intervals between frequencies, etc.

> +
> +	TEST_REQUIRE(sys_clocksource_is_based_on_tsc());
> +
> +	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
> +
> +	configure_pvclock(vm, vcpu);
> +
> +	if (scale_tsc)
> +		configure_scaled_tsc(vcpu);
> +
> +	run_test(vm, vcpu);
> +
> +	return 0;
> +}
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 07/21] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration
  2024-05-22  0:17 ` [RFC PATCH v3 07/21] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration David Woodhouse
@ 2024-08-14  1:52   ` Sean Christopherson
  0 siblings, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-14  1:52 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 72ad5ace118d..fe7c98907818 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -864,6 +864,12 @@ struct kvm_hyperv_eventfd {
>  /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
>  #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
>  #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
> +#define   KVM_VCPU_TSC_SCALE  1 /* attribute for TSC scaling factor */
> +
> +struct kvm_vcpu_tsc_scale {
> +	__u64 tsc_ratio;
> +	__u64 tsc_frac_bits;
> +};
>  
>  /* x86-specific KVM_EXIT_HYPERCALL flags. */
>  #define KVM_EXIT_HYPERCALL_LONG_MODE	_BITULL(0)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 42abce7b4fc9..00a7c1188dec 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5715,6 +5715,7 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
>  
>  	switch (attr->attr) {
>  	case KVM_VCPU_TSC_OFFSET:
> +	case KVM_VCPU_TSC_SCALE:
>  		r = 0;
>  		break;
>  	default:
> @@ -5737,6 +5738,17 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
>  			break;
>  		r = 0;
>  		break;
> +	case KVM_VCPU_TSC_SCALE: {
> +		struct kvm_vcpu_tsc_scale scale;
> +
> +		scale.tsc_ratio = vcpu->arch.l1_tsc_scaling_ratio;

I'm pretty sure vcpu->arch.l1_tsc_scaling_ratio is set to the correct value only
if the vCPU is using KVM's default frequency, or TSC scaling is supported in
hardware.

	/* TSC scaling supported? */
	if (!kvm_caps.has_tsc_control) {
		if (user_tsc_khz > tsc_khz) {
			vcpu->arch.tsc_catchup = 1;
			vcpu->arch.tsc_always_catchup = 1;
			return 0;
		} else {
			pr_warn_ratelimited("user requested TSC rate below hardware speed\n");
			return -1;
		}
	}

I assume the easiest solution is to enumerate support for KVM_VCPU_TSC_SCALE if
and only if kvm_caps.has_tsc_control is true.

> +		scale.tsc_frac_bits = kvm_caps.tsc_scaling_ratio_frac_bits;
> +		r = -EFAULT;
> +		if (copy_to_user(uaddr, &scale, sizeof(scale)))
> +			break;
> +		r = 0;
> +		break;
> +	}
>  	default:
>  		r = -ENXIO;
>  	}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 08/21] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host
  2024-05-22  0:17 ` [RFC PATCH v3 08/21] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host David Woodhouse
@ 2024-08-14  1:57   ` Sean Christopherson
  0 siblings, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-14  1:57 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
>  static inline void kvm_ops_update(struct kvm_x86_init_ops *ops)
>  {
> @@ -9984,9 +9977,10 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
>  
>  	if (pi_inject_timer == -1)
>  		pi_inject_timer = housekeeping_enabled(HK_TYPE_TIMER);
> -#ifdef CONFIG_X86_64
> +
>  	pvclock_gtod_register_notifier(&pvclock_gtod_notifier);

The unregister path got missed.

#ifdef CONFIG_X86_64
	pvclock_gtod_unregister_notifier(&pvclock_gtod_notifier);

>  
> +#ifdef CONFIG_X86_64
>  	if (hypervisor_is_type(X86_HYPER_MS_HYPERV))
>  		set_hv_tscchange_cb(kvm_hyperv_tsc_notifier);
>  #endif
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get_kvmclock()
  2024-05-22  0:17 ` [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get_kvmclock() David Woodhouse
  2024-05-24 13:20   ` Paul Durrant
@ 2024-08-14  2:58   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-14  2:58 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> When in 'master clock mode' (i.e. when host and guest TSCs are behaving
> sanely and in sync), the KVM clock is defined in terms of the guest TSC.
> 
> When TSC scaling is used, calculating the KVM clock directly from *host*
> TSC cycles leads to a systemic drift from the values calculated by the
> guest from its TSC.
> 
> Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers")
> had a simple workaround for the specific case of Xen timers, as it had an
> actual vCPU to hand and could use its scaling information. That commit
> noted that it was broken for the general case of get_kvmclock_ns(), and
> said "I'll come back to that".
> 
> Since __get_kvmclock() is invoked without a specific CPU, it needs to
> be able to find or generate the scaling values required to perform the
> correct calculation.
> 
> Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC,
> so it isn't as complex as it might have been.
> 
> In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in
> kvm->arch.last_tsc_scaling_ratio. That is only protected by the
> tsc_write_lock, so in pvclock_update_vm_gtod_copy(), copy it into a
> separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed
> using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and
> shift factors to convert to nanoseconds for the corresponding KVM clock,
> just as kvm_guest_time_update() would.
> 
> In __get_kvmclock(), which runs within a seqcount retry loop, use those
> values to convert host to guest TSC and then to nanoseconds. Only fall
> back to using get_kvmclock_base_ns() when not in master clock mode.
> 
> There was previously a code path in __get_kvmclock() which looked like
> it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
> even on 32-bit hosts. In practice that could never happen as the
> ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
> hosts it would never be set when the system clock isn't TSC-based. So
> that code path is now removed.

This should be a separate patch.  Actually, patches, plural.  More below

> The kvm_get_wall_clock_epoch() function had the same problem; make it
> just call get_kvmclock() and subtract kvmclock from wallclock, with
> the same fallback as before.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---

...

> @@ -3100,36 +3131,49 @@ static unsigned long get_cpu_tsc_khz(void)
>  static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
>  {
>  	struct kvm_arch *ka = &kvm->arch;
> -	struct pvclock_vcpu_time_info hv_clock;
> +
> +#ifdef CONFIG_X86_64
> +	uint64_t cur_tsc_khz = 0;
> +	struct timespec64 ts;
>  
>  	/* both __this_cpu_read() and rdtsc() should be on the same cpu */
>  	get_cpu();
>  
> -	data->flags = 0;
>  	if (ka->use_master_clock &&
> -	    (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
> -#ifdef CONFIG_X86_64
> -		struct timespec64 ts;
> +	    (cur_tsc_khz = get_cpu_tsc_khz()) &&

That is mean.  And if you push it inside the if-statement, the {get,put}_cpu()
can be avoided when the master clock isn't being used, e.g.

	if (ka->use_master_clock) {
		/*
		 * The RDTSC needs to happen on the same CPU whose frequency is
		 * used to compute kvmclock's time.
		 */
		get_cpu();
    
    		cur_tsc_khz = get_cpu_tsc_khz();
		if (cur_tsc_khz &&
	    	    !kvm_get_walltime_and_clockread(&ts, &data->host_tsc))
			cur_tsc_khz = 0;

		put_cpu();
	}

However, the changelog essentially claims kvm_get_walltime_and_clockread() should
never fail when use_master_clock is enabled, which suggests a WARN is warranted.

    There was previously a code path in __get_kvmclock() which looked like
    it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
    even on 32-bit hosts. In practice that could never happen as the
    ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
    hosts it would never be set when the system clock isn't TSC-based. So
    that code path is now removed.

But, I think kvm_get_walltime_and_clockread() can fail when use_master_clock is
true, i.e. I don't think a WARN is viable as it could get false positives.

Ah, this is protected by pvclock_sc, so a stale use_master_clock should result
in a retry.  What if we WARN on that?

Hrm, that requires plumbing in the original sequence count.  Ah, but looking at
the patch as a whole, if we keep kvm_get_wall_clock_epoch()'s style, then it's
much easier.  And FWIW, I like the existing kvm_get_wall_clock_epoch() style a
lot more than the get_kvmclock() => __get_kvmclock() approach.

So, can we do this as prep patch #1?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9c14d0f5a684..98806a59e110 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3360,9 +3360,16 @@ uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
 
                local_tsc_khz = get_cpu_tsc_khz();
 
+               /*
+                * The master clock depends on the pvclock being based on TSC,
+                * so the only way kvm_get_walltime_and_clockread() can fail is
+                * if the clocksource changed and use_master_clock is stale, in
+                * which case a seqcount retry should be pending.
+                */
                if (local_tsc_khz &&
-                   !kvm_get_walltime_and_clockread(&ts, &host_tsc))
-                       local_tsc_khz = 0; /* Fall back to old method */
+                   !kvm_get_walltime_and_clockread(&ts, &host_tsc) &&
+                   WARN_ON_ONCE(!read_seqcount_retry(&ka->pvclock_sc, seq)))
+                           local_tsc_khz = 0; /* Fall back to old method */
 
                put_cpu();
 

And then as patch(es) 2..7 (give or take)

  (2) fold __get_kvmclock() into get_kvmclock()
  (3) and the same WARN on the seqcount in get_kvmclock() (but skimp on the comments)
  (4) use get_kvmclock_base_ns() as the fallback in get_kvmclock(), i.e. delete
      the raw rdtsc() and setting of KVM_CLOCK_TSC_STABLE w/o KVM_CLOCK_REALTIME
  (5) use get_cpu_tsc_khz() instead of open coding something similar
  (6) scale TSC when computing kvmclock (the core of this patch)
  (7) use get_kvmclock() in kvm_get_wall_clock_epoch() as the will be 100%
      equivalent at this point.

> +	    !kvm_get_walltime_and_clockread(&ts, &data->host_tsc))
> +		cur_tsc_khz = 0;
>  
> -		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
> -			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
> -			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
> -		} else
> -#endif
> -		data->host_tsc = rdtsc();
> -
> -		data->flags |= KVM_CLOCK_TSC_STABLE;
> -		hv_clock.tsc_timestamp = ka->master_cycle_now;
> -		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> -		kvm_get_time_scale(NSEC_PER_SEC, get_cpu_tsc_khz() * 1000LL,
> -				   &hv_clock.tsc_shift,
> -				   &hv_clock.tsc_to_system_mul);
> -		data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
> -	} else {
> -		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
> +	put_cpu();
> +
> +	if (cur_tsc_khz) {
> +		uint64_t tsc_cycles;
> +		uint32_t mul;
> +		int8_t shift;
> +
> +		tsc_cycles = data->host_tsc - ka->master_cycle_now;
> +
> +		if (kvm_caps.has_tsc_control)
> +			tsc_cycles = kvm_scale_tsc(tsc_cycles,
> +						   ka->master_tsc_scaling_ratio);
> +
> +		if (static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> +			mul = ka->master_tsc_mul;
> +			shift = ka->master_tsc_shift;
> +		} else {
> +			kvm_get_time_scale(NSEC_PER_SEC, cur_tsc_khz * 1000LL,
> +					   &shift, &mul);
> +		}
> +		data->clock = ka->master_kernel_ns + ka->kvmclock_offset +
> +			pvclock_scale_delta(tsc_cycles, mul, shift);
> +		data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
> +		data->flags = KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC | KVM_CLOCK_TSC_STABLE;
> +		return;
>  	}
> +#endif
>  
> -	put_cpu();
> +	data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
> +	data->flags = 0;
>  }


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
  2024-05-22  0:17 ` [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time() David Woodhouse
  2024-05-24 13:26   ` Paul Durrant
@ 2024-08-14  4:57   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-14  4:57 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> There was some confusion in kvm_update_guest_time() when software needs
> to advance the guest TSC.
> 
> In master clock mode, there are two points of time which need to be taken
> into account. First there is the master clock reference point, stored in
> kvm->arch.master_kernel_ns (and associated host TSC ->master_cycle_now).
> Secondly, there is the time *now*, at the point kvm_update_guest_time()
> is being called.
> 
> With software TSC upscaling, the guest TSC is getting further and further
> ahead of the host TSC as time elapses. So at time "now", the guest TSC
> should be further ahead of the host, than it was at master_kernel_ns.
> 
> The adjustment in kvm_update_guest_time() was not taking that into
> account, and was only advancing the guest TSC by the appropriate amount
> for master_kernel_ns, *not* the current time.
> 
> Fix it to calculate them both correctly.
> 
> Since the KVM clock reference point in master_kernel_ns might actually
> be *earlier* than the reference point used for the guest TSC
> (vcpu->last_tsc_nsec), this might lead to a negative delta. Fix the
> compute_guest_tsc() function to cope with negative numbers, which
> then means there is no need to force a master clock update when the
> guest TSC is written.

Please do this in a separate patch.  There's no need to squeeze it in here, and
this change is complex/subtle enough as it is.

> @@ -3300,8 +3306,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  		kernel_ns = get_kvmclock_base_ns();
>  	}
>  
> -	tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
> -
>  	/*
>  	 * We may have to catch up the TSC to match elapsed wall clock
>  	 * time for two reasons, even if kvmclock is used.
> @@ -3313,11 +3317,46 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  	 *	very slowly.
>  	 */
>  	if (vcpu->tsc_catchup) {
> -		u64 tsc = compute_guest_tsc(v, kernel_ns);

Random side topic, kernel_ns is a s64, shouldn't it be a u64?

> -		if (tsc > tsc_timestamp) {
> -			adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
> -			tsc_timestamp = tsc;
> +		uint64_t now_host_tsc, now_guest_tsc;
> +		int64_t adjustment;
> +
> +		/*
> +		 * First, calculate what the guest TSC should be at the
> +		 * time (kernel_ns) which will be placed in the hvclock.
> +		 * This may be the *current* time, or it may be the time
> +		 * of the master clock reference. This is 'tsc_timestamp'.
> +		 */
> +		tsc_timestamp = compute_guest_tsc(v, kernel_ns);
> +
> +		now_guest_tsc = tsc_timestamp;
> +		now_host_tsc = host_tsc;
> +
> +#ifdef CONFIG_X86_64
> +		/*
> +		 * If the master clock was used, calculate what the guest
> +		 * TSC should be *now* in order to advance to that.
> +		 */
> +		if (use_master_clock) {
> +			int64_t now_kernel_ns;
> +
> +			if (!kvm_get_time_and_clockread(&now_kernel_ns,

Doesn't this need to be called under protection of the seqcount?

Ahh, but with that change, then get_cpu_tsc_khz() isn't guaranteed to be from
the same CPU.

Oof, disabling IRQs to protect against migration is complete overkill, and at
this point dumb luck as much as anything.  Saving IRQs was added by commit
commit 18068523d3a0 ("KVM: paravirtualized clocksource: host part") before there
was any coordination with timekeeping.  And then after the coordination and
locking was added, commit c09664bb4418 ("KVM: x86: fix deadlock in clock-in-progress
request handling") moved the locking/coordination out of IRQ protection, and thus
made disabling IRQs completely pointless, except for protecting get_cpu_tsc_khz()
and now kvm_get_time_and_clockread().

Ha!  And if we slowly unwind that mess, this all ends up being _excrutiatingly_
close to the same code as get_kvmclock().  Sadly, I don't think it's close enough
to be reusable, unless we want to play macro games.

> +							&now_host_tsc)) {
> +				now_kernel_ns = get_kvmclock_base_ns();
> +				now_host_tsc = rdtsc();
> +			}
> +			now_guest_tsc = compute_guest_tsc(v, now_kernel_ns);

I find the mixed state of kernel_ns and host_tsc to be terribly confusing.  It's
hard to see and remember that kernel_ns/host_tsc aren't "now" when use_master_clock
is true.

For TSC upscaling, I think we can have kernel_ns/host_tsc always be "now", we just
need to snapshot the master clock tsc+ns, and then shove those into kernel_ns and
host_tsc after doing the software upscaling.  That simplifies the TSC upscaling
code considerably, and IMO makes it more obvious how tsc_timestamp is computed,
and what its role is.

When all is said and done, I think we can get to this?

	/*
	 * If the host uses TSC clock, then passthrough TSC as stable
	 * to the guest.
	 */
	do {
		seq = read_seqcount_begin(&ka->pvclock_sc);

		use_master_clock = ka->use_master_clock;

		/*
		 * The TSC read and the call to get_cpu_tsc_khz() must happen
		 * on the same CPU.
		 */
		get_cpu();

		tgt_tsc_hz = get_cpu_tsc_khz();

		if (use_master_clock &&
		    !kvm_get_time_and_clockread(&kernel_ns, &host_tsc) &&
		    WARN_ON_ONCE(!read_seqcount_retry(&ka->pvclock_sc, seq)))
			use_master_clock = false;

		put_cpu();

		if (!use_master_clock)
			break;

		master_host_tsc = ka->master_cycle_now;
		master_kernel_ns = ka->master_kernel_ns;
	while (read_seqcount_retry(&ka->pvclock_sc, seq))

	if (unlikely(!tgt_tsc_hz)) {
		kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
		return 1;
	}
	if (!use_master_clock) {
		host_tsc = rdtsc();
		kernel_ns = get_kvmclock_base_ns();
	}

	/*
	 * We may have to catch up the TSC to match elapsed wall clock
	 * time for two reasons, even if kvmclock is used.
	 *   1) CPU could have been running below the maximum TSC rate
	 *   2) Broken TSC compensation resets the base at each VCPU
	 *      entry to avoid unknown leaps of TSC even when running
	 *      again on the same CPU.  This may cause apparent elapsed
	 *      time to disappear, and the guest to stand still or run
	 *	very slowly.
	 */
	if (vcpu->tsc_catchup) {
		int64_t adjustment;

		/*
		 * Calculate the delta between what the guest TSC *should* be,
		 * and what it actually is according to kvm_read_l1_tsc().
		 */
		adjustment = compute_guest_tsc(v, kernel_ns) -
			     kvm_read_l1_tsc(v, host_tsc);
		if (adjustment > 0)
			adjust_tsc_offset_guest(v, adjustment);
	}

	/*
	 * Now that TSC upscaling is out of the way, the remaining calculations
	 * are all relative to the reference time that's placed in hv_clock.
	 * If the master clock is NOT in use, the reference time is "now".  If
	 * master clock is in use, the reference time comes from there.
	 */
	if (use_master_clock) {
		host_tsc = master_host_tsc;
		kernel_ns = master_kernel_ns;
	}
	tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 11/21] KVM: x86: Simplify and comment kvm_get_time_scale()
  2024-05-22  0:17 ` [RFC PATCH v3 11/21] KVM: x86: Simplify and comment kvm_get_time_scale() David Woodhouse
  2024-05-24 13:53   ` Paul Durrant
@ 2024-08-15 15:46   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-15 15:46 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Commit 3ae13faac400 ("KVM: x86: pass kvm_get_time_scale arguments in hertz")
> made this function take 64-bit values in Hz rather than 32-bit kHz. Thus
> making it entrely pointless to shadow its arguments into local 64-bit
> variables. Just use scaled_hz and base_hz directly.
> 
> Also rename the 'tps32' variable to 'base32', having utterly failed to
> think of any reason why it might have been called that in the first place.

Ticks Per Second?

> +	/*
> +	 * Start by shifting the base_hz right until it fits in 32 bits, and
> +	 * is lower than double the target rate. This introduces a negative
> +	 * shift value which would result in pvclock_scale_delta() shifting
> +	 * the actual tick count right before performing the multiplication.
> +	 */
> +	while (base_hz > scaled_hz*2 || base_hz & 0xffffffff00000000ULL) {
> +		base_hz >>= 1;
>  		shift--;
>  	}
>  
> -	tps32 = (uint32_t)tps64;
> -	while (tps32 <= scaled64 || scaled64 & 0xffffffff00000000ULL) {
> -		if (scaled64 & 0xffffffff00000000ULL || tps32 & 0x80000000)
> -			scaled64 >>= 1;
> +	/* Now the shifted base_hz fits in 32 bits, copy it to base32 */
> +	base32 = (uint32_t)base_hz;
> +
> +	/*
> +	 * Next, shift the scaled_hz right until it fits in 32 bits, and ensure
> +	 * that the shifted base_hz is not larger (so that the result of the
> +	 * final division also fits in 32 bits).
> +	 */
> +	while (base32 <= scaled_hz || scaled_hz & 0xffffffff00000000ULL) {
> +		if (scaled_hz & 0xffffffff00000000ULL || base32 & 0x80000000)
> +			scaled_hz >>= 1;
>  		else
> -			tps32 <<= 1;
> +			base32 <<= 1;
>  		shift++;
>  	}

Any chance you'd want to do this on top, so that it's easier to see that the
loops are waiting for the upper bits to go to zero?  And so that readers don't
have to count effs and zeros :-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d30b12986e17..786d5a855459 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2417,7 +2417,7 @@ static void kvm_get_time_scale(uint64_t scaled_hz, uint64_t base_hz,
         * shift value which would result in pvclock_scale_delta() shifting
         * the actual tick count right before performing the multiplication.
         */
-       while (base_hz > scaled_hz*2 || base_hz & 0xffffffff00000000ULL) {
+       while (base_hz > scaled_hz*2 || base_hz >> 32) {
                base_hz >>= 1;
                shift--;
        }
@@ -2430,8 +2430,8 @@ static void kvm_get_time_scale(uint64_t scaled_hz, uint64_t base_hz,
         * that the shifted base_hz is not larger (so that the result of the
         * final division also fits in 32 bits).
         */
-       while (base32 <= scaled_hz || scaled_hz & 0xffffffff00000000ULL) {
-               if (scaled_hz & 0xffffffff00000000ULL || base32 & 0x80000000)
+       while (base32 <= scaled_hz || scaled_hz >> 32) {
+               if (scaled_hz >> 32 || base32 & BIT(31))
                        scaled_hz >>= 1;
                else
                        base32 <<= 1;

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 12/21] KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset()
  2024-05-22  0:17 ` [RFC PATCH v3 12/21] KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset() David Woodhouse
  2024-05-24 13:56   ` Paul Durrant
@ 2024-08-15 15:52   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-15 15:52 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Let the callers pass the host TSC value in as an explicit parameter.
> 
> This leaves some fairly obviously stupid code, which using this function
> to compare the guest TSC at some *other* time, with the newly-minted TSC
> value from rdtsc(). Unless it's being used to measure *elapsed* time,
> that isn't very sensible.
> 
> In this case, "obviously stupid" is an improvement over being non-obviously
> so.
> 
> No functional change intended.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/kvm/x86.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ef3cd6113037..ea59694d712a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2601,11 +2601,12 @@ u64 kvm_scale_tsc(u64 tsc, u64 ratio)
>  	return _tsc;
>  }
>  
> -static u64 kvm_compute_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)
> +static u64 kvm_compute_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 host_tsc,
> +				     u64 target_tsc)

Would it make sense to have a __kvm_compute_l1_tsc_offset() version that takes
in the host TSC, and then this?

static u64 kvm_compute_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)
{
	return __kvm_compute_l1_tsc_offset(vcpu, rdtsc(), target_tsc);
}


Hmm, or maybe a better option would be:

static u64 kvm_compute_current_l1_tsc_offset(struct kvm_vcpu *vcpu,
					     u64 target_tsc)
{
	return kvm_compute_l1_tsc_offset(vcpu, rdtsc(), target_tsc);
}

Meh, after typing those out, I don't like either one.  Let's keep it how you
wrote it, I think there's quite a bit of added readability by forcing callers to
provide the host TSC.

>  {
>  	u64 tsc;
>  
> -	tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio);
> +	tsc = kvm_scale_tsc(host_tsc, vcpu->arch.l1_tsc_scaling_ratio);
>  
>  	return target_tsc - tsc;

Opportunistically drop "tsc" too?  E.g.

	return target_tsc -
	       kvm_scale_tsc(host_tsc, vcpu->arch.l1_tsc_scaling_ratio);

or

	return target_tsc - kvm_scale_tsc(host_tsc, vcpu->arch.l1_tsc_scaling_ratio);

I find either of those much easier to read.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 13/21] KVM: x86: Improve synchronization in kvm_synchronize_tsc()
  2024-05-22  0:17 ` [RFC PATCH v3 13/21] KVM: x86: Improve synchronization in kvm_synchronize_tsc() David Woodhouse
  2024-05-24 14:03   ` Paul Durrant
@ 2024-08-15 16:00   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-15 16:00 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> When synchronizing to an existing TSC (either by explicitly writing zero,
> or the legacy hack where the TSC is written within one second's worth of
> the previously written TSC), the last_tsc_write and last_tsc_nsec values
> were being misrecorded by __kvm_synchronize_tsc(). The *unsynchronized*
> value of the TSC (perhaps even zero) was bring recorded, along with the
> current time at which kvm_synchronize_tsc() was called. This could cause
> *subsequent* writes to fail to synchronize correctly.
> 
> Fix that by resetting {data, ns} to the previous values before passing
> them to __kvm_synchronize_tsc() when synchronization is detected. Except
> in the case where the TSC is unstable and *has* to be synthesised from
> the host clock, in which case attempt to create a nsec/tsc pair which is
> on the correct line.
> 
> Furthermore, there were *three* different TSC reads used for calculating
> the "current" time, all slightly different from each other. Fix that by
> using kvm_get_time_and_clockread() where possible and using the same
> host_tsc value in all cases.

Please split this into two patches, one to switch to a single RDTSC, and another
do fix the other stuff.

> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/kvm/x86.c | 32 ++++++++++++++++++++++++++++----
>  1 file changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ea59694d712a..6ec43f39bdb0 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -201,6 +201,10 @@ module_param(eager_page_split, bool, 0644);
>  static bool __read_mostly mitigate_smt_rsb;
>  module_param(mitigate_smt_rsb, bool, 0444);
>  
> +#ifdef CONFIG_X86_64
> +static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);
> +#endif
> +
>  /*
>   * Restoring the host value for MSRs that are only consumed when running in
>   * usermode, e.g. SYSCALL MSRs and TSC_AUX, can be deferred until the CPU
> @@ -2753,14 +2757,22 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
>  {
>  	u64 data = user_value ? *user_value : 0;
>  	struct kvm *kvm = vcpu->kvm;
> -	u64 offset, ns, elapsed;
> +	u64 offset, host_tsc, ns, elapsed;
>  	unsigned long flags;
>  	bool matched = false;
>  	bool synchronizing = false;
>  
> +#ifdef CONFIG_X86_64
> +	if (!kvm_get_time_and_clockread(&ns, &host_tsc))
> +#endif

I'm pretty sure we can unconditionally declare kvm_get_time_and_clockread() above,
and then do

	if (!IS_ENABLED(CONFIG_X86_64) ||
	    !kvm_get_time_and_clockread(&ns, &host_tsc))

and let dead code elimintation do its thing to avoid a linker error.

> +	{
> +		ns = get_kvmclock_base_ns();
> +		host_tsc = rdtsc();
> +	}
> +

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 14/21] KVM: x86: Kill cur_tsc_{nsec,offset,write} fields
  2024-05-22  0:17 ` [RFC PATCH v3 14/21] KVM: x86: Kill cur_tsc_{nsec,offset,write} fields David Woodhouse
  2024-05-24 14:05   ` Paul Durrant
@ 2024-08-15 16:30   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-15 16:30 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> These pointlessly duplicate the last_tsc_{nsec,offset,write} values.
> 
> The only place they were used was where the TSC is stable and a new vCPU
> is being synchronized to the previous setting, in which case the 'last_'
> value is definitely identical.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 ---
>  arch/x86/kvm/x86.c              | 19 ++++++++-----------
>  2 files changed, 8 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b01c1d000fff..7d06f389a607 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1354,9 +1354,6 @@ struct kvm_arch {
>  	u32 last_tsc_khz;
>  	u64 last_tsc_offset;
>  	u64 last_tsc_scaling_ratio;
> -	u64 cur_tsc_nsec;
> -	u64 cur_tsc_write;
> -	u64 cur_tsc_offset;
>  	u64 cur_tsc_generation;
>  	int nr_vcpus_matched_tsc;
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6ec43f39bdb0..ab5d55071253 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2713,11 +2713,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
>  	lockdep_assert_held(&kvm->arch.tsc_write_lock);
>  
>  	/*
> -	 * We also track th most recent recorded KHZ, write and time to
> -	 * allow the matching interval to be extended at each write.
> +	 * Track the last recorded kHz (and associated scaling ratio for
> +	 * calculating the guest TSC), and offset.
>  	 */
> -	kvm->arch.last_tsc_nsec = ns;
> -	kvm->arch.last_tsc_write = tsc;
>  	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
>  	kvm->arch.last_tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
>  	kvm->arch.last_tsc_offset = offset;
> @@ -2736,10 +2734,9 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
>  		 *
>  		 * These values are tracked in kvm->arch.cur_xxx variables.

This comment is now stale, as most of the fields are now .last_xxx, not cur_xxx.

However...

>  		 */
> +		kvm->arch.last_tsc_nsec = ns;


There is a functional change here, and it's either incorrect or misleading (I
think the latter).  If the TSC is unstable, "ns" in kvm_synchronize_tsc() will
come from get_kvmclock_base_ns(), and only the TSC frequency is checked for a
match when synchronizing.

That results in .last_tsc_nsec not being updated, and so subsequent syncs will
compute a larger elapsed time (relative to the current generation's timestamp,
not the "last" timestamp).

Functionally, I think that's ok?  So long as all vCPUs sync against the same
baseline, it should work?  I think.

But if that's the case, then I would prefer to delete last_tsc_{nsec,write,offset},
not the cur_xxx versions.  For nsec and write it shows that they are valid/used
only in the context of the current generation.

And for the offset, updating it _outside_ of the loop makes it more obvious that
the offset can change (by design) within a generation if the TSC is unstable.

Ooh, and if I'm reading the code correctly, last_tsc_khz can be renamed to
cur_tsc_khz and moved in the !matched statement too, as it's guaranteed to be
vcpu->arch.virtual_tsc_khz if matched==true.

Ah, right, and last_tsc_scaling_ratio is just an deriviation of virtual_tsc_khz,
so it too can be cur_xxx and put under !matched.

Am I missing something?  That seems too easy...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 16/21] KVM: x86: Factor out kvm_use_master_clock()
  2024-05-22  0:17 ` [RFC PATCH v3 16/21] KVM: x86: Factor out kvm_use_master_clock() David Woodhouse
  2024-05-24 14:13   ` Paul Durrant
@ 2024-08-15 17:12   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-15 17:12 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

The shortlog is rather misleading.  This is more than just a refactor, and I
would argue the refactor aspect is secondary, i.e. the main goal of this patch
is to apply the exceptons to kvm_track_tsc_matching().

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Both kvm_track_tsc_matching() and pvclock_update_vm_gtod_copy() make a
> decision about whether the KVM clock should be in master clock mode.
> 
> They use *different* criteria for the decision though. This isn't really
> a problem; it only has the potential to cause unnecessary invocations of
> KVM_REQ_MASTERCLOCK_UPDATE if the masterclock was disabled due to TSC
> going backwards, or the guest using the old MSR. But it isn't pretty.
> 
> Factor the decision out to a single function. And document the historical
> reason why it's disabled for guests that use the old MSR_KVM_SYSTEM_TIME.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/kvm/x86.c | 27 +++++++++++++++++++++++----
>  1 file changed, 23 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e21b8c075bf6..437412b36cae 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2518,6 +2518,27 @@ static inline bool gtod_is_based_on_tsc(int mode)
>  }
>  #endif
>  
> +static bool kvm_use_master_clock(struct kvm *kvm)

Maybe kvm_can_use_master_clock() so that this isn't misconstrued with the actual
ka->user_master_clock field.

> +{
> +	struct kvm_arch *ka = &kvm->arch;
> +
> +	/*
> +	 * The 'old kvmclock' check is a workaround (from 2015) for a
> +	 * SUSE 2.6.16 kernel that didn't boot if the system_time in
> +	 * its kvmclock was too far behind the current time. So the
> +	 * mode of just setting the reference point and allowing time
> +	 * to proceed linearly from there makes it fail to boot.
> +	 * Despite that being kind of the *point* of the way the clock
> +	 * is exposed to the guest. By coincidence, the offending
> +	 * kernels used the old MSR_KVM_SYSTEM_TIME, which was moved
> +	 * only because it resided in the wrong number range. So the
> +	 * workaround is activated for *all* guests using the old MSR.
> +	 */
> +	return ka->all_vcpus_matched_tsc &&
> +		!ka->backwards_tsc_observed &&
> +		!ka->boot_vcpu_runs_old_kvmclock;

Please align indentation:

	return ka->all_vcpus_matched_tsc &&
	       !ka->backwards_tsc_observed &&
	       !ka->boot_vcpu_runs_old_kvmclock;

> +}
> +
>  static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
>  {
>  #ifdef CONFIG_X86_64
> @@ -2550,7 +2571,7 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
>  	 * To use the masterclock, the host clocksource must be based on TSC
>  	 * and all vCPUs must have matching TSC frequencies.
>  	 */
> -	bool use_master_clock = ka->all_vcpus_matched_tsc &&
> +	bool use_master_clock = kvm_use_master_clock(vcpu->kvm) &&
>  				gtod_is_based_on_tsc(gtod->clock.vclock_mode);
>  
>  	/*
> @@ -3096,9 +3117,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>  					&ka->master_cycle_now);
>  
>  	ka->use_master_clock = host_tsc_clocksource
> -				&& ka->all_vcpus_matched_tsc
> -				&& !ka->backwards_tsc_observed
> -				&& !ka->boot_vcpu_runs_old_kvmclock;
> +				&& kvm_use_master_clock(kvm);

Perfect opportuity to put the "&&" on the preceding line.
>  
>  	/*
>  	 * When TSC scaling is in use (which can thankfully only happen
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 18/21] KVM: x86: Avoid gratuitous global clock reload in kvm_arch_vcpu_load()
  2024-05-22  0:17 ` [RFC PATCH v3 18/21] KVM: x86: Avoid gratuitous global clock reload in kvm_arch_vcpu_load() David Woodhouse
  2024-05-24 14:16   ` Paul Durrant
@ 2024-08-15 17:31   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-15 17:31 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Commit d98d07ca7e034 ("KVM: x86: update pvclock area conditionally, on
> cpu migration") turned an unconditional KVM_REQ_CLOCK_UPDATE into a
> conditional one, if either the master clock isn't enabled *or* the vCPU
> was not previously scheduled (vcpu->cpu == -1). The commit message doesn't
> explain the latter condition, which is specifically for the master clock
> case.

It handles the first load of the vCPU, which is distinctly different from CPU
migration.

> Commit 0061d53daf26f ("KVM: x86: limit difference between kvmclock
> updates") later turned that into a KVM_REQ_GLOBAL_CLOCK_UPDATE to avoid
> skew between vCPUs.
> 
> In master clock mode there is no need for any of that, regardless of
> whether/where this vCPU was previously scheduled.
> 
> Do it only if (!kvm->arch.use_master_clock).
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/kvm/x86.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 32a873d5ed00..dd53860ca284 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5161,7 +5161,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  		 * On a host with synchronized TSC, there is no need to update
>  		 * kvmclock on vcpu->cpu migration
>  		 */
> -		if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
> +		if (!vcpu->kvm->arch.use_master_clock)
>  			kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);

I believe this needs to be:

		if (!vcpu->kvm->arch.use_master_clock)
			kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
		else if (vcpu->cpu == -1)
			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

or alternatively do KVM_REQ_CLOCK_UPDATE during vCPU creation (hotplug?).

>  		if (vcpu->cpu != cpu)
>  			kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu);
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other
  2024-05-22  0:17 ` [RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other David Woodhouse
  2024-05-24 14:10   ` Paul Durrant
@ 2024-08-16  2:38   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-16  2:38 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> There is no reason why the KVM clock cannot be in masterclock mode when
> the TSCs are not in sync, as long as they are at the same *frequency*.

Yes there is.  KVM sets PVCLOCK_TSC_STABLE_BIT when the master clock is enabled:

	if (use_master_clock)
		pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;

which tells the guest its safe to skip the cross-(v)CPU forced monotonicty goo:

	if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
		(flags & PVCLOCK_TSC_STABLE_BIT))
		return ret;

	/*
	 * Assumption here is that last_value, a global accumulator, always goes
	 * forward. If we are less than that, we should not be much smaller.
	 * We assume there is an error margin we're inside, and then the correction
	 * does not sacrifice accuracy.
	 *
	 * For reads: global may have changed between test and return,
	 * but this means someone else updated poked the clock at a later time.
	 * We just need to make sure we are not seeing a backwards event.
	 *
	 * For updates: last_value = ret is not enough, since two vcpus could be
	 * updating at the same time, and one of them could be slightly behind,
	 * making the assumption that last_value always go forward fail to hold.
	 */
	last = raw_atomic64_read(&last_value);
	do {
		if (ret <= last)
			return last;
	} while (!raw_atomic64_try_cmpxchg(&last_value, &last, ret));

	return ret;

As called out by commit b48aa97e3820 ("KVM: x86: require matched TSC offsets for
master clock"), that was the motivation for requiring matched offsets.

As an aside, Marcelo's assertion that "its obvious" was anything but to me.  Took
me forever to piece together the PVCLOCK_TSC_STABLE_BIT angle.

    KVM: x86: require matched TSC offsets for master clock
    
    With master clock, a pvclock clock read calculates:
    
    ret = system_timestamp + [ (rdtsc + tsc_offset) - tsc_timestamp ]
    
    Where 'rdtsc' is the host TSC.
    
    system_timestamp and tsc_timestamp are unique, one tuple
    per VM: the "master clock".
    
    Given a host with synchronized TSCs, its obvious that
    guest TSC must be matched for the above to guarantee monotonicity.
    
    Allow master clock usage only if guest TSCs are synchronized.
    
    Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

> Running at a different frequency would lead to a systemic skew between
> the clock(s) as observed by different vCPUs due to arithmetic precision
> in the scaling. So that should indeed force the clock to be based on the
> host's CLOCK_MONOTONIC_RAW instead of being in masterclock mode where it
> is defined by the (or 'a') guest TSC.
> 
> But when the vCPUs merely have a different TSC *offset*, that's not a
> problem. The offset is applied to that vCPU's kvmclock->tsc_timestamp
> field, and it all comes out in the wash.
> 
> So, remove ka->nr_vcpus_matched_tsc and replace it with a new field
> ka->all_vcpus_matched_tsc which is not only changed to a boolean, but
> also now tracks that the *frequency* matches, not the precise offset.
> 
> Using a *count* was always racy because a new vCPU could be being
> created *while* kvm_track_tsc_matching() was running and comparing with
> kvm->online_vcpus. That variable is only atomic with respect to itself.
> In particular, kvm_arch_vcpu_create() runs before kvm->online_vcpus is
> incremented for the new vCPU, and kvm_arch_vcpu_postcreate() runs later.
> 
> Repurpose kvm_track_tsc_matching() to be called from kvm_set_tsc_khz(),
> and kill the cur_tsc_generation/last_tsc_generation fields which tracked
> the precise TSC matching.

If this goes anywhere, it needs to be at least three patches:

  1. Bulk code movement
  2. Chaning from a count to a bool/flag
  3. Eliminating the offset matching

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 17/21] KVM: x86: Avoid global clock update on setting KVM clock MSR
  2024-05-22  0:17 ` [RFC PATCH v3 17/21] KVM: x86: Avoid global clock update on setting KVM clock MSR David Woodhouse
  2024-05-24 14:14   ` Paul Durrant
@ 2024-08-16  4:28   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-16  4:28 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Commit 0061d53daf26 ("KVM: x86: limit difference between kvmclock updates")
> introduced a KVM_REQ_GLOBAL_CLOCK_UPDATE when one vCPU set up its clock.
> 
> This was a workaround for the ever-drifting clocks which were based on the
> host's CLOCK_MONOTONIC and thus subject to NTP skew. On booting or resuming
> a guest, it just leads to running kvm_guest_time_update() twice for each
> vCPU for now good reason.

s/now/no

> Just use KVM_REQ_CLOCK_UPDATE on the vCPU itself, and only in the case
> where the KVM clock is being set up, not turned off.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/kvm/x86.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 437412b36cae..32a873d5ed00 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2361,13 +2361,13 @@ static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
>  	}
>  
>  	vcpu->arch.time = system_time;
> -	kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
>  
>  	/* we verify if the enable bit is set... */
> -	if (system_time & 1)
> +	if (system_time & 1) {
>  		kvm_gpc_activate(&vcpu->arch.pv_time, system_time & ~1ULL,
>  				 sizeof(struct pvclock_vcpu_time_info));
> -	else
> +		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> +	} else

Needs curly braces now that the if-statement has 'em.

>  		kvm_gpc_deactivate(&vcpu->arch.pv_time);
>  
>  	return;
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 19/21] KVM: x86: Avoid periodic KVM clock updates in master clock mode
  2024-05-22  0:17 ` [RFC PATCH v3 19/21] KVM: x86: Avoid periodic KVM clock updates in master clock mode David Woodhouse
  2024-05-24 14:18   ` Paul Durrant
@ 2024-08-16  4:33   ` Sean Christopherson
  1 sibling, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-16  4:33 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> When the KVM clock is in master clock mode, updating the KVM clock is
> pointless.

Please explain why it's pointless.  I almost feel bad asking you to elaborate
since you've written wonderful changelogs for so many other patches, but far too
many of the (historical) kvmclock commits assume the reader is some omniscient
being, and it's definitely contributed to things getting into such a mess state.

> Let the periodic work 'expire', and start it running again
> from kvm_end_pvclock_update() if the master clock mode is ever turned
> off again.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative
  2024-05-22  0:17 ` [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative David Woodhouse
  2024-05-24 14:25   ` Paul Durrant
  2024-07-02 14:09   ` Shrikanth Hegde
@ 2024-08-16  4:35   ` Sean Christopherson
  2 siblings, 0 replies; 58+ messages in thread
From: Sean Christopherson @ 2024-08-16  4:35 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

This should be posted To: something other than kvm@, in a separate series, else
it's bound to get lost/ignored.

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> In steal_account_process_time(), a delta is calculated between the value
> returned by paravirt_steal_clock(), and this_rq()->prev_steal_time which
> is assumed to be the *previous* value returned by paravirt_steal_clock().
> 
> However, instead of just assigning the newly-read value directly into
> ->prev_steal_time for use in the next iteration, ->prev_steal_time is
> *incremented* by the calculated delta.
> 
> This used to be roughly the same, modulo conversion to jiffies and back,
> until commit 807e5b80687c0 ("sched/cputime: Add steal time support to
> full dynticks CPU time accounting") started clamping that delta to a
> maximum of the actual time elapsed. So now, if the value returned by
> paravirt_steal_clock() jumps by a large amount, instead of a *single*
> period of reporting 100% steal time, the system will report 100% steal
> time for as long as it takes to "catch up" with the reported value.
> Which is up to 584 years.
> 
> But there is a benefit to advancing ->prev_steal_time only by the time
> which was *accounted* as having been stolen. It means that any extra
> time truncated by the clamping will be accounted in the next sample
> period rather than lost. Given the stochastic nature of the sampling,
> that is more accurate overall.
> 
> So, continue to advance ->prev_steal_time by the accounted value as
> long as the delta isn't egregiously large (for which, use maxtime * 2).
> If the delta is more than that, just set ->prev_steal_time directly to
> the value returned by paravirt_steal_clock().
> 
> Fixes: 807e5b80687c0 ("sched/cputime: Add steal time support to full dynticks CPU time accounting")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  kernel/sched/cputime.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index af7952f12e6c..3a8a8b38966d 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -254,13 +254,21 @@ static __always_inline u64 steal_account_process_time(u64 maxtime)
>  {
>  #ifdef CONFIG_PARAVIRT
>  	if (static_key_false(&paravirt_steal_enabled)) {
> -		u64 steal;
> -
> -		steal = paravirt_steal_clock(smp_processor_id());
> -		steal -= this_rq()->prev_steal_time;
> -		steal = min(steal, maxtime);
> +		u64 steal, abs_steal;
> +
> +		abs_steal = paravirt_steal_clock(smp_processor_id());
> +		steal = abs_steal - this_rq()->prev_steal_time;
> +		if (unlikely(steal > maxtime)) {
> +			/*
> +			 * If the delta isn't egregious, it can be counted
> +			 * in the next time period. Only advance by maxtime.
> +			 */
> +			if (steal < maxtime * 2)
> +				abs_steal = this_rq()->prev_steal_time + maxtime;
> +			steal = maxtime;
> +		}
>  		account_steal_time(steal);
> -		this_rq()->prev_steal_time += steal;
> +		this_rq()->prev_steal_time = abs_steal;
>  
>  		return steal;
>  	}
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 20/21] KVM: x86/xen: Prevent runstate times from becoming negative
  2024-05-22  0:17 ` [RFC PATCH v3 20/21] KVM: x86/xen: Prevent runstate times from becoming negative David Woodhouse
  2024-05-24 14:21   ` Paul Durrant
@ 2024-08-16  4:39   ` Sean Christopherson
  2024-08-20 10:22     ` David Woodhouse
  1 sibling, 1 reply; 58+ messages in thread
From: Sean Christopherson @ 2024-08-16  4:39 UTC (permalink / raw)
  To: David Woodhouse
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Wed, May 22, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> When kvm_xen_update_runstate() is invoked to set a vCPU's runstate, the
> time spent in the previous runstate is accounted. This is based on the
> delta between the current KVM clock time, and the previous value stored
> in vcpu->arch.xen.runstate_entry_time.
> 
> If the KVM clock goes backwards, that delta will be negative. Or, since
> it's an unsigned 64-bit integer, very *large*. Linux guests deal with
> that particularly badly, reporting 100% steal time for ever more (well,
> for *centuries* at least, until the delta has been consumed).
> 
> So when a negative delta is detected, just refrain from updating the
> runstates until the KVM clock catches up with runstate_entry_time again.
> 
> The userspace APIs for setting the runstate times do not allow them to
> be set past the current KVM clock, but userspace can still adjust the
> KVM clock *after* setting the runstate times, which would cause this
> situation to occur.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/kvm/xen.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 014048c22652..3d4111de4472 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -538,24 +538,34 @@ void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
>  {
>  	struct kvm_vcpu_xen *vx = &v->arch.xen;
>  	u64 now = get_kvmclock_ns(v->kvm);
> -	u64 delta_ns = now - vx->runstate_entry_time;
>  	u64 run_delay = current->sched_info.run_delay;
> +	s64 delta_ns = now - vx->runstate_entry_time;
> +	s64 steal_ns = run_delay - vx->last_steal;
>  
>  	if (unlikely(!vx->runstate_entry_time))
>  		vx->current_runstate = RUNSTATE_offline;
>  
> +	vx->last_steal = run_delay;
> +
> +	/*
> +	 * If KVM clock time went backwards, stop updating until it
> +	 * catches up (or the runstates are reset by userspace).
> +	 */

I take it this is a legitimate scenario where userpace sets KVM clock and then
the runstates, and KVM needs to lend a hand because userspace can't do those two
things atomically?

> +	if (delta_ns < 0)
> +		return;
> +
>  	/*
>  	 * Time waiting for the scheduler isn't "stolen" if the
>  	 * vCPU wasn't running anyway.
>  	 */
> -	if (vx->current_runstate == RUNSTATE_running) {
> -		u64 steal_ns = run_delay - vx->last_steal;
> +	if (vx->current_runstate == RUNSTATE_running && steal_ns > 0) {
> +		if (steal_ns > delta_ns)
> +			steal_ns = delta_ns;
>  
>  		delta_ns -= steal_ns;
>  
>  		vx->runstate_times[RUNSTATE_runnable] += steal_ns;
>  	}
> -	vx->last_steal = run_delay;
>  
>  	vx->runstate_times[vx->current_runstate] += delta_ns;
>  	vx->current_runstate = state;
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 20/21] KVM: x86/xen: Prevent runstate times from becoming negative
  2024-08-16  4:39   ` Sean Christopherson
@ 2024-08-20 10:22     ` David Woodhouse
  2024-08-20 15:08       ` Steven Rostedt
  0 siblings, 1 reply; 58+ messages in thread
From: David Woodhouse @ 2024-08-20 10:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Paolo Bonzini, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Paul Durrant,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

[-- Attachment #1: Type: text/plain, Size: 692 bytes --]

On Thu, 2024-08-15 at 21:39 -0700, Sean Christopherson wrote:
> > +       vx->last_steal = run_delay;
> > +
> > +       /*
> > +        * If KVM clock time went backwards, stop updating until it
> > +        * catches up (or the runstates are reset by userspace).
> > +        */
> 
> I take it this is a legitimate scenario where userpace sets KVM clock and then
> the runstates, and KVM needs to lend a hand because userspace can't do those two
> things atomically?

Indeed. Will update the comment to make that more obvious.

Thanks for the rest of the review on this series. I'll go through in
detail and update it, hopefully this week.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 20/21] KVM: x86/xen: Prevent runstate times from becoming negative
  2024-08-20 10:22     ` David Woodhouse
@ 2024-08-20 15:08       ` Steven Rostedt
  2024-08-20 15:42         ` David Woodhouse
  0 siblings, 1 reply; 58+ messages in thread
From: Steven Rostedt @ 2024-08-20 15:08 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Sean Christopherson, kvm, Paolo Bonzini, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

On Tue, 20 Aug 2024 11:22:31 +0100
David Woodhouse <dwmw2@infradead.org> wrote:

> On Thu, 2024-08-15 at 21:39 -0700, Sean Christopherson wrote:
> > > +       vx->last_steal = run_delay;
> > > +
> > > +       /*
> > > +        * If KVM clock time went backwards, stop updating until it
> > > +        * catches up (or the runstates are reset by userspace).
> > > +        */  
> > 
> > I take it this is a legitimate scenario where userpace sets KVM clock and then
> > the runstates, and KVM needs to lend a hand because userspace can't do those two
> > things atomically?  
> 
> Indeed. Will update the comment to make that more obvious.
> 
> Thanks for the rest of the review on this series. I'll go through in
> detail and update it, hopefully this week.

Hmm, is this related at all to this:

  https://lore.kernel.org/all/20240806111157.1336532-1-suleiman@google.com/

-- Steve

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v3 20/21] KVM: x86/xen: Prevent runstate times from becoming negative
  2024-08-20 15:08       ` Steven Rostedt
@ 2024-08-20 15:42         ` David Woodhouse
  0 siblings, 0 replies; 58+ messages in thread
From: David Woodhouse @ 2024-08-20 15:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Sean Christopherson, kvm, Paolo Bonzini, Jonathan Corbet,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Shuah Khan,
	linux-doc, linux-kernel, jalliste, sveith, zide.chen,
	Dongli Zhang, Chenyi Qiang

[-- Attachment #1: Type: text/plain, Size: 1122 bytes --]

On Tue, 2024-08-20 at 11:08 -0400, Steven Rostedt wrote:
> On Tue, 20 Aug 2024 11:22:31 +0100
> David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > On Thu, 2024-08-15 at 21:39 -0700, Sean Christopherson wrote:
> > > > +       vx->last_steal = run_delay;
> > > > +
> > > > +       /*
> > > > +        * If KVM clock time went backwards, stop updating
> > > > until it
> > > > +        * catches up (or the runstates are reset by
> > > > userspace).
> > > > +        */  
> > > 
> > > I take it this is a legitimate scenario where userpace sets KVM
> > > clock and then
> > > the runstates, and KVM needs to lend a hand because userspace
> > > can't do those two
> > > things atomically?  
> > 
> > Indeed. Will update the comment to make that more obvious.
> > 
> > Thanks for the rest of the review on this series. I'll go through
> > in
> > detail and update it, hopefully this week.
> 
> Hmm, is this related at all to this:
> 
>  
> https://lore.kernel.org/all/20240806111157.1336532-1-suleiman@google.com/

No, but patch 21 in this same series is.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2024-08-20 15:42 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-22  0:16 [RFC PATCH v3 00/21] Cleaning up the KVM clock mess David Woodhouse
2024-05-22  0:16 ` [RFC PATCH v3 01/21] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() David Woodhouse
2024-05-22  0:16 ` [RFC PATCH v3 02/21] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force David Woodhouse
2024-08-13 17:50   ` Sean Christopherson
2024-05-22  0:16 ` [RFC PATCH v3 03/21] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration David Woodhouse
2024-05-22  0:16 ` [RFC PATCH v3 04/21] UAPI: x86: Move pvclock-abi to UAPI for x86 platforms David Woodhouse
2024-05-24 13:14   ` Paul Durrant
2024-08-13 18:07   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 05/21] KVM: selftests: Add KVM/PV clock selftest to prove timer correction David Woodhouse
2024-08-13 18:55   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 06/21] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC David Woodhouse
2024-05-22  0:17 ` [RFC PATCH v3 07/21] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration David Woodhouse
2024-08-14  1:52   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 08/21] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host David Woodhouse
2024-08-14  1:57   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get_kvmclock() David Woodhouse
2024-05-24 13:20   ` Paul Durrant
2024-08-14  2:58   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time() David Woodhouse
2024-05-24 13:26   ` Paul Durrant
2024-08-14  4:57   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 11/21] KVM: x86: Simplify and comment kvm_get_time_scale() David Woodhouse
2024-05-24 13:53   ` Paul Durrant
2024-08-15 15:46   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 12/21] KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset() David Woodhouse
2024-05-24 13:56   ` Paul Durrant
2024-08-15 15:52   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 13/21] KVM: x86: Improve synchronization in kvm_synchronize_tsc() David Woodhouse
2024-05-24 14:03   ` Paul Durrant
2024-08-15 16:00   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 14/21] KVM: x86: Kill cur_tsc_{nsec,offset,write} fields David Woodhouse
2024-05-24 14:05   ` Paul Durrant
2024-08-15 16:30   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other David Woodhouse
2024-05-24 14:10   ` Paul Durrant
2024-08-16  2:38   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 16/21] KVM: x86: Factor out kvm_use_master_clock() David Woodhouse
2024-05-24 14:13   ` Paul Durrant
2024-08-15 17:12   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 17/21] KVM: x86: Avoid global clock update on setting KVM clock MSR David Woodhouse
2024-05-24 14:14   ` Paul Durrant
2024-08-16  4:28   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 18/21] KVM: x86: Avoid gratuitous global clock reload in kvm_arch_vcpu_load() David Woodhouse
2024-05-24 14:16   ` Paul Durrant
2024-08-15 17:31   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 19/21] KVM: x86: Avoid periodic KVM clock updates in master clock mode David Woodhouse
2024-05-24 14:18   ` Paul Durrant
2024-08-16  4:33   ` Sean Christopherson
2024-05-22  0:17 ` [RFC PATCH v3 20/21] KVM: x86/xen: Prevent runstate times from becoming negative David Woodhouse
2024-05-24 14:21   ` Paul Durrant
2024-08-16  4:39   ` Sean Christopherson
2024-08-20 10:22     ` David Woodhouse
2024-08-20 15:08       ` Steven Rostedt
2024-08-20 15:42         ` David Woodhouse
2024-05-22  0:17 ` [RFC PATCH v3 21/21] sched/cputime: Cope with steal time going backwards or negative David Woodhouse
2024-05-24 14:25   ` Paul Durrant
2024-07-02 14:09   ` Shrikanth Hegde
2024-08-16  4:35   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).