[PATCH v4] 00/30] Cleaning up the KVM clock mess

Linux Documentation
 help / color / mirror / Atom feed

* [PATCH v4] 00/30] Cleaning up the KVM clock mess
@ 2026-05-09 22:46 David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 01/30] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() David Woodhouse
                   ` (32 more replies)
  0 siblings, 33 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

This is v4 of the series to clean up the KVM clock, addressing review
feedback from Sean Christopherson and Paul Durrant on v3, rebased to
the current kernel, and incorporating related work from Dongli Zhang.

The KVM clock has historically suffered from three problems:

 1. Imprecision: get_kvmclock_ns() computed the clock from the *host*
    TSC without applying guest TSC scaling, causing systemic drift from
    the values the guest computes from its own TSC.

 2. Unnecessary discontinuities: gratuitous KVM_REQ_MASTERCLOCK_UPDATE
    requests caused the master clock reference point to be re-snapshotted,
    yanking the guest's clock due to arithmetic precision differences.

 3. No precise migration API: the existing KVM_[GS]ET_CLOCK only allows
    setting the clock at a given UTC reference time, which is necessarily
    imprecise. There was no way to preserve the exact arithmetic
    relationship between guest TSC and KVM clock across live migration.

This series addresses all three, and adds new APIs for precise clock
migration and TSC frequency reporting.

Changes since v3:
 - Rebased to v7.1-rc2
 - Split patch 09 (__get_kvmclock fix) into 6 incremental patches per
   Sean's review
 - Split patch 10 (TSC upscaling) into 2 patches per Sean's review
 - Split patch 15 (offset TSCs) into frequency-match vs offset-match
 - Addressed Sean's review: hw_tsc_hz overflow (u64), KVM_VCPU_TSC_SCALE
   gated on has_tsc_control, pvclock_gtod_notifier unregister path,
   kvm_get_time_scale() readability, and many more
 - Incorporated Dongli Zhang's masterclock drift mitigation, reworked as
   a proper deduplication of redundant updates via request clearing under
   the tsc_write_lock
 - Added KVM_VCPU_TSC_EFFECTIVE_FREQ attribute for userspace to populate
   CPUID timing leaves without KVM modifying guest CPUID at runtime
 - Removed runtime Xen TSC CPUID modification (was updating wrong leaf)
 - Added guest-side patches to use CPUID 0x40000010 for TSC frequency
   under both KVM and Xen
 - Selftest covers clock correction at multiple TSC frequencies,
   PVCLOCK_TSC_STABLE_BIT behaviour, and multi-vCPU offset scenarios
 - Fixed RCU splat in KVM_GET_CLOCK_GUEST (needs srcu_read_lock)

The series can be broadly grouped as:

Patches 1-5: Core clock fixes and new KVM_[GS]ET_CLOCK_GUEST API
Patches 6-8: TSC scaling prerequisites
Patches 9-14: Fix get_kvmclock() precision (split per review)
Patches 15-16: Fix kvm_guest_time_update() for TSC upscaling
Patches 17-20: Code cleanup and simplification
Patches 21-22: Allow master clock with offset TSCs
Patches 23-24: Eliminate gratuitous clock updates
Patch 25: Xen runstate negative time fix
Patch 26: Deduplicate redundant masterclock updates
Patches 27-28: TSC frequency reporting for CPUID
Patches 29-30: Guest-side CPUID frequency consumption

David Woodhouse (27):
      KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init()
      KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
      KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC
      KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration
      KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host
      KVM: x86: WARN if kvm_get_walltime_and_clockread() fails unexpectedly
      KVM: x86: Fold __get_kvmclock() into get_kvmclock()
      KVM: x86: Add WARN and restructure get_kvmclock()
      KVM: x86: Use get_kvmclock_base_ns() as fallback in get_kvmclock()
      KVM: x86: Fix KVM clock precision in get_kvmclock() with TSC scaling
      KVM: x86: Use get_kvmclock() in kvm_get_wall_clock_epoch()
      KVM: x86: Fix compute_guest_tsc() to handle negative time deltas
      KVM: x86: Restructure kvm_guest_time_update() for TSC upscaling
      KVM: x86: Simplify and comment kvm_get_time_scale()
      KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset()
      KVM: x86: Improve synchronization in kvm_synchronize_tsc()
      KVM: x86: Kill last_tsc_{nsec,write,offset} fields
      KVM: x86: Replace nr_vcpus_matched_tsc count with all_vcpus_matched_tsc bool
      KVM: x86: Allow KVM master clock mode when TSCs are offset from each other
      KVM: x86: Factor out kvm_use_master_clock()
      KVM: x86: Avoid gratuitous global clock updates
      KVM: x86/xen: Prevent runstate times from becoming negative
      KVM: x86: Avoid redundant masterclock updates from multiple vCPUs
      KVM: x86: Add KVM_VCPU_TSC_EFFECTIVE_FREQ attribute
      KVM: x86: Remove runtime Xen TSC frequency CPUID update
      x86/kvm: Obtain TSC frequency from CPUID if present
      x86/xen: Obtain TSC frequency from CPUID if present

Jack Allister (3):
      UAPI: x86: Move pvclock-abi to UAPI for x86 platforms
      KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
      KVM: selftests: Add KVM/PV clock selftest to prove timer correction

 Documentation/virt/kvm/api.rst                 |  37 ++
 Documentation/virt/kvm/devices/vcpu.rst        |  69 ++-
 MAINTAINERS                                    |   4 +-
 arch/x86/include/asm/kvm_host.h                |  13 +-
 arch/x86/include/asm/kvm_para.h                |   1 +
 arch/x86/include/uapi/asm/kvm.h                |  12 +
 arch/x86/include/uapi/asm/kvm_para.h           |  11 +
 arch/x86/include/{ => uapi}/asm/pvclock-abi.h  |  27 +-
 arch/x86/kernel/kvm.c                          |  10 +
 arch/x86/kernel/kvmclock.c                     |   7 +-
 arch/x86/kvm/cpuid.c                           |  16 -
 arch/x86/kvm/svm/svm.c                         |   3 +-
 arch/x86/kvm/vmx/vmx.c                         |   2 +-
 arch/x86/kvm/x86.c                             | 735 ++++++++++++++++++-------
 arch/x86/kvm/xen.c                             |  21 +-
 arch/x86/kvm/xen.h                             |  13 -
 arch/x86/xen/time.c                            |  12 +
 include/uapi/linux/kvm.h                       |   3 +
 tools/testing/selftests/kvm/Makefile.kvm       |   1 +
 tools/testing/selftests/kvm/x86/pvclock_test.c | 415 ++++++++++++++
 20 files changed, 1157 insertions(+), 255 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86/pvclock_test.c
 rename arch/x86/include/{asm => uapi/asm}/pvclock-abi.h (82%)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 01/30] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init()
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 02/30] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force David Woodhouse
                   ` (31 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

The KVM clock is an interesting thing. It is defined as "nanoseconds
since the guest was created", but in practice it runs at two *different*
rates — or three different rates, if you count implementation bugs.

Definition A is that it runs synchronously with the CLOCK_MONOTONIC_RAW
of the host, with a delta of kvm->arch.kvmclock_offset.

But that version doesn't actually get used in the common case, where the
host has a reliable TSC and the guest TSCs are all running at the same
rate and in sync with each other, and kvm->arch.use_master_clock is set.

In that common case, definition B is used: There is a reference point in
time at kvm->arch.master_kernel_ns (again a CLOCK_MONOTONIC_RAW time),
and a corresponding host TSC value kvm->arch.master_cycle_now. This
fixed point in time is converted to guest units (the time offset by
kvmclock_offset and the TSC Value scaled and offset to be a guest TSC
value) and advertised to the guest in the pvclock structure. While in
this 'use_master_clock' mode, the fixed point in time never needs to be
changed, and the clock runs precisely in time with the guest TSC, at the
rate advertised in the pvclock structure.

The third definition C is implemented in kvm_get_wall_clock_epoch() and
__get_kvmclock(), using the master_cycle_now and master_kernel_ns fields
but converting the *host* TSC cycles directly to a value in nanoseconds
instead of scaling via the guest TSC.

One might naïvely think that all three definitions are identical, since
CLOCK_MONOTONIC_RAW is not skewed by NTP frequency corrections; all
three are just the result of counting the host TSC at a known frequency,
or the scaled guest TSC at a known precise fraction of the host's
frequency. The problem is with arithmetic precision, and the way that
frequency scaling is done in a division-free way by multiplying by a
scale factor, then shifting right. In practice, all three ways of
calculating the KVM clock will suffer a systemic drift from each other.

Eventually, definition C should just be eliminated. Commit 451a707813ae
("KVM: x86/xen: improve accuracy of Xen timers") worked around it for
the specific case of Xen timers, which are defined in terms of the KVM
clock and suffered from a continually increasing error in timer expiry
times. That commit notes that get_kvmclock_ns() is non-trivial to fix
and says "I'll come back to that", which remains true.

Definitions A and B do need to coexist, the former to handle the case
where the host or guest TSC is suboptimally configured. But KVM should
be more careful about switching between them, and the discontinuity in
guest time which could result.

In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of
time as the reference in master_kernel_ns and master_cycle_now, yanking
the guest's clock back to match definition A at that moment.

When invoked from in 'use_master_clock' mode, kvm_update_masterclock()
should probably *adjust* kvm->arch.kvmclock_offset to account for the
drift, instead of yanking the clock back to definition A. But in the
meantime there are a bunch of places where it just doesn't need to be
invoked at all.

To start with: there is no need to do such an update when a Xen guest
populates the shared_info page. This seems to have been a hangover from
the very first implementation of shared_info which automatically
populated the vcpu_info structures at their default locations, but even
then it should just have raised KVM_REQ_CLOCK_UPDATE on each vCPU
instead of using KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is
expected to explicitly set the vcpu_info even in its default locations,
there's not even any need for that either.

Fixes: 629b5348841a ("KVM: x86/xen: update wallclock region")
Reviewed-by: Paul Durrant <paul@xen.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/xen.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 91fd3673c09a..82e34edbfdbd 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -98,8 +98,6 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 	wc->version = wc_version + 1;
 	read_unlock_irq(&gpc->lock);

-	kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
-
 out:
 	srcu_read_unlock(&kvm->srcu, idx);
 	return ret;
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 02/30] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 01/30] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 03/30] UAPI: x86: Move pvclock-abi to UAPI for x86 platforms David Woodhouse
                   ` (30 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

The kvm_guest_time_update() function scales the host TSC frequency to
the guest's using kvm_scale_tsc() and the v->arch.l1_tsc_scaling_ratio
scaling ratio previously calculated for that vCPU. Then calculates the
scaling factors for the KVM clock itself based on that guest TSC
frequency.

However, it uses kHz as the unit when scaling, and then multiplies by
1000 only at the end.

With a host TSC frequency of 3000MHz and a guest set to 2500MHz, the
result of kvm_scale_tsc() will actually come out at 2,499,999kHz. So
the KVM clock advertised to the guest is based on a frequency of
2,499,999,000 Hz.

By using Hz as the unit from the beginning, the KVM clock would be based
on a more accurate frequency of 2,499,999,999 Hz in this example.

Use u64 for the hw_tsc_hz field since an unsigned int would overflow for
TSC frequencies above 4GHz. Use div_u64() for the Xen CPUID leaf to
play nice with 32-bit kernels.

Fixes: 78db6a503796 ("KVM: x86: rewrite handling of scaled TSC for kvmclock")
Reviewed-by: Paul Durrant <paul@xen.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/cpuid.c            |  2 +-
 arch/x86/kvm/x86.c              | 17 +++++++++--------
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c470e40a00aa..37264212c7df 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -950,7 +950,7 @@ struct kvm_vcpu_arch {
 	gpa_t time;
 	s8  pvclock_tsc_shift;
 	u32 pvclock_tsc_mul;
-	unsigned int hw_tsc_khz;
+	u64 hw_tsc_hz;
 	struct gfn_to_pfn_cache pv_time;
 	/* set guest stopped flag in pvclock flags field */
 	bool pvclock_set_guest_stopped_request;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index e69156b54cff..621d950ec692 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -2131,7 +2131,7 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
 				*ecx = vcpu->arch.pvclock_tsc_mul;
 				*edx = vcpu->arch.pvclock_tsc_shift;
 			} else if (index == 2) {
-				*eax = vcpu->arch.hw_tsc_khz;
+				*eax = div_u64(vcpu->arch.hw_tsc_hz, 1000);
 			}
 		}
 	} else {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0a1b63c63d1a..d9ef165df6a1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3314,7 +3314,8 @@ static void kvm_setup_guest_pvclock(struct pvclock_vcpu_time_info *ref_hv_clock,
 int kvm_guest_time_update(struct kvm_vcpu *v)
 {
 	struct pvclock_vcpu_time_info hv_clock = {};
-	unsigned long flags, tgt_tsc_khz;
+	unsigned long flags;
+	u64 tgt_tsc_hz;
 	unsigned seq;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
 	struct kvm_arch *ka = &v->kvm->arch;
@@ -3340,8 +3341,8 @@ int kvm_guest_time_update(struct kvm_vcpu *v)
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	tgt_tsc_khz = get_cpu_tsc_khz();
-	if (unlikely(tgt_tsc_khz == 0)) {
+	tgt_tsc_hz = (u64)get_cpu_tsc_khz() * 1000;
+	if (unlikely(tgt_tsc_hz == 0)) {
 		local_irq_restore(flags);
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
 		return 1;
@@ -3376,16 +3377,16 @@ int kvm_guest_time_update(struct kvm_vcpu *v)
 	/* With all the info we got, fill in the values */
 
 	if (kvm_caps.has_tsc_control) {
-		tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz,
+		tgt_tsc_hz = kvm_scale_tsc(tgt_tsc_hz,
 					    v->arch.l1_tsc_scaling_ratio);
-		tgt_tsc_khz = tgt_tsc_khz ? : 1;
+		tgt_tsc_hz = tgt_tsc_hz ? : 1;
 	}
 
-	if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
-		kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
+	if (unlikely(vcpu->hw_tsc_hz != tgt_tsc_hz)) {
+		kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_hz,
 				   &vcpu->pvclock_tsc_shift,
 				   &vcpu->pvclock_tsc_mul);
-		vcpu->hw_tsc_khz = tgt_tsc_khz;
+		vcpu->hw_tsc_hz = tgt_tsc_hz;
 	}
 
 	hv_clock.tsc_shift = vcpu->pvclock_tsc_shift;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 03/30] UAPI: x86: Move pvclock-abi to UAPI for x86 platforms
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 01/30] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 02/30] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration David Woodhouse
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: Jack Allister <jalliste@amazon.com>

A subsequent commit will provide a new KVM interface for performing a
fixup/correction of the KVM clock against the reference TSC. The
KVM_[GS]ET_CLOCK_GUEST API requires a pvclock_vcpu_time_info, as such
the caller must know about this definition.

Move the definition to the UAPI folder so that it is exported to
usermode and also change the type definitions to use the standard for
UAPI exports.

Signed-off-by: Jack Allister <jalliste@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 MAINTAINERS                                   |  4 +--
 arch/x86/include/{ => uapi}/asm/pvclock-abi.h | 27 ++++++++++---------
 2 files changed, 17 insertions(+), 14 deletions(-)
 rename arch/x86/include/{ => uapi}/asm/pvclock-abi.h (82%)

diff --git a/MAINTAINERS b/MAINTAINERS
index e0b307b2108c..e49676955c0c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14406,7 +14406,7 @@ S:	Supported
 T:	git git://git.kernel.org/pub/scm/virt/kvm/kvm.git
 F:	arch/um/include/asm/kvm_para.h
 F:	arch/x86/include/asm/kvm_para.h
-F:	arch/x86/include/asm/pvclock-abi.h
+F:	arch/x86/include/uapi/asm/pvclock-abi.h
 F:	arch/x86/include/uapi/asm/kvm_para.h
 F:	arch/x86/kernel/kvm.c
 F:	arch/x86/kernel/kvmclock.c
@@ -29087,7 +29087,7 @@ R:	Boris Ostrovsky <boris.ostrovsky@oracle.com>
 L:	xen-devel@lists.xenproject.org (moderated for non-subscribers)
 S:	Supported
 F:	arch/x86/configs/xen.config
-F:	arch/x86/include/asm/pvclock-abi.h
+F:	arch/x86/include/uapi/asm/pvclock-abi.h
 F:	arch/x86/include/asm/xen/
 F:	arch/x86/platform/pvh/
 F:	arch/x86/xen/
diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/uapi/asm/pvclock-abi.h
similarity index 82%
rename from arch/x86/include/asm/pvclock-abi.h
rename to arch/x86/include/uapi/asm/pvclock-abi.h
index b9fece5fc96d..6d70cf640362 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/uapi/asm/pvclock-abi.h
@@ -1,6 +1,9 @@
-/* SPDX-License-Identifier: GPL-2.0 */
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
 #ifndef _ASM_X86_PVCLOCK_ABI_H
 #define _ASM_X86_PVCLOCK_ABI_H
+
+#include <linux/types.h>
+
 #ifndef __ASSEMBLER__
 
 /*
@@ -24,20 +27,20 @@
  */
 
 struct pvclock_vcpu_time_info {
-	u32   version;
-	u32   pad0;
-	u64   tsc_timestamp;
-	u64   system_time;
-	u32   tsc_to_system_mul;
-	s8    tsc_shift;
-	u8    flags;
-	u8    pad[2];
+	__u32   version;
+	__u32   pad0;
+	__u64   tsc_timestamp;
+	__u64   system_time;
+	__u32   tsc_to_system_mul;
+	__s8    tsc_shift;
+	__u8    flags;
+	__u8    pad[2];
 } __attribute__((__packed__)); /* 32 bytes */
 
 struct pvclock_wall_clock {
-	u32   version;
-	u32   sec;
-	u32   nsec;
+	__u32   version;
+	__u32   sec;
+	__u32   nsec;
 } __attribute__((__packed__));
 
 #define PVCLOCK_TSC_STABLE_BIT	(1 << 0)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (2 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 03/30] UAPI: x86: Move pvclock-abi to UAPI for x86 platforms David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 05/30] KVM: selftests: Add KVM/PV clock selftest to prove timer correction David Woodhouse
                   ` (28 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: Jack Allister <jalliste@amazon.com>

In the common case (where kvm->arch.use_master_clock is true), the KVM
clock is defined as a simple arithmetic function of the guest TSC, based
on a reference point stored in kvm->arch.master_kernel_ns and
kvm->arch.master_cycle_now.

The existing KVM_[GS]ET_CLOCK functionality does not allow for this
relationship to be precisely saved and restored by userspace. All it can
currently do is set the KVM clock at a given UTC reference time, which
is necessarily imprecise.

So on live update, the guest TSC can remain cycle accurate at precisely
the same offset from the host TSC, but there is no way for userspace to
restore the KVM clock accurately.

Even on live migration to a new host, where the accuracy of the guest
time-keeping is fundamentally limited by the accuracy of wallclock
synchronization between the source and destination hosts, the clock jump
experienced by the guest's TSC and its KVM clock should at least be
*consistent*. Even when the guest TSC suffers a discontinuity, its KVM
clock should still remain the *same* arithmetic function of the guest
TSC, and not suffer an *additional* discontinuity.

To allow for accurate migration of the KVM clock, add per-vCPU ioctls
which save and restore the actual PV clock info in
pvclock_vcpu_time_info.

The restoration in KVM_SET_CLOCK_GUEST works by creating a new reference
point in time just as kvm_update_masterclock() does, and calculating the
corresponding guest TSC value. This guest TSC value is then passed
through the user-provided pvclock structure to generate the *intended*
KVM clock value at that point in time, and through the *actual* KVM
clock calculation. Then kvm->arch.kvmclock_offset is adjusted to
eliminate the difference.

Where kvm->arch.use_master_clock is false (because the host TSC is
unreliable, or the guest TSCs are configured strangely), the KVM clock
is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
returns an error. In this case, as documented, userspace shall use the
legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
case since the clocks are imprecise in this mode anyway.

On *restoration*, if kvm->arch.use_master_clock is false, an error is
returned for similar reasons and userspace shall fall back to using
KVM_SET_CLOCK. This does mean that, as documented, userspace needs to
use *both* KVM_GET_CLOCK_GUEST and KVM_GET_CLOCK and send both results
with the migration data (unless the intent is to refuse to resume on a
host with bad TSC).

Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Jack Allister <jalliste@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Cc: Dongli Zhang <dongli.zhang@oracle.com>
---
 Documentation/virt/kvm/api.rst |  37 ++++++++
 arch/x86/kvm/x86.c             | 151 +++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h       |   3 +
 3 files changed, 191 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 52bbbb553ce1..2268b4442df6 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6553,6 +6553,43 @@ KVM_S390_KEYOP_SSKE
   Sets the storage key for the guest address ``guest_addr`` to the key
   specified in ``key``, returning the previous value in ``key``.
 
+4.145 KVM_GET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86_64
+:Type: vcpu ioctl
+:Parameters: struct pvclock_vcpu_time_info (out)
+:Returns: 0 on success, <0 on error
+
+Retrieves the current time information structure used for KVM/PV clocks,
+in precisely the form advertised to the guest vCPU, which gives parameters
+for a direct conversion from a guest TSC value to nanoseconds.
+
+When the KVM clock is not in "master clock" mode, for example because the
+host TSC is unreliable or the guest TSCs are oddly configured, the KVM clock
+is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest TSC.
+In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL.
+
+4.146 KVM_SET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86_64
+:Type: vcpu ioctl
+:Parameters: struct pvclock_vcpu_time_info (in)
+:Returns: 0 on success, <0 on error
+
+Sets the KVM clock (for the whole VM) in terms of the vCPU TSC, using the
+pvclock structure as returned by KVM_GET_CLOCK_GUEST. This allows the precise
+arithmetic relationship between guest TSC and KVM clock to be preserved by
+userspace across migration.
+
+When the KVM clock is not in "master clock" mode, and the KVM clock is actually
+defined by the host CLOCK_MONOTONIC_RAW, this ioctl returns -EINVAL. Userspace
+may choose to set the clock using the less precise KVM_SET_CLOCK ioctl, or may
+choose to fail, denying migration to a host whose TSC is misbehaving.
+
 .. _kvm_run:
 
 5. The kvm_run structure
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d9ef165df6a1..d1327d5fba3f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6205,6 +6205,149 @@ static int kvm_get_reg_list(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+#ifdef CONFIG_X86_64
+static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
+{
+	struct pvclock_vcpu_time_info hv_clock = {};
+	struct kvm_vcpu_arch *vcpu = &v->arch;
+	struct kvm_arch *ka = &v->kvm->arch;
+	unsigned int seq;
+
+	/*
+	 * If KVM_REQ_CLOCK_UPDATE is already pending, or if the pvclock
+	 * has never been generated at all, call kvm_guest_time_update().
+	 */
+	if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, v) || !vcpu->hw_tsc_hz) {
+		int idx = srcu_read_lock(&v->kvm->srcu);
+		int ret = kvm_guest_time_update(v);
+
+		srcu_read_unlock(&v->kvm->srcu, idx);
+		if (ret)
+			return -EINVAL;
+	}
+
+	/*
+	 * Reconstruct the pvclock from the master clock state, matching
+	 * exactly what kvm_guest_time_update() writes to the guest.
+	 */
+	do {
+		seq = read_seqcount_begin(&ka->pvclock_sc);
+
+		if (!ka->use_master_clock)
+			return -EINVAL;
+
+		hv_clock.tsc_timestamp = kvm_read_l1_tsc(v, ka->master_cycle_now);
+		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
+	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
+
+	hv_clock.tsc_shift = vcpu->pvclock_tsc_shift;
+	hv_clock.tsc_to_system_mul = vcpu->pvclock_tsc_mul;
+	hv_clock.flags = PVCLOCK_TSC_STABLE_BIT;
+
+	if (copy_to_user(argp, &hv_clock, sizeof(hv_clock)))
+		return -EFAULT;
+
+	return 0;
+}
+
+/*
+ * Reverse the calculation in the hv_clock definition.
+ *
+ * time_ns = ( (cycles << shift) * mul ) >> 32;
+ * (although shift can be negative, so that's bad C)
+ *
+ * So for a single second,
+ * NSEC_PER_SEC = ( ( FREQ_HZ << shift) * mul ) >> 32
+ * NSEC_PER_SEC << 32 = ( FREQ_HZ << shift ) * mul
+ * ( NSEC_PER_SEC << 32 ) / mul = FREQ_HZ << shift
+ * ( NSEC_PER_SEC << 32 ) / mul ) >> shift = FREQ_HZ
+ */
+static u64 hvclock_to_hz(u32 mul, s8 shift)
+{
+	u64 tm = NSEC_PER_SEC << 32;
+
+	/* Maximise precision. Shift right until the top bit is set */
+	tm <<= 2;
+	shift += 2;
+
+	/* While 'mul' is even, increase the shift *after* the division */
+	while (!(mul & 1)) {
+		shift++;
+		mul >>= 1;
+	}
+
+	tm /= mul;
+
+	if (shift > 0)
+		return tm >> shift;
+	else
+		return tm << -shift;
+}
+
+static int kvm_vcpu_ioctl_set_clock_guest(struct kvm_vcpu *v, void __user *argp)
+{
+	struct pvclock_vcpu_time_info user_hv_clock;
+	struct kvm *kvm = v->kvm;
+	struct kvm_arch *ka = &kvm->arch;
+	u64 curr_tsc_hz, user_tsc_hz;
+	u64 user_clk_ns;
+	u64 guest_tsc;
+	int rc = 0;
+
+	if (copy_from_user(&user_hv_clock, argp, sizeof(user_hv_clock)))
+		return -EFAULT;
+
+	if (!user_hv_clock.tsc_to_system_mul)
+		return -EINVAL;
+
+	user_tsc_hz = hvclock_to_hz(user_hv_clock.tsc_to_system_mul,
+				    user_hv_clock.tsc_shift);
+
+	kvm_hv_request_tsc_page_update(kvm);
+	kvm_start_pvclock_update(kvm);
+	pvclock_update_vm_gtod_copy(kvm);
+
+	if (!ka->use_master_clock) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	curr_tsc_hz = (u64)get_cpu_tsc_khz() * 1000;
+	if (unlikely(curr_tsc_hz == 0)) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	if (kvm_caps.has_tsc_control)
+		curr_tsc_hz = kvm_scale_tsc(curr_tsc_hz,
+					    v->arch.l1_tsc_scaling_ratio);
+
+	/*
+	 * Allow for a discrepancy of 1 kHz either way between the TSC
+	 * frequency used to generate the user's pvclock and the current
+	 * host's measured frequency, since they may not precisely match.
+	 */
+	if (user_tsc_hz < curr_tsc_hz - 1000 ||
+	    user_tsc_hz > curr_tsc_hz + 1000) {
+		rc = -ERANGE;
+		goto out;
+	}
+
+	/*
+	 * Calculate the guest TSC at the new reference point, and the
+	 * corresponding KVM clock value according to user_hv_clock.
+	 * Adjust kvmclock_offset so both definitions agree.
+	 */
+	guest_tsc = kvm_read_l1_tsc(v, ka->master_cycle_now);
+	user_clk_ns = __pvclock_read_cycles(&user_hv_clock, guest_tsc);
+	ka->kvmclock_offset = user_clk_ns - ka->master_kernel_ns;
+
+out:
+	kvm_end_pvclock_update(kvm);
+	return rc;
+}
+#endif
+
 long kvm_arch_vcpu_ioctl(struct file *filp,
 			 unsigned int ioctl, unsigned long arg)
 {
@@ -6605,6 +6748,14 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		srcu_read_unlock(&vcpu->kvm->srcu, idx);
 		break;
 	}
+#ifdef CONFIG_X86_64
+	case KVM_SET_CLOCK_GUEST:
+		r = kvm_vcpu_ioctl_set_clock_guest(vcpu, argp);
+		break;
+	case KVM_GET_CLOCK_GUEST:
+		r = kvm_vcpu_ioctl_get_clock_guest(vcpu, argp);
+		break;
+#endif
 #ifdef CONFIG_KVM_HYPERV
 	case KVM_GET_SUPPORTED_HV_CPUID:
 		r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6c8afa2047bf..9b50191b859c 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1669,4 +1669,7 @@ struct kvm_pre_fault_memory {
 	__u64 padding[5];
 };
 
+#define KVM_SET_CLOCK_GUEST	_IOW(KVMIO, 0xd6, struct pvclock_vcpu_time_info)
+#define KVM_GET_CLOCK_GUEST	_IOR(KVMIO, 0xd7, struct pvclock_vcpu_time_info)
+
 #endif /* __LINUX_KVM_H */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 05/30] KVM: selftests: Add KVM/PV clock selftest to prove timer correction
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (3 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 06/30] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC David Woodhouse
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: Jack Allister <jalliste@amazon.com>

A VM's KVM/PV clock has an inherent relationship to its TSC. When either
the host system live-updates or the VM is live-migrated this pairing of
the two clock sources should stay the same. In reality this is not the
case without some correction taking place.

The KVM_GET_CLOCK_GUEST/KVM_SET_CLOCK_GUEST ioctls can be used to
perform a correction on the PVTI (PV time information) structure held by
KVM to effectively fix up the kvmclock_offset prior to the guest VM
resuming in either a live-update/migration scenario.

This test proves that without the necessary fixup there is a perceived
change in the guest TSC and KVM/PV clock relationship before and after a
simulated LU/LM takes place, and that the correction eliminates it.

The test:
  1. Snapshots the PVTI at boot (PVTI0).
  2. Induces a change in PVTI data (KVM_REQ_MASTERCLOCK_UPDATE).
  3. Snapshots the PVTI after the change (PVTI1).
  4. Requests correction via KVM_SET_CLOCK_GUEST using PVTI0.
  5. Snapshots the PVTI after correction (PVTI2).

Then samples the TSC at a single point in time and calculates the KVM
clock using each PVTI snapshot. The corrected clock should match the
boot clock to within ±1ns.

The test enumerates multiple TSC frequencies from 1GHz to 5GHz at 500MHz
steps, crossing the 32-bit boundary, to exercise the scaling path at
various ratios. The sleep duration between snapshots is configurable via
the -s/--sleep command line option.

Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Jack Allister <jalliste@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Cc: Dongli Zhang <dongli.zhang@oracle.com>
---
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../testing/selftests/kvm/x86/pvclock_test.c  | 415 ++++++++++++++++++
 2 files changed, 416 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86/pvclock_test.c

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index 9118a5a51b89..fb935ae3bf38 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -105,6 +105,7 @@ TEST_GEN_PROGS_x86 += x86/pmu_counters_test
 TEST_GEN_PROGS_x86 += x86/pmu_event_filter_test
 TEST_GEN_PROGS_x86 += x86/private_mem_conversions_test
 TEST_GEN_PROGS_x86 += x86/private_mem_kvm_exits_test
+TEST_GEN_PROGS_x86 += x86/pvclock_test
 TEST_GEN_PROGS_x86 += x86/set_boot_cpu_id
 TEST_GEN_PROGS_x86 += x86/set_sregs_test
 TEST_GEN_PROGS_x86 += x86/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86/pvclock_test.c b/tools/testing/selftests/kvm/x86/pvclock_test.c
new file mode 100644
index 000000000000..1a3d52923c71
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86/pvclock_test.c
@@ -0,0 +1,415 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © Amazon.com, Inc. or its affiliates.
+ *
+ * Tests for pvclock API
+ * KVM_SET_CLOCK_GUEST/KVM_GET_CLOCK_GUEST
+ */
+#include <getopt.h>
+#include <stdint.h>
+#include <string.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+
+#include <asm/pvclock-abi.h>
+
+/*
+ * Reproduce the pvclock calculation the guest uses to convert TSC to
+ * nanoseconds. This must match the kernel's __pvclock_read_cycles().
+ */
+static inline uint64_t pvclock_scale_delta(uint64_t delta, uint32_t mul,
+					   int8_t shift)
+{
+	if (shift < 0)
+		delta >>= -shift;
+	else
+		delta <<= shift;
+	return ((__uint128_t)delta * mul) >> 32;
+}
+
+static inline uint64_t pvclock_read_cycles(struct pvclock_vcpu_time_info *src,
+					   uint64_t tsc)
+{
+	uint64_t delta = tsc - src->tsc_timestamp;
+
+	return src->system_time + pvclock_scale_delta(delta,
+						      src->tsc_to_system_mul,
+						      src->tsc_shift);
+}
+
+enum {
+	STAGE_FIRST_BOOT,
+	STAGE_UNCORRECTED,
+	STAGE_CORRECTED
+};
+
+#define KVMCLOCK_GPA	0xc0000000ull
+#define KVMCLOCK_SIZE	sizeof(struct pvclock_vcpu_time_info)
+
+static void trigger_pvti_update(void)
+{
+	/*
+	 * Toggle between KVM's old and new system time methods to coerce KVM
+	 * into updating the fields in the PV time info struct.
+	 */
+	wrmsr(MSR_KVM_SYSTEM_TIME, KVMCLOCK_GPA | KVM_MSR_ENABLED);
+	wrmsr(MSR_KVM_SYSTEM_TIME_NEW, KVMCLOCK_GPA | KVM_MSR_ENABLED);
+}
+
+static void guest_code(void)
+{
+	struct pvclock_vcpu_time_info *pvti =
+		(void *)(unsigned long)KVMCLOCK_GPA;
+	struct pvclock_vcpu_time_info pvti_boot;
+	struct pvclock_vcpu_time_info pvti_uncorrected;
+	struct pvclock_vcpu_time_info pvti_corrected;
+	uint64_t tsc_guest;
+	uint64_t clk_boot, clk_uncorrected, clk_corrected;
+	int64_t delta_corrected;
+
+	/* Set up kvmclock and snapshot the initial pvclock parameters. */
+	wrmsr(MSR_KVM_SYSTEM_TIME_NEW, KVMCLOCK_GPA | KVM_MSR_ENABLED);
+	pvti_boot = *pvti;
+	GUEST_SYNC(STAGE_FIRST_BOOT);
+
+	/*
+	 * Trigger an update of the PVTI. Calculating the KVM clock using this
+	 * updated structure will show a delta from the original.
+	 */
+	trigger_pvti_update();
+	pvti_uncorrected = *pvti;
+	GUEST_SYNC(STAGE_UNCORRECTED);
+
+	/*
+	 * Snapshot the corrected time (the host does KVM_SET_CLOCK_GUEST when
+	 * handling STAGE_UNCORRECTED).
+	 */
+	pvti_corrected = *pvti;
+
+	/*
+	 * Sample the TSC at a single point in time, then calculate the
+	 * effective KVM clock using the PVTI from each stage. Verify that the
+	 * corrected clock matches the boot clock to within ±1ns.
+	 */
+	tsc_guest = rdtsc();
+
+	clk_boot = pvclock_read_cycles(&pvti_boot, tsc_guest);
+	clk_uncorrected = pvclock_read_cycles(&pvti_uncorrected, tsc_guest);
+	clk_corrected = pvclock_read_cycles(&pvti_corrected, tsc_guest);
+
+	delta_corrected = clk_boot - clk_corrected;
+
+	__GUEST_ASSERT(delta_corrected >= -2 && delta_corrected <= 2,
+		       "corrected delta %ld out of range (boot=%lu uncorrected=%lu corrected=%lu)",
+		       delta_corrected, clk_boot, clk_uncorrected, clk_corrected);
+
+	GUEST_SYNC(STAGE_CORRECTED);
+}
+
+static void run_test(struct kvm_vm *vm, struct kvm_vcpu *vcpu,
+		     unsigned int sleep_sec)
+{
+	struct pvclock_vcpu_time_info pvti_before;
+	struct ucall uc;
+
+	for (;;) {
+		vcpu_run(vcpu);
+		TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+
+		switch (get_ucall(vcpu, &uc)) {
+		case UCALL_ABORT:
+			REPORT_GUEST_ASSERT(uc);
+			break;
+		case UCALL_SYNC:
+			break;
+		default:
+			TEST_FAIL("Unexpected ucall");
+		}
+
+		switch (uc.args[1]) {
+		case STAGE_FIRST_BOOT:
+			/* Save the pvclock parameters before the update. */
+			vcpu_ioctl(vcpu, KVM_GET_CLOCK_GUEST, &pvti_before);
+
+			/* Sleep to let the clocks diverge. */
+			sleep(sleep_sec);
+			break;
+
+		case STAGE_UNCORRECTED:
+			/* Restore the original pvclock parameters. */
+			vcpu_ioctl(vcpu, KVM_SET_CLOCK_GUEST, &pvti_before);
+			break;
+
+		case STAGE_CORRECTED:
+			/* Guest verified the delta in-guest. */
+			return;
+
+		default:
+			TEST_FAIL("Unknown stage %lu", uc.args[1]);
+		}
+	}
+}
+
+static void configure_pvclock(struct kvm_vm *vm)
+{
+	unsigned int nr_pages;
+
+	nr_pages = vm_calc_num_guest_pages(VM_MODE_DEFAULT, KVMCLOCK_SIZE);
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+				    KVMCLOCK_GPA, 1, nr_pages, 0);
+	virt_map(vm, KVMCLOCK_GPA, KVMCLOCK_GPA, nr_pages);
+}
+
+static void run_at_frequency(uint64_t tsc_khz, unsigned int sleep_sec)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+
+	pr_info("Testing at TSC frequency %lu kHz\n", tsc_khz);
+	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
+	configure_pvclock(vm);
+	vcpu_ioctl(vcpu, KVM_SET_TSC_KHZ, (void *)tsc_khz);
+	run_test(vm, vcpu, sleep_sec);
+	kvm_vm_release(vm);
+}
+
+static void test_tsc_stable_bit(void);
+static void test_clock_guest_with_offsets(void);
+
+static void usage(const char *name)
+{
+	printf("Usage: %s [options]\n"
+	       "  -s, --sleep SEC     sleep duration between snapshots (default: 2)\n"
+	       "  -h, --help          show this help\n", name);
+}
+
+int main(int argc, char *argv[])
+{
+	static const struct option long_opts[] = {
+		{ "sleep", required_argument, NULL, 's' },
+		{ "help",  no_argument,       NULL, 'h' },
+		{ NULL,    0,                  NULL,  0  },
+	};
+	unsigned int sleep_sec = 2;
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	uint64_t host_khz;
+	uint64_t freq;
+	int opt;
+
+	while ((opt = getopt_long(argc, argv, "s:h", long_opts, NULL)) != -1) {
+		switch (opt) {
+		case 's':
+			sleep_sec = atoi(optarg);
+			break;
+		case 'h':
+		default:
+			usage(argv[0]);
+			return opt == 'h' ? 0 : 1;
+		}
+	}
+
+	TEST_REQUIRE(sys_clocksource_is_based_on_tsc());
+
+	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
+	configure_pvclock(vm);
+
+	/* First run at native frequency (no scaling). */
+	run_test(vm, vcpu, sleep_sec);
+
+	/*
+	 * Then enumerate a range of TSC frequencies crossing the 32-bit
+	 * boundary, to exercise the scaling path at various ratios.
+	 */
+	host_khz = __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL);
+	kvm_vm_release(vm);
+
+	for (freq = 1000000; freq <= 5000000; freq += 500000) {
+		if (freq == host_khz)
+			continue;
+		run_at_frequency(freq, sleep_sec);
+	}
+
+	test_tsc_stable_bit();
+	test_clock_guest_with_offsets();
+
+	return 0;
+}
+
+static void guest_code_stable_bit(void)
+{
+	wrmsr(MSR_KVM_SYSTEM_TIME_NEW, KVMCLOCK_GPA | KVM_MSR_ENABLED);
+	GUEST_SYNC(0);
+	GUEST_SYNC(0);
+	GUEST_SYNC(0);
+}
+
+static void set_tsc_offset(struct kvm_vcpu *vcpu, uint64_t offset)
+{
+	struct kvm_device_attr attr = {
+		.group = KVM_VCPU_TSC_CTRL,
+		.attr = KVM_VCPU_TSC_OFFSET,
+		.addr = (__u64)(uintptr_t)&offset,
+	};
+	vcpu_ioctl(vcpu, KVM_SET_DEVICE_ATTR, &attr);
+}
+
+static void run_vcpu_once(struct kvm_vcpu *vcpu)
+{
+	struct ucall uc;
+
+	vcpu_run(vcpu);
+	TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+	switch (get_ucall(vcpu, &uc)) {
+	case UCALL_ABORT:
+		REPORT_GUEST_ASSERT(uc);
+		break;
+	case UCALL_SYNC:
+		break;
+	default:
+		TEST_FAIL("Unexpected ucall");
+	}
+}
+
+static void test_tsc_stable_bit(void)
+{
+	struct pvclock_vcpu_time_info pvti;
+	struct kvm_vcpu *vcpus[2];
+	struct kvm_vm *vm;
+	int ret;
+
+	pr_info("Testing PVCLOCK_TSC_STABLE_BIT with matched/unmatched TSCs\n");
+
+	vm = vm_create_with_vcpus(2, guest_code_stable_bit, vcpus);
+	configure_pvclock(vm);
+
+	/*
+	 * Case 1: All TSCs matched (same frequency and offset).
+	 * Master clock should be active, PVCLOCK_TSC_STABLE_BIT set.
+	 */
+	run_vcpu_once(vcpus[0]);
+
+	ret = __vcpu_ioctl(vcpus[0], KVM_GET_CLOCK_GUEST, &pvti);
+	TEST_ASSERT(!ret, "GET_CLOCK_GUEST should succeed with matched TSCs");
+	TEST_ASSERT(pvti.flags & PVCLOCK_TSC_STABLE_BIT,
+		    "PVCLOCK_TSC_STABLE_BIT should be set with matched TSCs");
+
+	/*
+	 * Case 2: Different TSC offset, same frequency.
+	 * Master clock should still be active (frequency matches), but
+	 * PVCLOCK_TSC_STABLE_BIT should be cleared (offsets differ).
+	 */
+	set_tsc_offset(vcpus[1], 12345678);
+	run_vcpu_once(vcpus[1]);
+	run_vcpu_once(vcpus[0]);
+
+	ret = __vcpu_ioctl(vcpus[0], KVM_GET_CLOCK_GUEST, &pvti);
+	if (ret) {
+		/* Master clock disabled by offset mismatch — old kernel */
+		pr_info("  Skipping offset tests (master clock requires matched offsets)\n");
+		goto out_stable;
+	}
+	TEST_ASSERT(!(pvti.flags & PVCLOCK_TSC_STABLE_BIT),
+		    "PVCLOCK_TSC_STABLE_BIT should be clear with offset-mismatched TSCs");
+
+	/*
+	 * Case 3: Different TSC frequency.
+	 * Master clock should be disabled entirely.
+	 */
+	vcpu_ioctl(vcpus[1], KVM_SET_TSC_KHZ,
+		   (void *)(unsigned long)(__vcpu_ioctl(vcpus[1], KVM_GET_TSC_KHZ, NULL) / 2));
+	/* Write TSC to trigger kvm_synchronize_tsc / kvm_track_tsc_matching */
+	vcpu_set_msr(vcpus[1], MSR_IA32_TSC, 0);
+	run_vcpu_once(vcpus[1]);
+
+	ret = __vcpu_ioctl(vcpus[0], KVM_GET_CLOCK_GUEST, &pvti);
+	TEST_ASSERT(ret && errno == EINVAL,
+		    "GET_CLOCK_GUEST should fail with frequency-mismatched TSCs, got %d (errno %d)",
+		    ret, errno);
+
+out_stable:
+	kvm_vm_release(vm);
+}
+
+static void test_clock_guest_with_offsets(void)
+{
+	struct pvclock_vcpu_time_info pvti0, pvti1, pvti1_after;
+	struct kvm_vcpu *vcpus[2];
+	struct kvm_vm *vm;
+	int64_t delta;
+	int ret;
+
+	pr_info("Testing KVM_[GS]ET_CLOCK_GUEST with different TSC offsets\n");
+
+	vm = vm_create_with_vcpus(2, guest_code_stable_bit, vcpus);
+	configure_pvclock(vm);
+
+	/* Set different TSC offsets on the two vCPUs */
+	set_tsc_offset(vcpus[0], 0);
+	set_tsc_offset(vcpus[1], 1000000000ull);
+
+	/* Run both to establish kvmclock */
+	run_vcpu_once(vcpus[0]);
+	run_vcpu_once(vcpus[1]);
+
+	/* GET_CLOCK_GUEST on both — should succeed (master clock active) */
+	ret = __vcpu_ioctl(vcpus[0], KVM_GET_CLOCK_GUEST, &pvti0);
+	if (ret) {
+		pr_info("  Skipping (master clock requires matched offsets on this kernel)\n");
+		kvm_vm_release(vm);
+		return;
+	}
+	ret = __vcpu_ioctl(vcpus[1], KVM_GET_CLOCK_GUEST, &pvti1);
+	TEST_ASSERT(!ret, "GET_CLOCK_GUEST on vcpu1 failed");
+
+	/* The tsc_timestamps should differ (different offsets) */
+	TEST_ASSERT(pvti0.tsc_timestamp != pvti1.tsc_timestamp,
+		    "tsc_timestamps should differ with different offsets");
+
+	/* Sleep to let time elapse, then restore vcpu0's clock */
+	sleep(1);
+	vcpu_ioctl(vcpus[0], KVM_SET_CLOCK_GUEST, &pvti0);
+
+	/* Run vcpu0 to process the clock update */
+	run_vcpu_once(vcpus[0]);
+
+	/* GET_CLOCK_GUEST on vcpu1 — should reflect the correction */
+	ret = __vcpu_ioctl(vcpus[1], KVM_GET_CLOCK_GUEST, &pvti1_after);
+	TEST_ASSERT(!ret, "GET_CLOCK_GUEST on vcpu1 after SET failed");
+
+	/*
+	 * After SET on vcpu0, verify the correction worked by getting
+	 * the clock on vcpu0 again. The mul/shift should be the same,
+	 * and computing kvmclock at the same TSC should give the same
+	 * result as the original (within ±2ns).
+	 */
+	{
+		struct pvclock_vcpu_time_info pvti0_after;
+		uint64_t tsc_now, clk_from_old, clk_from_new;
+
+		ret = __vcpu_ioctl(vcpus[0], KVM_GET_CLOCK_GUEST, &pvti0_after);
+		TEST_ASSERT(!ret, "GET_CLOCK_GUEST on vcpu0 after SET failed");
+
+		tsc_now = pvti0_after.tsc_timestamp;
+		clk_from_old = pvclock_read_cycles(&pvti0, tsc_now);
+		clk_from_new = pvclock_read_cycles(&pvti0_after, tsc_now);
+
+		delta = (int64_t)clk_from_new - (int64_t)clk_from_old;
+		TEST_ASSERT(delta >= -2 && delta <= 2,
+			    "clock correction delta should be <=2ns, got %ld ns",
+			    delta);
+	}
+
+	/*
+	 * Also verify that vcpu1's clock is still accessible (master
+	 * clock still active with different offsets).
+	 */
+	ret = __vcpu_ioctl(vcpus[1], KVM_GET_CLOCK_GUEST, &pvti1_after);
+	TEST_ASSERT(!ret, "GET_CLOCK_GUEST on vcpu1 after SET failed");
+
+	kvm_vm_release(vm);
+}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 06/30] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (4 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 05/30] KVM: selftests: Add KVM/PV clock selftest to prove timer correction David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 07/30] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration David Woodhouse
                   ` (26 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

KVM does make an attempt to cope with non-constant TSC, and has
notifiers to handle host TSC frequency changes. However, it *only*
adjusts the KVM clock, and doesn't adjust TSC frequency scaling when
the host changes.

This is presumably because non-constant TSCs were fixed in hardware
long before TSC scaling was implemented, so there should never be real
CPUs which have TSC scaling but *not* CONSTANT_TSC.

Such a combination could potentially happen in some odd L1 nesting
environment, but it isn't worth trying to support it. Just make the
dependency explicit.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/svm/svm.c | 3 ++-
 arch/x86/kvm/vmx/vmx.c | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index e7fdd7a9c280..7817752533fe 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5546,7 +5546,8 @@ static __init int svm_hardware_setup(void)
 				     XFEATURE_MASK_BNDCSR);
 
 	if (tsc_scaling) {
-		if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR)) {
+		if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR) ||
+		    !boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
 			tsc_scaling = false;
 		} else {
 			pr_info("TSC scaling supported\n");
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5c2c33a5f7dc..4f6035d72bbe 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8696,7 +8696,7 @@ __init int vmx_hardware_setup(void)
 	if (!enable_apicv || !cpu_has_vmx_ipiv())
 		enable_ipiv = false;
 
-	if (cpu_has_vmx_tsc_scaling())
+	if (cpu_has_vmx_tsc_scaling() && boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
 		kvm_caps.has_tsc_control = true;
 
 	kvm_caps.max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 07/30] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (5 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 06/30] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 08/30] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host David Woodhouse
                   ` (25 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully
inadequate. It ignores TSC scaling, and ignores the fact that the host
TSC may differ from one host to the next (and in fact because of the way
the kernel calibrates it, it generally differs from one boot to the next
even on the same hardware).

Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits,
and attempt to document the process that userspace needs to follow to
preserve the TSC across migration.

Only enumerate KVM_VCPU_TSC_SCALE when kvm_caps.has_tsc_control is true,
since the scaling ratio is only meaningful when hardware TSC scaling is
supported.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 Documentation/virt/kvm/devices/vcpu.rst | 36 ++++++++++++++++++++++++-
 arch/x86/include/uapi/asm/kvm.h         |  6 +++++
 arch/x86/kvm/x86.c                      | 22 +++++++++++++++
 3 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 5e3805820010..56562b932280 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -243,7 +243,10 @@ Returns:
 Specifies the guest's TSC offset relative to the host's TSC. The guest's
 TSC is then derived by the following equation:
 
-  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
+  guest_tsc = ((host_tsc * tsc_scale_ratio) >> tsc_scale_bits) + KVM_VCPU_TSC_OFFSET
+
+The values of tsc_scale_ratio and tsc_scale_bits can be obtained using
+the KVM_VCPU_TSC_SCALE attribute.
 
 This attribute is useful to adjust the guest's TSC on live migration,
 so that the TSC counts the time during which the VM was paused. The
@@ -292,3 +295,34 @@ From the destination VMM process:
 
 7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
    respective value derived in the previous step.
+
+4.2 ATTRIBUTE: KVM_VCPU_TSC_SCALE
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:Parameters: struct kvm_vcpu_tsc_scale
+
+Returns:
+
+	 ======= ======================================
+	 -EFAULT Error reading the provided parameter
+		 address.
+	 -ENXIO  Attribute not supported (no TSC scaling)
+	 -EINVAL Invalid request to write the attribute
+	 ======= ======================================
+
+This read-only attribute reports the guest's TSC scaling factor, in the form
+of a fixed-point number represented by the following structure::
+
+  struct kvm_vcpu_tsc_scale {
+	__u64 tsc_ratio;
+	__u64 tsc_frac_bits;
+  };
+
+The tsc_frac_bits field indicates the location of the fixed point, such that
+host TSC values are converted to guest TSC using the formula:
+
+  guest_tsc = ((host_tsc * tsc_ratio) >> tsc_frac_bits) + offset
+
+Userspace can use this to precisely calculate the guest TSC from the host
+TSC at any given moment. This is needed for accurate migration of guests,
+as described in the documentation for the KVM_VCPU_TSC_OFFSET attribute.
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 5f2b30d0405c..384be9a53395 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -961,6 +961,12 @@ struct kvm_hyperv_eventfd {
 /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define   KVM_VCPU_TSC_SCALE  1 /* attribute for TSC scaling factor */
+
+struct kvm_vcpu_tsc_scale {
+	__u64 tsc_ratio;
+	__u64 tsc_frac_bits;
+};
 
 /* x86-specific KVM_EXIT_HYPERCALL flags. */
 #define KVM_EXIT_HYPERCALL_LONG_MODE	_BITULL(0)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d1327d5fba3f..2179ea2da8e0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5930,6 +5930,9 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
 	case KVM_VCPU_TSC_OFFSET:
 		r = 0;
 		break;
+	case KVM_VCPU_TSC_SCALE:
+		r = kvm_caps.has_tsc_control ? 0 : -ENXIO;
+		break;
 	default:
 		r = -ENXIO;
 	}
@@ -5950,6 +5953,22 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
 			break;
 		r = 0;
 		break;
+	case KVM_VCPU_TSC_SCALE: {
+		struct kvm_vcpu_tsc_scale scale;
+
+		if (!kvm_caps.has_tsc_control) {
+			r = -ENXIO;
+			break;
+		}
+
+		scale.tsc_ratio = vcpu->arch.l1_tsc_scaling_ratio;
+		scale.tsc_frac_bits = kvm_caps.tsc_scaling_ratio_frac_bits;
+		r = -EFAULT;
+		if (copy_to_user(uaddr, &scale, sizeof(scale)))
+			break;
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
@@ -5989,6 +6008,9 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
 		r = 0;
 		break;
 	}
+	case KVM_VCPU_TSC_SCALE:
+		r = -EINVAL; /* Read only */
+		break;
 	default:
 		r = -ENXIO;
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 08/30] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (6 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 07/30] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 09/30] KVM: x86: WARN if kvm_get_walltime_and_clockread() fails unexpectedly David Woodhouse
                   ` (24 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

Commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to monotonic raw
clock") did so only for 64-bit hosts, by capturing the boot offset from
within the existing clocksource notifier update_pvclock_gtod().

That notifier was added in commit 16e8d74d2da9 ("KVM: x86: notifier for
clocksource changes") but only on x86_64, because its original purpose
was just to disable the "master clock" mode which is only supported on
x86_64.

Now that the notifier is used for more than disabling master clock mode,
enable it for the 32-bit build too so that get_kvmclock_base_ns() can be
unaffected by NTP sync on 32-bit too.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/x86.c | 19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2179ea2da8e0..1e1834533e98 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2342,7 +2342,6 @@ static int do_set_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data)
 	return kvm_set_msr_ignored_check(vcpu, index, *data, true);
 }
 
-#ifdef CONFIG_X86_64
 struct pvclock_clock {
 	int vclock_mode;
 	u64 cycle_last;
@@ -2400,13 +2399,6 @@ static s64 get_kvmclock_base_ns(void)
 	/* Count up from boot time, but with the frequency of the raw clock.  */
 	return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
 }
-#else
-static s64 get_kvmclock_base_ns(void)
-{
-	/* Master clock not used, so we can just use CLOCK_BOOTTIME.  */
-	return ktime_get_boottime_ns();
-}
-#endif
 
 static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_ofs)
 {
@@ -10160,6 +10152,7 @@ static void pvclock_irq_work_fn(struct irq_work *w)
 }
 
 static DEFINE_IRQ_WORK(pvclock_irq_work, pvclock_irq_work_fn);
+#endif
 
 /*
  * Notification about pvclock gtod data update.
@@ -10167,26 +10160,26 @@ static DEFINE_IRQ_WORK(pvclock_irq_work, pvclock_irq_work_fn);
 static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
 			       void *priv)
 {
-	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 	struct timekeeper *tk = priv;
 
 	update_pvclock_gtod(tk);
 
+#ifdef CONFIG_X86_64
 	/*
 	 * Disable master clock if host does not trust, or does not use,
 	 * TSC based clocksource. Delegate queue_work() to irq_work as
 	 * this is invoked with tk_core.seq write held.
 	 */
-	if (!gtod_is_based_on_tsc(gtod->clock.vclock_mode) &&
+	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode) &&
 	    atomic_read(&kvm_guest_has_master_clock) != 0)
 		irq_work_queue(&pvclock_irq_work);
+#endif
 	return 0;
 }
 
 static struct notifier_block pvclock_gtod_notifier = {
 	.notifier_call = pvclock_gtod_notify,
 };
-#endif
 
 void kvm_setup_xss_caps(void)
 {
@@ -10375,9 +10368,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
 
 	if (pi_inject_timer == -1)
 		pi_inject_timer = housekeeping_enabled(HK_TYPE_TIMER);
-#ifdef CONFIG_X86_64
 	pvclock_gtod_register_notifier(&pvclock_gtod_notifier);
 
+#ifdef CONFIG_X86_64
 	if (hypervisor_is_type(X86_HYPER_MS_HYPERV))
 		set_hv_tscchange_cb(kvm_hyperv_tsc_notifier);
 #endif
@@ -10434,8 +10427,8 @@ void kvm_x86_vendor_exit(void)
 					    CPUFREQ_TRANSITION_NOTIFIER);
 		cpuhp_remove_state_nocalls(CPUHP_AP_X86_KVM_CLK_ONLINE);
 	}
-#ifdef CONFIG_X86_64
 	pvclock_gtod_unregister_notifier(&pvclock_gtod_notifier);
+#ifdef CONFIG_X86_64
 	irq_work_sync(&pvclock_irq_work);
 	cancel_work_sync(&pvclock_gtod_work);
 #endif
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 09/30] KVM: x86: WARN if kvm_get_walltime_and_clockread() fails unexpectedly
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (7 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 08/30] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 10/30] KVM: x86: Fold __get_kvmclock() into get_kvmclock() David Woodhouse
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

The master clock depends on the pvclock being based on TSC, so the only
way kvm_get_walltime_and_clockread() can fail when use_master_clock is
true is if the clocksource changed and use_master_clock is stale, in
which case a seqcount retry should be pending.

Add a WARN_ON_ONCE if the seqcount has not been invalidated, to catch
any case where the failure is genuinely unexpected.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1e1834533e98..ccdfd3fa3402 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3474,7 +3474,8 @@ uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
 		local_tsc_khz = get_cpu_tsc_khz();
 
 		if (local_tsc_khz &&
-		    !kvm_get_walltime_and_clockread(&ts, &host_tsc))
+		    !kvm_get_walltime_and_clockread(&ts, &host_tsc) &&
+		    WARN_ON_ONCE(!read_seqcount_retry(&ka->pvclock_sc, seq)))
 			local_tsc_khz = 0; /* Fall back to old method */
 
 		put_cpu();
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 10/30] KVM: x86: Fold __get_kvmclock() into get_kvmclock()
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (8 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 09/30] KVM: x86: WARN if kvm_get_walltime_and_clockread() fails unexpectedly David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 11/30] KVM: x86: Add WARN and restructure get_kvmclock() David Woodhouse
                   ` (22 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

There is no need for the separate __get_kvmclock() helper; just inline
its body into get_kvmclock() within the seqcount retry loop.

No functional change.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 63 +++++++++++++++++++++-------------------------
 1 file changed, 28 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ccdfd3fa3402..6f660c3210ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3200,50 +3200,43 @@ static unsigned long get_cpu_tsc_khz(void)
 		return __this_cpu_read(cpu_tsc_khz);
 }
 
-/* Called within read_seqcount_begin/retry for kvm->pvclock_sc.  */
-static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
+static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 {
 	struct kvm_arch *ka = &kvm->arch;
 	struct pvclock_vcpu_time_info hv_clock;
+	unsigned int seq;
 
-	/* both __this_cpu_read() and rdtsc() should be on the same cpu */
-	get_cpu();
+	do {
+		seq = read_seqcount_begin(&ka->pvclock_sc);
 
-	data->flags = 0;
-	if (ka->use_master_clock &&
-	    (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
+		/* both __this_cpu_read() and rdtsc() should be on the same cpu */
+		get_cpu();
+
+		data->flags = 0;
+		if (ka->use_master_clock &&
+		    (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
 #ifdef CONFIG_X86_64
-		struct timespec64 ts;
+			struct timespec64 ts;
 
-		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
-			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
-			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
-		} else
+			if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
+				data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
+				data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
+			} else
 #endif
-		data->host_tsc = rdtsc();
-
-		data->flags |= KVM_CLOCK_TSC_STABLE;
-		hv_clock.tsc_timestamp = ka->master_cycle_now;
-		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
-		kvm_get_time_scale(NSEC_PER_SEC, get_cpu_tsc_khz() * 1000LL,
-				   &hv_clock.tsc_shift,
-				   &hv_clock.tsc_to_system_mul);
-		data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
-	} else {
-		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
-	}
-
-	put_cpu();
-}
-
-static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
-{
-	struct kvm_arch *ka = &kvm->arch;
-	unsigned seq;
+			data->host_tsc = rdtsc();
+
+			data->flags |= KVM_CLOCK_TSC_STABLE;
+			hv_clock.tsc_timestamp = ka->master_cycle_now;
+			hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
+			kvm_get_time_scale(NSEC_PER_SEC, get_cpu_tsc_khz() * 1000LL,
+					   &hv_clock.tsc_shift,
+					   &hv_clock.tsc_to_system_mul);
+			data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
+		} else {
+			data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
+		}
 
-	do {
-		seq = read_seqcount_begin(&ka->pvclock_sc);
-		__get_kvmclock(kvm, data);
+		put_cpu();
 	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 11/30] KVM: x86: Add WARN and restructure get_kvmclock()
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (9 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 10/30] KVM: x86: Fold __get_kvmclock() into get_kvmclock() David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 12/30] KVM: x86: Use get_kvmclock_base_ns() as fallback in get_kvmclock() David Woodhouse
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

Add the same WARN_ON_ONCE for unexpected kvm_get_walltime_and_clockread()
failure as in kvm_get_wall_clock_epoch().

Move get/put_cpu inside the use_master_clock branch since they are only
needed there (for RDTSC and get_cpu_tsc_khz() to be on the same CPU).

Also simplify the use_master_clock condition: the open-coded
CONSTANT_TSC || cpu_tsc_khz check is unnecessary since use_master_clock
can only be true when the TSC is usable.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6f660c3210ee..9b395c00ccf2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3209,21 +3209,27 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 	do {
 		seq = read_seqcount_begin(&ka->pvclock_sc);
 
-		/* both __this_cpu_read() and rdtsc() should be on the same cpu */
-		get_cpu();
-
 		data->flags = 0;
-		if (ka->use_master_clock &&
-		    (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
+		if (ka->use_master_clock) {
 #ifdef CONFIG_X86_64
 			struct timespec64 ts;
+#endif
+			/*
+			 * The RDTSC and get_cpu_tsc_khz() must happen on
+			 * the same CPU.
+			 */
+			get_cpu();
 
+#ifdef CONFIG_X86_64
 			if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
 				data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
 				data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
-			} else
-#endif
+			} else if (WARN_ON_ONCE(!read_seqcount_retry(&ka->pvclock_sc, seq))) {
+				data->host_tsc = rdtsc();
+			}
+#else
 			data->host_tsc = rdtsc();
+#endif
 
 			data->flags |= KVM_CLOCK_TSC_STABLE;
 			hv_clock.tsc_timestamp = ka->master_cycle_now;
@@ -3232,11 +3238,11 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 					   &hv_clock.tsc_shift,
 					   &hv_clock.tsc_to_system_mul);
 			data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
+
+			put_cpu();
 		} else {
 			data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
 		}
-
-		put_cpu();
 	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 12/30] KVM: x86: Use get_kvmclock_base_ns() as fallback in get_kvmclock()
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (10 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 11/30] KVM: x86: Add WARN and restructure get_kvmclock() David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 13/30] KVM: x86: Fix KVM clock precision in get_kvmclock() with TSC scaling David Woodhouse
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

If kvm_get_walltime_and_clockread() fails unexpectedly (the WARN case),
restart the seqcount loop rather than falling back to a raw rdtsc()
which would set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME.

That code path could never actually be reached in practice: on 64-bit
hosts, use_master_clock can only be true when the clocksource is
TSC-based, and on 32-bit hosts, use_master_clock is never set. But
the fallback to raw rdtsc() was misleading and the resulting flags
combination was inconsistent.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9b395c00ccf2..f2653eaccdf8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3225,7 +3225,8 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 				data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
 				data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
 			} else if (WARN_ON_ONCE(!read_seqcount_retry(&ka->pvclock_sc, seq))) {
-				data->host_tsc = rdtsc();
+				put_cpu();
+				continue;
 			}
 #else
 			data->host_tsc = rdtsc();
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 13/30] KVM: x86: Fix KVM clock precision in get_kvmclock() with TSC scaling
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (11 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 12/30] KVM: x86: Use get_kvmclock_base_ns() as fallback in get_kvmclock() David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 14/30] KVM: x86: Use get_kvmclock() in kvm_get_wall_clock_epoch() David Woodhouse
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

When in master clock mode, the KVM clock is defined in terms of the
guest TSC. But get_kvmclock() was computing it from the host TSC
without applying TSC scaling, leading to a systemic drift from the
values the guest computes from its own TSC.

Store the VM's TSC scaling ratio in kvm_arch and precompute the
guest-TSC-based mul/shift in pvclock_update_vm_gtod_copy(). Use these
in get_kvmclock() to scale the host TSC delta to guest TSC before
converting to nanoseconds.

This avoids "definition C" of the KVM clock described in the
earlier commit "KVM: x86/xen: Do not corrupt KVM clock in
kvm_xen_shared_info_init()".

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h |  4 +++
 arch/x86/kvm/x86.c              | 50 +++++++++++++++++++++++++++++----
 2 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 37264212c7df..5348fd5ea3f3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1490,6 +1490,7 @@ struct kvm_arch {
 	u64 last_tsc_write;
 	u32 last_tsc_khz;
 	u64 last_tsc_offset;
+	u64 last_tsc_scaling_ratio;
 	u64 cur_tsc_nsec;
 	u64 cur_tsc_write;
 	u64 cur_tsc_offset;
@@ -1504,6 +1505,9 @@ struct kvm_arch {
 	bool use_master_clock;
 	u64 master_kernel_ns;
 	u64 master_cycle_now;
+	u64 master_tsc_scaling_ratio;
+	s8  master_tsc_shift;
+	u32 master_tsc_mul;
 
 #ifdef CONFIG_KVM_HYPERV
 	struct kvm_hv hyperv;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f2653eaccdf8..09b00906b1de 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2781,6 +2781,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm->arch.last_tsc_write = tsc;
 	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
 	kvm->arch.last_tsc_offset = offset;
+	kvm->arch.last_tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
 
 	vcpu->arch.last_guest_tsc = tsc;
 
@@ -3109,6 +3110,8 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
  *
  */
 
+static unsigned long get_cpu_tsc_khz(void);
+
 static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 {
 #ifdef CONFIG_X86_64
@@ -3132,9 +3135,28 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 				&& !ka->backwards_tsc_observed
 				&& !ka->boot_vcpu_runs_old_kvmclock;
 
-	if (ka->use_master_clock)
+	if (ka->use_master_clock) {
+		u64 tsc_hz;
+
 		atomic_set(&kvm_guest_has_master_clock, 1);
 
+		/*
+		 * Copy the scaling ratio and precompute the mul/shift for
+		 * converting guest TSC to nanoseconds. These are used by
+		 * get_kvmclock() to compute kvmclock from the host TSC
+		 * without needing a vCPU reference.
+		 */
+		ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio;
+		tsc_hz = (u64)get_cpu_tsc_khz() * 1000;
+		if (tsc_hz && kvm_caps.has_tsc_control)
+			tsc_hz = kvm_scale_tsc(tsc_hz,
+					       ka->master_tsc_scaling_ratio);
+		if (tsc_hz)
+			kvm_get_time_scale(NSEC_PER_SEC, tsc_hz,
+					   &ka->master_tsc_shift,
+					   &ka->master_tsc_mul);
+	}
+
 	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
 	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode,
 					vcpus_matched);
@@ -3235,10 +3257,28 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 			data->flags |= KVM_CLOCK_TSC_STABLE;
 			hv_clock.tsc_timestamp = ka->master_cycle_now;
 			hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
-			kvm_get_time_scale(NSEC_PER_SEC, get_cpu_tsc_khz() * 1000LL,
-					   &hv_clock.tsc_shift,
-					   &hv_clock.tsc_to_system_mul);
-			data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
+
+			/*
+			 * Use the precomputed guest-TSC-based mul/shift
+			 * so that the kvmclock value matches what the
+			 * guest computes from its own TSC.
+			 */
+			hv_clock.tsc_shift = ka->master_tsc_shift;
+			hv_clock.tsc_to_system_mul = ka->master_tsc_mul;
+
+			if (kvm_caps.has_tsc_control) {
+				u64 tsc_delta = data->host_tsc - ka->master_cycle_now;
+
+				tsc_delta = kvm_scale_tsc(tsc_delta,
+							  ka->master_tsc_scaling_ratio);
+				data->clock = hv_clock.system_time +
+					pvclock_scale_delta(tsc_delta,
+							    hv_clock.tsc_to_system_mul,
+							    hv_clock.tsc_shift);
+			} else {
+				data->clock = __pvclock_read_cycles(&hv_clock,
+								    data->host_tsc);
+			}
 
 			put_cpu();
 		} else {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 14/30] KVM: x86: Use get_kvmclock() in kvm_get_wall_clock_epoch()
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (12 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 13/30] KVM: x86: Fix KVM clock precision in get_kvmclock() with TSC scaling David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 15/30] KVM: x86: Fix compute_guest_tsc() to handle negative time deltas David Woodhouse
                   ` (18 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

Now that get_kvmclock() correctly handles TSC scaling and captures both
wallclock and kvmclock from the same TSC reading,
kvm_get_wall_clock_epoch()
can simply call it instead of duplicating the pvclock computation.

This eliminates the last instance of the "definition C" kvmclock
calculation that computed nanoseconds directly from the host TSC
without accounting for guest TSC scaling.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 58 +++++++---------------------------------------
 1 file changed, 8 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 09b00906b1de..2bbc2c7ac449 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3491,60 +3491,18 @@ int kvm_guest_time_update(struct kvm_vcpu *v)
  */
 uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
 {
-#ifdef CONFIG_X86_64
-	struct pvclock_vcpu_time_info hv_clock;
-	struct kvm_arch *ka = &kvm->arch;
-	unsigned long seq, local_tsc_khz;
-	struct timespec64 ts;
-	uint64_t host_tsc;
-
-	do {
-		seq = read_seqcount_begin(&ka->pvclock_sc);
-
-		local_tsc_khz = 0;
-		if (!ka->use_master_clock)
-			break;
-
-		/*
-		 * The TSC read and the call to get_cpu_tsc_khz() must happen
-		 * on the same CPU.
-		 */
-		get_cpu();
-
-		local_tsc_khz = get_cpu_tsc_khz();
-
-		if (local_tsc_khz &&
-		    !kvm_get_walltime_and_clockread(&ts, &host_tsc) &&
-		    WARN_ON_ONCE(!read_seqcount_retry(&ka->pvclock_sc, seq)))
-			local_tsc_khz = 0; /* Fall back to old method */
-
-		put_cpu();
-
-		/*
-		 * These values must be snapshotted within the seqcount loop.
-		 * After that, it's just mathematics which can happen on any
-		 * CPU at any time.
-		 */
-		hv_clock.tsc_timestamp = ka->master_cycle_now;
-		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
+	struct kvm_clock_data data;
 
-	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
+	get_kvmclock(kvm, &data);
 
 	/*
-	 * If the conditions were right, and obtaining the wallclock+TSC was
-	 * successful, calculate the KVM clock at the corresponding time and
-	 * subtract one from the other to get the guest's epoch in nanoseconds
-	 * since 1970-01-01.
+	 * If get_kvmclock() captured both wallclock and kvmclock from the
+	 * same TSC reading, use them for a precise epoch calculation.
 	 */
-	if (local_tsc_khz) {
-		kvm_get_time_scale(NSEC_PER_SEC, local_tsc_khz * NSEC_PER_USEC,
-				   &hv_clock.tsc_shift,
-				   &hv_clock.tsc_to_system_mul);
-		return ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec -
-			__pvclock_read_cycles(&hv_clock, host_tsc);
-	}
-#endif
-	return ktime_get_real_ns() - get_kvmclock_ns(kvm);
+	if (data.flags & KVM_CLOCK_REALTIME)
+		return data.realtime - data.clock;
+
+	return ktime_get_real_ns() - data.clock;
 }
 
 /*
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 15/30] KVM: x86: Fix compute_guest_tsc() to handle negative time deltas
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (13 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 14/30] KVM: x86: Use get_kvmclock() in kvm_get_wall_clock_epoch() David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 16/30] KVM: x86: Restructure kvm_guest_time_update() for TSC upscaling David Woodhouse
                   ` (17 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

The compute_guest_tsc() function computes the guest TSC at a given
kernel_ns timestamp. When the master clock reference point
(master_kernel_ns) is earlier than vcpu->arch.this_tsc_nsec, the delta
is negative. Since pvclock_scale_delta() takes a u64, the negative
value wraps to a huge positive number, producing a wildly wrong result.

Change the return type to s64 and handle negative deltas explicitly by
negating the delta, scaling it, and subtracting from this_tsc_write.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2bbc2c7ac449..e281c49561fa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2586,13 +2586,23 @@ static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
 	return set_tsc_khz(vcpu, user_tsc_khz, use_scaling);
 }
 
-static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
+static s64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 {
-	u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
-				      vcpu->arch.virtual_tsc_mult,
-				      vcpu->arch.virtual_tsc_shift);
-	tsc += vcpu->arch.this_tsc_write;
-	return tsc;
+	s64 delta_ns = kernel_ns - vcpu->arch.this_tsc_nsec;
+	u64 tsc;
+
+	/* Handle negative deltas gracefully (master clock ref may be earlier) */
+	if (delta_ns < 0) {
+		tsc = pvclock_scale_delta(-delta_ns,
+					  vcpu->arch.virtual_tsc_mult,
+					  vcpu->arch.virtual_tsc_shift);
+		return vcpu->arch.this_tsc_write - tsc;
+	}
+
+	tsc = pvclock_scale_delta(delta_ns,
+				  vcpu->arch.virtual_tsc_mult,
+				  vcpu->arch.virtual_tsc_shift);
+	return vcpu->arch.this_tsc_write + tsc;
 }
 
 #ifdef CONFIG_X86_64
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 16/30] KVM: x86: Restructure kvm_guest_time_update() for TSC upscaling
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (14 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 15/30] KVM: x86: Fix compute_guest_tsc() to handle negative time deltas David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 17/30] KVM: x86: Simplify and comment kvm_get_time_scale() David Woodhouse
                   ` (16 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

Restructure kvm_guest_time_update() so that kernel_ns/host_tsc are
always "now" when doing TSC catchup, then swap in the master clock
reference values afterward for the hv_clock.

This makes the TSC upscaling code considerably simpler: the catchup
adjustment is computed as the delta between what the guest TSC *should*
be at "now" and what it actually is, rather than mixing "now" and
"master clock reference" timestamps.

The seqcount loop now also contains the kvm_get_time_and_clockread()
call (matching get_kvmclock's pattern), with the same WARN for
unexpected failure.

Based on a suggestion by Sean Christopherson.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 67 ++++++++++++++++++++++++++++++++--------------
 1 file changed, 47 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e281c49561fa..8e4993ef4f6b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3363,39 +3363,51 @@ int kvm_guest_time_update(struct kvm_vcpu *v)
 	struct kvm_arch *ka = &v->kvm->arch;
 	s64 kernel_ns;
 	u64 tsc_timestamp, host_tsc;
+	u64 master_host_tsc = 0;
+	s64 master_kernel_ns = 0;
 	bool use_master_clock;
 
-	kernel_ns = 0;
-	host_tsc = 0;
-
 	/*
 	 * If the host uses TSC clock, then passthrough TSC as stable
 	 * to the guest.
 	 */
 	do {
 		seq = read_seqcount_begin(&ka->pvclock_sc);
+
 		use_master_clock = ka->use_master_clock;
-		if (use_master_clock) {
-			host_tsc = ka->master_cycle_now;
-			kernel_ns = ka->master_kernel_ns;
-		}
+
+		/*
+		 * The TSC read and the call to get_cpu_tsc_khz() must happen
+		 * on the same CPU.
+		 */
+		get_cpu();
+
+		tgt_tsc_hz = (u64)get_cpu_tsc_khz() * 1000;
+
+		if (use_master_clock &&
+		    !kvm_get_time_and_clockread(&kernel_ns, &host_tsc) &&
+		    WARN_ON_ONCE(!read_seqcount_retry(&ka->pvclock_sc, seq)))
+			use_master_clock = false;
+
+		put_cpu();
+
+		if (!use_master_clock)
+			break;
+
+		master_host_tsc = ka->master_cycle_now;
+		master_kernel_ns = ka->master_kernel_ns;
 	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
 
-	/* Keep irq disabled to prevent changes to the clock */
-	local_irq_save(flags);
-	tgt_tsc_hz = (u64)get_cpu_tsc_khz() * 1000;
 	if (unlikely(tgt_tsc_hz == 0)) {
-		local_irq_restore(flags);
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
 		return 1;
 	}
+
 	if (!use_master_clock) {
 		host_tsc = rdtsc();
 		kernel_ns = get_kvmclock_base_ns();
 	}
 
-	tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
-
 	/*
 	 * We may have to catch up the TSC to match elapsed wall clock
 	 * time for two reasons, even if kvmclock is used.
@@ -3404,17 +3416,32 @@ int kvm_guest_time_update(struct kvm_vcpu *v)
 	 *      entry to avoid unknown leaps of TSC even when running
 	 *      again on the same CPU.  This may cause apparent elapsed
 	 *      time to disappear, and the guest to stand still or run
-	 *	very slowly.
+	 *      very slowly.
 	 */
 	if (vcpu->tsc_catchup) {
-		u64 tsc = compute_guest_tsc(v, kernel_ns);
-		if (tsc > tsc_timestamp) {
-			adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
-			tsc_timestamp = tsc;
-		}
+		s64 adjustment;
+
+		/*
+		 * Calculate the delta between what the guest TSC *should* be
+		 * and what it actually is according to kvm_read_l1_tsc().
+		 */
+		adjustment = compute_guest_tsc(v, kernel_ns) -
+			     kvm_read_l1_tsc(v, host_tsc);
+		if (adjustment > 0)
+			adjust_tsc_offset_guest(v, adjustment);
 	}
 
-	local_irq_restore(flags);
+	/*
+	 * Now that TSC upscaling is out of the way, the remaining calculations
+	 * are all relative to the reference time that's placed in hv_clock.
+	 * If the master clock is NOT in use, the reference time is "now".  If
+	 * master clock is in use, the reference time comes from there.
+	 */
+	if (use_master_clock) {
+		host_tsc = master_host_tsc;
+		kernel_ns = master_kernel_ns;
+	}
+	tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
 
 	/* With all the info we got, fill in the values */
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 17/30] KVM: x86: Simplify and comment kvm_get_time_scale()
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (15 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 16/30] KVM: x86: Restructure kvm_guest_time_update() for TSC upscaling David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 18/30] KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset() David Woodhouse
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

The kvm_get_time_scale() function was entirely opaque. Add comments
explaining what it does: compute a fixed-point multiplier and shift for
converting TSC ticks to nanoseconds via pvclock_scale_delta().

Rename the local variables from the cryptic tps64/tps32/scaled64 to
base_hz_u64/base32/scaled_hz_u64 to make the code self-documenting.
The "tps32" name stood for "Ticks Per Second" but was misleading since
it held the shifted base frequency, not a tick count.

No functional change.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/x86.c | 55 +++++++++++++++++++++++++++++++++-------------
 1 file changed, 40 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e4993ef4f6b..980fc22ee05b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2472,32 +2472,57 @@ static uint32_t div_frac(uint32_t dividend, uint32_t divisor)
 	return dividend;
 }
 
-static void kvm_get_time_scale(uint64_t scaled_hz, uint64_t base_hz,
+static void kvm_get_time_scale(u64 scaled_hz, u64 base_hz,
 			       s8 *pshift, u32 *pmultiplier)
 {
-	uint64_t scaled64;
-	int32_t  shift = 0;
-	uint64_t tps64;
-	uint32_t tps32;
+	u64 scaled_hz_u64 = scaled_hz;
+	s32 shift = 0;
+	u64 base_hz_u64;
+	u32 base32;
 
-	tps64 = base_hz;
-	scaled64 = scaled_hz;
-	while (tps64 > scaled64*2 || tps64 & 0xffffffff00000000ULL) {
-		tps64 >>= 1;
+	/*
+	 * This function calculates a fixed-point multiplier and shift such
+	 * that:
+	 *   time_ns = (tsc_cycles << shift) * multiplier >> 32
+	 *
+	 * Where tsc_cycles tick at base_hz, and time_ns should count at
+	 * scaled_hz (typically NSEC_PER_SEC for a TSC→nanoseconds conversion).
+	 *
+	 * The multiplier is: (scaled_hz << 32) / base_hz, adjusted by shift
+	 * to keep everything in range.
+	 */
+
+	base_hz_u64 = base_hz;
+
+	/*
+	 * Start by shifting base_hz right until it fits in 32 bits, and
+	 * is lower than double the target rate. This introduces a negative
+	 * shift value which would result in pvclock_scale_delta() shifting
+	 * the actual tick count right before performing the multiplication.
+	 */
+	while (base_hz_u64 > scaled_hz_u64 * 2 || base_hz_u64 >> 32) {
+		base_hz_u64 >>= 1;
 		shift--;
 	}
 
-	tps32 = (uint32_t)tps64;
-	while (tps32 <= scaled64 || scaled64 & 0xffffffff00000000ULL) {
-		if (scaled64 & 0xffffffff00000000ULL || tps32 & 0x80000000)
-			scaled64 >>= 1;
+	/* Now the shifted base_hz fits in 32 bits. */
+	base32 = (u32)base_hz_u64;
+
+	/*
+	 * Next, shift scaled_hz right until it fits in 32 bits, and ensure
+	 * that the shifted base_hz is not larger (so that the result of the
+	 * final division also fits in 32 bits).
+	 */
+	while (base32 <= scaled_hz_u64 || scaled_hz_u64 >> 32) {
+		if (scaled_hz_u64 >> 32 || base32 & BIT(31))
+			scaled_hz_u64 >>= 1;
 		else
-			tps32 <<= 1;
+			base32 <<= 1;
 		shift++;
 	}
 
 	*pshift = shift;
-	*pmultiplier = div_frac(scaled64, tps32);
+	*pmultiplier = div_frac(scaled_hz_u64, base32);
 }
 
 #ifdef CONFIG_X86_64
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 18/30] KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset()
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (16 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 17/30] KVM: x86: Simplify and comment kvm_get_time_scale() David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 19/30] KVM: x86: Improve synchronization in kvm_synchronize_tsc() David Woodhouse
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

Let the callers pass the host TSC value in as an explicit parameter.
This leaves some fairly obviously stupid code, which is using this
function to compare the guest TSC at some *other* time, with the
newly-minted TSC value from rdtsc(). Unless it's being used to measure
*elapsed* time, that isn't very sensible.

In this case, "obviously stupid" is an improvement over being
non-obviously so.

No functional change intended.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/x86.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 980fc22ee05b..c9cbebd6a92a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2693,11 +2693,12 @@ u64 kvm_scale_tsc(u64 tsc, u64 ratio)
 	return _tsc;
 }
 
-static u64 kvm_compute_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)
+static u64 kvm_compute_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 host_tsc,
+				     u64 target_tsc)
 {
 	u64 tsc;
 
-	tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio);
+	tsc = kvm_scale_tsc(host_tsc, vcpu->arch.l1_tsc_scaling_ratio);
 
 	return target_tsc - tsc;
 }
@@ -2859,7 +2860,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
 	bool synchronizing = false;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
-	offset = kvm_compute_l1_tsc_offset(vcpu, data);
+	offset = kvm_compute_l1_tsc_offset(vcpu, rdtsc(), data);
 	ns = get_kvmclock_base_ns();
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
@@ -2908,7 +2909,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
 		} else {
 			u64 delta = nsec_to_cycles(vcpu, elapsed);
 			data += delta;
-			offset = kvm_compute_l1_tsc_offset(vcpu, data);
+			offset = kvm_compute_l1_tsc_offset(vcpu, rdtsc(), data);
 		}
 		matched = true;
 	}
@@ -4143,7 +4144,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (msr_info->host_initiated) {
 			kvm_synchronize_tsc(vcpu, &data);
 		} else if (!vcpu->arch.guest_tsc_protected) {
-			u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
+			u64 adj = kvm_compute_l1_tsc_offset(vcpu, rdtsc(), data) -
+				  vcpu->arch.l1_tsc_offset;
 			adjust_tsc_offset_guest(vcpu, adj);
 			vcpu->arch.ia32_tsc_adjust_msr += adj;
 		}
@@ -5267,7 +5269,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 			mark_tsc_unstable("KVM discovered backwards TSC");
 
 		if (kvm_check_tsc_unstable()) {
-			u64 offset = kvm_compute_l1_tsc_offset(vcpu,
+			u64 offset = kvm_compute_l1_tsc_offset(vcpu, rdtsc(),
 						vcpu->arch.last_guest_tsc);
 			kvm_vcpu_write_tsc_offset(vcpu, offset);
 			if (!vcpu->arch.guest_tsc_protected)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 19/30] KVM: x86: Improve synchronization in kvm_synchronize_tsc()
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (17 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 18/30] KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset() David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 20/30] KVM: x86: Kill last_tsc_{nsec,write,offset} fields David Woodhouse
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

When synchronizing to an existing TSC (either by explicitly writing
zero, or the legacy hack where the TSC is written within one second's
worth of the previously written TSC), the last_tsc_write and
last_tsc_nsec values were being misrecorded by __kvm_synchronize_tsc().
The *unsynchronized* value of the TSC (perhaps even zero) was being
recorded, along with the current time at which kvm_synchronize_tsc()
was called. This could cause *subsequent* writes to fail to synchronize
correctly.

Fix that by resetting {data, ns} to the previous values before passing
them to __kvm_synchronize_tsc() when synchronization is detected.
Except in the case where the TSC is unstable and *has* to be synthesised
from the host clock, in which case attempt to create a nsec/tsc pair
which is on the correct line.

Furthermore, there were *three* different TSC reads used for calculating
the "current" time, all slightly different from each other. Fix that by
using kvm_get_time_and_clockread() where possible and using the same
host_tsc value in all cases.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/x86.c | 32 ++++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c9cbebd6a92a..097df58749c3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -203,6 +203,9 @@ module_param(mitigate_smt_rsb, bool, 0444);
  * usermode, e.g. SYSCALL MSRs and TSC_AUX, can be deferred until the CPU
  * returns to userspace, i.e. the kernel can run with the guest's value.
  */
+#ifdef CONFIG_X86_64
+static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);
+#endif
 #define KVM_MAX_NR_USER_RETURN_MSRS 16
 
 struct kvm_user_return_msrs {
@@ -2854,14 +2857,22 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
 {
 	u64 data = user_value ? *user_value : 0;
 	struct kvm *kvm = vcpu->kvm;
-	u64 offset, ns, elapsed;
+	u64 offset, host_tsc, ns, elapsed;
 	unsigned long flags;
 	bool matched = false;
 	bool synchronizing = false;
 
+#ifdef CONFIG_X86_64
+	if (!kvm_get_time_and_clockread(&ns, &host_tsc))
+#endif
+	{
+		ns = get_kvmclock_base_ns();
+		host_tsc = rdtsc();
+	}
+
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
-	offset = kvm_compute_l1_tsc_offset(vcpu, rdtsc(), data);
-	ns = get_kvmclock_base_ns();
+
+	offset = kvm_compute_l1_tsc_offset(vcpu, host_tsc, data);
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
 	if (vcpu->arch.virtual_tsc_khz) {
@@ -2904,12 +2915,25 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
          */
 	if (synchronizing &&
 	    vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
+		/*
+		 * If synchronizing, the "last written" TSC value/time
+		 * recorded by __kvm_synchronize_tsc() should not change
+		 * (i.e. should be precisely the same as the existing
+		 * generation).
+		 */
+		data = kvm->arch.last_tsc_write;
+
 		if (!kvm_check_tsc_unstable()) {
 			offset = kvm->arch.cur_tsc_offset;
+			ns = kvm->arch.cur_tsc_nsec;
 		} else {
+			/*
+			 * ...unless the TSC is unstable and has to be
+			 * synthesised from the host clock in nanoseconds.
+			 */
 			u64 delta = nsec_to_cycles(vcpu, elapsed);
 			data += delta;
-			offset = kvm_compute_l1_tsc_offset(vcpu, rdtsc(), data);
+			offset = kvm_compute_l1_tsc_offset(vcpu, host_tsc, data);
 		}
 		matched = true;
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 20/30] KVM: x86: Kill last_tsc_{nsec,write,offset} fields
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (18 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 19/30] KVM: x86: Improve synchronization in kvm_synchronize_tsc() David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 21/30] KVM: x86: Replace nr_vcpus_matched_tsc count with all_vcpus_matched_tsc bool David Woodhouse
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

These pointlessly duplicate the cur_tsc_{nsec,write,offset} values.
The only place they were used was where the TSC is stable and a new
vCPU is being synchronized to the previous setting, in which case the
cur_tsc_* value is definitely identical.

Rename last_tsc_khz and last_tsc_scaling_ratio to cur_tsc_khz and
cur_tsc_scaling_ratio respectively, since they are properties of the
current TSC generation.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/include/asm/kvm_host.h |  7 ++-----
 arch/x86/kvm/x86.c              | 32 ++++++++++++++------------------
 2 files changed, 16 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5348fd5ea3f3..59298a8f78eb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1486,11 +1486,8 @@ struct kvm_arch {
 	 * preemption-disabled region, so it must be a raw spinlock.
 	 */
 	raw_spinlock_t tsc_write_lock;
-	u64 last_tsc_nsec;
-	u64 last_tsc_write;
-	u32 last_tsc_khz;
-	u64 last_tsc_offset;
-	u64 last_tsc_scaling_ratio;
+	u32 cur_tsc_khz;
+	u64 cur_tsc_scaling_ratio;
 	u64 cur_tsc_nsec;
 	u64 cur_tsc_write;
 	u64 cur_tsc_offset;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 097df58749c3..3c68d2a4c8d0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2813,14 +2813,12 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 		vcpu->kvm->arch.user_set_tsc = true;
 
 	/*
-	 * We also track th most recent recorded KHZ, write and time to
-	 * allow the matching interval to be extended at each write.
+	 * Track the TSC frequency, scaling ratio, and offset for the current
+	 * generation. These are used to detect matching TSC writes and to
+	 * compute the guest TSC from the host clock.
 	 */
-	kvm->arch.last_tsc_nsec = ns;
-	kvm->arch.last_tsc_write = tsc;
-	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
-	kvm->arch.last_tsc_offset = offset;
-	kvm->arch.last_tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
+	kvm->arch.cur_tsc_khz = vcpu->arch.virtual_tsc_khz;
+	kvm->arch.cur_tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
 
 	vcpu->arch.last_guest_tsc = tsc;
 
@@ -2833,8 +2831,6 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 		 * nanosecond time, offset, and write, so if TSCs are in
 		 * sync, we can match exact offset, and if not, we can match
 		 * exact software computation in compute_guest_tsc()
-		 *
-		 * These values are tracked in kvm->arch.cur_xxx variables.
 		 */
 		kvm->arch.cur_tsc_generation++;
 		kvm->arch.cur_tsc_nsec = ns;
@@ -2873,7 +2869,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 
 	offset = kvm_compute_l1_tsc_offset(vcpu, host_tsc, data);
-	elapsed = ns - kvm->arch.last_tsc_nsec;
+	elapsed = ns - kvm->arch.cur_tsc_nsec;
 
 	if (vcpu->arch.virtual_tsc_khz) {
 		if (data == 0) {
@@ -2883,7 +2879,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
 			 */
 			synchronizing = true;
 		} else if (kvm->arch.user_set_tsc) {
-			u64 tsc_exp = kvm->arch.last_tsc_write +
+			u64 tsc_exp = kvm->arch.cur_tsc_write +
 						nsec_to_cycles(vcpu, elapsed);
 			u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
 			/*
@@ -2914,14 +2910,14 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
 	 * it's better to try to match offsets from the beginning.
          */
 	if (synchronizing &&
-	    vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
+	    vcpu->arch.virtual_tsc_khz == kvm->arch.cur_tsc_khz) {
 		/*
 		 * If synchronizing, the "last written" TSC value/time
 		 * recorded by __kvm_synchronize_tsc() should not change
 		 * (i.e. should be precisely the same as the existing
 		 * generation).
 		 */
-		data = kvm->arch.last_tsc_write;
+		data = kvm->arch.cur_tsc_write;
 
 		if (!kvm_check_tsc_unstable()) {
 			offset = kvm->arch.cur_tsc_offset;
@@ -3206,7 +3202,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 		 * get_kvmclock() to compute kvmclock from the host TSC
 		 * without needing a vCPU reference.
 		 */
-		ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio;
+		ka->master_tsc_scaling_ratio = ka->cur_tsc_scaling_ratio;
 		tsc_hz = (u64)get_cpu_tsc_khz() * 1000;
 		if (tsc_hz && kvm_caps.has_tsc_control)
 			tsc_hz = kvm_scale_tsc(tsc_hz,
@@ -6075,8 +6071,8 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
 		raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
 
 		matched = (vcpu->arch.virtual_tsc_khz &&
-			   kvm->arch.last_tsc_khz == vcpu->arch.virtual_tsc_khz &&
-			   kvm->arch.last_tsc_offset == offset);
+			   kvm->arch.cur_tsc_khz == vcpu->arch.virtual_tsc_khz &&
+			   kvm->arch.cur_tsc_offset == offset);
 
 		tsc = kvm_scale_tsc(rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
 		ns = get_kvmclock_base_ns();
@@ -13520,8 +13516,8 @@ int kvm_arch_enable_virtualization_cpu(void)
 			 * you may have some problem.  Solving this issue is
 			 * left as an exercise to the reader.
 			 */
-			kvm->arch.last_tsc_nsec = 0;
-			kvm->arch.last_tsc_write = 0;
+			kvm->arch.cur_tsc_nsec = 0;
+			kvm->arch.cur_tsc_write = 0;
 		}
 
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 21/30] KVM: x86: Replace nr_vcpus_matched_tsc count with all_vcpus_matched_tsc bool
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (19 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 20/30] KVM: x86: Kill last_tsc_{nsec,write,offset} fields David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 22/30] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other David Woodhouse
                   ` (11 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

Using a count and comparing with kvm->online_vcpus was always racy
because a new vCPU could be created while kvm_track_tsc_matching() was
running and comparing with kvm->online_vcpus. That variable is only
atomic with respect to itself; kvm_arch_vcpu_create() runs before
kvm->online_vcpus is incremented for the new vCPU.

Replace the count with a boolean that is set in kvm_track_tsc_matching()
after comparing the count, and cleared when a new TSC generation starts.
The boolean is consumed by pvclock_update_vm_gtod_copy() under the
tsc_write_lock, which serializes against __kvm_synchronize_tsc().

Keep the count for now as it's still used in the trace event.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 13 +++++++------
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 59298a8f78eb..eb81f90284ba 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1492,6 +1492,7 @@ struct kvm_arch {
 	u64 cur_tsc_write;
 	u64 cur_tsc_offset;
 	u64 cur_tsc_generation;
+	bool all_vcpus_matched_tsc;
 	int nr_vcpus_matched_tsc;
 
 	u32 default_tsc_khz;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3c68d2a4c8d0..b74fd8b088ad 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2648,11 +2648,12 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
 
 	/*
 	 * To use the masterclock, the host clocksource must be based on TSC
-	 * and all vCPUs must have matching TSCs.  Note, the count for matching
-	 * vCPUs doesn't include the reference vCPU, hence "+1".
+	 * and all vCPUs must have matching TSCs.
 	 */
-	bool use_master_clock = (ka->nr_vcpus_matched_tsc + 1 ==
-				 atomic_read(&vcpu->kvm->online_vcpus)) &&
+	ka->all_vcpus_matched_tsc = (ka->nr_vcpus_matched_tsc + 1 ==
+				     atomic_read(&vcpu->kvm->online_vcpus));
+
+	bool use_master_clock = ka->all_vcpus_matched_tsc &&
 				gtod_is_based_on_tsc(gtod->clock.vclock_mode);
 
 	/*
@@ -2837,6 +2838,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 		kvm->arch.cur_tsc_write = tsc;
 		kvm->arch.cur_tsc_offset = offset;
 		kvm->arch.nr_vcpus_matched_tsc = 0;
+		kvm->arch.all_vcpus_matched_tsc = false;
 	} else if (vcpu->arch.this_tsc_generation != kvm->arch.cur_tsc_generation) {
 		kvm->arch.nr_vcpus_matched_tsc++;
 	}
@@ -3176,8 +3178,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 	bool host_tsc_clocksource, vcpus_matched;
 
 	lockdep_assert_held(&kvm->arch.tsc_write_lock);
-	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
-			atomic_read(&kvm->online_vcpus));
+	vcpus_matched = ka->all_vcpus_matched_tsc;
 
 	/*
 	 * If the host uses TSC clock, then passthrough TSC as stable
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 22/30] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (20 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 21/30] KVM: x86: Replace nr_vcpus_matched_tsc count with all_vcpus_matched_tsc bool David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 23/30] KVM: x86: Factor out kvm_use_master_clock() David Woodhouse
                   ` (10 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

There is no reason why the KVM clock cannot be in masterclock mode when
the TSCs are not in sync, as long as they are at the same *frequency*.

Running at a different frequency would lead to a systemic skew between
the clock(s) as observed by different vCPUs due to arithmetic precision
in the scaling. So that should indeed force the clock to be based on the
host's CLOCK_MONOTONIC_RAW instead of being in masterclock mode where it
is defined by the guest TSC.

But when the vCPUs merely have a different TSC *offset*, that's not a
problem. The offset is applied to that vCPU's kvmclock->tsc_timestamp
field, and it all comes out in the wash.

Track frequency matching separately from full TSC matching. Use
frequency match for master clock eligibility, and full TSC match
(including offset) only for PVCLOCK_TSC_STABLE_BIT, which tells the
guest it is safe to skip cross-vCPU monotonicity enforcement.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 27 +++++++++++++++++++++------
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index eb81f90284ba..c770c63087cb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1493,6 +1493,7 @@ struct kvm_arch {
 	u64 cur_tsc_offset;
 	u64 cur_tsc_generation;
 	bool all_vcpus_matched_tsc;
+	bool all_vcpus_matched_freq;
 	int nr_vcpus_matched_tsc;
 
 	u32 default_tsc_khz;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b74fd8b088ad..d36d03b8268e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2647,13 +2647,22 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 
 	/*
-	 * To use the masterclock, the host clocksource must be based on TSC
-	 * and all vCPUs must have matching TSCs.
+	 * Track whether all vCPUs have matching TSC offsets (for
+	 * PVCLOCK_TSC_STABLE_BIT) and matching frequencies (for
+	 * master clock eligibility).
 	 */
 	ka->all_vcpus_matched_tsc = (ka->nr_vcpus_matched_tsc + 1 ==
 				     atomic_read(&vcpu->kvm->online_vcpus));
+	if (ka->all_vcpus_matched_tsc)
+		ka->all_vcpus_matched_freq = true;
 
-	bool use_master_clock = ka->all_vcpus_matched_tsc &&
+	/*
+	 * To use the masterclock, the host clocksource must be based on TSC
+	 * and all vCPUs must have matching TSC *frequency*. Different offsets
+	 * are fine — each vCPU's pvclock has its own tsc_timestamp that
+	 * accounts for its offset.
+	 */
+	bool use_master_clock = ka->all_vcpus_matched_freq &&
 				gtod_is_based_on_tsc(gtod->clock.vclock_mode);
 
 	/*
@@ -2817,7 +2826,13 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	 * Track the TSC frequency, scaling ratio, and offset for the current
 	 * generation. These are used to detect matching TSC writes and to
 	 * compute the guest TSC from the host clock.
+	 *
+	 * If the frequency changed, master clock mode can no longer be used
+	 * since the kvmclock scaling factors differ between vCPUs.
 	 */
+	if (vcpu->arch.virtual_tsc_khz != kvm->arch.cur_tsc_khz)
+		kvm->arch.all_vcpus_matched_freq = false;
+
 	kvm->arch.cur_tsc_khz = vcpu->arch.virtual_tsc_khz;
 	kvm->arch.cur_tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
 
@@ -3178,7 +3193,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 	bool host_tsc_clocksource, vcpus_matched;
 
 	lockdep_assert_held(&kvm->arch.tsc_write_lock);
-	vcpus_matched = ka->all_vcpus_matched_tsc;
+	vcpus_matched = ka->all_vcpus_matched_freq;
 
 	/*
 	 * If the host uses TSC clock, then passthrough TSC as stable
@@ -3513,7 +3528,7 @@ int kvm_guest_time_update(struct kvm_vcpu *v)
 
 	/* If the host uses TSC clocksource, then it is stable */
 	hv_clock.flags = 0;
-	if (use_master_clock)
+	if (use_master_clock && ka->all_vcpus_matched_tsc)
 		hv_clock.flags |= PVCLOCK_TSC_STABLE_BIT;
 
 	if (vcpu->pv_time.active) {
@@ -6340,7 +6355,7 @@ static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
 
 	hv_clock.tsc_shift = vcpu->pvclock_tsc_shift;
 	hv_clock.tsc_to_system_mul = vcpu->pvclock_tsc_mul;
-	hv_clock.flags = PVCLOCK_TSC_STABLE_BIT;
+	hv_clock.flags = ka->all_vcpus_matched_tsc ? PVCLOCK_TSC_STABLE_BIT : 0;
 
 	if (copy_to_user(argp, &hv_clock, sizeof(hv_clock)))
 		return -EFAULT;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 23/30] KVM: x86: Factor out kvm_use_master_clock()
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (21 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 22/30] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 24/30] KVM: x86: Avoid gratuitous global clock updates David Woodhouse
                   ` (9 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

Both kvm_track_tsc_matching() and pvclock_update_vm_gtod_copy() make a
decision about whether the KVM clock should be in master clock mode.
They used *different* criteria for the decision though. This isn't
really a problem; it only has the potential to cause unnecessary
invocations of KVM_REQ_MASTERCLOCK_UPDATE if the masterclock was
disabled due to TSC going backwards, or the guest using the old MSR.
But it isn't pretty.

Factor the decision out to a single function. And document the
historical reason why it's disabled for guests that use the old
MSR_KVM_SYSTEM_TIME.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/x86.c | 33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d36d03b8268e..0656d901fe79 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2640,6 +2640,27 @@ static inline bool gtod_is_based_on_tsc(int mode)
 }
 #endif
 
+static bool kvm_use_master_clock(struct kvm *kvm)
+{
+	struct kvm_arch *ka = &kvm->arch;
+
+	/*
+	 * The 'old kvmclock' check is a workaround (from 2015) for a
+	 * SUSE 2.6.16 kernel that didn't boot if the system_time in
+	 * its kvmclock was too far behind the current time. So the
+	 * mode of just setting the reference point and allowing time
+	 * to proceed linearly from there makes it fail to boot.
+	 * Despite that being kind of the *point* of the way the clock
+	 * is exposed to the guest. By coincidence, the offending
+	 * kernels used the old MSR_KVM_SYSTEM_TIME, which was moved
+	 * only because it resided in the wrong number range. So the
+	 * workaround is activated for *all* guests using the old MSR.
+	 */
+	return ka->all_vcpus_matched_freq &&
+		!ka->backwards_tsc_observed &&
+		!ka->boot_vcpu_runs_old_kvmclock;
+}
+
 static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
 {
 #ifdef CONFIG_X86_64
@@ -2662,7 +2683,7 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
 	 * are fine — each vCPU's pvclock has its own tsc_timestamp that
 	 * accounts for its offset.
 	 */
-	bool use_master_clock = ka->all_vcpus_matched_freq &&
+	bool use_master_clock = kvm_use_master_clock(vcpu->kvm) &&
 				gtod_is_based_on_tsc(gtod->clock.vclock_mode);
 
 	/*
@@ -3190,10 +3211,9 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 #ifdef CONFIG_X86_64
 	struct kvm_arch *ka = &kvm->arch;
 	int vclock_mode;
-	bool host_tsc_clocksource, vcpus_matched;
+	bool host_tsc_clocksource;
 
 	lockdep_assert_held(&kvm->arch.tsc_write_lock);
-	vcpus_matched = ka->all_vcpus_matched_freq;
 
 	/*
 	 * If the host uses TSC clock, then passthrough TSC as stable
@@ -3203,9 +3223,8 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 					&ka->master_kernel_ns,
 					&ka->master_cycle_now);
 
-	ka->use_master_clock = host_tsc_clocksource && vcpus_matched
-				&& !ka->backwards_tsc_observed
-				&& !ka->boot_vcpu_runs_old_kvmclock;
+	ka->use_master_clock = host_tsc_clocksource &&
+				kvm_use_master_clock(kvm);
 
 	if (ka->use_master_clock) {
 		u64 tsc_hz;
@@ -3231,7 +3250,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 
 	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
 	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode,
-					vcpus_matched);
+					ka->all_vcpus_matched_freq);
 #endif
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 24/30] KVM: x86: Avoid gratuitous global clock updates
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (22 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 23/30] KVM: x86: Factor out kvm_use_master_clock() David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 25/30] KVM: x86/xen: Prevent runstate times from becoming negative David Woodhouse
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

Eliminate three sources of unnecessary KVM_REQ_GLOBAL_CLOCK_UPDATE:

1. kvm_write_system_time(): The global clock update was a workaround for
   ever-drifting clocks based on the host's CLOCK_MONOTONIC subject to
   NTP skew. On booting or resuming a guest, it just leads to running
   kvm_guest_time_update() twice for each vCPU for no good reason. Use
   KVM_REQ_CLOCK_UPDATE on the vCPU itself, and only when the clock is
   being enabled, not disabled.

2. kvm_arch_vcpu_load(): Use KVM_REQ_CLOCK_UPDATE instead of
   KVM_REQ_GLOBAL_CLOCK_UPDATE. There is no need to update all vCPUs'
   clocks when one vCPU is loaded.

3. kvm_gen_kvmclock_update(): Skip the periodic global update entirely
   when in master clock mode, since the clock is defined precisely by
   the guest TSC and doesn't drift.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/x86.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0656d901fe79..7d9ec0638d28 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2457,13 +2457,13 @@ static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
 	}
 
 	vcpu->arch.time = system_time;
-	kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
 
 	/* we verify if the enable bit is set... */
-	if (system_time & 1)
+	if (system_time & 1) {
 		kvm_gpc_activate(&vcpu->arch.pv_time, system_time & ~1ULL,
 				 sizeof(struct pvclock_vcpu_time_info));
-	else
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+	} else
 		kvm_gpc_deactivate(&vcpu->arch.pv_time);
 
 	return;
@@ -3638,6 +3638,10 @@ static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
 	struct kvm_vcpu *vcpu;
 	struct kvm *kvm = v->kvm;
 
+	/* In master clock mode, the clock doesn't need periodic updates. */
+	if (kvm->arch.use_master_clock)
+		return;
+
 	kvm_for_each_vcpu(i, vcpu, kvm) {
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 		kvm_vcpu_kick(vcpu);
@@ -5339,7 +5343,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		 * kvmclock on vcpu->cpu migration
 		 */
 		if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
-			kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
+			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 		if (vcpu->cpu != cpu)
 			kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu);
 		vcpu->cpu = cpu;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 25/30] KVM: x86/xen: Prevent runstate times from becoming negative
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (23 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 24/30] KVM: x86: Avoid gratuitous global clock updates David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 26/30] KVM: x86: Avoid redundant masterclock updates from multiple vCPUs David Woodhouse
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

When kvm_xen_update_runstate() is invoked to set a vCPU's runstate, the
time spent in the previous runstate is accounted. This is based on the
delta between the current KVM clock time, and the previous value stored
in vcpu->arch.xen.runstate_entry_time.

If the KVM clock goes backwards, that delta will be negative. Or, since
it's an unsigned 64-bit integer, very *large*. Linux guests deal with
that particularly badly, reporting 100% steal time for ever more (well,
for *centuries* at least, until the delta has been consumed).

So when a negative delta is detected, just refrain from updating the
runstates until the KVM clock catches up with runstate_entry_time again.

Also clamp steal_ns to delta_ns to prevent steal time from exceeding
the total elapsed time, and handle negative steal_ns (which can happen
if run_delay goes backwards across a scheduler update).

The userspace APIs for setting the runstate times do not allow them to
be set past the current KVM clock, but userspace can still adjust the
KVM clock *after* setting the runstate times, which would cause this
situation to occur.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/xen.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 82e34edbfdbd..fef52b8ea26a 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -586,24 +586,33 @@ void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
 {
 	struct kvm_vcpu_xen *vx = &v->arch.xen;
 	u64 now = get_kvmclock_ns(v->kvm);
-	u64 delta_ns = now - vx->runstate_entry_time;
 	u64 run_delay = current->sched_info.run_delay;
+	s64 delta_ns = now - vx->runstate_entry_time;
+	s64 steal_ns = run_delay - vx->last_steal;
 
 	if (unlikely(!vx->runstate_entry_time))
 		vx->current_runstate = RUNSTATE_offline;
 
+	vx->last_steal = run_delay;
+
+	/*
+	 * If KVM clock time went backwards, stop updating until it
+	 * catches up (or the runstates are reset by userspace).
+	 */
+	if (delta_ns < 0)
+		return;
+
 	/*
 	 * Time waiting for the scheduler isn't "stolen" if the
 	 * vCPU wasn't running anyway.
 	 */
-	if (vx->current_runstate == RUNSTATE_running) {
-		u64 steal_ns = run_delay - vx->last_steal;
+	if (vx->current_runstate == RUNSTATE_running && steal_ns > 0) {
+		if (steal_ns > delta_ns)
+			steal_ns = delta_ns;
 
 		delta_ns -= steal_ns;
-
 		vx->runstate_times[RUNSTATE_runnable] += steal_ns;
 	}
-	vx->last_steal = run_delay;
 
 	vx->runstate_times[vx->current_runstate] += delta_ns;
 	vx->current_runstate = state;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 26/30] KVM: x86: Avoid redundant masterclock updates from multiple vCPUs
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (24 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 25/30] KVM: x86/xen: Prevent runstate times from becoming negative David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 27/30] KVM: x86: Add KVM_VCPU_TSC_EFFECTIVE_FREQ attribute David Woodhouse
                   ` (6 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

When a masterclock update is triggered (e.g. by the clocksource change
notifier), KVM_REQ_MASTERCLOCK_UPDATE is set on all vCPUs. Without this
fix, each vCPU independently processes the request and redundantly
re-executes the entire pvclock_update_vm_gtod_copy() sequence, serialized
only by tsc_write_lock. Each redundant re-snapshot of the master clock
reference point introduces potential clock drift.

Fix this by having __kvm_start_pvclock_update() check, after acquiring
the lock, whether the requesting vCPU's KVM_REQ_MASTERCLOCK_UPDATE is
still set. If another vCPU already did the update and cleared it, bail
out. Otherwise, clear the request on all other vCPUs before proceeding.

The caller in vcpu_enter_guest() now uses kvm_test_request() (non-clearing)
since the clearing is done inside __kvm_start_pvclock_update() under the
lock.

Suggested-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 56 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7d9ec0638d28..77dfd4455a4e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3259,10 +3259,39 @@ static void kvm_make_mclock_inprogress_request(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
 }
 
-static void __kvm_start_pvclock_update(struct kvm *kvm)
+static void kvm_clear_mclock_inprogress_request(struct kvm *kvm)
 {
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
+}
+
+static bool __kvm_start_pvclock_update(struct kvm *kvm, struct kvm_vcpu *requesting_vcpu)
+{
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
+
 	raw_spin_lock_irq(&kvm->arch.tsc_write_lock);
+
+	/*
+	 * If another vCPU already did the update while we were waiting
+	 * for the lock, our request will have been cleared. Bail out.
+	 */
+	if (requesting_vcpu &&
+	    !kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE, requesting_vcpu)) {
+		kvm_clear_mclock_inprogress_request(kvm);
+		raw_spin_unlock_irq(&kvm->arch.tsc_write_lock);
+		return false;
+	}
+
+	/* The update is VM-wide; prevent other vCPUs from redoing it. */
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_clear_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+
 	write_seqcount_begin(&kvm->arch.pvclock_sc);
+	return true;
 }
 
 static void kvm_start_pvclock_update(struct kvm *kvm)
@@ -3270,7 +3299,7 @@ static void kvm_start_pvclock_update(struct kvm *kvm)
 	kvm_make_mclock_inprogress_request(kvm);
 
 	/* no guest entries from this point */
-	__kvm_start_pvclock_update(kvm);
+	__kvm_start_pvclock_update(kvm, NULL);
 }
 
 static void kvm_end_pvclock_update(struct kvm *kvm)
@@ -3279,22 +3308,25 @@ static void kvm_end_pvclock_update(struct kvm *kvm)
 	struct kvm_vcpu *vcpu;
 	unsigned long i;
 
-	write_seqcount_end(&ka->pvclock_sc);
-	raw_spin_unlock_irq(&ka->tsc_write_lock);
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
 	/* guest entries allowed */
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
+	kvm_clear_mclock_inprogress_request(kvm);
+
+	write_seqcount_end(&ka->pvclock_sc);
+	raw_spin_unlock_irq(&ka->tsc_write_lock);
 }
 
-static void kvm_update_masterclock(struct kvm *kvm)
+static void kvm_update_masterclock(struct kvm *kvm, struct kvm_vcpu *vcpu)
 {
 	kvm_hv_request_tsc_page_update(kvm);
-	kvm_start_pvclock_update(kvm);
-	pvclock_update_vm_gtod_copy(kvm);
-	kvm_end_pvclock_update(kvm);
+	kvm_make_mclock_inprogress_request(kvm);
+
+	if (__kvm_start_pvclock_update(kvm, vcpu)) {
+		pvclock_update_vm_gtod_copy(kvm);
+		kvm_end_pvclock_update(kvm);
+	}
 }
 
 /*
@@ -11485,8 +11517,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_mmu_free_obsolete_roots(vcpu);
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
 			__kvm_migrate_timers(vcpu);
-		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
-			kvm_update_masterclock(vcpu->kvm);
+		if (kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
+			kvm_update_masterclock(vcpu->kvm, vcpu);
 		if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))
 			kvm_gen_kvmclock_update(vcpu);
 		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 27/30] KVM: x86: Add KVM_VCPU_TSC_EFFECTIVE_FREQ attribute
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (25 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 26/30] KVM: x86: Avoid redundant masterclock updates from multiple vCPUs David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 28/30] KVM: x86: Remove runtime Xen TSC frequency CPUID update David Woodhouse
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

Add a read-only per-vCPU attribute that reports the effective TSC and
APIC bus frequencies as seen by the guest, after hardware TSC scaling
is applied.

This allows userspace to populate CPUID leaf 0x40000010 (the "generic"
timing information leaf used by FreeBSD, XNU, and VMware) with correct
values, without KVM needing to modify guest CPUID at runtime.

The effective TSC frequency differs from what userspace requested via
KVM_SET_TSC_KHZ due to the granularity of hardware scaling and the
host kernel's measurement of its own TSC frequency.

The relationship between the attributes in KVM_VCPU_TSC_CTRL:

  KVM_VCPU_TSC_OFFSET: the offset added to the scaled host TSC
  KVM_VCPU_TSC_SCALE: the raw hardware scaling ratio and frac_bits,
    for VMClock and precise arithmetic
  KVM_VCPU_TSC_EFFECTIVE_FREQ: the resulting frequencies in kHz,
    for populating CPUID leaves

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 Documentation/virt/kvm/devices/vcpu.rst | 33 +++++++++++++++++++++++++
 arch/x86/include/uapi/asm/kvm.h         |  6 +++++
 arch/x86/kvm/x86.c                      | 23 +++++++++++++++++
 3 files changed, 62 insertions(+)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 56562b932280..75d1c2bbb8bc 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -326,3 +326,36 @@ host TSC values are converted to guest TSC using the formula:
 Userspace can use this to precisely calculate the guest TSC from the host
 TSC at any given moment. This is needed for accurate migration of guests,
 as described in the documentation for the KVM_VCPU_TSC_OFFSET attribute.
+
+4.3 ATTRIBUTE: KVM_VCPU_TSC_EFFECTIVE_FREQ
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:Parameters: struct kvm_vcpu_tsc_effective_freq
+
+Returns:
+
+	 ======= ======================================
+	 -EFAULT Error reading the provided parameter
+		 address.
+	 -ENXIO  Attribute not supported (no constant TSC)
+	 ======= ======================================
+
+This read-only attribute reports the effective TSC and APIC bus timer
+frequencies as observed by the guest, after hardware TSC scaling is
+applied::
+
+  struct kvm_vcpu_tsc_effective_freq {
+	__u32 tsc_khz;
+	__u32 bus_khz;
+  };
+
+The tsc_khz field is the guest's effective TSC frequency in kHz. This
+may differ slightly from what userspace requested via KVM_SET_TSC_KHZ
+due to the granularity of hardware scaling and the host kernel's
+measurement of its own TSC frequency.
+
+The bus_khz field is the APIC bus timer frequency in kHz.
+
+Userspace can use these values to populate CPUID timing leaves for the
+guest, such as the generic timing leaf at 0x40000010 (EAX=tsc_khz,
+EBX=bus_khz) or hypervisor-specific equivalents.
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 384be9a53395..196899296f84 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -962,12 +962,18 @@ struct kvm_hyperv_eventfd {
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
 #define   KVM_VCPU_TSC_SCALE  1 /* attribute for TSC scaling factor */
+#define   KVM_VCPU_TSC_EFFECTIVE_FREQ 2 /* attribute for effective frequencies */
 
 struct kvm_vcpu_tsc_scale {
 	__u64 tsc_ratio;
 	__u64 tsc_frac_bits;
 };
 
+struct kvm_vcpu_tsc_effective_freq {
+	__u32 tsc_khz;
+	__u32 bus_khz;
+};
+
 /* x86-specific KVM_EXIT_HYPERCALL flags. */
 #define KVM_EXIT_HYPERCALL_LONG_MODE	_BITULL(0)
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 77dfd4455a4e..c15303963686 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6079,6 +6079,9 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
 	case KVM_VCPU_TSC_SCALE:
 		r = kvm_caps.has_tsc_control ? 0 : -ENXIO;
 		break;
+	case KVM_VCPU_TSC_EFFECTIVE_FREQ:
+		r = boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ? 0 : -ENXIO;
+		break;
 	default:
 		r = -ENXIO;
 	}
@@ -6115,6 +6118,25 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
 		r = 0;
 		break;
 	}
+	case KVM_VCPU_TSC_EFFECTIVE_FREQ: {
+		struct kvm_vcpu_tsc_effective_freq freq;
+
+		if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+			r = -ENXIO;
+			break;
+		}
+
+		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu))
+			kvm_guest_time_update(vcpu);
+
+		freq.tsc_khz = div_u64(vcpu->arch.hw_tsc_hz, 1000);
+		freq.bus_khz = 1000000 / vcpu->kvm->arch.apic_bus_cycle_ns;
+		r = -EFAULT;
+		if (copy_to_user(uaddr, &freq, sizeof(freq)))
+			break;
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
@@ -6155,6 +6177,7 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
 		break;
 	}
 	case KVM_VCPU_TSC_SCALE:
+	case KVM_VCPU_TSC_EFFECTIVE_FREQ:
 		r = -EINVAL; /* Read only */
 		break;
 	default:
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 28/30] KVM: x86: Remove runtime Xen TSC frequency CPUID update
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (26 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 27/30] KVM: x86: Add KVM_VCPU_TSC_EFFECTIVE_FREQ attribute David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 29/30] x86/kvm: Obtain TSC frequency from CPUID if present David Woodhouse
                   ` (4 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

Remove the code in kvm_cpuid() that dynamically updates the Xen TSC
info CPUID leaf at runtime. This code was updating the wrong sub-leaf
anyway (0x40000x03/2 EAX is the *host* TSC frequency per the Xen ABI,
not the guest frequency which belongs in 0x40000x03/0 ECX).

Userspace now has all the information it needs to populate the Xen TSC
info leaves (and the generic 0x40000010 timing leaf) at vCPU setup time:

  - KVM_GET_CLOCK_GUEST returns the pvclock_vcpu_time_info structure
    containing tsc_to_system_mul and tsc_shift (Xen leaf index 1)
  - KVM_VCPU_TSC_EFFECTIVE_FREQ returns the effective TSC and bus
    frequencies in kHz (Xen leaf index 2, and 0x40000010)
  - KVM_VCPU_TSC_SCALE returns the raw hardware scaling ratio for
    precise arithmetic (VMClock)

This eliminates the last instance of KVM modifying guest CPUID entries
at runtime for timing information.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/cpuid.c | 16 ----------------
 arch/x86/kvm/xen.h   | 13 -------------
 2 files changed, 29 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 621d950ec692..826637a0b72d 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -2117,22 +2117,6 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
 		} else if (function == 0x80000007) {
 			if (kvm_hv_invtsc_suppressed(vcpu))
 				*edx &= ~feature_bit(CONSTANT_TSC);
-		} else if (IS_ENABLED(CONFIG_KVM_XEN) &&
-			   kvm_xen_is_tsc_leaf(vcpu, function)) {
-			/*
-			 * Update guest TSC frequency information if necessary.
-			 * Ignore failures, there is no sane value that can be
-			 * provided if KVM can't get the TSC frequency.
-			 */
-			if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu))
-				kvm_guest_time_update(vcpu);
-
-			if (index == 1) {
-				*ecx = vcpu->arch.pvclock_tsc_mul;
-				*edx = vcpu->arch.pvclock_tsc_shift;
-			} else if (index == 2) {
-				*eax = div_u64(vcpu->arch.hw_tsc_hz, 1000);
-			}
 		}
 	} else {
 		*eax = *ebx = *ecx = *edx = 0;
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index 59e6128a7bd3..f372855857a8 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -50,14 +50,6 @@ static inline void kvm_xen_sw_enable_lapic(struct kvm_vcpu *vcpu)
 		kvm_xen_inject_vcpu_vector(vcpu);
 }
 
-static inline bool kvm_xen_is_tsc_leaf(struct kvm_vcpu *vcpu, u32 function)
-{
-	return static_branch_unlikely(&kvm_xen_enabled.key) &&
-	       vcpu->arch.xen.cpuid.base &&
-	       function <= vcpu->arch.xen.cpuid.limit &&
-	       function == (vcpu->arch.xen.cpuid.base | XEN_CPUID_LEAF(3));
-}
-
 static inline bool kvm_xen_msr_enabled(struct kvm *kvm)
 {
 	return static_branch_unlikely(&kvm_xen_enabled.key) &&
@@ -177,11 +169,6 @@ static inline bool kvm_xen_timer_enabled(struct kvm_vcpu *vcpu)
 {
 	return false;
 }
-
-static inline bool kvm_xen_is_tsc_leaf(struct kvm_vcpu *vcpu, u32 function)
-{
-	return false;
-}
 #endif
 
 int kvm_xen_hypercall(struct kvm_vcpu *vcpu);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 29/30] x86/kvm: Obtain TSC frequency from CPUID if present
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (27 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 28/30] KVM: x86: Remove runtime Xen TSC frequency CPUID update David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-09 22:46 ` [PATCH v4 30/30] x86/xen: " David Woodhouse
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

In https://lore.kernel.org/all/1222881242.9381.17.camel@alok-dev1/
a proposal was made for generic CPUID conventions across hypervisors.
It was mostly shot down in flames, but the leaf at 0x40000010
containing timing information didn't die.

It's used by XNU and FreeBSD guests under all hypervisors to determine
the TSC frequency, and also exposed by the EC2 Nitro hypervisor and
VMware. Use it under KVM to obtain the TSC frequency more accurately,
instead of reverse-calculating the frequency from the mul/shift values
in the KVM clock.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_para.h      |  1 +
 arch/x86/include/uapi/asm/kvm_para.h | 11 +++++++++++
 arch/x86/kernel/kvm.c                | 10 ++++++++++
 arch/x86/kernel/kvmclock.c           |  7 ++++++-
 4 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 4a47c16e2df8..03fa1228fcf2 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -121,6 +121,7 @@ static inline long kvm_sev_hypercall3(unsigned int nr, unsigned long p1,
 void kvmclock_init(void);
 void kvmclock_disable(void);
 bool kvm_para_available(void);
+unsigned int kvm_para_tsc_khz(void);
 unsigned int kvm_arch_para_features(void);
 unsigned int kvm_arch_para_hints(void);
 void kvm_async_pf_task_wait_schedule(u32 token);
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index a1efa7907a0b..dc0d036fe678 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -44,6 +44,17 @@
  */
 #define KVM_FEATURE_CLOCKSOURCE_STABLE_BIT	24
 
+/*
+ * In https://lore.kernel.org/all/1222881242.9381.17.camel@alok-dev1/
+ * VMware proposed a timing information leaf providing the TSC and
+ * local APIC timer frequencies:
+ *
+ *  # EAX: (Virtual) TSC frequency in kHz.
+ *  # EBX: (Virtual) Bus (local apic timer) frequency in kHz.
+ *  # ECX, EDX: RESERVED (reserved fields are set to zero).
+ */
+#define KVM_CPUID_TIMING_INFO	0x40000010
+
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 29226d112029..60375165b66c 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -910,6 +910,16 @@ bool kvm_para_available(void)
 }
 EXPORT_SYMBOL_GPL(kvm_para_available);
 
+unsigned int kvm_para_tsc_khz(void)
+{
+	u32 base = kvm_cpuid_base();
+
+	if (base && cpuid_eax(base) >= (base | KVM_CPUID_TIMING_INFO))
+		return cpuid_eax(base | KVM_CPUID_TIMING_INFO);
+
+	return 0;
+}
+
 unsigned int kvm_arch_para_features(void)
 {
 	return cpuid_eax(kvm_cpuid_base() | KVM_CPUID_FEATURES);
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index b5991d53fc0e..74aca22dc726 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -118,7 +118,12 @@ static inline void kvm_sched_clock_init(bool stable)
 static unsigned long kvm_get_tsc_khz(void)
 {
 	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
-	return pvclock_tsc_khz(this_cpu_pvti());
+
+	/*
+	 * If KVM advertises the frequency directly in CPUID, use that
+	 * instead of reverse-calculating it from the KVM clock data.
+	 */
+	return kvm_para_tsc_khz() ? : pvclock_tsc_khz(this_cpu_pvti());
 }
 
 static void __init kvm_get_preset_lpj(void)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 30/30] x86/xen: Obtain TSC frequency from CPUID if present
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (28 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 29/30] x86/kvm: Obtain TSC frequency from CPUID if present David Woodhouse
@ 2026-05-09 22:46 ` David Woodhouse
  2026-05-10 20:56 ` [PATCH v4 33/30] KVM: selftests: Add Xen runstate migration test David Woodhouse
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-09 22:46 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Joey Gouly, Jack Allister, Dongli Zhang, joe.jin,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

From: David Woodhouse <dwmw@amazon.co.uk>

The Xen CPUID leaf 3, sub-leaf 0, ECX provides the guest TSC frequency
in kHz directly. Use it when available instead of reverse-calculating
the frequency from the pvclock tsc_to_system_mul and tsc_shift values,
which loses precision.

This mirrors the equivalent change for KVM guests using the generic
0x40000010 timing leaf.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c  |  3 +--
 arch/x86/xen/time.c | 12 ++++++++++++
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c15303963686..ac982652e5e0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3469,7 +3469,6 @@ static void kvm_setup_guest_pvclock(struct pvclock_vcpu_time_info *ref_hv_clock,
 int kvm_guest_time_update(struct kvm_vcpu *v)
 {
 	struct pvclock_vcpu_time_info hv_clock = {};
-	unsigned long flags;
 	u64 tgt_tsc_hz;
 	unsigned seq;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
@@ -10162,7 +10161,7 @@ static void kvm_hyperv_tsc_notifier(void)
 	kvm_caps.max_guest_tsc_khz = tsc_khz;
 
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		__kvm_start_pvclock_update(kvm);
+		__kvm_start_pvclock_update(kvm, NULL);
 		pvclock_update_vm_gtod_copy(kvm);
 		kvm_end_pvclock_update(kvm);
 	}
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 6f9f665bb7ae..862b8e9e8405 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -41,8 +41,20 @@ static unsigned long xen_tsc_khz(void)
 {
 	struct pvclock_vcpu_time_info *info =
 		&HYPERVISOR_shared_info->vcpu_info[0].time;
+	u32 eax, ebx, ecx, edx;
+	u32 base = xen_cpuid_base();
 
 	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+
+	/*
+	 * If Xen provides the guest TSC frequency directly in CPUID
+	 * (leaf 3, sub-leaf 0, ECX), use that instead of reverse-
+	 * calculating from the pvclock mul/shift.
+	 */
+	cpuid_count(base + 3, 0, &eax, &ebx, &ecx, &edx);
+	if (ecx)
+		return ecx;
+
 	return pvclock_tsc_khz(info);
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 33/30] KVM: selftests: Add Xen runstate migration test
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (29 preceding siblings ...)
  2026-05-09 22:46 ` [PATCH v4 30/30] x86/xen: " David Woodhouse
@ 2026-05-10 20:56 ` David Woodhouse
  2026-05-10 20:58 ` [PATCH v4 31/30] KVM: selftests: Add Xen/generic CPUID timing leaf test David Woodhouse
  2026-05-10 21:05 ` [PATCH v4 32/30] KVM: x86: Re-synchronize TSC after KVM_SET_TSC_KHZ David Woodhouse
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-10 20:56 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Dongli Zhang, Jack Allister, Joe Jin, Joey Gouly,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 7851 bytes --]

From: David Woodhouse <dwmw@amazon.co.uk>

Test that Xen runstate (steal time) is correctly accounted across a
simulated live migration using KVM_XEN_VCPU_ATTR and KVM_[GS]ET_CLOCK_GUEST.

The test simulates what a real VMM does during migration:
1. Creates a VM with Xen HVM config and runstate tracking
2. Runs the guest to accumulate some kvmclock time
3. Saves clock (KVM_GET_CLOCK_GUEST), TSC offset, and runstate
4. Marks the saved state as RUNSTATE_runnable (vCPU not running)
5. Destroys the source VM
6. Sleeps 10ms (simulating migration network transfer time)
7. Creates a new VM and restores all state precisely as saved
8. Runs the guest and verifies the migration gap appears as steal

The kernel accounts the gap because: on vcpu_load, it transitions from
RUNSTATE_runnable to RUNSTATE_running, computing delta = kvmclock_now -
state_entry_time. Since kvmclock has advanced past the saved entry time
(real time elapsed during migration), the delta is added to time_runnable.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 .../selftests/kvm/x86/xen_migration_test.c    | 194 ++++++++++++++++++
 1 file changed, 194 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86/xen_migration_test.c

diff --git a/tools/testing/selftests/kvm/x86/xen_migration_test.c b/tools/testing/selftests/kvm/x86/xen_migration_test.c
new file mode 100644
index 000000000000..37e8ace00611
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86/xen_migration_test.c
@@ -0,0 +1,194 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Test Xen runstate (steal time) preservation across simulated migration.
+ *
+ * Verifies that the kernel correctly accounts the migration gap as
+ * steal time (runnable) when runstate data is saved and restored
+ * precisely, but real time elapses during the migration.
+ *
+ * The key insight: userspace saves the runstate with state=RUNSTATE_runnable
+ * (the vCPU is not running during migration). On restore, the kernel sees
+ * that kvmclock has advanced past state_entry_time, and accounts the
+ * difference as time spent in the runnable state.
+ */
+#include <inttypes.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+
+#include <asm/pvclock-abi.h>
+
+#define SHINFO_GPA	0xc0000000ULL
+#define RUNSTATE_GPA	(SHINFO_GPA + 0x1000)
+
+#define RUNSTATE_running  0
+#define RUNSTATE_runnable 1
+#define RUNSTATE_blocked  2
+#define RUNSTATE_offline  3
+
+struct vcpu_runstate_info {
+	uint32_t state;
+	uint64_t state_entry_time;
+	uint64_t time[4];
+} __attribute__((packed));
+
+static void guest_code(void)
+{
+	volatile struct vcpu_runstate_info *rs =
+		(void *)(unsigned long)RUNSTATE_GPA;
+
+	/* Report runstate times — no need to enable kvmclock MSR,
+	 * the kernel writes runstate using its internal kvmclock. */
+	GUEST_SYNC_ARGS(0, rs->time[RUNSTATE_runnable],
+			rs->time[RUNSTATE_running], 0, 0);
+}
+
+static struct kvm_vm *create_xen_vm(struct kvm_vcpu **vcpu)
+{
+	struct kvm_vm *vm;
+	int xen_caps;
+
+	vm = vm_create_with_one_vcpu(vcpu, guest_code);
+
+	xen_caps = kvm_check_cap(KVM_CAP_XEN_HVM);
+	TEST_REQUIRE(xen_caps & KVM_XEN_HVM_CONFIG_SHARED_INFO);
+	TEST_REQUIRE(xen_caps & KVM_XEN_HVM_CONFIG_RUNSTATE);
+
+	/* Map pages */
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+				    SHINFO_GPA, 1, 2, 0);
+	virt_map(vm, SHINFO_GPA, SHINFO_GPA, 2);
+
+	/* Enable Xen HVM with MSR interception (enables runstate tracking) */
+	struct kvm_xen_hvm_config cfg = {
+		.flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL,
+		.msr = 0x40000000,
+	};
+	vm_ioctl(vm, KVM_XEN_HVM_CONFIG, &cfg);
+
+	/* Set shared_info */
+	struct kvm_xen_hvm_attr ha = {
+		.type = KVM_XEN_ATTR_TYPE_SHARED_INFO,
+		.u.shared_info.gfn = SHINFO_GPA >> 12,
+	};
+	vm_ioctl(vm, KVM_XEN_HVM_SET_ATTR, &ha);
+
+	/* Set runstate address */
+	struct kvm_xen_vcpu_attr rs_addr = {
+		.type = KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR,
+		.u.gpa = RUNSTATE_GPA,
+	};
+	vcpu_ioctl(*vcpu, KVM_XEN_VCPU_SET_ATTR, &rs_addr);
+
+	return vm;
+}
+
+int main(void)
+{
+	struct pvclock_vcpu_time_info pvti;
+	struct kvm_xen_vcpu_attr runstate_save;
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	struct ucall uc;
+	uint64_t tsc_offset;
+	int ret;
+
+	/* === SOURCE SIDE === */
+	pr_info("=== Source: create VM and run guest ===\n");
+	vm = create_xen_vm(&vcpu);
+
+	/* Run guest once to accumulate some runstate time */
+	vcpu_run(vcpu);
+	TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+	TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC);
+
+	pr_info("  Guest sees: runnable=%" PRIu64 " running=%" PRIu64 "\n",
+		uc.args[2], uc.args[3]);
+
+	/* Save clock state */
+	ret = __vcpu_ioctl(vcpu, KVM_GET_CLOCK_GUEST, &pvti);
+	TEST_ASSERT(!ret, "KVM_GET_CLOCK_GUEST failed");
+
+	/* Save TSC offset */
+	tsc_offset = vcpu_get_msr(vcpu, MSR_IA32_TSC_ADJUST);
+
+	/* Save runstate — the vCPU is now "runnable" (not running) */
+	runstate_save.type = KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_DATA;
+	vcpu_ioctl(vcpu, KVM_XEN_VCPU_GET_ATTR, &runstate_save);
+
+	/*
+	 * Transition to runnable state before saving — the vCPU is
+	 * not running during migration.
+	 */
+	runstate_save.u.runstate.state = RUNSTATE_runnable;
+
+	pr_info("  Saved runstate: running=%" PRIu64 " runnable=%" PRIu64
+		" entry=%" PRIu64 "\n",
+		(uint64_t)runstate_save.u.runstate.time_running,
+		(uint64_t)runstate_save.u.runstate.time_runnable,
+		(uint64_t)runstate_save.u.runstate.state_entry_time);
+
+	uint64_t saved_runnable = runstate_save.u.runstate.time_runnable;
+
+	kvm_vm_release(vm);
+
+	/* === MIGRATION GAP === */
+	pr_info("=== Simulating migration (sleeping 10ms) ===\n");
+	usleep(10000);
+
+	/* === DESTINATION SIDE === */
+	pr_info("=== Destination: create new VM and restore ===\n");
+	vm = create_xen_vm(&vcpu);
+
+	/* Restore TSC offset */
+	vcpu_set_msr(vcpu, MSR_IA32_TSC_ADJUST, tsc_offset);
+
+	/* Restore clock — kvmclock will now be ~10ms ahead of the snapshot */
+	vcpu_ioctl(vcpu, KVM_SET_CLOCK_GUEST, &pvti);
+
+	/* Restore runstate exactly as saved (state=runnable) */
+	runstate_save.type = KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_DATA;
+	ret = __vcpu_ioctl(vcpu, KVM_XEN_VCPU_SET_ATTR, &runstate_save);
+	TEST_ASSERT(!ret, "Restore runstate failed: errno %d", errno);
+
+	/*
+	 * Run the guest. When the vCPU enters vcpu_run, the kernel
+	 * transitions from RUNSTATE_runnable to RUNSTATE_running.
+	 * It computes: delta = kvmclock_now - state_entry_time
+	 * This delta (which includes the migration gap) is added to
+	 * time_runnable (steal time).
+	 */
+	vcpu_run(vcpu);
+	TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+	TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC);
+
+	uint64_t guest_runnable = uc.args[2];
+	uint64_t guest_running = uc.args[3];
+
+	pr_info("  Guest sees: runnable=%" PRIu64 " running=%" PRIu64 "\n",
+		guest_runnable, guest_running);
+
+	uint64_t steal_increase = guest_runnable - saved_runnable;
+	pr_info("  Steal time increase: %" PRIu64 " ns (migration gap)\n",
+		steal_increase);
+
+	/*
+	 * The steal time increase should be at least 10ms (the sleep)
+	 * but not more than 5s (allowing for VM creation overhead).
+	 * The actual gap is from the source's state_entry_time to the
+	 * destination's kvmclock "now" at vcpu_load time.
+	 */
+	TEST_ASSERT(steal_increase >= 10000000ULL &&
+		    steal_increase < 5000000000ULL,
+		    "Steal time increase %" PRIu64 " ns not in expected range "
+		    "[10ms, 5s]", steal_increase);
+
+	kvm_vm_release(vm);
+	pr_info("PASS: Migration gap correctly accounted as steal time\n");
+	return 0;
+}
-- 
2.43.0



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 31/30] KVM: selftests: Add Xen/generic CPUID timing leaf test
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (30 preceding siblings ...)
  2026-05-10 20:56 ` [PATCH v4 33/30] KVM: selftests: Add Xen runstate migration test David Woodhouse
@ 2026-05-10 20:58 ` David Woodhouse
  2026-05-10 21:05 ` [PATCH v4 32/30] KVM: x86: Re-synchronize TSC after KVM_SET_TSC_KHZ David Woodhouse
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-10 20:58 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Dongli Zhang, Jack Allister, Joe Jin, Joey Gouly,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 8759 bytes --]

From: David Woodhouse <dwmw@amazon.co.uk>

Verify that userspace can correctly populate Xen and generic CPUID
timing leaves using the KVM_VCPU_TSC_EFFECTIVE_FREQ and
KVM_VCPU_TSC_SCALE attributes.

This validates that the removal of KVM's runtime Xen CPUID modification
doesn't break guests: userspace queries the effective TSC and bus
frequencies, computes the pvclock mul/shift, populates the CPUID leaves,
and the guest verifies the values match.

The test exercises:
 - KVM_VCPU_TSC_EFFECTIVE_FREQ at native and scaled frequencies
 - KVM_VCPU_TSC_SCALE ratio verification against effective frequency
 - Generic timing leaf 0x40000010 (EAX=tsc_khz, EBX=bus_khz)
 - Xen leaf 3 sub-leaf 0 (ECX=guest TSC kHz)
 - Xen leaf 3 sub-leaf 1 (ECX=mul, EDX=shift)

Gracefully skips TSC scaling tests on hardware without support.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../selftests/kvm/x86/xen_cpuid_timing_test.c | 232 ++++++++++++++++++
 2 files changed, 233 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86/xen_cpuid_timing_test.c

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index fb935ae3bf38..50f02116249f 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -139,6 +139,7 @@ TEST_GEN_PROGS_x86 += x86/xss_msr_test
 TEST_GEN_PROGS_x86 += x86/debug_regs
 TEST_GEN_PROGS_x86 += x86/tsc_msrs_test
 TEST_GEN_PROGS_x86 += x86/vmx_pmu_caps_test
+TEST_GEN_PROGS_x86 += x86/xen_cpuid_timing_test
 TEST_GEN_PROGS_x86 += x86/xen_shinfo_test
 TEST_GEN_PROGS_x86 += x86/xen_vmcall_test
 TEST_GEN_PROGS_x86 += x86/sev_init2_tests
diff --git a/tools/testing/selftests/kvm/x86/xen_cpuid_timing_test.c b/tools/testing/selftests/kvm/x86/xen_cpuid_timing_test.c
new file mode 100644
index 000000000000..f574343ed449
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86/xen_cpuid_timing_test.c
@@ -0,0 +1,232 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Test that userspace can correctly populate Xen and generic CPUID
+ * timing leaves using KVM_VCPU_TSC_EFFECTIVE_FREQ.
+ *
+ * This validates that the removal of KVM's runtime Xen CPUID modification
+ * doesn't break guests, because userspace has all the information needed.
+ */
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+
+#include <asm/pvclock-abi.h>
+
+#define XEN_CPUID_BASE		0x40000100
+#define XEN_CPUID_LEAF(n)	(XEN_CPUID_BASE + (n))
+#define GENERIC_TIMING_LEAF	0x40000010
+
+/* Values set by host, verified by guest */
+static uint32_t expected_tsc_khz;
+static uint32_t expected_bus_khz;
+static uint32_t expected_tsc_mul;
+static int8_t   expected_tsc_shift;
+static uint64_t host_khz;
+
+static void guest_code(void)
+{
+	uint32_t eax, ebx, ecx, edx;
+
+	/* Check generic timing leaf 0x40000010 */
+	__cpuid(GENERIC_TIMING_LEAF, 0, &eax, &ebx, &ecx, &edx);
+	GUEST_ASSERT_EQ(eax, expected_tsc_khz);
+	GUEST_ASSERT_EQ(ebx, expected_bus_khz);
+
+	/* Check Xen leaf 3, sub-leaf 0: ECX = guest TSC frequency */
+	__cpuid(XEN_CPUID_LEAF(3), 0, &eax, &ebx, &ecx, &edx);
+	GUEST_ASSERT_EQ(ecx, expected_tsc_khz);
+
+	/* Check Xen leaf 3, sub-leaf 1: ECX = mul, EDX = shift */
+	__cpuid(XEN_CPUID_LEAF(3), 1, &eax, &ebx, &ecx, &edx);
+	GUEST_ASSERT_EQ(ecx, expected_tsc_mul);
+	GUEST_ASSERT_EQ((int8_t)edx, expected_tsc_shift);
+
+	GUEST_SYNC(0);
+}
+
+static void add_cpuid_entry(struct kvm_vcpu *vcpu, uint32_t function,
+			    uint32_t index, uint32_t eax, uint32_t ebx,
+			    uint32_t ecx, uint32_t edx)
+{
+	struct kvm_cpuid2 *cpuid = vcpu->cpuid;
+	struct kvm_cpuid_entry2 *entry;
+	int n = cpuid->nent;
+
+	vcpu->cpuid = realloc(vcpu->cpuid,
+			      sizeof(*cpuid) + (n + 1) * sizeof(*entry));
+	cpuid = vcpu->cpuid;
+	cpuid->nent = n + 1;
+
+	entry = &cpuid->entries[n];
+	memset(entry, 0, sizeof(*entry));
+	entry->function = function;
+	entry->index = index;
+	entry->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+	entry->eax = eax;
+	entry->ebx = ebx;
+	entry->ecx = ecx;
+	entry->edx = edx;
+}
+
+/*
+ * Compute pvclock mul/shift from frequency, matching kvm_get_time_scale().
+ */
+static void compute_tsc_mul_shift(uint64_t tsc_hz, uint32_t *mul, int8_t *shift)
+{
+	uint64_t scaled = 1000000000ULL;
+	uint64_t base = tsc_hz;
+	int32_t s = 0;
+	uint32_t base32;
+
+	while (base > scaled * 2 || base >> 32) {
+		base >>= 1;
+		s--;
+	}
+	base32 = (uint32_t)base;
+	while (base32 <= scaled || scaled >> 32) {
+		if (scaled >> 32 || base32 & (1U << 31))
+			scaled >>= 1;
+		else
+			base32 <<= 1;
+		s++;
+	}
+	*mul = (uint32_t)((scaled << 32) / base32);
+	*shift = (int8_t)s;
+}
+
+static void run_test(uint64_t tsc_khz)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	struct ucall uc;
+	struct { uint32_t tsc_khz; uint32_t bus_khz; } freq;
+	struct kvm_device_attr freq_attr = {
+		.group = KVM_VCPU_TSC_CTRL,
+		.attr = 2, /* KVM_VCPU_TSC_EFFECTIVE_FREQ */
+		.addr = (uint64_t)(uintptr_t)&freq,
+	};
+
+	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
+
+	if (tsc_khz) {
+		pr_info("Testing at TSC frequency %lu kHz\n", tsc_khz);
+		vcpu_ioctl(vcpu, KVM_SET_TSC_KHZ, (void *)(unsigned long)tsc_khz);
+	} else {
+		pr_info("Testing at native TSC frequency\n");
+	}
+
+	vcpu_ioctl(vcpu, KVM_GET_DEVICE_ATTR, &freq_attr);
+
+	/* If scaling wasn't applied, skip this frequency */
+	if (tsc_khz && freq.tsc_khz == host_khz) {
+		pr_info("  TSC scaling not available, skipping\n");
+		kvm_vm_release(vm);
+		return;
+	}
+
+	pr_info("  Effective TSC: %u kHz, Bus: %u kHz\n", freq.tsc_khz, freq.bus_khz);
+
+	/* Also exercise KVM_VCPU_TSC_SCALE if available */
+	{
+		struct { uint64_t ratio; uint64_t frac_bits; } scale;
+		struct kvm_device_attr scale_attr = {
+			.group = KVM_VCPU_TSC_CTRL,
+			.attr = 1, /* KVM_VCPU_TSC_SCALE */
+			.addr = (uint64_t)(uintptr_t)&scale,
+		};
+
+		if (!__vcpu_ioctl(vcpu, KVM_HAS_DEVICE_ATTR, &scale_attr)) {
+			vcpu_ioctl(vcpu, KVM_GET_DEVICE_ATTR, &scale_attr);
+			pr_info("  TSC scale: ratio=%lu frac_bits=%lu\n",
+				scale.ratio, scale.frac_bits);
+
+			/*
+			 * Verify: applying the ratio to the host TSC frequency
+			 * should give approximately the effective frequency.
+			 */
+			if (tsc_khz) {
+				uint64_t computed = ((__uint128_t)host_khz * scale.ratio) >> scale.frac_bits;
+				int64_t diff = (int64_t)computed - (int64_t)freq.tsc_khz;
+
+				TEST_ASSERT(diff >= -1 && diff <= 1,
+					    "TSC_SCALE ratio mismatch: computed %lu vs effective %u (diff %ld)",
+					    computed, freq.tsc_khz, diff);
+			}
+		}
+	}
+
+	compute_tsc_mul_shift((uint64_t)freq.tsc_khz * 1000,
+			      &expected_tsc_mul, &expected_tsc_shift);
+
+	expected_tsc_khz = freq.tsc_khz;
+	expected_bus_khz = freq.bus_khz;
+
+	sync_global_to_guest(vm, expected_tsc_khz);
+	sync_global_to_guest(vm, expected_bus_khz);
+	sync_global_to_guest(vm, expected_tsc_mul);
+	sync_global_to_guest(vm, expected_tsc_shift);
+
+	/* Populate CPUID leaves as a VMM would */
+	add_cpuid_entry(vcpu, GENERIC_TIMING_LEAF, 0,
+			freq.tsc_khz, freq.bus_khz, 0, 0);
+	add_cpuid_entry(vcpu, XEN_CPUID_LEAF(3), 0,
+			0, 0, freq.tsc_khz, 0);
+	add_cpuid_entry(vcpu, XEN_CPUID_LEAF(3), 1,
+			0, 0, expected_tsc_mul,
+			(uint32_t)(uint8_t)expected_tsc_shift);
+
+	vcpu_set_cpuid(vcpu);
+
+	pr_info("  pvclock mul=%u shift=%d\n", expected_tsc_mul, expected_tsc_shift);
+
+	vcpu_run(vcpu);
+	TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+
+	switch (get_ucall(vcpu, &uc)) {
+	case UCALL_ABORT:
+		REPORT_GUEST_ASSERT(uc);
+		break;
+	case UCALL_SYNC:
+		break;
+	default:
+		TEST_FAIL("Unexpected ucall");
+	}
+
+	kvm_vm_release(vm);
+}
+
+int main(void)
+{
+	uint64_t freq;
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	struct kvm_device_attr attr = {
+		.group = KVM_VCPU_TSC_CTRL,
+		.attr = 2,
+	};
+
+	TEST_REQUIRE(sys_clocksource_is_based_on_tsc());
+
+	/* Check KVM_VCPU_TSC_EFFECTIVE_FREQ is supported */
+	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
+	TEST_REQUIRE(!__vcpu_ioctl(vcpu, KVM_HAS_DEVICE_ATTR, &attr));
+	host_khz = __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL);
+	kvm_vm_release(vm);
+
+	/* Native frequency */
+	run_test(0);
+
+	/* Scaled frequencies — skip if TSC scaling not available */
+	for (freq = 1000000; freq <= 4000000; freq += 1000000) {
+		if (freq == host_khz)
+			continue;
+		run_test(freq);
+	}
+
+	pr_info("PASS: All CPUID timing leaf tests passed\n");
+	return 0;
+}
-- 
2.43.0



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 32/30] KVM: x86: Re-synchronize TSC after KVM_SET_TSC_KHZ
  2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
                   ` (31 preceding siblings ...)
  2026-05-10 20:58 ` [PATCH v4 31/30] KVM: selftests: Add Xen/generic CPUID timing leaf test David Woodhouse
@ 2026-05-10 21:05 ` David Woodhouse
  32 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2026-05-10 21:05 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky,
	David Woodhouse, Paul Durrant, Jonathan Cameron, Sascha Bischoff,
	Marc Zyngier, Dongli Zhang, Jack Allister, Joe Jin, Joey Gouly,
	kvm, linux-doc, linux-kernel, xen-devel, linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 2047 bytes --]

From: David Woodhouse <dwmw@amazon.co.uk>

KVM_SET_TSC_KHZ changes the vCPU's TSC scaling ratio but does not
update the VM-wide cur_tsc_scaling_ratio used by get_kvmclock().
This causes get_kvmclock() to use a stale (default 1:1) ratio when
computing the KVM clock, leading to drift between the host-side
kvmclock and what the guest observes.

Fix this by calling kvm_synchronize_tsc() after changing the TSC
frequency. This:
 - Updates cur_tsc_scaling_ratio (consumed by pvclock_update_vm_gtod_copy)
 - Ensures the TSC value is continuous across the frequency change
 - Triggers kvm_track_tsc_matching() for proper masterclock handling
 - Allows subsequent vCPUs to synchronize via the 1-second slop hack

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ac982652e5e0..833a4f119e22 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -206,6 +206,7 @@ module_param(mitigate_smt_rsb, bool, 0444);
 #ifdef CONFIG_X86_64
 static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);
 #endif
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value);
 #define KVM_MAX_NR_USER_RETURN_MSRS 16
 
 struct kvm_user_return_msrs {
@@ -2611,7 +2612,20 @@ static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
 			 user_tsc_khz, thresh_lo, thresh_hi);
 		use_scaling = 1;
 	}
-	return set_tsc_khz(vcpu, user_tsc_khz, use_scaling);
+	if (set_tsc_khz(vcpu, user_tsc_khz, use_scaling))
+		return -1;
+
+	/*
+	 * Re-synchronize the TSC after changing frequency. This ensures
+	 * cur_tsc_scaling_ratio is updated (used by get_kvmclock) and
+	 * the TSC value is continuous across the frequency change.
+	 */
+	{
+		u64 tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+
+		kvm_synchronize_tsc(vcpu, &tsc);
+	}
+	return 0;
 }
 
 static s64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
-- 
2.43.0



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply related	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2026-05-10 21:05 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-09 22:46 [PATCH v4] 00/30] Cleaning up the KVM clock mess David Woodhouse
2026-05-09 22:46 ` [PATCH v4 01/30] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() David Woodhouse
2026-05-09 22:46 ` [PATCH v4 02/30] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force David Woodhouse
2026-05-09 22:46 ` [PATCH v4 03/30] UAPI: x86: Move pvclock-abi to UAPI for x86 platforms David Woodhouse
2026-05-09 22:46 ` [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration David Woodhouse
2026-05-09 22:46 ` [PATCH v4 05/30] KVM: selftests: Add KVM/PV clock selftest to prove timer correction David Woodhouse
2026-05-09 22:46 ` [PATCH v4 06/30] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC David Woodhouse
2026-05-09 22:46 ` [PATCH v4 07/30] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration David Woodhouse
2026-05-09 22:46 ` [PATCH v4 08/30] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host David Woodhouse
2026-05-09 22:46 ` [PATCH v4 09/30] KVM: x86: WARN if kvm_get_walltime_and_clockread() fails unexpectedly David Woodhouse
2026-05-09 22:46 ` [PATCH v4 10/30] KVM: x86: Fold __get_kvmclock() into get_kvmclock() David Woodhouse
2026-05-09 22:46 ` [PATCH v4 11/30] KVM: x86: Add WARN and restructure get_kvmclock() David Woodhouse
2026-05-09 22:46 ` [PATCH v4 12/30] KVM: x86: Use get_kvmclock_base_ns() as fallback in get_kvmclock() David Woodhouse
2026-05-09 22:46 ` [PATCH v4 13/30] KVM: x86: Fix KVM clock precision in get_kvmclock() with TSC scaling David Woodhouse
2026-05-09 22:46 ` [PATCH v4 14/30] KVM: x86: Use get_kvmclock() in kvm_get_wall_clock_epoch() David Woodhouse
2026-05-09 22:46 ` [PATCH v4 15/30] KVM: x86: Fix compute_guest_tsc() to handle negative time deltas David Woodhouse
2026-05-09 22:46 ` [PATCH v4 16/30] KVM: x86: Restructure kvm_guest_time_update() for TSC upscaling David Woodhouse
2026-05-09 22:46 ` [PATCH v4 17/30] KVM: x86: Simplify and comment kvm_get_time_scale() David Woodhouse
2026-05-09 22:46 ` [PATCH v4 18/30] KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset() David Woodhouse
2026-05-09 22:46 ` [PATCH v4 19/30] KVM: x86: Improve synchronization in kvm_synchronize_tsc() David Woodhouse
2026-05-09 22:46 ` [PATCH v4 20/30] KVM: x86: Kill last_tsc_{nsec,write,offset} fields David Woodhouse
2026-05-09 22:46 ` [PATCH v4 21/30] KVM: x86: Replace nr_vcpus_matched_tsc count with all_vcpus_matched_tsc bool David Woodhouse
2026-05-09 22:46 ` [PATCH v4 22/30] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other David Woodhouse
2026-05-09 22:46 ` [PATCH v4 23/30] KVM: x86: Factor out kvm_use_master_clock() David Woodhouse
2026-05-09 22:46 ` [PATCH v4 24/30] KVM: x86: Avoid gratuitous global clock updates David Woodhouse
2026-05-09 22:46 ` [PATCH v4 25/30] KVM: x86/xen: Prevent runstate times from becoming negative David Woodhouse
2026-05-09 22:46 ` [PATCH v4 26/30] KVM: x86: Avoid redundant masterclock updates from multiple vCPUs David Woodhouse
2026-05-09 22:46 ` [PATCH v4 27/30] KVM: x86: Add KVM_VCPU_TSC_EFFECTIVE_FREQ attribute David Woodhouse
2026-05-09 22:46 ` [PATCH v4 28/30] KVM: x86: Remove runtime Xen TSC frequency CPUID update David Woodhouse
2026-05-09 22:46 ` [PATCH v4 29/30] x86/kvm: Obtain TSC frequency from CPUID if present David Woodhouse
2026-05-09 22:46 ` [PATCH v4 30/30] x86/xen: " David Woodhouse
2026-05-10 20:56 ` [PATCH v4 33/30] KVM: selftests: Add Xen runstate migration test David Woodhouse
2026-05-10 20:58 ` [PATCH v4 31/30] KVM: selftests: Add Xen/generic CPUID timing leaf test David Woodhouse
2026-05-10 21:05 ` [PATCH v4 32/30] KVM: x86: Re-synchronize TSC after KVM_SET_TSC_KHZ David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox