[PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update

Kernel KVM virtualization development
 help / color / mirror / Atom feed

* [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update
@ 2026-01-15 20:22 Dongli Zhang
  2026-01-15 20:22 ` [PATCH 1/3] KVM: x86: Fix compute_guest_tsc() to cope with negative delta Dongli Zhang
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Dongli Zhang @ 2026-01-15 20:22 UTC (permalink / raw)
  To: kvm
  Cc: seanjc, pbonzini, dwmw2, dwmw, paul, tglx, mingo, bp, dave.hansen,
	x86, hpa, linux-kernel, joe.jin, dongli.zhang

As noted in commit c52ffadc65e2 ("KVM: x86: Don't unnecessarily
force masterclock update on vCPU hotplug"), each unnecessary
KVM_REQ_MASTERCLOCK_UPDATE can cause the kvm-clock time to jump.

Although that commit addressed the kvm-clock drift issue during vCPU
hotplugl there are still unnecessary KVM_REQ_MASTERCLOCK_UPDATE requests
during live migration on the target host.

The patchset below was authored by David Woodhouse. Two of the patches aim
to avoid unnecessary KVM_REQ_MASTERCLOCK_UPDATE requests.

[RFC PATCH v3 00/21] Cleaning up the KVM clock mess
https://lore.kernel.org/all/20240522001817.619072-1-dwmw2@infradead.org/

[RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
[RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other

The current patchset has three patches.

PATCH 1 is a partial copy of "[RFC PATCH v3 10/21] KVM: x86: Fix software
TSC upscaling in kvm_update_guest_time()", as Sean suggested, "Please do
this in a separate patch. There's no need to squeeze it in here, and this
change is complex/subtle enough as it is.", and David's authorship is
preserved.

PATCH 2 clears unnecessary KVM_REQ_MASTERCLOCK_UPDATE at the end of
KVM_SET_CLOCK, if masterclock is already active.

PATCH 3 avoids unnecessary updates of ka->master_kernel_ns and
ka->master_cycle_now in pvclock_update_vm_gtod_copy(), if it is already
active and will remain active.

David Woodhouse (1):
  KVM: x86: Fix compute_guest_tsc() to cope with negative delta

Dongli Zhang (2):
  KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK
  KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()

 arch/x86/kvm/x86.c | 73 +++++++++++++++++++++++++++++++++++--------------
 1 file changed, 52 insertions(+), 21 deletions(-)

base-commit: 944aacb68baf7624ab8d277d0ebf07f025ca137c

Thank you very much!

Dongli Zhang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] KVM: x86: Fix compute_guest_tsc() to cope with negative delta
  2026-01-15 20:22 [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
@ 2026-01-15 20:22 ` Dongli Zhang
  2026-01-15 20:22 ` [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK Dongli Zhang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Dongli Zhang @ 2026-01-15 20:22 UTC (permalink / raw)
  To: kvm
  Cc: seanjc, pbonzini, dwmw2, dwmw, paul, tglx, mingo, bp, dave.hansen,
	x86, hpa, linux-kernel, joe.jin, dongli.zhang

From: David Woodhouse <dwmw@amazon.co.uk>

The upcoming patches will minimize the chances of updating the master clock
data. Unfortunately, this may cause issues in compute_guest_tsc().

Since the KVM clock reference point in master_kernel_ns might actually be
*earlier* than the reference point used for the guest TSC
(vcpu->last_tsc_nsec), this might lead to a negative delta to be passed to
pvclock_scale_delta(). Fix the compute_guest_tsc() function to cope with
negative numbers.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/all/20240522001817.619072-11-dwmw2@infradead.org/
[Dongli: copy relevant code from above link and modify changelog]
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
This a partial copy of "[RFC PATCH v3 10/21] KVM: x86: Fix software
TSC upscaling in kvm_update_guest_time()", as Sean suggested, "Please do
this in a separate patch. There's no need to squeeze it in here, and this
change is complex/subtle enough as it is.", and David's authorship is
preserved.

 arch/x86/kvm/x86.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 63afdb6bb078..5e7418cfd0af 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2581,10 +2581,19 @@ static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
 
 static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 {
-	u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
-				      vcpu->arch.virtual_tsc_mult,
-				      vcpu->arch.virtual_tsc_shift);
-	tsc += vcpu->arch.this_tsc_write;
+	s64 delta = kernel_ns - vcpu->arch.this_tsc_nsec;
+	u64 tsc = vcpu->arch.this_tsc_write;
+
+	/* pvclock_scale_delta cannot cope with negative deltas */
+	if (delta >= 0)
+		tsc += pvclock_scale_delta(delta,
+					   vcpu->arch.virtual_tsc_mult,
+					   vcpu->arch.virtual_tsc_shift);
+	else
+		tsc -= pvclock_scale_delta(-delta,
+					   vcpu->arch.virtual_tsc_mult,
+					   vcpu->arch.virtual_tsc_shift);
+
 	return tsc;
 }
 
@@ -2595,7 +2604,7 @@ static inline bool gtod_is_based_on_tsc(int mode)
 }
 #endif
 
-static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
+static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 {
 #ifdef CONFIG_X86_64
 	struct kvm_arch *ka = &vcpu->kvm->arch;
@@ -2612,12 +2621,9 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
 
 	/*
 	 * Request a masterclock update if the masterclock needs to be toggled
-	 * on/off, or when starting a new generation and the masterclock is
-	 * enabled (compute_guest_tsc() requires the masterclock snapshot to be
-	 * taken _after_ the new generation is created).
+	 * on/off.
 	 */
-	if ((ka->use_master_clock && new_generation) ||
-	    (ka->use_master_clock != use_master_clock))
+	if (ka->use_master_clock != use_master_clock)
 		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
 
 	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
@@ -2803,7 +2809,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
 	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
 
-	kvm_track_tsc_matching(vcpu, !matched);
+	kvm_track_tsc_matching(vcpu);
 }
 
 static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK
  2026-01-15 20:22 [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
  2026-01-15 20:22 ` [PATCH 1/3] KVM: x86: Fix compute_guest_tsc() to cope with negative delta Dongli Zhang
@ 2026-01-15 20:22 ` Dongli Zhang
  2026-05-09 20:04   ` David Woodhouse
  2026-01-15 20:22 ` [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy() Dongli Zhang
  2026-01-15 20:37 ` [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
  3 siblings, 1 reply; 16+ messages in thread
From: Dongli Zhang @ 2026-01-15 20:22 UTC (permalink / raw)
  To: kvm
  Cc: seanjc, pbonzini, dwmw2, dwmw, paul, tglx, mingo, bp, dave.hansen,
	x86, hpa, linux-kernel, joe.jin, dongli.zhang

The KVM_SET_CLOCK command calls pvclock_update_vm_gtod_copy() to update the
masterclock data.

Many vCPUs may already have KVM_REQ_MASTERCLOCK_UPDATE pending before
KVM_SET_CLOCK is invoked. If pvclock_update_vm_gtod_copy() decides to use
the masterclock, there is no need to update the masterclock multiple times
afterward. As noted in commit c52ffadc65e2 ("KVM: x86: Don't unnecessarily
force masterclock update on vCPU hotplug"), each unnecessary
KVM_REQ_MASTERCLOCK_UPDATE can cause the kvm-clock time to jump.

Therefore, clear KVM_REQ_MASTERCLOCK_UPDATE for each vCPU at the end of
KVM_SET_CLOCK when the master clock is active. The 'tsc_write_lock' ensures
that only requests issued before KVM_SET_CLOCK are cleared.

Cc: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 arch/x86/kvm/x86.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5e7418cfd0af..0599949a7803 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7173,6 +7173,8 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_arch *ka = &kvm->arch;
 	struct kvm_clock_data data;
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
 	u64 now_raw_ns;
 
 	if (copy_from_user(&data, argp, sizeof(data)))
@@ -7211,6 +7213,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 	else
 		now_raw_ns = get_kvmclock_base_ns();
 	ka->kvmclock_offset = data.clock - now_raw_ns;
+
+	if (kvm->arch.use_master_clock) {
+		kvm_for_each_vcpu(i, vcpu, kvm)
+			kvm_clear_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+	}
+
 	kvm_end_pvclock_update(kvm);
 	return 0;
 }
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()
  2026-01-15 20:22 [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
  2026-01-15 20:22 ` [PATCH 1/3] KVM: x86: Fix compute_guest_tsc() to cope with negative delta Dongli Zhang
  2026-01-15 20:22 ` [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK Dongli Zhang
@ 2026-01-15 20:22 ` Dongli Zhang
  2026-05-09 12:22   ` David Woodhouse
  2026-01-15 20:37 ` [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
  3 siblings, 1 reply; 16+ messages in thread
From: Dongli Zhang @ 2026-01-15 20:22 UTC (permalink / raw)
  To: kvm
  Cc: seanjc, pbonzini, dwmw2, dwmw, paul, tglx, mingo, bp, dave.hansen,
	x86, hpa, linux-kernel, joe.jin, dongli.zhang

The pvclock_update_vm_gtod_copy() function always unconditionally updates
ka->master_kernel_ns and ka->master_cycle_now whenever a
KVM_REQ_MASTERCLOCK_UPDATE occurs. Unfortunately, each masterclock update
increases the risk of kvm-clock drift.

If pvclock_update_vm_gtod_copy() is not called from
vcpu_enter_guest()-->kvm_update_masterclock(), we keep the existing
workflow. The argument 'forced' is introduced to tell where it is from.

Otherwise, we avoid updating the masterclock if it is already
active and will remain active. In such cases, updating the masterclock
data is not beneficial and can instead lead to kvm-clock drift.

As a result, this patch minimizes the chance of unnecessary masterclock
data updates to avoid kvm-clock drift.

Cc: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 arch/x86/kvm/x86.c | 37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0599949a7803..d2ce696abf55 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3108,12 +3108,15 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
  *
  */
 
-static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
+static void pvclock_update_vm_gtod_copy(struct kvm *kvm, bool forced)
 {
 #ifdef CONFIG_X86_64
 	struct kvm_arch *ka = &kvm->arch;
 	int vclock_mode;
 	bool host_tsc_clocksource, vcpus_matched;
+	bool use_master_clock;
+	u64 master_kernel_ns;
+	u64 master_cycle_now;
 
 	lockdep_assert_held(&kvm->arch.tsc_write_lock);
 	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
@@ -3124,12 +3127,26 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 	 * to the guest.
 	 */
 	host_tsc_clocksource = kvm_get_time_and_clockread(
-					&ka->master_kernel_ns,
-					&ka->master_cycle_now);
+					&master_kernel_ns,
+					&master_cycle_now);
+
+	use_master_clock = host_tsc_clocksource && vcpus_matched
+			    && !ka->backwards_tsc_observed
+			    && !ka->boot_vcpu_runs_old_kvmclock;
+
+	/*
+	 * Always update masterclock data unconditionally if not for
+	 * KVM_REQ_MASTERCLOCK_UPDATE request.
+	 *
+	 * Otherwise, do not update masterclock data if it is already
+	 * active and will remain active.
+	 */
+	if (forced || !(use_master_clock && ka->use_master_clock)) {
+		ka->master_kernel_ns = master_kernel_ns;
+		ka->master_cycle_now = master_cycle_now;
+	}
 
-	ka->use_master_clock = host_tsc_clocksource && vcpus_matched
-				&& !ka->backwards_tsc_observed
-				&& !ka->boot_vcpu_runs_old_kvmclock;
+	ka->use_master_clock = use_master_clock;
 
 	if (ka->use_master_clock)
 		atomic_set(&kvm_guest_has_master_clock, 1);
@@ -3179,7 +3196,7 @@ static void kvm_update_masterclock(struct kvm *kvm)
 {
 	kvm_hv_request_tsc_page_update(kvm);
 	kvm_start_pvclock_update(kvm);
-	pvclock_update_vm_gtod_copy(kvm);
+	pvclock_update_vm_gtod_copy(kvm, false);
 	kvm_end_pvclock_update(kvm);
 }
 
@@ -7189,7 +7206,7 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 
 	kvm_hv_request_tsc_page_update(kvm);
 	kvm_start_pvclock_update(kvm);
-	pvclock_update_vm_gtod_copy(kvm);
+	pvclock_update_vm_gtod_copy(kvm, true);
 
 	/*
 	 * This pairs with kvm_guest_time_update(): when masterclock is
@@ -9773,7 +9790,7 @@ static void kvm_hyperv_tsc_notifier(void)
 
 	list_for_each_entry(kvm, &vm_list, vm_list) {
 		__kvm_start_pvclock_update(kvm);
-		pvclock_update_vm_gtod_copy(kvm);
+		pvclock_update_vm_gtod_copy(kvm, true);
 		kvm_end_pvclock_update(kvm);
 	}
 
@@ -13206,7 +13223,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm->arch.kvmclock_offset = -get_kvmclock_base_ns();
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
-	pvclock_update_vm_gtod_copy(kvm);
+	pvclock_update_vm_gtod_copy(kvm, true);
 	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 
 	kvm->arch.default_tsc_khz = max_tsc_khz ? : tsc_khz;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update
  2026-01-15 20:22 [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
                   ` (2 preceding siblings ...)
  2026-01-15 20:22 ` [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy() Dongli Zhang
@ 2026-01-15 20:37 ` Dongli Zhang
  2026-01-15 21:13   ` David Woodhouse
  3 siblings, 1 reply; 16+ messages in thread
From: Dongli Zhang @ 2026-01-15 20:37 UTC (permalink / raw)
  To: dwmw2, kvm
  Cc: seanjc, pbonzini, dwmw, paul, tglx, mingo, bp, dave.hansen, x86,
	hpa, linux-kernel, joe.jin

Hi David,

On 1/15/26 12:22 PM, Dongli Zhang wrote:
> As noted in commit c52ffadc65e2 ("KVM: x86: Don't unnecessarily
> force masterclock update on vCPU hotplug"), each unnecessary
> KVM_REQ_MASTERCLOCK_UPDATE can cause the kvm-clock time to jump.
> 
> Although that commit addressed the kvm-clock drift issue during vCPU
> hotplugl there are still unnecessary KVM_REQ_MASTERCLOCK_UPDATE requests
> during live migration on the target host.
> 
> The patchset below was authored by David Woodhouse. Two of the patches aim
> to avoid unnecessary KVM_REQ_MASTERCLOCK_UPDATE requests.
> 
> [RFC PATCH v3 00/21] Cleaning up the KVM clock mess
> https://lore.kernel.org/all/20240522001817.619072-1-dwmw2@infradead.org/
> 
> [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
> [RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other
> 
> The current patchset has three patches.
> 
> PATCH 1 is a partial copy of "[RFC PATCH v3 10/21] KVM: x86: Fix software
> TSC upscaling in kvm_update_guest_time()", as Sean suggested, "Please do
> this in a separate patch. There's no need to squeeze it in here, and this
> change is complex/subtle enough as it is.", and David's authorship is
> preserved.
> 

Please let me know if this is inappropriate and whether I should have
confirmed with you before reusing your code from the patch below, with your
authorship preserved.

[RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
https://lore.kernel.org/all/20240522001817.619072-11-dwmw2@infradead.org/

The objective is to trigger a discussion on whether there is any quick,
short-term solution to mitigate the kvm-clock drift issue. We can also
resurrect your patchset.

I have some other work in QEMU userspace.

[PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/

The combination of changes in QEMU and this KVM patchset can make kvm-clock
drift during live migration very very trivial.

Thank you very much!

Dongli Zhang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update
  2026-01-15 20:37 ` [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
@ 2026-01-15 21:13   ` David Woodhouse
  2026-01-16  9:31     ` Dongli Zhang
  0 siblings, 1 reply; 16+ messages in thread
From: David Woodhouse @ 2026-01-15 21:13 UTC (permalink / raw)
  To: Dongli Zhang, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin

[-- Attachment #1: Type: text/plain, Size: 1612 bytes --]

On Thu, 2026-01-15 at 12:37 -0800, Dongli Zhang wrote:
> 
> Please let me know if this is inappropriate and whether I should have
> confirmed with you before reusing your code from the patch below, with your
> authorship preserved.
> 
> [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
> https://lore.kernel.org/all/20240522001817.619072-11-dwmw2@infradead.org/
> 
> The objective is to trigger a discussion on whether there is any quick,
> short-term solution to mitigate the kvm-clock drift issue. We can also
> resurrect your patchset.
> 
> I have some other work in QEMU userspace.
> 
> [PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
> https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/
> 
> The combination of changes in QEMU and this KVM patchset can make kvm-clock
> drift during live migration very very trivial.
> 
> Thank you very much!

Not at all inappropriate; thank you so much for updating it. I've been
meaning to do so but it's never made it back to the top of my list.

I don't believe that the existing KVM_SET_CLOCK is viable though. The
aim is that you should be able to create a new KVM on the same host and
set the kvmclock, and the contents of the pvclock that the new guest
sees should be *identical*. Not just 'close'.

I believe we need Jack's KVM_[GS]ET_CLOCK_GUEST for that to be
feasible, so I'd very much prefer that any resurrection of this series
should include that, even if some of the other patches are dropped for
now.

Thanks again.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update
  2026-01-15 21:13   ` David Woodhouse
@ 2026-01-16  9:31     ` Dongli Zhang
  2026-01-22  5:01       ` Dongli Zhang
  2026-01-24  1:31       ` David Woodhouse
  0 siblings, 2 replies; 16+ messages in thread
From: Dongli Zhang @ 2026-01-16  9:31 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin

Hi David,

On 1/15/26 1:13 PM, David Woodhouse wrote:
> On Thu, 2026-01-15 at 12:37 -0800, Dongli Zhang wrote:
>>
>> Please let me know if this is inappropriate and whether I should have
>> confirmed with you before reusing your code from the patch below, with your
>> authorship preserved.
>>
>> [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
>> https://lore.kernel.org/all/20240522001817.619072-11-dwmw2@infradead.org/
>>
>> The objective is to trigger a discussion on whether there is any quick,
>> short-term solution to mitigate the kvm-clock drift issue. We can also
>> resurrect your patchset.
>>
>> I have some other work in QEMU userspace.
>>
>> [PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
>> https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/
>>
>> The combination of changes in QEMU and this KVM patchset can make kvm-clock
>> drift during live migration very very trivial.
>>
>> Thank you very much!
> 
> Not at all inappropriate; thank you so much for updating it. I've been
> meaning to do so but it's never made it back to the top of my list.
> 
> I don't believe that the existing KVM_SET_CLOCK is viable though. The
> aim is that you should be able to create a new KVM on the same host and
> set the kvmclock, and the contents of the pvclock that the new guest
> sees should be *identical*. Not just 'close'.
> 
> I believe we need Jack's KVM_[GS]ET_CLOCK_GUEST for that to be
> feasible, so I'd very much prefer that any resurrection of this series
> should include that, even if some of the other patches are dropped for
> now.
> 
> Thanks again.

Thank you very much for the feedback.

The issue addressed by this patchset cannot be resolved only by
KVM_[GS]ET_CLOCK_GUEST.

The problem I am trying to solve is avoiding unnecessary
KVM_REQ_MASTERCLOCK_UPDATE requests. Even when using KVM_[GS]ET_CLOCK_GUEST, if
vCPUs already have pending KVM_REQ_MASTERCLOCK_UPDATE requests, unpausing the
vCPUs from the host userspace VMM (i.e., QEMU) can still trigger multiple master
clock updates - typically proportional to the number of vCPUs.

As we known, each KVM_REQ_MASTERCLOCK_UPDATE can cause unexpected kvm-clock
forward/backward drift.

Therefore, rather than KVM_[GS]ET_CLOCK_GUEST, this patchset is more relevant to
the other two of your patches, defining a new policy to minimize
KVM_REQ_MASTERCLOCK_UPDATE.

[RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
[RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset
from each other

Suppose the combination of QEMU and KVM. The following details explain the
problem I am trying to address.

(Assuming TSC scaling is *inactive*)

## Problem 1. Account the live migration downtimes into kvm-clock and guest_tsc.

So far, QEMU/KVM live migration does not account all elapsed blackout downtimes.
For example, if a guest is live-migrated to a file, left idle for one hour, and
then restored from that file to the target host, the one-hour blackout period
will not be reflected in the kvm-clock or guest TSC.

This can be resolved by leveraging KVM_VCPU_TSC_CTRL and KVM_CLOCK_REALTIME in
QEMU. I have sent a QEMU patch (and just received your feedback on that thread).

[PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/

## Problem 2. The kvm-clock drifts due to changes in the PVTI data.

Unlike the previous vCPU hotplug-related kvm-clock drift issue, during live
migration the amount of drift is not determined by the time elapsed between two
masterclock updates. Instead, it occurs because guest_clock and guest_tsc are
not stopped or resumed at the same point in time.

For example, MSR_IA32_TSC and KVM_GET_CLOCK are used to save guest_tsc and
guest_clock on the source host. This is effectively equivalent to stopping their
counters. However, they are not stopped simultaneously: guest_tsc stops at time
point P1, while guest_clock stops at time point P2.

- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=0 ===> P1
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=1
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=2
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=3
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=4
... ...
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=N
- KVM_GET_CLOCK                               ===> P2

On the target host, QEMU restores the saved values using MSR_IA32_TSC and
KVM_SET_CLOCK. As a result, guest_tsc resumes counting at time point P3, while
guest_clock resumes counting at time point P4.

- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=1 ===> P3
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=2
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=3
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=4
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=5
... ...
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=N
- KVM_SET_CLOCK                               ====> P4

Therefore, below are the equations I use to calculate the expected kvm-clock drift.

T1_ns  = P2 - P1 (nanoseconds)
T2_tsc = P4 - P3 (cycles)
T2_ns  = pvclock_scale_delta(T2_tsc,
                             hv_clock_src.tsc_to_system_mul,
                             hv_clock_src.tsc_shift)

if (T2_ns > T1_ns)
    backward drift: T2_ns - T1_ns
else if (T1_ns > T2_ns)
    forward drift: T1_ns - T2_ns

To fix this issue, ideally both guest_tsc and guest_clock should be stopped and
resumed at exactly the same time.

As you mentioned in the QEMU patch, "the kvmclock should be a fixed relationship
from the guest's TSC which doesn't change for the whole lifetime of the guest."

Fortunately, to take advantage of KVM_VCPU_TSC_CTRL and KVM_CLOCK_REALTIME in
QEMU can achieve the same goal.

[PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/

## Problem 3. Unfortunately, unnecessary KVM_REQ_MASTERCLOCK_UPDATE requests are
being triggered for the vCPUs.

During kvm_synchronize_tsc() or kvm_arch_tsc_set_attr(KVM_VCPU_TSC_OFFSET),
KVM_REQ_MASTERCLOCK_UPDATE requests may be set either before or after KVM_SET_CLOCK.

As a result, once all vCPUs are unpaused, these unnecessary
KVM_REQ_MASTERCLOCK_UPDATE requests can lead to kvm-clock drift.

Indeed, only PATCH 1 and PATCH 3 from this patch set are sufficient to mitigate
the issue.

With above changes in both QEMU and KVM, a same-host live migration of a 4-vCPU
VM with approximately 10 seconds of downtime (introduced on purpose) results in
only about 4 nanoseconds of backward drift in my test environment. We may even
be able to make more improvement from QEMU to rule out the remaining 4 nanoseconds.

old_clock->tsc_timestamp = 32041800585
old_clock->system_time = 3639151
old_clock->tsc_to_system_mul = 3186238974
old_clock->tsc_shift = -1

new_clock->tsc_timestamp = 213016088950
new_clock->system_time = 67131895453
new_clock->tsc_to_system_mul = 3186238974
new_clock->tsc_shift =  -1

If I do not introduce the ~10 seconds of downtime on purpose during live
migration, the drift is always 0 nanoseconds.

I introduce downtime on purpose by stopping the target QEMU before live
migration. The target QEMU will not resume until the 'cont' command is issued in
the QEMU monitor.

Regarding goal, I appreciate if there can be any quick solution (even short
term) or half-measures to support:

- Account for live migration downtime.
- Minimize kvm-clock drift (especially backward).

Thank you very much!

Dongli Zhang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update
  2026-01-16  9:31     ` Dongli Zhang
@ 2026-01-22  5:01       ` Dongli Zhang
  2026-01-24  1:31       ` David Woodhouse
  1 sibling, 0 replies; 16+ messages in thread
From: Dongli Zhang @ 2026-01-22  5:01 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin



On 1/16/26 1:31 AM, Dongli Zhang wrote:
> Hi David,
> 
> On 1/15/26 1:13 PM, David Woodhouse wrote:
>> On Thu, 2026-01-15 at 12:37 -0800, Dongli Zhang wrote:
>>>
>>> Please let me know if this is inappropriate and whether I should have
>>> confirmed with you before reusing your code from the patch below, with your
>>> authorship preserved.
>>>
>>> [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
>>> https://lore.kernel.org/all/20240522001817.619072-11-dwmw2@infradead.org/
>>>
>>> The objective is to trigger a discussion on whether there is any quick,
>>> short-term solution to mitigate the kvm-clock drift issue. We can also
>>> resurrect your patchset.
>>>
>>> I have some other work in QEMU userspace.
>>>
>>> [PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
>>> https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/
>>>
>>> The combination of changes in QEMU and this KVM patchset can make kvm-clock
>>> drift during live migration very very trivial.
>>>
>>> Thank you very much!
>>
>> Not at all inappropriate; thank you so much for updating it. I've been
>> meaning to do so but it's never made it back to the top of my list.
>>
>> I don't believe that the existing KVM_SET_CLOCK is viable though. The
>> aim is that you should be able to create a new KVM on the same host and
>> set the kvmclock, and the contents of the pvclock that the new guest
>> sees should be *identical*. Not just 'close'.
>>
>> I believe we need Jack's KVM_[GS]ET_CLOCK_GUEST for that to be
>> feasible, so I'd very much prefer that any resurrection of this series
>> should include that, even if some of the other patches are dropped for
>> now.
>>
>> Thanks again.
> 
> Thank you very much for the feedback.
> 
> The issue addressed by this patchset cannot be resolved only by
> KVM_[GS]ET_CLOCK_GUEST.
> 
> The problem I am trying to solve is avoiding unnecessary
> KVM_REQ_MASTERCLOCK_UPDATE requests. Even when using KVM_[GS]ET_CLOCK_GUEST, if
> vCPUs already have pending KVM_REQ_MASTERCLOCK_UPDATE requests, unpausing the
> vCPUs from the host userspace VMM (i.e., QEMU) can still trigger multiple master
> clock updates - typically proportional to the number of vCPUs.
> 
> As we known, each KVM_REQ_MASTERCLOCK_UPDATE can cause unexpected kvm-clock
> forward/backward drift.
> 
> Therefore, rather than KVM_[GS]ET_CLOCK_GUEST, this patchset is more relevant to
> the other two of your patches, defining a new policy to minimize
> KVM_REQ_MASTERCLOCK_UPDATE.
> 
> [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
> [RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset
> from each other
> 
> 
> Suppose the combination of QEMU and KVM. The following details explain the
> problem I am trying to address.
> 
> (Assuming TSC scaling is *inactive*)
> 
> 
> ## Problem 1. Account the live migration downtimes into kvm-clock and guest_tsc.
> 
> So far, QEMU/KVM live migration does not account all elapsed blackout downtimes.
> For example, if a guest is live-migrated to a file, left idle for one hour, and
> then restored from that file to the target host, the one-hour blackout period
> will not be reflected in the kvm-clock or guest TSC.
> 
> This can be resolved by leveraging KVM_VCPU_TSC_CTRL and KVM_CLOCK_REALTIME in
> QEMU. I have sent a QEMU patch (and just received your feedback on that thread).
> 
> [PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
> https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/
> 
> 
> ## Problem 2. The kvm-clock drifts due to changes in the PVTI data.
> 
> Unlike the previous vCPU hotplug-related kvm-clock drift issue, during live
> migration the amount of drift is not determined by the time elapsed between two
> masterclock updates. Instead, it occurs because guest_clock and guest_tsc are
> not stopped or resumed at the same point in time.
> 
> For example, MSR_IA32_TSC and KVM_GET_CLOCK are used to save guest_tsc and
> guest_clock on the source host. This is effectively equivalent to stopping their
> counters. However, they are not stopped simultaneously: guest_tsc stops at time
> point P1, while guest_clock stops at time point P2.
> 
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=0 ===> P1
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=1
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=2
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=3
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=4
> ... ...
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=N
> - KVM_GET_CLOCK                               ===> P2
> 
> On the target host, QEMU restores the saved values using MSR_IA32_TSC and
> KVM_SET_CLOCK. As a result, guest_tsc resumes counting at time point P3, while
> guest_clock resumes counting at time point P4.
> 
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=1 ===> P3
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=2
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=3
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=4
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=5
> ... ...
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=N
> - KVM_SET_CLOCK                               ====> P4
> 
> 
> Therefore, below are the equations I use to calculate the expected kvm-clock drift.
> 
> T1_ns  = P2 - P1 (nanoseconds)
> T2_tsc = P4 - P3 (cycles)
> T2_ns  = pvclock_scale_delta(T2_tsc,
>                              hv_clock_src.tsc_to_system_mul,
>                              hv_clock_src.tsc_shift)
> 
> if (T2_ns > T1_ns)
>     backward drift: T2_ns - T1_ns
> else if (T1_ns > T2_ns)
>     forward drift: T1_ns - T2_ns

Here are more details to explain the prediction of live migration kvm-clock.

As we know, when the masterclock is active, the PVTI is a snapshot created at a
specific point in time as a base to help calculate the kvm-clock.

struct pvclock_vcpu_time_info {
    u32   version;
    u32   pad0;
    u64   tsc_timestamp;
    u64   system_time;
    u32   tsc_to_system_mul;
    s8    tsc_shift;
    u8    flags;
    u8    pad[2];
} __attribute__((__packed__)); /* 32 bytes */

PVTI->tsc_timestamp is the guest TSC when the PVTI is snapshotted.

PVTI->system_time is the kvm-clock when the PVTI is snapshotted.

Ideally, the data in the PVTI remains unchanged.

However, let's assume the PVTI data changes *every nanosecond*.

As you mentioned, "the kvmclock should be a fixed relationship from the guest's
TSC which doesn't change for the whole lifetime of the guest."

We expect both PVTI->tsc_timestamp and PVTI->system_time to increment at the
same speed.

For instance, after GT_0 guest TSC cycles ...

PVTI->tsc_timestamp += GT_0
PVTI->system_time   += pvclock_scale_delta(GT_0)

... after another GT_1 guest TSC cycles ...

PVTI->tsc_timestamp += GT_1
PVTI->system_time   += pvclock_scale_delta(GT_1)

... ...
... ...

... after another GT_N guest TSC cycles ...

PVTI->tsc_timestamp += GT_N
PVTI->system_time   += pvclock_scale_delta(GT_N)


However, in QEMU, the guest TSC and kvm-clock are not stopped or resumed at the
same time.

P2 − P1 is the number of nanoseconds by which PVTI->system_time increments after
PVTI->tsc_timestamp stops incrementing.

P4 − P3 is the number of guest TSC cycles (assuming TSC scaling is inactive) by
which PVTI->tsc_timestamp increments while PVTI->system_time remains stopped
until P4.

Therefore, if (P4 − P3) > (P2 − P1), it indicates that PVTI->tsc_timestamp moves
forward more than PVTI->system_time, which will cause the kvm-clock to go backward.

Thank you very much!

Dongli Zhang


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update
  2026-01-16  9:31     ` Dongli Zhang
  2026-01-22  5:01       ` Dongli Zhang
@ 2026-01-24  1:31       ` David Woodhouse
  1 sibling, 0 replies; 16+ messages in thread
From: David Woodhouse @ 2026-01-24  1:31 UTC (permalink / raw)
  To: Dongli Zhang, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin

[-- Attachment #1: Type: text/plain, Size: 752 bytes --]

On Fri, 2026-01-16 at 01:31 -0800, Dongli Zhang wrote:
> 
> With above changes in both QEMU and KVM, a same-host live migration of a 4-vCPU
> VM with approximately 10 seconds of downtime (introduced on purpose) results in
> only about 4 nanoseconds of backward drift in my test environment. We may even
> be able to make more improvement from QEMU to rule out the remaining 4 nanoseconds.

On the same host, even with TSC scaling, there is no excuse for *any*
errors on live migration.

The *offset* of the host → guest TSC should remain precisely the same.
And the calculation of KVM clock from guest TSC should remain precisely the same.

Absolutely *zero* error.

Don't bother with "improvements" which still don't get that right.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()
  2026-01-15 20:22 ` [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy() Dongli Zhang
@ 2026-05-09 12:22   ` David Woodhouse
  2026-05-12  0:16     ` Dongli Zhang
  0 siblings, 1 reply; 16+ messages in thread
From: David Woodhouse @ 2026-05-09 12:22 UTC (permalink / raw)
  To: Dongli Zhang, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin

[-- Attachment #1: Type: text/plain, Size: 1695 bytes --]

On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
> The pvclock_update_vm_gtod_copy() function always unconditionally updates
> ka->master_kernel_ns and ka->master_cycle_now whenever a
> KVM_REQ_MASTERCLOCK_UPDATE occurs. Unfortunately, each masterclock update
> increases the risk of kvm-clock drift.
> 
> If pvclock_update_vm_gtod_copy() is not called from
> vcpu_enter_guest()-->kvm_update_masterclock(), we keep the existing
> workflow. The argument 'forced' is introduced to tell where it is from.
> 
> Otherwise, we avoid updating the masterclock if it is already
> active and will remain active. In such cases, updating the masterclock
> data is not beneficial and can instead lead to kvm-clock drift.
> 
> As a result, this patch minimizes the chance of unnecessary masterclock
> data updates to avoid kvm-clock drift.
> 
> Cc: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>

Hmm... so the only caller of pvclock_update_vm_gtod_copy() that doesn't
set the 'force' argument is the one in kvm_update_masterclock(), so we
are asserting that kvm_update_masterclock() never needs to *change* the
masterclock origin point, if it was already set?

The gtod notifier callback in pvclock_gtod_update_fn() also ends up
setting KVM_REQ_MASTERCLOCK_UPDATE, and is triggered by an actual host
timekeeping update (which could potentially be from a clocksource
change). It also hypothetically possible that the clocksource changes
from TSC → HPET → TSC, switching back to TSC again before the
masterclock update ever gets to run. Or maybe a suspend/resume?

Are you *sure* that the optimisation is always valid...?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK
  2026-01-15 20:22 ` [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK Dongli Zhang
@ 2026-05-09 20:04   ` David Woodhouse
  2026-05-12  0:21     ` Dongli Zhang
  0 siblings, 1 reply; 16+ messages in thread
From: David Woodhouse @ 2026-05-09 20:04 UTC (permalink / raw)
  To: Dongli Zhang, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin

[-- Attachment #1: Type: text/plain, Size: 7178 bytes --]

On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
> The KVM_SET_CLOCK command calls pvclock_update_vm_gtod_copy() to update the
> masterclock data.
> 
> Many vCPUs may already have KVM_REQ_MASTERCLOCK_UPDATE pending before
> KVM_SET_CLOCK is invoked. If pvclock_update_vm_gtod_copy() decides to use
> the masterclock, there is no need to update the masterclock multiple times
> afterward. As noted in commit c52ffadc65e2 ("KVM: x86: Don't unnecessarily
> force masterclock update on vCPU hotplug"), each unnecessary
> KVM_REQ_MASTERCLOCK_UPDATE can cause the kvm-clock time to jump.
> 
> Therefore, clear KVM_REQ_MASTERCLOCK_UPDATE for each vCPU at the end of
> KVM_SET_CLOCK when the master clock is active. The 'tsc_write_lock' ensures
> that only requests issued before KVM_SET_CLOCK are cleared.

Hm, I think we should do this in kvm_make_mclock_inprogress_request()
instead.

Currently, things like pvclock_gtod_update_fn() and one or two others
set KVM_REQ_MASTERCLOCK_UPDATE for all vCPUs. Not just one, because we
don't want *any* vCPU going back into guest mode before doing the work.

So they could *all* end up calling into kvm_masterclock_update() at the
same time. I have a vague recollection there was once a mutex there,
but now there isn't.

So all vCPUs can each, in parallel, set KVM_REQ_MCLOCK_INPROGRESS on
all other vCPUs. That's an unhandled request which just makes a vCPU
spin (preventing it from entering the guest until the clock update
completes).

Then each vCPU tries to take the kvm->arch.tsc_write_lock in
kvm_start_pvclock_update(), and from this point they'll be
serialized... but every vCPU will go through the whole of
pvclock_update_vm_gtod_copy() for itself, updating the masterclock one
more time for each vCPU in the system?

I came here to say that kvm_make_mclock_inprogress_request() should
probably clear KVM_REQ_MASTERCLOCK_UPDATE on all other vCPUs, since
it's about to do the work, and the other vCPUs will no longer need to.

But... it's more broken than that now. 

I think maybe vcpu_enter_guest() should call kvm_update_masterclock()
if *kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE)* (i.e. not *clear* the
bit by using kvm_check_request()).

And then after getting ->arch.tsc_write_lock,
kvm_start_pvclock_update() should check if KVM_REQ_MASTERCLOCK_UPDATE
is *still* set, and only proceed with the update if it is?

I'll play with it some more... something like this perhaps?

From 91e37d74ee267c722bc2c8f26754fcdfbc040e29 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Date: Sat, 9 May 2026 16:54:01 +0100
Subject: [PATCH] KVM: x86: Avoid redundant masterclock updates from multiple
 vCPUs

When a masterclock update is triggered (e.g. by the clocksource change
notifier), KVM_REQ_MASTERCLOCK_UPDATE is set on all vCPUs. Without this
fix, each vCPU independently processes the request and redundantly
re-executes the entire pvclock_update_vm_gtod_copy() sequence, serialized
only by tsc_write_lock. Each redundant re-snapshot of the master clock
reference point introduces potential clock drift.

Fix this by having __kvm_start_pvclock_update() check, after acquiring
the lock, whether the requesting vCPU's KVM_REQ_MASTERCLOCK_UPDATE is
still set. If another vCPU already did the update and cleared it, bail
out. Otherwise, clear the request on all other vCPUs before proceeding.

The caller in vcpu_enter_guest() now uses kvm_test_request() (non-clearing)
since the clearing is done inside __kvm_start_pvclock_update() under the
lock.

Suggested-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 56 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ed0f7dec08bd..b249c4a6048c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3259,10 +3259,39 @@ static void kvm_make_mclock_inprogress_request(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
 }

-static void __kvm_start_pvclock_update(struct kvm *kvm)
+static void kvm_clear_mclock_inprogress_request(struct kvm *kvm)
 {
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
+}
+
+static bool __kvm_start_pvclock_update(struct kvm *kvm, struct kvm_vcpu *requesting_vcpu)
+{
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
+
 	raw_spin_lock_irq(&kvm->arch.tsc_write_lock);
+
+	/*
+	 * If another vCPU already did the update while we were waiting
+	 * for the lock, our request will have been cleared. Bail out.
+	 */
+	if (requesting_vcpu &&
+	    !kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE, requesting_vcpu)) {
+		kvm_clear_mclock_inprogress_request(kvm);
+		raw_spin_unlock_irq(&kvm->arch.tsc_write_lock);
+		return false;
+	}
+
+	/* The update is VM-wide; prevent other vCPUs from redoing it. */
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_clear_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+
 	write_seqcount_begin(&kvm->arch.pvclock_sc);
+	return true;
 }

 static void kvm_start_pvclock_update(struct kvm *kvm)
@@ -3270,7 +3299,7 @@ static void kvm_start_pvclock_update(struct kvm *kvm)
 	kvm_make_mclock_inprogress_request(kvm);

 	/* no guest entries from this point */
-	__kvm_start_pvclock_update(kvm);
+	__kvm_start_pvclock_update(kvm, NULL);
 }

 static void kvm_end_pvclock_update(struct kvm *kvm)
@@ -3279,22 +3308,25 @@ static void kvm_end_pvclock_update(struct kvm *kvm)
 	struct kvm_vcpu *vcpu;
 	unsigned long i;

-	write_seqcount_end(&ka->pvclock_sc);
-	raw_spin_unlock_irq(&ka->tsc_write_lock);
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

 	/* guest entries allowed */
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
+	kvm_clear_mclock_inprogress_request(kvm);
+
+	write_seqcount_end(&ka->pvclock_sc);
+	raw_spin_unlock_irq(&ka->tsc_write_lock);
 }

-static void kvm_update_masterclock(struct kvm *kvm)
+static void kvm_update_masterclock(struct kvm *kvm, struct kvm_vcpu *vcpu)
 {
 	kvm_hv_request_tsc_page_update(kvm);
-	kvm_start_pvclock_update(kvm);
-	pvclock_update_vm_gtod_copy(kvm);
-	kvm_end_pvclock_update(kvm);
+	kvm_make_mclock_inprogress_request(kvm);
+
+	if (__kvm_start_pvclock_update(kvm, vcpu)) {
+		pvclock_update_vm_gtod_copy(kvm);
+		kvm_end_pvclock_update(kvm);
+	}
 }

 /*
@@ -11481,8 +11513,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_mmu_free_obsolete_roots(vcpu);
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
 			__kvm_migrate_timers(vcpu);
-		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
-			kvm_update_masterclock(vcpu->kvm);
+		if (kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
+			kvm_update_masterclock(vcpu->kvm, vcpu);
 		if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))
 			kvm_gen_kvmclock_update(vcpu);
 		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
-- 
2.43.0

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()
  2026-05-09 12:22   ` David Woodhouse
@ 2026-05-12  0:16     ` Dongli Zhang
  2026-05-12  5:21       ` David Woodhouse
  0 siblings, 1 reply; 16+ messages in thread
From: Dongli Zhang @ 2026-05-12  0:16 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin



On 5/9/26 5:22 AM, David Woodhouse wrote:
> On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
>> The pvclock_update_vm_gtod_copy() function always unconditionally updates
>> ka->master_kernel_ns and ka->master_cycle_now whenever a
>> KVM_REQ_MASTERCLOCK_UPDATE occurs. Unfortunately, each masterclock update
>> increases the risk of kvm-clock drift.
>>
>> If pvclock_update_vm_gtod_copy() is not called from
>> vcpu_enter_guest()-->kvm_update_masterclock(), we keep the existing
>> workflow. The argument 'forced' is introduced to tell where it is from.
>>
>> Otherwise, we avoid updating the masterclock if it is already
>> active and will remain active. In such cases, updating the masterclock
>> data is not beneficial and can instead lead to kvm-clock drift.
>>
>> As a result, this patch minimizes the chance of unnecessary masterclock
>> data updates to avoid kvm-clock drift.
>>
>> Cc: David Woodhouse <dwmw@amazon.co.uk>
>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> 
> Hmm... so the only caller of pvclock_update_vm_gtod_copy() that doesn't
> set the 'force' argument is the one in kvm_update_masterclock(), so we
> are asserting that kvm_update_masterclock() never needs to *change* the
> masterclock origin point, if it was already set?
> 
> The gtod notifier callback in pvclock_gtod_update_fn() also ends up
> setting KVM_REQ_MASTERCLOCK_UPDATE, and is triggered by an actual host
> timekeeping update (which could potentially be from a clocksource
> change). It also hypothetically possible that the clocksource changes
> from TSC → HPET → TSC, switching back to TSC again before the
> masterclock update ever gets to run. Or maybe a suspend/resume?
> 
> Are you *sure* that the optimisation is always valid...?

Thank you very much!

I didn't validate the scenario you mentioned. I missed that scenario because I
assumed that most production systems nowadays use STABLE/CONSTANT/NONSTOP TSC as
the host clocksource, although I sometimes forgot to add "clocksource=tsc
tsc=reliable" to my AMD L1 KVM guest (acting as the hypervisor for L2 guest).

I didn't follow up on this patch because I noticed another issue. I found that
the tsc_timestamp in the PVTI can become a very large number if we simply reboot
the guest VM. This happens because the patch stops updating the masterclock data
when the masterclock is already active and remains active.

For example:

current guest TSC: 122763682
PVTI->tsc_timestamp = 18446744073656540006
PVTI->system_time=196515164269

Although I could not reproduce any bug, that still made me feel uncomfortable.

I think the patch you posted for PATCH 2/3 can fix the issue mentioned in this
patch.

https://lore.kernel.org/kvm/f37f0ae0ae1dfce3ad3c6fce653f5df34adecc0a.camel@infradead.org

I would also test with your patchset, if the unnecessary masterclock update is
avoided or minimized to one time.

https://lore.kernel.org/kvm/20260509224824.3264567-27-dwmw2@infradead.org/

Thank you very much!

Dongli Zhang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK
  2026-05-09 20:04   ` David Woodhouse
@ 2026-05-12  0:21     ` Dongli Zhang
  2026-05-12  7:19       ` David Woodhouse
  0 siblings, 1 reply; 16+ messages in thread
From: Dongli Zhang @ 2026-05-12  0:21 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin



On 5/9/26 1:04 PM, David Woodhouse wrote:
> On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
>> The KVM_SET_CLOCK command calls pvclock_update_vm_gtod_copy() to update the
>> masterclock data.
>>
>> Many vCPUs may already have KVM_REQ_MASTERCLOCK_UPDATE pending before
>> KVM_SET_CLOCK is invoked. If pvclock_update_vm_gtod_copy() decides to use
>> the masterclock, there is no need to update the masterclock multiple times
>> afterward. As noted in commit c52ffadc65e2 ("KVM: x86: Don't unnecessarily
>> force masterclock update on vCPU hotplug"), each unnecessary
>> KVM_REQ_MASTERCLOCK_UPDATE can cause the kvm-clock time to jump.
>>
>> Therefore, clear KVM_REQ_MASTERCLOCK_UPDATE for each vCPU at the end of
>> KVM_SET_CLOCK when the master clock is active. The 'tsc_write_lock' ensures
>> that only requests issued before KVM_SET_CLOCK are cleared.
> 
> Hm, I think we should do this in kvm_make_mclock_inprogress_request()
> instead.
> 
> Currently, things like pvclock_gtod_update_fn() and one or two others
> set KVM_REQ_MASTERCLOCK_UPDATE for all vCPUs. Not just one, because we
> don't want *any* vCPU going back into guest mode before doing the work.
> 
> So they could *all* end up calling into kvm_masterclock_update() at the
> same time. I have a vague recollection there was once a mutex there,
> but now there isn't.
> 
> So all vCPUs can each, in parallel, set KVM_REQ_MCLOCK_INPROGRESS on
> all other vCPUs. That's an unhandled request which just makes a vCPU
> spin (preventing it from entering the guest until the clock update
> completes).
> 
> Then each vCPU tries to take the kvm->arch.tsc_write_lock in
> kvm_start_pvclock_update(), and from this point they'll be
> serialized... but every vCPU will go through the whole of
> pvclock_update_vm_gtod_copy() for itself, updating the masterclock one
> more time for each vCPU in the system?
> 
> I came here to say that kvm_make_mclock_inprogress_request() should
> probably clear KVM_REQ_MASTERCLOCK_UPDATE on all other vCPUs, since
> it's about to do the work, and the other vCPUs will no longer need to.
> 
> But... it's more broken than that now. 
> 
> I think maybe vcpu_enter_guest() should call kvm_update_masterclock()
> if *kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE)* (i.e. not *clear* the
> bit by using kvm_check_request()).
> 
> And then after getting ->arch.tsc_write_lock,
> kvm_start_pvclock_update() should check if KVM_REQ_MASTERCLOCK_UPDATE
> is *still* set, and only proceed with the update if it is?
> 
> I'll play with it some more... something like this perhaps?
> 
> From 91e37d74ee267c722bc2c8f26754fcdfbc040e29 Mon Sep 17 00:00:00 2001
> From: David Woodhouse <dwmw@amazon.co.uk>
> Date: Sat, 9 May 2026 16:54:01 +0100
> Subject: [PATCH] KVM: x86: Avoid redundant masterclock updates from multiple
>  vCPUs
> 
> When a masterclock update is triggered (e.g. by the clocksource change
> notifier), KVM_REQ_MASTERCLOCK_UPDATE is set on all vCPUs. Without this
> fix, each vCPU independently processes the request and redundantly
> re-executes the entire pvclock_update_vm_gtod_copy() sequence, serialized
> only by tsc_write_lock. Each redundant re-snapshot of the master clock
> reference point introduces potential clock drift.
> 
> Fix this by having __kvm_start_pvclock_update() check, after acquiring
> the lock, whether the requesting vCPU's KVM_REQ_MASTERCLOCK_UPDATE is
> still set. If another vCPU already did the update and cleared it, bail
> out. Otherwise, clear the request on all other vCPUs before proceeding.
> 
> The caller in vcpu_enter_guest() now uses kvm_test_request() (non-clearing)
> since the clearing is done inside __kvm_start_pvclock_update() under the
> lock.
> 
> Suggested-by: Dongli Zhang <dongli.zhang@oracle.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>


Thank you very much! Feel free to add me as Suggested-by or Reported-by.

I will also help test your large patch series.

Regarding the issue, I would like to confirm whether there can still be any
KVM_REQ_MASTERCLOCK_UPDATE after KVM_SET_CLOCK_GUEST.

Per my understanding, I would expect KVM_SET_CLOCK_GUEST to be the last point
where the masterclock can be updated (during live migration or live update).

After KVM_SET_CLOCK_GUEST, any further masterclock change may cause kvm-clock drift.

Dongli Zhang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()
  2026-05-12  0:16     ` Dongli Zhang
@ 2026-05-12  5:21       ` David Woodhouse
  2026-05-12 23:23         ` Dongli Zhang
  0 siblings, 1 reply; 16+ messages in thread
From: David Woodhouse @ 2026-05-12  5:21 UTC (permalink / raw)
  To: Dongli Zhang, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin

[-- Attachment #1: Type: text/plain, Size: 3362 bytes --]

On Mon, 2026-05-11 at 17:16 -0700, Dongli Zhang wrote:
> 
> 
> On 5/9/26 5:22 AM, David Woodhouse wrote:
> > On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
> > > The pvclock_update_vm_gtod_copy() function always unconditionally updates
> > > ka->master_kernel_ns and ka->master_cycle_now whenever a
> > > KVM_REQ_MASTERCLOCK_UPDATE occurs. Unfortunately, each masterclock update
> > > increases the risk of kvm-clock drift.
> > > 
> > > If pvclock_update_vm_gtod_copy() is not called from
> > > vcpu_enter_guest()-->kvm_update_masterclock(), we keep the existing
> > > workflow. The argument 'forced' is introduced to tell where it is from.
> > > 
> > > Otherwise, we avoid updating the masterclock if it is already
> > > active and will remain active. In such cases, updating the masterclock
> > > data is not beneficial and can instead lead to kvm-clock drift.
> > > 
> > > As a result, this patch minimizes the chance of unnecessary masterclock
> > > data updates to avoid kvm-clock drift.
> > > 
> > > Cc: David Woodhouse <dwmw@amazon.co.uk>
> > > Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> > 
> > Hmm... so the only caller of pvclock_update_vm_gtod_copy() that doesn't
> > set the 'force' argument is the one in kvm_update_masterclock(), so we
> > are asserting that kvm_update_masterclock() never needs to *change* the
> > masterclock origin point, if it was already set?
> > 
> > The gtod notifier callback in pvclock_gtod_update_fn() also ends up
> > setting KVM_REQ_MASTERCLOCK_UPDATE, and is triggered by an actual host
> > timekeeping update (which could potentially be from a clocksource
> > change). It also hypothetically possible that the clocksource changes
> > from TSC → HPET → TSC, switching back to TSC again before the
> > masterclock update ever gets to run. Or maybe a suspend/resume?
> > 
> > Are you *sure* that the optimisation is always valid...?
> 
> Thank you very much!
> 
> I didn't validate the scenario you mentioned. I missed that scenario because I
> assumed that most production systems nowadays use STABLE/CONSTANT/NONSTOP TSC as
> the host clocksource, although I sometimes forgot to add "clocksource=tsc
> tsc=reliable" to my AMD L1 KVM guest (acting as the hypervisor for L2 guest).

I'd love to assume that, but we do still have to cater for systems
without it (or where it goes wrong). And where userspace sets up vCPUs
with different frequencies... although I'd quite like to ban that too
:)

> I didn't follow up on this patch because I noticed another issue. I found that
> the tsc_timestamp in the PVTI can become a very large number if we simply reboot
> the guest VM. This happens because the patch stops updating the masterclock data
> when the masterclock is already active and remains active.
> 
> For example:
> 
> current guest TSC: 122763682
> PVTI->tsc_timestamp = 18446744073656540006
> PVTI->system_time=196515164269
> 
> Although I could not reproduce any bug, that still made me feel uncomfortable.

That's just negative (-53011610). Theoretically it's OK; it's just a
consequence of using a reference point in the past. Probably just
*asking* for guest bugs though, so best avoided.

I'd like to understand how it got like that though. Did your 'reboot'
reset the guest TSC to zero but *not* its kvmclock?


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK
  2026-05-12  0:21     ` Dongli Zhang
@ 2026-05-12  7:19       ` David Woodhouse
  0 siblings, 0 replies; 16+ messages in thread
From: David Woodhouse @ 2026-05-12  7:19 UTC (permalink / raw)
  To: Dongli Zhang, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin

[-- Attachment #1: Type: text/plain, Size: 5887 bytes --]

On Mon, 2026-05-11 at 17:21 -0700, Dongli Zhang wrote:
> 
> 
> On 5/9/26 1:04 PM, David Woodhouse wrote:
> > On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
> > > The KVM_SET_CLOCK command calls pvclock_update_vm_gtod_copy() to update the
> > > masterclock data.
> > > 
> > > Many vCPUs may already have KVM_REQ_MASTERCLOCK_UPDATE pending before
> > > KVM_SET_CLOCK is invoked. If pvclock_update_vm_gtod_copy() decides to use
> > > the masterclock, there is no need to update the masterclock multiple times
> > > afterward. As noted in commit c52ffadc65e2 ("KVM: x86: Don't unnecessarily
> > > force masterclock update on vCPU hotplug"), each unnecessary
> > > KVM_REQ_MASTERCLOCK_UPDATE can cause the kvm-clock time to jump.
> > > 
> > > Therefore, clear KVM_REQ_MASTERCLOCK_UPDATE for each vCPU at the end of
> > > KVM_SET_CLOCK when the master clock is active. The 'tsc_write_lock' ensures
> > > that only requests issued before KVM_SET_CLOCK are cleared.
> > 
> > Hm, I think we should do this in kvm_make_mclock_inprogress_request()
> > instead.
> > 
> > Currently, things like pvclock_gtod_update_fn() and one or two others
> > set KVM_REQ_MASTERCLOCK_UPDATE for all vCPUs. Not just one, because we
> > don't want *any* vCPU going back into guest mode before doing the work.
> > 
> > So they could *all* end up calling into kvm_masterclock_update() at the
> > same time. I have a vague recollection there was once a mutex there,
> > but now there isn't.
> > 
> > So all vCPUs can each, in parallel, set KVM_REQ_MCLOCK_INPROGRESS on
> > all other vCPUs. That's an unhandled request which just makes a vCPU
> > spin (preventing it from entering the guest until the clock update
> > completes).
> > 
> > Then each vCPU tries to take the kvm->arch.tsc_write_lock in
> > kvm_start_pvclock_update(), and from this point they'll be
> > serialized... but every vCPU will go through the whole of
> > pvclock_update_vm_gtod_copy() for itself, updating the masterclock one
> > more time for each vCPU in the system?
> > 
> > I came here to say that kvm_make_mclock_inprogress_request() should
> > probably clear KVM_REQ_MASTERCLOCK_UPDATE on all other vCPUs, since
> > it's about to do the work, and the other vCPUs will no longer need to.
> > 
> > But... it's more broken than that now. 
> > 
> > I think maybe vcpu_enter_guest() should call kvm_update_masterclock()
> > if *kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE)* (i.e. not *clear* the
> > bit by using kvm_check_request()).
> > 
> > And then after getting ->arch.tsc_write_lock,
> > kvm_start_pvclock_update() should check if KVM_REQ_MASTERCLOCK_UPDATE
> > is *still* set, and only proceed with the update if it is?
> > 
> > I'll play with it some more... something like this perhaps?
> > 
> > From 91e37d74ee267c722bc2c8f26754fcdfbc040e29 Mon Sep 17 00:00:00 2001
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > Date: Sat, 9 May 2026 16:54:01 +0100
> > Subject: [PATCH] KVM: x86: Avoid redundant masterclock updates from multiple
> >  vCPUs
> > 
> > When a masterclock update is triggered (e.g. by the clocksource change
> > notifier), KVM_REQ_MASTERCLOCK_UPDATE is set on all vCPUs. Without this
> > fix, each vCPU independently processes the request and redundantly
> > re-executes the entire pvclock_update_vm_gtod_copy() sequence, serialized
> > only by tsc_write_lock. Each redundant re-snapshot of the master clock
> > reference point introduces potential clock drift.
> > 
> > Fix this by having __kvm_start_pvclock_update() check, after acquiring
> > the lock, whether the requesting vCPU's KVM_REQ_MASTERCLOCK_UPDATE is
> > still set. If another vCPU already did the update and cleared it, bail
> > out. Otherwise, clear the request on all other vCPUs before proceeding.
> > 
> > The caller in vcpu_enter_guest() now uses kvm_test_request() (non-clearing)
> > since the clearing is done inside __kvm_start_pvclock_update() under the
> > lock.
> > 
> > Suggested-by: Dongli Zhang <dongli.zhang@oracle.com>
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> 
> 
> Thank you very much! Feel free to add me as Suggested-by or Reported-by.

Already did!

> I will also help test your large patch series.

Thanks.

> Regarding the issue, I would like to confirm whether there can still be any
> KVM_REQ_MASTERCLOCK_UPDATE after KVM_SET_CLOCK_GUEST.
> 
> Per my understanding, I would expect KVM_SET_CLOCK_GUEST to be the last point
> where the masterclock can be updated (during live migration or live update).
> 
> After KVM_SET_CLOCK_GUEST, any further masterclock change may cause kvm-clock drift.

I believe that for a guest with ->use_master_clock, there should be no
further masterclock change once it's running. We trigger
KVM_REQ_MASTER_CLOCK update only in exceptional cases: 

 • kvm_set_guest_paused() — vCPU 0 writes kvmclock MSR switching
   between old/new MSR (boot_vcpu_runs_old_kvmclock changes)

 • kvm_track_tsc_matching() — TSC matching state changes: either a new
   TSC generation starts, or master clock needs to be toggled on/off

 • pvclock_gtod_notify() via irq_work — host clocksource stops being
   TSC-based (e.g., TSC marked unstable, clocksource switched to HPET)

 • kvm_arch_hardware_enable() — host TSC went backwards during CPU
   online (suspend/resume, CPU hotplug) setting backwards_tsc_observed

IIRC the timekeeping code does call pvclock_gtod_notify *all* the time
even when the clock rate is stable and NTP isn't applying any skew,
because a cycle rate of e.g. 333333⅓ cycles per tick will result in it
counting 333333, 333333, then 333334 cycles in consecutive ticks. And
calling all the notifiers whenever it flips between 333333 and 333334.
But KVM only updates the master clock if it becomes non-TSC-based, so
it doesn't hurt.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()
  2026-05-12  5:21       ` David Woodhouse
@ 2026-05-12 23:23         ` Dongli Zhang
  0 siblings, 0 replies; 16+ messages in thread
From: Dongli Zhang @ 2026-05-12 23:23 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin



On 5/11/26 10:21 PM, David Woodhouse wrote:
> On Mon, 2026-05-11 at 17:16 -0700, Dongli Zhang wrote:
>>
>>
>> On 5/9/26 5:22 AM, David Woodhouse wrote:
>>> On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
>>>> The pvclock_update_vm_gtod_copy() function always unconditionally updates
>>>> ka->master_kernel_ns and ka->master_cycle_now whenever a
>>>> KVM_REQ_MASTERCLOCK_UPDATE occurs. Unfortunately, each masterclock update
>>>> increases the risk of kvm-clock drift.
>>>>
>>>> If pvclock_update_vm_gtod_copy() is not called from
>>>> vcpu_enter_guest()-->kvm_update_masterclock(), we keep the existing
>>>> workflow. The argument 'forced' is introduced to tell where it is from.
>>>>
>>>> Otherwise, we avoid updating the masterclock if it is already
>>>> active and will remain active. In such cases, updating the masterclock
>>>> data is not beneficial and can instead lead to kvm-clock drift.
>>>>
>>>> As a result, this patch minimizes the chance of unnecessary masterclock
>>>> data updates to avoid kvm-clock drift.
>>>>
>>>> Cc: David Woodhouse <dwmw@amazon.co.uk>
>>>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>>>
>>> Hmm... so the only caller of pvclock_update_vm_gtod_copy() that doesn't
>>> set the 'force' argument is the one in kvm_update_masterclock(), so we
>>> are asserting that kvm_update_masterclock() never needs to *change* the
>>> masterclock origin point, if it was already set?
>>>
>>> The gtod notifier callback in pvclock_gtod_update_fn() also ends up
>>> setting KVM_REQ_MASTERCLOCK_UPDATE, and is triggered by an actual host
>>> timekeeping update (which could potentially be from a clocksource
>>> change). It also hypothetically possible that the clocksource changes
>>> from TSC → HPET → TSC, switching back to TSC again before the
>>> masterclock update ever gets to run. Or maybe a suspend/resume?
>>>
>>> Are you *sure* that the optimisation is always valid...?
>>
>> Thank you very much!
>>
>> I didn't validate the scenario you mentioned. I missed that scenario because I
>> assumed that most production systems nowadays use STABLE/CONSTANT/NONSTOP TSC as
>> the host clocksource, although I sometimes forgot to add "clocksource=tsc
>> tsc=reliable" to my AMD L1 KVM guest (acting as the hypervisor for L2 guest).
> 
> I'd love to assume that, but we do still have to cater for systems
> without it (or where it goes wrong). And where userspace sets up vCPUs
> with different frequencies... although I'd quite like to ban that too
> :)
> 
>> I didn't follow up on this patch because I noticed another issue. I found that
>> the tsc_timestamp in the PVTI can become a very large number if we simply reboot
>> the guest VM. This happens because the patch stops updating the masterclock data
>> when the masterclock is already active and remains active.
>>
>> For example:
>>
>> current guest TSC: 122763682
>> PVTI->tsc_timestamp = 18446744073656540006
>> PVTI->system_time=196515164269
>>
>> Although I could not reproduce any bug, that still made me feel uncomfortable.
> 
> That's just negative (-53011610). Theoretically it's OK; it's just a
> consequence of using a reference point in the past. Probably just
> *asking* for guest bugs though, so best avoided.
> 
> I'd like to understand how it got like that though. Did your 'reboot'
> reset the guest TSC to zero but *not* its kvmclock?
> 

Taking mainline QEMU as an example, I simply trigger a reboot from the guest VM.

A normal reboot from the guest VM resets only MSR_IA32_TSC, but not
KVM_SET_CLOCK. Before the reboot, the masterclock is enabled. During
kvm_synchronize_tsc() for each vCPU, the masterclock remains enabled as well.
Therefore, pvclock_update_vm_gtod_copy() does not update the masterclock values
because of this patchset.

As a result, tsc_timestamp eventually becomes negative.

For example, I apply the patch below and this patchset on top of the latest
Linux kernel (as KVM).

https://lore.kernel.org/kvm/20260115202256.119820-1-dongli.zhang@oracle.com

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6e88310f5979..d9f93cdb46ab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3158,8 +3158,13 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm,
bool forced)
        if (forced || !(use_master_clock && ka->use_master_clock)) {
                ka->master_kernel_ns = master_kernel_ns;
                ka->master_cycle_now = master_cycle_now;
+               pr_alert("debug: pvclock_update_vm_gtod_copy() update (%llu,
%llu)\n",
+                        ka->master_kernel_ns, ka->master_cycle_now);
        }

+       pr_alert("debug: pvclock_update_vm_gtod_copy() old=%d, new=%d\n",
+                ka->use_master_clock, use_master_clock);
+
        ka->use_master_clock = use_master_clock;

        if (ka->use_master_clock)
@@ -3416,6 +3421,9 @@ int kvm_guest_time_update(struct kvm_vcpu *v)
        hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
        vcpu->last_guest_tsc = tsc_timestamp;

+       pr_alert("debug: kvm_guest_time_update() vcpu=%u tsc=%llu time=%llu\n",
+                v->vcpu_id, hv_clock.tsc_timestamp, hv_clock.system_time);
+
        /* If the host uses TSC clocksource, then it is stable */
        hv_clock.flags = 0;
        if (use_master_clock)
@@ -4108,6 +4116,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
msr_data *msr_info)
                vcpu->arch.msr_ia32_power_ctl = data;
                break;
        case MSR_IA32_TSC:
+               pr_alert("debug: set MSR_IA32_TSC vcpu=%u, host=%d, data=%llu\n",
+                        vcpu->vcpu_id, msr_info->host_initiated, data);
                if (msr_info->host_initiated) {
                        kvm_synchronize_tsc(vcpu, &data);
                } else if (!vcpu->arch.guest_tsc_protected) {


Below are tsc_timestamp and system_time before reboot.

[  154.159733] kvm: debug: kvm_guest_time_update() vcpu=3 tsc=97471036 time=4759583
[  154.160101] kvm: debug: kvm_guest_time_update() vcpu=2 tsc=97471036 time=4759583
[  154.160658] kvm: debug: kvm_guest_time_update() vcpu=1 tsc=97471036 time=4759583
[  154.161292] kvm: debug: kvm_guest_time_update() vcpu=0 tsc=97471036 time=4759583

After reboot, there will be TSC synchronization.

[  154.217551] kvm: debug: set MSR_IA32_TSC vcpu=0, host=1, data=1
[  154.217980] kvm: debug: set MSR_IA32_TSC vcpu=1, host=1, data=1
[  154.218384] kvm: debug: set MSR_IA32_TSC vcpu=2, host=1, data=1
[  154.218792] kvm: debug: set MSR_IA32_TSC vcpu=3, host=1, data=1

The masterclock remains enabled. Therefore, pvclock_update_vm_gtod_copy() does
not update the masterclock values because of this patchset.

tsc_timestamp becomes negative, as TSC is reset to start from data=1.

[  154.219283] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[  154.219671] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[  154.504499] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[  154.504898] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[  154.504898] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[  154.504898] kvm: debug: kvm_guest_time_update() vcpu=3
tsc=18446743945549702059 time=4759583
[  154.505240] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[  154.505240] kvm: debug: kvm_guest_time_update() vcpu=1
tsc=18446743945549702059 time=4759583
[  154.505241] kvm: debug: kvm_guest_time_update() vcpu=2
tsc=18446743945549702059 time=4759583


Usually, QEMU users (i.e., libvirt) reboot the guest VM from outside, that is:

1. Send shutdown to guest VM.
2. (qemu) system_reset
3. (qemu) cont

The cont command triggers KVM_SET_CLOCK, so we do not get a negative tsc_timestamp.

That is why I did not follow up on this patch set further.

Thank you very much!

Dongli Zhang


^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-05-12 23:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-15 20:22 [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
2026-01-15 20:22 ` [PATCH 1/3] KVM: x86: Fix compute_guest_tsc() to cope with negative delta Dongli Zhang
2026-01-15 20:22 ` [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK Dongli Zhang
2026-05-09 20:04   ` David Woodhouse
2026-05-12  0:21     ` Dongli Zhang
2026-05-12  7:19       ` David Woodhouse
2026-01-15 20:22 ` [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy() Dongli Zhang
2026-05-09 12:22   ` David Woodhouse
2026-05-12  0:16     ` Dongli Zhang
2026-05-12  5:21       ` David Woodhouse
2026-05-12 23:23         ` Dongli Zhang
2026-01-15 20:37 ` [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
2026-01-15 21:13   ` David Woodhouse
2026-01-16  9:31     ` Dongli Zhang
2026-01-22  5:01       ` Dongli Zhang
2026-01-24  1:31       ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox