Re: [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* Re: [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()
       [not found] ` <20260115202256.119820-4-dongli.zhang@oracle.com>
@ 2026-05-09 12:22   ` David Woodhouse
  2026-05-12  0:16     ` Dongli Zhang
  0 siblings, 1 reply; 7+ messages in thread
From: David Woodhouse @ 2026-05-09 12:22 UTC (permalink / raw)
  To: Dongli Zhang, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin

[-- Attachment #1: Type: text/plain, Size: 1695 bytes --]

On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
> The pvclock_update_vm_gtod_copy() function always unconditionally updates
> ka->master_kernel_ns and ka->master_cycle_now whenever a
> KVM_REQ_MASTERCLOCK_UPDATE occurs. Unfortunately, each masterclock update
> increases the risk of kvm-clock drift.
> 
> If pvclock_update_vm_gtod_copy() is not called from
> vcpu_enter_guest()-->kvm_update_masterclock(), we keep the existing
> workflow. The argument 'forced' is introduced to tell where it is from.
> 
> Otherwise, we avoid updating the masterclock if it is already
> active and will remain active. In such cases, updating the masterclock
> data is not beneficial and can instead lead to kvm-clock drift.
> 
> As a result, this patch minimizes the chance of unnecessary masterclock
> data updates to avoid kvm-clock drift.
> 
> Cc: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>

Hmm... so the only caller of pvclock_update_vm_gtod_copy() that doesn't
set the 'force' argument is the one in kvm_update_masterclock(), so we
are asserting that kvm_update_masterclock() never needs to *change* the
masterclock origin point, if it was already set?

The gtod notifier callback in pvclock_gtod_update_fn() also ends up
setting KVM_REQ_MASTERCLOCK_UPDATE, and is triggered by an actual host
timekeeping update (which could potentially be from a clocksource
change). It also hypothetically possible that the clocksource changes
from TSC → HPET → TSC, switching back to TSC again before the
masterclock update ever gets to run. Or maybe a suspend/resume?

Are you *sure* that the optimisation is always valid...?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()
  2026-05-09 12:22   ` [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy() David Woodhouse
@ 2026-05-12  0:16     ` Dongli Zhang
  2026-05-12  5:21       ` David Woodhouse
  0 siblings, 1 reply; 7+ messages in thread
From: Dongli Zhang @ 2026-05-12  0:16 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin



On 5/9/26 5:22 AM, David Woodhouse wrote:
> On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
>> The pvclock_update_vm_gtod_copy() function always unconditionally updates
>> ka->master_kernel_ns and ka->master_cycle_now whenever a
>> KVM_REQ_MASTERCLOCK_UPDATE occurs. Unfortunately, each masterclock update
>> increases the risk of kvm-clock drift.
>>
>> If pvclock_update_vm_gtod_copy() is not called from
>> vcpu_enter_guest()-->kvm_update_masterclock(), we keep the existing
>> workflow. The argument 'forced' is introduced to tell where it is from.
>>
>> Otherwise, we avoid updating the masterclock if it is already
>> active and will remain active. In such cases, updating the masterclock
>> data is not beneficial and can instead lead to kvm-clock drift.
>>
>> As a result, this patch minimizes the chance of unnecessary masterclock
>> data updates to avoid kvm-clock drift.
>>
>> Cc: David Woodhouse <dwmw@amazon.co.uk>
>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> 
> Hmm... so the only caller of pvclock_update_vm_gtod_copy() that doesn't
> set the 'force' argument is the one in kvm_update_masterclock(), so we
> are asserting that kvm_update_masterclock() never needs to *change* the
> masterclock origin point, if it was already set?
> 
> The gtod notifier callback in pvclock_gtod_update_fn() also ends up
> setting KVM_REQ_MASTERCLOCK_UPDATE, and is triggered by an actual host
> timekeeping update (which could potentially be from a clocksource
> change). It also hypothetically possible that the clocksource changes
> from TSC → HPET → TSC, switching back to TSC again before the
> masterclock update ever gets to run. Or maybe a suspend/resume?
> 
> Are you *sure* that the optimisation is always valid...?

Thank you very much!

I didn't validate the scenario you mentioned. I missed that scenario because I
assumed that most production systems nowadays use STABLE/CONSTANT/NONSTOP TSC as
the host clocksource, although I sometimes forgot to add "clocksource=tsc
tsc=reliable" to my AMD L1 KVM guest (acting as the hypervisor for L2 guest).

I didn't follow up on this patch because I noticed another issue. I found that
the tsc_timestamp in the PVTI can become a very large number if we simply reboot
the guest VM. This happens because the patch stops updating the masterclock data
when the masterclock is already active and remains active.

For example:

current guest TSC: 122763682
PVTI->tsc_timestamp = 18446744073656540006
PVTI->system_time=196515164269

Although I could not reproduce any bug, that still made me feel uncomfortable.

I think the patch you posted for PATCH 2/3 can fix the issue mentioned in this
patch.

https://lore.kernel.org/kvm/f37f0ae0ae1dfce3ad3c6fce653f5df34adecc0a.camel@infradead.org

I would also test with your patchset, if the unnecessary masterclock update is
avoided or minimized to one time.

https://lore.kernel.org/kvm/20260509224824.3264567-27-dwmw2@infradead.org/

Thank you very much!

Dongli Zhang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()
  2026-05-12  0:16     ` Dongli Zhang
@ 2026-05-12  5:21       ` David Woodhouse
  2026-05-12 23:23         ` Dongli Zhang
  0 siblings, 1 reply; 7+ messages in thread
From: David Woodhouse @ 2026-05-12  5:21 UTC (permalink / raw)
  To: Dongli Zhang, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin

[-- Attachment #1: Type: text/plain, Size: 3362 bytes --]

On Mon, 2026-05-11 at 17:16 -0700, Dongli Zhang wrote:
> 
> 
> On 5/9/26 5:22 AM, David Woodhouse wrote:
> > On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
> > > The pvclock_update_vm_gtod_copy() function always unconditionally updates
> > > ka->master_kernel_ns and ka->master_cycle_now whenever a
> > > KVM_REQ_MASTERCLOCK_UPDATE occurs. Unfortunately, each masterclock update
> > > increases the risk of kvm-clock drift.
> > > 
> > > If pvclock_update_vm_gtod_copy() is not called from
> > > vcpu_enter_guest()-->kvm_update_masterclock(), we keep the existing
> > > workflow. The argument 'forced' is introduced to tell where it is from.
> > > 
> > > Otherwise, we avoid updating the masterclock if it is already
> > > active and will remain active. In such cases, updating the masterclock
> > > data is not beneficial and can instead lead to kvm-clock drift.
> > > 
> > > As a result, this patch minimizes the chance of unnecessary masterclock
> > > data updates to avoid kvm-clock drift.
> > > 
> > > Cc: David Woodhouse <dwmw@amazon.co.uk>
> > > Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> > 
> > Hmm... so the only caller of pvclock_update_vm_gtod_copy() that doesn't
> > set the 'force' argument is the one in kvm_update_masterclock(), so we
> > are asserting that kvm_update_masterclock() never needs to *change* the
> > masterclock origin point, if it was already set?
> > 
> > The gtod notifier callback in pvclock_gtod_update_fn() also ends up
> > setting KVM_REQ_MASTERCLOCK_UPDATE, and is triggered by an actual host
> > timekeeping update (which could potentially be from a clocksource
> > change). It also hypothetically possible that the clocksource changes
> > from TSC → HPET → TSC, switching back to TSC again before the
> > masterclock update ever gets to run. Or maybe a suspend/resume?
> > 
> > Are you *sure* that the optimisation is always valid...?
> 
> Thank you very much!
> 
> I didn't validate the scenario you mentioned. I missed that scenario because I
> assumed that most production systems nowadays use STABLE/CONSTANT/NONSTOP TSC as
> the host clocksource, although I sometimes forgot to add "clocksource=tsc
> tsc=reliable" to my AMD L1 KVM guest (acting as the hypervisor for L2 guest).

I'd love to assume that, but we do still have to cater for systems
without it (or where it goes wrong). And where userspace sets up vCPUs
with different frequencies... although I'd quite like to ban that too
:)

> I didn't follow up on this patch because I noticed another issue. I found that
> the tsc_timestamp in the PVTI can become a very large number if we simply reboot
> the guest VM. This happens because the patch stops updating the masterclock data
> when the masterclock is already active and remains active.
> 
> For example:
> 
> current guest TSC: 122763682
> PVTI->tsc_timestamp = 18446744073656540006
> PVTI->system_time=196515164269
> 
> Although I could not reproduce any bug, that still made me feel uncomfortable.

That's just negative (-53011610). Theoretically it's OK; it's just a
consequence of using a reference point in the past. Probably just
*asking* for guest bugs though, so best avoided.

I'd like to understand how it got like that though. Did your 'reboot'
reset the guest TSC to zero but *not* its kvmclock?


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()
  2026-05-12  5:21       ` David Woodhouse
@ 2026-05-12 23:23         ` Dongli Zhang
  0 siblings, 0 replies; 7+ messages in thread
From: Dongli Zhang @ 2026-05-12 23:23 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin



On 5/11/26 10:21 PM, David Woodhouse wrote:
> On Mon, 2026-05-11 at 17:16 -0700, Dongli Zhang wrote:
>>
>>
>> On 5/9/26 5:22 AM, David Woodhouse wrote:
>>> On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
>>>> The pvclock_update_vm_gtod_copy() function always unconditionally updates
>>>> ka->master_kernel_ns and ka->master_cycle_now whenever a
>>>> KVM_REQ_MASTERCLOCK_UPDATE occurs. Unfortunately, each masterclock update
>>>> increases the risk of kvm-clock drift.
>>>>
>>>> If pvclock_update_vm_gtod_copy() is not called from
>>>> vcpu_enter_guest()-->kvm_update_masterclock(), we keep the existing
>>>> workflow. The argument 'forced' is introduced to tell where it is from.
>>>>
>>>> Otherwise, we avoid updating the masterclock if it is already
>>>> active and will remain active. In such cases, updating the masterclock
>>>> data is not beneficial and can instead lead to kvm-clock drift.
>>>>
>>>> As a result, this patch minimizes the chance of unnecessary masterclock
>>>> data updates to avoid kvm-clock drift.
>>>>
>>>> Cc: David Woodhouse <dwmw@amazon.co.uk>
>>>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>>>
>>> Hmm... so the only caller of pvclock_update_vm_gtod_copy() that doesn't
>>> set the 'force' argument is the one in kvm_update_masterclock(), so we
>>> are asserting that kvm_update_masterclock() never needs to *change* the
>>> masterclock origin point, if it was already set?
>>>
>>> The gtod notifier callback in pvclock_gtod_update_fn() also ends up
>>> setting KVM_REQ_MASTERCLOCK_UPDATE, and is triggered by an actual host
>>> timekeeping update (which could potentially be from a clocksource
>>> change). It also hypothetically possible that the clocksource changes
>>> from TSC → HPET → TSC, switching back to TSC again before the
>>> masterclock update ever gets to run. Or maybe a suspend/resume?
>>>
>>> Are you *sure* that the optimisation is always valid...?
>>
>> Thank you very much!
>>
>> I didn't validate the scenario you mentioned. I missed that scenario because I
>> assumed that most production systems nowadays use STABLE/CONSTANT/NONSTOP TSC as
>> the host clocksource, although I sometimes forgot to add "clocksource=tsc
>> tsc=reliable" to my AMD L1 KVM guest (acting as the hypervisor for L2 guest).
> 
> I'd love to assume that, but we do still have to cater for systems
> without it (or where it goes wrong). And where userspace sets up vCPUs
> with different frequencies... although I'd quite like to ban that too
> :)
> 
>> I didn't follow up on this patch because I noticed another issue. I found that
>> the tsc_timestamp in the PVTI can become a very large number if we simply reboot
>> the guest VM. This happens because the patch stops updating the masterclock data
>> when the masterclock is already active and remains active.
>>
>> For example:
>>
>> current guest TSC: 122763682
>> PVTI->tsc_timestamp = 18446744073656540006
>> PVTI->system_time=196515164269
>>
>> Although I could not reproduce any bug, that still made me feel uncomfortable.
> 
> That's just negative (-53011610). Theoretically it's OK; it's just a
> consequence of using a reference point in the past. Probably just
> *asking* for guest bugs though, so best avoided.
> 
> I'd like to understand how it got like that though. Did your 'reboot'
> reset the guest TSC to zero but *not* its kvmclock?
> 

Taking mainline QEMU as an example, I simply trigger a reboot from the guest VM.

A normal reboot from the guest VM resets only MSR_IA32_TSC, but not
KVM_SET_CLOCK. Before the reboot, the masterclock is enabled. During
kvm_synchronize_tsc() for each vCPU, the masterclock remains enabled as well.
Therefore, pvclock_update_vm_gtod_copy() does not update the masterclock values
because of this patchset.

As a result, tsc_timestamp eventually becomes negative.

For example, I apply the patch below and this patchset on top of the latest
Linux kernel (as KVM).

https://lore.kernel.org/kvm/20260115202256.119820-1-dongli.zhang@oracle.com

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6e88310f5979..d9f93cdb46ab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3158,8 +3158,13 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm,
bool forced)
        if (forced || !(use_master_clock && ka->use_master_clock)) {
                ka->master_kernel_ns = master_kernel_ns;
                ka->master_cycle_now = master_cycle_now;
+               pr_alert("debug: pvclock_update_vm_gtod_copy() update (%llu,
%llu)\n",
+                        ka->master_kernel_ns, ka->master_cycle_now);
        }

+       pr_alert("debug: pvclock_update_vm_gtod_copy() old=%d, new=%d\n",
+                ka->use_master_clock, use_master_clock);
+
        ka->use_master_clock = use_master_clock;

        if (ka->use_master_clock)
@@ -3416,6 +3421,9 @@ int kvm_guest_time_update(struct kvm_vcpu *v)
        hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
        vcpu->last_guest_tsc = tsc_timestamp;

+       pr_alert("debug: kvm_guest_time_update() vcpu=%u tsc=%llu time=%llu\n",
+                v->vcpu_id, hv_clock.tsc_timestamp, hv_clock.system_time);
+
        /* If the host uses TSC clocksource, then it is stable */
        hv_clock.flags = 0;
        if (use_master_clock)
@@ -4108,6 +4116,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
msr_data *msr_info)
                vcpu->arch.msr_ia32_power_ctl = data;
                break;
        case MSR_IA32_TSC:
+               pr_alert("debug: set MSR_IA32_TSC vcpu=%u, host=%d, data=%llu\n",
+                        vcpu->vcpu_id, msr_info->host_initiated, data);
                if (msr_info->host_initiated) {
                        kvm_synchronize_tsc(vcpu, &data);
                } else if (!vcpu->arch.guest_tsc_protected) {


Below are tsc_timestamp and system_time before reboot.

[  154.159733] kvm: debug: kvm_guest_time_update() vcpu=3 tsc=97471036 time=4759583
[  154.160101] kvm: debug: kvm_guest_time_update() vcpu=2 tsc=97471036 time=4759583
[  154.160658] kvm: debug: kvm_guest_time_update() vcpu=1 tsc=97471036 time=4759583
[  154.161292] kvm: debug: kvm_guest_time_update() vcpu=0 tsc=97471036 time=4759583

After reboot, there will be TSC synchronization.

[  154.217551] kvm: debug: set MSR_IA32_TSC vcpu=0, host=1, data=1
[  154.217980] kvm: debug: set MSR_IA32_TSC vcpu=1, host=1, data=1
[  154.218384] kvm: debug: set MSR_IA32_TSC vcpu=2, host=1, data=1
[  154.218792] kvm: debug: set MSR_IA32_TSC vcpu=3, host=1, data=1

The masterclock remains enabled. Therefore, pvclock_update_vm_gtod_copy() does
not update the masterclock values because of this patchset.

tsc_timestamp becomes negative, as TSC is reset to start from data=1.

[  154.219283] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[  154.219671] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[  154.504499] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[  154.504898] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[  154.504898] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[  154.504898] kvm: debug: kvm_guest_time_update() vcpu=3
tsc=18446743945549702059 time=4759583
[  154.505240] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[  154.505240] kvm: debug: kvm_guest_time_update() vcpu=1
tsc=18446743945549702059 time=4759583
[  154.505241] kvm: debug: kvm_guest_time_update() vcpu=2
tsc=18446743945549702059 time=4759583


Usually, QEMU users (i.e., libvirt) reboot the guest VM from outside, that is:

1. Send shutdown to guest VM.
2. (qemu) system_reset
3. (qemu) cont

The cont command triggers KVM_SET_CLOCK, so we do not get a negative tsc_timestamp.

That is why I did not follow up on this patch set further.

Thank you very much!

Dongli Zhang


^ permalink raw reply related	[flat|nested] 7+ messages in thread

[parent not found: <20260115202256.119820-3-dongli.zhang@oracle.com>]

* Re: [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK
       [not found] ` <20260115202256.119820-3-dongli.zhang@oracle.com>
@ 2026-05-09 20:04   ` David Woodhouse
  2026-05-12  0:21     ` Dongli Zhang
  0 siblings, 1 reply; 7+ messages in thread
From: David Woodhouse @ 2026-05-09 20:04 UTC (permalink / raw)
  To: Dongli Zhang, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin

[-- Attachment #1: Type: text/plain, Size: 7178 bytes --]

On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
> The KVM_SET_CLOCK command calls pvclock_update_vm_gtod_copy() to update the
> masterclock data.
> 
> Many vCPUs may already have KVM_REQ_MASTERCLOCK_UPDATE pending before
> KVM_SET_CLOCK is invoked. If pvclock_update_vm_gtod_copy() decides to use
> the masterclock, there is no need to update the masterclock multiple times
> afterward. As noted in commit c52ffadc65e2 ("KVM: x86: Don't unnecessarily
> force masterclock update on vCPU hotplug"), each unnecessary
> KVM_REQ_MASTERCLOCK_UPDATE can cause the kvm-clock time to jump.
> 
> Therefore, clear KVM_REQ_MASTERCLOCK_UPDATE for each vCPU at the end of
> KVM_SET_CLOCK when the master clock is active. The 'tsc_write_lock' ensures
> that only requests issued before KVM_SET_CLOCK are cleared.

Hm, I think we should do this in kvm_make_mclock_inprogress_request()
instead.

Currently, things like pvclock_gtod_update_fn() and one or two others
set KVM_REQ_MASTERCLOCK_UPDATE for all vCPUs. Not just one, because we
don't want *any* vCPU going back into guest mode before doing the work.

So they could *all* end up calling into kvm_masterclock_update() at the
same time. I have a vague recollection there was once a mutex there,
but now there isn't.

So all vCPUs can each, in parallel, set KVM_REQ_MCLOCK_INPROGRESS on
all other vCPUs. That's an unhandled request which just makes a vCPU
spin (preventing it from entering the guest until the clock update
completes).

Then each vCPU tries to take the kvm->arch.tsc_write_lock in
kvm_start_pvclock_update(), and from this point they'll be
serialized... but every vCPU will go through the whole of
pvclock_update_vm_gtod_copy() for itself, updating the masterclock one
more time for each vCPU in the system?

I came here to say that kvm_make_mclock_inprogress_request() should
probably clear KVM_REQ_MASTERCLOCK_UPDATE on all other vCPUs, since
it's about to do the work, and the other vCPUs will no longer need to.

But... it's more broken than that now. 

I think maybe vcpu_enter_guest() should call kvm_update_masterclock()
if *kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE)* (i.e. not *clear* the
bit by using kvm_check_request()).

And then after getting ->arch.tsc_write_lock,
kvm_start_pvclock_update() should check if KVM_REQ_MASTERCLOCK_UPDATE
is *still* set, and only proceed with the update if it is?

I'll play with it some more... something like this perhaps?

From 91e37d74ee267c722bc2c8f26754fcdfbc040e29 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Date: Sat, 9 May 2026 16:54:01 +0100
Subject: [PATCH] KVM: x86: Avoid redundant masterclock updates from multiple
 vCPUs

When a masterclock update is triggered (e.g. by the clocksource change
notifier), KVM_REQ_MASTERCLOCK_UPDATE is set on all vCPUs. Without this
fix, each vCPU independently processes the request and redundantly
re-executes the entire pvclock_update_vm_gtod_copy() sequence, serialized
only by tsc_write_lock. Each redundant re-snapshot of the master clock
reference point introduces potential clock drift.

Fix this by having __kvm_start_pvclock_update() check, after acquiring
the lock, whether the requesting vCPU's KVM_REQ_MASTERCLOCK_UPDATE is
still set. If another vCPU already did the update and cleared it, bail
out. Otherwise, clear the request on all other vCPUs before proceeding.

The caller in vcpu_enter_guest() now uses kvm_test_request() (non-clearing)
since the clearing is done inside __kvm_start_pvclock_update() under the
lock.

Suggested-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 56 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ed0f7dec08bd..b249c4a6048c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3259,10 +3259,39 @@ static void kvm_make_mclock_inprogress_request(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
 }

-static void __kvm_start_pvclock_update(struct kvm *kvm)
+static void kvm_clear_mclock_inprogress_request(struct kvm *kvm)
 {
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
+}
+
+static bool __kvm_start_pvclock_update(struct kvm *kvm, struct kvm_vcpu *requesting_vcpu)
+{
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
+
 	raw_spin_lock_irq(&kvm->arch.tsc_write_lock);
+
+	/*
+	 * If another vCPU already did the update while we were waiting
+	 * for the lock, our request will have been cleared. Bail out.
+	 */
+	if (requesting_vcpu &&
+	    !kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE, requesting_vcpu)) {
+		kvm_clear_mclock_inprogress_request(kvm);
+		raw_spin_unlock_irq(&kvm->arch.tsc_write_lock);
+		return false;
+	}
+
+	/* The update is VM-wide; prevent other vCPUs from redoing it. */
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_clear_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+
 	write_seqcount_begin(&kvm->arch.pvclock_sc);
+	return true;
 }

 static void kvm_start_pvclock_update(struct kvm *kvm)
@@ -3270,7 +3299,7 @@ static void kvm_start_pvclock_update(struct kvm *kvm)
 	kvm_make_mclock_inprogress_request(kvm);

 	/* no guest entries from this point */
-	__kvm_start_pvclock_update(kvm);
+	__kvm_start_pvclock_update(kvm, NULL);
 }

 static void kvm_end_pvclock_update(struct kvm *kvm)
@@ -3279,22 +3308,25 @@ static void kvm_end_pvclock_update(struct kvm *kvm)
 	struct kvm_vcpu *vcpu;
 	unsigned long i;

-	write_seqcount_end(&ka->pvclock_sc);
-	raw_spin_unlock_irq(&ka->tsc_write_lock);
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

 	/* guest entries allowed */
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
+	kvm_clear_mclock_inprogress_request(kvm);
+
+	write_seqcount_end(&ka->pvclock_sc);
+	raw_spin_unlock_irq(&ka->tsc_write_lock);
 }

-static void kvm_update_masterclock(struct kvm *kvm)
+static void kvm_update_masterclock(struct kvm *kvm, struct kvm_vcpu *vcpu)
 {
 	kvm_hv_request_tsc_page_update(kvm);
-	kvm_start_pvclock_update(kvm);
-	pvclock_update_vm_gtod_copy(kvm);
-	kvm_end_pvclock_update(kvm);
+	kvm_make_mclock_inprogress_request(kvm);
+
+	if (__kvm_start_pvclock_update(kvm, vcpu)) {
+		pvclock_update_vm_gtod_copy(kvm);
+		kvm_end_pvclock_update(kvm);
+	}
 }

 /*
@@ -11481,8 +11513,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_mmu_free_obsolete_roots(vcpu);
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
 			__kvm_migrate_timers(vcpu);
-		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
-			kvm_update_masterclock(vcpu->kvm);
+		if (kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
+			kvm_update_masterclock(vcpu->kvm, vcpu);
 		if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))
 			kvm_gen_kvmclock_update(vcpu);
 		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
-- 
2.43.0

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK
  2026-05-09 20:04   ` [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK David Woodhouse
@ 2026-05-12  0:21     ` Dongli Zhang
  2026-05-12  7:19       ` David Woodhouse
  0 siblings, 1 reply; 7+ messages in thread
From: Dongli Zhang @ 2026-05-12  0:21 UTC (permalink / raw)
  To: David Woodhouse, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin



On 5/9/26 1:04 PM, David Woodhouse wrote:
> On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
>> The KVM_SET_CLOCK command calls pvclock_update_vm_gtod_copy() to update the
>> masterclock data.
>>
>> Many vCPUs may already have KVM_REQ_MASTERCLOCK_UPDATE pending before
>> KVM_SET_CLOCK is invoked. If pvclock_update_vm_gtod_copy() decides to use
>> the masterclock, there is no need to update the masterclock multiple times
>> afterward. As noted in commit c52ffadc65e2 ("KVM: x86: Don't unnecessarily
>> force masterclock update on vCPU hotplug"), each unnecessary
>> KVM_REQ_MASTERCLOCK_UPDATE can cause the kvm-clock time to jump.
>>
>> Therefore, clear KVM_REQ_MASTERCLOCK_UPDATE for each vCPU at the end of
>> KVM_SET_CLOCK when the master clock is active. The 'tsc_write_lock' ensures
>> that only requests issued before KVM_SET_CLOCK are cleared.
> 
> Hm, I think we should do this in kvm_make_mclock_inprogress_request()
> instead.
> 
> Currently, things like pvclock_gtod_update_fn() and one or two others
> set KVM_REQ_MASTERCLOCK_UPDATE for all vCPUs. Not just one, because we
> don't want *any* vCPU going back into guest mode before doing the work.
> 
> So they could *all* end up calling into kvm_masterclock_update() at the
> same time. I have a vague recollection there was once a mutex there,
> but now there isn't.
> 
> So all vCPUs can each, in parallel, set KVM_REQ_MCLOCK_INPROGRESS on
> all other vCPUs. That's an unhandled request which just makes a vCPU
> spin (preventing it from entering the guest until the clock update
> completes).
> 
> Then each vCPU tries to take the kvm->arch.tsc_write_lock in
> kvm_start_pvclock_update(), and from this point they'll be
> serialized... but every vCPU will go through the whole of
> pvclock_update_vm_gtod_copy() for itself, updating the masterclock one
> more time for each vCPU in the system?
> 
> I came here to say that kvm_make_mclock_inprogress_request() should
> probably clear KVM_REQ_MASTERCLOCK_UPDATE on all other vCPUs, since
> it's about to do the work, and the other vCPUs will no longer need to.
> 
> But... it's more broken than that now. 
> 
> I think maybe vcpu_enter_guest() should call kvm_update_masterclock()
> if *kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE)* (i.e. not *clear* the
> bit by using kvm_check_request()).
> 
> And then after getting ->arch.tsc_write_lock,
> kvm_start_pvclock_update() should check if KVM_REQ_MASTERCLOCK_UPDATE
> is *still* set, and only proceed with the update if it is?
> 
> I'll play with it some more... something like this perhaps?
> 
> From 91e37d74ee267c722bc2c8f26754fcdfbc040e29 Mon Sep 17 00:00:00 2001
> From: David Woodhouse <dwmw@amazon.co.uk>
> Date: Sat, 9 May 2026 16:54:01 +0100
> Subject: [PATCH] KVM: x86: Avoid redundant masterclock updates from multiple
>  vCPUs
> 
> When a masterclock update is triggered (e.g. by the clocksource change
> notifier), KVM_REQ_MASTERCLOCK_UPDATE is set on all vCPUs. Without this
> fix, each vCPU independently processes the request and redundantly
> re-executes the entire pvclock_update_vm_gtod_copy() sequence, serialized
> only by tsc_write_lock. Each redundant re-snapshot of the master clock
> reference point introduces potential clock drift.
> 
> Fix this by having __kvm_start_pvclock_update() check, after acquiring
> the lock, whether the requesting vCPU's KVM_REQ_MASTERCLOCK_UPDATE is
> still set. If another vCPU already did the update and cleared it, bail
> out. Otherwise, clear the request on all other vCPUs before proceeding.
> 
> The caller in vcpu_enter_guest() now uses kvm_test_request() (non-clearing)
> since the clearing is done inside __kvm_start_pvclock_update() under the
> lock.
> 
> Suggested-by: Dongli Zhang <dongli.zhang@oracle.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>


Thank you very much! Feel free to add me as Suggested-by or Reported-by.

I will also help test your large patch series.

Regarding the issue, I would like to confirm whether there can still be any
KVM_REQ_MASTERCLOCK_UPDATE after KVM_SET_CLOCK_GUEST.

Per my understanding, I would expect KVM_SET_CLOCK_GUEST to be the last point
where the masterclock can be updated (during live migration or live update).

After KVM_SET_CLOCK_GUEST, any further masterclock change may cause kvm-clock drift.

Dongli Zhang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK
  2026-05-12  0:21     ` Dongli Zhang
@ 2026-05-12  7:19       ` David Woodhouse
  0 siblings, 0 replies; 7+ messages in thread
From: David Woodhouse @ 2026-05-12  7:19 UTC (permalink / raw)
  To: Dongli Zhang, kvm
  Cc: seanjc, pbonzini, paul, tglx, mingo, bp, dave.hansen, x86, hpa,
	linux-kernel, joe.jin

[-- Attachment #1: Type: text/plain, Size: 5887 bytes --]

On Mon, 2026-05-11 at 17:21 -0700, Dongli Zhang wrote:
> 
> 
> On 5/9/26 1:04 PM, David Woodhouse wrote:
> > On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
> > > The KVM_SET_CLOCK command calls pvclock_update_vm_gtod_copy() to update the
> > > masterclock data.
> > > 
> > > Many vCPUs may already have KVM_REQ_MASTERCLOCK_UPDATE pending before
> > > KVM_SET_CLOCK is invoked. If pvclock_update_vm_gtod_copy() decides to use
> > > the masterclock, there is no need to update the masterclock multiple times
> > > afterward. As noted in commit c52ffadc65e2 ("KVM: x86: Don't unnecessarily
> > > force masterclock update on vCPU hotplug"), each unnecessary
> > > KVM_REQ_MASTERCLOCK_UPDATE can cause the kvm-clock time to jump.
> > > 
> > > Therefore, clear KVM_REQ_MASTERCLOCK_UPDATE for each vCPU at the end of
> > > KVM_SET_CLOCK when the master clock is active. The 'tsc_write_lock' ensures
> > > that only requests issued before KVM_SET_CLOCK are cleared.
> > 
> > Hm, I think we should do this in kvm_make_mclock_inprogress_request()
> > instead.
> > 
> > Currently, things like pvclock_gtod_update_fn() and one or two others
> > set KVM_REQ_MASTERCLOCK_UPDATE for all vCPUs. Not just one, because we
> > don't want *any* vCPU going back into guest mode before doing the work.
> > 
> > So they could *all* end up calling into kvm_masterclock_update() at the
> > same time. I have a vague recollection there was once a mutex there,
> > but now there isn't.
> > 
> > So all vCPUs can each, in parallel, set KVM_REQ_MCLOCK_INPROGRESS on
> > all other vCPUs. That's an unhandled request which just makes a vCPU
> > spin (preventing it from entering the guest until the clock update
> > completes).
> > 
> > Then each vCPU tries to take the kvm->arch.tsc_write_lock in
> > kvm_start_pvclock_update(), and from this point they'll be
> > serialized... but every vCPU will go through the whole of
> > pvclock_update_vm_gtod_copy() for itself, updating the masterclock one
> > more time for each vCPU in the system?
> > 
> > I came here to say that kvm_make_mclock_inprogress_request() should
> > probably clear KVM_REQ_MASTERCLOCK_UPDATE on all other vCPUs, since
> > it's about to do the work, and the other vCPUs will no longer need to.
> > 
> > But... it's more broken than that now. 
> > 
> > I think maybe vcpu_enter_guest() should call kvm_update_masterclock()
> > if *kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE)* (i.e. not *clear* the
> > bit by using kvm_check_request()).
> > 
> > And then after getting ->arch.tsc_write_lock,
> > kvm_start_pvclock_update() should check if KVM_REQ_MASTERCLOCK_UPDATE
> > is *still* set, and only proceed with the update if it is?
> > 
> > I'll play with it some more... something like this perhaps?
> > 
> > From 91e37d74ee267c722bc2c8f26754fcdfbc040e29 Mon Sep 17 00:00:00 2001
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > Date: Sat, 9 May 2026 16:54:01 +0100
> > Subject: [PATCH] KVM: x86: Avoid redundant masterclock updates from multiple
> >  vCPUs
> > 
> > When a masterclock update is triggered (e.g. by the clocksource change
> > notifier), KVM_REQ_MASTERCLOCK_UPDATE is set on all vCPUs. Without this
> > fix, each vCPU independently processes the request and redundantly
> > re-executes the entire pvclock_update_vm_gtod_copy() sequence, serialized
> > only by tsc_write_lock. Each redundant re-snapshot of the master clock
> > reference point introduces potential clock drift.
> > 
> > Fix this by having __kvm_start_pvclock_update() check, after acquiring
> > the lock, whether the requesting vCPU's KVM_REQ_MASTERCLOCK_UPDATE is
> > still set. If another vCPU already did the update and cleared it, bail
> > out. Otherwise, clear the request on all other vCPUs before proceeding.
> > 
> > The caller in vcpu_enter_guest() now uses kvm_test_request() (non-clearing)
> > since the clearing is done inside __kvm_start_pvclock_update() under the
> > lock.
> > 
> > Suggested-by: Dongli Zhang <dongli.zhang@oracle.com>
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> 
> 
> Thank you very much! Feel free to add me as Suggested-by or Reported-by.

Already did!

> I will also help test your large patch series.

Thanks.

> Regarding the issue, I would like to confirm whether there can still be any
> KVM_REQ_MASTERCLOCK_UPDATE after KVM_SET_CLOCK_GUEST.
> 
> Per my understanding, I would expect KVM_SET_CLOCK_GUEST to be the last point
> where the masterclock can be updated (during live migration or live update).
> 
> After KVM_SET_CLOCK_GUEST, any further masterclock change may cause kvm-clock drift.

I believe that for a guest with ->use_master_clock, there should be no
further masterclock change once it's running. We trigger
KVM_REQ_MASTER_CLOCK update only in exceptional cases: 

 • kvm_set_guest_paused() — vCPU 0 writes kvmclock MSR switching
   between old/new MSR (boot_vcpu_runs_old_kvmclock changes)

 • kvm_track_tsc_matching() — TSC matching state changes: either a new
   TSC generation starts, or master clock needs to be toggled on/off

 • pvclock_gtod_notify() via irq_work — host clocksource stops being
   TSC-based (e.g., TSC marked unstable, clocksource switched to HPET)

 • kvm_arch_hardware_enable() — host TSC went backwards during CPU
   online (suspend/resume, CPU hotplug) setting backwards_tsc_observed

IIRC the timekeeping code does call pvclock_gtod_notify *all* the time
even when the clock rate is stable and NTP isn't applying any skew,
because a cycle rate of e.g. 333333⅓ cycles per tick will result in it
counting 333333, 333333, then 333334 cycles in consecutive ticks. And
calling all the notifiers whenever it flips between 333333 and 333334.
But KVM only updates the master clock if it becomes non-TSC-based, so
it doesn't hurt.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-12 23:24 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260115202256.119820-1-dongli.zhang@oracle.com>
     [not found] ` <20260115202256.119820-4-dongli.zhang@oracle.com>
2026-05-09 12:22   ` [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy() David Woodhouse
2026-05-12  0:16     ` Dongli Zhang
2026-05-12  5:21       ` David Woodhouse
2026-05-12 23:23         ` Dongli Zhang
     [not found] ` <20260115202256.119820-3-dongli.zhang@oracle.com>
2026-05-09 20:04   ` [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK David Woodhouse
2026-05-12  0:21     ` Dongli Zhang
2026-05-12  7:19       ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox