Kernel KVM virtualization development
 help / color / mirror / Atom feed
From: Dongli Zhang <dongli.zhang@oracle.com>
To: David Woodhouse <dwmw2@infradead.org>, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, paul@xen.org,
	tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	linux-kernel@vger.kernel.org, joe.jin@oracle.com
Subject: Re: [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()
Date: Tue, 12 May 2026 16:23:36 -0700	[thread overview]
Message-ID: <1a7da49a-9737-4e00-b1ef-ab900433657f@oracle.com> (raw)
In-Reply-To: <94bbb27100ac85508f00234e4ffdb619de855d7c.camel@infradead.org>



On 5/11/26 10:21 PM, David Woodhouse wrote:
> On Mon, 2026-05-11 at 17:16 -0700, Dongli Zhang wrote:
>>
>>
>> On 5/9/26 5:22 AM, David Woodhouse wrote:
>>> On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
>>>> The pvclock_update_vm_gtod_copy() function always unconditionally updates
>>>> ka->master_kernel_ns and ka->master_cycle_now whenever a
>>>> KVM_REQ_MASTERCLOCK_UPDATE occurs. Unfortunately, each masterclock update
>>>> increases the risk of kvm-clock drift.
>>>>
>>>> If pvclock_update_vm_gtod_copy() is not called from
>>>> vcpu_enter_guest()-->kvm_update_masterclock(), we keep the existing
>>>> workflow. The argument 'forced' is introduced to tell where it is from.
>>>>
>>>> Otherwise, we avoid updating the masterclock if it is already
>>>> active and will remain active. In such cases, updating the masterclock
>>>> data is not beneficial and can instead lead to kvm-clock drift.
>>>>
>>>> As a result, this patch minimizes the chance of unnecessary masterclock
>>>> data updates to avoid kvm-clock drift.
>>>>
>>>> Cc: David Woodhouse <dwmw@amazon.co.uk>
>>>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>>>
>>> Hmm... so the only caller of pvclock_update_vm_gtod_copy() that doesn't
>>> set the 'force' argument is the one in kvm_update_masterclock(), so we
>>> are asserting that kvm_update_masterclock() never needs to *change* the
>>> masterclock origin point, if it was already set?
>>>
>>> The gtod notifier callback in pvclock_gtod_update_fn() also ends up
>>> setting KVM_REQ_MASTERCLOCK_UPDATE, and is triggered by an actual host
>>> timekeeping update (which could potentially be from a clocksource
>>> change). It also hypothetically possible that the clocksource changes
>>> from TSC → HPET → TSC, switching back to TSC again before the
>>> masterclock update ever gets to run. Or maybe a suspend/resume?
>>>
>>> Are you *sure* that the optimisation is always valid...?
>>
>> Thank you very much!
>>
>> I didn't validate the scenario you mentioned. I missed that scenario because I
>> assumed that most production systems nowadays use STABLE/CONSTANT/NONSTOP TSC as
>> the host clocksource, although I sometimes forgot to add "clocksource=tsc
>> tsc=reliable" to my AMD L1 KVM guest (acting as the hypervisor for L2 guest).
> 
> I'd love to assume that, but we do still have to cater for systems
> without it (or where it goes wrong). And where userspace sets up vCPUs
> with different frequencies... although I'd quite like to ban that too
> :)
> 
>> I didn't follow up on this patch because I noticed another issue. I found that
>> the tsc_timestamp in the PVTI can become a very large number if we simply reboot
>> the guest VM. This happens because the patch stops updating the masterclock data
>> when the masterclock is already active and remains active.
>>
>> For example:
>>
>> current guest TSC: 122763682
>> PVTI->tsc_timestamp = 18446744073656540006
>> PVTI->system_time=196515164269
>>
>> Although I could not reproduce any bug, that still made me feel uncomfortable.
> 
> That's just negative (-53011610). Theoretically it's OK; it's just a
> consequence of using a reference point in the past. Probably just
> *asking* for guest bugs though, so best avoided.
> 
> I'd like to understand how it got like that though. Did your 'reboot'
> reset the guest TSC to zero but *not* its kvmclock?
> 

Taking mainline QEMU as an example, I simply trigger a reboot from the guest VM.

A normal reboot from the guest VM resets only MSR_IA32_TSC, but not
KVM_SET_CLOCK. Before the reboot, the masterclock is enabled. During
kvm_synchronize_tsc() for each vCPU, the masterclock remains enabled as well.
Therefore, pvclock_update_vm_gtod_copy() does not update the masterclock values
because of this patchset.

As a result, tsc_timestamp eventually becomes negative.

For example, I apply the patch below and this patchset on top of the latest
Linux kernel (as KVM).

https://lore.kernel.org/kvm/20260115202256.119820-1-dongli.zhang@oracle.com

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6e88310f5979..d9f93cdb46ab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3158,8 +3158,13 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm,
bool forced)
        if (forced || !(use_master_clock && ka->use_master_clock)) {
                ka->master_kernel_ns = master_kernel_ns;
                ka->master_cycle_now = master_cycle_now;
+               pr_alert("debug: pvclock_update_vm_gtod_copy() update (%llu,
%llu)\n",
+                        ka->master_kernel_ns, ka->master_cycle_now);
        }

+       pr_alert("debug: pvclock_update_vm_gtod_copy() old=%d, new=%d\n",
+                ka->use_master_clock, use_master_clock);
+
        ka->use_master_clock = use_master_clock;

        if (ka->use_master_clock)
@@ -3416,6 +3421,9 @@ int kvm_guest_time_update(struct kvm_vcpu *v)
        hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
        vcpu->last_guest_tsc = tsc_timestamp;

+       pr_alert("debug: kvm_guest_time_update() vcpu=%u tsc=%llu time=%llu\n",
+                v->vcpu_id, hv_clock.tsc_timestamp, hv_clock.system_time);
+
        /* If the host uses TSC clocksource, then it is stable */
        hv_clock.flags = 0;
        if (use_master_clock)
@@ -4108,6 +4116,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
msr_data *msr_info)
                vcpu->arch.msr_ia32_power_ctl = data;
                break;
        case MSR_IA32_TSC:
+               pr_alert("debug: set MSR_IA32_TSC vcpu=%u, host=%d, data=%llu\n",
+                        vcpu->vcpu_id, msr_info->host_initiated, data);
                if (msr_info->host_initiated) {
                        kvm_synchronize_tsc(vcpu, &data);
                } else if (!vcpu->arch.guest_tsc_protected) {


Below are tsc_timestamp and system_time before reboot.

[  154.159733] kvm: debug: kvm_guest_time_update() vcpu=3 tsc=97471036 time=4759583
[  154.160101] kvm: debug: kvm_guest_time_update() vcpu=2 tsc=97471036 time=4759583
[  154.160658] kvm: debug: kvm_guest_time_update() vcpu=1 tsc=97471036 time=4759583
[  154.161292] kvm: debug: kvm_guest_time_update() vcpu=0 tsc=97471036 time=4759583

After reboot, there will be TSC synchronization.

[  154.217551] kvm: debug: set MSR_IA32_TSC vcpu=0, host=1, data=1
[  154.217980] kvm: debug: set MSR_IA32_TSC vcpu=1, host=1, data=1
[  154.218384] kvm: debug: set MSR_IA32_TSC vcpu=2, host=1, data=1
[  154.218792] kvm: debug: set MSR_IA32_TSC vcpu=3, host=1, data=1

The masterclock remains enabled. Therefore, pvclock_update_vm_gtod_copy() does
not update the masterclock values because of this patchset.

tsc_timestamp becomes negative, as TSC is reset to start from data=1.

[  154.219283] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[  154.219671] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[  154.504499] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[  154.504898] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[  154.504898] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[  154.504898] kvm: debug: kvm_guest_time_update() vcpu=3
tsc=18446743945549702059 time=4759583
[  154.505240] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[  154.505240] kvm: debug: kvm_guest_time_update() vcpu=1
tsc=18446743945549702059 time=4759583
[  154.505241] kvm: debug: kvm_guest_time_update() vcpu=2
tsc=18446743945549702059 time=4759583


Usually, QEMU users (i.e., libvirt) reboot the guest VM from outside, that is:

1. Send shutdown to guest VM.
2. (qemu) system_reset
3. (qemu) cont

The cont command triggers KVM_SET_CLOCK, so we do not get a negative tsc_timestamp.

That is why I did not follow up on this patch set further.

Thank you very much!

Dongli Zhang


  reply	other threads:[~2026-05-12 23:24 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-15 20:22 [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
2026-01-15 20:22 ` [PATCH 1/3] KVM: x86: Fix compute_guest_tsc() to cope with negative delta Dongli Zhang
2026-01-15 20:22 ` [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK Dongli Zhang
2026-05-09 20:04   ` David Woodhouse
2026-05-12  0:21     ` Dongli Zhang
2026-05-12  7:19       ` David Woodhouse
2026-01-15 20:22 ` [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy() Dongli Zhang
2026-05-09 12:22   ` David Woodhouse
2026-05-12  0:16     ` Dongli Zhang
2026-05-12  5:21       ` David Woodhouse
2026-05-12 23:23         ` Dongli Zhang [this message]
2026-01-15 20:37 ` [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
2026-01-15 21:13   ` David Woodhouse
2026-01-16  9:31     ` Dongli Zhang
2026-01-22  5:01       ` Dongli Zhang
2026-01-24  1:31       ` David Woodhouse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1a7da49a-9737-4e00-b1ef-ab900433657f@oracle.com \
    --to=dongli.zhang@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=dwmw2@infradead.org \
    --cc=hpa@zytor.com \
    --cc=joe.jin@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=paul@xen.org \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=tglx@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox