From: Dongli Zhang <dongli.zhang@oracle.com>
To: David Woodhouse <dwmw2@infradead.org>, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, paul@xen.org,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
linux-kernel@vger.kernel.org, joe.jin@oracle.com
Subject: Re: [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()
Date: Tue, 12 May 2026 16:23:36 -0700 [thread overview]
Message-ID: <1a7da49a-9737-4e00-b1ef-ab900433657f@oracle.com> (raw)
In-Reply-To: <94bbb27100ac85508f00234e4ffdb619de855d7c.camel@infradead.org>
On 5/11/26 10:21 PM, David Woodhouse wrote:
> On Mon, 2026-05-11 at 17:16 -0700, Dongli Zhang wrote:
>>
>>
>> On 5/9/26 5:22 AM, David Woodhouse wrote:
>>> On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
>>>> The pvclock_update_vm_gtod_copy() function always unconditionally updates
>>>> ka->master_kernel_ns and ka->master_cycle_now whenever a
>>>> KVM_REQ_MASTERCLOCK_UPDATE occurs. Unfortunately, each masterclock update
>>>> increases the risk of kvm-clock drift.
>>>>
>>>> If pvclock_update_vm_gtod_copy() is not called from
>>>> vcpu_enter_guest()-->kvm_update_masterclock(), we keep the existing
>>>> workflow. The argument 'forced' is introduced to tell where it is from.
>>>>
>>>> Otherwise, we avoid updating the masterclock if it is already
>>>> active and will remain active. In such cases, updating the masterclock
>>>> data is not beneficial and can instead lead to kvm-clock drift.
>>>>
>>>> As a result, this patch minimizes the chance of unnecessary masterclock
>>>> data updates to avoid kvm-clock drift.
>>>>
>>>> Cc: David Woodhouse <dwmw@amazon.co.uk>
>>>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>>>
>>> Hmm... so the only caller of pvclock_update_vm_gtod_copy() that doesn't
>>> set the 'force' argument is the one in kvm_update_masterclock(), so we
>>> are asserting that kvm_update_masterclock() never needs to *change* the
>>> masterclock origin point, if it was already set?
>>>
>>> The gtod notifier callback in pvclock_gtod_update_fn() also ends up
>>> setting KVM_REQ_MASTERCLOCK_UPDATE, and is triggered by an actual host
>>> timekeeping update (which could potentially be from a clocksource
>>> change). It also hypothetically possible that the clocksource changes
>>> from TSC → HPET → TSC, switching back to TSC again before the
>>> masterclock update ever gets to run. Or maybe a suspend/resume?
>>>
>>> Are you *sure* that the optimisation is always valid...?
>>
>> Thank you very much!
>>
>> I didn't validate the scenario you mentioned. I missed that scenario because I
>> assumed that most production systems nowadays use STABLE/CONSTANT/NONSTOP TSC as
>> the host clocksource, although I sometimes forgot to add "clocksource=tsc
>> tsc=reliable" to my AMD L1 KVM guest (acting as the hypervisor for L2 guest).
>
> I'd love to assume that, but we do still have to cater for systems
> without it (or where it goes wrong). And where userspace sets up vCPUs
> with different frequencies... although I'd quite like to ban that too
> :)
>
>> I didn't follow up on this patch because I noticed another issue. I found that
>> the tsc_timestamp in the PVTI can become a very large number if we simply reboot
>> the guest VM. This happens because the patch stops updating the masterclock data
>> when the masterclock is already active and remains active.
>>
>> For example:
>>
>> current guest TSC: 122763682
>> PVTI->tsc_timestamp = 18446744073656540006
>> PVTI->system_time=196515164269
>>
>> Although I could not reproduce any bug, that still made me feel uncomfortable.
>
> That's just negative (-53011610). Theoretically it's OK; it's just a
> consequence of using a reference point in the past. Probably just
> *asking* for guest bugs though, so best avoided.
>
> I'd like to understand how it got like that though. Did your 'reboot'
> reset the guest TSC to zero but *not* its kvmclock?
>
Taking mainline QEMU as an example, I simply trigger a reboot from the guest VM.
A normal reboot from the guest VM resets only MSR_IA32_TSC, but not
KVM_SET_CLOCK. Before the reboot, the masterclock is enabled. During
kvm_synchronize_tsc() for each vCPU, the masterclock remains enabled as well.
Therefore, pvclock_update_vm_gtod_copy() does not update the masterclock values
because of this patchset.
As a result, tsc_timestamp eventually becomes negative.
For example, I apply the patch below and this patchset on top of the latest
Linux kernel (as KVM).
https://lore.kernel.org/kvm/20260115202256.119820-1-dongli.zhang@oracle.com
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6e88310f5979..d9f93cdb46ab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3158,8 +3158,13 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm,
bool forced)
if (forced || !(use_master_clock && ka->use_master_clock)) {
ka->master_kernel_ns = master_kernel_ns;
ka->master_cycle_now = master_cycle_now;
+ pr_alert("debug: pvclock_update_vm_gtod_copy() update (%llu,
%llu)\n",
+ ka->master_kernel_ns, ka->master_cycle_now);
}
+ pr_alert("debug: pvclock_update_vm_gtod_copy() old=%d, new=%d\n",
+ ka->use_master_clock, use_master_clock);
+
ka->use_master_clock = use_master_clock;
if (ka->use_master_clock)
@@ -3416,6 +3421,9 @@ int kvm_guest_time_update(struct kvm_vcpu *v)
hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
vcpu->last_guest_tsc = tsc_timestamp;
+ pr_alert("debug: kvm_guest_time_update() vcpu=%u tsc=%llu time=%llu\n",
+ v->vcpu_id, hv_clock.tsc_timestamp, hv_clock.system_time);
+
/* If the host uses TSC clocksource, then it is stable */
hv_clock.flags = 0;
if (use_master_clock)
@@ -4108,6 +4116,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
msr_data *msr_info)
vcpu->arch.msr_ia32_power_ctl = data;
break;
case MSR_IA32_TSC:
+ pr_alert("debug: set MSR_IA32_TSC vcpu=%u, host=%d, data=%llu\n",
+ vcpu->vcpu_id, msr_info->host_initiated, data);
if (msr_info->host_initiated) {
kvm_synchronize_tsc(vcpu, &data);
} else if (!vcpu->arch.guest_tsc_protected) {
Below are tsc_timestamp and system_time before reboot.
[ 154.159733] kvm: debug: kvm_guest_time_update() vcpu=3 tsc=97471036 time=4759583
[ 154.160101] kvm: debug: kvm_guest_time_update() vcpu=2 tsc=97471036 time=4759583
[ 154.160658] kvm: debug: kvm_guest_time_update() vcpu=1 tsc=97471036 time=4759583
[ 154.161292] kvm: debug: kvm_guest_time_update() vcpu=0 tsc=97471036 time=4759583
After reboot, there will be TSC synchronization.
[ 154.217551] kvm: debug: set MSR_IA32_TSC vcpu=0, host=1, data=1
[ 154.217980] kvm: debug: set MSR_IA32_TSC vcpu=1, host=1, data=1
[ 154.218384] kvm: debug: set MSR_IA32_TSC vcpu=2, host=1, data=1
[ 154.218792] kvm: debug: set MSR_IA32_TSC vcpu=3, host=1, data=1
The masterclock remains enabled. Therefore, pvclock_update_vm_gtod_copy() does
not update the masterclock values because of this patchset.
tsc_timestamp becomes negative, as TSC is reset to start from data=1.
[ 154.219283] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[ 154.219671] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[ 154.504499] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[ 154.504898] kvm: debug: pvclock_update_vm_gtod_copy() old=1, new=1
[ 154.504898] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[ 154.504898] kvm: debug: kvm_guest_time_update() vcpu=3
tsc=18446743945549702059 time=4759583
[ 154.505240] kvm: debug: kvm_guest_time_update() vcpu=0
tsc=18446743945549702059 time=4759583
[ 154.505240] kvm: debug: kvm_guest_time_update() vcpu=1
tsc=18446743945549702059 time=4759583
[ 154.505241] kvm: debug: kvm_guest_time_update() vcpu=2
tsc=18446743945549702059 time=4759583
Usually, QEMU users (i.e., libvirt) reboot the guest VM from outside, that is:
1. Send shutdown to guest VM.
2. (qemu) system_reset
3. (qemu) cont
The cont command triggers KVM_SET_CLOCK, so we do not get a negative tsc_timestamp.
That is why I did not follow up on this patch set further.
Thank you very much!
Dongli Zhang
next prev parent reply other threads:[~2026-05-12 23:24 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-15 20:22 [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
2026-01-15 20:22 ` [PATCH 1/3] KVM: x86: Fix compute_guest_tsc() to cope with negative delta Dongli Zhang
2026-01-15 20:22 ` [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK Dongli Zhang
2026-05-09 20:04 ` David Woodhouse
2026-05-12 0:21 ` Dongli Zhang
2026-05-12 7:19 ` David Woodhouse
2026-01-15 20:22 ` [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy() Dongli Zhang
2026-05-09 12:22 ` David Woodhouse
2026-05-12 0:16 ` Dongli Zhang
2026-05-12 5:21 ` David Woodhouse
2026-05-12 23:23 ` Dongli Zhang [this message]
2026-01-15 20:37 ` [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
2026-01-15 21:13 ` David Woodhouse
2026-01-16 9:31 ` Dongli Zhang
2026-01-22 5:01 ` Dongli Zhang
2026-01-24 1:31 ` David Woodhouse
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1a7da49a-9737-4e00-b1ef-ab900433657f@oracle.com \
--to=dongli.zhang@oracle.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=dwmw2@infradead.org \
--cc=hpa@zytor.com \
--cc=joe.jin@oracle.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=paul@xen.org \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
--cc=tglx@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox