Re: [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update

Kernel KVM virtualization development
 help / color / mirror / Atom feed

From: Dongli Zhang <dongli.zhang@oracle.com>
To: David Woodhouse <dwmw2@infradead.org>, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, paul@xen.org,
	tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	linux-kernel@vger.kernel.org, joe.jin@oracle.com
Subject: Re: [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update
Date: Fri, 16 Jan 2026 01:31:06 -0800	[thread overview]
Message-ID: <4ed7a646-707a-40c9-93f9-2289fedf5709@oracle.com> (raw)
In-Reply-To: <285e30927dad5736df58aad5d957448e93b2d047.camel@infradead.org>

Hi David,

On 1/15/26 1:13 PM, David Woodhouse wrote:
> On Thu, 2026-01-15 at 12:37 -0800, Dongli Zhang wrote:
>>
>> Please let me know if this is inappropriate and whether I should have
>> confirmed with you before reusing your code from the patch below, with your
>> authorship preserved.
>>
>> [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
>> https://lore.kernel.org/all/20240522001817.619072-11-dwmw2@infradead.org/
>>
>> The objective is to trigger a discussion on whether there is any quick,
>> short-term solution to mitigate the kvm-clock drift issue. We can also
>> resurrect your patchset.
>>
>> I have some other work in QEMU userspace.
>>
>> [PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
>> https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/
>>
>> The combination of changes in QEMU and this KVM patchset can make kvm-clock
>> drift during live migration very very trivial.
>>
>> Thank you very much!
> 
> Not at all inappropriate; thank you so much for updating it. I've been
> meaning to do so but it's never made it back to the top of my list.
> 
> I don't believe that the existing KVM_SET_CLOCK is viable though. The
> aim is that you should be able to create a new KVM on the same host and
> set the kvmclock, and the contents of the pvclock that the new guest
> sees should be *identical*. Not just 'close'.
> 
> I believe we need Jack's KVM_[GS]ET_CLOCK_GUEST for that to be
> feasible, so I'd very much prefer that any resurrection of this series
> should include that, even if some of the other patches are dropped for
> now.
> 
> Thanks again.

Thank you very much for the feedback.

The issue addressed by this patchset cannot be resolved only by
KVM_[GS]ET_CLOCK_GUEST.

The problem I am trying to solve is avoiding unnecessary
KVM_REQ_MASTERCLOCK_UPDATE requests. Even when using KVM_[GS]ET_CLOCK_GUEST, if
vCPUs already have pending KVM_REQ_MASTERCLOCK_UPDATE requests, unpausing the
vCPUs from the host userspace VMM (i.e., QEMU) can still trigger multiple master
clock updates - typically proportional to the number of vCPUs.

As we known, each KVM_REQ_MASTERCLOCK_UPDATE can cause unexpected kvm-clock
forward/backward drift.

Therefore, rather than KVM_[GS]ET_CLOCK_GUEST, this patchset is more relevant to
the other two of your patches, defining a new policy to minimize
KVM_REQ_MASTERCLOCK_UPDATE.

[RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
[RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset
from each other

Suppose the combination of QEMU and KVM. The following details explain the
problem I am trying to address.

(Assuming TSC scaling is *inactive*)

## Problem 1. Account the live migration downtimes into kvm-clock and guest_tsc.

So far, QEMU/KVM live migration does not account all elapsed blackout downtimes.
For example, if a guest is live-migrated to a file, left idle for one hour, and
then restored from that file to the target host, the one-hour blackout period
will not be reflected in the kvm-clock or guest TSC.

This can be resolved by leveraging KVM_VCPU_TSC_CTRL and KVM_CLOCK_REALTIME in
QEMU. I have sent a QEMU patch (and just received your feedback on that thread).

[PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/

## Problem 2. The kvm-clock drifts due to changes in the PVTI data.

Unlike the previous vCPU hotplug-related kvm-clock drift issue, during live
migration the amount of drift is not determined by the time elapsed between two
masterclock updates. Instead, it occurs because guest_clock and guest_tsc are
not stopped or resumed at the same point in time.

For example, MSR_IA32_TSC and KVM_GET_CLOCK are used to save guest_tsc and
guest_clock on the source host. This is effectively equivalent to stopping their
counters. However, they are not stopped simultaneously: guest_tsc stops at time
point P1, while guest_clock stops at time point P2.

- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=0 ===> P1
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=1
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=2
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=3
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=4
... ...
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=N
- KVM_GET_CLOCK                               ===> P2

On the target host, QEMU restores the saved values using MSR_IA32_TSC and
KVM_SET_CLOCK. As a result, guest_tsc resumes counting at time point P3, while
guest_clock resumes counting at time point P4.

- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=1 ===> P3
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=2
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=3
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=4
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=5
... ...
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=N
- KVM_SET_CLOCK                               ====> P4

Therefore, below are the equations I use to calculate the expected kvm-clock drift.

T1_ns  = P2 - P1 (nanoseconds)
T2_tsc = P4 - P3 (cycles)
T2_ns  = pvclock_scale_delta(T2_tsc,
                             hv_clock_src.tsc_to_system_mul,
                             hv_clock_src.tsc_shift)

if (T2_ns > T1_ns)
    backward drift: T2_ns - T1_ns
else if (T1_ns > T2_ns)
    forward drift: T1_ns - T2_ns

To fix this issue, ideally both guest_tsc and guest_clock should be stopped and
resumed at exactly the same time.

As you mentioned in the QEMU patch, "the kvmclock should be a fixed relationship
from the guest's TSC which doesn't change for the whole lifetime of the guest."

Fortunately, to take advantage of KVM_VCPU_TSC_CTRL and KVM_CLOCK_REALTIME in
QEMU can achieve the same goal.

[PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/

## Problem 3. Unfortunately, unnecessary KVM_REQ_MASTERCLOCK_UPDATE requests are
being triggered for the vCPUs.

During kvm_synchronize_tsc() or kvm_arch_tsc_set_attr(KVM_VCPU_TSC_OFFSET),
KVM_REQ_MASTERCLOCK_UPDATE requests may be set either before or after KVM_SET_CLOCK.

As a result, once all vCPUs are unpaused, these unnecessary
KVM_REQ_MASTERCLOCK_UPDATE requests can lead to kvm-clock drift.

Indeed, only PATCH 1 and PATCH 3 from this patch set are sufficient to mitigate
the issue.

With above changes in both QEMU and KVM, a same-host live migration of a 4-vCPU
VM with approximately 10 seconds of downtime (introduced on purpose) results in
only about 4 nanoseconds of backward drift in my test environment. We may even
be able to make more improvement from QEMU to rule out the remaining 4 nanoseconds.

old_clock->tsc_timestamp = 32041800585
old_clock->system_time = 3639151
old_clock->tsc_to_system_mul = 3186238974
old_clock->tsc_shift = -1

new_clock->tsc_timestamp = 213016088950
new_clock->system_time = 67131895453
new_clock->tsc_to_system_mul = 3186238974
new_clock->tsc_shift =  -1

If I do not introduce the ~10 seconds of downtime on purpose during live
migration, the drift is always 0 nanoseconds.

I introduce downtime on purpose by stopping the target QEMU before live
migration. The target QEMU will not resume until the 'cont' command is issued in
the QEMU monitor.

Regarding goal, I appreciate if there can be any quick solution (even short
term) or half-measures to support:

- Account for live migration downtime.
- Minimize kvm-clock drift (especially backward).

Thank you very much!

Dongli Zhang

next prev parent reply	other threads:[~2026-01-16  9:31 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-15 20:22 [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
2026-01-15 20:22 ` [PATCH 1/3] KVM: x86: Fix compute_guest_tsc() to cope with negative delta Dongli Zhang
2026-01-15 20:22 ` [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK Dongli Zhang
2026-05-09 20:04   ` David Woodhouse
2026-05-12  0:21     ` Dongli Zhang
2026-05-12  7:19       ` David Woodhouse
2026-01-15 20:22 ` [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy() Dongli Zhang
2026-05-09 12:22   ` David Woodhouse
2026-05-12  0:16     ` Dongli Zhang
2026-05-12  5:21       ` David Woodhouse
2026-05-12 23:23         ` Dongli Zhang
2026-01-15 20:37 ` [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update Dongli Zhang
2026-01-15 21:13   ` David Woodhouse
2026-01-16  9:31     ` Dongli Zhang [this message]
2026-01-22  5:01       ` Dongli Zhang
2026-01-24  1:31       ` David Woodhouse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4ed7a646-707a-40c9-93f9-2289fedf5709@oracle.com \
    --to=dongli.zhang@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=dwmw2@infradead.org \
    --cc=hpa@zytor.com \
    --cc=joe.jin@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=paul@xen.org \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=tglx@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox