From: Vitaly Kuznetsov <vkuznets@redhat.com>
To: Sean Christopherson <seanjc@google.com>,
Thomas Lefebvre <thomas.lefebvre3@gmail.com>
Cc: pbonzini@redhat.com, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org
Subject: Re: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
Date: Tue, 07 Apr 2026 10:23:52 +0200 [thread overview]
Message-ID: <87se97mb53.fsf@redhat.com> (raw)
In-Reply-To: <adO_EYdKtl_TXooI@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Sun, Apr 05, 2026, Thomas Lefebvre wrote:
>> Hi,
>>
>> I'm seeing KVM_GET_CLOCK return values ~253 years in the future when
>> running KVM inside a Hyper-V VM (nested virtualization). I tracked
>> it down to an unsigned wraparound in __get_kvmclock() and have
>> bpftrace data showing the exact failure.
>>
>> Setup:
>> - Intel i7-11800H laptop running Windows with Hyper-V
>> - L1 guest: Ubuntu 24.04, kernel 6.8.0, 4 vCPUs
>> - Clocksource: hyperv_clocksource_tsc_page (VDSO_CLOCKMODE_HVCLOCK)
>> - KVM running inside L1, hosting L2 guests
>>
>> Root cause:
>>
>> __get_kvmclock() does:
>>
>> hv_clock.tsc_timestamp = ka->master_cycle_now;
>> hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
>> ...
>> data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
>>
>> and __pvclock_read_cycles() does:
>>
>> delta = tsc - src->tsc_timestamp; /* unsigned */
>>
>> master_cycle_now is a raw RDTSC captured by
>> pvclock_update_vm_gtod_copy(). host_tsc is a raw RDTSC read by
>> __get_kvmclock() on the current CPU. Both go through the vgettsc()
>> HVCLOCK path which calls hv_read_tsc_page_tsc() -- this computes a
>> cross-CPU-consistent reference counter via scale/offset, but stores
>> the *raw* RDTSC in tsc_timestamp as a side effect.
>>
>> Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
>> The hypervisor corrects them only through the TSC page scale/offset.
>> If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
>> later runs on CPU 1 where the raw TSC is lower, the unsigned
>> subtraction wraps.
>>
>> I wrote a bpftrace tracer (included below) to instrument both
>> functions and captured two corruption events:
>>
>> Event 1:
>>
>> [GTOD_COPY] pid=2117649 cpu=0->0 use_master=1
>> mcn=598992030530137 mkn=259977082393200
>>
>> [GET_CLOCK] pid=2117649 entry_cpu=1 exit_cpu=1 use_master=1
>> clock=8006399342167092479 host_tsc=598991848289183
>> master_cycle_now=598992030530137
>> system_time(mkn+off)=5175860260
>> TSC DEFICIT: 182240954 cycles
>>
>> master_cycle_now captured on CPU 0, host_tsc read on CPU 1.
>> CPU 1's raw RDTSC was 182M cycles lower.
>>
>> 598991848289183 - 598992030530137 = 18446744073527310662 (u64)
>>
>> Returned clock: 8,006,399,342,167,092,479 ns (~253.7 years)
>> Correct system_time: 5,175,860,260 ns (~5.2 seconds)
>>
>> Event 2:
>>
>> [GTOD_COPY] pid=2117953 cpu=0->0 use_master=1
>> mcn=599040238416510
>>
>> [GET_CLOCK] pid=2117953 entry_cpu=3 exit_cpu=3 use_master=1
>> clock=8006399342464295526 host_tsc=599040211994220
>> master_cycle_now=599040238416510
>> TSC DEFICIT: 26422290 cycles
>>
>> Same pattern, CPU 0 vs CPU 3, 26M cycle deficit.
>>
>> kvm_get_wall_clock_epoch() has the same pattern -- fresh host_tsc
>> vs stale master_cycle_now passed to __pvclock_read_cycles().
>>
>> The simplest fix I can think of is guarding the __pvclock_read_cycles
>> call in __get_kvmclock():
>>
>> if (data->host_tsc >= hv_clock.tsc_timestamp)
>> data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
>> else
>> data->clock = hv_clock.system_time;
>
> That might kinda sorta work for one KVM-as-the-host path, but it's not a proper
> fix. The actual guest-side (L2) reads in __pvclock_clocksource_read() will also
> be broken, because PVCLOCK_TSC_STABLE_BIT will be set.
>
> I don't see how this scenario can possibly work, KVM is effectively mixing two
> time domains. The stable timestamp from the TSC page is (obviously) *derived*
> from the raw, *unstable* TSC, but they are two distinct domains.
>
> What really confuses me is why we thought this would work for Hyper-V but not for
> kvmclock (i.e. KVM-on-KVM). Hyper-V's TSC page and kvmclock are the exact same
> concept, but vgettsc() only special cases VDSO_CLOCKMODE_HVCLOCK, not
> VDSO_CLOCKMODE_PVCLOCK.
>
> Shouldn't we just revert b0c39dc68e3b ("x86/kvm: Pass stable clocksource to guests
> when running nested on Hyper-V")?
>
> Vitaly, what am I missing?
>
It's probably me who's missing somethings :-) but my understanding is
that we can't be using TSC page clocksource with unsyncronized TSCs in
L1 at all as TSC page (unlike kvmclock) is always partition-wide and
thus can't lead to a sane result in case raw TSC readings diverge. The
idea of b0c39dc68e3b was that in Hyper-V guests *with stable,
syncronized TSC* we may still be using Hyper-V TSC page clocksource and
thus we can pass it to L2.
--
Vitaly
next prev parent reply other threads:[~2026-04-07 8:23 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-05 22:10 [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency Thomas Lefebvre
2026-04-06 14:11 ` Sean Christopherson
2026-04-07 8:23 ` Vitaly Kuznetsov [this message]
2026-04-07 8:17 ` Vitaly Kuznetsov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87se97mb53.fsf@redhat.com \
--to=vkuznets@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-hyperv@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
--cc=thomas.lefebvre3@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox