From: Sean Christopherson <seanjc@google.com>
To: Thomas Lefebvre <thomas.lefebvre3@gmail.com>
Cc: pbonzini@redhat.com, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
vkuznets@redhat.com
Subject: Re: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
Date: Mon, 6 Apr 2026 07:11:29 -0700 [thread overview]
Message-ID: <adO_EYdKtl_TXooI@google.com> (raw)
In-Reply-To: <CAKdXbaV1PTwetd4zs6+6Rp7h0dwHU1ygMoof5eAcfL6XYZF1xA@mail.gmail.com>
On Sun, Apr 05, 2026, Thomas Lefebvre wrote:
> Hi,
>
> I'm seeing KVM_GET_CLOCK return values ~253 years in the future when
> running KVM inside a Hyper-V VM (nested virtualization). I tracked
> it down to an unsigned wraparound in __get_kvmclock() and have
> bpftrace data showing the exact failure.
>
> Setup:
> - Intel i7-11800H laptop running Windows with Hyper-V
> - L1 guest: Ubuntu 24.04, kernel 6.8.0, 4 vCPUs
> - Clocksource: hyperv_clocksource_tsc_page (VDSO_CLOCKMODE_HVCLOCK)
> - KVM running inside L1, hosting L2 guests
>
> Root cause:
>
> __get_kvmclock() does:
>
> hv_clock.tsc_timestamp = ka->master_cycle_now;
> hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> ...
> data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
>
> and __pvclock_read_cycles() does:
>
> delta = tsc - src->tsc_timestamp; /* unsigned */
>
> master_cycle_now is a raw RDTSC captured by
> pvclock_update_vm_gtod_copy(). host_tsc is a raw RDTSC read by
> __get_kvmclock() on the current CPU. Both go through the vgettsc()
> HVCLOCK path which calls hv_read_tsc_page_tsc() -- this computes a
> cross-CPU-consistent reference counter via scale/offset, but stores
> the *raw* RDTSC in tsc_timestamp as a side effect.
>
> Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
> The hypervisor corrects them only through the TSC page scale/offset.
> If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
> later runs on CPU 1 where the raw TSC is lower, the unsigned
> subtraction wraps.
>
> I wrote a bpftrace tracer (included below) to instrument both
> functions and captured two corruption events:
>
> Event 1:
>
> [GTOD_COPY] pid=2117649 cpu=0->0 use_master=1
> mcn=598992030530137 mkn=259977082393200
>
> [GET_CLOCK] pid=2117649 entry_cpu=1 exit_cpu=1 use_master=1
> clock=8006399342167092479 host_tsc=598991848289183
> master_cycle_now=598992030530137
> system_time(mkn+off)=5175860260
> TSC DEFICIT: 182240954 cycles
>
> master_cycle_now captured on CPU 0, host_tsc read on CPU 1.
> CPU 1's raw RDTSC was 182M cycles lower.
>
> 598991848289183 - 598992030530137 = 18446744073527310662 (u64)
>
> Returned clock: 8,006,399,342,167,092,479 ns (~253.7 years)
> Correct system_time: 5,175,860,260 ns (~5.2 seconds)
>
> Event 2:
>
> [GTOD_COPY] pid=2117953 cpu=0->0 use_master=1
> mcn=599040238416510
>
> [GET_CLOCK] pid=2117953 entry_cpu=3 exit_cpu=3 use_master=1
> clock=8006399342464295526 host_tsc=599040211994220
> master_cycle_now=599040238416510
> TSC DEFICIT: 26422290 cycles
>
> Same pattern, CPU 0 vs CPU 3, 26M cycle deficit.
>
> kvm_get_wall_clock_epoch() has the same pattern -- fresh host_tsc
> vs stale master_cycle_now passed to __pvclock_read_cycles().
>
> The simplest fix I can think of is guarding the __pvclock_read_cycles
> call in __get_kvmclock():
>
> if (data->host_tsc >= hv_clock.tsc_timestamp)
> data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
> else
> data->clock = hv_clock.system_time;
That might kinda sorta work for one KVM-as-the-host path, but it's not a proper
fix. The actual guest-side (L2) reads in __pvclock_clocksource_read() will also
be broken, because PVCLOCK_TSC_STABLE_BIT will be set.
I don't see how this scenario can possibly work, KVM is effectively mixing two
time domains. The stable timestamp from the TSC page is (obviously) *derived*
from the raw, *unstable* TSC, but they are two distinct domains.
What really confuses me is why we thought this would work for Hyper-V but not for
kvmclock (i.e. KVM-on-KVM). Hyper-V's TSC page and kvmclock are the exact same
concept, but vgettsc() only special cases VDSO_CLOCKMODE_HVCLOCK, not
VDSO_CLOCKMODE_PVCLOCK.
Shouldn't we just revert b0c39dc68e3b ("x86/kvm: Pass stable clocksource to guests
when running nested on Hyper-V")?
Vitaly, what am I missing?
> system_time (= master_kernel_ns + kvmclock_offset) was computed from
> the TSC page's corrected reference counter and is accurate regardless
> of CPU. The fallback loses sub-us interpolation but avoids a 253-year
> jump. On systems with consistent cross-CPU TSC, the branch is never
> taken.
>
> One thing I wasn't sure about: when the fallback triggers,
> KVM_CLOCK_TSC_STABLE is still set in data->flags. I left it alone
> since the returned value is still correct (just less precise), but
> I could see an argument for clearing it.
>
> Disabling master clock entirely for HVCLOCK would also work but
> seemed heavy -- it sacrifices PVCLOCK_TSC_STABLE_BIT, forces the
> guest pvclock read into the atomic64_cmpxchg monotonicity guard,
> and triggers KVM_REQ_GLOBAL_CLOCK_UPDATE on vCPU migration.
>
> Reproducer bpftrace script (run while exercising KVM on a Hyper-V
> host):
>
> #!/usr/bin/env bpftrace
> /*
> * Detect host_tsc < master_cycle_now in __get_kvmclock.
> *
> * struct kvm_clock_data layout (for raw offset reads):
> * offset 0: u64 clock
> * offset 24: u64 host_tsc
> */
>
> kprobe:__get_kvmclock
> {
> $kvm = (struct kvm *)arg0;
> @get_data[tid] = (uint64)arg1;
> @get_use_master[tid] = (uint64)$kvm->arch.use_master_clock;
> @get_mcn[tid] = (uint64)$kvm->arch.master_cycle_now;
> @get_cpu[tid] = cpu;
> }
>
> kretprobe:__get_kvmclock
> {
> $data_ptr = @get_data[tid];
> if ($data_ptr != 0) {
> $clock = *(uint64 *)($data_ptr);
> $host_tsc = *(uint64 *)($data_ptr + 24);
> $use_master = @get_use_master[tid];
> $mcn = @get_mcn[tid];
>
> if ($use_master && $host_tsc != 0 && $host_tsc < $mcn) {
> printf("BUG: pid=%d cpu=%d->%d host_tsc=%lu mcn=%lu "
> "deficit=%lu clock=%lu\n",
> pid, @get_cpu[tid], cpu, $host_tsc,
> $mcn, $mcn - $host_tsc, $clock);
> }
> }
> delete(@get_data[tid]);
> delete(@get_use_master[tid]);
> delete(@get_mcn[tid]);
> delete(@get_cpu[tid]);
> }
>
> kprobe:pvclock_update_vm_gtod_copy {
> @gtod_kvm[tid] = (uint64)arg0;
> @gtod_cpu[tid] = cpu;
> }
> kretprobe:pvclock_update_vm_gtod_copy
> {
> $kvm = (struct kvm *)@gtod_kvm[tid];
> if ($kvm != 0) {
> printf("GTOD: pid=%d cpu=%d->%d mcn=%lu use_master=%d\n",
> pid, @gtod_cpu[tid], cpu,
> $kvm->arch.master_cycle_now,
> $kvm->arch.use_master_clock);
> }
> delete(@gtod_kvm[tid]);
> delete(@gtod_cpu[tid]);
> }
>
> Thanks,
> Thomas
prev parent reply other threads:[~2026-04-06 14:11 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-05 22:10 [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency Thomas Lefebvre
2026-04-06 14:11 ` Sean Christopherson [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=adO_EYdKtl_TXooI@google.com \
--to=seanjc@google.com \
--cc=kvm@vger.kernel.org \
--cc=linux-hyperv@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=thomas.lefebvre3@gmail.com \
--cc=vkuznets@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox