public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
@ 2026-04-05 22:10 Thomas Lefebvre
  2026-04-06 14:11 ` Sean Christopherson
  2026-04-07  8:17 ` Vitaly Kuznetsov
  0 siblings, 2 replies; 4+ messages in thread
From: Thomas Lefebvre @ 2026-04-05 22:10 UTC (permalink / raw)
  To: seanjc, pbonzini; +Cc: kvm, linux-kernel, linux-hyperv, vkuznets

Hi,

I'm seeing KVM_GET_CLOCK return values ~253 years in the future when
running KVM inside a Hyper-V VM (nested virtualization).  I tracked
it down to an unsigned wraparound in __get_kvmclock() and have
bpftrace data showing the exact failure.

Setup:
  - Intel i7-11800H laptop running Windows with Hyper-V
  - L1 guest: Ubuntu 24.04, kernel 6.8.0, 4 vCPUs
  - Clocksource: hyperv_clocksource_tsc_page (VDSO_CLOCKMODE_HVCLOCK)
  - KVM running inside L1, hosting L2 guests

Root cause:

__get_kvmclock() does:

    hv_clock.tsc_timestamp = ka->master_cycle_now;
    hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
    ...
    data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);

and __pvclock_read_cycles() does:

    delta = tsc - src->tsc_timestamp;    /* unsigned */

master_cycle_now is a raw RDTSC captured by
pvclock_update_vm_gtod_copy().  host_tsc is a raw RDTSC read by
__get_kvmclock() on the current CPU.  Both go through the vgettsc()
HVCLOCK path which calls hv_read_tsc_page_tsc() -- this computes a
cross-CPU-consistent reference counter via scale/offset, but stores
the *raw* RDTSC in tsc_timestamp as a side effect.

Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
The hypervisor corrects them only through the TSC page scale/offset.
If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
later runs on CPU 1 where the raw TSC is lower, the unsigned
subtraction wraps.

I wrote a bpftrace tracer (included below) to instrument both
functions and captured two corruption events:

  Event 1:

    [GTOD_COPY] pid=2117649 cpu=0->0 use_master=1
                mcn=598992030530137 mkn=259977082393200

    [GET_CLOCK] pid=2117649 entry_cpu=1 exit_cpu=1 use_master=1
      clock=8006399342167092479 host_tsc=598991848289183
      master_cycle_now=598992030530137
      system_time(mkn+off)=5175860260
      TSC DEFICIT: 182240954 cycles

    master_cycle_now captured on CPU 0, host_tsc read on CPU 1.
    CPU 1's raw RDTSC was 182M cycles lower.

      598991848289183 - 598992030530137 = 18446744073527310662 (u64)

    Returned clock: 8,006,399,342,167,092,479 ns (~253.7 years)
    Correct system_time: 5,175,860,260 ns (~5.2 seconds)

  Event 2:

    [GTOD_COPY] pid=2117953 cpu=0->0 use_master=1
                mcn=599040238416510

    [GET_CLOCK] pid=2117953 entry_cpu=3 exit_cpu=3 use_master=1
      clock=8006399342464295526 host_tsc=599040211994220
      master_cycle_now=599040238416510
      TSC DEFICIT: 26422290 cycles

    Same pattern, CPU 0 vs CPU 3, 26M cycle deficit.

kvm_get_wall_clock_epoch() has the same pattern -- fresh host_tsc
vs stale master_cycle_now passed to __pvclock_read_cycles().

The simplest fix I can think of is guarding the __pvclock_read_cycles
call in __get_kvmclock():

    if (data->host_tsc >= hv_clock.tsc_timestamp)
        data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
    else
        data->clock = hv_clock.system_time;

system_time (= master_kernel_ns + kvmclock_offset) was computed from
the TSC page's corrected reference counter and is accurate regardless
of CPU.  The fallback loses sub-us interpolation but avoids a 253-year
jump.  On systems with consistent cross-CPU TSC, the branch is never
taken.

One thing I wasn't sure about: when the fallback triggers,
KVM_CLOCK_TSC_STABLE is still set in data->flags.  I left it alone
since the returned value is still correct (just less precise), but
I could see an argument for clearing it.

Disabling master clock entirely for HVCLOCK would also work but
seemed heavy -- it sacrifices PVCLOCK_TSC_STABLE_BIT, forces the
guest pvclock read into the atomic64_cmpxchg monotonicity guard,
and triggers KVM_REQ_GLOBAL_CLOCK_UPDATE on vCPU migration.

Reproducer bpftrace script (run while exercising KVM on a Hyper-V
host):

  #!/usr/bin/env bpftrace
  /*
   * Detect host_tsc < master_cycle_now in __get_kvmclock.
   *
   * struct kvm_clock_data layout (for raw offset reads):
   *   offset 0:  u64 clock
   *   offset 24: u64 host_tsc
   */

  kprobe:__get_kvmclock
  {
      $kvm = (struct kvm *)arg0;
      @get_data[tid] = (uint64)arg1;
      @get_use_master[tid] = (uint64)$kvm->arch.use_master_clock;
      @get_mcn[tid] = (uint64)$kvm->arch.master_cycle_now;
      @get_cpu[tid] = cpu;
  }

  kretprobe:__get_kvmclock
  {
      $data_ptr = @get_data[tid];
      if ($data_ptr != 0) {
          $clock = *(uint64 *)($data_ptr);
          $host_tsc = *(uint64 *)($data_ptr + 24);
          $use_master = @get_use_master[tid];
          $mcn = @get_mcn[tid];

          if ($use_master && $host_tsc != 0 && $host_tsc < $mcn) {
              printf("BUG: pid=%d cpu=%d->%d host_tsc=%lu mcn=%lu "
                     "deficit=%lu clock=%lu\n",
                     pid, @get_cpu[tid], cpu, $host_tsc,
                     $mcn, $mcn - $host_tsc, $clock);
          }
      }
      delete(@get_data[tid]);
      delete(@get_use_master[tid]);
      delete(@get_mcn[tid]);
      delete(@get_cpu[tid]);
  }

  kprobe:pvclock_update_vm_gtod_copy {
      @gtod_kvm[tid] = (uint64)arg0;
      @gtod_cpu[tid] = cpu;
  }
  kretprobe:pvclock_update_vm_gtod_copy
  {
      $kvm = (struct kvm *)@gtod_kvm[tid];
      if ($kvm != 0) {
          printf("GTOD: pid=%d cpu=%d->%d mcn=%lu use_master=%d\n",
                 pid, @gtod_cpu[tid], cpu,
                 $kvm->arch.master_cycle_now,
                 $kvm->arch.use_master_clock);
      }
      delete(@gtod_kvm[tid]);
      delete(@gtod_cpu[tid]);
  }

Thanks,
Thomas

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
  2026-04-05 22:10 [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency Thomas Lefebvre
@ 2026-04-06 14:11 ` Sean Christopherson
  2026-04-07  8:23   ` Vitaly Kuznetsov
  2026-04-07  8:17 ` Vitaly Kuznetsov
  1 sibling, 1 reply; 4+ messages in thread
From: Sean Christopherson @ 2026-04-06 14:11 UTC (permalink / raw)
  To: Thomas Lefebvre; +Cc: pbonzini, kvm, linux-kernel, linux-hyperv, vkuznets

On Sun, Apr 05, 2026, Thomas Lefebvre wrote:
> Hi,
> 
> I'm seeing KVM_GET_CLOCK return values ~253 years in the future when
> running KVM inside a Hyper-V VM (nested virtualization).  I tracked
> it down to an unsigned wraparound in __get_kvmclock() and have
> bpftrace data showing the exact failure.
> 
> Setup:
>   - Intel i7-11800H laptop running Windows with Hyper-V
>   - L1 guest: Ubuntu 24.04, kernel 6.8.0, 4 vCPUs
>   - Clocksource: hyperv_clocksource_tsc_page (VDSO_CLOCKMODE_HVCLOCK)
>   - KVM running inside L1, hosting L2 guests
> 
> Root cause:
> 
> __get_kvmclock() does:
> 
>     hv_clock.tsc_timestamp = ka->master_cycle_now;
>     hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
>     ...
>     data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
> 
> and __pvclock_read_cycles() does:
> 
>     delta = tsc - src->tsc_timestamp;    /* unsigned */
> 
> master_cycle_now is a raw RDTSC captured by
> pvclock_update_vm_gtod_copy().  host_tsc is a raw RDTSC read by
> __get_kvmclock() on the current CPU.  Both go through the vgettsc()
> HVCLOCK path which calls hv_read_tsc_page_tsc() -- this computes a
> cross-CPU-consistent reference counter via scale/offset, but stores
> the *raw* RDTSC in tsc_timestamp as a side effect.
> 
> Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
> The hypervisor corrects them only through the TSC page scale/offset.
> If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
> later runs on CPU 1 where the raw TSC is lower, the unsigned
> subtraction wraps.
> 
> I wrote a bpftrace tracer (included below) to instrument both
> functions and captured two corruption events:
> 
>   Event 1:
> 
>     [GTOD_COPY] pid=2117649 cpu=0->0 use_master=1
>                 mcn=598992030530137 mkn=259977082393200
> 
>     [GET_CLOCK] pid=2117649 entry_cpu=1 exit_cpu=1 use_master=1
>       clock=8006399342167092479 host_tsc=598991848289183
>       master_cycle_now=598992030530137
>       system_time(mkn+off)=5175860260
>       TSC DEFICIT: 182240954 cycles
> 
>     master_cycle_now captured on CPU 0, host_tsc read on CPU 1.
>     CPU 1's raw RDTSC was 182M cycles lower.
> 
>       598991848289183 - 598992030530137 = 18446744073527310662 (u64)
> 
>     Returned clock: 8,006,399,342,167,092,479 ns (~253.7 years)
>     Correct system_time: 5,175,860,260 ns (~5.2 seconds)
> 
>   Event 2:
> 
>     [GTOD_COPY] pid=2117953 cpu=0->0 use_master=1
>                 mcn=599040238416510
> 
>     [GET_CLOCK] pid=2117953 entry_cpu=3 exit_cpu=3 use_master=1
>       clock=8006399342464295526 host_tsc=599040211994220
>       master_cycle_now=599040238416510
>       TSC DEFICIT: 26422290 cycles
> 
>     Same pattern, CPU 0 vs CPU 3, 26M cycle deficit.
> 
> kvm_get_wall_clock_epoch() has the same pattern -- fresh host_tsc
> vs stale master_cycle_now passed to __pvclock_read_cycles().
> 
> The simplest fix I can think of is guarding the __pvclock_read_cycles
> call in __get_kvmclock():
> 
>     if (data->host_tsc >= hv_clock.tsc_timestamp)
>         data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
>     else
>         data->clock = hv_clock.system_time;

That might kinda sorta work for one KVM-as-the-host path, but it's not a proper
fix.  The actual guest-side (L2) reads in __pvclock_clocksource_read() will also
be broken, because PVCLOCK_TSC_STABLE_BIT will be set.

I don't see how this scenario can possibly work, KVM is effectively mixing two
time domains.  The stable timestamp from the TSC page is (obviously) *derived*
from the raw, *unstable* TSC, but they are two distinct domains.

What really confuses me is why we thought this would work for Hyper-V but not for
kvmclock (i.e. KVM-on-KVM).  Hyper-V's TSC page and kvmclock are the exact same
concept, but vgettsc() only special cases VDSO_CLOCKMODE_HVCLOCK, not
VDSO_CLOCKMODE_PVCLOCK.

Shouldn't we just revert b0c39dc68e3b ("x86/kvm: Pass stable clocksource to guests
when running nested on Hyper-V")?

Vitaly, what am I missing?

> system_time (= master_kernel_ns + kvmclock_offset) was computed from
> the TSC page's corrected reference counter and is accurate regardless
> of CPU.  The fallback loses sub-us interpolation but avoids a 253-year
> jump.  On systems with consistent cross-CPU TSC, the branch is never
> taken.
> 
> One thing I wasn't sure about: when the fallback triggers,
> KVM_CLOCK_TSC_STABLE is still set in data->flags.  I left it alone
> since the returned value is still correct (just less precise), but
> I could see an argument for clearing it.
> 
> Disabling master clock entirely for HVCLOCK would also work but
> seemed heavy -- it sacrifices PVCLOCK_TSC_STABLE_BIT, forces the
> guest pvclock read into the atomic64_cmpxchg monotonicity guard,
> and triggers KVM_REQ_GLOBAL_CLOCK_UPDATE on vCPU migration.
> 
> Reproducer bpftrace script (run while exercising KVM on a Hyper-V
> host):
> 
>   #!/usr/bin/env bpftrace
>   /*
>    * Detect host_tsc < master_cycle_now in __get_kvmclock.
>    *
>    * struct kvm_clock_data layout (for raw offset reads):
>    *   offset 0:  u64 clock
>    *   offset 24: u64 host_tsc
>    */
> 
>   kprobe:__get_kvmclock
>   {
>       $kvm = (struct kvm *)arg0;
>       @get_data[tid] = (uint64)arg1;
>       @get_use_master[tid] = (uint64)$kvm->arch.use_master_clock;
>       @get_mcn[tid] = (uint64)$kvm->arch.master_cycle_now;
>       @get_cpu[tid] = cpu;
>   }
> 
>   kretprobe:__get_kvmclock
>   {
>       $data_ptr = @get_data[tid];
>       if ($data_ptr != 0) {
>           $clock = *(uint64 *)($data_ptr);
>           $host_tsc = *(uint64 *)($data_ptr + 24);
>           $use_master = @get_use_master[tid];
>           $mcn = @get_mcn[tid];
> 
>           if ($use_master && $host_tsc != 0 && $host_tsc < $mcn) {
>               printf("BUG: pid=%d cpu=%d->%d host_tsc=%lu mcn=%lu "
>                      "deficit=%lu clock=%lu\n",
>                      pid, @get_cpu[tid], cpu, $host_tsc,
>                      $mcn, $mcn - $host_tsc, $clock);
>           }
>       }
>       delete(@get_data[tid]);
>       delete(@get_use_master[tid]);
>       delete(@get_mcn[tid]);
>       delete(@get_cpu[tid]);
>   }
> 
>   kprobe:pvclock_update_vm_gtod_copy {
>       @gtod_kvm[tid] = (uint64)arg0;
>       @gtod_cpu[tid] = cpu;
>   }
>   kretprobe:pvclock_update_vm_gtod_copy
>   {
>       $kvm = (struct kvm *)@gtod_kvm[tid];
>       if ($kvm != 0) {
>           printf("GTOD: pid=%d cpu=%d->%d mcn=%lu use_master=%d\n",
>                  pid, @gtod_cpu[tid], cpu,
>                  $kvm->arch.master_cycle_now,
>                  $kvm->arch.use_master_clock);
>       }
>       delete(@gtod_kvm[tid]);
>       delete(@gtod_cpu[tid]);
>   }
> 
> Thanks,
> Thomas

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
  2026-04-05 22:10 [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency Thomas Lefebvre
  2026-04-06 14:11 ` Sean Christopherson
@ 2026-04-07  8:17 ` Vitaly Kuznetsov
  1 sibling, 0 replies; 4+ messages in thread
From: Vitaly Kuznetsov @ 2026-04-07  8:17 UTC (permalink / raw)
  To: Thomas Lefebvre, seanjc, pbonzini; +Cc: kvm, linux-kernel, linux-hyperv

Thomas Lefebvre <thomas.lefebvre3@gmail.com> writes:

...

>
> Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
> The hypervisor corrects them only through the TSC page scale/offset.
> If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
> later runs on CPU 1 where the raw TSC is lower, the unsigned
> subtraction wraps.
>

According to the TLFS, reference TSC page is partition wide:

"The hypervisor provides a partition-wide virtual reference TSC page
which is overlaid on the partition’s GPA space. A partition’s reference
time stamp counter page is accessed through the Reference TSC MSR."

so if as you say RAW rdtsc value is inconsistent across vCPUs, I can
hardly see how we can use this time source at all, even without
KVM. scale/offset are the same for all vCPUs.

I think the fix here is to avoid setting up Hyper-V TSC page clocksource
in L1. Unfortunately, with unsynchronized TSCs this will leave us the
only choice for a sane clocksource: raw HV_X64_MSR_TIME_REF_COUNT MSR
reads.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
  2026-04-06 14:11 ` Sean Christopherson
@ 2026-04-07  8:23   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 4+ messages in thread
From: Vitaly Kuznetsov @ 2026-04-07  8:23 UTC (permalink / raw)
  To: Sean Christopherson, Thomas Lefebvre
  Cc: pbonzini, kvm, linux-kernel, linux-hyperv

Sean Christopherson <seanjc@google.com> writes:

> On Sun, Apr 05, 2026, Thomas Lefebvre wrote:
>> Hi,
>> 
>> I'm seeing KVM_GET_CLOCK return values ~253 years in the future when
>> running KVM inside a Hyper-V VM (nested virtualization).  I tracked
>> it down to an unsigned wraparound in __get_kvmclock() and have
>> bpftrace data showing the exact failure.
>> 
>> Setup:
>>   - Intel i7-11800H laptop running Windows with Hyper-V
>>   - L1 guest: Ubuntu 24.04, kernel 6.8.0, 4 vCPUs
>>   - Clocksource: hyperv_clocksource_tsc_page (VDSO_CLOCKMODE_HVCLOCK)
>>   - KVM running inside L1, hosting L2 guests
>> 
>> Root cause:
>> 
>> __get_kvmclock() does:
>> 
>>     hv_clock.tsc_timestamp = ka->master_cycle_now;
>>     hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
>>     ...
>>     data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
>> 
>> and __pvclock_read_cycles() does:
>> 
>>     delta = tsc - src->tsc_timestamp;    /* unsigned */
>> 
>> master_cycle_now is a raw RDTSC captured by
>> pvclock_update_vm_gtod_copy().  host_tsc is a raw RDTSC read by
>> __get_kvmclock() on the current CPU.  Both go through the vgettsc()
>> HVCLOCK path which calls hv_read_tsc_page_tsc() -- this computes a
>> cross-CPU-consistent reference counter via scale/offset, but stores
>> the *raw* RDTSC in tsc_timestamp as a side effect.
>> 
>> Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
>> The hypervisor corrects them only through the TSC page scale/offset.
>> If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
>> later runs on CPU 1 where the raw TSC is lower, the unsigned
>> subtraction wraps.
>> 
>> I wrote a bpftrace tracer (included below) to instrument both
>> functions and captured two corruption events:
>> 
>>   Event 1:
>> 
>>     [GTOD_COPY] pid=2117649 cpu=0->0 use_master=1
>>                 mcn=598992030530137 mkn=259977082393200
>> 
>>     [GET_CLOCK] pid=2117649 entry_cpu=1 exit_cpu=1 use_master=1
>>       clock=8006399342167092479 host_tsc=598991848289183
>>       master_cycle_now=598992030530137
>>       system_time(mkn+off)=5175860260
>>       TSC DEFICIT: 182240954 cycles
>> 
>>     master_cycle_now captured on CPU 0, host_tsc read on CPU 1.
>>     CPU 1's raw RDTSC was 182M cycles lower.
>> 
>>       598991848289183 - 598992030530137 = 18446744073527310662 (u64)
>> 
>>     Returned clock: 8,006,399,342,167,092,479 ns (~253.7 years)
>>     Correct system_time: 5,175,860,260 ns (~5.2 seconds)
>> 
>>   Event 2:
>> 
>>     [GTOD_COPY] pid=2117953 cpu=0->0 use_master=1
>>                 mcn=599040238416510
>> 
>>     [GET_CLOCK] pid=2117953 entry_cpu=3 exit_cpu=3 use_master=1
>>       clock=8006399342464295526 host_tsc=599040211994220
>>       master_cycle_now=599040238416510
>>       TSC DEFICIT: 26422290 cycles
>> 
>>     Same pattern, CPU 0 vs CPU 3, 26M cycle deficit.
>> 
>> kvm_get_wall_clock_epoch() has the same pattern -- fresh host_tsc
>> vs stale master_cycle_now passed to __pvclock_read_cycles().
>> 
>> The simplest fix I can think of is guarding the __pvclock_read_cycles
>> call in __get_kvmclock():
>> 
>>     if (data->host_tsc >= hv_clock.tsc_timestamp)
>>         data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
>>     else
>>         data->clock = hv_clock.system_time;
>
> That might kinda sorta work for one KVM-as-the-host path, but it's not a proper
> fix.  The actual guest-side (L2) reads in __pvclock_clocksource_read() will also
> be broken, because PVCLOCK_TSC_STABLE_BIT will be set.
>
> I don't see how this scenario can possibly work, KVM is effectively mixing two
> time domains.  The stable timestamp from the TSC page is (obviously) *derived*
> from the raw, *unstable* TSC, but they are two distinct domains.
>
> What really confuses me is why we thought this would work for Hyper-V but not for
> kvmclock (i.e. KVM-on-KVM).  Hyper-V's TSC page and kvmclock are the exact same
> concept, but vgettsc() only special cases VDSO_CLOCKMODE_HVCLOCK, not
> VDSO_CLOCKMODE_PVCLOCK.
>
> Shouldn't we just revert b0c39dc68e3b ("x86/kvm: Pass stable clocksource to guests
> when running nested on Hyper-V")?
>
> Vitaly, what am I missing?
>

It's probably me who's missing somethings :-) but my understanding is
that we can't be using TSC page clocksource with unsyncronized TSCs in
L1 at all as TSC page (unlike kvmclock) is always partition-wide and
thus can't lead to a sane result in case raw TSC readings diverge. The
idea of b0c39dc68e3b was that in Hyper-V guests *with stable,
syncronized TSC* we may still be using Hyper-V TSC page clocksource and
thus we can pass it to L2.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-07  8:23 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-05 22:10 [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency Thomas Lefebvre
2026-04-06 14:11 ` Sean Christopherson
2026-04-07  8:23   ` Vitaly Kuznetsov
2026-04-07  8:17 ` Vitaly Kuznetsov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox