* Re: [RFC] KVM/x86: Killing kvm_get_time_and_clockread() in favour of ktime_get_snapshot()
2026-05-26 13:57 [RFC] KVM/x86: Killing kvm_get_time_and_clockread() in favour of ktime_get_snapshot() David Woodhouse
@ 2026-05-26 23:04 ` David Woodhouse
2026-05-27 8:30 ` Vitaly Kuznetsov
2026-05-27 8:49 ` Paolo Bonzini
2 siblings, 0 replies; 5+ messages in thread
From: David Woodhouse @ 2026-05-26 23:04 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, John Stultz,
Michael Kelley
Cc: Vitaly Kuznetsov, Marcelo Tosatti, Christopher S. Hall,
Stephen Boyd, Miroslav Lichvar, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Daniel Lezcano, kvm, linux-hyperv, x86,
linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2117 bytes --]
On Tue, 2026-05-26 at 14:57 +0100, David Woodhouse wrote:
>
> One simple option that occurs to me would be to add a 'cycles_raw'
> value to the system_time_snapshot, for PV clocksources like hyperv and
> kvmclock to populate with the original TSC reading.
>
> That might actually let us clean up some of the PTP code that currently
> has to deal with TSC vs. kvmclock in counter snapshots too. I think I
> could kill the use of get_cycles() in vmclock for the kvmclock case,
> which might make Thomas happy...
I hacked that up to see what it looks like, and it kind of seems to work...
Based on merging my kvmclock branch and Thomas's ktime_get_snapshot_id():
• https://git.infradead.org/?p=users/dwmw2/linux.git;a=shortlog;h=refs/heads/kvmclock5
• https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/log/?h=timers/ptp/timekeeping
I'll probably not post this for real until the above two are merged;
there's no rush but I think it's a worthwhile cleanup. For now it's at
• https://git.infradead.org/?p=users/dwmw2/linux.git;a=shortlog;h=refs/heads/kvm-ktime-snapshot
David Woodhouse (8):
timekeeping: Add clocksource read_raw() method and raw_cycles to snapshot
clocksource/hyperv: Implement read_raw() for TSC page clocksource
x86/kvmclock: Implement read_raw() for kvmclock clocksource
KVM: x86: Use ktime_get_snapshot_id() for master clock
KVM: x86: Compute kvmclock base without pvclock_gtod_data
KVM: x86: Replace pvclock_gtod_data vclock_mode with boolean
KVM: x86: Remove pvclock_gtod_data and private timekeeping code
ptp: vmclock: Use raw_cycles from snapshot for precise TSC pairing
arch/x86/kernel/kvmclock.c | 21 ++++
arch/x86/kvm/x86.c | 239 ++++++++-----------------------------
drivers/clocksource/hyperv_timer.c | 14 +++
drivers/ptp/ptp_vmclock.c | 4 +
include/linux/clocksource.h | 8 ++
include/linux/timekeeping.h | 6 +
kernel/time/timekeeping.c | 30 ++++-
7 files changed, 130 insertions(+), 192 deletions(-)
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [RFC] KVM/x86: Killing kvm_get_time_and_clockread() in favour of ktime_get_snapshot()
2026-05-26 13:57 [RFC] KVM/x86: Killing kvm_get_time_and_clockread() in favour of ktime_get_snapshot() David Woodhouse
2026-05-26 23:04 ` David Woodhouse
@ 2026-05-27 8:30 ` Vitaly Kuznetsov
2026-05-27 8:42 ` David Woodhouse
2026-05-27 8:49 ` Paolo Bonzini
2 siblings, 1 reply; 5+ messages in thread
From: Vitaly Kuznetsov @ 2026-05-27 8:30 UTC (permalink / raw)
To: David Woodhouse, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, John Stultz, Michael Kelley
Cc: Marcelo Tosatti, Christopher S. Hall, Stephen Boyd,
Miroslav Lichvar, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Daniel Lezcano, kvm, linux-hyperv, x86, linux-kernel
David Woodhouse <dwmw2@infradead.org> writes:
...
>
> Then in 2018, Vitaly Kuznetsov added Hyper-V TSC page support in
> commit b0c39dc68e3b ("x86/kvm: Pass stable clocksource to guests when
> running nested on Hyper-V"), which extended vgettsc() to handle the
> HVCLOCK case.
>
> I'd quite like to kill it all with fire and make KVM use
> ktime_get_snapshot() instead.
The main motivation is reducing the complexity of KVM's timekeeping code
I guess?
>
> However, to correlate with the TSC provided to guests, KVM needs the
> underlying host TSC counter value, *not* the cycles count from the
> hyperv_clocksource_tsc_page clocksource which is scaled to 10MHz.
>
> If we wanted to support master clock mode while nesting under KVM and
> bizarrely using the kvmclock for system timing, we'd have the same
> problem with the kvmclock clocksource, which similarly scales to 1GHz.
>
> One option is to say "Don't Do That Then™": if you want to provide a
> masterclock kvmclock to guests then *don't* use the silly pvclocks for
> your own kernel's timekeeping, use the damn TSC. Because if the TSC
> *isn't* reliable then you can't do masterclock mode for your guests
> anyway.
The statement "TSC isn't reliable" deserves a book of its own :-)
Historically, we've seen all sorts of issues with it, but by the time of
b0c39dc68e3b, they were mostly gone. The real problem the Hyper-V/Azure
folks were solving back then was that while the TSC *was* reliable
(synchronized across CPUs, not jumping backwards, stable frequency,
...), tons of hardware out there (Azure is quite big) did not support
TSC scaling. VMs on Azure don't migrate very often, but they do migrate
when hardware maintenance is needed. Migrating to a host with a
different TSC frequency would've been a problem, so the Hyper-V TSC page
was introduced. Note: it is a *single* page for all CPUs, so the
clocksource was never intended to be used in a situation where TSCs are
unsynchronized across CPUs.
To deal with migrations, the Hyper-V folks came up with a mechanism
called 'reenlightenment notifications', and we support it in KVM. It's
not really great, as we need to stop all the nested VMs, but it does the
job: we can re-compute guest PV clocksources (kvmclock, TSC page,
... Xen?) and live happily ever after.
>
> Perhaps that should have been the response when commit b0c39dc68e3b was
> submitted, but I guess we're stuck supporting that mode now.
Times are changing, and it is becoming increasingly difficult to find
x86 hardware without TSC scaling support. Linux guests on Hyper-V now
prefer TSC if possible (HV_ACCESS_TSC_INVARIANT; see, e.g., commit
4c78738ead4e), so I expect that in a few years, there will be no need
for the Hyper-V TSC page clocksource or the reenlightenment logic
anyway.
> But I really do want to kill the KVM hacks and use ktime_get_snapshot().
>
> Reverse-engineering the original TSC reading from the clocksource
> counter value doesn't look sane, without a loss of precision and/or
> 128-bit division.
>
> One simple option that occurs to me would be to add a 'cycles_raw'
> value to the system_time_snapshot, for PV clocksources like hyperv and
> kvmclock to populate with the original TSC reading.
Personally, I don't see this as such an ugly hack.
>
> That might actually let us clean up some of the PTP code that currently
> has to deal with TSC vs. kvmclock in counter snapshots too. I think I
> could kill the use of get_cycles() in vmclock for the kvmclock case,
> which might make Thomas happy...
>
> Any better ideas?
--
Vitaly
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [RFC] KVM/x86: Killing kvm_get_time_and_clockread() in favour of ktime_get_snapshot()
2026-05-27 8:30 ` Vitaly Kuznetsov
@ 2026-05-27 8:42 ` David Woodhouse
0 siblings, 0 replies; 5+ messages in thread
From: David Woodhouse @ 2026-05-27 8:42 UTC (permalink / raw)
To: Vitaly Kuznetsov, Sean Christopherson, Paolo Bonzini,
Thomas Gleixner, John Stultz, Michael Kelley
Cc: Marcelo Tosatti, Christopher S. Hall, Stephen Boyd,
Miroslav Lichvar, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Daniel Lezcano, kvm, linux-hyperv, x86, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1454 bytes --]
On Wed, 2026-05-27 at 10:30 +0200, Vitaly Kuznetsov wrote:
> David Woodhouse <dwmw2@infradead.org> writes:
>
> ...
>
> >
> > Then in 2018, Vitaly Kuznetsov added Hyper-V TSC page support in
> > commit b0c39dc68e3b ("x86/kvm: Pass stable clocksource to guests when
> > running nested on Hyper-V"), which extended vgettsc() to handle the
> > HVCLOCK case.
> >
> > I'd quite like to kill it all with fire and make KVM use
> > ktime_get_snapshot() instead.
>
> The main motivation is reducing the complexity of KVM's timekeeping code
> I guess?
Reduce the complexity, and clean it up to use the core kernel snapshot
facility as $DEITY intended, yes.
> The statement "TSC isn't reliable" deserves a book of its own :-)
... deserves a whole wing in a rehab facility...
Thanks for the clarification.
> > One simple option that occurs to me would be to add a 'cycles_raw'
> > value to the system_time_snapshot, for PV clocksources like hyperv and
> > kvmclock to populate with the original TSC reading.
>
> Personally, I don't see this as such an ugly hack.
Yeah, having thrown that together last night to see what it looks like,
I'm not all that unhappy with it. And it does indeed let me provide
precise vmclock paired time when the clocksource is kvmclock, too.
There are probably some details to tweak, and I'm sure Thomas will call
me a moron again, but I think it's worth it for the resulting cleanup
:)
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] KVM/x86: Killing kvm_get_time_and_clockread() in favour of ktime_get_snapshot()
2026-05-26 13:57 [RFC] KVM/x86: Killing kvm_get_time_and_clockread() in favour of ktime_get_snapshot() David Woodhouse
2026-05-26 23:04 ` David Woodhouse
2026-05-27 8:30 ` Vitaly Kuznetsov
@ 2026-05-27 8:49 ` Paolo Bonzini
2 siblings, 0 replies; 5+ messages in thread
From: Paolo Bonzini @ 2026-05-27 8:49 UTC (permalink / raw)
To: David Woodhouse, Sean Christopherson, Thomas Gleixner,
John Stultz, Michael Kelley
Cc: Vitaly Kuznetsov, Marcelo Tosatti, Christopher S. Hall,
Stephen Boyd, Miroslav Lichvar, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Daniel Lezcano, kvm, linux-hyperv, x86,
linux-kernel
On 5/26/26 15:57, David Woodhouse wrote:
> In 2012, as part of implementing the "master clock" mode for kvmclock,
> Marcelo added kvm_get_time_and_clockread() in commit d828199e8444
> ("KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag").
>
> In 2016, Christopher Hall added the generic ktime_get_snapshot() in
> commit 9da0f49c8767 ("time: Add timekeeping snapshot code capturing
> system time and counter"), which provides the same paired read of
> { time, counter } through the core timekeeping code.
>
> Then in 2018, Vitaly Kuznetsov added Hyper-V TSC page support in
> commit b0c39dc68e3b ("x86/kvm: Pass stable clocksource to guests when
> running nested on Hyper-V"), which extended vgettsc() to handle the
> HVCLOCK case.
>
> I'd quite like to kill it all with fire and make KVM use
> ktime_get_snapshot() instead.
>
> However, to correlate with the TSC provided to guests, KVM needs the
> underlying host TSC counter value, *not* the cycles count from the
> hyperv_clocksource_tsc_page clocksource which is scaled to 10MHz.
>
> If we wanted to support master clock mode while nesting under KVM and
> bizarrely using the kvmclock for system timing, we'd have the same
> problem with the kvmclock clocksource, which similarly scales to 1GHz.
>
> One option is to say "Don't Do That Then™": if you want to provide a
> masterclock kvmclock to guests then *don't* use the silly pvclocks for
> your own kernel's timekeeping, use the damn TSC. Because if the TSC
> *isn't* reliable then you can't do masterclock mode for your guests
> anyway.
>
> Perhaps that should have been the response when commit b0c39dc68e3b was
> submitted, but I guess we're stuck supporting that mode now. But I
> really do want to kill the KVM hacks and use ktime_get_snapshot().
>
> Reverse-engineering the original TSC reading from the clocksource
> counter value doesn't look sane, without a loss of precision and/or
> 128-bit division.
>
> One simple option that occurs to me would be to add a 'cycles_raw'
> value to the system_time_snapshot, for PV clocksources like hyperv and
> kvmclock to populate with the original TSC reading.
>
> That might actually let us clean up some of the PTP code that currently
> has to deal with TSC vs. kvmclock in counter snapshots too. I think I
> could kill the use of get_cycles() in vmclock for the kvmclock case,
> which might make Thomas happy...
Yeah, when reading I was thinking of PTP as well. Seems worthwhile.
Paolo
^ permalink raw reply [flat|nested] 5+ messages in thread