* [PATCH v4 36/47] x86/pvclock: Mark setup helpers and related various as __init/__ro_after_init
From: Sean Christopherson @ 2026-05-29 15:08 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H . Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Now that Xen PV clock and kvmclock explicitly do setup only during init,
tag the common PV clock flags/vsyscall variables and their mutators with
__init.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/pvclock.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index b3f81379c2fc..a51adce67f92 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -16,10 +16,10 @@
#include <asm/pvclock.h>
#include <asm/vgtod.h>
-static u8 valid_flags __read_mostly = 0;
-static struct pvclock_vsyscall_time_info *pvti_cpu0_va __read_mostly;
+static u8 valid_flags __ro_after_init = 0;
+static struct pvclock_vsyscall_time_info *pvti_cpu0_va __ro_after_init;
-void pvclock_set_flags(u8 flags)
+void __init pvclock_set_flags(u8 flags)
{
valid_flags = flags;
}
@@ -153,7 +153,7 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock,
set_normalized_timespec64(ts, now.tv_sec, now.tv_nsec);
}
-void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
+void __init pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
{
WARN_ON(vclock_was_used(VDSO_CLOCKMODE_PVCLOCK));
pvti_cpu0_va = pvti;
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 35/47] x86/xen/time: Mark xen_setup_vsyscall_time_info() as __init
From: Sean Christopherson @ 2026-05-29 15:08 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H . Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Annotate xen_setup_vsyscall_time_info() as being used only during kernel
initialization; it's called only by xen_time_init(), which is already
tagged __init.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/xen/time.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 91ef83b1e540..8f4511f91d16 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -454,7 +454,7 @@ void xen_restore_time_memory_area(void)
xen_sched_clock_offset = xen_clocksource_read() - xen_clock_value_saved;
}
-static void xen_setup_vsyscall_time_info(void)
+static void __init xen_setup_vsyscall_time_info(void)
{
struct vcpu_register_time_memory_area t;
struct pvclock_vsyscall_time_info *ti;
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 34/47] x86/kvmclock: Move kvm_sched_clock_init() down in kvmclock.c
From: Sean Christopherson @ 2026-05-29 15:08 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H . Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Move kvm_sched_clock_init() "down" so that it can reference the global
kvm_clock structure without needing a forward declaration.
Opportunistically mark the helper as "__init" instead of "inline" to make
its usage more obvious; modern compilers don't need a hint to inline a
single-use function, and an extra CALL+RET pair during boot is a complete
non-issue. And, if the compiler ignores the hint and does NOT inline the
function, the resulting code may not get discarded after boot due lack of
an __init annotation.
No functional change intended.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 05bca2be0df3..6372b4dc7b0c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -134,20 +134,6 @@ static void kvm_restore_sched_clock_state(void)
kvm_register_clock("primary cpu clock, resume");
}
-static inline void kvm_sched_clock_init(bool stable)
-{
- kvm_sched_clock_offset = kvm_clock_read();
- __paravirt_set_sched_clock(kvm_sched_clock_read, stable,
- kvm_save_sched_clock_state,
- kvm_restore_sched_clock_state);
-
- pr_info("kvm-clock: using sched offset of %llu cycles",
- kvm_sched_clock_offset);
-
- BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
- sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
-}
-
void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
{
/*
@@ -303,6 +289,20 @@ static int kvmclock_setup_percpu(unsigned int cpu)
return p ? 0 : -ENOMEM;
}
+static __init void kvm_sched_clock_init(bool stable)
+{
+ kvm_sched_clock_offset = kvm_clock_read();
+ __paravirt_set_sched_clock(kvm_sched_clock_read, stable,
+ kvm_save_sched_clock_state,
+ kvm_restore_sched_clock_state);
+
+ pr_info("kvm-clock: using sched offset of %llu cycles",
+ kvm_sched_clock_offset);
+
+ BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
+ sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
+}
+
void __init kvmclock_init(bool prefer_tsc)
{
u8 flags;
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 33/47] x86/paravirt: Pass sched_clock save/restore helpers during registration
From: Sean Christopherson @ 2026-05-29 15:07 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H . Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Pass in a PV clock's save/restore helpers when configuring sched_clock
instead of relying on each PV clock to manually set the save/restore hooks.
In addition to bringing sanity to the code, this will allow gracefully
"rejecting" a PV sched_clock, e.g. when running as a CoCo guest that has
access to a "secure" TSC.
No functional change intended.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/timer.h | 9 ++++++---
arch/x86/kernel/cpu/vmware.c | 8 +++-----
arch/x86/kernel/kvmclock.c | 6 +++---
arch/x86/kernel/tsc.c | 5 ++++-
arch/x86/xen/time.c | 5 ++---
drivers/clocksource/hyperv_timer.c | 6 ++----
6 files changed, 20 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h
index fe41d40a9ae6..e97cd1ae03d1 100644
--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -14,11 +14,14 @@ extern int no_timer_check;
extern bool using_native_sched_clock(void);
#ifdef CONFIG_PARAVIRT
-void __paravirt_set_sched_clock(u64 (*func)(void), bool stable);
+void __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
+ void (*save)(void), void (*restore)(void));
-static inline void paravirt_set_sched_clock(u64 (*func)(void))
+static inline void paravirt_set_sched_clock(u64 (*func)(void),
+ void (*save)(void),
+ void (*restore)(void))
{
- __paravirt_set_sched_clock(func, true);
+ __paravirt_set_sched_clock(func, true, save, restore);
}
#endif
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 051ef89029a7..f3ffc05c7c53 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -347,11 +347,9 @@ static void __init vmware_paravirt_ops_setup(void)
vmware_cyc2ns_setup();
- if (vmw_sched_clock) {
- paravirt_set_sched_clock(vmware_sched_clock);
- x86_platform.save_sched_clock_state = x86_init_noop;
- x86_platform.restore_sched_clock_state = x86_init_noop;
- }
+ if (vmw_sched_clock)
+ paravirt_set_sched_clock(vmware_sched_clock,
+ x86_init_noop, x86_init_noop);
if (vmware_is_stealclock_available()) {
has_steal_clock = true;
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 0d4f2cf97246..05bca2be0df3 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -137,7 +137,9 @@ static void kvm_restore_sched_clock_state(void)
static inline void kvm_sched_clock_init(bool stable)
{
kvm_sched_clock_offset = kvm_clock_read();
- __paravirt_set_sched_clock(kvm_sched_clock_read, stable);
+ __paravirt_set_sched_clock(kvm_sched_clock_read, stable,
+ kvm_save_sched_clock_state,
+ kvm_restore_sched_clock_state);
pr_info("kvm-clock: using sched offset of %llu cycles",
kvm_sched_clock_offset);
@@ -345,8 +347,6 @@ void __init kvmclock_init(bool prefer_tsc)
#ifdef CONFIG_SMP
x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
#endif
- x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
- x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
kvm_get_preset_lpj();
/*
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 19da1a3d2126..7fbcfc2efd1d 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -280,12 +280,15 @@ bool using_native_sched_clock(void)
return static_call_query(pv_sched_clock) == native_sched_clock;
}
-void __paravirt_set_sched_clock(u64 (*func)(void), bool stable)
+void __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
+ void (*save)(void), void (*restore)(void))
{
if (!stable)
clear_sched_clock_stable();
static_call_update(pv_sched_clock, func);
+ x86_platform.save_sched_clock_state = save;
+ x86_platform.restore_sched_clock_state = restore;
}
#else
u64 sched_clock_noinstr(void) __attribute__((alias("native_sched_clock")));
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 640b71d22d97..91ef83b1e540 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -577,13 +577,12 @@ static void __init xen_init_time_common(void)
{
xen_sched_clock_offset = xen_clocksource_read();
static_call_update(pv_steal_clock, xen_steal_clock);
- paravirt_set_sched_clock(xen_sched_clock);
+
/*
* Xen has paravirtualized suspend/resume and so doesn't use the common
* x86 sched_clock save/restore hooks.
*/
- x86_platform.save_sched_clock_state = x86_init_noop;
- x86_platform.restore_sched_clock_state = x86_init_noop;
+ paravirt_set_sched_clock(xen_sched_clock, x86_init_noop, x86_init_noop);
x86_init.hyper.get_tsc_khz = xen_tsc_khz;
x86_platform.get_wallclock = xen_get_wallclock;
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index ac1d9f9c381c..dee59ce61c29 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -553,10 +553,8 @@ static void hv_restore_sched_clock_state(void)
static __always_inline void hv_setup_sched_clock(void *sched_clock)
{
/* We're on x86/x64 *and* using PV ops */
- paravirt_set_sched_clock(sched_clock);
-
- x86_platform.save_sched_clock_state = hv_save_sched_clock_state;
- x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state;
+ paravirt_set_sched_clock(sched_clock, hv_save_sched_clock_state,
+ hv_restore_sched_clock_state);
}
#else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 32/47] x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock
From: Sean Christopherson @ 2026-05-29 15:07 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H . Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Now that all PV clocksources override the sched_clock save/restore hooks
when overriding sched_clock, WARN if the "default" TSC hooks are invoked
when using a PV sched_clock, e.g. to guard against regressions.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/tsc.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a9b6d3399c23..19da1a3d2126 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -972,9 +972,17 @@ EXPORT_SYMBOL_GPL(recalibrate_cpu_khz);
static unsigned long long cyc2ns_suspend;
+static __always_inline bool tsc_is_save_restore_needed(void)
+{
+ if (WARN_ON_ONCE(!using_native_sched_clock()))
+ return false;
+
+ return static_branch_likely(&__use_tsc) || sched_clock_stable();
+}
+
void tsc_save_sched_clock_state(void)
{
- if (!static_branch_likely(&__use_tsc) && !sched_clock_stable())
+ if (!tsc_is_save_restore_needed())
return;
cyc2ns_suspend = sched_clock();
@@ -994,7 +1002,7 @@ void tsc_restore_sched_clock_state(void)
unsigned long flags;
int cpu;
- if (!static_branch_likely(&__use_tsc) && !sched_clock_stable())
+ if (!tsc_is_save_restore_needed())
return;
local_irq_save(flags);
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 31/47] x86/vmware: NOP-ify save/restore hooks when using VMware's sched_clock
From: Sean Christopherson @ 2026-05-29 15:07 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H . Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
NOP-ify the sched_clock save/restore hooks when using VMware's version of
sched_clock. This will allow extending paravirt_set_sched_clock() to set
the save/restore hooks, without having to simultaneously change the
behavior of VMware guests.
Note, it's not at all obvious that it's safe/correct for VMware guests to
do nothing on suspend/resume, but that's a pre-existing problem. Leave it
for a VMware expert to sort out.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/cpu/vmware.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 2d0624c66799..051ef89029a7 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -347,8 +347,11 @@ static void __init vmware_paravirt_ops_setup(void)
vmware_cyc2ns_setup();
- if (vmw_sched_clock)
+ if (vmw_sched_clock) {
paravirt_set_sched_clock(vmware_sched_clock);
+ x86_platform.save_sched_clock_state = x86_init_noop;
+ x86_platform.restore_sched_clock_state = x86_init_noop;
+ }
if (vmware_is_stealclock_available()) {
has_steal_clock = true;
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 30/47] x86/xen/time: NOP-ify x86_platform's sched_clock save/restore hooks
From: Sean Christopherson @ 2026-05-29 15:07 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H . Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
NOP-ify the x86_platform sched_clock save/restore hooks when setting up
Xen's PV clock to make it somewhat obvious the hooks aren't used when
running as a Xen guest (Xen uses a paravirtualized suspend/resume flow).
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/xen/time.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 36d66abf5379..640b71d22d97 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -578,6 +578,12 @@ static void __init xen_init_time_common(void)
xen_sched_clock_offset = xen_clocksource_read();
static_call_update(pv_steal_clock, xen_steal_clock);
paravirt_set_sched_clock(xen_sched_clock);
+ /*
+ * Xen has paravirtualized suspend/resume and so doesn't use the common
+ * x86 sched_clock save/restore hooks.
+ */
+ x86_platform.save_sched_clock_state = x86_init_noop;
+ x86_platform.restore_sched_clock_state = x86_init_noop;
x86_init.hyper.get_tsc_khz = xen_tsc_khz;
x86_platform.get_wallclock = xen_get_wallclock;
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 29/47] x86/kvmclock: Move sched_clock save/restore helpers up in kvmclock.c
From: Sean Christopherson @ 2026-05-29 15:07 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H . Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Move kvmclock's sched_clock save/restore helper "up" so that they can
(eventually) be referenced by kvm_sched_clock_init().
No functional change intended.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 108 ++++++++++++++++++-------------------
1 file changed, 54 insertions(+), 54 deletions(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 4e50e75ff43d..0d4f2cf97246 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -71,6 +71,25 @@ static int kvm_set_wallclock(const struct timespec64 *now)
return -ENODEV;
}
+static void kvm_register_clock(char *txt)
+{
+ struct pvclock_vsyscall_time_info *src = this_cpu_hvclock();
+ u64 pa;
+
+ if (!src)
+ return;
+
+ pa = slow_virt_to_phys(&src->pvti) | 0x01ULL;
+ wrmsrq(msr_kvm_system_time, pa);
+ pr_debug("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
+}
+
+static void kvmclock_disable(void)
+{
+ if (msr_kvm_system_time)
+ native_write_msr(msr_kvm_system_time, 0);
+}
+
static u64 kvm_clock_read(void)
{
u64 ret;
@@ -91,6 +110,30 @@ static noinstr u64 kvm_sched_clock_read(void)
return pvclock_clocksource_read_nowd(this_cpu_pvti()) - kvm_sched_clock_offset;
}
+static void kvm_save_sched_clock_state(void)
+{
+ /*
+ * Stop host writes to kvmclock immediately prior to suspend/hibernate.
+ * If the system is hibernating, then kvmclock will likely reside at a
+ * different physical address when the system awakens, and host writes
+ * to the old address prior to reconfiguring kvmclock would clobber
+ * random memory.
+ */
+ kvmclock_disable();
+}
+
+#ifdef CONFIG_SMP
+static void kvm_setup_secondary_clock(void)
+{
+ kvm_register_clock("secondary cpu clock");
+}
+#endif
+
+static void kvm_restore_sched_clock_state(void)
+{
+ kvm_register_clock("primary cpu clock, resume");
+}
+
static inline void kvm_sched_clock_init(bool stable)
{
kvm_sched_clock_offset = kvm_clock_read();
@@ -103,6 +146,17 @@ static inline void kvm_sched_clock_init(bool stable)
sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
}
+void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
+{
+ /*
+ * Don't disable kvmclock on the BSP during suspend. If kvmclock is
+ * being used for sched_clock, then it needs to be kept alive until the
+ * last minute, and restored as quickly as possible after resume.
+ */
+ if (action != KVM_GUEST_BSP_SUSPEND)
+ kvmclock_disable();
+}
+
/*
* If we don't do that, there is the possibility that the guest
* will calibrate under heavy load - thus, getting a lower lpj -
@@ -161,60 +215,6 @@ static struct clocksource kvm_clock = {
.enable = kvm_cs_enable,
};
-static void kvm_register_clock(char *txt)
-{
- struct pvclock_vsyscall_time_info *src = this_cpu_hvclock();
- u64 pa;
-
- if (!src)
- return;
-
- pa = slow_virt_to_phys(&src->pvti) | 0x01ULL;
- wrmsrq(msr_kvm_system_time, pa);
- pr_debug("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
-}
-
-static void kvmclock_disable(void)
-{
- if (msr_kvm_system_time)
- native_write_msr(msr_kvm_system_time, 0);
-}
-
-static void kvm_save_sched_clock_state(void)
-{
- /*
- * Stop host writes to kvmclock immediately prior to suspend/hibernate.
- * If the system is hibernating, then kvmclock will likely reside at a
- * different physical address when the system awakens, and host writes
- * to the old address prior to reconfiguring kvmclock would clobber
- * random memory.
- */
- kvmclock_disable();
-}
-
-static void kvm_restore_sched_clock_state(void)
-{
- kvm_register_clock("primary cpu clock, resume");
-}
-
-void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
-{
- /*
- * Don't disable kvmclock on the BSP during suspend. If kvmclock is
- * being used for sched_clock, then it needs to be kept alive until the
- * last minute, and restored as quickly as possible after resume.
- */
- if (action != KVM_GUEST_BSP_SUSPEND)
- kvmclock_disable();
-}
-
-#ifdef CONFIG_SMP
-static void kvm_setup_secondary_clock(void)
-{
- kvm_register_clock("secondary cpu clock");
-}
-#endif
-
static void __init kvmclock_init_mem(void)
{
unsigned long ncpus;
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 28/47] x86/paravirt: Move handling of unstable PV clocks into paravirt_set_sched_clock()
From: Sean Christopherson @ 2026-05-29 15:07 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H . Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Move the handling of unstable PV clocks, of which kvmclock is the only
example, into paravirt_set_sched_clock(). This will allow modifying
paravirt_set_sched_clock() to keep using the TSC for sched_clock in
certain scenarios without unintentionally marking the TSC-based clock as
unstable.
No functional change intended.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/timer.h | 7 ++++++-
arch/x86/kernel/kvmclock.c | 5 +----
arch/x86/kernel/tsc.c | 5 ++++-
3 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h
index c71b466d6ace..fe41d40a9ae6 100644
--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -14,7 +14,12 @@ extern int no_timer_check;
extern bool using_native_sched_clock(void);
#ifdef CONFIG_PARAVIRT
-void paravirt_set_sched_clock(u64 (*func)(void));
+void __paravirt_set_sched_clock(u64 (*func)(void), bool stable);
+
+static inline void paravirt_set_sched_clock(u64 (*func)(void))
+{
+ __paravirt_set_sched_clock(func, true);
+}
#endif
/*
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 13c4be3a7f0a..4e50e75ff43d 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -12,7 +12,6 @@
#include <linux/hardirq.h>
#include <linux/cpuhotplug.h>
#include <linux/sched.h>
-#include <linux/sched/clock.h>
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/set_memory.h>
@@ -94,10 +93,8 @@ static noinstr u64 kvm_sched_clock_read(void)
static inline void kvm_sched_clock_init(bool stable)
{
- if (!stable)
- clear_sched_clock_stable();
kvm_sched_clock_offset = kvm_clock_read();
- paravirt_set_sched_clock(kvm_sched_clock_read);
+ __paravirt_set_sched_clock(kvm_sched_clock_read, stable);
pr_info("kvm-clock: using sched offset of %llu cycles",
kvm_sched_clock_offset);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 888bd1cbd9bc..a9b6d3399c23 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -280,8 +280,11 @@ bool using_native_sched_clock(void)
return static_call_query(pv_sched_clock) == native_sched_clock;
}
-void paravirt_set_sched_clock(u64 (*func)(void))
+void __paravirt_set_sched_clock(u64 (*func)(void), bool stable)
{
+ if (!stable)
+ clear_sched_clock_stable();
+
static_call_update(pv_sched_clock, func);
}
#else
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 27/47] x86/paravirt: Remove unnecessary PARAVIRT=n stub for paravirt_set_sched_clock()
From: Sean Christopherson @ 2026-05-29 15:06 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H . Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Remove the unnecessary paravirt_set_sched_clock() stub for PARAVIRT=n, as
all callers are gated by PARAVIRT=y. Eliminating the stub will avoid a
pile of pointless churn as the "real" implementation evolves.
No functional change intended.
Fixes: 39965afb1151 ("x86/paravirt: Move paravirt_sched_clock() related code into tsc.c")
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/timer.h | 3 +++
arch/x86/kernel/tsc.c | 1 -
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h
index fda18bcb19b4..c71b466d6ace 100644
--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -12,7 +12,10 @@ extern void recalibrate_cpu_khz(void);
extern int no_timer_check;
extern bool using_native_sched_clock(void);
+
+#ifdef CONFIG_PARAVIRT
void paravirt_set_sched_clock(u64 (*func)(void));
+#endif
/*
* We use the full linear equation: f(x) = a + b*x, in order to allow
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index bdff8c988866..888bd1cbd9bc 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -288,7 +288,6 @@ void paravirt_set_sched_clock(u64 (*func)(void))
u64 sched_clock_noinstr(void) __attribute__((alias("native_sched_clock")));
bool using_native_sched_clock(void) { return true; }
-void paravirt_set_sched_clock(u64 (*func)(void)) { }
#endif
notrace u64 sched_clock(void)
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 26/47] x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Don't disable kvmclock on the BSP during syscore_suspend(), as the BSP's
clock is NOT restored during syscore_resume(), but is instead restored
earlier via the sched_clock restore callback. If suspend is aborted, e.g.
due to a late wakeup, the BSP will run without its clock enabled, which
"works" only because KVM-the-hypervisor is kind enough to not clobber the
shared memory when the clock is disabled. But over time, the BSP's view
of time will drift from APs.
Plumb in an "action" to KVM-as-a-guest and kvmclock code in preparation
for additional cleanups to kvmclock's suspend/resume logic.
Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
Cc: stable@vger.kernel.org
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/kvm_para.h | 8 +++++++-
arch/x86/kernel/kvm.c | 15 ++++++++-------
arch/x86/kernel/kvmclock.c | 31 +++++++++++++++++++++++++------
3 files changed, 40 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 4a49fc286b4c..08686ff19caa 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,8 +118,14 @@ static inline long kvm_sev_hypercall3(unsigned int nr, unsigned long p1,
}
#ifdef CONFIG_KVM_GUEST
+enum kvm_guest_cpu_action {
+ KVM_GUEST_BSP_SUSPEND,
+ KVM_GUEST_AP_OFFLINE,
+ KVM_GUEST_SHUTDOWN,
+};
+
void kvmclock_init(bool prefer_tsc);
-void kvmclock_disable(void);
+void kvmclock_cpu_action(enum kvm_guest_cpu_action action);
bool kvm_para_available(void);
unsigned int kvm_arch_para_features(void);
unsigned int kvm_arch_para_hints(void);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c81a24d0efdf..fd1c417b4f9b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -460,7 +460,7 @@ static void __init sev_map_percpu_data(void)
}
}
-static void kvm_guest_cpu_offline(bool shutdown)
+static void kvm_guest_cpu_offline(enum kvm_guest_cpu_action action)
{
kvm_disable_steal_time();
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
@@ -468,9 +468,10 @@ static void kvm_guest_cpu_offline(bool shutdown)
if (kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL))
wrmsrq(MSR_KVM_MIGRATION_CONTROL, 0);
kvm_pv_disable_apf();
- if (!shutdown)
+ if (action != KVM_GUEST_SHUTDOWN)
apf_task_wake_all();
- kvmclock_disable();
+
+ kvmclock_cpu_action(action);
}
static int kvm_cpu_online(unsigned int cpu)
@@ -726,7 +727,7 @@ static int kvm_cpu_down_prepare(unsigned int cpu)
unsigned long flags;
local_irq_save(flags);
- kvm_guest_cpu_offline(false);
+ kvm_guest_cpu_offline(KVM_GUEST_AP_OFFLINE);
local_irq_restore(flags);
return 0;
}
@@ -737,7 +738,7 @@ static int kvm_suspend(void *data)
{
u64 val = 0;
- kvm_guest_cpu_offline(false);
+ kvm_guest_cpu_offline(KVM_GUEST_BSP_SUSPEND);
#ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL
if (kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL))
@@ -768,7 +769,7 @@ static struct syscore kvm_syscore = {
static void kvm_pv_guest_cpu_reboot(void *unused)
{
- kvm_guest_cpu_offline(true);
+ kvm_guest_cpu_offline(KVM_GUEST_SHUTDOWN);
}
static int kvm_pv_reboot_notify(struct notifier_block *nb,
@@ -792,7 +793,7 @@ static struct notifier_block kvm_pv_reboot_nb = {
#ifdef CONFIG_CRASH_DUMP
static void kvm_crash_shutdown(struct pt_regs *regs)
{
- kvm_guest_cpu_offline(true);
+ kvm_guest_cpu_offline(KVM_GUEST_SHUTDOWN);
native_machine_crash_shutdown(regs);
}
#endif
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 13c728444e12..13c4be3a7f0a 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -177,8 +177,22 @@ static void kvm_register_clock(char *txt)
pr_debug("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
}
+static void kvmclock_disable(void)
+{
+ if (msr_kvm_system_time)
+ native_write_msr(msr_kvm_system_time, 0);
+}
+
static void kvm_save_sched_clock_state(void)
{
+ /*
+ * Stop host writes to kvmclock immediately prior to suspend/hibernate.
+ * If the system is hibernating, then kvmclock will likely reside at a
+ * different physical address when the system awakens, and host writes
+ * to the old address prior to reconfiguring kvmclock would clobber
+ * random memory.
+ */
+ kvmclock_disable();
}
static void kvm_restore_sched_clock_state(void)
@@ -186,6 +200,17 @@ static void kvm_restore_sched_clock_state(void)
kvm_register_clock("primary cpu clock, resume");
}
+void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
+{
+ /*
+ * Don't disable kvmclock on the BSP during suspend. If kvmclock is
+ * being used for sched_clock, then it needs to be kept alive until the
+ * last minute, and restored as quickly as possible after resume.
+ */
+ if (action != KVM_GUEST_BSP_SUSPEND)
+ kvmclock_disable();
+}
+
#ifdef CONFIG_SMP
static void kvm_setup_secondary_clock(void)
{
@@ -193,12 +218,6 @@ static void kvm_setup_secondary_clock(void)
}
#endif
-void kvmclock_disable(void)
-{
- if (msr_kvm_system_time)
- native_write_msr(msr_kvm_system_time, 0);
-}
-
static void __init kvmclock_init_mem(void)
{
unsigned long ncpus;
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 25/47] x86/kvmclock: Setup kvmclock for secondary CPUs iff CONFIG_SMP=y
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Gate kvmclock's secondary CPU code on CONFIG_SMP, not CONFIG_X86_LOCAL_APIC.
Originally, kvmclock piggybacked PV APIC ops to setup secondary CPUs.
When that wart was fixed by commit df156f90a0f9 ("x86: Introduce
x86_cpuinit.early_percpu_clock_init hook"), the dependency on a local APIC
got carried forward unnecessarily.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 69a15fbfb779..13c728444e12 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -186,7 +186,7 @@ static void kvm_restore_sched_clock_state(void)
kvm_register_clock("primary cpu clock, resume");
}
-#ifdef CONFIG_X86_LOCAL_APIC
+#ifdef CONFIG_SMP
static void kvm_setup_secondary_clock(void)
{
kvm_register_clock("secondary cpu clock");
@@ -326,7 +326,7 @@ void __init kvmclock_init(bool prefer_tsc)
x86_init.hyper.get_cpu_khz = kvmclock_get_tsc_khz;
x86_platform.get_wallclock = kvm_get_wallclock;
x86_platform.set_wallclock = kvm_set_wallclock;
-#ifdef CONFIG_X86_LOCAL_APIC
+#ifdef CONFIG_SMP
x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
#endif
x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 24/47] clocksource: hyper-v: Don't save/restore TSC offset when using HV sched_clock
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Now that Hyper-V overrides the sched_clock save/restore hooks if and only
sched_clock itself is set to the Hyper-V reference counter, drop the
invocation of the "old" save/restore callbacks. When the registration of
the PV sched_clock was done separately from overriding the save/restore
hooks, it was possible for Hyper-V to clobber the TSC save/restore
callbacks without actually switching to the Hyper-V refcounter.
Enabling a PV sched_clock is a one-way street, i.e. the kernel will never
revert to using TSC for sched_clock, and so there is no need to invoke the
TSC save/restore hooks (and if there was, it belongs in common PV code).
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Acked-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
drivers/clocksource/hyperv_timer.c | 10 ----------
1 file changed, 10 deletions(-)
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index 69c1c7264e5d..ac1d9f9c381c 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -527,9 +527,6 @@ static __always_inline void hv_setup_sched_clock(void *sched_clock)
#include <asm/timer.h>
static u64 hv_ref_counter_at_suspend;
-static void (*old_save_sched_clock_state)(void);
-static void (*old_restore_sched_clock_state)(void);
-
/*
* Hyper-V clock counter resets during hibernation. Save and restore clock
* offset during suspend/resume, while also considering the time passed
@@ -539,8 +536,6 @@ static void (*old_restore_sched_clock_state)(void);
*/
static void hv_save_sched_clock_state(void)
{
- old_save_sched_clock_state();
-
hv_ref_counter_at_suspend = hv_read_reference_counter();
}
@@ -553,8 +548,6 @@ static void hv_restore_sched_clock_state(void)
* - reference counter (time) now.
*/
hv_sched_clock_offset -= (hv_ref_counter_at_suspend - hv_read_reference_counter());
-
- old_restore_sched_clock_state();
}
static __always_inline void hv_setup_sched_clock(void *sched_clock)
@@ -562,10 +555,7 @@ static __always_inline void hv_setup_sched_clock(void *sched_clock)
/* We're on x86/x64 *and* using PV ops */
paravirt_set_sched_clock(sched_clock);
- old_save_sched_clock_state = x86_platform.save_sched_clock_state;
x86_platform.save_sched_clock_state = hv_save_sched_clock_state;
-
- old_restore_sched_clock_state = x86_platform.restore_sched_clock_state;
x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state;
}
#else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 23/47] clocksource: hyper-v: Drop wrappers to sched_clock save/restore helpers
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Now that all of the Hyper-V reference counter sched_clock code is located
in a single file, drop the superfluous wrappers for the save/restore flows.
No functional change intended.
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
drivers/clocksource/hyperv_timer.c | 34 +++++-------------------------
include/clocksource/hyperv_timer.h | 2 --
2 files changed, 5 insertions(+), 31 deletions(-)
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index 72b966340a46..69c1c7264e5d 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -472,17 +472,6 @@ static void resume_hv_clock_tsc(struct clocksource *arg)
hv_set_msr(HV_MSR_REFERENCE_TSC, tsc_msr.as_uint64);
}
-/*
- * Called during resume from hibernation, from overridden
- * x86_platform.restore_sched_clock_state routine. This is to adjust offsets
- * used to calculate time for hv tsc page based sched_clock, to account for
- * time spent before hibernation.
- */
-void hv_adj_sched_clock_offset(u64 offset)
-{
- hv_sched_clock_offset -= offset;
-}
-
#ifdef HAVE_VDSO_CLOCKMODE_HVCLOCK
static int hv_cs_enable(struct clocksource *cs)
{
@@ -548,12 +537,14 @@ static void (*old_restore_sched_clock_state)(void);
* based clocksource, proceeds from where it left off during suspend and
* it shows correct time for the timestamps of kernel messages after resume.
*/
-static void save_hv_clock_tsc_state(void)
+static void hv_save_sched_clock_state(void)
{
+ old_save_sched_clock_state();
+
hv_ref_counter_at_suspend = hv_read_reference_counter();
}
-static void restore_hv_clock_tsc_state(void)
+static void hv_restore_sched_clock_state(void)
{
/*
* Adjust the offsets used by hv tsc clocksource to
@@ -561,23 +552,8 @@ static void restore_hv_clock_tsc_state(void)
* adjusted value = reference counter (time) at suspend
* - reference counter (time) now.
*/
- hv_adj_sched_clock_offset(hv_ref_counter_at_suspend - hv_read_reference_counter());
-}
-/*
- * Functions to override save_sched_clock_state and restore_sched_clock_state
- * functions of x86_platform. The Hyper-V clock counter is reset during
- * suspend-resume and the offset used to measure time needs to be
- * corrected, post resume.
- */
-static void hv_save_sched_clock_state(void)
-{
- old_save_sched_clock_state();
- save_hv_clock_tsc_state();
-}
+ hv_sched_clock_offset -= (hv_ref_counter_at_suspend - hv_read_reference_counter());
-static void hv_restore_sched_clock_state(void)
-{
- restore_hv_clock_tsc_state();
old_restore_sched_clock_state();
}
diff --git a/include/clocksource/hyperv_timer.h b/include/clocksource/hyperv_timer.h
index d48dd4176fd3..a4c81a60f53d 100644
--- a/include/clocksource/hyperv_timer.h
+++ b/include/clocksource/hyperv_timer.h
@@ -38,8 +38,6 @@ extern void hv_remap_tsc_clocksource(void);
extern unsigned long hv_get_tsc_pfn(void);
extern struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
-extern void hv_adj_sched_clock_offset(u64 offset);
-
static __always_inline bool
hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
u64 *cur_tsc, u64 *time)
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 22/47] clocksource: hyper-v: Register sched_clock save/restore iff it's necessary
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Register the Hyper-V reference counter (refcounter) callbacks for saving
and restoring its PV sched_clock, if and only if the refcounter is
actually being used for sched_clock. Currently, Hyper-V overrides the
save/restore hooks if the reference TSC available, whereas the Hyper-V
refcounter code only overrides sched_clock if the reference TSC is
available *and* it's not invariant. The flaw is effectively papered over
by invoking the "old" save/restore callbacks as part of save/restore, but
that's unnecessary and fragile.
To avoid introducing more complexity, and to allow for additional cleanups
of the PV sched_clock code, move the save/restore hooks and logic into
hyperv_timer.c and simply wire up the hooks when overriding sched_clock
itself.
Note, while the Hyper-V refcounter code is intended to be architecture
neutral, CONFIG_PARAVIRT is firmly x86-only, i.e. adding a small amount of
x86 specific code (which will be reduced in future cleanups) doesn't
meaningfully pollute generic code.
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/cpu/mshyperv.c | 58 ------------------------------
drivers/clocksource/hyperv_timer.c | 50 ++++++++++++++++++++++++++
2 files changed, 50 insertions(+), 58 deletions(-)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index f8653fc05a40..2403231fd4b0 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -275,63 +275,6 @@ static void hv_guest_crash_shutdown(struct pt_regs *regs)
}
#endif /* CONFIG_CRASH_DUMP */
-static u64 hv_ref_counter_at_suspend;
-static void (*old_save_sched_clock_state)(void);
-static void (*old_restore_sched_clock_state)(void);
-
-/*
- * Hyper-V clock counter resets during hibernation. Save and restore clock
- * offset during suspend/resume, while also considering the time passed
- * before suspend. This is to make sure that sched_clock using hv tsc page
- * based clocksource, proceeds from where it left off during suspend and
- * it shows correct time for the timestamps of kernel messages after resume.
- */
-static void save_hv_clock_tsc_state(void)
-{
- hv_ref_counter_at_suspend = hv_read_reference_counter();
-}
-
-static void restore_hv_clock_tsc_state(void)
-{
- /*
- * Adjust the offsets used by hv tsc clocksource to
- * account for the time spent before hibernation.
- * adjusted value = reference counter (time) at suspend
- * - reference counter (time) now.
- */
- hv_adj_sched_clock_offset(hv_ref_counter_at_suspend - hv_read_reference_counter());
-}
-
-/*
- * Functions to override save_sched_clock_state and restore_sched_clock_state
- * functions of x86_platform. The Hyper-V clock counter is reset during
- * suspend-resume and the offset used to measure time needs to be
- * corrected, post resume.
- */
-static void hv_save_sched_clock_state(void)
-{
- old_save_sched_clock_state();
- save_hv_clock_tsc_state();
-}
-
-static void hv_restore_sched_clock_state(void)
-{
- restore_hv_clock_tsc_state();
- old_restore_sched_clock_state();
-}
-
-static void __init x86_setup_ops_for_tsc_pg_clock(void)
-{
- if (!(ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE))
- return;
-
- old_save_sched_clock_state = x86_platform.save_sched_clock_state;
- x86_platform.save_sched_clock_state = hv_save_sched_clock_state;
-
- old_restore_sched_clock_state = x86_platform.restore_sched_clock_state;
- x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state;
-}
-
#ifdef CONFIG_X86_64
DEFINE_STATIC_CALL(hv_hypercall, hv_std_hypercall);
EXPORT_STATIC_CALL_TRAMP_GPL(hv_hypercall);
@@ -739,7 +682,6 @@ static void __init ms_hyperv_init_platform(void)
/* Register Hyper-V specific clocksource */
hv_init_clocksource();
- x86_setup_ops_for_tsc_pg_clock();
hv_vtl_init_platform();
#endif
/*
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index e9f5034a1bc8..72b966340a46 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -537,10 +537,60 @@ static __always_inline void hv_setup_sched_clock(void *sched_clock)
#elif defined CONFIG_PARAVIRT
#include <asm/timer.h>
+static u64 hv_ref_counter_at_suspend;
+static void (*old_save_sched_clock_state)(void);
+static void (*old_restore_sched_clock_state)(void);
+
+/*
+ * Hyper-V clock counter resets during hibernation. Save and restore clock
+ * offset during suspend/resume, while also considering the time passed
+ * before suspend. This is to make sure that sched_clock using hv tsc page
+ * based clocksource, proceeds from where it left off during suspend and
+ * it shows correct time for the timestamps of kernel messages after resume.
+ */
+static void save_hv_clock_tsc_state(void)
+{
+ hv_ref_counter_at_suspend = hv_read_reference_counter();
+}
+
+static void restore_hv_clock_tsc_state(void)
+{
+ /*
+ * Adjust the offsets used by hv tsc clocksource to
+ * account for the time spent before hibernation.
+ * adjusted value = reference counter (time) at suspend
+ * - reference counter (time) now.
+ */
+ hv_adj_sched_clock_offset(hv_ref_counter_at_suspend - hv_read_reference_counter());
+}
+/*
+ * Functions to override save_sched_clock_state and restore_sched_clock_state
+ * functions of x86_platform. The Hyper-V clock counter is reset during
+ * suspend-resume and the offset used to measure time needs to be
+ * corrected, post resume.
+ */
+static void hv_save_sched_clock_state(void)
+{
+ old_save_sched_clock_state();
+ save_hv_clock_tsc_state();
+}
+
+static void hv_restore_sched_clock_state(void)
+{
+ restore_hv_clock_tsc_state();
+ old_restore_sched_clock_state();
+}
+
static __always_inline void hv_setup_sched_clock(void *sched_clock)
{
/* We're on x86/x64 *and* using PV ops */
paravirt_set_sched_clock(sched_clock);
+
+ old_save_sched_clock_state = x86_platform.save_sched_clock_state;
+ x86_platform.save_sched_clock_state = hv_save_sched_clock_state;
+
+ old_restore_sched_clock_state = x86_platform.restore_sched_clock_state;
+ x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state;
}
#else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 21/47] x86/xen: Obtain TSC frequency from CPUID if present
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
From: David Woodhouse <dwmw@amazon.co.uk>
The Xen CPUID leaf 3, sub-leaf 0, ECX provides the guest TSC frequency
in kHz directly. Use it when available instead of reverse-calculating
the frequency from the pvclock tsc_to_system_mul and tsc_shift values,
which loses precision.
This mirrors the equivalent change for KVM guests using the generic
0x40000010 timing leaf.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
[sean: drop non-Xen changes]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/xen/time.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 487ad838c441..36d66abf5379 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -42,6 +42,17 @@ static unsigned int __init xen_tsc_khz(void)
{
struct pvclock_vcpu_time_info *info =
&HYPERVISOR_shared_info->vcpu_info[0].time;
+ u32 base = xen_cpuid_base();
+ u32 eax, ebx, ecx, edx;
+
+ /*
+ * If Xen provides the guest TSC frequency directly in CPUID
+ * (leaf 3, sub-leaf 0, ECX), use that instead of reverse-
+ * calculating from the pvclock mul/shift.
+ */
+ cpuid_count(base + 3, 0, &eax, &ebx, &ecx, &edx);
+ if (ecx)
+ return ecx;
return pvclock_tsc_khz(info);
}
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 20/47] x86/kvm: Get CPU base frequency from CPUID when it's available
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
If CPUID.0x16 is present and valid, use the CPU frequency provided by
CPUID instead of assuming that the virtual CPU runs at the same
frequency as TSC and/or kvmclock. Back before constant TSCs were a
thing, treating the TSC and CPU frequencies as one and the same was
somewhat reasonable, but now it's nonsensical, especially if the
hypervisor explicitly enumerates the CPU frequency.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvm.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c1139182121d..c81a24d0efdf 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -50,6 +50,7 @@
#include <asm/e820/api.h>
static unsigned int kvm_tsc_khz_cpuid __initdata;
+static unsigned int kvm_cpu_khz_cpuid __initdata;
DEFINE_STATIC_KEY_FALSE_RO(kvm_async_pf_enabled);
@@ -928,6 +929,11 @@ static unsigned int __init kvm_get_tsc_khz(void)
return kvm_tsc_khz_cpuid;
}
+static unsigned int __init kvm_get_cpu_khz(void)
+{
+ return kvm_cpu_khz_cpuid;
+}
+
unsigned int kvm_arch_para_features(void)
{
return cpuid_eax(kvm_cpuid_base() | KVM_CPUID_FEATURES);
@@ -1049,6 +1055,14 @@ static void __init kvm_init_platform(void)
#endif
}
+ /*
+ * Prefer CPUID.0x16 over KVM's PV CPUID when possible, as the base CPU
+ * frequency isn't necessarily the same as the TSC frequency.
+ */
+ kvm_cpu_khz_cpuid = __cpu_khz_from_cpuid();
+ if (kvm_cpu_khz_cpuid)
+ x86_init.hyper.get_cpu_khz = kvm_get_cpu_khz;
+
/*
* If the TSC counts at a constant frequency across P/T states, counts
* in deep C-states, and the TSC hasn't been marked unstable, treat the
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 19/47] x86/tsc: Add standalone helper for getting CPU frequency from CPUID
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Extract the guts of cpu_khz_from_cpuid() to a standalone helper that
doesn't restrict the usage to Intel CPUs. This will allow sharing the
core logic with KVM-as-a-guest, as KVM generally doesn't restrict CPUID
based on vendor.
No functional change intended.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/tsc.h | 1 +
arch/x86/kernel/tsc.c | 32 +++++++++++++++-----------------
2 files changed, 16 insertions(+), 17 deletions(-)
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 4a224f99c3b9..7ff2bfdcdf38 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -90,6 +90,7 @@ struct cpuid_tsc_info {
unsigned int tsc_khz;
};
extern int __init cpuid_get_tsc_freq(struct cpuid_tsc_info *info);
+extern unsigned int __cpu_khz_from_cpuid(void);
extern void tsc_early_init(void);
extern void tsc_init(void);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 3e911f0f7364..bdff8c988866 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -692,6 +692,18 @@ int __init cpuid_get_tsc_freq(struct cpuid_tsc_info *info)
return 0;
}
+unsigned int __cpu_khz_from_cpuid(void)
+{
+ unsigned int eax_base_mhz, ebx, ecx, edx;
+
+ if (boot_cpu_data.cpuid_level < CPUID_LEAF_FREQ)
+ return 0;
+
+ cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
+
+ return eax_base_mhz * 1000;
+}
+
/**
* native_calibrate_tsc - determine TSC frequency
* Determine TSC frequency via CPUID, else return 0.
@@ -727,13 +739,8 @@ static unsigned long native_calibrate_tsc(void)
* clock, but we can easily calculate it to a high degree of accuracy
* by considering the crystal ratio and the CPU speed.
*/
- if (!info.crystal_khz && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
- unsigned int eax_base_mhz, ebx, ecx, edx;
-
- cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
- info.crystal_khz = eax_base_mhz * 1000 *
- info.denominator / info.numerator;
- }
+ if (!info.crystal_khz)
+ info.crystal_khz = __cpu_khz_from_cpuid() * info.denominator / info.numerator;
if (!info.crystal_khz)
return 0;
@@ -760,19 +767,10 @@ static unsigned long native_calibrate_tsc(void)
static unsigned long cpu_khz_from_cpuid(void)
{
- unsigned int eax_base_mhz, ebx_max_mhz, ecx_bus_mhz, edx;
-
if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
return 0;
- if (boot_cpu_data.cpuid_level < CPUID_LEAF_FREQ)
- return 0;
-
- eax_base_mhz = ebx_max_mhz = ecx_bus_mhz = edx = 0;
-
- cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx_max_mhz, &ecx_bus_mhz, &edx);
-
- return eax_base_mhz * 1000;
+ return __cpu_khz_from_cpuid();
}
/*
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 18/47] x86/kvm: Get local APIC bus frequency from PV CPUID Timing Info
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
When running as a KVM guest with PV timing info provided by the host,
stuff the APIC timer period/frequency with the local APIC bus frequency
reported in CPUID.0x40000010.EBX instead of trying to calibrate/guess the
frequency.
Note, the unit of measurement for lapic_timer_period is "ticks per HZ", not
Khz.
See Documentation/virt/kvm/x86/cpuid.rst for details.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvm.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 4fe9c69bf40b..c1139182121d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -977,6 +977,7 @@ static void __init kvm_init_platform(void)
.mask_lo = (u32)(~(SZ_4G - tolud - 1)) | MTRR_PHYSMASK_V,
.mask_hi = (BIT_ULL(boot_cpu_data.x86_phys_bits) - 1) >> 32,
};
+ u32 apic_khz __maybe_unused;
u32 timing_info_leaf;
bool tsc_is_reliable;
@@ -1039,6 +1040,13 @@ static void __init kvm_init_platform(void)
x86_init.hyper.get_tsc_khz = kvm_get_tsc_khz;
x86_init.hyper.get_cpu_khz = kvm_get_tsc_khz;
}
+
+#ifdef CONFIG_X86_LOCAL_APIC
+ /* The leaf also includes the local APIC bus/timer frequency.*/
+ apic_khz = cpuid_ebx(timing_info_leaf);
+ if (apic_khz)
+ lapic_timer_period = apic_khz * 1000 / HZ;
+#endif
}
/*
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 17/47] x86/kvm: Mark TSC as reliable when it's constant and nonstop
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Mark the TSC as reliable if the hypervisor (KVM) has enumerated the TSC
as constant and nonstop, and the admin hasn't explicitly marked the TSC
as unstable. Like most (all?) virtualization setups, any secondary
clocksource that's used as a watchdog is guaranteed to be less reliable
than a constant, nonstop TSC, as all clocksources the kernel uses as a
watchdog are all but guaranteed to be emulated when running as a KVM
guest. I.e. any observed discrepancies between the TSC and watchdog will
be due to jitter in the watchdog.
This is especially true for KVM, as the watchdog clocksource is usually
emulated in host userspace, i.e. reading the clock incurs a roundtrip
cost of thousands of cycles.
Marking the TSC reliable addresses a flaw where the TSC will occasionally
be marked unstable if the host is under moderate/heavy load.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/kvm_para.h | 2 +-
arch/x86/kernel/kvm.c | 16 +++++++++++++++-
arch/x86/kernel/kvmclock.c | 15 +++++----------
3 files changed, 21 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 4a47c16e2df8..4a49fc286b4c 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,7 +118,7 @@ static inline long kvm_sev_hypercall3(unsigned int nr, unsigned long p1,
}
#ifdef CONFIG_KVM_GUEST
-void kvmclock_init(void);
+void kvmclock_init(bool prefer_tsc);
void kvmclock_disable(void);
bool kvm_para_available(void);
unsigned int kvm_arch_para_features(void);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 909d3e5e5bcd..4fe9c69bf40b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -978,6 +978,7 @@ static void __init kvm_init_platform(void)
.mask_hi = (BIT_ULL(boot_cpu_data.x86_phys_bits) - 1) >> 32,
};
u32 timing_info_leaf;
+ bool tsc_is_reliable;
if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL)) {
@@ -1040,7 +1041,20 @@ static void __init kvm_init_platform(void)
}
}
- kvmclock_init();
+ /*
+ * If the TSC counts at a constant frequency across P/T states, counts
+ * in deep C-states, and the TSC hasn't been marked unstable, treat the
+ * TSC reliable, as guaranteed by KVM. Note, the TSC unstable check
+ * exists purely to honor the TSC being marked unstable via command
+ * line, any runtime detection of an unstable will happen after this.
+ */
+ tsc_is_reliable = boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+ boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
+ !check_tsc_unstable();
+ if (tsc_is_reliable)
+ setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+
+ kvmclock_init(tsc_is_reliable);
x86_platform.apic_post_init = kvm_apic_init;
/*
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 404f60741aa8..69a15fbfb779 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -285,7 +285,7 @@ static int kvmclock_setup_percpu(unsigned int cpu)
return p ? 0 : -ENOMEM;
}
-void __init kvmclock_init(void)
+void __init kvmclock_init(bool prefer_tsc)
{
u8 flags;
@@ -334,16 +334,11 @@ void __init kvmclock_init(void)
kvm_get_preset_lpj();
/*
- * X86_FEATURE_NONSTOP_TSC is TSC runs at constant rate
- * with P/T states and does not stop in deep C-states.
- *
- * Invariant TSC exposed by host means kvmclock is not necessary:
- * can use TSC as clocksource.
- *
+ * If TSC is preferred over kvmlock, drop kvmclock's rating so that TSC
+ * is chosen as the clocksource (but still register kvmclock in case
+ * the kernel doesn't want to use TSC for whatever reason).
*/
- if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
- boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
- !check_tsc_unstable())
+ if (prefer_tsc)
kvm_clock.rating = 299;
clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 16/47] x86/kvm: Obtain TSC frequency from PV CPUID if present
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
From: David Woodhouse <dwmw@amazon.co.uk>
In https://lkml.org/lkml/2008/10/1/246 a proposal was made for generic
CPUID conventions across hypervisors. It was mostly shot down in flames,
but the leaf at 0x40000010 containing timing information didn't die.
It's used by XNU and FreeBSD guests under all hypervisors¹² to determine
the TSC frequency, and also exposed by the EC2 Nitro hypervisor (as
well as, presumably, VMware). FreeBSD's Bhyve is probably just about
to start exposing it too.
Use it under KVM to obtain the TSC frequency more accurately, instead of
reverse-calculating the frequency from the mul/shift values in the KVM
clock. Use the information to get the CPU frequency as well (kvmclock
feeds in kvm_get_tsc_khz() for both TSC and CPU calibration), as the info
from CPUID is superior in every way; whether or not kvmclock should be
overriding CPU calibration in the first place is an entirely different
question.
Use the info from CPUID even if the user explicitly disables kvmclock, or
if it's unsupported. The PV CPUID leaf has no dependency on kvmclock, and
is in fact more useful if kvmclock is disabled since the kernel won't be
able to use kvmclock to derive a derive the TSC frequency.
Before:
[ 0.000020] tsc: Detected 2900.014 MHz processor
After:
[ 0.000020] tsc: Detected 2900.015 MHz processor
$ cpuid -1 -l 0x40000010
CPU:
hypervisor generic timing information (0x40000010):
TSC frequency (Hz) = 2900015
bus frequency (Hz) = 1000000
Note! *Independently* query for non-null get_{cpu,tsc}_khz() overrides so
that kvmclock doesn't clobber x86_init.hyper.get_cpu_khz() if/when KVM adds
support for getting the CPU frequency separately from the TSC frequency.
¹ https://github.com/apple/darwin-xnu/blob/main/osfmk/i386/cpuid.c
² https://github.com/freebsd/freebsd-src/commit/4a432614f68
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvm.c | 33 +++++++++++++++++++++++++++++++++
arch/x86/kernel/kvmclock.c | 6 ++++--
2 files changed, 37 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index dcef84da304b..909d3e5e5bcd 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -49,6 +49,8 @@
#include <asm/svm.h>
#include <asm/e820/api.h>
+static unsigned int kvm_tsc_khz_cpuid __initdata;
+
DEFINE_STATIC_KEY_FALSE_RO(kvm_async_pf_enabled);
static int kvmapf = 1;
@@ -911,6 +913,21 @@ bool kvm_para_available(void)
}
EXPORT_SYMBOL_GPL(kvm_para_available);
+static u32 __init kvm_cpuid_timing_info_leaf(void)
+{
+ u32 base = kvm_cpuid_base();
+
+ if (!base || cpuid_eax(base) < (base | KVM_CPUID_TIMING_INFO))
+ return 0;
+
+ return base | KVM_CPUID_TIMING_INFO;
+}
+
+static unsigned int __init kvm_get_tsc_khz(void)
+{
+ return kvm_tsc_khz_cpuid;
+}
+
unsigned int kvm_arch_para_features(void)
{
return cpuid_eax(kvm_cpuid_base() | KVM_CPUID_FEATURES);
@@ -960,6 +977,7 @@ static void __init kvm_init_platform(void)
.mask_lo = (u32)(~(SZ_4G - tolud - 1)) | MTRR_PHYSMASK_V,
.mask_hi = (BIT_ULL(boot_cpu_data.x86_phys_bits) - 1) >> 32,
};
+ u32 timing_info_leaf;
if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL)) {
@@ -1007,6 +1025,21 @@ static void __init kvm_init_platform(void)
wrmsrq(MSR_KVM_MIGRATION_CONTROL,
KVM_MIGRATION_READY);
}
+
+ /*
+ * If KVM advertises the frequency directly in CPUID, use that instead
+ * of reverse-calculating it from the KVM clock data, or worse, trying
+ * to calibratate the TSC using an emulated device.
+ */
+ timing_info_leaf = kvm_cpuid_timing_info_leaf();
+ if (timing_info_leaf) {
+ kvm_tsc_khz_cpuid = cpuid_eax(timing_info_leaf);
+ if (kvm_tsc_khz_cpuid) {
+ x86_init.hyper.get_tsc_khz = kvm_get_tsc_khz;
+ x86_init.hyper.get_cpu_khz = kvm_get_tsc_khz;
+ }
+ }
+
kvmclock_init();
x86_platform.apic_post_init = kvm_apic_init;
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index c4a782a0c903..404f60741aa8 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -320,8 +320,10 @@ void __init kvmclock_init(void)
flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
- x86_init.hyper.get_tsc_khz = kvmclock_get_tsc_khz;
- x86_init.hyper.get_cpu_khz = kvmclock_get_tsc_khz;
+ if (!x86_init.hyper.get_tsc_khz)
+ x86_init.hyper.get_tsc_khz = kvmclock_get_tsc_khz;
+ if (!x86_init.hyper.get_cpu_khz)
+ x86_init.hyper.get_cpu_khz = kvmclock_get_tsc_khz;
x86_platform.get_wallclock = kvm_get_wallclock;
x86_platform.set_wallclock = kvm_set_wallclock;
#ifdef CONFIG_X86_LOCAL_APIC
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 15/47] KVM: x86: Officially define CPUID 0x40000010 as PV Timing Info (TSC and Bus)
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
From: David Woodhouse <dwmw@amazon.co.uk>
Formally define and document CPUID 0x40000010 as providing TSC and local
APIC bus frequency information for KVM's PV CPUID range. Way back in
2008, VMware proposed (https://lkml.org/lkml/2008/10/1/246) carving out a
range of CPUID leaves for use by hypervisors. While the broader proposal
from VMware was mostly shot down in flames, use of CPUID 0x40000010 to
provide TSC and local APIC bus frequency information survived and made it's
way into multiple guest operating systems.
XNU unconditionally assumes CPUID 0x40000010 contains the frequency
information, if it's present on any hypervisor:
https://github.com/apple/darwin-xnu/blob/main/osfmk/i386/cpuid.c
As does FreeBSD:
https://github.com/freebsd/freebsd-src/commit/4a432614f68
More importantly, QEMU (the de facto "reference" VMM for KVM) has
conditionally provided timing information in CPUID 0x40000010 for almost
9 years, since commit 9954a1582e ("x86-KVM: Supply TSC and APIC clock
rates to guest like VMWare").
So at this point it would be daft for KVM (or any hypervisor) to expose
0x40000010 for any *other* content. Officially carve out and define the
CPUID leaf so that Linux-as-a-guest can follow suit and pull TSC and Local
APIC Bus frequency information from CPUID.
Defer providing userspace with the necessary information needed to
precisely and accurately enumerate the _actual_ configured TSC frequency
to the guest (that exact information, along with the scaled ratio, isn't
exposed to userspace). As evidenced by QEMU, providing CPUID 0x40000010
without help from KVM is entirely possible, just not ideal.
Link: https://lore.kernel.org/all/ea0d7f43d910cee9600b254e303f468722fa355b.camel@infradead.org
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
[sean: drop KVM filling of CPUID, add documentation, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
Documentation/virt/kvm/x86/cpuid.rst | 12 ++++++++++++
arch/x86/include/uapi/asm/kvm_para.h | 11 +++++++++++
2 files changed, 23 insertions(+)
diff --git a/Documentation/virt/kvm/x86/cpuid.rst b/Documentation/virt/kvm/x86/cpuid.rst
index bda3e3e737d7..f02e395cfa9b 100644
--- a/Documentation/virt/kvm/x86/cpuid.rst
+++ b/Documentation/virt/kvm/x86/cpuid.rst
@@ -122,3 +122,15 @@ KVM_HINTS_REALTIME 0 guest checks this feature bit to
preempted for an unlimited time
allowing optimizations
================== ============ =================================
+
+function: KVM_CPUID_TIMING_INFO (0x40000010)
+
+returns::
+
+ eax = (Virtual) TSC frequency in kHz
+ ebx = (Virtual) Bus (local APIC timer) frequency in kHz
+ ecx = 0 (Reserved)
+ edx = 0 (Reserved)
+
+Note, KVM only defines the semantics of KVM_CPUID_TIMING_INFO; KVM does NOT
+advertise support via KVM_GET_SUPPORTED_CPUID.
\ No newline at end of file
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index a1efa7907a0b..c3a384711f3a 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -44,6 +44,17 @@
*/
#define KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24
+/*
+ * The timing information leaf provides TSC and local APIC timer frequency
+ * information to the guest. Note, userspace is responsible for filling the
+ * leaf with the correct information.
+ *
+ * # EAX: (Virtual) TSC frequency in kHz.
+ * # EBX: (Virtual) Bus (local APIC timer) frequency in kHz.
+ * # ECX, EDX: Reserved (must be zero).
+ */
+#define KVM_CPUID_TIMING_INFO 0x40000010
+
#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 14/47] x86/kvmclock: Rename kvm_get_tsc_khz() to kvmclock_get_tsc_khz()
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Rename kvm_get_tsc_khz() to kvmclock_get_tsc_khz() in anticipation of
adding support for getting TSC info from PV CPUID, i.e. in a KVM specific
way, but without non-kvmclock.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 69752b170e0a..c4a782a0c903 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -115,7 +115,7 @@ static inline void kvm_sched_clock_init(bool stable)
* poll of guests can be running and trouble each other. So we preset
* lpj here
*/
-static unsigned int __init kvm_get_tsc_khz(void)
+static unsigned int __init kvmclock_get_tsc_khz(void)
{
return pvclock_tsc_khz(this_cpu_pvti());
}
@@ -125,7 +125,7 @@ static void __init kvm_get_preset_lpj(void)
unsigned long khz;
u64 lpj;
- khz = kvm_get_tsc_khz();
+ khz = kvmclock_get_tsc_khz();
lpj = ((u64)khz * 1000);
do_div(lpj, HZ);
@@ -320,8 +320,8 @@ void __init kvmclock_init(void)
flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
- x86_init.hyper.get_tsc_khz = kvm_get_tsc_khz;
- x86_init.hyper.get_cpu_khz = kvm_get_tsc_khz;
+ x86_init.hyper.get_tsc_khz = kvmclock_get_tsc_khz;
+ x86_init.hyper.get_cpu_khz = kvmclock_get_tsc_khz;
x86_platform.get_wallclock = kvm_get_wallclock;
x86_platform.set_wallclock = kvm_set_wallclock;
#ifdef CONFIG_X86_LOCAL_APIC
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 13/47] x86/tsc: Fold native_calibrate_cpu() into recalibrate_cpu_khz()
From: Sean Christopherson @ 2026-05-29 14:44 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Fold the guts of native_calibrate_cpu() into its sole remaining caller,
recalibrate_cpu_khz() to eliminate the extra SMP=n #ifdef, and so that it's
more obvious that directly invoking the early vs. late calibration routines
in determine_cpu_tsc_frequencies() is intentional.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/tsc.c | 20 ++++----------------
1 file changed, 4 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 534462c81c78..3e911f0f7364 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -945,21 +945,6 @@ static unsigned long native_calibrate_cpu_early(void)
return fast_calibrate;
}
-#ifndef CONFIG_SMP
-/**
- * native_calibrate_cpu - calibrate the cpu
- */
-static unsigned long native_calibrate_cpu(void)
-{
- unsigned long tsc_freq = native_calibrate_cpu_early();
-
- if (!tsc_freq)
- tsc_freq = native_calibrate_cpu_late();
-
- return tsc_freq;
-}
-#endif
-
void recalibrate_cpu_khz(void)
{
#ifndef CONFIG_SMP
@@ -968,7 +953,10 @@ void recalibrate_cpu_khz(void)
if (!boot_cpu_has(X86_FEATURE_TSC))
return;
- cpu_khz = native_calibrate_cpu();
+ cpu_khz = native_calibrate_cpu_early();
+ if (!cpu_khz)
+ cpu_khz = native_calibrate_cpu_late();
+
if (!boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ))
tsc_khz = native_calibrate_tsc();
if (tsc_khz == 0)
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
* [PATCH v4 12/47] x86/tsc: Rename pit_hpet_ptimer_calibrate_cpu() => native_calibrate_cpu_late()
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>
Rename the late CPU calibration routine so that its relationship to the
early routine is more obvious and intuitive.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/tsc.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 5b4b6e43c94c..534462c81c78 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -779,7 +779,7 @@ static unsigned long cpu_khz_from_cpuid(void)
* calibrate cpu using pit, hpet, and ptimer methods. They are available
* later in boot after acpi is initialized.
*/
-static unsigned long pit_hpet_ptimer_calibrate_cpu(void)
+static unsigned long native_calibrate_cpu_late(void)
{
u64 tsc1, tsc2, delta, ref1, ref2;
unsigned long tsc_pit_min = ULONG_MAX, tsc_ref_min = ULONG_MAX;
@@ -954,7 +954,7 @@ static unsigned long native_calibrate_cpu(void)
unsigned long tsc_freq = native_calibrate_cpu_early();
if (!tsc_freq)
- tsc_freq = pit_hpet_ptimer_calibrate_cpu();
+ tsc_freq = native_calibrate_cpu_late();
return tsc_freq;
}
@@ -1497,7 +1497,7 @@ static bool __init determine_cpu_tsc_frequencies(bool early,
else
tsc_khz = native_calibrate_tsc();
} else {
- cpu_khz = pit_hpet_ptimer_calibrate_cpu();
+ cpu_khz = native_calibrate_cpu_late();
}
/*
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox