* [PATCH v2 01/38] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 02/38] x86/tsc: Add standalone helper for getting CPU frequency from CPUID Sean Christopherson
` (37 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Extract retrieval of TSC frequency information from CPUID into standalone
helpers so that TDX guest support and kvmlock can reuse the logic. Provide
a version that includes the multiplier math as TDX in particular does NOT
want to use native_calibrate_tsc()'s fallback logic that derives the TSC
frequency based on CPUID.0x16 when the core crystal frequency isn't known.
Opportunsitically drop native_calibrate_tsc()'s "== 0" and "!= 0" check
in favor of the kernel's preferred style.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/tsc.h | 9 +++++
arch/x86/kernel/tsc.c | 67 +++++++++++++++++++++++++-------------
2 files changed, 53 insertions(+), 23 deletions(-)
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 94408a784c8e..a4d84f721775 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -28,6 +28,15 @@ static inline cycles_t get_cycles(void)
}
#define get_cycles get_cycles
+struct cpuid_tsc_info {
+ unsigned int denominator;
+ unsigned int numerator;
+ unsigned int crystal_khz;
+ unsigned int tsc_khz;
+};
+extern int cpuid_get_tsc_info(struct cpuid_tsc_info *info);
+extern int cpuid_get_tsc_freq(struct cpuid_tsc_info *info);
+
extern void tsc_early_init(void);
extern void tsc_init(void);
extern void mark_tsc_unstable(char *reason);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 34dec0b72ea8..93713eb81f52 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -655,46 +655,67 @@ static unsigned long quick_pit_calibrate(void)
return delta;
}
+int cpuid_get_tsc_info(struct cpuid_tsc_info *info)
+{
+ unsigned int ecx_hz, edx;
+
+ memset(info, 0, sizeof(*info));
+
+ if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
+ return -ENOENT;
+
+ /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
+ cpuid(CPUID_LEAF_TSC, &info->denominator, &info->numerator, &ecx_hz, &edx);
+
+ if (!info->denominator || !info->numerator)
+ return -ENOENT;
+
+ /*
+ * Note, some CPUs provide the multiplier information, but not the core
+ * crystal frequency. The multiplier information is still useful for
+ * such CPUs, as the crystal frequency can be gleaned from CPUID.0x16.
+ */
+ info->crystal_khz = ecx_hz / 1000;
+ return 0;
+}
+
+int cpuid_get_tsc_freq(struct cpuid_tsc_info *info)
+{
+ if (cpuid_get_tsc_info(info) || !info->crystal_khz)
+ return -ENOENT;
+
+ info->tsc_khz = info->crystal_khz * info->numerator / info->denominator;
+ return 0;
+}
+
/**
* native_calibrate_tsc - determine TSC frequency
* Determine TSC frequency via CPUID, else return 0.
*/
unsigned long native_calibrate_tsc(void)
{
- unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
- unsigned int crystal_khz;
+ struct cpuid_tsc_info info;
if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
return 0;
- if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
+ if (cpuid_get_tsc_info(&info))
return 0;
- eax_denominator = ebx_numerator = ecx_hz = edx = 0;
-
- /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
- cpuid(CPUID_LEAF_TSC, &eax_denominator, &ebx_numerator, &ecx_hz, &edx);
-
- if (ebx_numerator == 0 || eax_denominator == 0)
- return 0;
-
- crystal_khz = ecx_hz / 1000;
-
/*
* Denverton SoCs don't report crystal clock, and also don't support
* CPUID_LEAF_FREQ for the calculation below, so hardcode the 25MHz
* crystal clock.
*/
- if (crystal_khz == 0 &&
- boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT_D)
- crystal_khz = 25000;
+ if (!info.crystal_khz && boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT_D)
+ info.crystal_khz = 25000;
/*
* TSC frequency reported directly by CPUID is a "hardware reported"
* frequency and is the most accurate one so far we have. This
* is considered a known frequency.
*/
- if (crystal_khz != 0)
+ if (info.crystal_khz)
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
/*
@@ -702,15 +723,15 @@ unsigned long native_calibrate_tsc(void)
* clock, but we can easily calculate it to a high degree of accuracy
* by considering the crystal ratio and the CPU speed.
*/
- if (crystal_khz == 0 && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
+ if (!info.crystal_khz && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
unsigned int eax_base_mhz, ebx, ecx, edx;
cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
- crystal_khz = eax_base_mhz * 1000 *
- eax_denominator / ebx_numerator;
+ info.crystal_khz = eax_base_mhz * 1000 *
+ info.denominator / info.numerator;
}
- if (crystal_khz == 0)
+ if (!info.crystal_khz)
return 0;
/*
@@ -727,10 +748,10 @@ unsigned long native_calibrate_tsc(void)
* lapic_timer_period here to avoid having to calibrate the APIC
* timer later.
*/
- lapic_timer_period = crystal_khz * 1000 / HZ;
+ lapic_timer_period = info.crystal_khz * 1000 / HZ;
#endif
- return crystal_khz * ebx_numerator / eax_denominator;
+ return info.crystal_khz * info.numerator / info.denominator;
}
static unsigned long cpu_khz_from_cpuid(void)
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 02/38] x86/tsc: Add standalone helper for getting CPU frequency from CPUID
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 01/38] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15 Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 03/38] x86/tsc: Add helper to register CPU and TSC freq calibration routines Sean Christopherson
` (36 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Extract the guts of cpu_khz_from_cpuid() to a standalone helper that
doesn't restrict the usage to Intel CPUs. This will allow sharing the
core logic with kvmclock, as (a) CPUID.0x16 may be enumerated alongside
kvmclock, and (b) KVM generally doesn't restrict CPUID based on vendor.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/tsc.h | 1 +
arch/x86/kernel/tsc.c | 37 +++++++++++++++++++++++--------------
2 files changed, 24 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index a4d84f721775..c3a14df46327 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -36,6 +36,7 @@ struct cpuid_tsc_info {
};
extern int cpuid_get_tsc_info(struct cpuid_tsc_info *info);
extern int cpuid_get_tsc_freq(struct cpuid_tsc_info *info);
+extern int cpuid_get_cpu_freq(unsigned int *cpu_khz);
extern void tsc_early_init(void);
extern void tsc_init(void);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 93713eb81f52..bb4619148161 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -688,6 +688,24 @@ int cpuid_get_tsc_freq(struct cpuid_tsc_info *info)
return 0;
}
+int cpuid_get_cpu_freq(unsigned int *cpu_khz)
+{
+ unsigned int eax_base_mhz, ebx, ecx, edx;
+
+ *cpu_khz = 0;
+
+ if (boot_cpu_data.cpuid_level < CPUID_LEAF_FREQ)
+ return -ENOENT;
+
+ cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
+
+ if (!eax_base_mhz)
+ return -ENOENT;
+
+ *cpu_khz = eax_base_mhz * 1000;
+ return 0;
+}
+
/**
* native_calibrate_tsc - determine TSC frequency
* Determine TSC frequency via CPUID, else return 0.
@@ -723,13 +741,8 @@ unsigned long native_calibrate_tsc(void)
* clock, but we can easily calculate it to a high degree of accuracy
* by considering the crystal ratio and the CPU speed.
*/
- if (!info.crystal_khz && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
- unsigned int eax_base_mhz, ebx, ecx, edx;
-
- cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
- info.crystal_khz = eax_base_mhz * 1000 *
- info.denominator / info.numerator;
- }
+ if (!info.crystal_khz && !cpuid_get_cpu_freq(&cpu_khz))
+ info.crystal_khz = cpu_khz * info.denominator / info.numerator;
if (!info.crystal_khz)
return 0;
@@ -756,19 +769,15 @@ unsigned long native_calibrate_tsc(void)
static unsigned long cpu_khz_from_cpuid(void)
{
- unsigned int eax_base_mhz, ebx_max_mhz, ecx_bus_mhz, edx;
+ unsigned int cpu_khz;
if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
return 0;
- if (boot_cpu_data.cpuid_level < CPUID_LEAF_FREQ)
+ if (cpuid_get_cpu_freq(&cpu_khz))
return 0;
- eax_base_mhz = ebx_max_mhz = ecx_bus_mhz = edx = 0;
-
- cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx_max_mhz, &ecx_bus_mhz, &edx);
-
- return eax_base_mhz * 1000;
+ return cpu_khz;
}
/*
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 03/38] x86/tsc: Add helper to register CPU and TSC freq calibration routines
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 01/38] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15 Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 02/38] x86/tsc: Add standalone helper for getting CPU frequency from CPUID Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-03-04 4:17 ` Michael Kelley
2025-02-27 2:18 ` [PATCH v2 04/38] x86/sev: Mark TSC as reliable when configuring Secure TSC Sean Christopherson
` (35 subsequent siblings)
38 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Add a helper to register non-native, i.e. PV and CoCo, CPU and TSC
frequency calibration routines. This will allow consolidating handling
of common TSC properties that are forced by hypervisor (PV routines),
and will also allow adding sanity checks to guard against overriding a
TSC calibration routine with a routine that is less robust/trusted.
Make the CPU calibration routine optional, as Xen (very sanely) doesn't
assume the CPU runs as the same frequency as the TSC.
Wrap the helper in an #ifdef to document that the kernel overrides
the native routines when running as a VM, and to guard against unwanted
usage. Add a TODO to call out that AMD_MEM_ENCRYPT is a mess and doesn't
depend on HYPERVISOR_GUEST because it gates both guest and host code.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/coco/sev/core.c | 4 ++--
arch/x86/include/asm/tsc.h | 4 ++++
arch/x86/kernel/cpu/acrn.c | 4 ++--
arch/x86/kernel/cpu/mshyperv.c | 3 +--
arch/x86/kernel/cpu/vmware.c | 4 ++--
arch/x86/kernel/jailhouse.c | 4 ++--
arch/x86/kernel/kvmclock.c | 4 ++--
arch/x86/kernel/tsc.c | 17 +++++++++++++++++
arch/x86/xen/time.c | 2 +-
9 files changed, 33 insertions(+), 13 deletions(-)
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 82492efc5d94..684cef70edc1 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -3291,6 +3291,6 @@ void __init snp_secure_tsc_init(void)
rdmsrl(MSR_AMD64_GUEST_TSC_FREQ, tsc_freq_mhz);
snp_tsc_freq_khz = (unsigned long)(tsc_freq_mhz * 1000);
- x86_platform.calibrate_cpu = securetsc_get_tsc_khz;
- x86_platform.calibrate_tsc = securetsc_get_tsc_khz;
+ tsc_register_calibration_routines(securetsc_get_tsc_khz,
+ securetsc_get_tsc_khz);
}
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index c3a14df46327..9318c74e8d13 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -40,6 +40,10 @@ extern int cpuid_get_cpu_freq(unsigned int *cpu_khz);
extern void tsc_early_init(void);
extern void tsc_init(void);
+#if defined(CONFIG_HYPERVISOR_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
+extern void tsc_register_calibration_routines(unsigned long (*calibrate_tsc)(void),
+ unsigned long (*calibrate_cpu)(void));
+#endif
extern void mark_tsc_unstable(char *reason);
extern int unsynchronized_tsc(void);
extern int check_tsc_unstable(void);
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index 2c5b51aad91a..c1506cb87d8c 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -29,8 +29,8 @@ static void __init acrn_init_platform(void)
/* Install system interrupt handler for ACRN hypervisor callback */
sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
- x86_platform.calibrate_tsc = acrn_get_tsc_khz;
- x86_platform.calibrate_cpu = acrn_get_tsc_khz;
+ tsc_register_calibration_routines(acrn_get_tsc_khz,
+ acrn_get_tsc_khz);
}
static bool acrn_x2apic_available(void)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index f285757618fc..aa60491bf738 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -478,8 +478,7 @@ static void __init ms_hyperv_init_platform(void)
if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
- x86_platform.calibrate_tsc = hv_get_tsc_khz;
- x86_platform.calibrate_cpu = hv_get_tsc_khz;
+ tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz);
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
}
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 00189cdeb775..d6f079a75f05 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -416,8 +416,8 @@ static void __init vmware_platform_setup(void)
}
vmware_tsc_khz = tsc_khz;
- x86_platform.calibrate_tsc = vmware_get_tsc_khz;
- x86_platform.calibrate_cpu = vmware_get_tsc_khz;
+ tsc_register_calibration_routines(vmware_get_tsc_khz,
+ vmware_get_tsc_khz);
#ifdef CONFIG_X86_LOCAL_APIC
/* Skip lapic calibration since we know the bus frequency. */
diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
index cd8ed1edbf9e..b0a053692161 100644
--- a/arch/x86/kernel/jailhouse.c
+++ b/arch/x86/kernel/jailhouse.c
@@ -209,8 +209,6 @@ static void __init jailhouse_init_platform(void)
x86_init.mpparse.parse_smp_cfg = jailhouse_parse_smp_config;
x86_init.pci.arch_init = jailhouse_pci_arch_init;
- x86_platform.calibrate_cpu = jailhouse_get_tsc;
- x86_platform.calibrate_tsc = jailhouse_get_tsc;
x86_platform.get_wallclock = jailhouse_get_wallclock;
x86_platform.legacy.rtc = 0;
x86_platform.legacy.warm_reset = 0;
@@ -220,6 +218,8 @@ static void __init jailhouse_init_platform(void)
machine_ops.emergency_restart = jailhouse_no_restart;
+ tsc_register_calibration_routines(jailhouse_get_tsc, jailhouse_get_tsc);
+
while (pa_data) {
mapping = early_memremap(pa_data, sizeof(header));
memcpy(&header, mapping, sizeof(header));
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 5b2c15214a6b..b898b95a7d50 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -320,8 +320,8 @@ void __init kvmclock_init(void)
flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
- x86_platform.calibrate_tsc = kvm_get_tsc_khz;
- x86_platform.calibrate_cpu = kvm_get_tsc_khz;
+ tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz);
+
x86_platform.get_wallclock = kvm_get_wallclock;
x86_platform.set_wallclock = kvm_set_wallclock;
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index bb4619148161..d65e85929d3e 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1294,6 +1294,23 @@ static void __init check_system_tsc_reliable(void)
tsc_disable_clocksource_watchdog();
}
+/*
+ * TODO: Disentangle AMD_MEM_ENCRYPT and make SEV guest support depend on
+ * HYPERVISOR_GUEST.
+ */
+#if defined(CONFIG_HYPERVISOR_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
+void tsc_register_calibration_routines(unsigned long (*calibrate_tsc)(void),
+ unsigned long (*calibrate_cpu)(void))
+{
+ if (WARN_ON_ONCE(!calibrate_tsc))
+ return;
+
+ x86_platform.calibrate_tsc = calibrate_tsc;
+ if (calibrate_cpu)
+ x86_platform.calibrate_cpu = calibrate_cpu;
+}
+#endif
+
/*
* Make an educated guess if the TSC is trustworthy and synchronized
* over all CPUs.
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 96521b1874ac..9e2e900dc0c7 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -566,7 +566,7 @@ static void __init xen_init_time_common(void)
static_call_update(pv_steal_clock, xen_steal_clock);
paravirt_set_sched_clock(xen_sched_clock);
- x86_platform.calibrate_tsc = xen_tsc_khz;
+ tsc_register_calibration_routines(xen_tsc_khz, NULL);
x86_platform.get_wallclock = xen_get_wallclock;
}
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* RE: [PATCH v2 03/38] x86/tsc: Add helper to register CPU and TSC freq calibration routines
2025-02-27 2:18 ` [PATCH v2 03/38] x86/tsc: Add helper to register CPU and TSC freq calibration routines Sean Christopherson
@ 2025-03-04 4:17 ` Michael Kelley
0 siblings, 0 replies; 50+ messages in thread
From: Michael Kelley @ 2025-03-04 4:17 UTC (permalink / raw)
To: Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86@kernel.org, Kirill A. Shutemov,
Paolo Bonzini, Juergen Gross, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
kvm@vger.kernel.org, virtualization@lists.linux.dev,
linux-hyperv@vger.kernel.org, xen-devel@lists.xenproject.org,
Tom Lendacky, Nikunj A Dadhania
From: Sean Christopherson <seanjc@google.com> Sent: Wednesday, February 26, 2025 6:18 PM
>
> Add a helper to register non-native, i.e. PV and CoCo, CPU and TSC
> frequency calibration routines. This will allow consolidating handling
> of common TSC properties that are forced by hypervisor (PV routines),
> and will also allow adding sanity checks to guard against overriding a
> TSC calibration routine with a routine that is less robust/trusted.
>
> Make the CPU calibration routine optional, as Xen (very sanely) doesn't
> assume the CPU runs as the same frequency as the TSC.
>
> Wrap the helper in an #ifdef to document that the kernel overrides
> the native routines when running as a VM, and to guard against unwanted
> usage. Add a TODO to call out that AMD_MEM_ENCRYPT is a mess and doesn't
> depend on HYPERVISOR_GUEST because it gates both guest and host code.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/coco/sev/core.c | 4 ++--
> arch/x86/include/asm/tsc.h | 4 ++++
> arch/x86/kernel/cpu/acrn.c | 4 ++--
> arch/x86/kernel/cpu/mshyperv.c | 3 +--
> arch/x86/kernel/cpu/vmware.c | 4 ++--
> arch/x86/kernel/jailhouse.c | 4 ++--
> arch/x86/kernel/kvmclock.c | 4 ++--
> arch/x86/kernel/tsc.c | 17 +++++++++++++++++
> arch/x86/xen/time.c | 2 +-
> 9 files changed, 33 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
> index 82492efc5d94..684cef70edc1 100644
> --- a/arch/x86/coco/sev/core.c
> +++ b/arch/x86/coco/sev/core.c
> @@ -3291,6 +3291,6 @@ void __init snp_secure_tsc_init(void)
> rdmsrl(MSR_AMD64_GUEST_TSC_FREQ, tsc_freq_mhz);
> snp_tsc_freq_khz = (unsigned long)(tsc_freq_mhz * 1000);
>
> - x86_platform.calibrate_cpu = securetsc_get_tsc_khz;
> - x86_platform.calibrate_tsc = securetsc_get_tsc_khz;
> + tsc_register_calibration_routines(securetsc_get_tsc_khz,
> + securetsc_get_tsc_khz);
> }
> diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
> index c3a14df46327..9318c74e8d13 100644
> --- a/arch/x86/include/asm/tsc.h
> +++ b/arch/x86/include/asm/tsc.h
> @@ -40,6 +40,10 @@ extern int cpuid_get_cpu_freq(unsigned int *cpu_khz);
>
> extern void tsc_early_init(void);
> extern void tsc_init(void);
> +#if defined(CONFIG_HYPERVISOR_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
> +extern void tsc_register_calibration_routines(unsigned long (*calibrate_tsc)(void),
> + unsigned long (*calibrate_cpu)(void));
> +#endif
> extern void mark_tsc_unstable(char *reason);
> extern int unsynchronized_tsc(void);
> extern int check_tsc_unstable(void);
> diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
> index 2c5b51aad91a..c1506cb87d8c 100644
> --- a/arch/x86/kernel/cpu/acrn.c
> +++ b/arch/x86/kernel/cpu/acrn.c
> @@ -29,8 +29,8 @@ static void __init acrn_init_platform(void)
> /* Install system interrupt handler for ACRN hypervisor callback */
> sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
>
> - x86_platform.calibrate_tsc = acrn_get_tsc_khz;
> - x86_platform.calibrate_cpu = acrn_get_tsc_khz;
> + tsc_register_calibration_routines(acrn_get_tsc_khz,
> + acrn_get_tsc_khz);
> }
>
> static bool acrn_x2apic_available(void)
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index f285757618fc..aa60491bf738 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -478,8 +478,7 @@ static void __init ms_hyperv_init_platform(void)
>
> if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
> ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
> - x86_platform.calibrate_tsc = hv_get_tsc_khz;
> - x86_platform.calibrate_cpu = hv_get_tsc_khz;
> + tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz);
> setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> }
>
For the Hyper-V guest code,
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
> diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
> index 00189cdeb775..d6f079a75f05 100644
> --- a/arch/x86/kernel/cpu/vmware.c
> +++ b/arch/x86/kernel/cpu/vmware.c
> @@ -416,8 +416,8 @@ static void __init vmware_platform_setup(void)
> }
>
> vmware_tsc_khz = tsc_khz;
> - x86_platform.calibrate_tsc = vmware_get_tsc_khz;
> - x86_platform.calibrate_cpu = vmware_get_tsc_khz;
> + tsc_register_calibration_routines(vmware_get_tsc_khz,
> + vmware_get_tsc_khz);
>
> #ifdef CONFIG_X86_LOCAL_APIC
> /* Skip lapic calibration since we know the bus frequency. */
> diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
> index cd8ed1edbf9e..b0a053692161 100644
> --- a/arch/x86/kernel/jailhouse.c
> +++ b/arch/x86/kernel/jailhouse.c
> @@ -209,8 +209,6 @@ static void __init jailhouse_init_platform(void)
> x86_init.mpparse.parse_smp_cfg = jailhouse_parse_smp_config;
> x86_init.pci.arch_init = jailhouse_pci_arch_init;
>
> - x86_platform.calibrate_cpu = jailhouse_get_tsc;
> - x86_platform.calibrate_tsc = jailhouse_get_tsc;
> x86_platform.get_wallclock = jailhouse_get_wallclock;
> x86_platform.legacy.rtc = 0;
> x86_platform.legacy.warm_reset = 0;
> @@ -220,6 +218,8 @@ static void __init jailhouse_init_platform(void)
>
> machine_ops.emergency_restart = jailhouse_no_restart;
>
> + tsc_register_calibration_routines(jailhouse_get_tsc, jailhouse_get_tsc);
> +
> while (pa_data) {
> mapping = early_memremap(pa_data, sizeof(header));
> memcpy(&header, mapping, sizeof(header));
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 5b2c15214a6b..b898b95a7d50 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -320,8 +320,8 @@ void __init kvmclock_init(void)
> flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
> kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
>
> - x86_platform.calibrate_tsc = kvm_get_tsc_khz;
> - x86_platform.calibrate_cpu = kvm_get_tsc_khz;
> + tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz);
> +
> x86_platform.get_wallclock = kvm_get_wallclock;
> x86_platform.set_wallclock = kvm_set_wallclock;
> #ifdef CONFIG_X86_LOCAL_APIC
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index bb4619148161..d65e85929d3e 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1294,6 +1294,23 @@ static void __init check_system_tsc_reliable(void)
> tsc_disable_clocksource_watchdog();
> }
>
> +/*
> + * TODO: Disentangle AMD_MEM_ENCRYPT and make SEV guest support depend on
> + * HYPERVISOR_GUEST.
> + */
> +#if defined(CONFIG_HYPERVISOR_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
> +void tsc_register_calibration_routines(unsigned long (*calibrate_tsc)(void),
> + unsigned long (*calibrate_cpu)(void))
> +{
> + if (WARN_ON_ONCE(!calibrate_tsc))
> + return;
> +
> + x86_platform.calibrate_tsc = calibrate_tsc;
> + if (calibrate_cpu)
> + x86_platform.calibrate_cpu = calibrate_cpu;
> +}
> +#endif
> +
> /*
> * Make an educated guess if the TSC is trustworthy and synchronized
> * over all CPUs.
> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index 96521b1874ac..9e2e900dc0c7 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -566,7 +566,7 @@ static void __init xen_init_time_common(void)
> static_call_update(pv_steal_clock, xen_steal_clock);
> paravirt_set_sched_clock(xen_sched_clock);
>
> - x86_platform.calibrate_tsc = xen_tsc_khz;
> + tsc_register_calibration_routines(xen_tsc_khz, NULL);
> x86_platform.get_wallclock = xen_get_wallclock;
> }
>
> --
> 2.48.1.711.g2feabab25a-goog
>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v2 04/38] x86/sev: Mark TSC as reliable when configuring Secure TSC
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (2 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 03/38] x86/tsc: Add helper to register CPU and TSC freq calibration routines Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 05/38] x86/sev: Move check for SNP Secure TSC support to tsc_early_init() Sean Christopherson
` (34 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Move the code to mark the TSC as reliable from sme_early_init() to
snp_secure_tsc_init(). The only reader of TSC_RELIABLE is the aptly
named check_system_tsc_reliable(), which runs in tsc_init(), i.e.
after snp_secure_tsc_init().
This will allow consolidating the handling of TSC_KNOWN_FREQ and
TSC_RELIABLE when overriding the TSC calibration routine.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/coco/sev/core.c | 2 ++
arch/x86/mm/mem_encrypt_amd.c | 3 ---
2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 684cef70edc1..e6ce4ca72465 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -3288,6 +3288,8 @@ void __init snp_secure_tsc_init(void)
return;
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+ setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+
rdmsrl(MSR_AMD64_GUEST_TSC_FREQ, tsc_freq_mhz);
snp_tsc_freq_khz = (unsigned long)(tsc_freq_mhz * 1000);
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index b56c5c073003..774f9677458f 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -541,9 +541,6 @@ void __init sme_early_init(void)
* kernel mapped.
*/
snp_update_svsm_ca();
-
- if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
- setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
}
void __init mem_encrypt_free_decrypted_mem(void)
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 05/38] x86/sev: Move check for SNP Secure TSC support to tsc_early_init()
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (3 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 04/38] x86/sev: Mark TSC as reliable when configuring Secure TSC Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 06/38] x86/tdx: Override PV calibration routines with CPUID-based calibration Sean Christopherson
` (33 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Move the check on having a Secure TSC to the common tsc_early_init() so
that it's obvious that having a Secure TSC is conditional, and to prepare
for adding TDX to the mix (blindly initializing *both* SNP and TDX TSC
logic looks especially weird).
No functional change intended.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/coco/sev/core.c | 3 ---
arch/x86/kernel/tsc.c | 3 ++-
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index e6ce4ca72465..dab386f782ce 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -3284,9 +3284,6 @@ void __init snp_secure_tsc_init(void)
{
unsigned long long tsc_freq_mhz;
- if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
- return;
-
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index d65e85929d3e..6a011cd1ff94 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1563,7 +1563,8 @@ void __init tsc_early_init(void)
if (is_early_uv_system())
return;
- snp_secure_tsc_init();
+ if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
+ snp_secure_tsc_init();
if (!determine_cpu_tsc_frequencies(true))
return;
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 06/38] x86/tdx: Override PV calibration routines with CPUID-based calibration
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (4 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 05/38] x86/sev: Move check for SNP Secure TSC support to tsc_early_init() Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 13:15 ` Kirill A. Shutemov
2025-02-27 2:18 ` [PATCH v2 07/38] x86/acrn: Mark TSC frequency as known when using ACRN for calibration Sean Christopherson
` (32 subsequent siblings)
38 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
When running as a TDX guest, explicitly override the TSC frequency
calibration routine with CPUID-based calibration instead of potentially
relying on a hypervisor-controlled PV routine. For TDX guests, CPUID.0x15
is always emulated by the TDX-Module, i.e. the information from CPUID is
more trustworthy than the information provided by the hypervisor.
To maintain backwards compatibility with TDX guest kernels that use native
calibration, and because it's the least awful option, retain
native_calibrate_tsc()'s stuffing of the local APIC bus period using the
core crystal frequency. While it's entirely possible for the hypervisor
to emulate the APIC timer at a different frequency than the core crystal
frequency, the commonly accepted interpretation of Intel's SDM is that APIC
timer runs at the core crystal frequency when that latter is enumerated via
CPUID:
The APIC timer frequency will be the processor’s bus clock or core
crystal clock frequency (when TSC/core crystal clock ratio is enumerated
in CPUID leaf 0x15).
If the hypervisor is malicious and deliberately runs the APIC timer at the
wrong frequency, nothing would stop the hypervisor from modifying the
frequency at any time, i.e. attempting to manually calibrate the frequency
out of paranoia would be futile.
Deliberately leave the CPU frequency calibration routine as is, since the
TDX-Module doesn't provide any guarantees with respect to CPUID.0x16.
Opportunistically add a comment explaining that CoCo TSC initialization
needs to come after hypervisor specific initialization.
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/coco/tdx/tdx.c | 30 +++++++++++++++++++++++++++---
arch/x86/include/asm/tdx.h | 2 ++
arch/x86/kernel/tsc.c | 8 ++++++++
3 files changed, 37 insertions(+), 3 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 32809a06dab4..42cdaa98dc5e 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -8,6 +8,7 @@
#include <linux/export.h>
#include <linux/io.h>
#include <linux/kexec.h>
+#include <asm/apic.h>
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
@@ -1063,9 +1064,6 @@ void __init tdx_early_init(void)
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
- /* TSC is the only reliable clock in TDX guest */
- setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
-
cc_vendor = CC_VENDOR_INTEL;
/* Configure the TD */
@@ -1122,3 +1120,29 @@ void __init tdx_early_init(void)
tdx_announce();
}
+
+static unsigned long tdx_get_tsc_khz(void)
+{
+ struct cpuid_tsc_info info;
+
+ if (WARN_ON_ONCE(cpuid_get_tsc_freq(&info)))
+ return 0;
+
+ lapic_timer_period = info.crystal_khz * 1000 / HZ;
+
+ return info.tsc_khz;
+}
+
+void __init tdx_tsc_init(void)
+{
+ /* TSC is the only reliable clock in TDX guest */
+ setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+ setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+
+ /*
+ * Override the PV calibration routines (if set) with more trustworthy
+ * CPUID-based calibration. The TDX module emulates CPUID, whereas any
+ * PV information is provided by the hypervisor.
+ */
+ tsc_register_calibration_routines(tdx_get_tsc_khz, NULL);
+}
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index b4b16dafd55e..621fbdd101e2 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -53,6 +53,7 @@ struct ve_info {
#ifdef CONFIG_INTEL_TDX_GUEST
void __init tdx_early_init(void);
+void __init tdx_tsc_init(void);
void tdx_get_ve_info(struct ve_info *ve);
@@ -72,6 +73,7 @@ void __init tdx_dump_td_ctls(u64 td_ctls);
#else
static inline void tdx_early_init(void) { };
+static inline void tdx_tsc_init(void) { }
static inline void tdx_safe_halt(void) { };
static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 6a011cd1ff94..472d6c71d3a5 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -32,6 +32,7 @@
#include <asm/topology.h>
#include <asm/uv/uv.h>
#include <asm/sev.h>
+#include <asm/tdx.h>
unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */
EXPORT_SYMBOL(cpu_khz);
@@ -1563,8 +1564,15 @@ void __init tsc_early_init(void)
if (is_early_uv_system())
return;
+ /*
+ * Do CoCo specific "secure" TSC initialization *after* hypervisor
+ * platform initialization so that the secure variant can override the
+ * hypervisor's PV calibration routine with a more trusted method.
+ */
if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
snp_secure_tsc_init();
+ else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
+ tdx_tsc_init();
if (!determine_cpu_tsc_frequencies(true))
return;
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH v2 06/38] x86/tdx: Override PV calibration routines with CPUID-based calibration
2025-02-27 2:18 ` [PATCH v2 06/38] x86/tdx: Override PV calibration routines with CPUID-based calibration Sean Christopherson
@ 2025-02-27 13:15 ` Kirill A. Shutemov
0 siblings, 0 replies; 50+ messages in thread
From: Kirill A. Shutemov @ 2025-02-27 13:15 UTC (permalink / raw)
To: Sean Christopherson
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Paolo Bonzini, Juergen Gross, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz, linux-kernel,
linux-coco, kvm, virtualization, linux-hyperv, xen-devel,
Tom Lendacky, Nikunj A Dadhania
On Wed, Feb 26, 2025 at 06:18:22PM -0800, Sean Christopherson wrote:
> When running as a TDX guest, explicitly override the TSC frequency
> calibration routine with CPUID-based calibration instead of potentially
> relying on a hypervisor-controlled PV routine. For TDX guests, CPUID.0x15
> is always emulated by the TDX-Module, i.e. the information from CPUID is
> more trustworthy than the information provided by the hypervisor.
>
> To maintain backwards compatibility with TDX guest kernels that use native
> calibration, and because it's the least awful option, retain
> native_calibrate_tsc()'s stuffing of the local APIC bus period using the
> core crystal frequency. While it's entirely possible for the hypervisor
> to emulate the APIC timer at a different frequency than the core crystal
> frequency, the commonly accepted interpretation of Intel's SDM is that APIC
> timer runs at the core crystal frequency when that latter is enumerated via
> CPUID:
>
> The APIC timer frequency will be the processor’s bus clock or core
> crystal clock frequency (when TSC/core crystal clock ratio is enumerated
> in CPUID leaf 0x15).
>
> If the hypervisor is malicious and deliberately runs the APIC timer at the
> wrong frequency, nothing would stop the hypervisor from modifying the
> frequency at any time, i.e. attempting to manually calibrate the frequency
> out of paranoia would be futile.
>
> Deliberately leave the CPU frequency calibration routine as is, since the
> TDX-Module doesn't provide any guarantees with respect to CPUID.0x16.
>
> Opportunistically add a comment explaining that CoCo TSC initialization
> needs to come after hypervisor specific initialization.
>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/coco/tdx/tdx.c | 30 +++++++++++++++++++++++++++---
> arch/x86/include/asm/tdx.h | 2 ++
> arch/x86/kernel/tsc.c | 8 ++++++++
> 3 files changed, 37 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 32809a06dab4..42cdaa98dc5e 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -8,6 +8,7 @@
> #include <linux/export.h>
> #include <linux/io.h>
> #include <linux/kexec.h>
> +#include <asm/apic.h>
> #include <asm/coco.h>
> #include <asm/tdx.h>
> #include <asm/vmx.h>
> @@ -1063,9 +1064,6 @@ void __init tdx_early_init(void)
>
> setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
>
> - /* TSC is the only reliable clock in TDX guest */
> - setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> -
> cc_vendor = CC_VENDOR_INTEL;
>
> /* Configure the TD */
> @@ -1122,3 +1120,29 @@ void __init tdx_early_init(void)
>
> tdx_announce();
> }
> +
> +static unsigned long tdx_get_tsc_khz(void)
> +{
> + struct cpuid_tsc_info info;
> +
> + if (WARN_ON_ONCE(cpuid_get_tsc_freq(&info)))
> + return 0;
> +
> + lapic_timer_period = info.crystal_khz * 1000 / HZ;
> +
> + return info.tsc_khz;
> +}
> +
> +void __init tdx_tsc_init(void)
> +{
> + /* TSC is the only reliable clock in TDX guest */
> + setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> + setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> +
> + /*
> + * Override the PV calibration routines (if set) with more trustworthy
> + * CPUID-based calibration. The TDX module emulates CPUID, whereas any
> + * PV information is provided by the hypervisor.
> + */
> + tsc_register_calibration_routines(tdx_get_tsc_khz, NULL);
> +}
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index b4b16dafd55e..621fbdd101e2 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -53,6 +53,7 @@ struct ve_info {
> #ifdef CONFIG_INTEL_TDX_GUEST
>
> void __init tdx_early_init(void);
> +void __init tdx_tsc_init(void);
>
> void tdx_get_ve_info(struct ve_info *ve);
>
> @@ -72,6 +73,7 @@ void __init tdx_dump_td_ctls(u64 td_ctls);
> #else
>
> static inline void tdx_early_init(void) { };
> +static inline void tdx_tsc_init(void) { }
> static inline void tdx_safe_halt(void) { };
>
> static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 6a011cd1ff94..472d6c71d3a5 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -32,6 +32,7 @@
> #include <asm/topology.h>
> #include <asm/uv/uv.h>
> #include <asm/sev.h>
> +#include <asm/tdx.h>
>
> unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */
> EXPORT_SYMBOL(cpu_khz);
> @@ -1563,8 +1564,15 @@ void __init tsc_early_init(void)
> if (is_early_uv_system())
> return;
>
> + /*
> + * Do CoCo specific "secure" TSC initialization *after* hypervisor
> + * platform initialization so that the secure variant can override the
> + * hypervisor's PV calibration routine with a more trusted method.
> + */
> if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
> snp_secure_tsc_init();
> + else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
> + tdx_tsc_init();
Maybe a x86_platform.guest callback for this?
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v2 07/38] x86/acrn: Mark TSC frequency as known when using ACRN for calibration
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (5 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 06/38] x86/tdx: Override PV calibration routines with CPUID-based calibration Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 08/38] clocksource: hyper-v: Register sched_clock save/restore iff it's necessary Sean Christopherson
` (31 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Mark the TSC frequency as known when using ACRN's PV CPUID information.
Per commit 81a71f51b89e ("x86/acrn: Set up timekeeping") and common sense,
the TSC freq is explicitly provided by the hypervisor.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/cpu/acrn.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index c1506cb87d8c..2da3de4d470e 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -29,6 +29,7 @@ static void __init acrn_init_platform(void)
/* Install system interrupt handler for ACRN hypervisor callback */
sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
+ setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
tsc_register_calibration_routines(acrn_get_tsc_khz,
acrn_get_tsc_khz);
}
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 08/38] clocksource: hyper-v: Register sched_clock save/restore iff it's necessary
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (6 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 07/38] x86/acrn: Mark TSC frequency as known when using ACRN for calibration Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-03-04 4:17 ` Michael Kelley
2025-02-27 2:18 ` [PATCH v2 09/38] clocksource: hyper-v: Drop wrappers to sched_clock save/restore helpers Sean Christopherson
` (30 subsequent siblings)
38 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Register the Hyper-V timer callbacks or saving/restoring its PV sched_clock
if and only if the timer is actually being used for sched_clock.
Currently, Hyper-V overrides the save/restore hooks if the reference TSC
available, whereas the Hyper-V timer code only overrides sched_clock if
the reference TSC is available *and* it's not invariant. The flaw is
effectively papered over by invoking the "old" save/restore callbacks as
part of save/restore, but that's unnecessary and fragile.
To avoid introducing more complexity, and to allow for additional cleanups
of the PV sched_clock code, move the save/restore hooks and logic into
hyperv_timer.c and simply wire up the hooks when overriding sched_clock
itself.
Note, while the Hyper-V timer code is intended to be architecture neutral,
CONFIG_PARAVIRT is firmly x86-only, i.e. adding a small amount of x86
specific code (which will be reduced in future cleanups) doesn't
meaningfully pollute generic code.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/cpu/mshyperv.c | 58 ------------------------------
drivers/clocksource/hyperv_timer.c | 50 ++++++++++++++++++++++++++
2 files changed, 50 insertions(+), 58 deletions(-)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index aa60491bf738..174f6a71c899 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -223,63 +223,6 @@ static void hv_machine_crash_shutdown(struct pt_regs *regs)
hyperv_cleanup();
}
#endif /* CONFIG_CRASH_DUMP */
-
-static u64 hv_ref_counter_at_suspend;
-static void (*old_save_sched_clock_state)(void);
-static void (*old_restore_sched_clock_state)(void);
-
-/*
- * Hyper-V clock counter resets during hibernation. Save and restore clock
- * offset during suspend/resume, while also considering the time passed
- * before suspend. This is to make sure that sched_clock using hv tsc page
- * based clocksource, proceeds from where it left off during suspend and
- * it shows correct time for the timestamps of kernel messages after resume.
- */
-static void save_hv_clock_tsc_state(void)
-{
- hv_ref_counter_at_suspend = hv_read_reference_counter();
-}
-
-static void restore_hv_clock_tsc_state(void)
-{
- /*
- * Adjust the offsets used by hv tsc clocksource to
- * account for the time spent before hibernation.
- * adjusted value = reference counter (time) at suspend
- * - reference counter (time) now.
- */
- hv_adj_sched_clock_offset(hv_ref_counter_at_suspend - hv_read_reference_counter());
-}
-
-/*
- * Functions to override save_sched_clock_state and restore_sched_clock_state
- * functions of x86_platform. The Hyper-V clock counter is reset during
- * suspend-resume and the offset used to measure time needs to be
- * corrected, post resume.
- */
-static void hv_save_sched_clock_state(void)
-{
- old_save_sched_clock_state();
- save_hv_clock_tsc_state();
-}
-
-static void hv_restore_sched_clock_state(void)
-{
- restore_hv_clock_tsc_state();
- old_restore_sched_clock_state();
-}
-
-static void __init x86_setup_ops_for_tsc_pg_clock(void)
-{
- if (!(ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE))
- return;
-
- old_save_sched_clock_state = x86_platform.save_sched_clock_state;
- x86_platform.save_sched_clock_state = hv_save_sched_clock_state;
-
- old_restore_sched_clock_state = x86_platform.restore_sched_clock_state;
- x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state;
-}
#endif /* CONFIG_HYPERV */
static uint32_t __init ms_hyperv_platform(void)
@@ -635,7 +578,6 @@ static void __init ms_hyperv_init_platform(void)
/* Register Hyper-V specific clocksource */
hv_init_clocksource();
- x86_setup_ops_for_tsc_pg_clock();
hv_vtl_init_platform();
#endif
/*
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index f00019b078a7..86a55167bf5d 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -534,10 +534,60 @@ static __always_inline void hv_setup_sched_clock(void *sched_clock)
sched_clock_register(sched_clock, 64, NSEC_PER_SEC);
}
#elif defined CONFIG_PARAVIRT
+static u64 hv_ref_counter_at_suspend;
+static void (*old_save_sched_clock_state)(void);
+static void (*old_restore_sched_clock_state)(void);
+
+/*
+ * Hyper-V clock counter resets during hibernation. Save and restore clock
+ * offset during suspend/resume, while also considering the time passed
+ * before suspend. This is to make sure that sched_clock using hv tsc page
+ * based clocksource, proceeds from where it left off during suspend and
+ * it shows correct time for the timestamps of kernel messages after resume.
+ */
+static void save_hv_clock_tsc_state(void)
+{
+ hv_ref_counter_at_suspend = hv_read_reference_counter();
+}
+
+static void restore_hv_clock_tsc_state(void)
+{
+ /*
+ * Adjust the offsets used by hv tsc clocksource to
+ * account for the time spent before hibernation.
+ * adjusted value = reference counter (time) at suspend
+ * - reference counter (time) now.
+ */
+ hv_adj_sched_clock_offset(hv_ref_counter_at_suspend - hv_read_reference_counter());
+}
+/*
+ * Functions to override save_sched_clock_state and restore_sched_clock_state
+ * functions of x86_platform. The Hyper-V clock counter is reset during
+ * suspend-resume and the offset used to measure time needs to be
+ * corrected, post resume.
+ */
+static void hv_save_sched_clock_state(void)
+{
+ old_save_sched_clock_state();
+ save_hv_clock_tsc_state();
+}
+
+static void hv_restore_sched_clock_state(void)
+{
+ restore_hv_clock_tsc_state();
+ old_restore_sched_clock_state();
+}
+
static __always_inline void hv_setup_sched_clock(void *sched_clock)
{
/* We're on x86/x64 *and* using PV ops */
paravirt_set_sched_clock(sched_clock);
+
+ old_save_sched_clock_state = x86_platform.save_sched_clock_state;
+ x86_platform.save_sched_clock_state = hv_save_sched_clock_state;
+
+ old_restore_sched_clock_state = x86_platform.restore_sched_clock_state;
+ x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state;
}
#else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* RE: [PATCH v2 08/38] clocksource: hyper-v: Register sched_clock save/restore iff it's necessary
2025-02-27 2:18 ` [PATCH v2 08/38] clocksource: hyper-v: Register sched_clock save/restore iff it's necessary Sean Christopherson
@ 2025-03-04 4:17 ` Michael Kelley
2025-03-04 17:39 ` Sean Christopherson
0 siblings, 1 reply; 50+ messages in thread
From: Michael Kelley @ 2025-03-04 4:17 UTC (permalink / raw)
To: Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86@kernel.org, Kirill A. Shutemov,
Paolo Bonzini, Juergen Gross, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
kvm@vger.kernel.org, virtualization@lists.linux.dev,
linux-hyperv@vger.kernel.org, xen-devel@lists.xenproject.org,
Tom Lendacky, Nikunj A Dadhania
From: Sean Christopherson <seanjc@google.com> Sent: Wednesday, February 26, 2025 6:18 PM
>
> Register the Hyper-V timer callbacks or saving/restoring its PV sched_clock
s/or/for/
> if and only if the timer is actually being used for sched_clock.
> Currently, Hyper-V overrides the save/restore hooks if the reference TSC
> available, whereas the Hyper-V timer code only overrides sched_clock if
> the reference TSC is available *and* it's not invariant. The flaw is
> effectively papered over by invoking the "old" save/restore callbacks as
> part of save/restore, but that's unnecessary and fragile.
The Hyper-V specific terminology here isn't quite right. There is a
PV "Hyper-V timer", but it is loaded by the guest OS with a specific value
and generates an interrupt when that value is reached. In Linux, it is used
for clockevents, but it's not a clocksource and is not used for sched_clock.
The correct Hyper-V term is "Hyper-V reference counter" (or "refcounter"
for short). The refcounter behaves like the TSC -- it's a monotonically
increasing value that is read-only, and can serve as the sched_clock.
And yes, both the Hyper-V timer and Hyper-V refcounter code is in a
source file with a name containing "timer" but not "refcounter". But
that seems to be the pattern for many of the drivers in
drivers/clocksource. :-)
>
> To avoid introducing more complexity, and to allow for additional cleanups
> of the PV sched_clock code, move the save/restore hooks and logic into
> hyperv_timer.c and simply wire up the hooks when overriding sched_clock
> itself.
>
> Note, while the Hyper-V timer code is intended to be architecture neutral,
> CONFIG_PARAVIRT is firmly x86-only, i.e. adding a small amount of x86
> specific code (which will be reduced in future cleanups) doesn't
> meaningfully pollute generic code.
I'm good with this approach.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Modulo the terminology used in the commit message,
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
> ---
> arch/x86/kernel/cpu/mshyperv.c | 58 ------------------------------
> drivers/clocksource/hyperv_timer.c | 50 ++++++++++++++++++++++++++
> 2 files changed, 50 insertions(+), 58 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index aa60491bf738..174f6a71c899 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -223,63 +223,6 @@ static void hv_machine_crash_shutdown(struct pt_regs *regs)
> hyperv_cleanup();
> }
> #endif /* CONFIG_CRASH_DUMP */
> -
> -static u64 hv_ref_counter_at_suspend;
> -static void (*old_save_sched_clock_state)(void);
> -static void (*old_restore_sched_clock_state)(void);
> -
> -/*
> - * Hyper-V clock counter resets during hibernation. Save and restore clock
> - * offset during suspend/resume, while also considering the time passed
> - * before suspend. This is to make sure that sched_clock using hv tsc page
> - * based clocksource, proceeds from where it left off during suspend and
> - * it shows correct time for the timestamps of kernel messages after resume.
> - */
> -static void save_hv_clock_tsc_state(void)
> -{
> - hv_ref_counter_at_suspend = hv_read_reference_counter();
> -}
> -
> -static void restore_hv_clock_tsc_state(void)
> -{
> - /*
> - * Adjust the offsets used by hv tsc clocksource to
> - * account for the time spent before hibernation.
> - * adjusted value = reference counter (time) at suspend
> - * - reference counter (time) now.
> - */
> - hv_adj_sched_clock_offset(hv_ref_counter_at_suspend -
> hv_read_reference_counter());
> -}
> -
> -/*
> - * Functions to override save_sched_clock_state and restore_sched_clock_state
> - * functions of x86_platform. The Hyper-V clock counter is reset during
> - * suspend-resume and the offset used to measure time needs to be
> - * corrected, post resume.
> - */
> -static void hv_save_sched_clock_state(void)
> -{
> - old_save_sched_clock_state();
> - save_hv_clock_tsc_state();
> -}
> -
> -static void hv_restore_sched_clock_state(void)
> -{
> - restore_hv_clock_tsc_state();
> - old_restore_sched_clock_state();
> -}
> -
> -static void __init x86_setup_ops_for_tsc_pg_clock(void)
> -{
> - if (!(ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE))
> - return;
> -
> - old_save_sched_clock_state = x86_platform.save_sched_clock_state;
> - x86_platform.save_sched_clock_state = hv_save_sched_clock_state;
> -
> - old_restore_sched_clock_state = x86_platform.restore_sched_clock_state;
> - x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state;
> -}
> #endif /* CONFIG_HYPERV */
>
> static uint32_t __init ms_hyperv_platform(void)
> @@ -635,7 +578,6 @@ static void __init ms_hyperv_init_platform(void)
>
> /* Register Hyper-V specific clocksource */
> hv_init_clocksource();
> - x86_setup_ops_for_tsc_pg_clock();
> hv_vtl_init_platform();
> #endif
> /*
> diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
> index f00019b078a7..86a55167bf5d 100644
> --- a/drivers/clocksource/hyperv_timer.c
> +++ b/drivers/clocksource/hyperv_timer.c
> @@ -534,10 +534,60 @@ static __always_inline void hv_setup_sched_clock(void
> *sched_clock)
> sched_clock_register(sched_clock, 64, NSEC_PER_SEC);
> }
> #elif defined CONFIG_PARAVIRT
> +static u64 hv_ref_counter_at_suspend;
> +static void (*old_save_sched_clock_state)(void);
> +static void (*old_restore_sched_clock_state)(void);
> +
> +/*
> + * Hyper-V clock counter resets during hibernation. Save and restore clock
> + * offset during suspend/resume, while also considering the time passed
> + * before suspend. This is to make sure that sched_clock using hv tsc page
> + * based clocksource, proceeds from where it left off during suspend and
> + * it shows correct time for the timestamps of kernel messages after resume.
> + */
> +static void save_hv_clock_tsc_state(void)
> +{
> + hv_ref_counter_at_suspend = hv_read_reference_counter();
> +}
> +
> +static void restore_hv_clock_tsc_state(void)
> +{
> + /*
> + * Adjust the offsets used by hv tsc clocksource to
> + * account for the time spent before hibernation.
> + * adjusted value = reference counter (time) at suspend
> + * - reference counter (time) now.
> + */
> + hv_adj_sched_clock_offset(hv_ref_counter_at_suspend -
> hv_read_reference_counter());
> +}
> +/*
> + * Functions to override save_sched_clock_state and restore_sched_clock_state
> + * functions of x86_platform. The Hyper-V clock counter is reset during
> + * suspend-resume and the offset used to measure time needs to be
> + * corrected, post resume.
> + */
> +static void hv_save_sched_clock_state(void)
> +{
> + old_save_sched_clock_state();
> + save_hv_clock_tsc_state();
> +}
> +
> +static void hv_restore_sched_clock_state(void)
> +{
> + restore_hv_clock_tsc_state();
> + old_restore_sched_clock_state();
> +}
> +
> static __always_inline void hv_setup_sched_clock(void *sched_clock)
> {
> /* We're on x86/x64 *and* using PV ops */
> paravirt_set_sched_clock(sched_clock);
> +
> + old_save_sched_clock_state = x86_platform.save_sched_clock_state;
> + x86_platform.save_sched_clock_state = hv_save_sched_clock_state;
> +
> + old_restore_sched_clock_state = x86_platform.restore_sched_clock_state;
> + x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state;
> }
> #else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
> static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
> --
> 2.48.1.711.g2feabab25a-goog
>
^ permalink raw reply [flat|nested] 50+ messages in thread* Re: [PATCH v2 08/38] clocksource: hyper-v: Register sched_clock save/restore iff it's necessary
2025-03-04 4:17 ` Michael Kelley
@ 2025-03-04 17:39 ` Sean Christopherson
0 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-03-04 17:39 UTC (permalink / raw)
To: Michael Kelley
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86@kernel.org, Kirill A. Shutemov, Paolo Bonzini, Juergen Gross,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Ajay Kaher,
Jan Kiszka, Andy Lutomirski, Peter Zijlstra, Daniel Lezcano,
John Stultz, linux-kernel@vger.kernel.org,
linux-coco@lists.linux.dev, kvm@vger.kernel.org,
virtualization@lists.linux.dev, linux-hyperv@vger.kernel.org,
xen-devel@lists.xenproject.org, Tom Lendacky, Nikunj A Dadhania
On Tue, Mar 04, 2025, Michael Kelley wrote:
> From: Sean Christopherson <seanjc@google.com> Sent: Wednesday, February 26, 2025 6:18 PM
> >
> > Register the Hyper-V timer callbacks or saving/restoring its PV sched_clock
>
> s/or/for/
>
> > if and only if the timer is actually being used for sched_clock.
> > Currently, Hyper-V overrides the save/restore hooks if the reference TSC
> > available, whereas the Hyper-V timer code only overrides sched_clock if
> > the reference TSC is available *and* it's not invariant. The flaw is
> > effectively papered over by invoking the "old" save/restore callbacks as
> > part of save/restore, but that's unnecessary and fragile.
>
> The Hyper-V specific terminology here isn't quite right. There is a
> PV "Hyper-V timer", but it is loaded by the guest OS with a specific value
> and generates an interrupt when that value is reached. In Linux, it is used
> for clockevents, but it's not a clocksource and is not used for sched_clock.
> The correct Hyper-V term is "Hyper-V reference counter" (or "refcounter"
> for short). The refcounter behaves like the TSC -- it's a monotonically
> increasing value that is read-only, and can serve as the sched_clock.
>
> And yes, both the Hyper-V timer and Hyper-V refcounter code is in a
> source file with a name containing "timer" but not "refcounter". But
> that seems to be the pattern for many of the drivers in
> drivers/clocksource. :-)
Heh, wading through misleading naming is basically a right of passage in the kernel.
Thanks for the reviews and testing! I'll fixup all the changelogs.
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v2 09/38] clocksource: hyper-v: Drop wrappers to sched_clock save/restore helpers
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (7 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 08/38] clocksource: hyper-v: Register sched_clock save/restore iff it's necessary Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-03-04 4:17 ` Michael Kelley
2025-02-27 2:18 ` [PATCH v2 10/38] clocksource: hyper-v: Don't save/restore TSC offset when using HV sched_clock Sean Christopherson
` (29 subsequent siblings)
38 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Now that all of the Hyper-V timer sched_clock code is located in a single
file, drop the superfluous wrappers for the save/restore flows.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
drivers/clocksource/hyperv_timer.c | 34 +++++-------------------------
include/clocksource/hyperv_timer.h | 2 --
2 files changed, 5 insertions(+), 31 deletions(-)
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index 86a55167bf5d..5a52d0acf31f 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -471,17 +471,6 @@ static void resume_hv_clock_tsc(struct clocksource *arg)
hv_set_msr(HV_MSR_REFERENCE_TSC, tsc_msr.as_uint64);
}
-/*
- * Called during resume from hibernation, from overridden
- * x86_platform.restore_sched_clock_state routine. This is to adjust offsets
- * used to calculate time for hv tsc page based sched_clock, to account for
- * time spent before hibernation.
- */
-void hv_adj_sched_clock_offset(u64 offset)
-{
- hv_sched_clock_offset -= offset;
-}
-
#ifdef HAVE_VDSO_CLOCKMODE_HVCLOCK
static int hv_cs_enable(struct clocksource *cs)
{
@@ -545,12 +534,14 @@ static void (*old_restore_sched_clock_state)(void);
* based clocksource, proceeds from where it left off during suspend and
* it shows correct time for the timestamps of kernel messages after resume.
*/
-static void save_hv_clock_tsc_state(void)
+static void hv_save_sched_clock_state(void)
{
+ old_save_sched_clock_state();
+
hv_ref_counter_at_suspend = hv_read_reference_counter();
}
-static void restore_hv_clock_tsc_state(void)
+static void hv_restore_sched_clock_state(void)
{
/*
* Adjust the offsets used by hv tsc clocksource to
@@ -558,23 +549,8 @@ static void restore_hv_clock_tsc_state(void)
* adjusted value = reference counter (time) at suspend
* - reference counter (time) now.
*/
- hv_adj_sched_clock_offset(hv_ref_counter_at_suspend - hv_read_reference_counter());
-}
-/*
- * Functions to override save_sched_clock_state and restore_sched_clock_state
- * functions of x86_platform. The Hyper-V clock counter is reset during
- * suspend-resume and the offset used to measure time needs to be
- * corrected, post resume.
- */
-static void hv_save_sched_clock_state(void)
-{
- old_save_sched_clock_state();
- save_hv_clock_tsc_state();
-}
+ hv_sched_clock_offset -= (hv_ref_counter_at_suspend - hv_read_reference_counter());
-static void hv_restore_sched_clock_state(void)
-{
- restore_hv_clock_tsc_state();
old_restore_sched_clock_state();
}
diff --git a/include/clocksource/hyperv_timer.h b/include/clocksource/hyperv_timer.h
index d48dd4176fd3..a4c81a60f53d 100644
--- a/include/clocksource/hyperv_timer.h
+++ b/include/clocksource/hyperv_timer.h
@@ -38,8 +38,6 @@ extern void hv_remap_tsc_clocksource(void);
extern unsigned long hv_get_tsc_pfn(void);
extern struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
-extern void hv_adj_sched_clock_offset(u64 offset);
-
static __always_inline bool
hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
u64 *cur_tsc, u64 *time)
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* RE: [PATCH v2 09/38] clocksource: hyper-v: Drop wrappers to sched_clock save/restore helpers
2025-02-27 2:18 ` [PATCH v2 09/38] clocksource: hyper-v: Drop wrappers to sched_clock save/restore helpers Sean Christopherson
@ 2025-03-04 4:17 ` Michael Kelley
0 siblings, 0 replies; 50+ messages in thread
From: Michael Kelley @ 2025-03-04 4:17 UTC (permalink / raw)
To: Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86@kernel.org, Kirill A. Shutemov,
Paolo Bonzini, Juergen Gross, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
kvm@vger.kernel.org, virtualization@lists.linux.dev,
linux-hyperv@vger.kernel.org, xen-devel@lists.xenproject.org,
Tom Lendacky, Nikunj A Dadhania
From: Sean Christopherson <seanjc@google.com> Sent: Wednesday, February 26, 2025 6:18 PM
>
> Now that all of the Hyper-V timer sched_clock code is located in a single
Again, "Hyper-V refcounter", not "Hyper-V timer".
> file, drop the superfluous wrappers for the save/restore flows.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Modulo the terminology in the commit message,
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
> ---
> drivers/clocksource/hyperv_timer.c | 34 +++++-------------------------
> include/clocksource/hyperv_timer.h | 2 --
> 2 files changed, 5 insertions(+), 31 deletions(-)
>
> diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
> index 86a55167bf5d..5a52d0acf31f 100644
> --- a/drivers/clocksource/hyperv_timer.c
> +++ b/drivers/clocksource/hyperv_timer.c
> @@ -471,17 +471,6 @@ static void resume_hv_clock_tsc(struct clocksource *arg)
> hv_set_msr(HV_MSR_REFERENCE_TSC, tsc_msr.as_uint64);
> }
>
> -/*
> - * Called during resume from hibernation, from overridden
> - * x86_platform.restore_sched_clock_state routine. This is to adjust offsets
> - * used to calculate time for hv tsc page based sched_clock, to account for
> - * time spent before hibernation.
> - */
> -void hv_adj_sched_clock_offset(u64 offset)
> -{
> - hv_sched_clock_offset -= offset;
> -}
> -
> #ifdef HAVE_VDSO_CLOCKMODE_HVCLOCK
> static int hv_cs_enable(struct clocksource *cs)
> {
> @@ -545,12 +534,14 @@ static void (*old_restore_sched_clock_state)(void);
> * based clocksource, proceeds from where it left off during suspend and
> * it shows correct time for the timestamps of kernel messages after resume.
> */
> -static void save_hv_clock_tsc_state(void)
> +static void hv_save_sched_clock_state(void)
> {
> + old_save_sched_clock_state();
> +
> hv_ref_counter_at_suspend = hv_read_reference_counter();
> }
>
> -static void restore_hv_clock_tsc_state(void)
> +static void hv_restore_sched_clock_state(void)
> {
> /*
> * Adjust the offsets used by hv tsc clocksource to
> @@ -558,23 +549,8 @@ static void restore_hv_clock_tsc_state(void)
> * adjusted value = reference counter (time) at suspend
> * - reference counter (time) now.
> */
> - hv_adj_sched_clock_offset(hv_ref_counter_at_suspend -
> hv_read_reference_counter());
> -}
> -/*
> - * Functions to override save_sched_clock_state and restore_sched_clock_state
> - * functions of x86_platform. The Hyper-V clock counter is reset during
> - * suspend-resume and the offset used to measure time needs to be
> - * corrected, post resume.
> - */
> -static void hv_save_sched_clock_state(void)
> -{
> - old_save_sched_clock_state();
> - save_hv_clock_tsc_state();
> -}
> + hv_sched_clock_offset -= (hv_ref_counter_at_suspend -
> hv_read_reference_counter());
>
> -static void hv_restore_sched_clock_state(void)
> -{
> - restore_hv_clock_tsc_state();
> old_restore_sched_clock_state();
> }
>
> diff --git a/include/clocksource/hyperv_timer.h b/include/clocksource/hyperv_timer.h
> index d48dd4176fd3..a4c81a60f53d 100644
> --- a/include/clocksource/hyperv_timer.h
> +++ b/include/clocksource/hyperv_timer.h
> @@ -38,8 +38,6 @@ extern void hv_remap_tsc_clocksource(void);
> extern unsigned long hv_get_tsc_pfn(void);
> extern struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
>
> -extern void hv_adj_sched_clock_offset(u64 offset);
> -
> static __always_inline bool
> hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
> u64 *cur_tsc, u64 *time)
> --
> 2.48.1.711.g2feabab25a-goog
>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v2 10/38] clocksource: hyper-v: Don't save/restore TSC offset when using HV sched_clock
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (8 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 09/38] clocksource: hyper-v: Drop wrappers to sched_clock save/restore helpers Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-03-04 4:17 ` Michael Kelley
2025-02-27 2:18 ` [PATCH v2 11/38] x86/kvmclock: Setup kvmclock for secondary CPUs iff CONFIG_SMP=y Sean Christopherson
` (28 subsequent siblings)
38 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Now that Hyper-V overrides the sched_clock save/restore hooks if and only
sched_clock itself is set to the Hyper-V timer, drop the invocation of the
"old" save/restore callbacks. When the registration of the PV sched_clock
was done separate from overriding the save/restore hooks, it was possible
for Hyper-V to clobber the TSC save/restore callbacks without actually
switching to the Hyper-V timer.
Enabling a PV sched_clock is a one-way street, i.e. the kernel will never
revert to using TSC for sched_clock, and so there is no need to invoke the
TSC save/restore hooks (and if there was, it belongs in common PV code).
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
drivers/clocksource/hyperv_timer.c | 10 ----------
1 file changed, 10 deletions(-)
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index 5a52d0acf31f..4a21874e91b9 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -524,9 +524,6 @@ static __always_inline void hv_setup_sched_clock(void *sched_clock)
}
#elif defined CONFIG_PARAVIRT
static u64 hv_ref_counter_at_suspend;
-static void (*old_save_sched_clock_state)(void);
-static void (*old_restore_sched_clock_state)(void);
-
/*
* Hyper-V clock counter resets during hibernation. Save and restore clock
* offset during suspend/resume, while also considering the time passed
@@ -536,8 +533,6 @@ static void (*old_restore_sched_clock_state)(void);
*/
static void hv_save_sched_clock_state(void)
{
- old_save_sched_clock_state();
-
hv_ref_counter_at_suspend = hv_read_reference_counter();
}
@@ -550,8 +545,6 @@ static void hv_restore_sched_clock_state(void)
* - reference counter (time) now.
*/
hv_sched_clock_offset -= (hv_ref_counter_at_suspend - hv_read_reference_counter());
-
- old_restore_sched_clock_state();
}
static __always_inline void hv_setup_sched_clock(void *sched_clock)
@@ -559,10 +552,7 @@ static __always_inline void hv_setup_sched_clock(void *sched_clock)
/* We're on x86/x64 *and* using PV ops */
paravirt_set_sched_clock(sched_clock);
- old_save_sched_clock_state = x86_platform.save_sched_clock_state;
x86_platform.save_sched_clock_state = hv_save_sched_clock_state;
-
- old_restore_sched_clock_state = x86_platform.restore_sched_clock_state;
x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state;
}
#else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* RE: [PATCH v2 10/38] clocksource: hyper-v: Don't save/restore TSC offset when using HV sched_clock
2025-02-27 2:18 ` [PATCH v2 10/38] clocksource: hyper-v: Don't save/restore TSC offset when using HV sched_clock Sean Christopherson
@ 2025-03-04 4:17 ` Michael Kelley
0 siblings, 0 replies; 50+ messages in thread
From: Michael Kelley @ 2025-03-04 4:17 UTC (permalink / raw)
To: Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86@kernel.org, Kirill A. Shutemov,
Paolo Bonzini, Juergen Gross, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
kvm@vger.kernel.org, virtualization@lists.linux.dev,
linux-hyperv@vger.kernel.org, xen-devel@lists.xenproject.org,
Tom Lendacky, Nikunj A Dadhania
From: Sean Christopherson <seanjc@google.com> Sent: Wednesday, February 26, 2025 6:18 PM
>
> Now that Hyper-V overrides the sched_clock save/restore hooks if and only
> sched_clock itself is set to the Hyper-V timer, drop the invocation of the
> "old" save/restore callbacks. When the registration of the PV sched_clock
> was done separate from overriding the save/restore hooks, it was possible
> for Hyper-V to clobber the TSC save/restore callbacks without actually
> switching to the Hyper-V timer.
Terminology again.
>
> Enabling a PV sched_clock is a one-way street, i.e. the kernel will never
> revert to using TSC for sched_clock, and so there is no need to invoke the
> TSC save/restore hooks (and if there was, it belongs in common PV code).
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
I've tested the full patch series in a Hyper-V guest on x86 with
and without the Hyper-V TSC_INVARIANT option. In both cases,
the sched_clock is correctly maintained across hibernation and
resume from hibernation, and the "flags" in /proc/cpuinfo are
as expected. I also compiled hyperv_timer.c on arm64, and that is
clean with no errors. So all is good.
Modulo the terminology used in the commit message,
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
> ---
> drivers/clocksource/hyperv_timer.c | 10 ----------
> 1 file changed, 10 deletions(-)
>
> diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
> index 5a52d0acf31f..4a21874e91b9 100644
> --- a/drivers/clocksource/hyperv_timer.c
> +++ b/drivers/clocksource/hyperv_timer.c
> @@ -524,9 +524,6 @@ static __always_inline void hv_setup_sched_clock(void
> *sched_clock)
> }
> #elif defined CONFIG_PARAVIRT
> static u64 hv_ref_counter_at_suspend;
> -static void (*old_save_sched_clock_state)(void);
> -static void (*old_restore_sched_clock_state)(void);
> -
> /*
> * Hyper-V clock counter resets during hibernation. Save and restore clock
> * offset during suspend/resume, while also considering the time passed
> @@ -536,8 +533,6 @@ static void (*old_restore_sched_clock_state)(void);
> */
> static void hv_save_sched_clock_state(void)
> {
> - old_save_sched_clock_state();
> -
> hv_ref_counter_at_suspend = hv_read_reference_counter();
> }
>
> @@ -550,8 +545,6 @@ static void hv_restore_sched_clock_state(void)
> * - reference counter (time) now.
> */
> hv_sched_clock_offset -= (hv_ref_counter_at_suspend - hv_read_reference_counter());
> -
> - old_restore_sched_clock_state();
> }
>
> static __always_inline void hv_setup_sched_clock(void *sched_clock)
> @@ -559,10 +552,7 @@ static __always_inline void hv_setup_sched_clock(void
> *sched_clock)
> /* We're on x86/x64 *and* using PV ops */
> paravirt_set_sched_clock(sched_clock);
>
> - old_save_sched_clock_state = x86_platform.save_sched_clock_state;
> x86_platform.save_sched_clock_state = hv_save_sched_clock_state;
> -
> - old_restore_sched_clock_state = x86_platform.restore_sched_clock_state;
> x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state;
> }
> #else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
> --
> 2.48.1.711.g2feabab25a-goog
>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v2 11/38] x86/kvmclock: Setup kvmclock for secondary CPUs iff CONFIG_SMP=y
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (9 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 10/38] clocksource: hyper-v: Don't save/restore TSC offset when using HV sched_clock Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 12/38] x86/kvm: Don't disable kvmclock on BSP in syscore_suspend() Sean Christopherson
` (27 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Gate kvmclock's secondary CPU code on CONFIG_SMP, not CONFIG_X86_LOCAL_APIC.
Originally, kvmclock piggybacked PV APIC ops to setup secondary CPUs.
When that wart was fixed by commit df156f90a0f9 ("x86: Introduce
x86_cpuinit.early_percpu_clock_init hook"), the dependency on a local APIC
got carried forward unnecessarily.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index b898b95a7d50..80d1a06609c8 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -186,7 +186,7 @@ static void kvm_restore_sched_clock_state(void)
kvm_register_clock("primary cpu clock, resume");
}
-#ifdef CONFIG_X86_LOCAL_APIC
+#ifdef CONFIG_SMP
static void kvm_setup_secondary_clock(void)
{
kvm_register_clock("secondary cpu clock");
@@ -324,7 +324,7 @@ void __init kvmclock_init(void)
x86_platform.get_wallclock = kvm_get_wallclock;
x86_platform.set_wallclock = kvm_set_wallclock;
-#ifdef CONFIG_X86_LOCAL_APIC
+#ifdef CONFIG_SMP
x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
#endif
x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 12/38] x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (10 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 11/38] x86/kvmclock: Setup kvmclock for secondary CPUs iff CONFIG_SMP=y Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 13/38] x86/paravirt: Move handling of unstable PV clocks into paravirt_set_sched_clock() Sean Christopherson
` (26 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Don't disable kvmclock on the BSP during syscore_suspend(), as the BSP's
clock is NOT restored during syscore_resume(), but is instead restored
earlier via the sched_clock restore callback. If suspend is aborted, e.g.
due to a late wakeup, the BSP will run without its clock enabled, which
"works" only because KVM-the-hypervisor is kind enough to not clobber the
shared memory when the clock is disabled. But over time, the BSP's view
of time will drift from APs.
Plumb in an "action" to KVM-as-a-guest and kvmclock code in preparation
for additional cleanups to kvmclock's suspend/resume logic.
Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/kvm_para.h | 8 +++++++-
arch/x86/kernel/kvm.c | 15 ++++++++-------
arch/x86/kernel/kvmclock.c | 31 +++++++++++++++++++++++++------
3 files changed, 40 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 57bc74e112f2..8708598f5b8e 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -118,8 +118,14 @@ static inline long kvm_sev_hypercall3(unsigned int nr, unsigned long p1,
}
#ifdef CONFIG_KVM_GUEST
+enum kvm_guest_cpu_action {
+ KVM_GUEST_BSP_SUSPEND,
+ KVM_GUEST_AP_OFFLINE,
+ KVM_GUEST_SHUTDOWN,
+};
+
void kvmclock_init(void);
-void kvmclock_disable(void);
+void kvmclock_cpu_action(enum kvm_guest_cpu_action action);
bool kvm_para_available(void);
unsigned int kvm_arch_para_features(void);
unsigned int kvm_arch_para_hints(void);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7a422a6c5983..866b061ee0d9 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -447,7 +447,7 @@ static void __init sev_map_percpu_data(void)
}
}
-static void kvm_guest_cpu_offline(bool shutdown)
+static void kvm_guest_cpu_offline(enum kvm_guest_cpu_action action)
{
kvm_disable_steal_time();
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
@@ -455,9 +455,10 @@ static void kvm_guest_cpu_offline(bool shutdown)
if (kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL))
wrmsrl(MSR_KVM_MIGRATION_CONTROL, 0);
kvm_pv_disable_apf();
- if (!shutdown)
+ if (action != KVM_GUEST_SHUTDOWN)
apf_task_wake_all();
- kvmclock_disable();
+
+ kvmclock_cpu_action(action);
}
static int kvm_cpu_online(unsigned int cpu)
@@ -713,7 +714,7 @@ static int kvm_cpu_down_prepare(unsigned int cpu)
unsigned long flags;
local_irq_save(flags);
- kvm_guest_cpu_offline(false);
+ kvm_guest_cpu_offline(KVM_GUEST_AP_OFFLINE);
local_irq_restore(flags);
return 0;
}
@@ -724,7 +725,7 @@ static int kvm_suspend(void)
{
u64 val = 0;
- kvm_guest_cpu_offline(false);
+ kvm_guest_cpu_offline(KVM_GUEST_BSP_SUSPEND);
#ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL
if (kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL))
@@ -751,7 +752,7 @@ static struct syscore_ops kvm_syscore_ops = {
static void kvm_pv_guest_cpu_reboot(void *unused)
{
- kvm_guest_cpu_offline(true);
+ kvm_guest_cpu_offline(KVM_GUEST_SHUTDOWN);
}
static int kvm_pv_reboot_notify(struct notifier_block *nb,
@@ -775,7 +776,7 @@ static struct notifier_block kvm_pv_reboot_nb = {
#ifdef CONFIG_CRASH_DUMP
static void kvm_crash_shutdown(struct pt_regs *regs)
{
- kvm_guest_cpu_offline(true);
+ kvm_guest_cpu_offline(KVM_GUEST_SHUTDOWN);
native_machine_crash_shutdown(regs);
}
#endif
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 80d1a06609c8..223e5297f5ee 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -177,8 +177,22 @@ static void kvm_register_clock(char *txt)
pr_debug("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
}
+static void kvmclock_disable(void)
+{
+ if (msr_kvm_system_time)
+ native_write_msr(msr_kvm_system_time, 0, 0);
+}
+
static void kvm_save_sched_clock_state(void)
{
+ /*
+ * Stop host writes to kvmclock immediately prior to suspend/hibernate.
+ * If the system is hibernating, then kvmclock will likely reside at a
+ * different physical address when the system awakens, and host writes
+ * to the old address prior to reconfiguring kvmclock would clobber
+ * random memory.
+ */
+ kvmclock_disable();
}
static void kvm_restore_sched_clock_state(void)
@@ -186,6 +200,17 @@ static void kvm_restore_sched_clock_state(void)
kvm_register_clock("primary cpu clock, resume");
}
+void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
+{
+ /*
+ * Don't disable kvmclock on the BSP during suspend. If kvmclock is
+ * being used for sched_clock, then it needs to be kept alive until the
+ * last minute, and restored as quickly as possible after resume.
+ */
+ if (action != KVM_GUEST_BSP_SUSPEND)
+ kvmclock_disable();
+}
+
#ifdef CONFIG_SMP
static void kvm_setup_secondary_clock(void)
{
@@ -193,12 +218,6 @@ static void kvm_setup_secondary_clock(void)
}
#endif
-void kvmclock_disable(void)
-{
- if (msr_kvm_system_time)
- native_write_msr(msr_kvm_system_time, 0, 0);
-}
-
static void __init kvmclock_init_mem(void)
{
unsigned long ncpus;
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 13/38] x86/paravirt: Move handling of unstable PV clocks into paravirt_set_sched_clock()
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (11 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 12/38] x86/kvm: Don't disable kvmclock on BSP in syscore_suspend() Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 14/38] x86/kvmclock: Move sched_clock save/restore helpers up in kvmclock.c Sean Christopherson
` (25 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Move the handling of unstable PV clocks, of which kvmclock is the only
example, into paravirt_set_sched_clock(). This will allow modifying
paravirt_set_sched_clock() to keep using the TSC for sched_clock in
certain scenarios without unintentionally marking the TSC-based clock as
unstable.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/paravirt.h | 7 ++++++-
arch/x86/kernel/kvmclock.c | 5 +----
arch/x86/kernel/paravirt.c | 6 +++++-
3 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 041aff51eb50..cfceabd5f7e1 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -28,7 +28,12 @@ u64 dummy_sched_clock(void);
DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock);
DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock);
-void paravirt_set_sched_clock(u64 (*func)(void));
+void __paravirt_set_sched_clock(u64 (*func)(void), bool stable);
+
+static inline void paravirt_set_sched_clock(u64 (*func)(void))
+{
+ __paravirt_set_sched_clock(func, true);
+}
static __always_inline u64 paravirt_sched_clock(void)
{
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 223e5297f5ee..aae6fba21331 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -12,7 +12,6 @@
#include <linux/hardirq.h>
#include <linux/cpuhotplug.h>
#include <linux/sched.h>
-#include <linux/sched/clock.h>
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/set_memory.h>
@@ -93,10 +92,8 @@ static noinstr u64 kvm_sched_clock_read(void)
static inline void kvm_sched_clock_init(bool stable)
{
- if (!stable)
- clear_sched_clock_stable();
kvm_sched_clock_offset = kvm_clock_read();
- paravirt_set_sched_clock(kvm_sched_clock_read);
+ __paravirt_set_sched_clock(kvm_sched_clock_read, stable);
pr_info("kvm-clock: using sched offset of %llu cycles",
kvm_sched_clock_offset);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 1ccaa3397a67..55c819673a9d 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -14,6 +14,7 @@
#include <linux/highmem.h>
#include <linux/kprobes.h>
#include <linux/pgtable.h>
+#include <linux/sched/clock.h>
#include <linux/static_call.h>
#include <asm/bug.h>
@@ -85,8 +86,11 @@ static u64 native_steal_clock(int cpu)
DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
-void paravirt_set_sched_clock(u64 (*func)(void))
+void __paravirt_set_sched_clock(u64 (*func)(void), bool stable)
{
+ if (!stable)
+ clear_sched_clock_stable();
+
static_call_update(pv_sched_clock, func);
}
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 14/38] x86/kvmclock: Move sched_clock save/restore helpers up in kvmclock.c
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (12 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 13/38] x86/paravirt: Move handling of unstable PV clocks into paravirt_set_sched_clock() Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 15/38] x86/xen/time: Nullify x86_platform's sched_clock save/restore hooks Sean Christopherson
` (24 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Move kvmclock's sched_clock save/restore helper "up" so that they can
(eventually) be referenced by kvm_sched_clock_init().
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 108 ++++++++++++++++++-------------------
1 file changed, 54 insertions(+), 54 deletions(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aae6fba21331..c78db52ae399 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -70,6 +70,25 @@ static int kvm_set_wallclock(const struct timespec64 *now)
return -ENODEV;
}
+static void kvm_register_clock(char *txt)
+{
+ struct pvclock_vsyscall_time_info *src = this_cpu_hvclock();
+ u64 pa;
+
+ if (!src)
+ return;
+
+ pa = slow_virt_to_phys(&src->pvti) | 0x01ULL;
+ wrmsrl(msr_kvm_system_time, pa);
+ pr_debug("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
+}
+
+static void kvmclock_disable(void)
+{
+ if (msr_kvm_system_time)
+ native_write_msr(msr_kvm_system_time, 0, 0);
+}
+
static u64 kvm_clock_read(void)
{
u64 ret;
@@ -90,6 +109,30 @@ static noinstr u64 kvm_sched_clock_read(void)
return pvclock_clocksource_read_nowd(this_cpu_pvti()) - kvm_sched_clock_offset;
}
+static void kvm_save_sched_clock_state(void)
+{
+ /*
+ * Stop host writes to kvmclock immediately prior to suspend/hibernate.
+ * If the system is hibernating, then kvmclock will likely reside at a
+ * different physical address when the system awakens, and host writes
+ * to the old address prior to reconfiguring kvmclock would clobber
+ * random memory.
+ */
+ kvmclock_disable();
+}
+
+#ifdef CONFIG_SMP
+static void kvm_setup_secondary_clock(void)
+{
+ kvm_register_clock("secondary cpu clock");
+}
+#endif
+
+static void kvm_restore_sched_clock_state(void)
+{
+ kvm_register_clock("primary cpu clock, resume");
+}
+
static inline void kvm_sched_clock_init(bool stable)
{
kvm_sched_clock_offset = kvm_clock_read();
@@ -102,6 +145,17 @@ static inline void kvm_sched_clock_init(bool stable)
sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
}
+void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
+{
+ /*
+ * Don't disable kvmclock on the BSP during suspend. If kvmclock is
+ * being used for sched_clock, then it needs to be kept alive until the
+ * last minute, and restored as quickly as possible after resume.
+ */
+ if (action != KVM_GUEST_BSP_SUSPEND)
+ kvmclock_disable();
+}
+
/*
* If we don't do that, there is the possibility that the guest
* will calibrate under heavy load - thus, getting a lower lpj -
@@ -161,60 +215,6 @@ static struct clocksource kvm_clock = {
.enable = kvm_cs_enable,
};
-static void kvm_register_clock(char *txt)
-{
- struct pvclock_vsyscall_time_info *src = this_cpu_hvclock();
- u64 pa;
-
- if (!src)
- return;
-
- pa = slow_virt_to_phys(&src->pvti) | 0x01ULL;
- wrmsrl(msr_kvm_system_time, pa);
- pr_debug("kvm-clock: cpu %d, msr %llx, %s", smp_processor_id(), pa, txt);
-}
-
-static void kvmclock_disable(void)
-{
- if (msr_kvm_system_time)
- native_write_msr(msr_kvm_system_time, 0, 0);
-}
-
-static void kvm_save_sched_clock_state(void)
-{
- /*
- * Stop host writes to kvmclock immediately prior to suspend/hibernate.
- * If the system is hibernating, then kvmclock will likely reside at a
- * different physical address when the system awakens, and host writes
- * to the old address prior to reconfiguring kvmclock would clobber
- * random memory.
- */
- kvmclock_disable();
-}
-
-static void kvm_restore_sched_clock_state(void)
-{
- kvm_register_clock("primary cpu clock, resume");
-}
-
-void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
-{
- /*
- * Don't disable kvmclock on the BSP during suspend. If kvmclock is
- * being used for sched_clock, then it needs to be kept alive until the
- * last minute, and restored as quickly as possible after resume.
- */
- if (action != KVM_GUEST_BSP_SUSPEND)
- kvmclock_disable();
-}
-
-#ifdef CONFIG_SMP
-static void kvm_setup_secondary_clock(void)
-{
- kvm_register_clock("secondary cpu clock");
-}
-#endif
-
static void __init kvmclock_init_mem(void)
{
unsigned long ncpus;
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 15/38] x86/xen/time: Nullify x86_platform's sched_clock save/restore hooks
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (13 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 14/38] x86/kvmclock: Move sched_clock save/restore helpers up in kvmclock.c Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 16/38] x86/vmware: Nullify save/restore hooks when using VMware's sched_clock Sean Christopherson
` (23 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Nullify the x86_platform sched_clock save/restore hooks when setting up
Xen's PV clock to make it somewhat obvious the hooks aren't used when
running as a Xen guest (Xen uses a paravirtualized suspend/resume flow).
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/xen/time.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 9e2e900dc0c7..51eba986cd18 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -565,6 +565,12 @@ static void __init xen_init_time_common(void)
xen_sched_clock_offset = xen_clocksource_read();
static_call_update(pv_steal_clock, xen_steal_clock);
paravirt_set_sched_clock(xen_sched_clock);
+ /*
+ * Xen has paravirtualized suspend/resume and so doesn't use the common
+ * x86 sched_clock save/restore hooks.
+ */
+ x86_platform.save_sched_clock_state = NULL;
+ x86_platform.restore_sched_clock_state = NULL;
tsc_register_calibration_routines(xen_tsc_khz, NULL);
x86_platform.get_wallclock = xen_get_wallclock;
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 16/38] x86/vmware: Nullify save/restore hooks when using VMware's sched_clock
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (14 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 15/38] x86/xen/time: Nullify x86_platform's sched_clock save/restore hooks Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 17/38] x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock Sean Christopherson
` (22 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Nullify the sched_clock save/restore hooks when using VMware's version of
sched_clock. This will allow extending paravirt_set_sched_clock() to set
the save/restore hooks, without having to simultaneously change the
behavior of VMware guests.
Note, it's not at all obvious that it's safe/correct for VMware guests to
do nothing on suspend/resume, but that's a pre-existing problem. Leave it
for a VMware expert to sort out.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/cpu/vmware.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index d6f079a75f05..d6eadb5b37fd 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -344,8 +344,11 @@ static void __init vmware_paravirt_ops_setup(void)
vmware_cyc2ns_setup();
- if (vmw_sched_clock)
+ if (vmw_sched_clock) {
paravirt_set_sched_clock(vmware_sched_clock);
+ x86_platform.save_sched_clock_state = NULL;
+ x86_platform.restore_sched_clock_state = NULL;
+ }
if (vmware_is_stealclock_available()) {
has_steal_clock = true;
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 17/38] x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (15 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 16/38] x86/vmware: Nullify save/restore hooks when using VMware's sched_clock Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 18/38] x86/paravirt: Pass sched_clock save/restore helpers during registration Sean Christopherson
` (21 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Now that all PV clocksources override the sched_clock save/restore hooks
when overriding sched_clock, WARN if the "default" TSC hooks are invoked
when using a PV sched_clock, e.g. to guard against regressions.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/tsc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 472d6c71d3a5..5501d76243c8 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -990,7 +990,7 @@ static unsigned long long cyc2ns_suspend;
void tsc_save_sched_clock_state(void)
{
- if (!sched_clock_stable())
+ if (!sched_clock_stable() || WARN_ON_ONCE(!using_native_sched_clock()))
return;
cyc2ns_suspend = sched_clock();
@@ -1010,7 +1010,7 @@ void tsc_restore_sched_clock_state(void)
unsigned long flags;
int cpu;
- if (!sched_clock_stable())
+ if (!sched_clock_stable() || WARN_ON_ONCE(!using_native_sched_clock()))
return;
local_irq_save(flags);
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 18/38] x86/paravirt: Pass sched_clock save/restore helpers during registration
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (16 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 17/38] x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 19/38] x86/kvmclock: Move kvm_sched_clock_init() down in kvmclock.c Sean Christopherson
` (20 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Pass in a PV clock's save/restore helpers when configuring sched_clock
instead of relying on each PV clock to manually set the save/restore hooks.
In addition to bringing sanity to the code, this will allow gracefully
"rejecting" a PV sched_clock, e.g. when running as a CoCo guest that has
access to a "secure" TSC.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/paravirt.h | 8 +++++---
arch/x86/kernel/cpu/vmware.c | 7 ++-----
arch/x86/kernel/kvmclock.c | 5 ++---
arch/x86/kernel/paravirt.c | 5 ++++-
arch/x86/xen/time.c | 5 ++---
drivers/clocksource/hyperv_timer.c | 6 ++----
6 files changed, 17 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index cfceabd5f7e1..dc26a3c26527 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -28,11 +28,13 @@ u64 dummy_sched_clock(void);
DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock);
DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock);
-void __paravirt_set_sched_clock(u64 (*func)(void), bool stable);
+void __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
+ void (*save)(void), void (*restore)(void));
-static inline void paravirt_set_sched_clock(u64 (*func)(void))
+static inline void paravirt_set_sched_clock(u64 (*func)(void),
+ void (*save)(void), void (*restore)(void))
{
- __paravirt_set_sched_clock(func, true);
+ __paravirt_set_sched_clock(func, true, save, restore);
}
static __always_inline u64 paravirt_sched_clock(void)
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index d6eadb5b37fd..399cf3286a60 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -344,11 +344,8 @@ static void __init vmware_paravirt_ops_setup(void)
vmware_cyc2ns_setup();
- if (vmw_sched_clock) {
- paravirt_set_sched_clock(vmware_sched_clock);
- x86_platform.save_sched_clock_state = NULL;
- x86_platform.restore_sched_clock_state = NULL;
- }
+ if (vmw_sched_clock)
+ paravirt_set_sched_clock(vmware_sched_clock, NULL, NULL);
if (vmware_is_stealclock_available()) {
has_steal_clock = true;
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index c78db52ae399..1ad3878cc1d9 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -136,7 +136,8 @@ static void kvm_restore_sched_clock_state(void)
static inline void kvm_sched_clock_init(bool stable)
{
kvm_sched_clock_offset = kvm_clock_read();
- __paravirt_set_sched_clock(kvm_sched_clock_read, stable);
+ __paravirt_set_sched_clock(kvm_sched_clock_read, stable,
+ kvm_save_sched_clock_state, kvm_restore_sched_clock_state);
pr_info("kvm-clock: using sched offset of %llu cycles",
kvm_sched_clock_offset);
@@ -343,8 +344,6 @@ void __init kvmclock_init(void)
#ifdef CONFIG_SMP
x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
#endif
- x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
- x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
kvm_get_preset_lpj();
/*
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 55c819673a9d..9673cd3a3f0a 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -86,12 +86,15 @@ static u64 native_steal_clock(int cpu)
DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
-void __paravirt_set_sched_clock(u64 (*func)(void), bool stable)
+void __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
+ void (*save)(void), void (*restore)(void))
{
if (!stable)
clear_sched_clock_stable();
static_call_update(pv_sched_clock, func);
+ x86_platform.save_sched_clock_state = save;
+ x86_platform.restore_sched_clock_state = restore;
}
/* These are in entry.S */
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 51eba986cd18..3179f850352d 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -564,13 +564,12 @@ static void __init xen_init_time_common(void)
{
xen_sched_clock_offset = xen_clocksource_read();
static_call_update(pv_steal_clock, xen_steal_clock);
- paravirt_set_sched_clock(xen_sched_clock);
+
/*
* Xen has paravirtualized suspend/resume and so doesn't use the common
* x86 sched_clock save/restore hooks.
*/
- x86_platform.save_sched_clock_state = NULL;
- x86_platform.restore_sched_clock_state = NULL;
+ paravirt_set_sched_clock(xen_sched_clock, NULL, NULL);
tsc_register_calibration_routines(xen_tsc_khz, NULL);
x86_platform.get_wallclock = xen_get_wallclock;
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index 4a21874e91b9..1c4ed9995cb2 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -550,10 +550,8 @@ static void hv_restore_sched_clock_state(void)
static __always_inline void hv_setup_sched_clock(void *sched_clock)
{
/* We're on x86/x64 *and* using PV ops */
- paravirt_set_sched_clock(sched_clock);
-
- x86_platform.save_sched_clock_state = hv_save_sched_clock_state;
- x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state;
+ paravirt_set_sched_clock(sched_clock, hv_save_sched_clock_state,
+ hv_restore_sched_clock_state);
}
#else /* !CONFIG_GENERIC_SCHED_CLOCK && !CONFIG_PARAVIRT */
static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 19/38] x86/kvmclock: Move kvm_sched_clock_init() down in kvmclock.c
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (17 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 18/38] x86/paravirt: Pass sched_clock save/restore helpers during registration Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 20/38] x86/xen/time: Mark xen_setup_vsyscall_time_info() as __init Sean Christopherson
` (19 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Move kvm_sched_clock_init() "down" so that it can reference the global
kvm_clock structure without needing a forward declaration.
Opportunistically mark the helper as "__init" instead of "inline" to make
its usage more obvious; modern compilers don't need a hint to inline a
single-use function, and an extra CALL+RET pair during boot is a complete
non-issue. And, if the compiler ignores the hint and does NOT inline the
function, the resulting code may not get discarded after boot due lack of
an __init annotation.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 1ad3878cc1d9..934ee4a4c6d4 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -133,19 +133,6 @@ static void kvm_restore_sched_clock_state(void)
kvm_register_clock("primary cpu clock, resume");
}
-static inline void kvm_sched_clock_init(bool stable)
-{
- kvm_sched_clock_offset = kvm_clock_read();
- __paravirt_set_sched_clock(kvm_sched_clock_read, stable,
- kvm_save_sched_clock_state, kvm_restore_sched_clock_state);
-
- pr_info("kvm-clock: using sched offset of %llu cycles",
- kvm_sched_clock_offset);
-
- BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
- sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
-}
-
void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
{
/*
@@ -302,6 +289,19 @@ static int kvmclock_setup_percpu(unsigned int cpu)
return p ? 0 : -ENOMEM;
}
+static void __init kvm_sched_clock_init(bool stable)
+{
+ kvm_sched_clock_offset = kvm_clock_read();
+ __paravirt_set_sched_clock(kvm_sched_clock_read, stable,
+ kvm_save_sched_clock_state, kvm_restore_sched_clock_state);
+
+ pr_info("kvm-clock: using sched offset of %llu cycles",
+ kvm_sched_clock_offset);
+
+ BUILD_BUG_ON(sizeof(kvm_sched_clock_offset) >
+ sizeof(((struct pvclock_vcpu_time_info *)NULL)->system_time));
+}
+
void __init kvmclock_init(void)
{
u8 flags;
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 20/38] x86/xen/time: Mark xen_setup_vsyscall_time_info() as __init
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (18 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 19/38] x86/kvmclock: Move kvm_sched_clock_init() down in kvmclock.c Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 21/38] x86/pvclock: Mark setup helpers and related various as __init/__ro_after_init Sean Christopherson
` (18 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Annotate xen_setup_vsyscall_time_info() as being used only during kernel
initialization; it's called only by xen_time_init(), which is already
tagged __init.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/xen/time.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 3179f850352d..13e5888c4501 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -441,7 +441,7 @@ void xen_restore_time_memory_area(void)
xen_sched_clock_offset = xen_clocksource_read() - xen_clock_value_saved;
}
-static void xen_setup_vsyscall_time_info(void)
+static void __init xen_setup_vsyscall_time_info(void)
{
struct vcpu_register_time_memory_area t;
struct pvclock_vsyscall_time_info *ti;
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 21/38] x86/pvclock: Mark setup helpers and related various as __init/__ro_after_init
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (19 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 20/38] x86/xen/time: Mark xen_setup_vsyscall_time_info() as __init Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 22/38] x86/pvclock: WARN if pvclock's valid_flags are overwritten Sean Christopherson
` (17 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Now that Xen PV clock and kvmclock explicitly do setup only during init,
tag the common PV clock flags/vsyscall variables and their mutators with
__init.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/pvclock.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index b3f81379c2fc..a51adce67f92 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -16,10 +16,10 @@
#include <asm/pvclock.h>
#include <asm/vgtod.h>
-static u8 valid_flags __read_mostly = 0;
-static struct pvclock_vsyscall_time_info *pvti_cpu0_va __read_mostly;
+static u8 valid_flags __ro_after_init = 0;
+static struct pvclock_vsyscall_time_info *pvti_cpu0_va __ro_after_init;
-void pvclock_set_flags(u8 flags)
+void __init pvclock_set_flags(u8 flags)
{
valid_flags = flags;
}
@@ -153,7 +153,7 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock,
set_normalized_timespec64(ts, now.tv_sec, now.tv_nsec);
}
-void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
+void __init pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti)
{
WARN_ON(vclock_was_used(VDSO_CLOCKMODE_PVCLOCK));
pvti_cpu0_va = pvti;
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 22/38] x86/pvclock: WARN if pvclock's valid_flags are overwritten
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (20 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 21/38] x86/pvclock: Mark setup helpers and related various as __init/__ro_after_init Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 23/38] x86/kvmclock: Refactor handling of PVCLOCK_TSC_STABLE_BIT during kvmclock_init() Sean Christopherson
` (16 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
WARN if the common PV clock valid_flags are overwritten; all PV clocks
expect that they are the one and only PV clock, i.e. don't guard against
another PV clock having modified the flags.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/pvclock.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index a51adce67f92..8d098841a225 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -21,6 +21,7 @@ static struct pvclock_vsyscall_time_info *pvti_cpu0_va __ro_after_init;
void __init pvclock_set_flags(u8 flags)
{
+ WARN_ON(valid_flags);
valid_flags = flags;
}
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 23/38] x86/kvmclock: Refactor handling of PVCLOCK_TSC_STABLE_BIT during kvmclock_init()
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (21 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 22/38] x86/pvclock: WARN if pvclock's valid_flags are overwritten Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 24/38] timekeeping: Resume clocksources before reading persistent clock Sean Christopherson
` (15 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Clean up the setting of PVCLOCK_TSC_STABLE_BIT during kvmclock init to
make it somewhat obvious that pvclock_read_flags() must be called *after*
pvclock_set_flags().
Note, in theory, a different PV clock could have set PVCLOCK_TSC_STABLE_BIT
in the supported flags, i.e. reading flags only if
KVM_FEATURE_CLOCKSOURCE_STABLE_BIT is set could very, very theoretically
result in a change in behavior. In practice, the kernel only supports a
single PV clock.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 934ee4a4c6d4..0580fe1aefa0 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -304,7 +304,7 @@ static void __init kvm_sched_clock_init(bool stable)
void __init kvmclock_init(void)
{
- u8 flags;
+ bool stable = false;
if (!kvm_para_available() || !kvmclock)
return;
@@ -331,11 +331,18 @@ void __init kvmclock_init(void)
kvm_register_clock("primary cpu clock");
pvclock_set_pvti_cpu0_va(hv_clock_boot);
- if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
+ if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) {
pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
- flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
- kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
+ /*
+ * Check if the clock is stable *after* marking TSC_STABLE as a
+ * valid flag.
+ */
+ stable = pvclock_read_flags(&hv_clock_boot[0].pvti) &
+ PVCLOCK_TSC_STABLE_BIT;
+ }
+
+ kvm_sched_clock_init(stable);
tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz);
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 24/38] timekeeping: Resume clocksources before reading persistent clock
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (22 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 23/38] x86/kvmclock: Refactor handling of PVCLOCK_TSC_STABLE_BIT during kvmclock_init() Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 7:48 ` Thomas Gleixner
2025-02-27 2:18 ` [PATCH v2 25/38] x86/kvmclock: Hook clocksource.suspend/resume when kvmclock isn't sched_clock Sean Christopherson
` (14 subsequent siblings)
38 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
When resuming timekeeping after suspend, restore clocksources prior to
reading the persistent clock. Paravirt clocks, e.g. kvmclock, tie the
validity of a PV persistent clock to a clocksource, i.e. reading the PV
persistent clock will return garbage if the underlying PV clocksource
hasn't been enabled. The flaw has gone unnoticed because kvmclock is a
mess and uses its own suspend/resume hooks instead of the clocksource
suspend/resume hooks, which happens to work by sheer dumb luck (the
kvmclock resume hook runs before timekeeping_resume()).
Note, there is no evidence that any clocksource supported by the kernel
depends on a persistent clock.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
kernel/time/timekeeping.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 1e67d076f195..332d053fa9ce 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1794,11 +1794,16 @@ void timekeeping_resume(void)
u64 cycle_now, nsec;
unsigned long flags;
- read_persistent_clock64(&ts_new);
-
clockevents_resume();
clocksource_resume();
+ /*
+ * Read persistent time after clocksources have been resumed. Paravirt
+ * clocks have a nasty habit of piggybacking a persistent clock on a
+ * system clock, and may return garbage if the system clock is suspended.
+ */
+ read_persistent_clock64(&ts_new);
+
raw_spin_lock_irqsave(&tk_core.lock, flags);
/*
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH v2 24/38] timekeeping: Resume clocksources before reading persistent clock
2025-02-27 2:18 ` [PATCH v2 24/38] timekeeping: Resume clocksources before reading persistent clock Sean Christopherson
@ 2025-02-27 7:48 ` Thomas Gleixner
0 siblings, 0 replies; 50+ messages in thread
From: Thomas Gleixner @ 2025-02-27 7:48 UTC (permalink / raw)
To: Sean Christopherson, Ingo Molnar, Borislav Petkov, Dave Hansen,
x86, Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
On Wed, Feb 26 2025 at 18:18, Sean Christopherson wrote:
> When resuming timekeeping after suspend, restore clocksources prior to
> reading the persistent clock. Paravirt clocks, e.g. kvmclock, tie the
> validity of a PV persistent clock to a clocksource, i.e. reading the PV
> persistent clock will return garbage if the underlying PV clocksource
> hasn't been enabled. The flaw has gone unnoticed because kvmclock is a
> mess and uses its own suspend/resume hooks instead of the clocksource
> suspend/resume hooks, which happens to work by sheer dumb luck (the
> kvmclock resume hook runs before timekeeping_resume()).
>
> Note, there is no evidence that any clocksource supported by the kernel
> depends on a persistent clock.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v2 25/38] x86/kvmclock: Hook clocksource.suspend/resume when kvmclock isn't sched_clock
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (23 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 24/38] timekeeping: Resume clocksources before reading persistent clock Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 26/38] x86/kvmclock: WARN if wall clock is read while kvmclock is suspended Sean Christopherson
` (13 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Save/restore kvmclock across suspend/resume via clocksource hooks when
kvmclock isn't being used for sched_clock. This will allow using kvmclock
as a clocksource (or for wallclock!) without also using it for sched_clock.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 0580fe1aefa0..319f8b2d0702 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -130,7 +130,17 @@ static void kvm_setup_secondary_clock(void)
static void kvm_restore_sched_clock_state(void)
{
- kvm_register_clock("primary cpu clock, resume");
+ kvm_register_clock("primary cpu, sched_clock resume");
+}
+
+static void kvmclock_suspend(struct clocksource *cs)
+{
+ kvmclock_disable();
+}
+
+static void kvmclock_resume(struct clocksource *cs)
+{
+ kvm_register_clock("primary cpu, clocksource resume");
}
void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
@@ -201,6 +211,8 @@ static struct clocksource kvm_clock = {
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
.id = CSID_X86_KVM_CLK,
.enable = kvm_cs_enable,
+ .suspend = kvmclock_suspend,
+ .resume = kvmclock_resume,
};
static void __init kvmclock_init_mem(void)
@@ -295,6 +307,15 @@ static void __init kvm_sched_clock_init(bool stable)
__paravirt_set_sched_clock(kvm_sched_clock_read, stable,
kvm_save_sched_clock_state, kvm_restore_sched_clock_state);
+ /*
+ * The BSP's clock is managed via dedicated sched_clock save/restore
+ * hooks when kvmclock is used as sched_clock, as sched_clock needs to
+ * be kept alive until the very end of suspend entry, and restored as
+ * quickly as possible after resume.
+ */
+ kvm_clock.suspend = NULL;
+ kvm_clock.resume = NULL;
+
pr_info("kvm-clock: using sched offset of %llu cycles",
kvm_sched_clock_offset);
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 26/38] x86/kvmclock: WARN if wall clock is read while kvmclock is suspended
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (24 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 25/38] x86/kvmclock: Hook clocksource.suspend/resume when kvmclock isn't sched_clock Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 27/38] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock Sean Christopherson
` (12 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
WARN if kvmclock is still suspended when its wallclock is read, i.e. when
the kernel reads its persistent clock. The wallclock subtly depends on
the BSP's kvmclock being enabled, and returns garbage if kvmclock is
disabled.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 319f8b2d0702..0ce23f862cbd 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -52,6 +52,8 @@ static struct pvclock_vsyscall_time_info *hvclock_mem;
DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
EXPORT_PER_CPU_SYMBOL_GPL(hv_clock_per_cpu);
+static bool kvmclock_suspended;
+
/*
* The wallclock is the time of day when we booted. Since then, some time may
* have elapsed since the hypervisor wrote the data. So we try to account for
@@ -59,6 +61,7 @@ EXPORT_PER_CPU_SYMBOL_GPL(hv_clock_per_cpu);
*/
static void kvm_get_wallclock(struct timespec64 *now)
{
+ WARN_ON_ONCE(kvmclock_suspended);
wrmsrl(msr_kvm_wall_clock, slow_virt_to_phys(&wall_clock));
preempt_disable();
pvclock_read_wallclock(&wall_clock, this_cpu_pvti(), now);
@@ -118,6 +121,7 @@ static void kvm_save_sched_clock_state(void)
* to the old address prior to reconfiguring kvmclock would clobber
* random memory.
*/
+ kvmclock_suspended = true;
kvmclock_disable();
}
@@ -130,16 +134,19 @@ static void kvm_setup_secondary_clock(void)
static void kvm_restore_sched_clock_state(void)
{
+ kvmclock_suspended = false;
kvm_register_clock("primary cpu, sched_clock resume");
}
static void kvmclock_suspend(struct clocksource *cs)
{
+ kvmclock_suspended = true;
kvmclock_disable();
}
static void kvmclock_resume(struct clocksource *cs)
{
+ kvmclock_suspended = false;
kvm_register_clock("primary cpu, clocksource resume");
}
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 27/38] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (25 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 26/38] x86/kvmclock: WARN if wall clock is read while kvmclock is suspended Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 28/38] x86/paravirt: Mark __paravirt_set_sched_clock() as __init Sean Christopherson
` (11 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
invoked if and only if kvmclock is set as sched_clock, ensure APs enable
their kvmclock during CPU online. While a redundant write to the MSR is
technically ok, skip the registration when kvmclock is sched_clock so that
it's somewhat obvious that kvmclock *needs* to be enabled during early
bringup when it's being used as sched_clock.
Plumb in the BSP's resume path purely for documentation purposes. Both
KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
not super obvious that using KVM's hooks would be flawed. E.g. it would
work today, because KVM's hooks happen to run after/before timekeeping's
hooks during suspend/resume, but that's sheer dumb luck as the order in
which syscore_ops are invoked depends entirely on when a subsystem is
initialized and thus registers its hooks.
Opportunsitically make the registration messages more precise to help
debug issues where kvmclock is enabled too late.
Opportunstically WARN in kvmclock_{suspend,resume}() to harden against
future bugs.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/kvm_para.h | 2 ++
arch/x86/kernel/kvm.c | 24 +++++++++++-------
arch/x86/kernel/kvmclock.c | 44 +++++++++++++++++++++++++++------
3 files changed, 53 insertions(+), 17 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 8708598f5b8e..42d90bf8fde9 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -120,6 +120,8 @@ static inline long kvm_sev_hypercall3(unsigned int nr, unsigned long p1,
#ifdef CONFIG_KVM_GUEST
enum kvm_guest_cpu_action {
KVM_GUEST_BSP_SUSPEND,
+ KVM_GUEST_BSP_RESUME,
+ KVM_GUEST_AP_ONLINE,
KVM_GUEST_AP_OFFLINE,
KVM_GUEST_SHUTDOWN,
};
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 866b061ee0d9..5f093190df17 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -461,18 +461,24 @@ static void kvm_guest_cpu_offline(enum kvm_guest_cpu_action action)
kvmclock_cpu_action(action);
}
+static int __kvm_cpu_online(unsigned int cpu, enum kvm_guest_cpu_action action)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ kvmclock_cpu_action(action);
+ kvm_guest_cpu_init();
+ local_irq_restore(flags);
+ return 0;
+}
+
+#ifdef CONFIG_SMP
+
static int kvm_cpu_online(unsigned int cpu)
{
- unsigned long flags;
-
- local_irq_save(flags);
- kvm_guest_cpu_init();
- local_irq_restore(flags);
- return 0;
+ return __kvm_cpu_online(cpu, KVM_GUEST_AP_ONLINE);
}
-#ifdef CONFIG_SMP
-
static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask);
static bool pv_tlb_flush_supported(void)
@@ -737,7 +743,7 @@ static int kvm_suspend(void)
static void kvm_resume(void)
{
- kvm_cpu_online(raw_smp_processor_id());
+ __kvm_cpu_online(raw_smp_processor_id(), KVM_GUEST_BSP_RESUME);
#ifdef CONFIG_ARCH_CPUIDLE_HALTPOLL
if (kvm_para_has_feature(KVM_FEATURE_POLL_CONTROL) && has_guest_poll)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 0ce23f862cbd..76884dfc77f4 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -52,6 +52,7 @@ static struct pvclock_vsyscall_time_info *hvclock_mem;
DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
EXPORT_PER_CPU_SYMBOL_GPL(hv_clock_per_cpu);
+static bool kvmclock_is_sched_clock;
static bool kvmclock_suspended;
/*
@@ -128,7 +129,7 @@ static void kvm_save_sched_clock_state(void)
#ifdef CONFIG_SMP
static void kvm_setup_secondary_clock(void)
{
- kvm_register_clock("secondary cpu clock");
+ kvm_register_clock("secondary cpu, sched_clock setup");
}
#endif
@@ -140,25 +141,51 @@ static void kvm_restore_sched_clock_state(void)
static void kvmclock_suspend(struct clocksource *cs)
{
+ if (WARN_ON_ONCE(kvmclock_is_sched_clock))
+ return;
+
kvmclock_suspended = true;
kvmclock_disable();
}
static void kvmclock_resume(struct clocksource *cs)
{
+ if (WARN_ON_ONCE(kvmclock_is_sched_clock))
+ return;
+
kvmclock_suspended = false;
kvm_register_clock("primary cpu, clocksource resume");
}
void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
{
- /*
- * Don't disable kvmclock on the BSP during suspend. If kvmclock is
- * being used for sched_clock, then it needs to be kept alive until the
- * last minute, and restored as quickly as possible after resume.
- */
- if (action != KVM_GUEST_BSP_SUSPEND)
+ switch (action) {
+ /*
+ * The BSP's clock is managed via clocksource suspend/resume,
+ * to ensure it's enabled/disabled when timekeeping needs it
+ * to be, e.g. before reading wallclock (which uses kvmclock).
+ */
+ case KVM_GUEST_BSP_SUSPEND:
+ case KVM_GUEST_BSP_RESUME:
+ break;
+ case KVM_GUEST_AP_ONLINE:
+ /*
+ * Secondary CPUs use dedicated sched_clock hooks to enable
+ * kvmclock early during bringup, there's nothing to be done
+ * when during CPU online.
+ */
+ if (kvmclock_is_sched_clock)
+ break;
+ kvm_register_clock("secondary cpu, online");
+ break;
+ case KVM_GUEST_AP_OFFLINE:
+ case KVM_GUEST_SHUTDOWN:
kvmclock_disable();
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ break;
+ }
}
/*
@@ -313,6 +340,7 @@ static void __init kvm_sched_clock_init(bool stable)
kvm_sched_clock_offset = kvm_clock_read();
__paravirt_set_sched_clock(kvm_sched_clock_read, stable,
kvm_save_sched_clock_state, kvm_restore_sched_clock_state);
+ kvmclock_is_sched_clock = true;
/*
* The BSP's clock is managed via dedicated sched_clock save/restore
@@ -356,7 +384,7 @@ void __init kvmclock_init(void)
msr_kvm_system_time, msr_kvm_wall_clock);
this_cpu_write(hv_clock_per_cpu, &hv_clock_boot[0]);
- kvm_register_clock("primary cpu clock");
+ kvm_register_clock("primary cpu, online");
pvclock_set_pvti_cpu0_va(hv_clock_boot);
if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) {
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 28/38] x86/paravirt: Mark __paravirt_set_sched_clock() as __init
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (26 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 27/38] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 29/38] x86/paravirt: Plumb a return code into __paravirt_set_sched_clock() Sean Christopherson
` (10 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Annotate __paravirt_set_sched_clock() as __init, and make its wrapper
__always_inline to ensure sanitizers don't result in a non-inline version
hanging around. All callers run during __init, and changing sched_clock
after boot would be all kinds of crazy.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/paravirt.h | 9 +++++----
arch/x86/kernel/paravirt.c | 4 ++--
2 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index dc26a3c26527..e6d5e77753c4 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -28,11 +28,12 @@ u64 dummy_sched_clock(void);
DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock);
DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock);
-void __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
- void (*save)(void), void (*restore)(void));
+void __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
+ void (*save)(void), void (*restore)(void));
-static inline void paravirt_set_sched_clock(u64 (*func)(void),
- void (*save)(void), void (*restore)(void))
+static __always_inline void paravirt_set_sched_clock(u64 (*func)(void),
+ void (*save)(void),
+ void (*restore)(void))
{
__paravirt_set_sched_clock(func, true, save, restore);
}
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 9673cd3a3f0a..92bf831a63b1 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -86,8 +86,8 @@ static u64 native_steal_clock(int cpu)
DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
-void __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
- void (*save)(void), void (*restore)(void))
+void __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
+ void (*save)(void), void (*restore)(void))
{
if (!stable)
clear_sched_clock_stable();
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 29/38] x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (27 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 28/38] x86/paravirt: Mark __paravirt_set_sched_clock() as __init Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 30/38] x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted TSC Sean Christopherson
` (9 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Add a return code to __paravirt_set_sched_clock() so that the kernel can
reject attempts to use a PV sched_clock without breaking the caller. E.g.
when running as a CoCo VM with a secure TSC, using a PV clock is generally
undesirable.
Note, kvmclock is the only PV clock that does anything "extra" beyond
simply registering itself as sched_clock, i.e. is the only caller that
needs to check the new return value.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/paravirt.h | 6 +++---
arch/x86/kernel/kvmclock.c | 7 +++++--
arch/x86/kernel/paravirt.c | 5 +++--
3 files changed, 11 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index e6d5e77753c4..5de31b22aa5f 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -28,14 +28,14 @@ u64 dummy_sched_clock(void);
DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock);
DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock);
-void __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
- void (*save)(void), void (*restore)(void));
+int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
+ void (*save)(void), void (*restore)(void));
static __always_inline void paravirt_set_sched_clock(u64 (*func)(void),
void (*save)(void),
void (*restore)(void))
{
- __paravirt_set_sched_clock(func, true, save, restore);
+ (void)__paravirt_set_sched_clock(func, true, save, restore);
}
static __always_inline u64 paravirt_sched_clock(void)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 76884dfc77f4..1dbe12ecb26e 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -337,9 +337,12 @@ static int kvmclock_setup_percpu(unsigned int cpu)
static void __init kvm_sched_clock_init(bool stable)
{
+ if (__paravirt_set_sched_clock(kvm_sched_clock_read, stable,
+ kvm_save_sched_clock_state,
+ kvm_restore_sched_clock_state))
+ return;
+
kvm_sched_clock_offset = kvm_clock_read();
- __paravirt_set_sched_clock(kvm_sched_clock_read, stable,
- kvm_save_sched_clock_state, kvm_restore_sched_clock_state);
kvmclock_is_sched_clock = true;
/*
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 92bf831a63b1..a3a1359cfc26 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -86,8 +86,8 @@ static u64 native_steal_clock(int cpu)
DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
-void __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
- void (*save)(void), void (*restore)(void))
+int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
+ void (*save)(void), void (*restore)(void))
{
if (!stable)
clear_sched_clock_stable();
@@ -95,6 +95,7 @@ void __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
static_call_update(pv_sched_clock, func);
x86_platform.save_sched_clock_state = save;
x86_platform.restore_sched_clock_state = restore;
+ return 0;
}
/* These are in entry.S */
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 30/38] x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted TSC
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (28 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 29/38] x86/paravirt: Plumb a return code into __paravirt_set_sched_clock() Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 13:17 ` Kirill A. Shutemov
2025-02-27 2:18 ` [PATCH v2 31/38] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration Sean Christopherson
` (8 subsequent siblings)
38 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Silently ignore attempts to switch to a paravirt sched_clock when running
as a CoCo guest with trusted TSC. In hand-wavy theory, a misbehaving
hypervisor could attack the guest by manipulating the PV clock to affect
guest scheduling in some weird and/or predictable way. More importantly,
reading TSC on such platforms is faster than any PV clock, and sched_clock
is all about speed.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/paravirt.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index a3a1359cfc26..c538c608d9fb 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -89,6 +89,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
void (*save)(void), void (*restore)(void))
{
+ /*
+ * Don't replace TSC with a PV clock when running as a CoCo guest and
+ * the TSC is secure/trusted; PV clocks are emulated by the hypervisor,
+ * which isn't in the guest's TCB.
+ */
+ if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC) ||
+ boot_cpu_has(X86_FEATURE_TDX_GUEST))
+ return -EPERM;
+
if (!stable)
clear_sched_clock_stable();
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH v2 30/38] x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted TSC
2025-02-27 2:18 ` [PATCH v2 30/38] x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted TSC Sean Christopherson
@ 2025-02-27 13:17 ` Kirill A. Shutemov
0 siblings, 0 replies; 50+ messages in thread
From: Kirill A. Shutemov @ 2025-02-27 13:17 UTC (permalink / raw)
To: Sean Christopherson
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Paolo Bonzini, Juergen Gross, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz, linux-kernel,
linux-coco, kvm, virtualization, linux-hyperv, xen-devel,
Tom Lendacky, Nikunj A Dadhania
On Wed, Feb 26, 2025 at 06:18:46PM -0800, Sean Christopherson wrote:
> Silently ignore attempts to switch to a paravirt sched_clock when running
> as a CoCo guest with trusted TSC. In hand-wavy theory, a misbehaving
> hypervisor could attack the guest by manipulating the PV clock to affect
> guest scheduling in some weird and/or predictable way. More importantly,
> reading TSC on such platforms is faster than any PV clock, and sched_clock
> is all about speed.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kernel/paravirt.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index a3a1359cfc26..c538c608d9fb 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -89,6 +89,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
> int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
> void (*save)(void), void (*restore)(void))
> {
> + /*
> + * Don't replace TSC with a PV clock when running as a CoCo guest and
> + * the TSC is secure/trusted; PV clocks are emulated by the hypervisor,
> + * which isn't in the guest's TCB.
> + */
> + if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC) ||
> + boot_cpu_has(X86_FEATURE_TDX_GUEST))
> + return -EPERM;
> +
Looks like a call for generic CC_ATTR_GUEST_SECURE_TSC that would be true
for TDX and SEV with CC_ATTR_GUEST_SNP_SECURE_TSC.
> if (!stable)
> clear_sched_clock_stable();
>
> --
> 2.48.1.711.g2feabab25a-goog
>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v2 31/38] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (29 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 30/38] x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted TSC Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-03-04 4:18 ` Michael Kelley
2025-02-27 2:18 ` [PATCH v2 32/38] x86/tsc: Rejects attempts to override TSC calibration with lesser routine Sean Christopherson
` (7 subsequent siblings)
38 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Add a "tsc_properties" set of flags and use it to annotate whether the
TSC operates at a known and/or reliable frequency when registering a
paravirtual TSC calibration routine. Currently, each PV flow manually
sets the associated feature flags, but often in haphazard fashion that
makes it difficult for unfamiliar readers to see the properties of the
TSC when running under a particular hypervisor.
The other, bigger issue with manually setting the feature flags is that
it decouples the flags from the calibration routine. E.g. in theory, PV
code could mark the TSC as having a known frequency, but then have its
PV calibration discarded in favor of a method that doesn't use that known
frequency. Passing the TSC properties along with the calibration routine
will allow adding sanity checks to guard against replacing a "better"
calibration routine with a "worse" routine.
As a bonus, the flags also give developers working on new PV code a heads
up that they should at least mark the TSC as having a known frequency.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/coco/sev/core.c | 6 ++----
arch/x86/coco/tdx/tdx.c | 7 ++-----
arch/x86/include/asm/tsc.h | 8 +++++++-
arch/x86/kernel/cpu/acrn.c | 4 ++--
arch/x86/kernel/cpu/mshyperv.c | 10 +++++++---
arch/x86/kernel/cpu/vmware.c | 7 ++++---
arch/x86/kernel/jailhouse.c | 4 ++--
arch/x86/kernel/kvmclock.c | 4 ++--
arch/x86/kernel/tsc.c | 8 +++++++-
arch/x86/xen/time.c | 4 ++--
10 files changed, 37 insertions(+), 25 deletions(-)
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index dab386f782ce..29dd50552715 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -3284,12 +3284,10 @@ void __init snp_secure_tsc_init(void)
{
unsigned long long tsc_freq_mhz;
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
- setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
-
rdmsrl(MSR_AMD64_GUEST_TSC_FREQ, tsc_freq_mhz);
snp_tsc_freq_khz = (unsigned long)(tsc_freq_mhz * 1000);
tsc_register_calibration_routines(securetsc_get_tsc_khz,
- securetsc_get_tsc_khz);
+ securetsc_get_tsc_khz,
+ TSC_FREQ_KNOWN_AND_RELIABLE);
}
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 42cdaa98dc5e..ca31560d0dd3 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -1135,14 +1135,11 @@ static unsigned long tdx_get_tsc_khz(void)
void __init tdx_tsc_init(void)
{
- /* TSC is the only reliable clock in TDX guest */
- setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
-
/*
* Override the PV calibration routines (if set) with more trustworthy
* CPUID-based calibration. The TDX module emulates CPUID, whereas any
* PV information is provided by the hypervisor.
*/
- tsc_register_calibration_routines(tdx_get_tsc_khz, NULL);
+ tsc_register_calibration_routines(tdx_get_tsc_khz, NULL,
+ TSC_FREQ_KNOWN_AND_RELIABLE);
}
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 9318c74e8d13..360f47610258 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -41,8 +41,14 @@ extern int cpuid_get_cpu_freq(unsigned int *cpu_khz);
extern void tsc_early_init(void);
extern void tsc_init(void);
#if defined(CONFIG_HYPERVISOR_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
+enum tsc_properties {
+ TSC_FREQUENCY_KNOWN = BIT(0),
+ TSC_RELIABLE = BIT(1),
+ TSC_FREQ_KNOWN_AND_RELIABLE = TSC_FREQUENCY_KNOWN | TSC_RELIABLE,
+};
extern void tsc_register_calibration_routines(unsigned long (*calibrate_tsc)(void),
- unsigned long (*calibrate_cpu)(void));
+ unsigned long (*calibrate_cpu)(void),
+ enum tsc_properties properties);
#endif
extern void mark_tsc_unstable(char *reason);
extern int unsynchronized_tsc(void);
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index 2da3de4d470e..4f2f4f7ec334 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -29,9 +29,9 @@ static void __init acrn_init_platform(void)
/* Install system interrupt handler for ACRN hypervisor callback */
sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
tsc_register_calibration_routines(acrn_get_tsc_khz,
- acrn_get_tsc_khz);
+ acrn_get_tsc_khz,
+ TSC_FREQUENCY_KNOWN);
}
static bool acrn_x2apic_available(void)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 174f6a71c899..445ac3adfebc 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -421,8 +421,13 @@ static void __init ms_hyperv_init_platform(void)
if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
- tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz);
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+ enum tsc_properties tsc_properties = TSC_FREQUENCY_KNOWN;
+
+ if (ms_hyperv.features & HV_ACCESS_TSC_INVARIANT)
+ tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
+
+ tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz,
+ tsc_properties);
}
if (ms_hyperv.priv_high & HV_ISOLATION) {
@@ -525,7 +530,6 @@ static void __init ms_hyperv_init_platform(void)
* is called.
*/
wrmsrl(HV_X64_MSR_TSC_INVARIANT_CONTROL, HV_EXPOSE_INVARIANT_TSC);
- setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
}
/*
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 399cf3286a60..a3a71309214c 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -385,10 +385,10 @@ static void __init vmware_paravirt_ops_setup(void)
*/
static void __init vmware_set_capabilities(void)
{
+ /* TSC is non-stop and reliable even if the frequency isn't known. */
setup_force_cpu_cap(X86_FEATURE_CONSTANT_TSC);
setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
- if (vmware_tsc_khz)
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+
if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMCALL)
setup_force_cpu_cap(X86_FEATURE_VMCALL);
else if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMMCALL)
@@ -417,7 +417,8 @@ static void __init vmware_platform_setup(void)
vmware_tsc_khz = tsc_khz;
tsc_register_calibration_routines(vmware_get_tsc_khz,
- vmware_get_tsc_khz);
+ vmware_get_tsc_khz,
+ TSC_FREQ_KNOWN_AND_RELIABLE);
#ifdef CONFIG_X86_LOCAL_APIC
/* Skip lapic calibration since we know the bus frequency. */
diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
index b0a053692161..d73a4d0fb118 100644
--- a/arch/x86/kernel/jailhouse.c
+++ b/arch/x86/kernel/jailhouse.c
@@ -218,7 +218,8 @@ static void __init jailhouse_init_platform(void)
machine_ops.emergency_restart = jailhouse_no_restart;
- tsc_register_calibration_routines(jailhouse_get_tsc, jailhouse_get_tsc);
+ tsc_register_calibration_routines(jailhouse_get_tsc, jailhouse_get_tsc,
+ TSC_FREQUENCY_KNOWN);
while (pa_data) {
mapping = early_memremap(pa_data, sizeof(header));
@@ -256,7 +257,6 @@ static void __init jailhouse_init_platform(void)
pr_debug("Jailhouse: PM-Timer IO Port: %#x\n", pmtmr_ioport);
precalibrated_tsc_khz = setup_data.v1.tsc_khz;
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
pci_probe = 0;
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 1dbe12ecb26e..ce676e735ced 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -199,7 +199,6 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
*/
static unsigned long kvm_get_tsc_khz(void)
{
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
return pvclock_tsc_khz(this_cpu_pvti());
}
@@ -403,7 +402,8 @@ void __init kvmclock_init(void)
kvm_sched_clock_init(stable);
- tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz);
+ tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz,
+ TSC_FREQUENCY_KNOWN);
x86_platform.get_wallclock = kvm_get_wallclock;
x86_platform.set_wallclock = kvm_set_wallclock;
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 5501d76243c8..be58df4fef66 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1301,11 +1301,17 @@ static void __init check_system_tsc_reliable(void)
*/
#if defined(CONFIG_HYPERVISOR_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
void tsc_register_calibration_routines(unsigned long (*calibrate_tsc)(void),
- unsigned long (*calibrate_cpu)(void))
+ unsigned long (*calibrate_cpu)(void),
+ enum tsc_properties properties)
{
if (WARN_ON_ONCE(!calibrate_tsc))
return;
+ if (properties & TSC_FREQUENCY_KNOWN)
+ setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+ if (properties & TSC_RELIABLE)
+ setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+
x86_platform.calibrate_tsc = calibrate_tsc;
if (calibrate_cpu)
x86_platform.calibrate_cpu = calibrate_cpu;
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 13e5888c4501..4de06ea55397 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -40,7 +40,6 @@ static unsigned long xen_tsc_khz(void)
struct pvclock_vcpu_time_info *info =
&HYPERVISOR_shared_info->vcpu_info[0].time;
- setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
return pvclock_tsc_khz(info);
}
@@ -571,7 +570,8 @@ static void __init xen_init_time_common(void)
*/
paravirt_set_sched_clock(xen_sched_clock, NULL, NULL);
- tsc_register_calibration_routines(xen_tsc_khz, NULL);
+ tsc_register_calibration_routines(xen_tsc_khz, NULL,
+ TSC_FREQUENCY_KNOWN);
x86_platform.get_wallclock = xen_get_wallclock;
}
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* RE: [PATCH v2 31/38] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
2025-02-27 2:18 ` [PATCH v2 31/38] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration Sean Christopherson
@ 2025-03-04 4:18 ` Michael Kelley
0 siblings, 0 replies; 50+ messages in thread
From: Michael Kelley @ 2025-03-04 4:18 UTC (permalink / raw)
To: Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86@kernel.org, Kirill A. Shutemov,
Paolo Bonzini, Juergen Gross, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
kvm@vger.kernel.org, virtualization@lists.linux.dev,
linux-hyperv@vger.kernel.org, xen-devel@lists.xenproject.org,
Tom Lendacky, Nikunj A Dadhania
From: Sean Christopherson <seanjc@google.com> Sent: Wednesday, February 26, 2025 6:19 PM
>
> Add a "tsc_properties" set of flags and use it to annotate whether the
> TSC operates at a known and/or reliable frequency when registering a
> paravirtual TSC calibration routine. Currently, each PV flow manually
> sets the associated feature flags, but often in haphazard fashion that
> makes it difficult for unfamiliar readers to see the properties of the
> TSC when running under a particular hypervisor.
>
> The other, bigger issue with manually setting the feature flags is that
> it decouples the flags from the calibration routine. E.g. in theory, PV
> code could mark the TSC as having a known frequency, but then have its
> PV calibration discarded in favor of a method that doesn't use that known
> frequency. Passing the TSC properties along with the calibration routine
> will allow adding sanity checks to guard against replacing a "better"
> calibration routine with a "worse" routine.
>
> As a bonus, the flags also give developers working on new PV code a heads
> up that they should at least mark the TSC as having a known frequency.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/coco/sev/core.c | 6 ++----
> arch/x86/coco/tdx/tdx.c | 7 ++-----
> arch/x86/include/asm/tsc.h | 8 +++++++-
> arch/x86/kernel/cpu/acrn.c | 4 ++--
> arch/x86/kernel/cpu/mshyperv.c | 10 +++++++---
> arch/x86/kernel/cpu/vmware.c | 7 ++++---
> arch/x86/kernel/jailhouse.c | 4 ++--
> arch/x86/kernel/kvmclock.c | 4 ++--
> arch/x86/kernel/tsc.c | 8 +++++++-
> arch/x86/xen/time.c | 4 ++--
> 10 files changed, 37 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
> index dab386f782ce..29dd50552715 100644
> --- a/arch/x86/coco/sev/core.c
> +++ b/arch/x86/coco/sev/core.c
> @@ -3284,12 +3284,10 @@ void __init snp_secure_tsc_init(void)
> {
> unsigned long long tsc_freq_mhz;
>
> - setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> - setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> -
> rdmsrl(MSR_AMD64_GUEST_TSC_FREQ, tsc_freq_mhz);
> snp_tsc_freq_khz = (unsigned long)(tsc_freq_mhz * 1000);
>
> tsc_register_calibration_routines(securetsc_get_tsc_khz,
> - securetsc_get_tsc_khz);
> + securetsc_get_tsc_khz,
> + TSC_FREQ_KNOWN_AND_RELIABLE);
> }
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 42cdaa98dc5e..ca31560d0dd3 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -1135,14 +1135,11 @@ static unsigned long tdx_get_tsc_khz(void)
>
> void __init tdx_tsc_init(void)
> {
> - /* TSC is the only reliable clock in TDX guest */
> - setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> - setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> -
> /*
> * Override the PV calibration routines (if set) with more trustworthy
> * CPUID-based calibration. The TDX module emulates CPUID, whereas any
> * PV information is provided by the hypervisor.
> */
> - tsc_register_calibration_routines(tdx_get_tsc_khz, NULL);
> + tsc_register_calibration_routines(tdx_get_tsc_khz, NULL,
> + TSC_FREQ_KNOWN_AND_RELIABLE);
> }
> diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
> index 9318c74e8d13..360f47610258 100644
> --- a/arch/x86/include/asm/tsc.h
> +++ b/arch/x86/include/asm/tsc.h
> @@ -41,8 +41,14 @@ extern int cpuid_get_cpu_freq(unsigned int *cpu_khz);
> extern void tsc_early_init(void);
> extern void tsc_init(void);
> #if defined(CONFIG_HYPERVISOR_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
> +enum tsc_properties {
> + TSC_FREQUENCY_KNOWN = BIT(0),
> + TSC_RELIABLE = BIT(1),
> + TSC_FREQ_KNOWN_AND_RELIABLE = TSC_FREQUENCY_KNOWN | TSC_RELIABLE,
> +};
> extern void tsc_register_calibration_routines(unsigned long (*calibrate_tsc)(void),
> - unsigned long (*calibrate_cpu)(void));
> + unsigned long (*calibrate_cpu)(void),
> + enum tsc_properties properties);
> #endif
> extern void mark_tsc_unstable(char *reason);
> extern int unsynchronized_tsc(void);
> diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
> index 2da3de4d470e..4f2f4f7ec334 100644
> --- a/arch/x86/kernel/cpu/acrn.c
> +++ b/arch/x86/kernel/cpu/acrn.c
> @@ -29,9 +29,9 @@ static void __init acrn_init_platform(void)
> /* Install system interrupt handler for ACRN hypervisor callback */
> sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
>
> - setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> tsc_register_calibration_routines(acrn_get_tsc_khz,
> - acrn_get_tsc_khz);
> + acrn_get_tsc_khz,
> + TSC_FREQUENCY_KNOWN);
> }
>
> static bool acrn_x2apic_available(void)
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 174f6a71c899..445ac3adfebc 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -421,8 +421,13 @@ static void __init ms_hyperv_init_platform(void)
>
> if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
> ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
> - tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz);
> - setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> + enum tsc_properties tsc_properties = TSC_FREQUENCY_KNOWN;
> +
> + if (ms_hyperv.features & HV_ACCESS_TSC_INVARIANT)
> + tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> +
> + tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz,
> + tsc_properties);
> }
For the Hyper-V guest code,
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
>
> if (ms_hyperv.priv_high & HV_ISOLATION) {
> @@ -525,7 +530,6 @@ static void __init ms_hyperv_init_platform(void)
> * is called.
> */
> wrmsrl(HV_X64_MSR_TSC_INVARIANT_CONTROL,
> HV_EXPOSE_INVARIANT_TSC);
> - setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> }
>
> /*
> diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
> index 399cf3286a60..a3a71309214c 100644
> --- a/arch/x86/kernel/cpu/vmware.c
> +++ b/arch/x86/kernel/cpu/vmware.c
> @@ -385,10 +385,10 @@ static void __init vmware_paravirt_ops_setup(void)
> */
> static void __init vmware_set_capabilities(void)
> {
> + /* TSC is non-stop and reliable even if the frequency isn't known. */
> setup_force_cpu_cap(X86_FEATURE_CONSTANT_TSC);
> setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> - if (vmware_tsc_khz)
> - setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> +
> if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMCALL)
> setup_force_cpu_cap(X86_FEATURE_VMCALL);
> else if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMMCALL)
> @@ -417,7 +417,8 @@ static void __init vmware_platform_setup(void)
>
> vmware_tsc_khz = tsc_khz;
> tsc_register_calibration_routines(vmware_get_tsc_khz,
> - vmware_get_tsc_khz);
> + vmware_get_tsc_khz,
> + TSC_FREQ_KNOWN_AND_RELIABLE);
>
> #ifdef CONFIG_X86_LOCAL_APIC
> /* Skip lapic calibration since we know the bus frequency. */
> diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
> index b0a053692161..d73a4d0fb118 100644
> --- a/arch/x86/kernel/jailhouse.c
> +++ b/arch/x86/kernel/jailhouse.c
> @@ -218,7 +218,8 @@ static void __init jailhouse_init_platform(void)
>
> machine_ops.emergency_restart = jailhouse_no_restart;
>
> - tsc_register_calibration_routines(jailhouse_get_tsc, jailhouse_get_tsc);
> + tsc_register_calibration_routines(jailhouse_get_tsc, jailhouse_get_tsc,
> + TSC_FREQUENCY_KNOWN);
>
> while (pa_data) {
> mapping = early_memremap(pa_data, sizeof(header));
> @@ -256,7 +257,6 @@ static void __init jailhouse_init_platform(void)
> pr_debug("Jailhouse: PM-Timer IO Port: %#x\n", pmtmr_ioport);
>
> precalibrated_tsc_khz = setup_data.v1.tsc_khz;
> - setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
>
> pci_probe = 0;
>
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 1dbe12ecb26e..ce676e735ced 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -199,7 +199,6 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
> */
> static unsigned long kvm_get_tsc_khz(void)
> {
> - setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> return pvclock_tsc_khz(this_cpu_pvti());
> }
>
> @@ -403,7 +402,8 @@ void __init kvmclock_init(void)
>
> kvm_sched_clock_init(stable);
>
> - tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz);
> + tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz,
> + TSC_FREQUENCY_KNOWN);
>
> x86_platform.get_wallclock = kvm_get_wallclock;
> x86_platform.set_wallclock = kvm_set_wallclock;
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 5501d76243c8..be58df4fef66 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1301,11 +1301,17 @@ static void __init check_system_tsc_reliable(void)
> */
> #if defined(CONFIG_HYPERVISOR_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
> void tsc_register_calibration_routines(unsigned long (*calibrate_tsc)(void),
> - unsigned long (*calibrate_cpu)(void))
> + unsigned long (*calibrate_cpu)(void),
> + enum tsc_properties properties)
> {
> if (WARN_ON_ONCE(!calibrate_tsc))
> return;
>
> + if (properties & TSC_FREQUENCY_KNOWN)
> + setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> + if (properties & TSC_RELIABLE)
> + setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> +
> x86_platform.calibrate_tsc = calibrate_tsc;
> if (calibrate_cpu)
> x86_platform.calibrate_cpu = calibrate_cpu;
> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index 13e5888c4501..4de06ea55397 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -40,7 +40,6 @@ static unsigned long xen_tsc_khz(void)
> struct pvclock_vcpu_time_info *info =
> &HYPERVISOR_shared_info->vcpu_info[0].time;
>
> - setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> return pvclock_tsc_khz(info);
> }
>
> @@ -571,7 +570,8 @@ static void __init xen_init_time_common(void)
> */
> paravirt_set_sched_clock(xen_sched_clock, NULL, NULL);
>
> - tsc_register_calibration_routines(xen_tsc_khz, NULL);
> + tsc_register_calibration_routines(xen_tsc_khz, NULL,
> + TSC_FREQUENCY_KNOWN);
> x86_platform.get_wallclock = xen_get_wallclock;
> }
>
> --
> 2.48.1.711.g2feabab25a-goog
>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH v2 32/38] x86/tsc: Rejects attempts to override TSC calibration with lesser routine
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (30 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 31/38] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 33/38] x86/kvmclock: Mark TSC as reliable when it's constant and nonstop Sean Christopherson
` (6 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
When registering a TSC frequency calibration routine, sanity check that
the incoming routine is as robust as the outgoing routine, and reject the
incoming routine if the sanity check fails.
Because native calibration routines only mark the TSC frequency as known
and reliable when they actually run, the effective progression of
capabilities is: None (native) => Known and maybe Reliable (PV) =>
Known and Reliable (CoCo). Violating that progression for a PV override
is relatively benign, but messing up the progression when CoCo is
involved is more problematic, as it likely means a trusted source of
information (hardware/firmware) is being discarded in favor of a less
trusted source (hypervisor).
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/tsc.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index be58df4fef66..ebcfaf7dcd38 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1309,8 +1309,13 @@ void tsc_register_calibration_routines(unsigned long (*calibrate_tsc)(void),
if (properties & TSC_FREQUENCY_KNOWN)
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+ else if (WARN_ON(boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)))
+ return;
+
if (properties & TSC_RELIABLE)
setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+ else if (WARN_ON(boot_cpu_has(X86_FEATURE_TSC_RELIABLE)))
+ return;
x86_platform.calibrate_tsc = calibrate_tsc;
if (calibrate_cpu)
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 33/38] x86/kvmclock: Mark TSC as reliable when it's constant and nonstop
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (31 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 32/38] x86/tsc: Rejects attempts to override TSC calibration with lesser routine Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 34/38] x86/kvmclock: Get CPU base frequency from CPUID when it's available Sean Christopherson
` (5 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Mark the TSC as reliable if the hypervisor (KVM) has enumerated the TSC
as constant and nonstop, and the admin hasn't explicitly marked the TSC
as unstable. Like most (all?) virtualization setups, any secondary
clocksource that's used as a watchdog is guaranteed to be less reliable
than a constant, nonstop TSC, as all clocksources the kernel uses as a
watchdog are all but guaranteed to be emulated when running as a KVM
guest. I.e. any observed discrepancies between the TSC and watchdog will
be due to jitter in the watchdog.
This is especially true for KVM, as the watchdog clocksource is usually
emulated in host userspace, i.e. reading the clock incurs a roundtrip
cost of thousands of cycles.
Marking the TSC reliable addresses a flaw where the TSC will occasionally
be marked unstable if the host is under moderate/heavy load.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index ce676e735ced..b924b19e8f0f 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -362,6 +362,7 @@ static void __init kvm_sched_clock_init(bool stable)
void __init kvmclock_init(void)
{
+ enum tsc_properties tsc_properties = TSC_FREQUENCY_KNOWN;
bool stable = false;
if (!kvm_para_available() || !kvmclock)
@@ -400,18 +401,6 @@ void __init kvmclock_init(void)
PVCLOCK_TSC_STABLE_BIT;
}
- kvm_sched_clock_init(stable);
-
- tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz,
- TSC_FREQUENCY_KNOWN);
-
- x86_platform.get_wallclock = kvm_get_wallclock;
- x86_platform.set_wallclock = kvm_set_wallclock;
-#ifdef CONFIG_SMP
- x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
-#endif
- kvm_get_preset_lpj();
-
/*
* X86_FEATURE_NONSTOP_TSC is TSC runs at constant rate
* with P/T states and does not stop in deep C-states.
@@ -422,8 +411,22 @@ void __init kvmclock_init(void)
*/
if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
- !check_tsc_unstable())
+ !check_tsc_unstable()) {
kvm_clock.rating = 299;
+ tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
+ }
+
+ kvm_sched_clock_init(stable);
+
+ tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz,
+ tsc_properties);
+
+ x86_platform.get_wallclock = kvm_get_wallclock;
+ x86_platform.set_wallclock = kvm_set_wallclock;
+#ifdef CONFIG_SMP
+ x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
+#endif
+ kvm_get_preset_lpj();
clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
pv_info.name = "KVM";
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 34/38] x86/kvmclock: Get CPU base frequency from CPUID when it's available
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (32 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 33/38] x86/kvmclock: Mark TSC as reliable when it's constant and nonstop Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 35/38] x86/kvmclock: Get TSC frequency from CPUID when its available Sean Christopherson
` (4 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
If CPUID.0x16 is present and valid, use the CPU frequency provided by
CPUID instead of assuming that the virtual CPU runs at the same
frequency as TSC and/or kvmclock. Back before constant TSCs were a
thing, treating the TSC and CPU frequencies as one and the same was
somewhat reasonable, but now it's nonsensical, especially if the
hypervisor explicitly enumerates the CPU frequency.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index b924b19e8f0f..c45b321533e5 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -188,6 +188,20 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
}
}
+static unsigned long kvm_get_cpu_khz(void)
+{
+ unsigned int cpu_khz;
+
+ /*
+ * Prefer CPUID over kvmclock when possible, as the base CPU frequency
+ * isn't necessarily the same as the kvmlock "TSC" frequency.
+ */
+ if (!cpuid_get_cpu_freq(&cpu_khz))
+ return cpu_khz;
+
+ return pvclock_tsc_khz(this_cpu_pvti());
+}
+
/*
* If we don't do that, there is the possibility that the guest
* will calibrate under heavy load - thus, getting a lower lpj -
@@ -418,7 +432,7 @@ void __init kvmclock_init(void)
kvm_sched_clock_init(stable);
- tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz,
+ tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_cpu_khz,
tsc_properties);
x86_platform.get_wallclock = kvm_get_wallclock;
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 35/38] x86/kvmclock: Get TSC frequency from CPUID when its available
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (33 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 34/38] x86/kvmclock: Get CPU base frequency from CPUID when it's available Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 36/38] x86/kvmclock: Stuff local APIC bus period when core crystal freq comes from CPUID Sean Christopherson
` (3 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
When kvmclock and CPUID.0x15 are both present, use the TSC frequency from
CPUID.0x15 instead of kvmclock's frequency. Barring a misconfigured
setup, both sources should provide the same frequency, CPUID.0x15 is
arguably a better source when using the TSC over kvmclock, and most
importantly, using CPUID.0x15 will allow stuffing the local APIC timer
frequency based on the core crystal frequency, i.e. will allow skipping
APIC timer calibration.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index c45b321533e5..3efb837c7406 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -188,6 +188,16 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
}
}
+static unsigned long kvm_get_tsc_khz(void)
+{
+ struct cpuid_tsc_info info;
+
+ if (!cpuid_get_tsc_freq(&info))
+ return info.tsc_khz;
+
+ return pvclock_tsc_khz(this_cpu_pvti());
+}
+
static unsigned long kvm_get_cpu_khz(void)
{
unsigned int cpu_khz;
@@ -211,11 +221,6 @@ static unsigned long kvm_get_cpu_khz(void)
* poll of guests can be running and trouble each other. So we preset
* lpj here
*/
-static unsigned long kvm_get_tsc_khz(void)
-{
- return pvclock_tsc_khz(this_cpu_pvti());
-}
-
static void __init kvm_get_preset_lpj(void)
{
unsigned long khz;
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 36/38] x86/kvmclock: Stuff local APIC bus period when core crystal freq comes from CPUID
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (34 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 35/38] x86/kvmclock: Get TSC frequency from CPUID when its available Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 37/38] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop Sean Christopherson
` (2 subsequent siblings)
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
When running as a KVM guest with kvmclock support enabled, stuff the APIC
timer period/frequency with the core crystal frequency from CPUID.0x15 (if
CPUID.0x15 is provided). KVM's ABI adheres to Intel's SDM, which states
that the APIC timer runs at the core crystal frequency when said frequency
is enumerated via CPUID.0x15.
The APIC timer frequency will be the processor’s bus clock or core
crystal clock frequency (when TSC/core crystal clock ratio is enumerated
in CPUID leaf 0x15).
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 3efb837c7406..80d9c86e0671 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -192,8 +192,18 @@ static unsigned long kvm_get_tsc_khz(void)
{
struct cpuid_tsc_info info;
- if (!cpuid_get_tsc_freq(&info))
+ /*
+ * Prefer CPUID over kvmclock when possible, as CPUID also includes the
+ * core crystal frequency, i.e. the APIC timer frequency. When the core
+ * crystal frequency is enumerated in CPUID.0x15, KVM's ABI is that the
+ * (virtual) APIC BUS runs at the same frequency.
+ */
+ if (!cpuid_get_tsc_freq(&info)) {
+#ifdef CONFIG_X86_LOCAL_APIC
+ lapic_timer_period = info.crystal_khz * 1000 / HZ;
+#endif
return info.tsc_khz;
+ }
return pvclock_tsc_khz(this_cpu_pvti());
}
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 37/38] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (35 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 36/38] x86/kvmclock: Stuff local APIC bus period when core crystal freq comes from CPUID Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-27 2:18 ` [PATCH v2 38/38] x86/paravirt: kvmclock: Setup kvmclock early iff it's sched_clock Sean Christopherson
2025-02-28 11:23 ` [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC David Woodhouse
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Prefer the TSC over kvmclock for sched_clock if the TSC is constant,
nonstop, and not marked unstable via command line. I.e. use the same
criteria as tweaking the clocksource rating so that TSC is preferred over
kvmclock. Per the below comment from native_sched_clock(), sched_clock
is more tolerant of slop than clocksource; using TSC for clocksource but
not sched_clock makes little to no sense, especially now that KVM CoCo
guests with a trusted TSC use TSC, not kvmclock.
/*
* Fall back to jiffies if there's no TSC available:
* ( But note that we still use it if the TSC is marked
* unstable. We do this because unlike Time Of Day,
* the scheduler clock tolerates small errors and it's
* very important for it to be as fast as the platform
* can achieve it. )
*/
The only advantage of using kvmclock is that doing so allows for early
and common detection of PVCLOCK_GUEST_STOPPED, but that code has been
broken for nearly two years with nary a complaint, i.e. it can't be
_that_ valuable. And as above, certain types of KVM guests are losing
the functionality regardless, i.e. acknowledging PVCLOCK_GUEST_STOPPED
needs to be decoupled from sched_clock() no matter what.
Link: https://lore.kernel.org/all/Z4hDK27OV7wK572A@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kernel/kvmclock.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 80d9c86e0671..280bb964f30a 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -431,22 +431,22 @@ void __init kvmclock_init(void)
}
/*
- * X86_FEATURE_NONSTOP_TSC is TSC runs at constant rate
- * with P/T states and does not stop in deep C-states.
- *
- * Invariant TSC exposed by host means kvmclock is not necessary:
- * can use TSC as clocksource.
- *
+ * If the TSC counts at a constant frequency across P/T states, counts
+ * in deep C-states, and the TSC hasn't been marked unstable, prefer
+ * the TSC over kvmclock for sched_clock and drop kvmclock's rating so
+ * that TSC is chosen as the clocksource. Note, the TSC unstable check
+ * exists purely to honor the TSC being marked unstable via command
+ * line, any runtime detection of an unstable will happen after this.
*/
if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
!check_tsc_unstable()) {
kvm_clock.rating = 299;
tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
+ } else {
+ kvm_sched_clock_init(stable);
}
- kvm_sched_clock_init(stable);
-
tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_cpu_khz,
tsc_properties);
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH v2 38/38] x86/paravirt: kvmclock: Setup kvmclock early iff it's sched_clock
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (36 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 37/38] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop Sean Christopherson
@ 2025-02-27 2:18 ` Sean Christopherson
2025-02-28 11:23 ` [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC David Woodhouse
38 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-02-27 2:18 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Sean Christopherson,
Juergen Gross, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
Rework the seemingly generic x86_cpuinit_ops.early_percpu_clock_init hook
into a dedicated PV sched_clock hook, as the only reason the hook exists
is to allow kvmclock to enable its PV clock on secondary CPUs before the
kernel tries to reference sched_clock, e.g. when grabbing a timestamp for
printk.
Rearranging the hook doesn't exactly reduce complexity; arguably it does
the opposite. But as-is, it's practically impossible to understand *why*
kvmclock needs to do early configuration.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/paravirt.h | 10 ++++++++--
arch/x86/include/asm/x86_init.h | 2 --
arch/x86/kernel/kvmclock.c | 13 ++++++-------
arch/x86/kernel/paravirt.c | 18 +++++++++++++++++-
arch/x86/kernel/smpboot.c | 2 +-
arch/x86/kernel/x86_init.c | 1 -
6 files changed, 32 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 5de31b22aa5f..8550262fc710 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -29,13 +29,14 @@ DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock);
DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock);
int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
- void (*save)(void), void (*restore)(void));
+ void (*save)(void), void (*restore)(void),
+ void (*start_secondary));
static __always_inline void paravirt_set_sched_clock(u64 (*func)(void),
void (*save)(void),
void (*restore)(void))
{
- (void)__paravirt_set_sched_clock(func, true, save, restore);
+ (void)__paravirt_set_sched_clock(func, true, save, restore, NULL);
}
static __always_inline u64 paravirt_sched_clock(void)
@@ -43,6 +44,8 @@ static __always_inline u64 paravirt_sched_clock(void)
return static_call(pv_sched_clock)();
}
+void paravirt_sched_clock_start_secondary(void);
+
struct static_key;
extern struct static_key paravirt_steal_enabled;
extern struct static_key paravirt_steal_rq_enabled;
@@ -756,6 +759,9 @@ void native_pv_lock_init(void) __init;
static inline void native_pv_lock_init(void)
{
}
+static inline void paravirt_sched_clock_start_secondary(void)
+{
+}
#endif
#endif /* !CONFIG_PARAVIRT */
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 213cf5379a5a..e3456def5aea 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -187,13 +187,11 @@ struct x86_init_ops {
/**
* struct x86_cpuinit_ops - platform specific cpu hotplug setups
* @setup_percpu_clockev: set up the per cpu clock event device
- * @early_percpu_clock_init: early init of the per cpu clock event device
* @fixup_cpu_id: fixup function for cpuinfo_x86::topo.pkg_id
* @parallel_bringup: Parallel bringup control
*/
struct x86_cpuinit_ops {
void (*setup_percpu_clockev)(void);
- void (*early_percpu_clock_init)(void);
void (*fixup_cpu_id)(struct cpuinfo_x86 *c, int node);
bool parallel_bringup;
};
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 280bb964f30a..11f078b91f22 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -126,12 +126,13 @@ static void kvm_save_sched_clock_state(void)
kvmclock_disable();
}
-#ifdef CONFIG_SMP
-static void kvm_setup_secondary_clock(void)
+static void kvm_setup_secondary_sched_clock(void)
{
+ if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_SMP)))
+ return;
+
kvm_register_clock("secondary cpu, sched_clock setup");
}
-#endif
static void kvm_restore_sched_clock_state(void)
{
@@ -367,7 +368,8 @@ static void __init kvm_sched_clock_init(bool stable)
{
if (__paravirt_set_sched_clock(kvm_sched_clock_read, stable,
kvm_save_sched_clock_state,
- kvm_restore_sched_clock_state))
+ kvm_restore_sched_clock_state,
+ kvm_setup_secondary_sched_clock))
return;
kvm_sched_clock_offset = kvm_clock_read();
@@ -452,9 +454,6 @@ void __init kvmclock_init(void)
x86_platform.get_wallclock = kvm_get_wallclock;
x86_platform.set_wallclock = kvm_set_wallclock;
-#ifdef CONFIG_SMP
- x86_cpuinit.early_percpu_clock_init = kvm_setup_secondary_clock;
-#endif
kvm_get_preset_lpj();
clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c538c608d9fb..f93278ddb1d2 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -86,8 +86,13 @@ static u64 native_steal_clock(int cpu)
DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock);
+#ifdef CONFIG_SMP
+static void (*pv_sched_clock_start_secondary)(void) __ro_after_init;
+#endif
+
int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
- void (*save)(void), void (*restore)(void))
+ void (*save)(void), void (*restore)(void),
+ void (*start_secondary))
{
/*
* Don't replace TSC with a PV clock when running as a CoCo guest and
@@ -104,9 +109,20 @@ int __init __paravirt_set_sched_clock(u64 (*func)(void), bool stable,
static_call_update(pv_sched_clock, func);
x86_platform.save_sched_clock_state = save;
x86_platform.restore_sched_clock_state = restore;
+#ifdef CONFIG_SMP
+ pv_sched_clock_start_secondary = start_secondary;
+#endif
return 0;
}
+#ifdef CONFIG_SMP
+void paravirt_sched_clock_start_secondary(void)
+{
+ if (pv_sched_clock_start_secondary)
+ pv_sched_clock_start_secondary();
+}
+#endif
+
/* These are in entry.S */
static struct resource reserve_ioports = {
.start = 0,
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index c10850ae6f09..e6fff67dd264 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -278,7 +278,7 @@ static void notrace start_secondary(void *unused)
cpu_init();
fpu__init_cpu();
rcutree_report_cpu_starting(raw_smp_processor_id());
- x86_cpuinit.early_percpu_clock_init();
+ paravirt_sched_clock_start_secondary();
ap_starting();
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 0a2bbd674a6d..1d4cf071c74b 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -128,7 +128,6 @@ struct x86_init_ops x86_init __initdata = {
};
struct x86_cpuinit_ops x86_cpuinit = {
- .early_percpu_clock_init = x86_init_noop,
.setup_percpu_clockev = setup_secondary_APIC_clock,
.parallel_bringup = true,
};
--
2.48.1.711.g2feabab25a-goog
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC
2025-02-27 2:18 [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC Sean Christopherson
` (37 preceding siblings ...)
2025-02-27 2:18 ` [PATCH v2 38/38] x86/paravirt: kvmclock: Setup kvmclock early iff it's sched_clock Sean Christopherson
@ 2025-02-28 11:23 ` David Woodhouse
2025-03-03 20:53 ` Sean Christopherson
38 siblings, 1 reply; 50+ messages in thread
From: David Woodhouse @ 2025-02-28 11:23 UTC (permalink / raw)
To: Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, Kirill A. Shutemov,
Paolo Bonzini, Juergen Gross, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Ajay Kaher, Jan Kiszka, Andy Lutomirski,
Peter Zijlstra, Daniel Lezcano, John Stultz
Cc: linux-kernel, linux-coco, kvm, virtualization, linux-hyperv,
xen-devel, Tom Lendacky, Nikunj A Dadhania
[-- Attachment #1: Type: text/plain, Size: 8972 bytes --]
On Wed, 2025-02-26 at 18:18 -0800, Sean Christopherson wrote:
> This... snowballed a bit.
>
> The bulk of the changes are in kvmclock and TSC, but pretty much every
> hypervisor's guest-side code gets touched at some point. I am reaonsably
> confident in the correctness of the KVM changes. For all other hypervisors,
> assume it's completely broken until proven otherwise.
>
> Note, I deliberately omitted:
>
> Alexey Makhalov <alexey.amakhalov@broadcom.com>
> jailhouse-dev@googlegroups.com
>
> from the To/Cc, as those emails bounced on the last version, and I have zero
> desire to get 38*2 emails telling me an email couldn't be delivered.
>
> The primary goal of this series is (or at least was, when I started) to
> fix flaws with SNP and TDX guests where a PV clock provided by the untrusted
> hypervisor is used instead of the secure/trusted TSC that is controlled by
> trusted firmware.
>
> The secondary goal is to draft off of the SNP and TDX changes to slightly
> modernize running under KVM. Currently, KVM guests will use TSC for
> clocksource, but not sched_clock. And they ignore Intel's CPUID-based TSC
> and CPU frequency enumeration, even when using the TSC instead of kvmclock.
> And if the host provides the core crystal frequency in CPUID.0x15, then KVM
> guests can use that for the APIC timer period instead of manually calibrating
> the frequency.
>
> Lots more background on the SNP/TDX motiviation:
> https://lore.kernel.org/all/20250106124633.1418972-13-nikunj@amd.com
Looks good; thanks for tackling this.
I think there are still some things from my older series at
https://lore.kernel.org/all/20240522001817.619072-1-dwmw2@infradead.org/
which this doesn't address. Specifically, the accuracy and consistency
of what KVM advertises to the guest as the KVM clock. And as the Xen
clock, more to the point — because guests generally *know* that the KVM
clock is awful, but expect better of the Xen clock.
With a sane and consistent TSC, the mul/shift factors that KVM presents
to the guest in the kvmclock structure should basically *never* change.
Not even on live update (or live migration between hosts with the same
host TSC frequency).
Take live update as the simple case: serializing the QEMU state and
restarting it immediately, just to update QEMU with the guest
experiencing only a few milliseconds of steal time.
The guest TSC has a fixed arithmetic relationship to the host TSC. That
should *not* change across the live update; not by a single count.
I don't believe the KVM APIs allow userspace to get that right, which
is resolved by the KVM_VCPU_TSC_SCALE ioctl in patch 7 of that series:
https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@infradead.org/
And then the KVM clock should have a fixed arithmetic relationship to
the guest TSC, which should *also* not change. Not even over live
migration — userspace should ensure the guest TSC is as accurate as
possible given NTP synchronisation between the hosts, and then the KVM
clock remains a fixed function of the guest TSC (at least, if the guest
TSC is the same frequency on source and destination). The existing KVM
API doesn't allow userspace to get *that* right either, which is
addressed by Jack's patch 3 of the series:
https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/
The rest of the series is mostly fixing a bunch of places where KVM
gratuitously recalculates the KVM clock that it advertises to the
guest, and the fact that it does so *badly* in some cases, with a loss
of precision that causes errors in the guest. You may already have
addressed some of those; I'll go over my series and see what still
applies on top of yours.
>
> v2:
> - Add struct to hold the TSC CPUID output. [Boris]
> - Don't pointlessly inline the TSC CPUID helpers. [Boris]
> - Fix a variable goof in a helper, hopefully for real this time. [Dan]
> - Collect reviews. [Nikunj]
> - Override the sched_clock save/restore hooks if and only if a PV clock
> is successfully registered.
> - During resome, restore clocksources before reading persistent time.
> - Clean up more warts created by kvmclock.
> - Fix more bugs in kvmclock's suspend/resume handling.
> - Try to harden kvmclock against future bugs.
>
> v1: https://lore.kernel.org/all/20250201021718.699411-1-seanjc@google.com
>
> Sean Christopherson (38):
> x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
> x86/tsc: Add standalone helper for getting CPU frequency from CPUID
> x86/tsc: Add helper to register CPU and TSC freq calibration routines
> x86/sev: Mark TSC as reliable when configuring Secure TSC
> x86/sev: Move check for SNP Secure TSC support to tsc_early_init()
> x86/tdx: Override PV calibration routines with CPUID-based calibration
> x86/acrn: Mark TSC frequency as known when using ACRN for calibration
> clocksource: hyper-v: Register sched_clock save/restore iff it's
> necessary
> clocksource: hyper-v: Drop wrappers to sched_clock save/restore
> helpers
> clocksource: hyper-v: Don't save/restore TSC offset when using HV
> sched_clock
> x86/kvmclock: Setup kvmclock for secondary CPUs iff CONFIG_SMP=y
> x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
> x86/paravirt: Move handling of unstable PV clocks into
> paravirt_set_sched_clock()
> x86/kvmclock: Move sched_clock save/restore helpers up in kvmclock.c
> x86/xen/time: Nullify x86_platform's sched_clock save/restore hooks
> x86/vmware: Nullify save/restore hooks when using VMware's sched_clock
> x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock
> x86/paravirt: Pass sched_clock save/restore helpers during
> registration
> x86/kvmclock: Move kvm_sched_clock_init() down in kvmclock.c
> x86/xen/time: Mark xen_setup_vsyscall_time_info() as __init
> x86/pvclock: Mark setup helpers and related various as
> __init/__ro_after_init
> x86/pvclock: WARN if pvclock's valid_flags are overwritten
> x86/kvmclock: Refactor handling of PVCLOCK_TSC_STABLE_BIT during
> kvmclock_init()
> timekeeping: Resume clocksources before reading persistent clock
> x86/kvmclock: Hook clocksource.suspend/resume when kvmclock isn't
> sched_clock
> x86/kvmclock: WARN if wall clock is read while kvmclock is suspended
> x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't
> sched_clock
> x86/paravirt: Mark __paravirt_set_sched_clock() as __init
> x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()
> x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted
> TSC
> x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
> x86/tsc: Rejects attempts to override TSC calibration with lesser
> routine
> x86/kvmclock: Mark TSC as reliable when it's constant and nonstop
> x86/kvmclock: Get CPU base frequency from CPUID when it's available
> x86/kvmclock: Get TSC frequency from CPUID when its available
> x86/kvmclock: Stuff local APIC bus period when core crystal freq comes
> from CPUID
> x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
> x86/paravirt: kvmclock: Setup kvmclock early iff it's sched_clock
>
> arch/x86/coco/sev/core.c | 9 +-
> arch/x86/coco/tdx/tdx.c | 27 ++-
> arch/x86/include/asm/kvm_para.h | 10 +-
> arch/x86/include/asm/paravirt.h | 16 +-
> arch/x86/include/asm/tdx.h | 2 +
> arch/x86/include/asm/tsc.h | 20 +++
> arch/x86/include/asm/x86_init.h | 2 -
> arch/x86/kernel/cpu/acrn.c | 5 +-
> arch/x86/kernel/cpu/mshyperv.c | 69 +-------
> arch/x86/kernel/cpu/vmware.c | 11 +-
> arch/x86/kernel/jailhouse.c | 6 +-
> arch/x86/kernel/kvm.c | 39 +++--
> arch/x86/kernel/kvmclock.c | 260 +++++++++++++++++++++--------
> arch/x86/kernel/paravirt.c | 35 +++-
> arch/x86/kernel/pvclock.c | 9 +-
> arch/x86/kernel/smpboot.c | 2 +-
> arch/x86/kernel/tsc.c | 141 ++++++++++++----
> arch/x86/kernel/x86_init.c | 1 -
> arch/x86/mm/mem_encrypt_amd.c | 3 -
> arch/x86/xen/time.c | 13 +-
> drivers/clocksource/hyperv_timer.c | 38 +++--
> include/clocksource/hyperv_timer.h | 2 -
> kernel/time/timekeeping.c | 9 +-
> 23 files changed, 487 insertions(+), 242 deletions(-)
>
>
> base-commit: a64dcfb451e254085a7daee5fe51bf22959d52d3
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 50+ messages in thread* Re: [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC
2025-02-28 11:23 ` [PATCH v2 00/38] x86: Try to wrangle PV clocks vs. TSC David Woodhouse
@ 2025-03-03 20:53 ` Sean Christopherson
0 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2025-03-03 20:53 UTC (permalink / raw)
To: David Woodhouse
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
Kirill A. Shutemov, Paolo Bonzini, Juergen Gross,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Ajay Kaher,
Jan Kiszka, Andy Lutomirski, Peter Zijlstra, Daniel Lezcano,
John Stultz, linux-kernel, linux-coco, kvm, virtualization,
linux-hyperv, xen-devel, Tom Lendacky, Nikunj A Dadhania
On Fri, Feb 28, 2025, David Woodhouse wrote:
> On Wed, 2025-02-26 at 18:18 -0800, Sean Christopherson wrote:
> > This... snowballed a bit.
> >
> > The bulk of the changes are in kvmclock and TSC, but pretty much every
> > hypervisor's guest-side code gets touched at some point. I am reaonsably
> > confident in the correctness of the KVM changes. For all other hypervisors,
> > assume it's completely broken until proven otherwise.
> >
> > Note, I deliberately omitted:
> >
> > Alexey Makhalov <alexey.amakhalov@broadcom.com>
> > jailhouse-dev@googlegroups.com
> >
> > from the To/Cc, as those emails bounced on the last version, and I have zero
> > desire to get 38*2 emails telling me an email couldn't be delivered.
> >
> > The primary goal of this series is (or at least was, when I started) to
> > fix flaws with SNP and TDX guests where a PV clock provided by the untrusted
> > hypervisor is used instead of the secure/trusted TSC that is controlled by
> > trusted firmware.
> >
> > The secondary goal is to draft off of the SNP and TDX changes to slightly
> > modernize running under KVM. Currently, KVM guests will use TSC for
> > clocksource, but not sched_clock. And they ignore Intel's CPUID-based TSC
> > and CPU frequency enumeration, even when using the TSC instead of kvmclock.
> > And if the host provides the core crystal frequency in CPUID.0x15, then KVM
> > guests can use that for the APIC timer period instead of manually calibrating
> > the frequency.
> >
> > Lots more background on the SNP/TDX motiviation:
> > https://lore.kernel.org/all/20250106124633.1418972-13-nikunj@amd.com
>
> Looks good; thanks for tackling this.
>
> I think there are still some things from my older series at
> https://lore.kernel.org/all/20240522001817.619072-1-dwmw2@infradead.org/
> which this doesn't address.
Most definitely. I was/am assuming you're going to send a v4 at some point?
^ permalink raw reply [flat|nested] 50+ messages in thread