* [PATCH v3 05/41] x86/tdx: Override PV calibration routines with CPUID-based calibration
From: Sean Christopherson @ 2026-05-15 19:19 UTC (permalink / raw)
To: Kiryl Shutsemau, Paolo Bonzini, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-1-seanjc@google.com>
When running as a TDX guest, explicitly override the TSC frequency
calibration routine with CPUID-based calibration instead of potentially
relying on a hypervisor-controlled PV routine. For TDX guests, CPUID.0x15
is always emulated by the TDX-Module, i.e. the information from CPUID is
more trustworthy than the information provided by the hypervisor.
To maintain backwards compatibility with TDX guest kernels that use native
calibration, and because it's the least awful option, retain
native_calibrate_tsc()'s stuffing of the local APIC bus period using the
core crystal frequency. While it's entirely possible for the hypervisor
to emulate the APIC timer at a different frequency than the core crystal
frequency, the commonly accepted interpretation of Intel's SDM is that APIC
timer runs at the core crystal frequency when that latter is enumerated via
CPUID:
The APIC timer frequency will be the processor’s bus clock or core
crystal clock frequency (when TSC/core crystal clock ratio is enumerated
in CPUID leaf 0x15).
If the hypervisor is malicious and deliberately runs the APIC timer at the
wrong frequency, nothing would stop the hypervisor from modifying the
frequency at any time, i.e. attempting to manually calibrate the frequency
out of paranoia would be futile.
Deliberately leave the CPU frequency calibration routine as is, since the
TDX-Module doesn't provide any guarantees with respect to CPUID.0x16.
Opportunistically add a comment explaining that CoCo TSC initialization
needs to come after hypervisor specific initialization.
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/coco/tdx/tdx.c | 30 +++++++++++++++++++++++++++---
arch/x86/include/asm/tdx.h | 2 ++
arch/x86/kernel/tsc.c | 8 ++++++++
3 files changed, 37 insertions(+), 3 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 29b6f1ed59ec..26890cea790b 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -8,6 +8,7 @@
#include <linux/export.h>
#include <linux/io.h>
#include <linux/kexec.h>
+#include <asm/apic.h>
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
@@ -1123,9 +1124,6 @@ void __init tdx_early_init(void)
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
- /* TSC is the only reliable clock in TDX guest */
- setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
-
cc_vendor = CC_VENDOR_INTEL;
/* Configure the TD */
@@ -1195,3 +1193,29 @@ void __init tdx_early_init(void)
tdx_announce();
}
+
+static unsigned long tdx_get_tsc_khz(void)
+{
+ struct cpuid_tsc_info info;
+
+ if (WARN_ON_ONCE(cpuid_get_tsc_freq(&info)))
+ return 0;
+
+ lapic_timer_period = info.crystal_khz * 1000 / HZ;
+
+ return info.tsc_khz;
+}
+
+void __init tdx_tsc_init(void)
+{
+ /* TSC is the only reliable clock in TDX guest */
+ setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+ setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+
+ /*
+ * Override the PV calibration routines (if set) with more trustworthy
+ * CPUID-based calibration. The TDX module emulates CPUID, whereas any
+ * PV information is provided by the hypervisor.
+ */
+ tsc_register_calibration_routines(tdx_get_tsc_khz, NULL);
+}
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 15eac89b0afb..60deab0ed979 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -57,6 +57,7 @@ struct ve_info {
#ifdef CONFIG_INTEL_TDX_GUEST
void __init tdx_early_init(void);
+void __init tdx_tsc_init(void);
void tdx_get_ve_info(struct ve_info *ve);
@@ -78,6 +79,7 @@ void __init tdx_dump_td_ctls(u64 td_ctls);
#else
static inline void tdx_early_init(void) { };
+static inline void tdx_tsc_init(void) { }
static inline void tdx_halt(void) { };
static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 243999692aea..e00f53e3dd8d 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -34,6 +34,7 @@
#include <asm/topology.h>
#include <asm/uv/uv.h>
#include <asm/sev.h>
+#include <asm/tdx.h>
unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */
EXPORT_SYMBOL(cpu_khz);
@@ -1559,8 +1560,15 @@ void __init tsc_early_init(void)
if (is_early_uv_system())
return;
+ /*
+ * Do CoCo specific "secure" TSC initialization *after* hypervisor
+ * platform initialization so that the secure variant can override the
+ * hypervisor's PV calibration routine with a more trusted method.
+ */
if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
snp_secure_tsc_init();
+ else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
+ tdx_tsc_init();
if (!determine_cpu_tsc_frequencies(true))
return;
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v3 04/41] x86/sev: Move check for SNP Secure TSC support to tsc_early_init()
From: Sean Christopherson @ 2026-05-15 19:19 UTC (permalink / raw)
To: Kiryl Shutsemau, Paolo Bonzini, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-1-seanjc@google.com>
Move the check on having a Secure TSC to the common tsc_early_init() so
that it's obvious that having a Secure TSC is conditional, and to prepare
for adding TDX to the mix (blindly initializing *both* SNP and TDX TSC
logic looks especially weird).
No functional change intended.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/coco/sev/core.c | 3 ---
arch/x86/kernel/tsc.c | 3 ++-
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 14ced854cd83..39fb50697542 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -2029,9 +2029,6 @@ void __init snp_secure_tsc_init(void)
unsigned long tsc_freq_mhz;
void *mem;
- if (!cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
- return;
-
mem = early_memremap_encrypted(sev_secrets_pa, PAGE_SIZE);
if (!mem) {
pr_err("Unable to get TSC_FACTOR: failed to map the SNP secrets page.\n");
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 7e639c0a94a2..243999692aea 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1559,7 +1559,8 @@ void __init tsc_early_init(void)
if (is_early_uv_system())
return;
- snp_secure_tsc_init();
+ if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
+ snp_secure_tsc_init();
if (!determine_cpu_tsc_frequencies(true))
return;
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v3 03/41] x86/sev: Mark TSC as reliable when configuring Secure TSC
From: Sean Christopherson @ 2026-05-15 19:19 UTC (permalink / raw)
To: Kiryl Shutsemau, Paolo Bonzini, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-1-seanjc@google.com>
Move the code to mark the TSC as reliable from sme_early_init() to
snp_secure_tsc_init(). The only reader of TSC_RELIABLE is the aptly
named check_system_tsc_reliable(), which runs in tsc_init(), i.e.
after snp_secure_tsc_init().
This will allow consolidating the handling of TSC_KNOWN_FREQ and
TSC_RELIABLE when overriding the TSC calibration routine.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/coco/sev/core.c | 2 ++
arch/x86/mm/mem_encrypt_amd.c | 3 ---
2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index d27cf8f8b025..14ced854cd83 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -2041,6 +2041,8 @@ void __init snp_secure_tsc_init(void)
secrets = (__force struct snp_secrets_page *)mem;
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+ setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+
rdmsrq(MSR_AMD64_GUEST_TSC_FREQ, tsc_freq_mhz);
/* Extract the GUEST TSC MHZ from BIT[17:0], rest is reserved space */
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index 2f8c32173972..6c3af974c7c2 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -535,9 +535,6 @@ void __init sme_early_init(void)
*/
x86_init.resources.dmi_setup = snp_dmi_setup;
}
-
- if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
- setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
}
void __init mem_encrypt_free_decrypted_mem(void)
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: Sean Christopherson @ 2026-05-15 19:19 UTC (permalink / raw)
To: Kiryl Shutsemau, Paolo Bonzini, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-1-seanjc@google.com>
Add a helper to register non-native, i.e. PV and CoCo, CPU and TSC
frequency calibration routines. This will allow consolidating handling
of common TSC properties that are forced by hypervisor (PV routines),
and will also allow adding sanity checks to guard against overriding a
TSC calibration routine with a routine that is less robust/trusted.
Make the CPU calibration routine optional, as Xen (very sanely) doesn't
assume the CPU runs as the same frequency as the TSC.
Wrap the helper in an #ifdef to document that the kernel overrides
the native routines when running as a VM, and to guard against unwanted
usage. Add a TODO to call out that AMD_MEM_ENCRYPT is a mess and doesn't
depend on HYPERVISOR_GUEST because it gates both guest and host code.
No functional change intended.
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/coco/sev/core.c | 4 ++--
arch/x86/include/asm/tsc.h | 4 ++++
arch/x86/kernel/cpu/acrn.c | 4 ++--
arch/x86/kernel/cpu/mshyperv.c | 3 +--
arch/x86/kernel/cpu/vmware.c | 4 ++--
arch/x86/kernel/jailhouse.c | 4 ++--
arch/x86/kernel/kvmclock.c | 4 ++--
arch/x86/kernel/tsc.c | 17 +++++++++++++++++
arch/x86/xen/time.c | 2 +-
9 files changed, 33 insertions(+), 13 deletions(-)
diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index 7ed3da998489..d27cf8f8b025 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -2048,8 +2048,8 @@ void __init snp_secure_tsc_init(void)
snp_tsc_freq_khz = SNP_SCALE_TSC_FREQ(tsc_freq_mhz * 1000, secrets->tsc_factor);
- x86_platform.calibrate_cpu = securetsc_get_tsc_khz;
- x86_platform.calibrate_tsc = securetsc_get_tsc_khz;
+ tsc_register_calibration_routines(securetsc_get_tsc_khz,
+ securetsc_get_tsc_khz);
early_memunmap(mem, PAGE_SIZE);
}
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 0c57fadc4a39..bae709f5f44d 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -94,6 +94,10 @@ extern int cpuid_get_tsc_freq(struct cpuid_tsc_info *info);
extern void tsc_early_init(void);
extern void tsc_init(void);
+#if defined(CONFIG_HYPERVISOR_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
+extern void tsc_register_calibration_routines(unsigned long (*calibrate_tsc)(void),
+ unsigned long (*calibrate_cpu)(void));
+#endif
extern void mark_tsc_unstable(char *reason);
extern int unsynchronized_tsc(void);
extern int check_tsc_unstable(void);
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index 2c5b51aad91a..c1506cb87d8c 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -29,8 +29,8 @@ static void __init acrn_init_platform(void)
/* Install system interrupt handler for ACRN hypervisor callback */
sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
- x86_platform.calibrate_tsc = acrn_get_tsc_khz;
- x86_platform.calibrate_cpu = acrn_get_tsc_khz;
+ tsc_register_calibration_routines(acrn_get_tsc_khz,
+ acrn_get_tsc_khz);
}
static bool acrn_x2apic_available(void)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 640e6b223c2d..8d2401be420c 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -573,8 +573,7 @@ static void __init ms_hyperv_init_platform(void)
if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
- x86_platform.calibrate_tsc = hv_get_tsc_khz;
- x86_platform.calibrate_cpu = hv_get_tsc_khz;
+ tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz);
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
}
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 34b73573b108..b88d9ca01202 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -419,8 +419,8 @@ static void __init vmware_platform_setup(void)
}
vmware_tsc_khz = tsc_khz;
- x86_platform.calibrate_tsc = vmware_get_tsc_khz;
- x86_platform.calibrate_cpu = vmware_get_tsc_khz;
+ tsc_register_calibration_routines(vmware_get_tsc_khz,
+ vmware_get_tsc_khz);
#ifdef CONFIG_X86_LOCAL_APIC
/* Skip lapic calibration since we know the bus frequency. */
diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
index f58ce9220e0f..db8f31fdb480 100644
--- a/arch/x86/kernel/jailhouse.c
+++ b/arch/x86/kernel/jailhouse.c
@@ -210,8 +210,6 @@ static void __init jailhouse_init_platform(void)
x86_init.mpparse.parse_smp_cfg = jailhouse_parse_smp_config;
x86_init.pci.arch_init = jailhouse_pci_arch_init;
- x86_platform.calibrate_cpu = jailhouse_get_tsc;
- x86_platform.calibrate_tsc = jailhouse_get_tsc;
x86_platform.get_wallclock = jailhouse_get_wallclock;
x86_platform.legacy.rtc = 0;
x86_platform.legacy.warm_reset = 0;
@@ -221,6 +219,8 @@ static void __init jailhouse_init_platform(void)
machine_ops.emergency_restart = jailhouse_no_restart;
+ tsc_register_calibration_routines(jailhouse_get_tsc, jailhouse_get_tsc);
+
while (pa_data) {
mapping = early_memremap(pa_data, sizeof(header));
memcpy(&header, mapping, sizeof(header));
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index b5991d53fc0e..e9e7394140dd 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -321,8 +321,8 @@ void __init kvmclock_init(void)
flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
- x86_platform.calibrate_tsc = kvm_get_tsc_khz;
- x86_platform.calibrate_cpu = kvm_get_tsc_khz;
+ tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz);
+
x86_platform.get_wallclock = kvm_get_wallclock;
x86_platform.set_wallclock = kvm_set_wallclock;
#ifdef CONFIG_X86_LOCAL_APIC
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f92236f40cbc..7e639c0a94a2 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1281,6 +1281,23 @@ static void __init check_system_tsc_reliable(void)
tsc_disable_clocksource_watchdog();
}
+/*
+ * TODO: Disentangle AMD_MEM_ENCRYPT and make SEV guest support depend on
+ * HYPERVISOR_GUEST.
+ */
+#if defined(CONFIG_HYPERVISOR_GUEST) || defined(CONFIG_AMD_MEM_ENCRYPT)
+void tsc_register_calibration_routines(unsigned long (*calibrate_tsc)(void),
+ unsigned long (*calibrate_cpu)(void))
+{
+ if (WARN_ON_ONCE(!calibrate_tsc))
+ return;
+
+ x86_platform.calibrate_tsc = calibrate_tsc;
+ if (calibrate_cpu)
+ x86_platform.calibrate_cpu = calibrate_cpu;
+}
+#endif
+
/*
* Make an educated guess if the TSC is trustworthy and synchronized
* over all CPUs.
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index d62c14334b35..3d3165eef821 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -569,7 +569,7 @@ static void __init xen_init_time_common(void)
static_call_update(pv_steal_clock, xen_steal_clock);
paravirt_set_sched_clock(xen_sched_clock);
- x86_platform.calibrate_tsc = xen_tsc_khz;
+ tsc_register_calibration_routines(xen_tsc_khz, NULL);
x86_platform.get_wallclock = xen_get_wallclock;
}
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v3 01/41] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
From: Sean Christopherson @ 2026-05-15 19:19 UTC (permalink / raw)
To: Kiryl Shutsemau, Paolo Bonzini, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-1-seanjc@google.com>
Extract retrieval of TSC frequency information from CPUID into standalone
helpers so that TDX guest support can reuse the logic. Provide a version
that includes the multiplier math as TDX does NOT want to use
native_calibrate_tsc()'s fallback logic that derives the TSC frequency
based on CPUID.0x16, when the core crystal frequency isn't known.
Opportunsitically drop native_calibrate_tsc()'s "== 0" and "!= 0" checks
in favor of the kernel's preferred style.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/tsc.h | 9 +++++
arch/x86/kernel/tsc.c | 67 +++++++++++++++++++++++++-------------
2 files changed, 53 insertions(+), 23 deletions(-)
diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 4f7f09f50552..0c57fadc4a39 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -83,6 +83,15 @@ static inline cycles_t get_cycles(void)
}
#define get_cycles get_cycles
+struct cpuid_tsc_info {
+ unsigned int denominator;
+ unsigned int numerator;
+ unsigned int crystal_khz;
+ unsigned int tsc_khz;
+};
+extern int cpuid_get_tsc_info(struct cpuid_tsc_info *info);
+extern int cpuid_get_tsc_freq(struct cpuid_tsc_info *info);
+
extern void tsc_early_init(void);
extern void tsc_init(void);
extern void mark_tsc_unstable(char *reason);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index c5110eb554bc..f92236f40cbc 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -658,46 +658,67 @@ static unsigned long quick_pit_calibrate(void)
return delta;
}
+int cpuid_get_tsc_info(struct cpuid_tsc_info *info)
+{
+ unsigned int ecx_hz, edx;
+
+ memset(info, 0, sizeof(*info));
+
+ if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
+ return -ENOENT;
+
+ /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
+ cpuid(CPUID_LEAF_TSC, &info->denominator, &info->numerator, &ecx_hz, &edx);
+
+ if (!info->denominator || !info->numerator)
+ return -ENOENT;
+
+ /*
+ * Note, some CPUs provide the multiplier information, but not the core
+ * crystal frequency. The multiplier information is still useful for
+ * such CPUs, as the crystal frequency can be gleaned from CPUID.0x16.
+ */
+ info->crystal_khz = ecx_hz / 1000;
+ return 0;
+}
+
+int cpuid_get_tsc_freq(struct cpuid_tsc_info *info)
+{
+ if (cpuid_get_tsc_info(info) || !info->crystal_khz)
+ return -ENOENT;
+
+ info->tsc_khz = info->crystal_khz * info->numerator / info->denominator;
+ return 0;
+}
+
/**
* native_calibrate_tsc - determine TSC frequency
* Determine TSC frequency via CPUID, else return 0.
*/
unsigned long native_calibrate_tsc(void)
{
- unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
- unsigned int crystal_khz;
+ struct cpuid_tsc_info info;
if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
return 0;
- if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
+ if (cpuid_get_tsc_info(&info))
return 0;
- eax_denominator = ebx_numerator = ecx_hz = edx = 0;
-
- /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
- cpuid(CPUID_LEAF_TSC, &eax_denominator, &ebx_numerator, &ecx_hz, &edx);
-
- if (ebx_numerator == 0 || eax_denominator == 0)
- return 0;
-
- crystal_khz = ecx_hz / 1000;
-
/*
* Denverton SoCs don't report crystal clock, and also don't support
* CPUID_LEAF_FREQ for the calculation below, so hardcode the 25MHz
* crystal clock.
*/
- if (crystal_khz == 0 &&
- boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT_D)
- crystal_khz = 25000;
+ if (!info.crystal_khz && boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT_D)
+ info.crystal_khz = 25000;
/*
* TSC frequency reported directly by CPUID is a "hardware reported"
* frequency and is the most accurate one so far we have. This
* is considered a known frequency.
*/
- if (crystal_khz != 0)
+ if (info.crystal_khz)
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
/*
@@ -705,15 +726,15 @@ unsigned long native_calibrate_tsc(void)
* clock, but we can easily calculate it to a high degree of accuracy
* by considering the crystal ratio and the CPU speed.
*/
- if (crystal_khz == 0 && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
+ if (!info.crystal_khz && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
unsigned int eax_base_mhz, ebx, ecx, edx;
cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
- crystal_khz = eax_base_mhz * 1000 *
- eax_denominator / ebx_numerator;
+ info.crystal_khz = eax_base_mhz * 1000 *
+ info.denominator / info.numerator;
}
- if (crystal_khz == 0)
+ if (!info.crystal_khz)
return 0;
/*
@@ -730,10 +751,10 @@ unsigned long native_calibrate_tsc(void)
* lapic_timer_period here to avoid having to calibrate the APIC
* timer later.
*/
- lapic_timer_period = crystal_khz * 1000 / HZ;
+ lapic_timer_period = info.crystal_khz * 1000 / HZ;
#endif
- return crystal_khz * ebx_numerator / eax_denominator;
+ return info.crystal_khz * info.numerator / info.denominator;
}
static unsigned long cpu_khz_from_cpuid(void)
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v3 00/41] x86: Try to wrangle PV clocks vs. TSC
From: Sean Christopherson @ 2026-05-15 19:19 UTC (permalink / raw)
To: Kiryl Shutsemau, Paolo Bonzini, Sean Christopherson,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
Dave/Thomas/Peter/Boris, what's the going rate for bribes to take something
like this through the tip tree?
The bulk of the changes are in kvmclock and TSC, but pretty much every
hypervisor's guest-side code gets touched at some point. I am reaonsably
confident in the correctness of the KVM changes. Michael tested Hyper-V in
v2, and while there were conflicts when rebasing, they were largely
superficial (and I've just jinxed myself). For all other hypervisors, assume
the code is compile-tested only, but those changes are all quite small and
straightforward.
The only changes that are questionable/contentious are the last two patches,
which have KVM-as-a-guest use CPUID 0x16 to get the CPU frequency, even on
AMD (that's the dubious part). I very deliberately put them last, so that
they can be dropped at will (I don't care terribly if those patches land).
To merge them, I would want explicit Acks from Paolo and David W.
So, except for the last two patches, to get the stuff I really care about
landed, I think/hope it's just the TSC and guest-side CoCo changes that need
reviews/acks?
The primary goal of this series is (or at least was, when I started) to
fix flaws with SNP and TDX guests where a PV clock provided by the untrusted
hypervisor is used instead of the secure/trusted TSC that is controlled by
trusted firmware.
The secondary goal is to draft off of the SNP and TDX changes to slightly
modernize running under KVM. Currently, KVM guests will use TSC for
clocksource, but not sched_clock. And they ignore Intel's CPUID-based TSC
and CPU frequency enumeration, even when using the TSC instead of kvmclock.
And if the host provides the core crystal frequency in CPUID.0x15, then KVM
guests can use that for the APIC timer period instead of manually calibrating
the frequency.
The tertiary goal is to clean up all of the PV clock code to deduplicate logic
across hypervisors, and to hopefully make it all easier to maintain going
forward.
Lots more background on the SNP/TDX motiviation:
https://lore.kernel.org/all/20250106124633.1418972-13-nikunj@amd.com
Note, I deliberately omitted:
jailhouse-dev@googlegroups.com
from the To/Cc, as those emails bounced on v1, AFAICT nothing has changed, and
I have zero desire to get 41 emails telling me an email couldn't be delivered.
v3:
- Collect reviews. [Michael, Thomas]
- Use Hyper-V reference counter / refcounter instead of Hyper-V timer. [Michael]
- Use the paravirt CPUID interface first proposed by VMware for KVM's
"official" mechanism for communicating frequency to KVM-aware guests,
instead of abusing Intel's CPUID leafs. [David]
- Deal with paravirt code being moved into asm/timers.h and
arch/x86/kernel/tsc.c.
v2:
- https://lore.kernel.org/all/Z8YWttWDtvkyCtdJ@google.com
- Add struct to hold the TSC CPUID output. [Boris]
- Don't pointlessly inline the TSC CPUID helpers. [Boris]
- Fix a variable goof in a helper, hopefully for real this time. [Dan]
- Collect reviews. [Nikunj]
- Override the sched_clock save/restore hooks if and only if a PV clock
is successfully registered.
- During resome, restore clocksources before reading persistent time.
- Clean up more warts created by kvmclock.
- Fix more bugs in kvmclock's suspend/resume handling.
- Try to harden kvmclock against future bugs.
v1: https://lore.kernel.org/all/20250201021718.699411-1-seanjc@google.com
David Woodhouse (2):
KVM: x86: Officially define CPUID 0x40000010 as PV Timing Info (TSC
and Bus)
x86/kvmclock: Obtain TSC frequency from CPUID if present
Sean Christopherson (39):
x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
x86/tsc: Add helper to register CPU and TSC freq calibration routines
x86/sev: Mark TSC as reliable when configuring Secure TSC
x86/sev: Move check for SNP Secure TSC support to tsc_early_init()
x86/tdx: Override PV calibration routines with CPUID-based calibration
x86/acrn: Mark TSC frequency as known when using ACRN for calibration
clocksource: hyper-v: Register sched_clock save/restore iff it's
necessary
clocksource: hyper-v: Drop wrappers to sched_clock save/restore
helpers
clocksource: hyper-v: Don't save/restore TSC offset when using HV
sched_clock
x86/kvmclock: Setup kvmclock for secondary CPUs iff CONFIG_SMP=y
x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
x86/paravirt: Remove unnecessary PARAVIRT=n stub for
paravirt_set_sched_clock()
x86/paravirt: Move handling of unstable PV clocks into
paravirt_set_sched_clock()
x86/kvmclock: Move sched_clock save/restore helpers up in kvmclock.c
x86/xen/time: Nullify x86_platform's sched_clock save/restore hooks
x86/vmware: Nullify save/restore hooks when using VMware's sched_clock
x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock
x86/paravirt: Pass sched_clock save/restore helpers during
registration
x86/kvmclock: Move kvm_sched_clock_init() down in kvmclock.c
x86/xen/time: Mark xen_setup_vsyscall_time_info() as __init
x86/pvclock: Mark setup helpers and related various as
__init/__ro_after_init
x86/pvclock: WARN if pvclock's valid_flags are overwritten
x86/kvmclock: Refactor handling of PVCLOCK_TSC_STABLE_BIT during
kvmclock_init()
timekeeping: Resume clocksources before reading persistent clock
x86/kvmclock: Hook clocksource.suspend/resume when kvmclock isn't
sched_clock
x86/kvmclock: WARN if wall clock is read while kvmclock is suspended
x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't
sched_clock
x86/paravirt: Mark __paravirt_set_sched_clock() as __init
x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()
x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted
TSC
x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
x86/tsc: Rejects attempts to override TSC calibration with lesser
routine
x86/kvmclock: Mark TSC as reliable when it's constant and nonstop
x86/kvmclock: Get local APIC bus frequency from PV CPUID Timing Info
x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
x86/paravirt: kvmclock: Setup kvmclock early iff it's sched_clock
x86/paravirt: Move using_native_sched_clock() stub into timer.h
x86/tsc: Add standalone helper for getting CPU frequency from CPUID
x86/kvmclock: Get CPU base frequency from CPUID when it's available
Documentation/virt/kvm/x86/cpuid.rst | 12 ++
arch/x86/coco/sev/core.c | 9 +-
arch/x86/coco/tdx/tdx.c | 27 ++-
arch/x86/include/asm/kvm_para.h | 12 +-
arch/x86/include/asm/tdx.h | 2 +
arch/x86/include/asm/timer.h | 18 +-
arch/x86/include/asm/tsc.h | 20 +++
arch/x86/include/asm/x86_init.h | 2 -
arch/x86/include/uapi/asm/kvm_para.h | 11 ++
arch/x86/kernel/cpu/acrn.c | 5 +-
arch/x86/kernel/cpu/mshyperv.c | 69 +-------
arch/x86/kernel/cpu/vmware.c | 11 +-
arch/x86/kernel/jailhouse.c | 6 +-
arch/x86/kernel/kvm.c | 62 +++++--
arch/x86/kernel/kvmclock.c | 250 +++++++++++++++++++--------
arch/x86/kernel/pvclock.c | 9 +-
arch/x86/kernel/smpboot.c | 3 +-
arch/x86/kernel/tsc.c | 184 +++++++++++++++-----
arch/x86/kernel/x86_init.c | 1 -
arch/x86/mm/mem_encrypt_amd.c | 3 -
arch/x86/xen/time.c | 13 +-
drivers/clocksource/hyperv_timer.c | 38 ++--
include/clocksource/hyperv_timer.h | 2 -
kernel/time/timekeeping.c | 9 +-
24 files changed, 537 insertions(+), 241 deletions(-)
base-commit: 1196e304db58189264bb5953b4e8da7e90cda615
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply
* Re: [PATCH v9 11/23] x86/virt/seamldr: Allocate and populate a module update request
From: Dave Hansen @ 2026-05-15 18:59 UTC (permalink / raw)
To: Edgecombe, Rick P, kvm@vger.kernel.org,
linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org,
Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, Chatre, Reinette, seanjc@google.com,
pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com,
x86@kernel.org
In-Reply-To: <ec32e59834ffa585a4145e51253f84e134758c16.camel@intel.com>
On 5/15/26 11:44, Edgecombe, Rick P wrote:
> On Fri, 2026-05-15 at 11:24 -0700, Dave Hansen wrote:
>>> + /* Check the calculated payload size against the data size. */
>>> + if (HEADER_SIZE + sigstruct_len + module_len != size)
>>> + return -EINVAL;
>>> +
>>> + /*
>>> + * Don't care about user passing the wrong file, but protect
>>> + * kernel ABI by preventing accepting garbage.
>>> + */
>> How does this "protect kernel ABI"?
> I think it means to not allow values in that field in case the kernel wants to
> make them mean something new, later.
Maybe I'm just being dense, but I don't have any idea what either of you
is saying. ;)
Could you try a concrete example, please?
^ permalink raw reply
* Re: [PATCH v9 11/23] x86/virt/seamldr: Allocate and populate a module update request
From: Edgecombe, Rick P @ 2026-05-15 18:44 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Hansen, Dave,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, Chatre, Reinette, seanjc@google.com,
pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com,
x86@kernel.org
In-Reply-To: <7d7fff5a-53a5-439e-9ff8-dcbd97f473cd@intel.com>
On Fri, 2026-05-15 at 11:24 -0700, Dave Hansen wrote:
> > + /* Check the calculated payload size against the data size. */
> > + if (HEADER_SIZE + sigstruct_len + module_len != size)
> > + return -EINVAL;
> > +
> > + /*
> > + * Don't care about user passing the wrong file, but protect
> > + * kernel ABI by preventing accepting garbage.
> > + */
>
> How does this "protect kernel ABI"?
I think it means to not allow values in that field in case the kernel wants to
make them mean something new, later.
^ permalink raw reply
* Re: [PATCH v9 11/23] x86/virt/seamldr: Allocate and populate a module update request
From: Dave Hansen @ 2026-05-15 18:24 UTC (permalink / raw)
To: Chao Gao, kvm, linux-coco, linux-kernel
Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <20260513151045.1420990-12-chao.gao@intel.com>
On 5/13/26 08:09, Chao Gao wrote:
> There are two important ABIs here:
>
> 'struct tdx_image' - The on-disk and in-memory format for a TDX
> module update image.
> 'struct seamldr_params' - The in-memory ABI passed to the TDX module
> loader. Points to a single 'struct tdx_image'
... broken up into 4k pages
> Userspace supplies the update image in struct tdx_image format. The image
> consists of a header followed by a sigstruct and the module binary.
>
> P-SEAMLDR, however, consumes struct seamldr_params rather than the image
> directly. Parse the struct tdx_image provided by userspace and populate a
> matching struct seamldr_params.
This is doing it again. It's taking the key imperative and burying it.
It should be:
=>
Userspace supplies the update image in struct tdx_image format. The
image consists of a header followed by a sigstruct and the module
binary. P-SEAMLDR, however, consumes struct seamldr_params rather than
the image directly.
Parse the struct tdx_image provided by userspace and populate a matching
struct seamldr_params.
<=
> Validate the struct tdx_image header before using it, because the header is
> consumed solely by the kernel to locate the sigstruct and module within
> the image. Do not validate the payload itself. The sigstruct and module
> pages are passed through to P-SEAMLDR, which validates them as part of the
> update flow.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
> sigstruct_pages_pa_list currently has only one entry, but it will grow to
> four pages in the future. Keep it as an array for symmetry with
> module_pages_pa_list and for extensibility.
That's a good note. It can go above the ---.
> + * The seamldr_params "scenario" field specifies the operation mode:
> + * 0: Install TDX module from scratch (not used by kernel)
> + * 1: Update existing TDX module to a compatible version
> + */
> +#define SEAMLDR_SCENARIO_UPDATE 1
> +
> +/*
> + * This is called the "SEAMLDR_PARAMS" data structure and is defined
> + * in "SEAM Loader (SEAMLDR) Interface Specification".
> + *
> + * It describes the TDX module that will be installed.
Yeah, but that's not super useful.
This is the in-memory ABI that the kernel passes to the P-
SEAMLDR to update the TDX module. It breaks the TDX module image
up in page-size pieces.
Better?
> + */
> +struct seamldr_params {
> + u32 version;
> + u32 scenario;
> + u64 sigstruct_pages_pa_list[SEAMLDR_MAX_NR_SIG_PAGES];
> + u8 reserved[104];
> + u64 module_nr_pages;
> + u64 module_pages_pa_list[SEAMLDR_MAX_NR_MODULE_PAGES];
> +} __packed;
> +
> +static_assert(sizeof(struct seamldr_params) == 4096);
> +
> /*
> * Serialize P-SEAMLDR calls since the hardware only allows a single CPU to
> * interact with P-SEAMLDR simultaneously. Use raw version as the calls can
> @@ -43,6 +71,89 @@ int seamldr_get_info(struct seamldr_info *seamldr_info)
> }
> EXPORT_SYMBOL_FOR_MODULES(seamldr_get_info, "tdx-host");
>
> +#define TDX_IMAGE_VERSION_2 0x200
> +
> +struct tdx_image_header {
> + u16 version; // This ABI is always 0x200
That comment reads strangely in here. Did I ask you to write that?
> + u16 checksum;
> + u8 signature[8];
> + u32 sigstruct_nr_pages;
> + u32 module_nr_pages;
> + u8 reserved[4076];
> +} __packed;
> +
> +#define HEADER_SIZE sizeof(struct tdx_image_header)
> +static_assert(HEADER_SIZE == 4096);
> +
> +/* Intel TDX module update ABI structure. aka. "TDX module blob". */
> +struct tdx_image {
> + struct tdx_image_header header;
> + u8 payload[]; // Contains sigstruct pages followed by module pages
> +};
> +
> +static void populate_pa_list(u64 *pa_list, u32 max_entries, const u8 *start, u32 nr_pages)
The naming in there is painful. How about:
populate_pa_list(u64 *pa_list, u32 pa_list_len,
const u8 *vmalloc_addr, u32 vmalloc_len_pages)
> +{
> + int i;
> +
> + nr_pages = MIN(nr_pages, max_entries);
This seems wonky. Should it really be silently suppressing things if
either the allocation or source is too small? I get not wanting to
overflow, but this seems strange.
> + for (i = 0; i < nr_pages; i++) {
> + pa_list[i] = vmalloc_to_pfn(start) << PAGE_SHIFT;
> + start += PAGE_SIZE;
> + }
At the point that you modify 'start', it's not 'start' any more. Use
another variable. This would do, for instance:
for (i = 0; i < nr_pages; i++) {
unsigned long offset = i * PAGE_SIZE;
pa_list[i] = vmalloc_to_pfn(&start[offset]);
}
> +static void populate_seamldr_params(struct seamldr_params *params,
> + const u8 *sig, u32 sig_nr_pages,
> + const u8 *mod, u32 mod_nr_pages)
> +{
> + params->version = 0;
> + params->scenario = SEAMLDR_SCENARIO_UPDATE;
> + params->module_nr_pages = mod_nr_pages;
> +
> + populate_pa_list(params->sigstruct_pages_pa_list, SEAMLDR_MAX_NR_SIG_PAGES,
> + sig, sig_nr_pages);
> + populate_pa_list(params->module_pages_pa_list, SEAMLDR_MAX_NR_MODULE_PAGES,
> + mod, mod_nr_pages);
> +}
Yes, this is starting to look OK. Nit: vertically align the "*_PAGES" args:
populate_pa_list(params->sigstruct_pages_pa_list, SEAMLDR_...,
sig, sig_nr_pages);
populate_pa_list(params->module_pages_pa_list, SEAMLDR_...,
mod, mod_nr_pages);
> +static int init_seamldr_params(struct seamldr_params *params, const u8 *data, u32 size)
> +{
> + const struct tdx_image *image = (const void *)data;
> + const struct tdx_image_header *header = &image->header;
> +
> + u32 sigstruct_len = header->sigstruct_nr_pages * PAGE_SIZE;
> + u32 module_len = header->module_nr_pages * PAGE_SIZE;
> +
> + u8 *header_start = (u8 *)header;
> + u8 *header_end = header_start + HEADER_SIZE;
> +
> + u8 *sigstruct_start = header_end;
> + u8 *sigstruct_end = sigstruct_start + sigstruct_len;
> +
> + u8 *module_start = sigstruct_end;
> +
> + /* Check the calculated payload size against the data size. */
> + if (HEADER_SIZE + sigstruct_len + module_len != size)
> + return -EINVAL;
> +
> + /*
> + * Don't care about user passing the wrong file, but protect
> + * kernel ABI by preventing accepting garbage.
> + */
How does this "protect kernel ABI"?
> + if (header->version != TDX_IMAGE_VERSION_2)
> + return -EINVAL;
> +
> + if (memcmp(header->signature, "TDX-BLOB", sizeof(header->signature)))
> + return -EINVAL;
> +
> + if (memchr_inv(header->reserved, 0, sizeof(header->reserved)))
> + return -EINVAL;
> +
> + populate_seamldr_params(params, sigstruct_start, header->sigstruct_nr_pages,
> + module_start, header->module_nr_pages);
Please work on the vertical alignment if there's a pattern. For
instance, it makes things pop if the two "header->"'s are vertically
aligned.
^ permalink raw reply
* Re: [PATCH v9 10/23] coco/tdx-host: Implement firmware upload sysfs ABI for TDX module updates
From: Dave Hansen @ 2026-05-15 18:05 UTC (permalink / raw)
To: Chao Gao, kvm, linux-coco, linux-kernel
Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <20260513151045.1420990-11-chao.gao@intel.com>
On 5/13/26 08:09, Chao Gao wrote:
> tl;dr: Select fw_upload for doing TDX module updates. The process of
> selecting among available update images is complicated and nuanced. Punt
> the selection policy out to userspace.
Shouldn't we also say that there is userspace out there to do this
today? Like it's not vaporware. Can we point to it?
> Long Version:
>
> Linux kernel supports two primary firmware update mechanisms:
> - request_firmware()
> - firmware upload (or fw_upload)
>
> The former is used by microcode updates, SEV firmware updates, etc. The
> latter is used by CXL and FPGA firmware updates.
Gah, another former/latter. Format it like this:
The kernel supports two primary firmware update mechanisms:
1. request_firmware() - used by microcode, SEV firmware, hundreds of
other drivers
2. 'struct fw_upload' - used by CXL, FPGA updates, dozens of others
Isn't that a billion times easier to parse?
> One key difference between them is: request_firmware() loads a named
> file from the filesystem where the filename is kernel-controlled, while
> fw_upload accepts firmware data directly from userspace.
One more thing to remove from your changelogs: this "is: " construct.
It's horribly awkward for a reader.
This, please:
The key difference between is that request_firmware() loads a
named file from the filesystem where the filename is kernel-
controlled, while fw_upload accepts firmware data directly from
userspace.
> Use fw_upload for TDX module updates as loading a named file isn't
> suitable for TDX (see below for more reasons). Specifically, register
> TDX faux device with fw_upload framework to expose sysfs interfaces
> and implement operations to process data blobs supplied by userspace.
This is just noise for the justification. We can't do it because it's
not suitable? That's not a reason.
Isn't this better?
TDX module firmware update selection policy is too complex for
the kernel. Leave it to userspace and use fw_upload.
> Why fw_upload instead of request_firmware()?
> ============================================
> The explicit file selection capabilities of fw_upload is preferred over
> the implicit file selection of request_firmware() for the following
> reasons:
> a. Intel distributes all versions of the TDX module, allowing admins to
> load any version rather than always defaulting to the latest. This
> flexibility is necessary because future extensions may require reverting to
> a previous version to clear fatal errors.
How about just: The fw_upload cleanly supports both upgrades and
reversions to earlier module versions. TDX users are expected to need to
do both.
> b. Some module version series are platform-specific. For example, the 1.5.x
> series is for certain platform generations, while the 2.0.x series is
> intended for others.
Not "Some".
x. A given module image can be compatible with several platforms. 1.5.2
runs on <example 1> and <example 2>
y. Not all modules images are compatible with all platforms. 2.0.x runs
<example 3> but not <example 1>.
z. A filesystem will have TDX module images for many platforms, the same
as how /lib/firmware/intel-ucode/ has ucode for many processor
models.
> c. The update policy for TDX module updates is non-linear at times. The
> latest TDX module may not be compatible. For example, TDX module 1.5.x
> may be updated to 1.5.y but not to 1.5.y+1. This policy is documented
> separately in a file released along with each TDX module release.
Again, I'd just give an actual example.
> So, the default policy of "request_firmware()" of "always load latest", is
> not suitable for TDX. Userspace needs to deploy a more sophisticated policy
> check (e.g., latest may not be compatible), and there is potential
> operator choice to consider.
Here's a flow in userspace that I can imagine:
1. Find all the available modules
2. Filter out modules which are incompatible with this system
3. Find the current running module version
4. Decide which direction: upgrade or downgrade. Filter out modules
which are not in the right direction
5. Filter out modules which have a functionally too distant version
(1.2.3=>1.2.4 is OK, but going to 1.2.999 is not)
6. Optimize for fewest updates, or smallest updates. If allowed, go:
1.2.3=>1.2.5, or 1.2.3=>1.2.4=>1.2.5?
Steps 4 and 6 are _pure_ policy.
> diff --git a/drivers/virt/coco/tdx-host/Kconfig b/drivers/virt/coco/tdx-host/Kconfig
> index d35d85ef91c0..ca600a39d97b 100644
> --- a/drivers/virt/coco/tdx-host/Kconfig
> +++ b/drivers/virt/coco/tdx-host/Kconfig
> @@ -1,6 +1,8 @@
> config TDX_HOST_SERVICES
> tristate "TDX Host Services Driver"
> depends on INTEL_TDX_HOST
> + select FW_LOADER
> + select FW_UPLOAD
> default m
> help
> Enable access to TDX host services like module update and
Doesn't this break the compile if FW_LOADER can't be selected? Or does
it error out at Kconfig time. I always forget.
> diff --git a/drivers/virt/coco/tdx-host/tdx-host.c b/drivers/virt/coco/tdx-host/tdx-host.c
> index a540d658757b..c4c099cf3de1 100644
> --- a/drivers/virt/coco/tdx-host/tdx-host.c
> +++ b/drivers/virt/coco/tdx-host/tdx-host.c
> @@ -6,6 +6,7 @@
> */
>
> #include <linux/device/faux.h>
> +#include <linux/firmware.h>
> #include <linux/module.h>
> #include <linux/mod_devicetable.h>
> #include <linux/sysfs.h>
> @@ -84,7 +85,7 @@ static struct attribute *seamldr_attrs[] = {
> NULL,
> };
>
> -static umode_t seamldr_group_visible(struct kobject *kobj, struct attribute *attr, int idx)
> +static bool supports_runtime_update(void)
> {
> const struct tdx_sys_info *sysinfo = tdx_get_sysinfo();
>
> @@ -99,7 +100,12 @@ static umode_t seamldr_group_visible(struct kobject *kobj, struct attribute *att
> if (boot_cpu_has_bug(X86_BUG_SEAMRET_INVD_VMCS))
> return 0;
>
> - return tdx_supports_runtime_update(sysinfo) ? attr->mode : 0;
> + return tdx_supports_runtime_update(sysinfo);
> +}
> +
> +static umode_t seamldr_group_visible(struct kobject *kobj, struct attribute *attr, int idx)
> +{
> + return supports_runtime_update() ? attr->mode : 0;
> }
>
> static const struct attribute_group seamldr_group = {
> @@ -113,6 +119,81 @@ static const struct attribute_group *tdx_host_groups[] = {
> NULL,
> };
>
> +static enum fw_upload_err tdx_fw_prepare(struct fw_upload *fwl,
> + const u8 *data, u32 size)
> +{
> + return FW_UPLOAD_ERR_NONE;
> +}
> +
> +static enum fw_upload_err tdx_fw_write(struct fw_upload *fwl, const u8 *data,
> + u32 offset, u32 size, u32 *written)
> +{
> + int ret;
> +
> + ret = seamldr_install_module(data, size);
> + switch (ret) {
> + case 0:
> + *written = size;
> + return FW_UPLOAD_ERR_NONE;
> + default:
> + return FW_UPLOAD_ERR_FW_INVALID;
> + }
> +}
This is rather ugly for a single condition. Plus, it puts the error path
and the success path on the same footing. That's not great. How about:
if (ret)
return FW_UPLOAD_ERR_FW_INVALID;
*written = size;
return FW_UPLOAD_ERR_NONE;
^ permalink raw reply
* Re: [PATCH v9 08/23] coco/tdx-host: Expose P-SEAMLDR information via sysfs
From: Dave Hansen @ 2026-05-15 17:28 UTC (permalink / raw)
To: Chao Gao, kvm, linux-coco, linux-kernel
Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <20260513151045.1420990-9-chao.gao@intel.com>
On 5/13/26 08:09, Chao Gao wrote:
> +static umode_t seamldr_group_visible(struct kobject *kobj, struct attribute *attr, int idx)
> +{
> + const struct tdx_sys_info *sysinfo = tdx_get_sysinfo();
> +
> + if (!sysinfo)
> + return 0;
> +
> + return tdx_supports_runtime_update(sysinfo) ? attr->mode : 0;
> +}
Axe the tertiary form.
if (!tdx_supports_runtime_update(sysinfo))
return 0;
return attr->mode;
Please.
^ permalink raw reply
* Re: [PATCH v9 09/23] coco/tdx-host: Don't expose P-SEAMLDR information on CPUs with erratum
From: Dave Hansen @ 2026-05-15 17:26 UTC (permalink / raw)
To: Chao Gao, kvm, linux-coco, linux-kernel
Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <20260513151045.1420990-10-chao.gao@intel.com>
On 5/13/26 08:09, Chao Gao wrote:
> Some TDX-capable CPUs have an erratum, as documented in Intel® Trust
> Domain CPU Architectural Extensions (May 2021 edition) Chapter 2.3:
2021, eh?
> SEAMRET from the P-SEAMLDR clears the current VMCS structure pointed
> to by the current-VMCS pointer. A VMM that invokes the P-SEAMLDR using
> SEAMCALL must reload the current-VMCS, if required, using the VMPTRLD
> instruction.
>
> Clearing the current VMCS behind KVM's back will break KVM.
>
> This erratum is not present when IA32_VMX_BASIC[60] is set. Add a CPU
> bug bit for this erratum and refuse to expose P-SEAMLDR information
> on affected CPUs, because even reading the P-SEAMLDR sysfs knobs would
> enter and exit P-SEAMLDR.
>
> Use a CPU bug bit to stay consistent with X86_BUG_TDX_PW_MCE. As a bonus,
> the bug bit is visible to userspace, which allows userspace to determine
> why these sysfs files are not exposed, and it can also be checked by other
> kernel components in the future if needed.
>
> == Alternatives ==
> Two workarounds were considered but both were rejected:
>
> 1. Save/restore the current VMCS around P-SEAMLDR calls. This produces ugly
> assembly code [1] and doesn't play well with #MCE or #NMI if they
> need to use the current VMCS.
>
> 2. Move KVM's VMCS tracking logic to the TDX core code, which would break
> the boundary between KVM and the TDX core code [2].
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v9 08/23] coco/tdx-host: Expose P-SEAMLDR information via sysfs
From: Dave Hansen @ 2026-05-15 17:23 UTC (permalink / raw)
To: Chao Gao, kvm, linux-coco, linux-kernel
Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <20260513151045.1420990-9-chao.gao@intel.com>
> +What: /sys/devices/faux/tdx_host/seamldr_version
> +Contact: linux-coco@lists.linux.dev
> +Description: (RO) Report the version of the loaded P-SEAMLDR. The P-SEAMLDR
> + version is formatted as x.y.z, where "x" is the major version,
> + "y" is the minor version and "z" is the update version. Versions
> + are used for bug reporting and compatibility checks.
Rather than a copy/paste, I'd just say the format is the same as the
main module version.
...
> +static ssize_t num_remaining_updates_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct seamldr_info info;
> + int ret;
> +
> + ret = seamldr_get_info(&info);
> + if (ret)
> + return ret;
> +
> + return sysfs_emit(buf, "%u\n", info.num_remaining_updates);
> +}
> +
> +static DEVICE_ATTR_ADMIN_RO(seamldr_version);
> +static DEVICE_ATTR_ADMIN_RO(num_remaining_updates);
> +
> +static struct attribute *seamldr_attrs[] = {
> + &dev_attr_seamldr_version.attr,
> + &dev_attr_num_remaining_updates.attr,
> + NULL,
> +};
> +
> +static umode_t seamldr_group_visible(struct kobject *kobj, struct attribute *attr, int idx)
> +{
> + const struct tdx_sys_info *sysinfo = tdx_get_sysinfo();
> +
> + if (!sysinfo)
> + return 0;
> +
> + return tdx_supports_runtime_update(sysinfo) ? attr->mode : 0;
> +}
> +
> +static const struct attribute_group seamldr_group = {
> + .attrs = seamldr_attrs,
> + .is_visible = seamldr_group_visible,
> +};
I feel like we need to mention *somewhere* that these are kinda nasty.
tdx_get_sysinfo() is slow and single-threaded. These very much are and
need to stay 0400 for good reason.
Talk about the DEVICE_ATTR_ADMIN_RO() choice _somewhere_, please.
With those fixed:
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v9 07/23] x86/virt/seamldr: Add a helper to retrieve P-SEAMLDR information
From: Dave Hansen @ 2026-05-15 17:18 UTC (permalink / raw)
To: Chao Gao, kvm, linux-coco, linux-kernel
Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <20260513151045.1420990-8-chao.gao@intel.com>
On 5/13/26 08:09, Chao Gao wrote:
> P-SEAMLDR reports its state via SEAMLDR.INFO, including its version and
> the number of remaining runtime updates.
>
> This information is useful for userspace. For example, the admin can use
> the P-SEAMLDR version to determine whether a candidate TDX module is
> compatible with the running loader, and can use the remaining update count
> to determine whether another runtime update is still possible.
>
> Add a helper to retrieve P-SEAMLDR information in preparation for
> exposing P-SEAMLDR version and other necessary information to userspace.
> Export the new kAPI for use by tdx-host.ko.
What is "tdx-host.ko"? My kernel doesn't have that. Remember a few
patches ago, the "tristate" in Kconfig? ;)
You also called in the "tdx_host" device. Please be consistent with the
naming.
> Note that there are two distinct P-SEAMLDR APIs with similar names:
>
> SEAMLDR.INFO: Returns a SEAMLDR_INFO structure containing SEAMLDR
> information such as version and remaining updates.
>
> SEAMLDR.SEAMINFO: Returns a SEAMLDR_SEAMINFO structure containing SEAM
> and system information such as Convertible Memory
> Regions (CMRs) and number of CPUs and sockets.
>
> The former is used here.
This doesn't help.
"SEAMLDR.INFO" is metadata about the loader. It's metadata for the
update process.
"SEAMLDR.SEAMINFO" is metadata about the module. It is for the module
init process, not for the update process.
Right? Isn't that a billion times more useful and actually helps
differentiate them?
Also, more nits: I hate former/latter too. It makes me stop reading and
have to go back *EVER* *TIME*. I hate that particular english construct.
It's horrid.
Just say:
Use SEAMLDR.INFO here.
It's even shorter than the passive "is used" and doesn't require me
going back and re-read the text.
> +++ b/arch/x86/include/asm/seamldr.h
> @@ -0,0 +1,36 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_SEAMLDR_H
> +#define _ASM_X86_SEAMLDR_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * This is called the "SEAMLDR_INFO" data structure and is defined
> + * in "SEAM Loader (SEAMLDR) Interface Specification".
More succinct:
* This is the "SEAMLDR_INFO" data structure defined in the
"SEAM Loader (SEAMLDR) Interface Specification".
> + * The SEAMLDR.INFO documentation requires this to be aligned to a
> + * 256-byte boundary.
> + */
Just say:
* Must be aligned to a 256-byte boundary.
With those fixed:
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v9 06/23] x86/virt/seamldr: Introduce a wrapper for P-SEAMLDR SEAMCALLs
From: Dave Hansen @ 2026-05-15 17:07 UTC (permalink / raw)
To: Chao Gao, kvm, linux-coco, linux-kernel, linux-rt-devel
Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt
In-Reply-To: <20260513151045.1420990-7-chao.gao@intel.com>
On 5/13/26 08:09, Chao Gao wrote:
> The TDX architecture uses the "SEAMCALL" instruction to communicate with
> SEAM mode software. Right now, the only SEAM mode software that the kernel
> communicates with is the TDX module. But, there is actually another
> component that runs in SEAM mode but it is separate from the TDX module:
> the persistent SEAM loader or "P-SEAMLDR". Right now, the only component
> that communicates with it is the BIOS which loads the TDX module itself at
> boot. But, to support updating the TDX module, the kernel now needs to be
> able to talk to it.
>
> P-SEAMLDR SEAMCALLs differ from TDX module SEAMCALLs in areas such as
> concurrency requirements. Add a P-SEAMLDR wrapper to handle these
> differences and prepare for implementing concrete functions.
>
> Use seamcall_prerr() (not '_ret') because current P-SEAMLDR calls do not
> use any output registers other than RAX.
>
> Note that unlike P-SEAMLDR, there is also a non-persistent SEAM loader
> ("NP-SEAMLDR"). This is an authenticated code module (ACM) that is not
> callable at runtime. Only BIOS launches it to load P-SEAMLDR at boot;
> the kernel does not need to interact with it for runtime update.
Nit: the "main act" solution statement for this changelog is:
Add a P-SEAMLDR wrapper to handle these differences and prepare
for implementing concrete functions.
What's more important, that ^?
Or:
Use seamcall_prerr() (not '_ret') because current P-SEAMLDR
calls do not use any output registers other than RAX.
?
Now, what is in a more prominent place in the changelog?
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v9 06/23] x86/virt/seamldr: Introduce a wrapper for P-SEAMLDR SEAMCALLs
From: Dave Hansen @ 2026-05-15 17:02 UTC (permalink / raw)
To: Chao Gao, kvm, linux-coco, linux-kernel, linux-rt-devel
Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt
In-Reply-To: <20260513151045.1420990-7-chao.gao@intel.com>
On 5/13/26 08:09, Chao Gao wrote:
> Note that unlike P-SEAMLDR, there is also a non-persistent SEAM loader
> ("NP-SEAMLDR"). This is an authenticated code module (ACM) that is not
> callable at runtime. Only BIOS launches it to load P-SEAMLDR at boot;
> the kernel does not need to interact with it for runtime update.
The "unlike" is the wrong word to use here.
Here's a rewrite, with some Claude help:
Note: Despite the similar name, the NP-SEAMLDR ("Non-Persistent")
differs sharply from the P-SEAMLDR. It is an authenticated code module
(ACM) invoked exclusively by the BIOS at boot rather than a component
running in SEAM mode. The kernel cannot call it at runtime. It exposes
no SEAMCALL interface.
^ permalink raw reply
* Re: [PATCH v9 05/23] coco/tdx-host: Expose TDX module version
From: Dave Hansen @ 2026-05-15 16:53 UTC (permalink / raw)
To: Chao Gao, kvm, linux-coco, linux-kernel
Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <20260513151045.1420990-6-chao.gao@intel.com>
On 5/13/26 08:09, Chao Gao wrote:
> For TDX module updates, userspace needs to select compatible update
> versions based on the current module version. This design delegates
> module selection complexity to userspace because TDX module update
> policies are complex and version series are platform-specific.
I'm not sure exactly what a "version series" is. Do you need to say more
than that the policy is complex?
> For example, the 1.5.x series is for certain platform generations, while
> the 2.0.x series is intended for others. And TDX module 1.5.x may be
> updated to 1.5.y but not to 1.5.y+1.
That's not much of an example, IMNHO. How about:
For example, the 1.5.x series runs on Sapphire Rapids but not
Granite Rapids, which needs 2.0.x. Updates are also constrained
by version distance, so a 1.5.6 module might permit updates to
1.5.7 but not to 1.5.20.
> Expose the TDX module version to userspace via sysfs to aid module
> selection. Since the TDX faux device will drive module updates, expose
> the version as its attribute.
>
> One bonus of exposing TDX module version via sysfs is: TDX module
> version information remains available even after dmesg logs are cleared.
I honestly wouldn't even mention this bit. You don't need a bonus.
> +++ b/Documentation/ABI/testing/sysfs-devices-faux-tdx-host
> @@ -0,0 +1,6 @@
> +What: /sys/devices/faux/tdx_host/version
> +Contact: linux-coco@lists.linux.dev
> +Description: (RO) Report the version of the loaded TDX module. The TDX module
> + version is formatted as x.y.z, where "x" is the major version,
> + "y" is the minor version and "z" is the update version. Versions
> + are used for bug reporting, TDX module updates etc.
The "etc." is silly. Just zap it.
Description: (RO) Report the version of the loaded TDX module.
Formatted as "major.minor.update". Used by TDX module
update tooling. Example: "1.2.03"
That's at least a wee bit of warning to folks about the leading 0 so if
they are parsing it they are a wee bit careful with it.
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 8b739ac01479..b7f4396b5cc5 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -41,6 +41,12 @@
> #include <asm/tdx_global_metadata.h>
> #include <linux/pgtable.h>
>
> +/*
> + * TDX module and P-SEAMLDR version convention: "major.minor.update"
> + * (e.g., "1.5.08") with zero-padded two-digit update field.
> + */
> +#define TDX_VERSION_FMT "%u.%u.%02u"
I hate "e.g.". I'm not sure why, but maybe it it often misused, or that
it isn't allowed by the style guide I had to use in high school. Either
way, please stop using it.
You don't have to modify this patch for it, but please stop.
Second, this was an opportunity to peel out the creation of
"TDX_VERSION_FMT". It would have shrunk your series by ~10 lines and
made this patch more obvious.
Again, you don't have to go do it, but it was a missed opportunity.
With those updates:
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v9 04/23] coco/tdx-host: Introduce a "tdx_host" device
From: Dave Hansen @ 2026-05-15 16:21 UTC (permalink / raw)
To: Chao Gao, kvm, linux-coco, linux-kernel
Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Dan Williams, Jonathan Cameron,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
H. Peter Anvin
In-Reply-To: <20260513151045.1420990-5-chao.gao@intel.com>
On 5/13/26 08:09, Chao Gao wrote:
> Co-developed-by: Xu Yilun <yilun.xu@linux.intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
This SoB chain at _least_ needs a note. It looks quite bizarre.
> +config TDX_HOST_SERVICES
> + tristate "TDX Host Services Driver"
> + depends on INTEL_TDX_HOST
> + default m
> + help
> + Enable access to TDX host services like module update and
> + extensions (e.g. TDX Connect).
> +
> + Say y or m if enabling support for confidential virtual machine
> + support (CONFIG_INTEL_TDX_HOST). The module is called tdx_host.ko.
In what world will anyone ever set INTEL_TDX_HOST=y, but turn this off?
Is this even worth a Kconfig prompt?
I guess we need it for the module or built in choice. But otherwise it
seems a bit silly.
^ permalink raw reply
* Re: [PATCH v9 02/23] x86/virt/tdx: Move TDX_FEATURES0 bits to asm/tdx.h
From: Dave Hansen @ 2026-05-15 16:15 UTC (permalink / raw)
To: Chao Gao, kvm, linux-coco, linux-kernel
Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <20260513151045.1420990-3-chao.gao@intel.com>
On 5/13/26 08:09, Chao Gao wrote:
> This prepares for TDX module update [1] and Dynamic PAMT [2] support. Both
> add new TDX_FEATURES0 capability bits, and both need those capabilities to
> be queried from code outside arch/x86/virt. The corresponding feature-query
> helpers therefore need to live in the public asm/tdx.h header, so move the
> existing bit definitions there first.
Please don't add unnecessary changelog cruft. If you need this move for
this series, that's enough.
^ permalink raw reply
* Re: [PATCH v9 01/23] x86/virt/tdx: Consolidate TDX global initialization states
From: Dave Hansen @ 2026-05-15 16:14 UTC (permalink / raw)
To: Chao Gao, kvm, linux-coco, linux-kernel
Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, x86, H. Peter Anvin
In-Reply-To: <20260513151045.1420990-2-chao.gao@intel.com>
On 5/13/26 08:09, Chao Gao wrote:
...
> Group the states into a single structure so they can be reset together, for
> example with memset(), and so a newly added state won't be missed.
...
> +struct tdx_module_state {
> + bool initialized;
> + bool sysinit_done;
> + int sysinit_ret;
> +};
...
> @@ -113,30 +119,28 @@ static int try_init_module_global(void)
> {
> struct tdx_module_args args = {};
> static DEFINE_RAW_SPINLOCK(sysinit_lock);
> - static bool sysinit_done;
> - static int sysinit_ret;
This doesn't look right to me.
> raw_spin_lock(&sysinit_lock);
>
> - if (sysinit_done)
> + if (tdx_module_state.sysinit_done)
> goto out;
>
> /* RCX is module attributes and all bits are reserved */
> args.rcx = 0;
> - sysinit_ret = seamcall_prerr(TDH_SYS_INIT, &args);
> + tdx_module_state.sysinit_ret = seamcall_prerr(TDH_SYS_INIT, &args);
>
> /*
> * The first SEAMCALL also detects the TDX module, thus
> * it can fail due to the TDX module is not loaded.
> * Dump message to let the user know.
> */
> - if (sysinit_ret == -ENODEV)
> + if (tdx_module_state.sysinit_ret == -ENODEV)
> pr_err("module not loaded\n");
>
> - sysinit_done = true;
> + tdx_module_state.sysinit_done = true;
> out:
> raw_spin_unlock(&sysinit_lock);
> - return sysinit_ret;
> + return tdx_module_state.sysinit_ret;
> }
But I think it's because 'sysinit_ret' is really a funky thing. It's
just here so that the module only _tries_ to be initialized once. That
one-time init records its error code in sysinit_ret and then secondary
callers pick it up.
Here's how you do that in a non-confusing way (some context chopped out):
static int try_init_module_global(void)
{
struct tdx_module_args args = {};
static DEFINE_RAW_SPINLOCK(sysinit_lock);
int ret;
raw_spin_lock(&sysinit_lock);
/* Return the "cached" return code: */
if (tdx_module_state.sysinit_done) {
ret = tdx_module_state.sysinit_ret;
goto out;
}
ret = seamcall_prerr(TDH_SYS_INIT, &args);
/* Save the return code for later callers: */
tdx_module_state.sysinit_ret = ret;
tdx_module_state.sysinit_done = true;
out:
raw_spin_unlock(&sysinit_lock);
return ret;
}
See how it sets the module state in _one_ place? It also only touches
the module state under the lock so it's more obvious that it is correct
and there are no races or tearing or other nonsense.
*That* is a proper refactoring.
I'm also not sure we need to be saving the return code. It seems a bit
much, but we don't have to fix that now.
^ permalink raw reply
* Re: [PATCH v2 03/15] KVM: x86/xen: Don't truncate RAX when handling hypercall from protected guest
From: Sean Christopherson @ 2026-05-15 12:55 UTC (permalink / raw)
To: Binbin Wu
Cc: Paolo Bonzini, Vitaly Kuznetsov, Kiryl Shutsemau, David Woodhouse,
Paul Durrant, Dave Hansen, Rick Edgecombe, kvm, x86, linux-coco,
linux-kernel, Yosry Ahmed, Kai Huang
In-Reply-To: <27ba35fd-5563-4bbd-8f95-2285b50efa7a@linux.intel.com>
On Fri, May 15, 2026, Binbin Wu wrote:
>
>
> On 5/15/2026 5:53 AM, Sean Christopherson wrote:
> > Don't truncate RAX when handling a Xen hypercall for a guest with protected
> > state, as KVM's ABI is to assume the guest is in 64-bit for such cases
> > (the guest leaving garbage in 63:32 after a transition to 32-bit mode is
> > far less likely than 63:32 being necessary to complete the hypercall).
> >
> > Fixes: b5aead0064f3 ("KVM: x86: Assume a 64-bit hypercall for guests with protected state")
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
>
> The patch looks good to me, but one question below.
>
> > ---
> > arch/x86/kvm/xen.c | 6 +++---
> > 1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> > index 6d9be74bb673..895095dc684e 100644
> > --- a/arch/x86/kvm/xen.c
> > +++ b/arch/x86/kvm/xen.c
> > @@ -1678,15 +1678,14 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
> > bool handled = false;
> > u8 cpl;
> >
> > - input = (u64)kvm_register_read(vcpu, VCPU_REGS_RAX);
> > -
> > /* Hyper-V hypercalls get bit 31 set in EAX */
> > - if ((input & 0x80000000) &&
> > + if ((kvm_rax_read(vcpu) & 0x80000000) &&
> > kvm_hv_hypercall_enabled(vcpu))
> > return kvm_hv_hypercall(vcpu);
> >
> > longmode = is_64_bit_hypercall(vcpu);
>
> Is the variable name misleading?
It most definitely is. However, @longmode is passed around quite a few locations
in xen.c, and so I don't want to opportunistically fix this one variable. Though
I'm definitely not opposed to a separate patch to rename them all to is_64bit or
something.
> If the vcpu is in compatible mode (when guest state is not protected),
> it's in long mode, but the code goes to !longmode path.
>
> > if (!longmode) {
> > + input = (u32)kvm_rax_read(vcpu);
> > params[0] = (u32)kvm_rbx_read(vcpu);
> > params[1] = (u32)kvm_rcx_read(vcpu);
> > params[2] = (u32)kvm_rdx_read(vcpu);
> > @@ -1696,6 +1695,7 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
> > }
> > else {
> > #ifdef CONFIG_X86_64
> > + input = (u64)kvm_rax_read(vcpu);
> > params[0] = (u64)kvm_rdi_read(vcpu);
> > params[1] = (u64)kvm_rsi_read(vcpu);
> > params[2] = (u64)kvm_rdx_read(vcpu);
>
^ permalink raw reply
* Re: [PATCH v2 12/15] KVM: x86: Harden is_64_bit_hypercall() against bugs on 32-bit kernels
From: Binbin Wu @ 2026-05-15 9:31 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Vitaly Kuznetsov, Kiryl Shutsemau, David Woodhouse,
Paul Durrant, Dave Hansen, Rick Edgecombe, kvm, x86, linux-coco,
linux-kernel, Yosry Ahmed, Kai Huang
In-Reply-To: <20260514215355.1648463-13-seanjc@google.com>
On 5/15/2026 5:53 AM, Sean Christopherson wrote:
> Unconditionally return %false for is_64_bit_hypercall() on 32-bit kernels
> to guard against incorrectly setting guest_state_protected, and because
> in a (very) hypothetical world where 32-bit KVM supports protected guests,
> assuming a hypercall was made in 64-bit mode is flat out wrong.
>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> arch/x86/kvm/regs.h | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/arch/x86/kvm/regs.h b/arch/x86/kvm/regs.h
> index 52bed14f43e3..d4d2a47a4968 100644
> --- a/arch/x86/kvm/regs.h
> +++ b/arch/x86/kvm/regs.h
> @@ -39,12 +39,16 @@ static inline bool is_64_bit_mode(struct kvm_vcpu *vcpu)
>
> static inline bool is_64_bit_hypercall(struct kvm_vcpu *vcpu)
> {
> +#ifdef CONFIG_X86_64
> /*
> * If running with protected guest state, the CS register is not
> * accessible. The hypercall register values will have had to been
> * provided in 64-bit mode, so assume the guest is in 64-bit.
> */
> return vcpu->arch.guest_state_protected || is_64_bit_mode(vcpu);
> +#else
> + return false;
> +#endif
> }
>
> static __always_inline unsigned long kvm_reg_mode_mask(struct kvm_vcpu *vcpu)
^ permalink raw reply
* Re: [PATCH v2 11/15] Revert "KVM: VMX: Read 32-bit GPR values for ENCLS instructions outside of 64-bit mode"
From: Binbin Wu @ 2026-05-15 9:26 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Vitaly Kuznetsov, Kiryl Shutsemau, David Woodhouse,
Paul Durrant, Dave Hansen, Rick Edgecombe, kvm, x86, linux-coco,
linux-kernel, Yosry Ahmed, Kai Huang
In-Reply-To: <20260514215355.1648463-12-seanjc@google.com>
On 5/15/2026 5:53 AM, Sean Christopherson wrote:
> Now that kvm_<reg>_read() are mode aware, i.e. are functionally equivalent
> to kvm_register_read(), revert aback to the less verbose versions.
>
> No functional change intended.
>
> This reverts commit 60919eccf6764c71cef31a1afeaa1a36b8e5ab85.
>
> Acked-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> arch/x86/kvm/vmx/sgx.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/sgx.c b/arch/x86/kvm/vmx/sgx.c
> index 2f5a1c58f3c5..876dc2814108 100644
> --- a/arch/x86/kvm/vmx/sgx.c
> +++ b/arch/x86/kvm/vmx/sgx.c
> @@ -225,8 +225,8 @@ static int handle_encls_ecreate(struct kvm_vcpu *vcpu)
> struct x86_exception ex;
> int r;
>
> - if (sgx_get_encls_gva(vcpu, kvm_register_read(vcpu, VCPU_REGS_RBX), 32, 32, &pageinfo_gva) ||
> - sgx_get_encls_gva(vcpu, kvm_register_read(vcpu, VCPU_REGS_RCX), 4096, 4096, &secs_gva))
> + if (sgx_get_encls_gva(vcpu, kvm_rbx_read(vcpu), 32, 32, &pageinfo_gva) ||
> + sgx_get_encls_gva(vcpu, kvm_rcx_read(vcpu), 4096, 4096, &secs_gva))
> return 1;
>
> /*
> @@ -302,9 +302,9 @@ static int handle_encls_einit(struct kvm_vcpu *vcpu)
> gpa_t sig_gpa, secs_gpa, token_gpa;
> int ret, trapnr;
>
> - if (sgx_get_encls_gva(vcpu, kvm_register_read(vcpu, VCPU_REGS_RBX), 1808, 4096, &sig_gva) ||
> - sgx_get_encls_gva(vcpu, kvm_register_read(vcpu, VCPU_REGS_RCX), 4096, 4096, &secs_gva) ||
> - sgx_get_encls_gva(vcpu, kvm_register_read(vcpu, VCPU_REGS_RDX), 304, 512, &token_gva))
> + if (sgx_get_encls_gva(vcpu, kvm_rbx_read(vcpu), 1808, 4096, &sig_gva) ||
> + sgx_get_encls_gva(vcpu, kvm_rcx_read(vcpu), 4096, 4096, &secs_gva) ||
> + sgx_get_encls_gva(vcpu, kvm_rdx_read(vcpu), 304, 512, &token_gva))
> return 1;
>
> /*
^ permalink raw reply
* Re: [PATCH v2 09/15] KVM: x86: Drop non-raw kvm_<reg>_write() helpers
From: Binbin Wu @ 2026-05-15 9:11 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Vitaly Kuznetsov, Kiryl Shutsemau, David Woodhouse,
Paul Durrant, Dave Hansen, Rick Edgecombe, kvm, x86, linux-coco,
linux-kernel, Yosry Ahmed, Kai Huang
In-Reply-To: <20260514215355.1648463-10-seanjc@google.com>
On 5/15/2026 5:53 AM, Sean Christopherson wrote:
> Drop the non-raw, mode-aware kvm_<reg>_write() helpers as there is no
> usage in KVM, and in all likelihood there will never be usage in KVM as
> use of hardcoded registers in instructions is uncommon, and *modifying*
> hardcoded registers is practically unheard of. While there are a few
> instructions that modify registers in mode-aware ways, e.g. REP string
> and some ENCLS varieties, the odds of KVM needing to emulate such
> instructions (outside of the fully emulator) are vanishingly small.
>
> Drop kvm_<reg>_write() to prevent incorrect usage; _if_ a new instruction
> comes along that needs to modify a hardcoded register, this can be
> reverted.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> arch/x86/kvm/regs.h | 5 -----
> 1 file changed, 5 deletions(-)
>
> diff --git a/arch/x86/kvm/regs.h b/arch/x86/kvm/regs.h
> index b28e71caed25..52bed14f43e3 100644
> --- a/arch/x86/kvm/regs.h
> +++ b/arch/x86/kvm/regs.h
> @@ -61,11 +61,6 @@ static __always_inline unsigned long kvm_##lname##_read(struct kvm_vcpu *vcpu)
> { \
> return vcpu->arch.regs[VCPU_REGS_##uname] & kvm_reg_mode_mask(vcpu); \
> } \
> -static __always_inline void kvm_##lname##_write(struct kvm_vcpu *vcpu, \
> - unsigned long val) \
> -{ \
> - vcpu->arch.regs[VCPU_REGS_##uname] = val & kvm_reg_mode_mask(vcpu); \
> -} \
> static __always_inline unsigned long kvm_##lname##_read_raw(struct kvm_vcpu *vcpu) \
> { \
> return vcpu->arch.regs[VCPU_REGS_##uname]; \
^ permalink raw reply
* Re: [PATCH v2 08/15] KVM: x86: Add mode-aware versions of kvm_<reg>_{read,write}() helpers
From: Binbin Wu @ 2026-05-15 8:46 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Vitaly Kuznetsov, Kiryl Shutsemau, David Woodhouse,
Paul Durrant, Dave Hansen, Rick Edgecombe, kvm, x86, linux-coco,
linux-kernel, Yosry Ahmed, Kai Huang
In-Reply-To: <20260514215355.1648463-9-seanjc@google.com>
On 5/15/2026 5:53 AM, Sean Christopherson wrote:
> Make kvm_<reg>_{read,write}() mode-aware (where the value is truncated to
> 32 bits if the vCPU isn't in 64-bit mode), and convert all the intentional
> "raw" accesses to kvm_<reg>_{read,write}_raw() versions. To avoid
> confusion and bikeshedding over whether or not explicit 32-bit accesses
> should use the "raw" or mode-aware variants, add and use "e" versions, e.g.
> for things like RDMSR, WRMSR, and CPUID, where the instruction uses only
> only bits 31:0, regardless of mode.
^
double "only"
>
> No functional change intended (all use of "e" versions is for cases where
> the value is already truncated due to bouncing through a u32).
>
> Cc: Binbin Wu <binbin.wu@linux.intel.com>
> Cc: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox