Linux Confidential Computing Development
 help / color / mirror / Atom feed
* [PATCH v4 03/47] x86/sev: Mark TSC as reliable when configuring Secure TSC
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Move the code to mark the TSC as reliable from sme_early_init() to
snp_secure_tsc_init().  The only reader of TSC_RELIABLE is the aptly
named check_system_tsc_reliable(), which runs in tsc_init(), i.e.
after snp_secure_tsc_init().

This will allow consolidating the handling of TSC_KNOWN_FREQ and
TSC_RELIABLE when overriding the TSC calibration routine.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/coco/sev/core.c      | 2 ++
 arch/x86/mm/mem_encrypt_amd.c | 3 ---
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c
index ecd77d3217f3..ed0ac52a765e 100644
--- a/arch/x86/coco/sev/core.c
+++ b/arch/x86/coco/sev/core.c
@@ -2037,6 +2037,8 @@ void __init snp_secure_tsc_init(void)
 	secrets = (__force struct snp_secrets_page *)mem;
 
 	setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
+	setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
+
 	rdmsrq(MSR_AMD64_GUEST_TSC_FREQ, tsc_freq_mhz);
 
 	/* Extract the GUEST TSC MHZ from BIT[17:0], rest is reserved space */
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index 2f8c32173972..6c3af974c7c2 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -535,9 +535,6 @@ void __init sme_early_init(void)
 		 */
 		x86_init.resources.dmi_setup = snp_dmi_setup;
 	}
-
-	if (sev_status & MSR_AMD64_SNP_SECURE_TSC)
-		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
 }
 
 void __init mem_encrypt_free_decrypted_mem(void)
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 02/47] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Extract retrieval of TSC frequency information from CPUID into standalone
helpers so that TDX guest support can reuse the logic.  Provide a version
that includes the multiplier math as TDX does NOT want to use
native_calibrate_tsc()'s fallback logic that derives the TSC frequency
based on CPUID.0x16, when the core crystal frequency isn't known.

Opportunistically drop native_calibrate_tsc()'s "== 0" and "!= 0" checks
in favor of the kernel's preferred style.

No functional change intended.

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/tsc.h |  8 +++++
 arch/x86/kernel/tsc.c      | 67 +++++++++++++++++++++++++-------------
 2 files changed, 52 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 4f7f09f50552..6cf26e62e9a6 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -83,6 +83,14 @@ static inline cycles_t get_cycles(void)
 }
 #define get_cycles get_cycles
 
+struct cpuid_tsc_info {
+	unsigned int denominator;
+	unsigned int numerator;
+	unsigned int crystal_khz;
+	unsigned int tsc_khz;
+};
+extern int __init cpuid_get_tsc_freq(struct cpuid_tsc_info *info);
+
 extern void tsc_early_init(void);
 extern void tsc_init(void);
 extern void mark_tsc_unstable(char *reason);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 08cf6625d484..f7f561722efa 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -658,46 +658,67 @@ static unsigned long quick_pit_calibrate(void)
 	return delta;
 }
 
+static int cpuid_get_tsc_info(struct cpuid_tsc_info *info)
+{
+	unsigned int ecx_hz, edx;
+
+	memset(info, 0, sizeof(*info));
+
+	if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
+		return -ENOENT;
+
+	/* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
+	cpuid(CPUID_LEAF_TSC, &info->denominator, &info->numerator, &ecx_hz, &edx);
+
+	if (!info->denominator || !info->numerator)
+		return -ENOENT;
+
+	/*
+	 * Note, some CPUs provide the multiplier information, but not the core
+	 * crystal frequency.  The multiplier information is still useful for
+	 * such CPUs, as the crystal frequency can be gleaned from CPUID.0x16.
+	 */
+	info->crystal_khz = ecx_hz / 1000;
+	return 0;
+}
+
+int __init cpuid_get_tsc_freq(struct cpuid_tsc_info *info)
+{
+	if (cpuid_get_tsc_info(info) || !info->crystal_khz)
+		return -ENOENT;
+
+	info->tsc_khz = info->crystal_khz * info->numerator / info->denominator;
+	return 0;
+}
+
 /**
  * native_calibrate_tsc - determine TSC frequency
  * Determine TSC frequency via CPUID, else return 0.
  */
 unsigned long native_calibrate_tsc(void)
 {
-	unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
-	unsigned int crystal_khz;
+	struct cpuid_tsc_info info;
 
 	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
 		return 0;
 
-	if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
+	if (cpuid_get_tsc_info(&info))
 		return 0;
 
-	eax_denominator = ebx_numerator = ecx_hz = edx = 0;
-
-	/* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
-	cpuid(CPUID_LEAF_TSC, &eax_denominator, &ebx_numerator, &ecx_hz, &edx);
-
-	if (ebx_numerator == 0 || eax_denominator == 0)
-		return 0;
-
-	crystal_khz = ecx_hz / 1000;
-
 	/*
 	 * Denverton SoCs don't report crystal clock, and also don't support
 	 * CPUID_LEAF_FREQ for the calculation below, so hardcode the 25MHz
 	 * crystal clock.
 	 */
-	if (crystal_khz == 0 &&
-			boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT_D)
-		crystal_khz = 25000;
+	if (!info.crystal_khz && boot_cpu_data.x86_vfm == INTEL_ATOM_GOLDMONT_D)
+		info.crystal_khz = 25000;
 
 	/*
 	 * TSC frequency reported directly by CPUID is a "hardware reported"
 	 * frequency and is the most accurate one so far we have. This
 	 * is considered a known frequency.
 	 */
-	if (crystal_khz != 0)
+	if (info.crystal_khz)
 		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
 
 	/*
@@ -705,15 +726,15 @@ unsigned long native_calibrate_tsc(void)
 	 * clock, but we can easily calculate it to a high degree of accuracy
 	 * by considering the crystal ratio and the CPU speed.
 	 */
-	if (crystal_khz == 0 && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
+	if (!info.crystal_khz && boot_cpu_data.cpuid_level >= CPUID_LEAF_FREQ) {
 		unsigned int eax_base_mhz, ebx, ecx, edx;
 
 		cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx, &ecx, &edx);
-		crystal_khz = eax_base_mhz * 1000 *
-			eax_denominator / ebx_numerator;
+		info.crystal_khz = eax_base_mhz * 1000 *
+			info.denominator / info.numerator;
 	}
 
-	if (crystal_khz == 0)
+	if (!info.crystal_khz)
 		return 0;
 
 	/*
@@ -730,10 +751,10 @@ unsigned long native_calibrate_tsc(void)
 	 * lapic_timer_period here to avoid having to calibrate the APIC
 	 * timer later.
 	 */
-	lapic_timer_period = crystal_khz * 1000 / HZ;
+	lapic_timer_period = info.crystal_khz * 1000 / HZ;
 #endif
 
-	return crystal_khz * ebx_numerator / eax_denominator;
+	return info.crystal_khz * info.numerator / info.denominator;
 }
 
 static unsigned long cpu_khz_from_cpuid(void)
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-1-seanjc@google.com>

Don't re-calibrate the TSC frequency if the TSC is known to run at a fixed
frequency.  In practice, this is likely one big nop, as re-calibration is
used only for SMP=n kernels, and only for hardware that is 20+ years old,
i.e. is extremely unlikely to collide with TSC_KNOWN_FREQ.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/tsc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index c5110eb554bc..08cf6625d484 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -946,7 +946,8 @@ void recalibrate_cpu_khz(void)
 		return;
 
 	cpu_khz = x86_platform.calibrate_cpu();
-	tsc_khz = x86_platform.calibrate_tsc();
+	if (!boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ))
+		tsc_khz = x86_platform.calibrate_tsc();
 	if (tsc_khz == 0)
 		tsc_khz = cpu_khz;
 	else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply related

* [PATCH v4 00/47] x86: Try to wrangle PV clocks vs. TSC
From: Sean Christopherson @ 2026-05-29 14:43 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, Sean Christopherson,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Andy Lutomirski,
	Peter Zijlstra, Juergen Gross, Daniel Lezcano, John Stultz
  Cc: H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner

Well, the number of patches in the series is going in the wrong direction,
but I'm much happier with this version, which eschews the x86_platform
overrides entirely in favor of a fixed sequence for selecting the TSC/CPU
frequency "routine".

Given that previous versions had fatal NULL pointer deref bugs that affected
VMware and Xen, this series needs testing and acks from those maintainers.

The primary goal of this series to fix flaws with SNP and TDX guests where a
PV clock provided by the untrusted hypervisor is used instead of the secure
TSC that is controlled by trusted firmware.

The secondary goal is modernize running under KVM.  Currently, KVM guests will
use TSC for clocksource, but not sched_clock.  And Linux-as-a-KVM-guest doesn't
support paravirt enumeration of the TSC/APIC frequencies, even though QEMU
provides that information by default.

The tertiary goal is to clean up the PV clock code to deduplicate logic across
hypervisors, and to hopefully make it all easier to maintain going forward.

v4 also adds a quaternary goal of cleaning up the TSC calibration code, which
was made stupidly hard to follow by hypervisor code mixing in with the native
calibration routines, instead of being implemented as a pure alternative.

Lots more background on the SNP/TDX motiviation:
https://lore.kernel.org/all/20250106124633.1418972-13-nikunj@amd.com

As before, I deliberately omitted jailhouse-dev@googlegroups.com from the To/Cc,
as those emails bounced on v1, AFAICT nothing has changed.

Note, I deliberately didn't collect a few reviews as the patches changed quite
a bit from what was reviewed in v3.

v4:
 - Use x86_init_noop() to skip save/restore on VMware and Xen instead of
   nullifying x86_platform.{save,restore}_sched_clock_state. [Sashiko]
 - Use '0' to indicate "failure" when getting the CPU frequency from CPUID, to
   avoid using an out-param and thus make it all but impossible to
   unintentionally clobber the global cpu_khz (which v3 did). [Sashiko]
 - Rename cpuid_get_cpu_freq() => __cpu_khz_from_cpuid() to capture its
   relationship with cpu_khz_from_cpuid().
 - Compute lapic_timer_period in units of ticks, not Khz. [Sashiko]
 - Kill off x86_platform_ops.calibrate_{cpu,tsc}(), and instead use dedicated
   hooks for hypervisor code, and direct calls for TDX and SNP. [David, loosely]
 - Drop SNP's secure TSC override of _CPU_ calibration, as there's zero
   evidence it's justified or a net positive.
 - Collect reviews/acks. [David, Wei]
 - Decouple getting TSC/APIC frequencies from KVM PV CPUID from kvmclock. [David]
 - Fix an amusing number of Opportunistically misspellings. [David]
 - Set kvm_sched_clock_offset _before_ registering kvmclock as sched_clock,
   and add a comment to guard against future goofs. [Sashiko]
 - Keep "setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE)" in Hyper-V's handling
   of HV_ACCESS_TSC_INVARIANT, as it's technically possible to have a VM
   with HV_ACCESS_TSC_INVARIANT but not HV_ACCESS_FREQUENCY_MSRS.  Though as
   a _very_ nice side effect of using dedicated sequencing for selecting the
   TSC frequency source, this would have naturally happened anyways. [Sashiko]

v3:
 - https://lore.kernel.org/all/20260515191942.1892718-1-seanjc@google.com
 - Collect reviews. [Michael, Thomas]
 - Use Hyper-V reference counter / refcounter instead of Hyper-V timer. [Michael]
 - Use the paravirt CPUID interface first proposed by VMware for KVM's
   "official" mechanism for communicating frequency to KVM-aware guests,
   instead of abusing Intel's CPUID leafs. [David]
 - Deal with paravirt code being moved into asm/timers.h and
   arch/x86/kernel/tsc.c.

v2:
 - https://lore.kernel.org/all/Z8YWttWDtvkyCtdJ@google.com
 - Add struct to hold the TSC CPUID output. [Boris]
 - Don't pointlessly inline the TSC CPUID helpers. [Boris]
 - Fix a variable goof in a helper, hopefully for real this time. [Dan]
 - Collect reviews. [Nikunj]
 - Override the sched_clock save/restore hooks if and only if a PV clock
   is successfully registered.
 - During resome, restore clocksources before reading persistent time.
 - Clean up more warts created by kvmclock.
 - Fix more bugs in kvmclock's suspend/resume handling.
 - Try to harden kvmclock against future bugs.

v1: https://lore.kernel.org/all/20250201021718.699411-1-seanjc@google.com

David Woodhouse (3):
  KVM: x86: Officially define CPUID 0x40000010 as PV Timing Info (TSC
    and Bus)
  x86/kvm: Obtain TSC frequency from PV CPUID if present
  x86/xen: Obtain TSC frequency from CPUID if present

Sean Christopherson (44):
  x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
  x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
  x86/sev: Mark TSC as reliable when configuring Secure TSC
  x86/sev: Don't override CPU frequency calibration for SNP's Secure TSC
  x86/sev: Move check for SNP Secure TSC support to tsc_early_init()
  x86/sev: Shove SNP's secure/trusted TSC frequency directly into
    "calibration"
  x86/tdx: Force TSC frequency with CPUID-based info provided by the
    TDX-Module
  x86/tsc: Add dedicated hypervisor hooks for getting known TSC/CPU
    frequencies
  x86/acrn: Mark TSC frequency as known when using ACRN for calibration
  x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
  x86/tsc: Kill off x86_platform_ops.calibrate_{cpu,tsc}() hooks
  x86/tsc: Rename pit_hpet_ptimer_calibrate_cpu() =>
    native_calibrate_cpu_late()
  x86/tsc: Fold native_calibrate_cpu() into recalibrate_cpu_khz()
  x86/kvmclock: Rename kvm_get_tsc_khz() to kvmclock_get_tsc_khz()
  x86/kvm: Mark TSC as reliable when it's constant and nonstop
  x86/kvm: Get local APIC bus frequency from PV CPUID Timing Info
  x86/tsc: Add standalone helper for getting CPU frequency from CPUID
  x86/kvm: Get CPU base frequency from CPUID when it's available
  clocksource: hyper-v: Register sched_clock save/restore iff it's
    necessary
  clocksource: hyper-v: Drop wrappers to sched_clock save/restore
    helpers
  clocksource: hyper-v: Don't save/restore TSC offset when using HV
    sched_clock
  x86/kvmclock: Setup kvmclock for secondary CPUs iff CONFIG_SMP=y
  x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
  x86/paravirt: Remove unnecessary PARAVIRT=n stub for
    paravirt_set_sched_clock()
  x86/paravirt: Move handling of unstable PV clocks into
    paravirt_set_sched_clock()
  x86/kvmclock: Move sched_clock save/restore helpers up in kvmclock.c
  x86/xen/time: NOP-ify x86_platform's sched_clock save/restore hooks
  x86/vmware: NOP-ify save/restore hooks when using VMware's sched_clock
  x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock
  x86/paravirt: Pass sched_clock save/restore helpers during
    registration
  x86/kvmclock: Move kvm_sched_clock_init() down in kvmclock.c
  x86/xen/time: Mark xen_setup_vsyscall_time_info() as __init
  x86/pvclock: Mark setup helpers and related various as
    __init/__ro_after_init
  x86/pvclock: WARN if pvclock's valid_flags are overwritten
  x86/kvmclock: Refactor handling of PVCLOCK_TSC_STABLE_BIT during
    kvmclock_init()
  timekeeping: Resume clocksources before reading persistent clock
  x86/kvmclock: Hook clocksource.suspend/resume when kvmclock isn't
    sched_clock
  x86/kvmclock: WARN if wall clock is read while kvmclock is suspended
  x86/paravirt: Mark __paravirt_set_sched_clock() as __init
  x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()
  x86/paravirt: Don't use a PV sched_clock in CoCo guests with trusted
    TSC
  x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
  x86/kvmclock: Plumb in AP-online and BSP-resume to kvmlock, for
    documentation
  x86/paravirt: Move using_native_sched_clock() stub into timer.h

 Documentation/virt/kvm/x86/cpuid.rst |  12 ++
 arch/x86/coco/sev/core.c             |  21 +--
 arch/x86/coco/tdx/tdx.c              |  19 ++-
 arch/x86/include/asm/acrn.h          |   5 -
 arch/x86/include/asm/kvm_para.h      |  12 +-
 arch/x86/include/asm/sev.h           |   4 +-
 arch/x86/include/asm/tdx.h           |   2 +
 arch/x86/include/asm/timer.h         |  15 +-
 arch/x86/include/asm/tsc.h           |  11 +-
 arch/x86/include/asm/x86_init.h      |   8 +-
 arch/x86/include/uapi/asm/kvm_para.h |  11 ++
 arch/x86/kernel/cpu/acrn.c           |  10 +-
 arch/x86/kernel/cpu/mshyperv.c       |  65 +-------
 arch/x86/kernel/cpu/vmware.c         |  13 +-
 arch/x86/kernel/jailhouse.c          |   7 +-
 arch/x86/kernel/kvm.c                | 108 +++++++++++--
 arch/x86/kernel/kvmclock.c           | 208 ++++++++++++++++---------
 arch/x86/kernel/pvclock.c            |   9 +-
 arch/x86/kernel/tsc.c                | 218 +++++++++++++++++----------
 arch/x86/kernel/x86_init.c           |   2 -
 arch/x86/mm/mem_encrypt_amd.c        |   3 -
 arch/x86/xen/time.c                  |  25 ++-
 drivers/clocksource/hyperv_timer.c   |  38 +++--
 include/clocksource/hyperv_timer.h   |   2 -
 kernel/time/timekeeping.c            |   9 +-
 25 files changed, 533 insertions(+), 304 deletions(-)


base-commit: 4678d11f294de0fd295a265e02955b5d1a4a2684
-- 
2.54.0.823.g6e5bcc1fc9-goog


^ permalink raw reply

* Re: [PATCH v2 0/6] KVM/x86: Drop "1" as MSR emulation return value
From: Juergen Gross @ 2026-05-29  9:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, x86, kvm, linux-coco, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Vitaly Kuznetsov, Kiryl Shutsemau, Rick Edgecombe,
	David Woodhouse, Paul Durrant
In-Reply-To: <6a7843b6-3c8a-4151-b3cf-3df59c93ce9f@suse.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 1823 bytes --]

On 28.05.26 17:50, Jürgen Groß wrote:
> On 28.05.26 15:21, Sean Christopherson wrote:
>> On Thu, May 28, 2026, Jürgen Groß wrote:
>>> On 28.05.26 15:09, Sean Christopherson wrote:
>>>> On Thu, May 28, 2026, Juergen Gross wrote:
>>>>> Please disregard this series, there is one complication sashiko made me
>>>>> aware of.
>>>>
>>>> Sashiko beat me to the punch. :-)
>>>>
>>>> See commit 2368048bf5c2 ("KVM: x86: Signal #GP, not -EPERM, on bad 
>>>> WRMSR(MCi_CTL/STATUS)")
>>>> for a real world example of how things can and will go wrong.
>>>
>>> Yeah, with Sashiko's pointer it was easy to spot.
>>>
>>> Question now is whether the already existing cases of -errno passed as return
>>> value are wrong or on purpose.
>>
>> What are the existing cases?
>>
>>> If the latter, there should be a comment for
>>> that, otherwise they need to be fixed..
>>>
>>> Disentangling the MSR emulation return values from the "normal" ones ("return
>>> to guest"/"return to user mode") will be quite interesting with the overloaded
>>> semantics of "1".
>>
>> LOL, "interesting".
> 
> What do you think about the following idea:
> 
> Lets pass struct msr_info * down to all functions which get their return
> value passed up. Then extend msr_info with a bool "return_to_guest" (valid
> only if !host_initiated), which should be set instead of passing "1" up to
> the caller (probably using an inline helper). Then the return value could
> be 0 or -errno, and after MSR emulation the return_to_guest indicator can
> be tested if needed.

In the end I think it is less error prone to define a struct kvm_msr_ret_t
used as return type, consisting of an int and a bool with the same semantics
as above. Helpers will still be a good idea IMHO.

I'll have a try how it looks like.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply

* Re: [PATCH kernel] crypto/ccp/tsm: Enable the root port after the endpoint
From: Herbert Xu @ 2026-05-29  6:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linux-crypto, linux-kernel, Tom Lendacky, David S. Miller,
	Dan Williams, x86, linux-coco, Pratik R . Sampat, Xu Yilun,
	Aneesh Kumar K . V
In-Reply-To: <20260521074301.2369293-1-aik@amd.com>

On Thu, May 21, 2026 at 05:43:01PM +1000, Alexey Kardashevskiy wrote:
> The PCIe r7.0, chapter "6.33.8 Other IDE Rules" mandates if selective IDE
> is enabled for config requersts, a stream must be enabled on the endpoint
> before enabling it on the rootport:
> 
> ===
> For Selective IDE, the Stream must not be used until it has been enabled in
> both Partner Ports. For cases where one of the Partner Ports is a Root Port
> and Selective IDE for Configuration Requests is enabled, the other
> Partner Port must be enabled prior to the Root Port. For other scenarios,
> the mechanisms to satisfy this requirement are implementation-specific.
> ===
> 
> Do what the spec says.
> 
> Fixes: 4be423572da1 ("crypto/ccp: Implement SEV-TIO PCIe IDE (phase1)")
> Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
> ---
>  drivers/crypto/ccp/sev-dev-tsm.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)

Patch applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Kalra, Ashish @ 2026-05-29  0:29 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Dave Hansen, tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <20260529002613.GEahjdJRsX2uNz0GnH@fat_crate.local>


On 5/28/2026 7:26 PM, Borislav Petkov wrote:
> On Thu, May 28, 2026 at 02:55:44PM -0500, Kalra, Ashish wrote:
>> Hello Dave,
>>
>> On 5/28/2026 2:50 PM, Dave Hansen wrote:
>>> On 5/28/26 12:37, Kalra, Ashish wrote:
>>>> A simple loop would be perfectly fine and avoids the need for the wrmsrq_on_cpus() helper entirely:
>>>>
>>>>   for_each_cpu(cpu, &rmpopt_cpumask)
>>>>       wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
>>>
>>> I'm glad we're on the same page finally. I just hope we can get to this
>>> point more quickly next time. I started off with exactly this
>>> suggestion, but someone chimed in to the thread and said it was "slower":
>>>
>>>> https://lore.kernel.org/lkml/6a50d050-f602-43fd-a44a-cecedd9823eb@amd.com/
>>>
>>
>> Yes, actually i should have made it explicitly clear that we need to do it in
>> parallel especially for issuing the RMPOPT instruction itself, as that is
>> in a performance critical path (and for that we are using on_each_cpu_mask()).
> 
> So which is it? Do we need the wrmsrq_on_cpus() helper or not?
> 
> I'm confused.
> 

No, we don't need it, i will drop this helper function patch.

Thanks,
Ashish

^ permalink raw reply

* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Borislav Petkov @ 2026-05-29  0:26 UTC (permalink / raw)
  To: Kalra, Ashish
  Cc: Dave Hansen, tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <3334a64a-9a5e-4ad5-94f3-01fef788df2e@amd.com>

On Thu, May 28, 2026 at 02:55:44PM -0500, Kalra, Ashish wrote:
> Hello Dave,
> 
> On 5/28/2026 2:50 PM, Dave Hansen wrote:
> > On 5/28/26 12:37, Kalra, Ashish wrote:
> >> A simple loop would be perfectly fine and avoids the need for the wrmsrq_on_cpus() helper entirely:
> >>
> >>   for_each_cpu(cpu, &rmpopt_cpumask)
> >>       wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
> > 
> > I'm glad we're on the same page finally. I just hope we can get to this
> > point more quickly next time. I started off with exactly this
> > suggestion, but someone chimed in to the thread and said it was "slower":
> > 
> >> https://lore.kernel.org/lkml/6a50d050-f602-43fd-a44a-cecedd9823eb@amd.com/
> > 
> 
> Yes, actually i should have made it explicitly clear that we need to do it in
> parallel especially for issuing the RMPOPT instruction itself, as that is
> in a performance critical path (and for that we are using on_each_cpu_mask()).

So which is it? Do we need the wrmsrq_on_cpus() helper or not?

I'm confused.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCH v5 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Kalra, Ashish @ 2026-05-28 23:52 UTC (permalink / raw)
  To: Ackerley Tng, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <CAEvNRgGfyb7zvZ1u1j7YLomD+JdAxnVW36gtvNG9gxgZ80vMyQ@mail.gmail.com>

Hello Ackerley,

On 5/28/2026 9:45 AM, Ackerley Tng wrote:
> Ashish Kalra <Ashish.Kalra@amd.com> writes:
> 
> Thank you Ashish!
> 
>> From: Ashish Kalra <ashish.kalra@amd.com>
>>
>> When SEV-SNP is enabled, all writes to memory are checked to ensure
>> integrity of SNP guest memory. This imposes performance overhead on the
>> whole system.
>>
>> RMPOPT is a new instruction that minimizes the performance overhead of
>> RMP checks on the hypervisor and on non-SNP guests by allowing RMP
>> checks to be skipped for 1GB regions of memory that are known not to
>> contain any SEV-SNP guest memory.
>>
>> Add support for performing RMP optimizations asynchronously using a
>> dedicated workqueue.
>>
>> Enable RMPOPT optimizations globally for all system RAM up to 2TB at
> 
> This should also be updated to say "Enable RMPOPT optimizations for up
> to 2TB worth of system RAM at..."
> 
> The current phrasing sounds like only addresses [0, 2TB) are allowed to
> be optimized, but actually any address [start, start + 2TB) can be
> optimized?

Yes, i will update it.

> 
>> RMP initialization time. RMP checks can initially be skipped for 1GB
>> memory ranges that do not contain SEV-SNP guest memory (excluding
>> preassigned pages such as the RMP table and firmware pages). As SNP
>> guests are launched, RMPUPDATE will disable the corresponding RMPOPT
>> optimizations.
>>
>> Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
>> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Reviewed-by: Ackerley Tng <ackerleytng@google.com>
>> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
>> ---
>>  arch/x86/virt/svm/sev.c | 167 +++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 164 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
>> index 82f9dc7a57c3..8876cac052d5 100644
>> --- a/arch/x86/virt/svm/sev.c
>> +++ b/arch/x86/virt/svm/sev.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/iommu.h>
>>  #include <linux/amd-iommu.h>
>>  #include <linux/nospec.h>
>> +#include <linux/workqueue.h>
>>
>>  #include <asm/sev.h>
>>  #include <asm/processor.h>
>> @@ -125,7 +126,18 @@ static void *rmp_bookkeeping __ro_after_init;
>>  static u64 probed_rmp_base, probed_rmp_size;
>>
>>  static cpumask_t rmpopt_cpumask;
>> -static phys_addr_t rmpopt_pa_start;
>> +static phys_addr_t rmpopt_pa_start, rmpopt_pa_end;
>> +
>> +enum rmpopt_function {
>> +	RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS,
>> +	RMPOPT_FUNC_REPORT_STATUS
>> +};
>> +
>> +#define RMPOPT_WORK_TIMEOUT	10000
>> +
>> +static struct workqueue_struct *rmpopt_wq;
>> +static struct delayed_work rmpopt_delayed_work;
>> +static DEFINE_MUTEX(rmpopt_wq_mutex);
>>
>>  static LIST_HEAD(snp_leaked_pages_list);
>>  static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
>> @@ -564,12 +576,21 @@ EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
>>
>>  static void rmpopt_cleanup(void)
>>  {
>> +	guard(mutex)(&rmpopt_wq_mutex);
>> +
>> +	if (!rmpopt_wq)
>> +		return;
>> +
>> +	cancel_delayed_work_sync(&rmpopt_delayed_work);
>> +	destroy_workqueue(rmpopt_wq);
>> +
>>  	cpus_read_lock();
>>  	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, 0);
>>  	cpus_read_unlock();
>>
>>  	cpumask_clear(&rmpopt_cpumask);
>> -	rmpopt_pa_start = 0;
>> +	rmpopt_pa_start = rmpopt_pa_end = 0;
>> +	rmpopt_wq = NULL;
>>  }
>>
>>  void snp_shutdown(void)
>> @@ -587,6 +608,105 @@ void snp_shutdown(void)
>>  }
>>  EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
>>
>> +static inline bool __rmpopt(u64 rax, u64 rcx)
> 
> Perhaps use pa_start instead of rax and op_type for rcx?
> 

I used these parameters to align with the RMPOPT specifications (rax and rcx)
which i think makes more sense.

>> +{
>> +	bool optimized;
>> +
>> +	asm volatile(".byte 0xf2, 0x0f, 0x01, 0xfc"
>> +		     : "=@ccc" (optimized)
>> +		     : "a" (rax), "c" (rcx)
>> +		     : "memory", "cc");
>> +
>> +	return optimized;
>> +}
>> +
>> +static void rmpopt(u64 pa)
>> +{
>> +	u64 rax = ALIGN_DOWN(pa, SZ_1G);
>> +	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
>> +
> 
> And pa_start and op_type here too.
> 
>> +	__rmpopt(rax, rcx);
>> +}
>> +
>> +/*
>> + * 'val' is a system physical address.
>> + */
>> +static void rmpopt_smp(void *val)
>> +{
>> +	rmpopt((u64)val);
>> +}
>> +
>> +/*
>> + * RMPOPT optimizations skip RMP checks at 1GB granularity if this
>> + * range of memory does not contain any SNP guest memory.
>> + */
>> +static void rmpopt_work_handler(struct work_struct *work)
>> +{
>> +	bool current_cpu_cleared = false;
>> +	phys_addr_t pa;
>> +	int this_cpu;
>> +
>> +	pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
>> +		rmpopt_pa_start, rmpopt_pa_end);
>> +
>> +	/*
>> +	 * RMPOPT scans the RMP table, stores the result of the scan in the
>> +	 * reserved processor memory. The RMP scan is the most expensive
>> +	 * part. If a second RMPOPT occurs, it can skip the expensive scan
>> +	 * if they can see a cached result in the reserved processor memory.
>> +	 *
>> +	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
>> +	 * on every other primary thread. This potentially allows the
> 
> I like the leader and follower comments below, thanks! With this
> leader/follower setup, will the followers definitely see the cached scan
> results, or might the followers still potentially not benefit from the
> caching? If it's still only "potentially", why?

I am verifying with the H/W architects if this is always going to be true or not,
will the followers always benefit from the scan results cached by the leader (first CPU)
or there is a possibility that the followers cannot see/access/get the cached results
and instead do full RMP scanning ?

> 
>> +	 * followers to use the "cached" scan results to avoid repeating
>> +	 * full scans.
>> +	 */
>> +
>> +	/*
>> +	 * Pin the worker to the current CPU for the leader loop so that
>> +	 * this_cpu remains valid and the RMPOPT instruction executes on
>> +	 * the CPU that was cleared from the cpumask.  The workqueue is
>> +	 * WQ_UNBOUND, so without pinning, the scheduler could migrate
>> +	 * the worker between the cpumask manipulation and the leader
>> +	 * loop, causing the leader to run on a different CPU while
>> +	 * this_cpu's core is skipped entirely.
>> +	 *
>> +	 * Use migrate_disable() rather than get_cpu() to prevent
>> +	 * migration while still allowing preemption.
>> +	 *
>> +	 * Note: rmpopt_cpumask is modified here without holding
>> +	 * rmpopt_wq_mutex.  This is safe because the delayed_work
>> +	 * mechanism guarantees single-threaded execution of this
>> +	 * handler, and rmpopt_cleanup() calls cancel_delayed_work_sync()
>> +	 * to ensure handler completion before tearing down the cpumask.
>> +	 */
>> +	migrate_disable();
>> +	this_cpu = smp_processor_id();
>> +	if (cpumask_test_cpu(this_cpu, &rmpopt_cpumask)) {
>> +		cpumask_clear_cpu(this_cpu, &rmpopt_cpumask);
>> +		current_cpu_cleared = true;
>> +	}
>> +
> 
> Instead of reusing the global rmpopt_cpumask, why not make a copy of
> rmpopt_cpumask for this function? Then this function won't have to
> figure out current_cpu_cleared or restore rmpopt_cpumask at the end.
> 
> I'm thinking to also drop the test and clear, this function can just
> always clear, like
> 
>   cpumask_clear_cpu(smp_processor_id(), followers_cpumask);
> 
> and later
> 
>   on_each_cpu_mask(&followers_cpumask, ...);

That's surely a much cleaner approach. Instead of modifying global
rmpopt_cpumask and using a local copy:

  cpumask_var_t follower_mask;
  alloc_cpumask_var(&follower_mask, GFP_KERNEL);
  cpumask_copy(follower_mask, &rmpopt_cpumask);

  migrate_disable();
  this_cpu = smp_processor_id();
  cpumask_clear_cpu(this_cpu, follower_mask);  // modify local only

  // leader loop on this_cpu...
  migrate_enable();

  // follower loop with follower_mask...
  on_each_cpu_mask(follower_mask, rmpopt_smp, ...);

  free_cpumask_var(follower_mask);

This eliminates:
- current_cpu_cleared variable
- The restore at the end
  
Additionally, the global rmpopt_cpumask is never modified, so no concurrency concerns with debugfs or other readers.

> 
> Actually, if for whatever reason cpumask_test_cpu(this_cpu,
> &rmpopt_cpumask) above returns false, would that mean somehow some cpu
> exists that wasn't enabled right when rmpopt was initialized? 

The work handler can always execute on a cpu which is not in the rmpopt_cpumask, so i believe the
cpumask_test_cpu() needs to be there.

The leader loop must only run on a CPU that has RMPOPT_BASE MSR programmed. If the WQ_UNBOUND scheduler puts the
handler on a CPU not in rmpopt_cpumask, that CPU's core never had RMPOPT enabled -> RMPOPT instruction causes #UD.

  So the leader should be conditional:

  cpumask_copy(follower_mask, &rmpopt_cpumask);

  migrate_disable();
  this_cpu = smp_processor_id();

  if (cpumask_test_cpu(this_cpu, follower_mask)) {
      cpumask_clear_cpu(this_cpu, follower_mask);

      /* Leader: prime the RMPOPT cache on this CPU */
      for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
          rmpopt(pa);
  }

  migrate_enable();

  /* Followers: run RMPOPT on remaining cores */
  cpus_read_lock();
  for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
      on_each_cpu_mask(follower_mask, rmpopt_smp, (void *)pa, true);
      cond_resched();
  }
  cpus_read_unlock();

If the current CPU isn't in rmpopt_cpumask, the leader is skipped and all cores run as followers — they lose the caching
optimization from a leader priming pass, but correctness is maintained.

Alternatively, i could pick the first CPU from rmpopt_cpumask as the explicit leader instead of relying on whichever CPU the
scheduler chose.

 	cpumask_copy(follower_mask, &rmpopt_cpumask);

        migrate_disable();
        this_cpu = smp_processor_id();

        if (cpumask_test_cpu(this_cpu, follower_mask)) {
                /* Fast path: leader runs locally, no IPIs */
                cpumask_clear_cpu(this_cpu, follower_mask);

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        rmpopt(pa);
        } else {
                /*
                 * Current CPU does not have RMPOPT_BASE MSR programmed.
                 * Pick an explicit leader from the cpumask to avoid #UD.
                 */
                int leader_cpu = cpumask_first(follower_mask);

                cpumask_clear_cpu(leader_cpu, follower_mask);

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        smp_call_function_single(leader_cpu, rmpopt_smp,
                                                 (void *)pa, true);
        }

        migrate_enable();

        /* Followers: run RMPOPT on remaining cores */
        cpus_read_lock();
        for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
                on_each_cpu_mask(follower_mask, rmpopt_smp,
                                 (void *)pa, true);

                /* Give a chance for other threads to run */
                cond_resched();
        }
        cpus_read_unlock();

        free_cpumask_var(follower_mask);
  }

> If yes, what happens if we call rmpopt() on a cpu where it wasn't initialized?

That will cause a #UD exception, if RMPOPT instruction is issued on 
a CPU where RMPOPT is not enabled (RMPOPT_BASE.RMPOPT_EN==0), so
it is essential to issue RMPOPT instruction only on the cpumask (covers both
primary and secondary threads) which was setup initially when rmpopt was
initialized and on which the RMPOPT_BASE MSR was setup and RMPOPT enabled.

I believe, there are actually three cases to be considered here: 

  1. Current CPU is in rmpopt_cpumask -> primary thread, run leader locally, remove from followers
  2. Current CPU's sibling is in rmpopt_cpumask -> sibling thread, RMPOPT_BASE per-core is programmed, run leader locally, 
     remove the sibling's primary from the follower mask
  3. Neither -> new/unknown CPU, RMPOPT_BASE never programmed on this core, fall back to explicit leader via IPI.

So this seems to the *correct* implementation of the RMPOPT loop: 

   	cpumask_copy(follower_mask, &rmpopt_cpumask);

        migrate_disable();
        this_cpu = smp_processor_id();

        if (cpumask_test_cpu(this_cpu, follower_mask)) {
                /*
                 * Current CPU is a primary thread in rmpopt_cpumask.
                 * Run leader locally and remove from follower mask.
                 */
                cpumask_clear_cpu(this_cpu, follower_mask);

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        rmpopt(pa);
        } else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
                                      follower_mask)) {
                /*
                 * Current CPU is a sibling thread whose primary is in
                 * rmpopt_cpumask.  RMPOPT_BASE MSR is per-core, so it
                 * is safe to run the leader locally.  Remove the sibling's
                 * primary from the follower mask as this core is already
                 * covered by the leader.
                 */
                cpumask_andnot(follower_mask, follower_mask,
                               topology_sibling_cpumask(this_cpu));

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        rmpopt(pa);
        } else {
                /*
                 * Current CPU's core does not have RMPOPT_BASE MSR
                 * programmed.  Pick an explicit leader from the cpumask
                 * to avoid #UD.
                 */
                int leader_cpu = cpumask_first(follower_mask);

                cpumask_clear_cpu(leader_cpu, follower_mask);

                for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
                        smp_call_function_single(leader_cpu, rmpopt_smp,
                                                 (void *)pa, true);
        }

        migrate_enable();

        /* Followers: run RMPOPT on remaining cores */
        cpus_read_lock();
        for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
                on_each_cpu_mask(follower_mask, rmpopt_smp,
                                 (void *)pa, true);

                /* Give a chance for other threads to run */
                cond_resched();
        }
        cpus_read_unlock();

        free_cpumask_var(follower_mask);
  }

Thanks,
Ashish

> 
>> +	/* Leader: prime the RMPOPT cache on this CPU */
>> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
>> +		rmpopt(pa);
>> +
>> +	migrate_enable();
>> +
>> +	/* Followers: run RMPOPT on all other cores */
>> +	cpus_read_lock();
>> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>> +		on_each_cpu_mask(&rmpopt_cpumask, rmpopt_smp,
>> +				 (void *)pa, true);
>> +
>> +		 /* Give a chance for other threads to run */
>> +		cond_resched();
>> +	}
>> +	cpus_read_unlock();
>> +
>> +	if (current_cpu_cleared)
>> +		cpumask_set_cpu(this_cpu, &rmpopt_cpumask);
>> +}
>> +
>>
>> [...snip...]
>>

^ permalink raw reply

* Re: [RFC PATCH 09/15] x86/virt/tdx: Add interface to generate a Quote
From: Edgecombe, Rick P @ 2026-05-28 22:30 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-10-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> +void *tdx_quote_generate(struct tdx_td *td, void *in_data, u32 in_data_len,
> +			 u32 *quote_len)
> +{
> +	void *quote_dup = NULL;
> +	u64 r, out_len;
> +
> +	if (!tdx_quote_enabled())
> +		return NULL;
> +
> +	/* TDH.QUOTE.GET expects the input data to fit in a page */
> +	if (in_data_len > PAGE_SIZE)
> +		return NULL;

Do we really need this check? We can't trust the caller to pass the right size?

> +
> +	mutex_lock(&tdx_quote_lock);
> +
> +	/*
> +	 * Use the first page of the quote buffer for input data. The buffer
> +	 * must be at least one page in size. @in_data may not be page-aligned,
> +	 * but TDH.QUOTE.GET expects page-aligned addresses.
> +	 */
> +	memcpy(quote_data.buf, in_data, (size_t)in_data_len);
> +
> +	r = tdx_quote_get(td, quote_data.hpa_list[0], (u64)in_data_len,
> +			  quote_data.hpa_list_pa, quote_data.buf_len, &out_len);
> +	if (r || !out_len || out_len > quote_data.buf_len)


How do these various error conditions happen?

> +		goto out;
> +
> +	/*
> +	 * The quote buffer is a shared resource, so use it only for the
> +	 * SEAMCALL and copy the data out as soon as possible.
> +	 */
> +	quote_dup = kvmemdup(quote_data.buf, out_len, GFP_KERNEL);

So at init time we allocate a vmalloc for the quote and pre-populate the
hpa_list. Then we use it every time and copy the contents to a new vmalloc.
Would it really be that hard to keep the hpa list allocation around, do a
vmalloc here and update the pfn list. Then do get quote on that and pass back
the vmalloc we just allocated? Just feels like global reuse way has extra pieces
in it. Compared to the whole quoting operation, this vmalloc_to_pfn() loop is
probably not very expensive.

> +
> +out:
> +	mutex_unlock(&tdx_quote_lock);
> +
> +	*quote_len = (u32)out_len;
> +
> +	return quote_dup;
> +}
> +EXPORT_SYMBOL_FOR_KVM(tdx_quote_generate);
> +


^ permalink raw reply

* Re: [RFC PATCH 07/15] x86/virt/tdx: Prepare Quote buffer during extension bringup
From: Edgecombe, Rick P @ 2026-05-28 22:30 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-8-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
> 
> The host uses a Quote buffer to communicate with the TDX module when
> generating Quotes.
> 

Can this be put in common terms. This is going to mean nothing to someone
reading this that doesn't already know the feature.

>  Because the Quote buffer is shared with TDX guests,

Why capitalize "Quote"?

> prepare the required metadata during Quoting extension bringup.

What does prepare the required metadata mean?

How does it being shared with TDX guest suggest this? Just that TDX guests will
need them? Is the reason just that only one is needed, so do it during global
init? 

> 
> This mostly involves determining the physical addresses of the Quote
> buffer pages and arranging them in the HPA_LINKED_LIST format defined by
> the Intel TDX Module ABI specification.
> 
> Signed-off-by: Peter Fang <peter.fang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> ---
>  arch/x86/virt/vmx/tdx/tdx.c | 85 ++++++++++++++++++++++++++++++++++++-
>  1 file changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index fb84fb6d952b..9d04293394d7 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -32,6 +32,7 @@
>  #include <linux/idr.h>
>  #include <linux/kvm_types.h>
>  #include <linux/bitfield.h>
> +#include <linux/vmalloc.h>
>  #include <asm/page.h>
>  #include <asm/special_insns.h>
>  #include <asm/msr-index.h>
> @@ -61,6 +62,13 @@ static LIST_HEAD(tdx_memlist);
>  static struct tdx_sys_info tdx_sysinfo __ro_after_init;
>  static bool tdx_module_initialized __ro_after_init;
>  
> +static struct quote_data {
> +	void *buf;
> +	u64 buf_len;
> +	u64 *hpa_list;
> +	phys_addr_t hpa_list_pa;
> +} quote_data;

Hmm, I think this should separate the type and variable declaration. It's not a
common pattern. I don't think there is an official rule.

> +
>  typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
>  
>  static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
> @@ -1205,9 +1213,78 @@ static inline u64 tdx_tdr_pa(struct tdx_td *td)
>  	return page_to_phys(td->tdr_page);
>  }
>  
> +#define HPAS_PER_PAGE			(PAGE_SIZE / sizeof(u64))
> +
> +static int tdx_quote_create_buf(unsigned int nr_pages, struct quote_data *qdata)
> +{
> +	unsigned long pfn;
> +	u64 qlist_npages;
> +	int err, i, j;
> +	u64 *qlist;
> +	void *qbuf;
> +
> +	if (!nr_pages)
> +		return -EINVAL;
> +
> +	/* The last entry of a linked list page points to the next page	*/
> +	qlist_npages = (u64)DIV_ROUND_UP(nr_pages, HPAS_PER_PAGE - 1);
> +
> +	qlist = vmalloc_array(qlist_npages, PAGE_SIZE);
> +	if (!qlist) {
> +		err = -ENOMEM;
> +		goto out_err;

Just return ENOMEM here. vfree() doesn't do any work if passed NULL, but it's
weird flow.

> +	}
> +
> +	/*
> +	 * Make sure unfilled entries are always -1, which means NULL in TDX.

Huh?

> +	 * Only the last page needs to be filled. All the other pages will be
> +	 * fully populated.
> +	 */
> +	memset((u8 *)qlist + (qlist_npages - 1) * PAGE_SIZE, 0xff, PAGE_SIZE);

What are the entries? And what is a -1 in u8? Or is it supposed to be u64?
Please make this a lot clearer.

> +
> +	qbuf = vcalloc(nr_pages, PAGE_SIZE);
> +	if (!qbuf) {
> +		err = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	/* Populate HPA_LINKED_LIST as per TDX ABI spec */
> +	for (i = 0, j = 0; j < nr_pages; i++) {
> +		if ((i % HPAS_PER_PAGE) == HPAS_PER_PAGE - 1) {
> +			/*
> +			 * The last entry always points to the next page. The
> +			 * address of the following entry must be on next page's
> +			 * boundary.
> +			 */

Can you maybe just explain this format that you are building in like one
sentence at the beginning of the function? "The quote buffer is passed to the
tdx module in a format that like... (some common terms that have no TDX
jargon)."

> +			pfn = vmalloc_to_pfn(&qlist[i + 1]);
> +			qlist[i] = PFN_PHYS(pfn);
> +			continue;
> +		}
> +
> +		pfn = vmalloc_to_pfn((u8 *)qbuf + j * PAGE_SIZE);
> +		qlist[i] = PFN_PHYS(pfn);
> +		j++;
> +	}
> +
> +	qdata->buf = qbuf;
> +	qdata->buf_len = (u64)nr_pages * PAGE_SIZE;
> +	qdata->hpa_list = qlist;
> +
> +	pfn = vmalloc_to_pfn(qlist);

Do we need a vmalloc_to_pa() helper? Maybe put it in terms of tdx format. Like
vmalloc_pfn_to_tdxpa() and keep it here? The tdx update stuff does this a bunch
too.

> +	qdata->hpa_list_pa = PFN_PHYS(pfn);
> +
> +	return 0;
> +
> +out_err:
> +	vfree(qlist);
> +
> +	return err;

It only returns -ENOMEM, so do we need the err var?

> +}
> +
>  static void tdx_quote_init(void)
>  {
>  	struct tdx_module_args args = {};
> +	unsigned int nr_quote_pages;
>  	u64 r;
>  
>  	do {
> @@ -1218,7 +1295,13 @@ static void tdx_quote_init(void)
>  		return;
>  
>  	/* Quoting metadata is valid only after initialization */
> -	get_tdx_sys_info_quote(&tdx_sysinfo.quote);
> +	if (get_tdx_sys_info_quote(&tdx_sysinfo.quote))
> +		return;

How come this patch gets error handling? Why is it needed now when it wasn't
before?

> +
> +	nr_quote_pages = PAGE_ALIGN(tdx_sysinfo.quote.max_quote_size) /
> +			 PAGE_SIZE;
> +	if (tdx_quote_create_buf(nr_quote_pages, &quote_data))
> +		pr_err("Failed to create quote buffer\n");

Err... what happens in ENOMEM scenario? NULL pointer later?

>  }
>  
>  /* Initialize the TDX Module Extensions then Extension-SEAMCALLs can be used */


^ permalink raw reply

* Re: [RFC PATCH 06/15] x86/virt/tdx: Initialize Quoting extension during bringup
From: Edgecombe, Rick P @ 2026-05-28 21:35 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-7-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
> 
> Initialize the Quoting extension and fetch its metadata during TDX
> bringup.
> 
> Because Quoting is an optional TDX feature, do not let its
> initialization failures cause TDX bringup to fail.
> 
> This patch
> 

Don't say "this patch" in tip logs. The patch is a temporary format, and some
x86 maintainers hate the term in logs.

>  does not include the opt-in portion of the initialization.
> It mainly lays the groundwork for TDX Quoting support. Opt-in will be
> added in a follow-up patch once the feature can be properly used by the
> system.

This could be imperative mood.

> 
> Signed-off-by: Peter Fang <peter.fang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>


^ permalink raw reply

* Re: [RFC PATCH 05/15] x86/virt/tdx: Move tdx_tdr_pa() up in the file
From: Edgecombe, Rick P @ 2026-05-28 21:32 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-6-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
> 
> Move the tdx_tdr_pa() in preparation for upcoming changes to use them
> during TDX bringup.
> 
> No functional change intended.
> 
> Signed-off-by: Peter Fang <peter.fang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply

* Re: [PATCH 04/15] x86/virt/tdx: Enable the Extensions right after basic TDX Module init
From: Edgecombe, Rick P @ 2026-05-28 21:32 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-5-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> The detailed initialization flow for TDX Module Extensions has been
> fully implemented.
> 

I'm not sure what this means exactly. Why "detailed". Is that important?

>  Enable the flow after basic TDX Module
> initialization.
> 
> Theoretically, the Extensions doesn't need to be enabled right after
> basic TDX initialization. It could be enabled right before the first
> Extension SEAMCALL is issued. That would save or postpone memory usage.
> But it isn't worth the complexity, the needs for the Extensions are vast
> but the savings are little for a typical TDX capable system (about
> 0.001% of memory). So the Linux decision is to just enable it along with
> the basic TDX.

The Linux decision is whatever this patch turns out to be after community
review. So for the patch log we just need to justify why it's a good idea, not
not make an argument to defer to authority.

> 
> Note that the Extensions initialization flow will still not start if no
> add-on features require Extensions. The enabling of add-on features will
> be in later patches. Until then, the system hasn't consumed extra memory.

Hmm, this patch reads like we are finally doing the initialization up until this
point. Then it turns out we don't actually light up the new code yet... 

A lot of this diff is adding __init to the function added in the earlier
patches. Do we need to do this? Why not add them as __init in the original
patches?


I think we maybe want to say instead that we are setting up to enable extensions
at TDX module init time, and do the explanation of why. Then without the __init
stuff, the patch is just about the init time decision. Which seems about right
sized.

^ permalink raw reply

* Re: [PATCH 01/15] x86/virt/tdx: Read global metadata for TDX Module Extensions
From: Edgecombe, Rick P @ 2026-05-28 21:17 UTC (permalink / raw)
  To: kas@kernel.org, yilun.xu@linux.intel.com
  Cc: Xu, Yilun, x86@kernel.org, baolu.lu@linux.intel.com, Li, Xiaoyao,
	djbw@kernel.org, linux-kernel@vger.kernel.org, Duan, Zhenzhong,
	Mehta, Sohil, kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	Fang, Peter
In-Reply-To: <ahfD1KMYnTrXJziq@yilunxu-OptiPlex-7050>

On Thu, 2026-05-28 at 12:25 +0800, Xu Yilun wrote:
> > > 
> > > If I read the TDX module base spec correctly, the amount of memory for
> > > extensions and EXT_REQUIRED field depends on the enabled features, which
> > > is
> > > determined by TDH.SYS.CONFIG/TDH.SYS.UPDATE ?
> 
> Yes.
> 
> > 
> > This is my read too. Looks like we need a separate step after
> > config_tdx_module() to readout config-dependatant metadata.
> 
> 
> The timing for when metadata becomes valid is now variable, e.g., the
> TDX QUOTING metadata is only valid after TDH.QUOTE.INIT [1].
> 
> Based on recent discussion, I think we should introduce runtime metadata
> reading interfaces for specific metadata sets as needed, rather than
> another catch-all step right after config_tdx_module(). See [2] for the
> proposed approach for Extensions metadata.
> 
> [1]:
> https://lore.kernel.org/all/20260522034128.3144354-7-yilun.xu@linux.intel.com/
> [2]: https://lore.kernel.org/all/ahXAL41ZmIDHmgfu@yilunxu-OptiPlex-7050/

Yea It is going to get confusing as to which metadata is populated at which
step. And if anything updates it.

I'm not sure we need to have all the metadata stored permanently. Some of the
metadata is needed for KVM and someday TSM. But a lot of it is onetime internal
use. There is some handiness in referring to a global var, but also those
reference add confusion as to when it got populated.

We only use ext_required, max_quote_size and memory_pool_required_pages each
once. So why not just read them to the stack and leave them out of struct
tdx_sys_info? Making it so there is not confusion of when it was read. And also
saving a global var that is never used again is a bit wrong.

How about for struct tdx_sys_info_ext read it to the stack in init_tdx_ext() and
pass it into init_tdx_ext_features(). For max_quote_size read it where it is
already read, but not into the global struct.

Do you see a problem?

^ permalink raw reply

* Re: [PATCH 01/15] x86/virt/tdx: Read global metadata for TDX Module Extensions
From: Edgecombe, Rick P @ 2026-05-28 21:00 UTC (permalink / raw)
  To: Fang, Peter, kas@kernel.org, djbw@kernel.org,
	yilun.xu@linux.intel.com, x86@kernel.org
  Cc: Xu, Yilun, Duan, Zhenzhong, baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260522034128.3144354-2-yilun.xu@linux.intel.com>

On Fri, 2026-05-22 at 11:41 +0800, Xu Yilun wrote:
> +struct tdx_sys_info_ext {
> +	u16 memory_pool_required_pages;

> +	u8 ext_required;

The docs say this is a bool.

> +};
> +


^ permalink raw reply

* Re: [PATCH v3 2/2] x86/tdx: Fix zero-extension for 32-bit port I/O
From: David Laight @ 2026-05-28 19:58 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Rick Edgecombe,
	Kuppuswamy Sathyanarayanan, Kai Huang, Sean Christopherson,
	Borys Tsyrulnikov, linux-kernel, linux-coco, kvm, stable
In-Reply-To: <ahgUBLjBRGhxULu3@thinkstation>

On Thu, 28 May 2026 11:14:38 +0100
Kiryl Shutsemau <kas@kernel.org> wrote:

> On Wed, May 27, 2026 at 10:45:28AM -0700, Dave Hansen wrote:
> > On 5/27/26 05:05, Kiryl Shutsemau (Meta) wrote:
> > ...  
> > > -	/* Update part of the register affected by the emulated instruction */
> > > -	regs->ax &= ~mask;
> > > +	/*
> > > +	 * IN writes the result into a sub-register of RAX. Only the
> > > +	 * 32-bit form zero-extends; the smaller forms leave the upper
> > > +	 * bits untouched:
> > > +	 *
> > > +	 *   insn  dest  size  bits written     bits preserved
> > > +	 *   inb   AL    1     RAX[ 7: 0]       RAX[63: 8]
> > > +	 *   inw   AX    2     RAX[15: 0]       RAX[63:16]
> > > +	 *   inl   EAX   4     RAX[63: 0]       (none, zero-extended)
> > > +	 *
> > > +	 * 'mask' only covers the low 'size' bytes, which is exactly the
> > > +	 * range affected for size 1 and 2. For size 4 the write also
> > > +	 * clears RAX[63:32], so widen the clear-mask.
> > > +	 */
> > > +	if (size == 4)
> > > +		regs->ax = 0;
> > > +	else
> > > +		regs->ax &= ~mask;
> > > +  
> > 
> > Is there any way we could do this with fewer comments and more code?
> > 
> > I mean, there's only three cases. Why have;
> > 
> > 	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
> > 
> > When there are only 3 possible cases:
> > 
> > 	1 => 0xf
> > 	2 => 0xff
> > 	4 => 0xffff
> > 
> > and one of those cases needs a special case on top of it.
> > 
> > Maybe something like this?
> > 
> > 	/* Clear out part of RAX so part of args.r11 can be OR'd in: */
> > 	switch (size) {
> > 	case 1:
> > 		/* inb consumes lower 8 bits of r11: */
> > 		regs->ax &= ~GENMASK_ULL(7, 0);
> > 		args.r11 &=  GENMASK_ULL(7, 0);
> > 		break;
> > 	case 2:
> > 		/* inw consumes lower 16 bits of r11: */
> > 		regs->ax &= ~GENMASK_ULL(15, 0);
> > 		args.r11 &=  GENMASK_ULL(15, 0);
> > 		break;
> > 	case 4:
> > 		/* inl is weird and zeros the whole register: */
> > 		regs->ax &= ~GENMASK_ULL(63, 0);
> > 		/* But only consumes 32-bits from r11: */
> > 		args.r11 &=  GENMASK_ULL(31, 0);
> > 		break;
> > 	default:
> > 		/* Probable TDX module bug. Illegal in[bwl] size: */
> > 		WARN_ON_ONCE(1);
> > 		success = 0;
> > 	}
> > 
> > 	if (success)
> > 		regs->ax |= args.r11;
> > 
> > It might need a temporary variable for args.r11, but you get the point.
> > That's basically the data from the comment but written as code.  
> 
> I hate how verbose it is. All these GENMASK_ULL() make it hard to
> follow.
> 
> What about the patch below. Inspired by kvm's assign_register().
> 
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 65119362f9a2..460b9fbabf14 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -693,8 +693,8 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
>  		.r13 = PORT_READ,
>  		.r14 = port,
>  	};
> -	u64 mask = GENMASK(BITS_PER_BYTE * size - 1, 0);
>  	bool success;
> +	u32 val;
>  
>  	/*
>  	 * Emulate the I/O read via hypercall. More info about ABI can be found
> @@ -703,10 +703,33 @@ static bool handle_in(struct pt_regs *regs, int size, int port)
>  	 */
>  	success = !__tdx_hypercall(&args);
>  
> -	/* Update part of the register affected by the emulated instruction */
> -	regs->ax &= ~mask;
>  	if (success)
> -		regs->ax |= args.r11 & mask;
> +		val = args.r11;
> +	else
> +		val = 0;
> +
> +	/*
> +	 * IN writes the result into a sub-register of RAX.
> +	 *
> +	 * Only the 32-bit form zero-extends; the smaller forms leave
> +	 * the upper bits untouched.
> +	 */
> +	switch (size) {
> +	case 1:
> +		*(u8 *)&regs->ax = (u8)val;
> +		break;
> +	case 2:
> +		*(u16 *)&regs->ax = (u16)val;
> +		break;
> +	case 4:
> +		/* zero-extended */
> +		regs->ax = val;
> +		break;
> +	default:
> +		/* Probable TDX module bug. Illegal in[bwl] size. */
> +		WARN_ON_ONCE(1);
> +		break;
> +	}

Just write it as normal arithmetic code:

	/* IN writes the result into a sub-register of RAX. */
	switch (size) {
	case 1:
		regs->ax = (regs->ax & ~0xfful) | (val & 0xff);
		break;
	case 2:
		regs->ax = (regs->ax & ~0xfffful) | (val & 0xffff);
		break;
	case 4:
	default:
		/* 32bit 'INB' will zero the high bits. */
		regs->ax = val
		break;
	}

Succinct, obvious and readable.

-- David


>  
>  	return success;
>  }


^ permalink raw reply

* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Kalra, Ashish @ 2026-05-28 19:55 UTC (permalink / raw)
  To: Dave Hansen, Borislav Petkov
  Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <c40dcb8c-5706-4c0f-ac85-c22957b9e192@intel.com>

Hello Dave,

On 5/28/2026 2:50 PM, Dave Hansen wrote:
> On 5/28/26 12:37, Kalra, Ashish wrote:
>> A simple loop would be perfectly fine and avoids the need for the wrmsrq_on_cpus() helper entirely:
>>
>>   for_each_cpu(cpu, &rmpopt_cpumask)
>>       wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
> 
> I'm glad we're on the same page finally. I just hope we can get to this
> point more quickly next time. I started off with exactly this
> suggestion, but someone chimed in to the thread and said it was "slower":
> 
>> https://lore.kernel.org/lkml/6a50d050-f602-43fd-a44a-cecedd9823eb@amd.com/
> 

Yes, actually i should have made it explicitly clear that we need to do it in
parallel especially for issuing the RMPOPT instruction itself, as that is
in a performance critical path (and for that we are using on_each_cpu_mask()).

Thanks,
Ashish

^ permalink raw reply

* Re: [PATCH 00/15] Enable TDX Module Extensions and DICE-based TDX Quoting
From: Sohil Mehta @ 2026-05-28 19:50 UTC (permalink / raw)
  To: Xu Yilun
  Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
	linux-kernel, kvm, yilun.xu, baolu.lu, zhenzhong.duan, xiaoyao.li
In-Reply-To: <ahfJ9MW+kJp+kE6A@yilunxu-OptiPlex-7050>

On 5/27/2026 9:52 PM, Xu Yilun wrote:

> No the memory needed varies depends on the feature or the number of
> features. But currently I see the total requirement is ~50MB.
> 
This is important consideration when defining the default policy. Could
you please elaborate on how this will scale in the future?

How are the memory requirements expected to grow with additional features?

Let's say a future platform has a lot more features and needs
significantly more memory. Wouldn't loading a legacy kernel with this
default policy lead to excessive wastage?

Maybe I am missing something obvious. The struct in patch 1,
memory_pool_required_pages is u16. So, will the Extensions support never
require more than 256MB?

^ permalink raw reply

* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Dave Hansen @ 2026-05-28 19:50 UTC (permalink / raw)
  To: Kalra, Ashish, Borislav Petkov
  Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <2d164e19-5cc6-47ca-9150-f4d432dd10c4@amd.com>

On 5/28/26 12:37, Kalra, Ashish wrote:
> A simple loop would be perfectly fine and avoids the need for the wrmsrq_on_cpus() helper entirely:
> 
>   for_each_cpu(cpu, &rmpopt_cpumask)
>       wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);

I'm glad we're on the same page finally. I just hope we can get to this
point more quickly next time. I started off with exactly this
suggestion, but someone chimed in to the thread and said it was "slower":

> https://lore.kernel.org/lkml/6a50d050-f602-43fd-a44a-cecedd9823eb@amd.com/


^ permalink raw reply

* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Kalra, Ashish @ 2026-05-28 19:37 UTC (permalink / raw)
  To: Borislav Petkov, Dave Hansen
  Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <20260528004332.GDahePtGqVp2boiEJL@fat_crate.local>

Hello Boris and Dave,

On 5/27/2026 7:43 PM, Borislav Petkov wrote:
> On Wed, May 27, 2026 at 02:38:05PM -0700, Dave Hansen wrote:
>> This one is my doing.
> 
> I know.
> 
> But hey, maybe we should not disagree on the public ML because the submitter
> might disappear like the last one. :-P
> 
>> wrmsr_on_cpus() is kinda a mess. I think it only has a single user. It's
>> also not very flexible because it needs a 'struct msr __percpu *msrs'
>> argument where each MSR has a value in memory.
> 
> Right, we did that a looong time ago.
> 
> The only reason I'd have for per-CPU MSR structs is reading different MSR
> values on different cores, modifying only the bits you need and then *keeping*
> the remaining values as they were. And that interface allows you to do that
> while this new thing won't.
> 
> And I'm going to venture a guess here that adding a simpler interface which
> simply forces a new value ontop of a whole MSR could cause a lot of subtle
> bugs when people don't pay attention to keep the old values.
> 
>> The use case for RMPOPT is that all CPUs get the same value. It'd be a
>> little awkward to go create a percpu data structure to duplcate the same
>> value to call wrmsr_on_cpus(). The RMPOPT case is also arguably
>> performance sensitive since it's done during boot. It should do the IPIs
>> in parallel.
> 
> Oh sure, my meaning was to create something that serves both purposes.
> 
>> toggle_ecc_err_reporting(), on the other hand, is done at module init
>> time. It's not really performance sensitive. It's probably pretty easy
>> to zap wrmsr_on_cpus() and just have toggle_ecc_err_reporting() do
>> something slightly less efficient.
> 
> Sure. That's fine.
> 
>> Yeah, the
>>
>> 	wrmsr_on_cpus()
>> 	wrmsrq_on_cpus()
>>
>> naming pain is real. There's little chance of bugs coming from it
>> because the function signatures are *SO* different. But, it certainly
>> could confuse humans for a minute.
> 
> Yap.
> 
>> But the real solution to this is axing wrmsr_on_cpus(). 
> 
> Yap, for example. Basically reingeneering the whole
> write-MSRs-on-multiple-CPUs functionality is what I meant.
> 
>> Which I think we could do after killing its one user which the attached
>> (completely untested) patch does. The only downside of the patch is that it
>> does RDMSR via IPIs one CPU at a time. But, looking at the code, I'm not
>> sure anyone would care. If anyone did, I _think_ all those MSRs have the
>> same value and the code could be simplified further. But that would take
>> more than 3 minutes.
>>
>> It's also possible that my grepping was bad or I'm completely
>> misunderstanding amd64_edac.c. Cluebat welcome if I'm being dense.
> 
> Looks ok to me, we can surely do that. I even hw to test it. I think...
> 
>> BTW, I also don't feel the need to make Ashish go do any of this edac
>> cleanup. I think it can just be done in parallel. But I wouldn't stop
>> him if he volunteered.
> 
> Why not?
> 
> It has always been the case: cleanups and bug fixes first, new features ontop.
> 
> So yeah, modulo figuring out how to redefine the *msr_on_cpus() interface,
> I think this all makes sense.

snp_setup_rmpopt() runs once during init and rmpopt_cleanup() runs once during shutdown. The batch IPI optimization
is irrelevant here. This RMPOPT_BASE MSR setup/programming is not in a performance critical path.

A simple loop would be perfectly fine and avoids the need for the wrmsrq_on_cpus() helper entirely:

  for_each_cpu(cpu, &rmpopt_cpumask)
      wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);

Calling wrmsrq_on_cpus() here for programming RMPOPT_BASE MSR:

-       wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
+       for_each_cpu(cpu, &rmpopt_cpumask)
+               wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);

So i will drop this helper patch.

Thanks,
Ashish

> 
> Thx.
> 

^ permalink raw reply

* Re: [PATCH v13 07/22] KVM: selftests: Introduce structures for TDX guest boot parameters
From: Yosry Ahmed @ 2026-05-28 19:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
	Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
	linux-coco, linux-kernel, x86
In-Reply-To: <CAO9r8zMaiGL8v=f72EAwWbwofoUHOkH8r6Se22k2TVxnUCQLOQ@mail.gmail.com>

On Fri, May 22, 2026 at 04:50:07PM -0700, Yosry Ahmed wrote:
> > > Sean, is this the preferred way to expose offsets to asm files (or asm
> > > code blocks) -- as opposed to say using .equ [*]?
> >
> > For actual .S assembly, yes.  For inline asm, maybe?  If it looks prettier, go
> > for it.
> >
> > > If yes, I can rework my nVMX GPR fixes to use the same approach for
> > > register offsets. I wonder if the non-TDX part of this patch (i.e.
> > > Makefile stuff) can be split, then patch 6 and the Makefile stuff can
> > > land independently and allow development on top.
> > >
> > > I can also split them out and include them in the next version of my
> > > series, then whichever series lands first will land the offsets
> > > support.
> > >
> > > WDYT?
> >
> > Hmm, I'd say keep your series as-is for now.  The OFFSET() infrastructure really
> > shines for proper assembly.  For what you're doing, AFAICT it's only marginally
> > better.  So I don't think it's worth juggling dependencies to use it right away,
> > we can always convert if/when the TDX series lands the fancy stuff.
> 
> Ack. We can do the switch later like you say.

I take this back. My series builds with the internal toolchain, but not
when I just use make with LLVM. Probably different compiler versions or
build options, but the fact the .equ thing doesn't always work means I
can't use it.

I would paste the error here, but the compiler literally spits out
incomprehensible garbage.

Lisa, if you will send a new version of this series for other reasons,
do you mind splitting out the non-TDX parts of this patch? Ideally we'd
have 1-2 patches that introduce the OFFSET() infrastructure without any
TDX parts, which should make it easier to pick up separately or include
with other series.

If a new version won't be needed anyway, I will just wait for this to
land before refreshing my series on top.

^ permalink raw reply

* RE: [PATCH v5 05/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Michael Kelley @ 2026-05-28 18:29 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm), iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
  Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Jason Gunthorpe, Mostafa Saleh, Petr Tesarik,
	Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <20260522042815.370873-6-aneesh.kumar@kernel.org>

From: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>Sent: Thursday, May 21, 2026 9:28 PM
> 
> Teach the atomic DMA pool code to distinguish between encrypted and
> unencrypted pools, and make pool allocation select the matching pool based
> on DMA attributes.
> 
> Introduce a dma_gen_pool wrapper that records whether a pool is
> unencrypted, initialize that state when the atomic pools are created, and
> use it when expanding and resizing the pools. Update dma_alloc_from_pool()
> to take attrs and skip pools whose encrypted state does not match
> DMA_ATTR_CC_SHARED. Update dma_free_from_pool() accordingly.
> 
> Also pass DMA_ATTR_CC_SHARED from the swiotlb atomic allocation path so
> decrypted swiotlb allocations are taken from the correct atomic pool.
> 
> Tested-by: Jiri Pirko <jiri@nvidia.com>
> Reviewed-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
>  drivers/iommu/dma-iommu.c   |   2 +-
>  include/linux/dma-map-ops.h |   2 +-
>  kernel/dma/direct.c         |  11 ++-
>  kernel/dma/pool.c           | 167 +++++++++++++++++++++++-------------
>  kernel/dma/swiotlb.c        |   7 +-
>  5 files changed, 123 insertions(+), 66 deletions(-)
>

[snip]
 
> +static __init struct dma_gen_pool *__dma_atomic_pool_init(struct dma_gen_pool *dma_pool,
> +		size_t pool_size, gfp_t gfp)
>  {
> -	struct gen_pool *pool;
>  	int ret;
> 
> -	pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
> -	if (!pool)
> +	dma_pool->pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
> +	if (!dma_pool->pool)
>  		return NULL;
> 
> -	gen_pool_set_algo(pool, gen_pool_first_fit_order_align, NULL);
> +	gen_pool_set_algo(dma_pool->pool, gen_pool_first_fit_order_align, NULL);
> +
> +	/* if platform is using memory encryption atomic pools are by default decrypted. */
> +	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> +		dma_pool->unencrypted = true;
> +	else
> +		dma_pool->unencrypted = false;

I'm curious about the name of the "unencrypted" field in struct dma_gen_pool,
and similarly in Patch 7 of the series for the swiotlb struct io_tlb_pool and
struct io_tlb_mem. Up through v3 of this series, you used "decrypted", but
starting in v4 switched to "unencrypted".

To me, the above "if" statement has some cognitive dissonance in that if
CC_ATTR_MEM_ENCRYPT is false (i.e., a normal VM), "unencrypted" is set
to false. But I think of memory in a normal VM as "unencrypted" since it
was never encrypted. A similar "if" statement occurs in your swiotlb changes.

Two related concepts are captured by the field:
1) Is some action needed to put the memory into the unencrypted state,
and to remove it from that state? This applies when assigning memory to the
pool, or freeing the memory in the pool.
2) Is the memory currently in the unencrypted state? This applies when
allocating memory from the pool to a caller.

It's hard to capture all that in a short field name. But I think I prefer "decrypted"
over "unencrypted".  The former implies that some action was taken. It's a
little easier to think of a normal VM as *not* having decrypted memory. The
memory was never encrypted in the first place, so no decryption action was taken.

Throughout the kernel, "decrypted" occurs much more frequently than
"unencrypted".  We have set_memory_encrypted() and set_memory_decrypted()
that are "take action" names.  But we also have force_dma_unencrypted(),
phys_to_dma_unencrypted(), and dma_addr_unencrypted(). So it's a bit
of a mess.

But maybe there's more background here that led to the change
between your v3 and v4.

Michael

^ permalink raw reply

* Re: [PATCH v3 2/2] x86/tdx: Fix zero-extension for 32-bit port I/O
From: Dave Hansen @ 2026-05-28 17:25 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Rick Edgecombe, Kuppuswamy Sathyanarayanan,
	Kai Huang, Sean Christopherson, Borys Tsyrulnikov, linux-kernel,
	linux-coco, kvm, stable
In-Reply-To: <ahgUBLjBRGhxULu3@thinkstation>

On 5/28/26 03:14, Kiryl Shutsemau wrote:
> What about the patch below. Inspired by kvm's assign_register().

I think I could stand this if it consolidated this site with kvm's
assign_register(). The copy/paste is too much to bear.



^ permalink raw reply

* Re: [PATCH v3 2/2] x86/tdx: Fix zero-extension for 32-bit port I/O
From: Dave Hansen @ 2026-05-28 16:43 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Rick Edgecombe, Kuppuswamy Sathyanarayanan,
	Kai Huang, Sean Christopherson, Borys Tsyrulnikov, linux-kernel,
	linux-coco, kvm, stable
In-Reply-To: <ahgUBLjBRGhxULu3@thinkstation>

On 5/28/26 03:14, Kiryl Shutsemau wrote:
> +	switch (size) {
> +	case 1:
> +		*(u8 *)&regs->ax = (u8)val;
> +		break;
> +	case 2:
> +		*(u16 *)&regs->ax = (u16)val;
> +		break;

Is this intentionally clever to only work on little endian, or
accidentally clever? This seems like a great IOCCC thing to do, but it's
far too clever for my taste.

I mean, it's making a pointer to a 64-bit value with an 8-bit name and
casting that to an 8-bit pointer and then assigning that to a 32-bit
value cast to a u8.

Is it just my tiny brain that thinks this will be unintelligible on Monday?

How about we just make the CPU do the thinking for us? IN[BWL] and
MOV[BWL] have the same semantics here, right? So even if 'rax' and 'val'
are 64-bit values here, the following should have all the right
behaviors, I think.

I generally loathe inline assembly. But we have a CPU that kinda knows
the rules already. No need for us to laboriously reimplement it. Right?

Thanks to the friendly LLM that knows inline assembly better than I do.
The resulting compiled assembly looks right to me.

/*
 * Use MOV[BWL] to/from registers to match the IN[BWL] behavior
 * including the fact that INL zeros the upper 64-bits while
 * IN[BW] don't zero anything.
 */

    switch (size) {
    case 1:
	// Just write 1 byte of RAX:
        __asm__ volatile ("movb %b1, %b0" : "+q"(rax)
                                          : "q"(val));
        break;
    case 2:
	// Write 2 bytes of RAX:
        __asm__ volatile ("movw %w1, %w0" : "+r"(rax)
                                          : "r"(val));
        break;
    case 4:
	// Write 'val' into lower 32 bits. Zero the upper 32 bits:
        __asm__ volatile ("movl %k1, %k0" : "=r"(rax)
                                          : "r"(val));
        break;
    default:
	// WARN
    }

Thoughts?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox