Linux Confidential Computing Development
 help / color / mirror / Atom feed
* [PATCH v4 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Ashish Kalra @ 2026-04-13 19:42 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1775874970.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

The existing wrmsr_on_cpus() takes a per-cpu struct msr array, requiring
callers to allocate and populate per-cpu storage even when every CPU
receives the same value. This is unnecessary overhead for the common
case of writing a single uniform u64 to a per-CPU MSR across multiple
CPUs.

Add wrmsrq_on_cpus() which writes the same u64 value to the specified
MSR on all CPUs in the given cpumask.

Co-developed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/msr.h |  5 +++++
 arch/x86/lib/msr-smp.c     | 20 ++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 9c2ea29e12a9..f5f63b4115c8 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -260,6 +260,7 @@ int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
 int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
 int rdmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 *q);
 int wrmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 q);
+void wrmsrq_on_cpus(const struct cpumask *mask, u32 msr_no, u64 q);
 void rdmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr __percpu *msrs);
 void wrmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr __percpu *msrs);
 int rdmsr_safe_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
@@ -289,6 +290,10 @@ static inline int wrmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 q)
 	wrmsrq(msr_no, q);
 	return 0;
 }
+static inline void wrmsrq_on_cpus(const struct cpumask *mask, u32 msr_no, u64 q)
+{
+	wrmsrq_on_cpu(0, msr_no, q);
+}
 static inline void rdmsr_on_cpus(const struct cpumask *m, u32 msr_no,
 				struct msr __percpu *msrs)
 {
diff --git a/arch/x86/lib/msr-smp.c b/arch/x86/lib/msr-smp.c
index b8f63419e6ae..d2c91c9bb47b 100644
--- a/arch/x86/lib/msr-smp.c
+++ b/arch/x86/lib/msr-smp.c
@@ -94,6 +94,26 @@ int wrmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 q)
 }
 EXPORT_SYMBOL(wrmsrq_on_cpu);
 
+void wrmsrq_on_cpus(const struct cpumask *mask, u32 msr_no, u64 q)
+{
+	struct msr_info rv;
+	int this_cpu;
+
+	memset(&rv, 0, sizeof(rv));
+
+	rv.msr_no = msr_no;
+	rv.reg.q = q;
+
+	this_cpu = get_cpu();
+
+	if (cpumask_test_cpu(this_cpu, mask))
+		__wrmsr_on_cpu(&rv);
+
+	smp_call_function_many(mask, __wrmsr_on_cpu, &rv, 1);
+	put_cpu();
+}
+EXPORT_SYMBOL(wrmsrq_on_cpus);
+
 static void __rwmsr_on_cpus(const struct cpumask *mask, u32 msr_no,
 			    struct msr __percpu *msrs,
 			    void (*msr_func) (void *info))
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 3/7] x86/sev: Initialize RMPOPT configuration MSRs
From: Ashish Kalra @ 2026-04-13 19:43 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1775874970.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

The new RMPOPT instruction helps manage per-CPU RMP optimization
structures inside the CPU. It takes a 1GB-aligned physical address
and either returns the status of the optimizations or tries to enable
the optimizations.

Per-CPU RMPOPT tables support at most 2 TB of addressable memory for
RMP optimizations.

Initialize the per-CPU RMPOPT table base to the starting physical
address. This enables RMP optimization for up to 2 TB of system RAM on
all CPUs.

Additionally, add support to setup and enable RMPOPT once SNP is
enabled and initialized.

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/coco/core.c             |  1 +
 arch/x86/include/asm/msr-index.h |  3 +++
 arch/x86/include/asm/sev.h       |  2 ++
 arch/x86/virt/svm/sev.c          | 41 +++++++++++++++++++++++++++++++-
 drivers/crypto/ccp/sev-dev.c     |  3 +++
 5 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 989ca9f72ba3..7fdef00ca8f2 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -172,6 +172,7 @@ static void amd_cc_platform_clear(enum cc_attr attr)
 	switch (attr) {
 	case CC_ATTR_HOST_SEV_SNP:
 		cc_flags.host_sev_snp = 0;
+		setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
 		break;
 	default:
 		break;
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index be3e3cc963b2..9c8a6dfd7891 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -758,6 +758,9 @@
 #define MSR_AMD64_SEG_RMP_ENABLED_BIT	0
 #define MSR_AMD64_SEG_RMP_ENABLED	BIT_ULL(MSR_AMD64_SEG_RMP_ENABLED_BIT)
 #define MSR_AMD64_RMP_SEGMENT_SHIFT(x)	(((x) & GENMASK_ULL(13, 8)) >> 8)
+#define MSR_AMD64_RMPOPT_BASE		0xc0010139
+#define MSR_AMD64_RMPOPT_ENABLE_BIT	0
+#define MSR_AMD64_RMPOPT_ENABLE		BIT_ULL(MSR_AMD64_RMPOPT_ENABLE_BIT)
 
 #define MSR_SVSM_CAA			0xc001f000
 
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 09e605c85de4..409ab3372f7c 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
 	__snp_leak_pages(pfn, pages, true);
 }
 void snp_prepare(void);
+void snp_setup_rmpopt(void);
 void snp_shutdown(void);
 #else
 static inline bool snp_probe_rmptable_info(void) { return false; }
@@ -680,6 +681,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
 static inline void kdump_sev_callback(void) { }
 static inline void snp_fixup_e820_tables(void) {}
 static inline void snp_prepare(void) {}
+static inline void snp_setup_rmpopt(void) {}
 static inline void snp_shutdown(void) {}
 #endif
 
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 41f76f15caa1..4f942abaf86e 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -124,6 +124,9 @@ static void *rmp_bookkeeping __ro_after_init;
 
 static u64 probed_rmp_base, probed_rmp_size;
 
+static cpumask_t rmpopt_cpumask;
+static phys_addr_t rmpopt_pa_start;
+
 static LIST_HEAD(snp_leaked_pages_list);
 static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
 
@@ -488,9 +491,12 @@ static bool __init setup_segmented_rmptable(void)
 static bool __init setup_rmptable(void)
 {
 	if (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) {
-		if (!setup_segmented_rmptable())
+		if (!setup_segmented_rmptable()) {
+			setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
 			return false;
+		}
 	} else {
+		setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
 		if (!setup_contiguous_rmptable())
 			return false;
 	}
@@ -554,6 +560,39 @@ void snp_shutdown(void)
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
 
+void snp_setup_rmpopt(void)
+{
+	u64 rmpopt_base;
+	int cpu;
+
+	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
+		return;
+
+	/*
+	 * RMPOPT_BASE MSR is per-core, so only one thread per core needs to
+	 * setup RMPOPT_BASE MSR.
+	 */
+
+	for_each_online_cpu(cpu) {
+		if (!topology_is_primary_thread(cpu))
+			continue;
+
+		cpumask_set_cpu(cpu, &rmpopt_cpumask);
+	}
+
+	rmpopt_pa_start = ALIGN_DOWN(PFN_PHYS(min_low_pfn), SZ_1G);
+	rmpopt_base = rmpopt_pa_start | MSR_AMD64_RMPOPT_ENABLE;
+
+	/*
+	 * Per-CPU RMPOPT tables support at most 2 TB of addressable memory
+	 * for RMP optimizations. Initialize the per-CPU RMPOPT table base
+	 * to the starting physical address to enable RMP optimizations for
+	 * up to 2 TB of system RAM on all CPUs.
+	 */
+	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
+}
+EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
+
 /*
  * Do the necessary preparations which are verified by the firmware as
  * described in the SNP_INIT_EX firmware command description in the SNP
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 939fa8aa155c..901395ad7d51 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1476,6 +1476,9 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
 	}
 
 	snp_hv_fixed_pages_state_update(sev, HV_FIXED);
+
+	snp_setup_rmpopt();
+
 	sev->snp_initialized = true;
 	dev_dbg(sev->dev, "SEV-SNP firmware initialized, SEV-TIO is %s\n",
 		data.tio_en ? "enabled" : "disabled");
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Ashish Kalra @ 2026-04-13 19:43 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1775874970.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

When SEV-SNP is enabled, all writes to memory are checked to ensure
integrity of SNP guest memory. This imposes performance overhead on the
whole system.

RMPOPT is a new instruction that minimizes the performance overhead of
RMP checks on the hypervisor and on non-SNP guests by allowing RMP
checks to be skipped for 1GB regions of memory that are known not to
contain any SEV-SNP guest memory.

Add support for performing RMP optimizations asynchronously using a
dedicated workqueue.

Enable RMPOPT optimizations globally for all system RAM up to 2TB at
RMP initialization time. RMP checks can initially be skipped for 1GB
memory ranges that do not contain SEV-SNP guest memory (excluding
preassigned pages such as the RMP table and firmware pages). As SNP
guests are launched, RMPUPDATE will disable the corresponding RMPOPT
optimizations.

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/virt/svm/sev.c | 117 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 115 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 4f942abaf86e..56c9fc3fe53a 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -19,6 +19,7 @@
 #include <linux/iommu.h>
 #include <linux/amd-iommu.h>
 #include <linux/nospec.h>
+#include <linux/workqueue.h>
 
 #include <asm/sev.h>
 #include <asm/processor.h>
@@ -125,7 +126,17 @@ static void *rmp_bookkeeping __ro_after_init;
 static u64 probed_rmp_base, probed_rmp_size;
 
 static cpumask_t rmpopt_cpumask;
-static phys_addr_t rmpopt_pa_start;
+static phys_addr_t rmpopt_pa_start, rmpopt_pa_end;
+
+enum rmpopt_function {
+	RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS,
+	RMPOPT_FUNC_REPORT_STATUS
+};
+
+#define RMPOPT_WORK_TIMEOUT	10000
+
+static struct workqueue_struct *rmpopt_wq;
+static struct delayed_work rmpopt_delayed_work;
 
 static LIST_HEAD(snp_leaked_pages_list);
 static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
@@ -560,6 +571,83 @@ void snp_shutdown(void)
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
 
+static inline bool __rmpopt(u64 rax, u64 rcx)
+{
+	bool optimized;
+
+	asm volatile(".byte 0xf2, 0x0f, 0x01, 0xfc"
+		     : "=@ccc" (optimized)
+		     : "a" (rax), "c" (rcx)
+		     : "memory", "cc");
+
+	return optimized;
+}
+
+/*
+ * 'val' is a system physical address.
+ */
+static void rmpopt_smp(void *val)
+{
+	u64 rax = ALIGN_DOWN((u64)val, SZ_1G);
+	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
+
+	__rmpopt(rax, rcx);
+}
+
+static void rmpopt(u64 pa)
+{
+	u64 rax = ALIGN_DOWN(pa, SZ_1G);
+	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
+
+	__rmpopt(rax, rcx);
+}
+
+/*
+ * RMPOPT optimizations skip RMP checks at 1GB granularity if this
+ * range of memory does not contain any SNP guest memory.
+ */
+static void rmpopt_work_handler(struct work_struct *work)
+{
+	bool current_cpu_cleared = false;
+	phys_addr_t pa;
+
+	pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
+		rmpopt_pa_start, rmpopt_pa_end);
+
+	/*
+	 * RMPOPT scans the RMP table, stores the result of the scan in the
+	 * reserved processor memory. The RMP scan is the most expensive
+	 * part. If a second RMPOPT occurs, it can skip the expensive scan
+	 * if they can see a cached result in the reserved processor memory.
+	 *
+	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
+	 * on every other primary thread. This potentially allows the
+	 * followers to use the "cached" scan results to avoid repeating
+	 * full scans.
+	 */
+
+	if (cpumask_test_cpu(smp_processor_id(), &rmpopt_cpumask)) {
+		cpumask_clear_cpu(smp_processor_id(), &rmpopt_cpumask);
+		current_cpu_cleared = true;
+	}
+
+	/* current CPU */
+	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
+		rmpopt(pa);
+
+	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+		on_each_cpu_mask(&rmpopt_cpumask, rmpopt_smp,
+				 (void *)pa, true);
+
+		 /* Give a chance for other threads to run */
+		cond_resched();
+
+	}
+
+	if (current_cpu_cleared)
+		cpumask_set_cpu(smp_processor_id(), &rmpopt_cpumask);
+}
+
 void snp_setup_rmpopt(void)
 {
 	u64 rmpopt_base;
@@ -568,9 +656,20 @@ void snp_setup_rmpopt(void)
 	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
 		return;
 
+	/*
+	 * Create an RMPOPT-specific workqueue to avoid scheduling
+	 * RMPOPT workitem on the global system workqueue.
+	 */
+	rmpopt_wq = alloc_workqueue("rmpopt_wq", WQ_UNBOUND, 1);
+	if (!rmpopt_wq) {
+		setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
+		return;
+	}
+
 	/*
 	 * RMPOPT_BASE MSR is per-core, so only one thread per core needs to
-	 * setup RMPOPT_BASE MSR.
+	 * setup RMPOPT_BASE MSR. Additionally only one thread per core needs
+	 * to issue the RMPOPT instruction.
 	 */
 
 	for_each_online_cpu(cpu) {
@@ -590,6 +689,20 @@ void snp_setup_rmpopt(void)
 	 * up to 2 TB of system RAM on all CPUs.
 	 */
 	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
+
+	INIT_DELAYED_WORK(&rmpopt_delayed_work, rmpopt_work_handler);
+
+	rmpopt_pa_end = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
+
+	/* Limit memory scanning to the first 2 TB of RAM */
+	if ((rmpopt_pa_end - rmpopt_pa_start) > SZ_2T)
+		rmpopt_pa_end = rmpopt_pa_start + SZ_2T;
+
+	/*
+	 * Once all per-CPU RMPOPT tables have been configured, enable RMPOPT
+	 * optimizations on all physical memory.
+	 */
+	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 5/7] x86/sev: Add interface to re-enable RMP optimizations.
From: Ashish Kalra @ 2026-04-13 19:43 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1775874970.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

RMPOPT table is a per-CPU table which indicates if 1GB regions of
physical memory are entirely hypervisor-owned or not.

When performing host memory accesses in hypervisor mode as well as
non-SNP guest mode, the processor may consult the RMPOPT table to
potentially skip an RMP access and improve performance.

Events such as RMPUPDATE can clear RMP optimizations. Add an interface
to re-enable those optimizations.

Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/sev.h |  2 ++
 arch/x86/virt/svm/sev.c    | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 409ab3372f7c..dc9086057060 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
 	__snp_leak_pages(pfn, pages, true);
 }
 void snp_prepare(void);
+void snp_rmpopt_all_physmem(void);
 void snp_setup_rmpopt(void);
 void snp_shutdown(void);
 #else
@@ -681,6 +682,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
 static inline void kdump_sev_callback(void) { }
 static inline void snp_fixup_e820_tables(void) {}
 static inline void snp_prepare(void) {}
+static inline void snp_rmpopt_all_physmem(void) {}
 static inline void snp_setup_rmpopt(void) {}
 static inline void snp_shutdown(void) {}
 #endif
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 56c9fc3fe53a..74ba8ec9de35 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -648,6 +648,16 @@ static void rmpopt_work_handler(struct work_struct *work)
 		cpumask_set_cpu(smp_processor_id(), &rmpopt_cpumask);
 }
 
+void snp_rmpopt_all_physmem(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
+		return;
+
+	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work,
+			   msecs_to_jiffies(RMPOPT_WORK_TIMEOUT));
+}
+EXPORT_SYMBOL_GPL(snp_rmpopt_all_physmem);
+
 void snp_setup_rmpopt(void)
 {
 	u64 rmpopt_base;
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 6/7] KVM: SEV: Perform RMP optimizations on SNP guest shutdown
From: Ashish Kalra @ 2026-04-13 19:44 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1775874970.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

Pages are converted from shared to private as SNP guests are launched.
This destroys exisiting RMPOPT optimizations in the regions where
pages are converted.

Conversely, guest pages are converted back to shared during SNP guest
termination and their region may become eligible for RMPOPT
optimization.

To take advantage of this, perform RMPOPT after guest termination.
Do it after a delay so that a single RMPOPT pass can be done if
multiple guests terminate in a short period of time.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/kvm/svm/sev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 3f9c1aa39a0a..e0f4f8ebef68 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2942,6 +2942,8 @@ void sev_vm_destroy(struct kvm *kvm)
 	if (sev_snp_guest(kvm)) {
 		snp_guest_req_cleanup(kvm);
 
+		snp_rmpopt_all_physmem();
+
 		/*
 		 * Decomission handles unbinding of the ASID. If it fails for
 		 * some unexpected reason, just leak the ASID.
-- 
2.43.0


^ permalink raw reply related

* [PATCH v4 7/7] x86/sev: Add debugfs support for RMPOPT
From: Ashish Kalra @ 2026-04-13 19:44 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1775874970.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

Add a debugfs interface to report per-CPU RMPOPT status across all
system RAM.

To dump the per-CPU RMPOPT status for all system RAM:

/sys/kernel/debug/rmpopt# cat rmpopt-table

Memory @  0GB: CPU(s): none
Memory @  1GB: CPU(s): none
Memory @  2GB: CPU(s): 0-1023
Memory @  3GB: CPU(s): 0-1023
Memory @  4GB: CPU(s): none
Memory @  5GB: CPU(s): 0-1023
Memory @  6GB: CPU(s): 0-1023
Memory @  7GB: CPU(s): 0-1023
...
Memory @1025GB: CPU(s): 0-1023
Memory @1026GB: CPU(s): 0-1023
Memory @1027GB: CPU(s): 0-1023
Memory @1028GB: CPU(s): 0-1023
Memory @1029GB: CPU(s): 0-1023
Memory @1030GB: CPU(s): 0-1023
Memory @1031GB: CPU(s): 0-1023
Memory @1032GB: CPU(s): 0-1023
Memory @1033GB: CPU(s): 0-1023
Memory @1034GB: CPU(s): 0-1023
Memory @1035GB: CPU(s): 0-1023
Memory @1036GB: CPU(s): 0-1023
Memory @1037GB: CPU(s): 0-1023
Memory @1038GB: CPU(s): none

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/virt/svm/sev.c | 107 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 107 insertions(+)

diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 74ba8ec9de35..dee2e853b4ad 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -20,6 +20,8 @@
 #include <linux/amd-iommu.h>
 #include <linux/nospec.h>
 #include <linux/workqueue.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
 
 #include <asm/sev.h>
 #include <asm/processor.h>
@@ -143,6 +145,13 @@ static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
 
 static unsigned long snp_nr_leaked_pages;
 
+static cpumask_t rmpopt_report_cpumask;
+static struct dentry *rmpopt_debugfs;
+
+struct seq_paddr {
+	phys_addr_t next_seq_paddr;
+};
+
 #undef pr_fmt
 #define pr_fmt(fmt)	"SEV-SNP: " fmt
 
@@ -580,6 +589,8 @@ static inline bool __rmpopt(u64 rax, u64 rcx)
 		     : "a" (rax), "c" (rcx)
 		     : "memory", "cc");
 
+	assign_cpu(smp_processor_id(), &rmpopt_report_cpumask, optimized);
+
 	return optimized;
 }
 
@@ -602,6 +613,100 @@ static void rmpopt(u64 pa)
 	__rmpopt(rax, rcx);
 }
 
+/*
+ * 'val' is a system physical address.
+ */
+static void rmpopt_report_status(void *val)
+{
+	u64 rax = ALIGN_DOWN((u64)val, SZ_1G);
+	u64 rcx = RMPOPT_FUNC_REPORT_STATUS;
+
+	__rmpopt(rax, rcx);
+}
+
+/*
+ * start() can be called multiple times if allocated buffer has overflowed
+ * and bigger buffer is allocated.
+ */
+static void *rmpopt_table_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	phys_addr_t end_paddr = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
+	struct seq_paddr *p = seq->private;
+
+	if (*pos == 0) {
+		p->next_seq_paddr = ALIGN_DOWN(PFN_PHYS(min_low_pfn), SZ_1G);
+		return &p->next_seq_paddr;
+	}
+
+	if (p->next_seq_paddr == end_paddr)
+		return NULL;
+
+	return &p->next_seq_paddr;
+}
+
+static void *rmpopt_table_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	phys_addr_t end_paddr = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
+	phys_addr_t *curr_paddr = v;
+
+	(*pos)++;
+	if (*curr_paddr == end_paddr)
+		return NULL;
+	*curr_paddr += SZ_1G;
+
+	return curr_paddr;
+}
+
+static void rmpopt_table_seq_stop(struct seq_file *seq, void *v)
+{
+}
+
+static int rmpopt_table_seq_show(struct seq_file *seq, void *v)
+{
+	phys_addr_t *curr_paddr = v;
+
+	seq_printf(seq, "Memory @%3lluGB: ",
+		   *curr_paddr >> (get_order(SZ_1G) + PAGE_SHIFT));
+
+	cpumask_clear(&rmpopt_report_cpumask);
+	on_each_cpu_mask(cpu_online_mask, rmpopt_report_status,
+			 (void *)*curr_paddr, true);
+
+	if (cpumask_empty(&rmpopt_report_cpumask))
+		seq_puts(seq, "CPU(s): none\n");
+	else
+		seq_printf(seq, "CPU(s): %*pbl\n", cpumask_pr_args(&rmpopt_report_cpumask));
+
+	return 0;
+}
+
+static const struct seq_operations rmpopt_table_seq_ops = {
+	.start = rmpopt_table_seq_start,
+	.next = rmpopt_table_seq_next,
+	.stop = rmpopt_table_seq_stop,
+	.show = rmpopt_table_seq_show
+};
+
+static int rmpopt_table_open(struct inode *inode, struct file *file)
+{
+	return seq_open_private(file, &rmpopt_table_seq_ops, sizeof(struct seq_paddr));
+}
+
+static const struct file_operations rmpopt_table_fops = {
+	.open = rmpopt_table_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release_private,
+};
+
+static void rmpopt_debugfs_setup(void)
+{
+	rmpopt_debugfs = debugfs_create_dir("rmpopt", arch_debugfs_dir);
+
+	debugfs_create_file("rmpopt-table", 0444, rmpopt_debugfs,
+			    NULL, &rmpopt_table_fops);
+}
+
 /*
  * RMPOPT optimizations skip RMP checks at 1GB granularity if this
  * range of memory does not contain any SNP guest memory.
@@ -713,6 +818,8 @@ void snp_setup_rmpopt(void)
 	 * optimizations on all physical memory.
 	 */
 	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
+
+	rmpopt_debugfs_setup();
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
 
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v7 21/22] x86/virt/tdx: Document TDX module update
From: Edgecombe, Rick P @ 2026-04-13 19:54 UTC (permalink / raw)
  To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Gao, Chao
  Cc: corbet@lwn.net, Li, Xiaoyao, Huang, Kai, Zhao, Yan Y,
	dave.hansen@linux.intel.com, kas@kernel.org, seanjc@google.com,
	binbin.wu@linux.intel.com, pbonzini@redhat.com, Chatre, Reinette,
	Verma, Vishal L, nik.borisov@suse.com, mingo@redhat.com,
	Weiny, Ira, skhan@linuxfoundation.org,
	tony.lindgren@linux.intel.com, Annapurve, Vishal,
	sagis@google.com, hpa@zytor.com, tglx@kernel.org,
	paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
	dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-22-chao.gao@intel.com>

On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> Document TDX module update as a subsection of "TDX Host Kernel Support" to
> provide background information and cover key points that developers and
> users may need to know, for example:
> 
>  - update is done in stop_machine() context
>  - update instructions and results
>  - update policy and tooling
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> ---
> v5:
>  - use "update" when refer to the update feature/concept [Kai]

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>




^ permalink raw reply

* Re: [PATCH v7 22/22] x86/virt/seamldr: Log TDX module update failures
From: Edgecombe, Rick P @ 2026-04-13 20:04 UTC (permalink / raw)
  To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Gao, Chao
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
	pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal,
	sagis@google.com, hpa@zytor.com, tglx@kernel.org,
	paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
	dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-23-chao.gao@intel.com>

On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> Currently, there is no way to restore a TDX module from shutdown state to
> running state. This means if errors occur after a successful module
> shutdown, they are unrecoverable since the old module is gone but the new
> module isn't installed. All subsequent SEAMCALLs to the TDX module will
> fail, so TDs will be killed due to SEAMCALL failures.
> 
> Log a message to clarify that SEAMCALL errors are expected in this
> scenario. This ensures that after update failures, the first message in
> dmesg explains the situation rather than showing confusing call traces from
> various code paths.

Why is this patch at the end? Is it not valid initial behavior?

> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
> Acked-by: Kai Huang <kai.huang@intel.com>
> ---
> v4:
>  - Use pr_warn_once() instead of reinventing it [Yilun]
> v3:
>  - Rephrase the changelog to eliminate the confusing uses of 'i.e.' and 'e.g.'
>    [Dave/Yilun]
> ---
>  arch/x86/virt/vmx/tdx/seamldr.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> index b3f3b40627c3..a972f9ba6877 100644
> --- a/arch/x86/virt/vmx/tdx/seamldr.c
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -252,6 +252,11 @@ static void ack_state(void)
>  		set_target_state(update_data.state + 1);
>  }
>  
> +static void print_update_failure_message(void)
> +{
> +	pr_err_once("update failed, SEAMCALLs will report failure until TDs killed\n");
> +}

Ehh, it seems ok. But is very KVM specific. I guess TDX connect will need to
update the message. The sys disable will fail too. But it does communicate the
main current actionable step for users.

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

> +
>  /*
>   * See multi_cpu_stop() from where this multi-cpu state-machine was
>   * adopted, and the rationale for touch_nmi_watchdog().
> @@ -291,10 +296,13 @@ static int do_seamldr_install_module(void *seamldr_params)
>  				break;
>  			}
>  
> -			if (ret)
> +			if (ret) {
>  				WRITE_ONCE(update_data.failed, true);
> -			else
> +				if (curstate > MODULE_UPDATE_SHUTDOWN)
> +					print_update_failure_message();
> +			} else {
>  				ack_state();
> +			}
>  		} else {
>  			touch_nmi_watchdog();
>  			rcu_momentary_eqs();


^ permalink raw reply

* Re: [PATCH v2 5/6] KVM: x86: Track available/dirty register masks as "unsigned long" values
From: Huang, Kai @ 2026-04-13 23:03 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Bae, Chang Seok, kvm@vger.kernel.org, pbonzini@redhat.com,
	kas@kernel.org, linux-kernel@vger.kernel.org,
	linux-coco@lists.linux.dev, x86@kernel.org
In-Reply-To: <ad0DrDUsUKTMfrDW@google.com>

On Mon, 2026-04-13 at 07:54 -0700, Sean Christopherson wrote:
> On Mon, Apr 13, 2026, Kai Huang wrote:
> > On Thu, 2026-04-09 at 15:42 -0700, Sean Christopherson wrote:
> > > -#define TDX_REGS_AVAIL_SET	(BIT_ULL(VCPU_REG_EXIT_INFO_1) | \
> > > -				 BIT_ULL(VCPU_REG_EXIT_INFO_2) | \
> > > -				 BIT_ULL(VCPU_REGS_RAX) | \
> > > -				 BIT_ULL(VCPU_REGS_RBX) | \
> > > -				 BIT_ULL(VCPU_REGS_RCX) | \
> > > -				 BIT_ULL(VCPU_REGS_RDX) | \
> > > -				 BIT_ULL(VCPU_REGS_RBP) | \
> > > -				 BIT_ULL(VCPU_REGS_RSI) | \
> > > -				 BIT_ULL(VCPU_REGS_RDI) | \
> > > -				 BIT_ULL(VCPU_REGS_R8) | \
> > > -				 BIT_ULL(VCPU_REGS_R9) | \
> > > -				 BIT_ULL(VCPU_REGS_R10) | \
> > > -				 BIT_ULL(VCPU_REGS_R11) | \
> > > -				 BIT_ULL(VCPU_REGS_R12) | \
> > > -				 BIT_ULL(VCPU_REGS_R13) | \
> > > -				 BIT_ULL(VCPU_REGS_R14) | \
> > > -				 BIT_ULL(VCPU_REGS_R15))
> > > +#define TDX_REGS_AVAIL_SET	(BIT(VCPU_REG_EXIT_INFO_1) | \
> > > +				 BIT(VCPU_REG_EXIT_INFO_2) | \
> > > +				 BIT(VCPU_REGS_RAX) | \
> > > +				 BIT(VCPU_REGS_RBX) | \
> > > +				 BIT(VCPU_REGS_RCX) | \
> > > +				 BIT(VCPU_REGS_RDX) | \
> > > +				 BIT(VCPU_REGS_RBP) | \
> > > +				 BIT(VCPU_REGS_RSI) | \
> > > +				 BIT(VCPU_REGS_RDI) | \
> > > +				 BIT(VCPU_REGS_R8) | \
> > > +				 BIT(VCPU_REGS_R9) | \
> > > +				 BIT(VCPU_REGS_R10) | \
> > > +				 BIT(VCPU_REGS_R11) | \
> > > +				 BIT(VCPU_REGS_R12) | \
> > > +				 BIT(VCPU_REGS_R13) | \
> > > +				 BIT(VCPU_REGS_R14) | \
> > > +				 BIT(VCPU_REGS_R15))
> > >  
> > 
> > Not related to this series, but this made me look into whether these
> > registers are truly needed to be set as available for TDX.
> > 
> > Firstly, all the listed registers are marked as available immediately after
> > exiting from tdh_vp_enter(), but except VCPU_REG_EXIT_INFO_1 and
> > VCPU_REG_EXIT_INFO_2 are immediately saved to the common 'struct vcpu_vt',
> > all other GPRs are not saved to vcpu->arch.regs[], which means marking GPRs
> > available immediately doesn't quite make sense.
> > 
> > In fact, IIUC other than when the TD exits with TDVMCALL on which TD shares
> > couple of GPRs with KVM, KVM has no way to get TD's GPRs.  So perhaps it
> > makes more sense is to mark the shared GPRs available upon TDVMCALL.
> > 
> > But even that does not make sense from KVM's "GPR available" perspective,
> > because TDVMCALL has a different ABI from KVM's existing infrastructure for
> > e.g., CPUID/MSR emulation.  E.g.,  KVM uses RCX/RAX/RDX for MSR emulation,
> > but TDVMCALL<MSR.WRITE> uses R12 and R13 to convey MSR index/value:
> > 
> >         case EXIT_REASON_MSR_WRITE:                 
> >                 kvm_rcx_write(vcpu, tdx->vp_enter_args.r12);         
> >                 kvm_rax_write(vcpu, tdx->vp_enter_args.r13 & -1u);   
> >                 kvm_rdx_write(vcpu, tdx->vp_enter_args.r13 >> 32);
> > 
> > So I think the most accurate way is to explicitly mark the relevant GPRs
> > available for each type of TDVMCALL. I am not sure whether it's worth to do
> > though, because AFAICT there's no real bug in the existing code, other than
> > "marking GPRs not in vcpu->arch.regs[] as available looks wrong".
> > 
> > A less invasive way is to mark all possible GPRs that can be used in
> > TDVMCALL emulation available once after TD exits.  AFAICT the KVM hypercall
> > uses most GPRs (RAX/RBX/RCX/RDX/RSI) and all other TDVMCALLs only use a
> > subset, so maybe we can remove other GPRs from the available list (the diff
> > in [*] passed my test of booting/destroying TD).
> > 
> > Bug again, not sure whether it's worth doing.
> 
> Not worth doing.  
> 

Fine to me. :-)

> Because VMX and SVM make all GRPs available immediately, except
> for RSP, KVM ignores avail/dirty for GPRs.  I.e. "fixing" TDX will just shift the
> "bugs" elsewhere.

Just want to understand:

I thought the fix could be we simply remove the wrong GPRs from the list. 
Not sure how fixing TDX will shift bugs elsewhere?

> 
> More importantly, because the TDX-Module *requires* RCX (the GPR that holds the
> mask of registers to expose to the VMM) to be hidden on TDVMCALL, KVM *can't*
> do any kind of meaningful "available" tracking.  
> 

Hmm I think RCX conveys the shared GPRs and VMM can read.  Per "Table 5.323:
TDH.VP.ENTER Output Operands Format #5 Definition: On TDCALL(TDG.VP.VMCALL)
Following a TD Entry":

  RCX   ...
	Bit(s) Name         Description

	31:0   PARAMS_MASK  Value as passed into TDCALL(TDG.VP.VMCALL) by
			    the guest TD: indicates which part of the guest
			    TD GPR and XMM state is passed as-is to the
VMM 
			    and back. For details, see the description of
			    TDG.VP.VMCALL in 5.5.26.

I think the problem is, as said previously, currently KVM TDX code uses
KVM's existing infrastructure to emulate MSR, KVM hypercall etc,  but
TDVMCALL has a different ABI, thus there's a mismatch here.

^ permalink raw reply

* Re: [PATCH v2 5/6] KVM: x86: Track available/dirty register masks as "unsigned long" values
From: Xiaoyao Li @ 2026-04-14  2:12 UTC (permalink / raw)
  To: Huang, Kai, seanjc@google.com
  Cc: Bae, Chang Seok, kvm@vger.kernel.org, pbonzini@redhat.com,
	kas@kernel.org, linux-kernel@vger.kernel.org,
	linux-coco@lists.linux.dev, x86@kernel.org
In-Reply-To: <d6f05e5fa781ccb465d27a4fb1c7c1ac1e9e95ff.camel@intel.com>

On 4/14/2026 7:03 AM, Huang, Kai wrote:
>> Because VMX and SVM make all GRPs available immediately, except
>> for RSP, KVM ignores avail/dirty for GPRs.  I.e. "fixing" TDX will just shift the
>> "bugs" elsewhere.
> Just want to understand:
> 
> I thought the fix could be we simply remove the wrong GPRs from the list.
> Not sure how fixing TDX will shift bugs elsewhere?

I'm curious too.

>> More importantly, because the TDX-Module*requires* RCX (the GPR that holds the
>> mask of registers to expose to the VMM) to be hidden on TDVMCALL, KVM*can't*
>> do any kind of meaningful "available" tracking.
>>
> Hmm I think RCX conveys the shared GPRs and VMM can read.  Per "Table 5.323:
> TDH.VP.ENTER Output Operands Format #5 Definition: On TDCALL(TDG.VP.VMCALL)
> Following a TD Entry":
> 
>    RCX   ...
> 	Bit(s) Name         Description
> 
> 	31:0   PARAMS_MASK  Value as passed into TDCALL(TDG.VP.VMCALL) by
> 			    the guest TD: indicates which part of the guest
> 			    TD GPR and XMM state is passed as-is to the
> VMM
> 			    and back. For details, see the description of
> 			    TDG.VP.VMCALL in 5.5.26.
> 
> I think the problem is, as said previously, currently KVM TDX code uses
> KVM's existing infrastructure to emulate MSR, KVM hypercall etc,  but
> TDVMCALL has a different ABI, thus there's a mismatch here.

I once had patch for it internally.

It adds back the available check for GPRs when accessing instead of 
assuming they are always available. For normal VMX and SVM, all the GPRs 
are still always available. But for TDX, only EXIT_INFO_1 and 
EXIT_INFO_2 are always marked available, while others need to be 
explicitly set case by case.

The good thing is it makes TDX safer that KVM won't consume invalid data 
silently for TDX. But it adds additional overhead of checking the 
unnecessary register availability for VMX and SVM case.

-----------------------------&<-------------------------------------
From: Xiaoyao Li <xiaoyao.li@intel.com>
Date: Tue, 11 Mar 2025 07:13:29 -0400
Subject: [PATCH] KVM: x86: Add available check for GPRs

Since commit de3cd117ed2f ("KVM: x86: Omit caching logic for
always-available GPRs"), KVM doesn't check the availability of GPRs
except RSP and RIP when accessing them, because they are always
available.

However, it's not true when it comes to TDX. The GPRs are not available
after TD vcpu exits actually. And it relies on KVM manually sets the
GPRs value when needed, e.g.

  - setting rax, rbx, rcx, rdx, rsi, for hypercall emulation in
    tdx_emulate_tdvmall();

  - setting rax, rcx and rdx before MSR write emulation;

Add the available check of GPRs read, and WARN_ON_ONCE() when unavailable.
It can help capture the cases of undesired GPRs consumption by TDX.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
  arch/x86/kvm/kvm_cache_regs.h | 60 +++++++++++++++++++----------------
  arch/x86/kvm/vmx/tdx.c        | 25 +++------------
  2 files changed, 37 insertions(+), 48 deletions(-)

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 8ddb01191d6f..b2fa01ee2b4b 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -16,34 +16,6 @@

  static_assert(!(KVM_POSSIBLE_CR0_GUEST_BITS & X86_CR0_PDPTR_BITS));

-#define BUILD_KVM_GPR_ACCESSORS(lname, uname)				      \
-static __always_inline unsigned long kvm_##lname##_read(struct kvm_vcpu 
*vcpu)\
-{									      \
-	return vcpu->arch.regs[VCPU_REGS_##uname];			      \
-}									      \
-static __always_inline void kvm_##lname##_write(struct kvm_vcpu *vcpu,	 
      \
-						unsigned long val)	      \
-{									      \
-	vcpu->arch.regs[VCPU_REGS_##uname] = val;			      \
-}
-BUILD_KVM_GPR_ACCESSORS(rax, RAX)
-BUILD_KVM_GPR_ACCESSORS(rbx, RBX)
-BUILD_KVM_GPR_ACCESSORS(rcx, RCX)
-BUILD_KVM_GPR_ACCESSORS(rdx, RDX)
-BUILD_KVM_GPR_ACCESSORS(rbp, RBP)
-BUILD_KVM_GPR_ACCESSORS(rsi, RSI)
-BUILD_KVM_GPR_ACCESSORS(rdi, RDI)
-#ifdef CONFIG_X86_64
-BUILD_KVM_GPR_ACCESSORS(r8,  R8)
-BUILD_KVM_GPR_ACCESSORS(r9,  R9)
-BUILD_KVM_GPR_ACCESSORS(r10, R10)
-BUILD_KVM_GPR_ACCESSORS(r11, R11)
-BUILD_KVM_GPR_ACCESSORS(r12, R12)
-BUILD_KVM_GPR_ACCESSORS(r13, R13)
-BUILD_KVM_GPR_ACCESSORS(r14, R14)
-BUILD_KVM_GPR_ACCESSORS(r15, R15)
-#endif
-
  /*
   * Using the register cache from interrupt context is generally not 
allowed, as
   * caching a register and marking it available/dirty can't be done 
atomically,
@@ -92,6 +64,38 @@ static inline void kvm_register_mark_dirty(struct 
kvm_vcpu *vcpu,
  	__set_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
  }

+#define BUILD_KVM_GPR_ACCESSORS(lname, uname)				      \
+static __always_inline unsigned long kvm_##lname##_read(struct kvm_vcpu 
*vcpu)\
+{									      \
+	if (WARN_ON_ONCE(!kvm_register_is_available(vcpu, VCPU_REGS_##uname)))\
+		return 0;						      \
+									      \
+	return vcpu->arch.regs[VCPU_REGS_##uname];			      \
+}									      \
+static __always_inline void kvm_##lname##_write(struct kvm_vcpu *vcpu,	 
      \
+						unsigned long val)	      \
+{									      \
+	vcpu->arch.regs[VCPU_REGS_##uname] = val;			      \
+	kvm_register_mark_available(vcpu, VCPU_REGS_##uname);	      	      \
+}
+BUILD_KVM_GPR_ACCESSORS(rax, RAX)
+BUILD_KVM_GPR_ACCESSORS(rbx, RBX)
+BUILD_KVM_GPR_ACCESSORS(rcx, RCX)
+BUILD_KVM_GPR_ACCESSORS(rdx, RDX)
+BUILD_KVM_GPR_ACCESSORS(rbp, RBP)
+BUILD_KVM_GPR_ACCESSORS(rsi, RSI)
+BUILD_KVM_GPR_ACCESSORS(rdi, RDI)
+#ifdef CONFIG_X86_64
+BUILD_KVM_GPR_ACCESSORS(r8,  R8)
+BUILD_KVM_GPR_ACCESSORS(r9,  R9)
+BUILD_KVM_GPR_ACCESSORS(r10, R10)
+BUILD_KVM_GPR_ACCESSORS(r11, R11)
+BUILD_KVM_GPR_ACCESSORS(r12, R12)
+BUILD_KVM_GPR_ACCESSORS(r13, R13)
+BUILD_KVM_GPR_ACCESSORS(r14, R14)
+BUILD_KVM_GPR_ACCESSORS(r15, R15)
+#endif
+
  /*
   * kvm_register_test_and_mark_available() is a special snowflake that 
uses an
   * arch bitop directly to avoid the explicit instrumentation that 
comes with
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index cefe6cdd60a9..2b90a60e6f64 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -969,6 +969,9 @@ static __always_inline u32 
tdx_to_vmx_exit_reason(struct kvm_vcpu *vcpu)
  	return exit_reason;
  }

+#define TDX_REGS_AVAIL_SET	(BIT_ULL(VCPU_EXREG_EXIT_INFO_1) | \
+				 BIT_ULL(VCPU_EXREG_EXIT_INFO_2))
+
  static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
  {
  	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -985,6 +988,8 @@ static noinstr void tdx_vcpu_enter_exit(struct 
kvm_vcpu *vcpu)
  	tdx->exit_gpa = tdx->vp_enter_args.r8;
  	vt->exit_intr_info = tdx->vp_enter_args.r9;

+	vcpu->arch.regs_avail &= TDX_REGS_AVAIL_SET;
+
  	vmx_handle_nmi(vcpu);

  	guest_state_exit_irqoff();
@@ -1017,24 +1022,6 @@ static fastpath_t 
tdx_exit_handlers_fastpath(struct kvm_vcpu *vcpu)
  	return EXIT_FASTPATH_NONE;
  }

-#define TDX_REGS_AVAIL_SET	(BIT_ULL(VCPU_EXREG_EXIT_INFO_1) | \
-				 BIT_ULL(VCPU_EXREG_EXIT_INFO_2) | \
-				 BIT_ULL(VCPU_REGS_RAX) | \
-				 BIT_ULL(VCPU_REGS_RBX) | \
-				 BIT_ULL(VCPU_REGS_RCX) | \
-				 BIT_ULL(VCPU_REGS_RDX) | \
-				 BIT_ULL(VCPU_REGS_RBP) | \
-				 BIT_ULL(VCPU_REGS_RSI) | \
-				 BIT_ULL(VCPU_REGS_RDI) | \
-				 BIT_ULL(VCPU_REGS_R8) | \
-				 BIT_ULL(VCPU_REGS_R9) | \
-				 BIT_ULL(VCPU_REGS_R10) | \
-				 BIT_ULL(VCPU_REGS_R11) | \
-				 BIT_ULL(VCPU_REGS_R12) | \
-				 BIT_ULL(VCPU_REGS_R13) | \
-				 BIT_ULL(VCPU_REGS_R14) | \
-				 BIT_ULL(VCPU_REGS_R15))
-
  static void tdx_load_host_xsave_state(struct kvm_vcpu *vcpu)
  {
  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -1108,8 +1095,6 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, u64 
run_flags)

  	tdx_load_host_xsave_state(vcpu);

-	vcpu->arch.regs_avail &= TDX_REGS_AVAIL_SET;
-
  	if (unlikely(tdx->vp_enter_ret == EXIT_REASON_EPT_MISCONFIG))
  		return EXIT_FASTPATH_NONE;

-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v7 08/22] x86/virt/seamldr: Allocate and populate a module update request
From: Chao Gao @ 2026-04-14  9:43 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Li, Xiaoyao, Huang, Kai,
	Zhao, Yan Y, dave.hansen@linux.intel.com, kas@kernel.org,
	seanjc@google.com, binbin.wu@linux.intel.com, pbonzini@redhat.com,
	Chatre, Reinette, Verma, Vishal L, nik.borisov@suse.com,
	mingo@redhat.com, Weiny, Ira, tony.lindgren@linux.intel.com,
	Annapurve, Vishal, sagis@google.com, hpa@zytor.com,
	tglx@kernel.org, paulmck@kernel.org, bp@alien8.de,
	yilun.xu@linux.intel.com, dan.j.williams@intel.com,
	x86@kernel.org
In-Reply-To: <a6a3426ba9b84ea25f3653e33a181dab0096b35f.camel@intel.com>

On Sat, Apr 11, 2026 at 09:14:36AM +0800, Edgecombe, Rick P wrote:
>On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
>> P-SEAMLDR uses the SEAMLDR_PARAMS structure to describe TDX module
>> update requests. This structure contains physical addresses pointing to
>> the module binary and its signature file (or sigstruct), along with an
>> update scenario field.
>> 
>> TDX modules are distributed in the tdx_blob format defined in
>> blob_structure.txt from the "Intel TDX module Binaries Repository". A
>> tdx_blob contains a header, sigstruct, and module binary. This is also the
>> format supplied by the userspace to the kernel.
>> 
>> Parse the tdx_blob format and populate a SEAMLDR_PARAMS structure
>> accordingly. This structure will be passed to P-SEAMLDR to initiate the
>> update.
>
>The thing that confused me about this earlier was the exact reason why we are
>checking all the fields. We discussed that we need to check the fields that
>kernel processes, but we don't need to double check data that the TDX module
>processes.
>
>Should we explain it?

Yes. how about adding the following in the changelog:

Parse the tdx_blob format and populate a SEAMLDR_PARAMS structure. The
header is consumed solely by the kernel to extract the sigstruct and
module, so validate it before processing. The sigstruct and module are
passed to and validated by P-SEAMLDR, so don't duplicate any validation
in the kernel.

>And how it explains the checks below?


>> +static void free_seamldr_params(struct seamldr_params *params)
>> +{
>> +	free_page((unsigned long)params);
>> +}
>
>Do we really need this helper? This doesn't work?

No. This works. Will do.

>
>DEFINE_FREE(free_seamldr_params, struct seamldr_params *,
>	    if (!IS_ERR_OR_NULL(_T)) free_page((unsigned long)_T))
>
>
>> +
>> +static struct seamldr_params *alloc_seamldr_params(const void *module, unsigned int module_size,
>> +						   const void *sig, unsigned int sig_size)
>> +{
>> +	struct seamldr_params *params;
>> +	const u8 *ptr;
>> +	int i;
>> +
>> +	if (module_size > SEAMLDR_MAX_NR_MODULE_4KB_PAGES * SZ_4K)
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	if (sig_size > SEAMLDR_MAX_NR_SIG_4KB_PAGES * SZ_4K)
>> +		return ERR_PTR(-EINVAL);
>
>I don't know if it's worth that much, but we could do a MIN thing here to
>protect the loop, and lose the conditionals. If userspace passes a blob that is
>out of spec they can deal with the module error, no?

I think we all agree: we need to do something to protect the loop. So, the
question is just how.

looks like you prefer:

	module_size = MIN(module_size, SEAMLDR_MAX_NR_MODULE_4KB_PAGES * SZ_4K);
	sig_size = MIN(sig_size, SEAMLDR_MAX_NR_SIG_4KB_PAGES * SZ_4K);

But I think the MIN approach actually requires more justification than a plain
bounds check.

With MIN, the reasoning chain is: we don't want conditionals -> userspace
can deal with the module error -> and that error won't be fatal to the
system (or even if it is, it's admin-initiated). That's a lot of assumptions
to validate.

With a bounds check and early return, the reasoning is simply: the input is out
of range -> reject it as invalid.

>
>> +
>> +	/*
>> +	 * Check that input buffers satisfy P-SEAMLDR's size and alignment
>> +	 * constraints so they can be passed directly to P-SEAMLDR without
>> +	 * relocation or copy.
>> +	 */
>> +	if (!IS_ALIGNED(module_size, SZ_4K) || !IS_ALIGNED(sig_size, SZ_4K) ||
>> +	    !IS_ALIGNED((unsigned long)module, SZ_4K) ||
>> +	    !IS_ALIGNED((unsigned long)sig, SZ_4K))
>> +		return ERR_PTR(-EINVAL);
>
>I thought you are going to reduce this checking to just to reject invalid input
>that the kernel processes.
>
>What happens if we don't check this?

If the blob->offset_of_module or blob->size is misaligned, the loops
silently drop the unaligned portion. P-SEAMLDR will then reject the update
after module shutdown and TDs will be killed. Since this is admin-initiated
with an intentionally invalid blob, the consequence is acceptable.

>The vmallocs are all going to be page
>aligned anyway. But even still, does it mess up the below loops somehow in a way
>that hurts anything?

I agree addresses/sizes are page-aligned for valid blobs, but they are not
guaranteed (e.g., if offset_of_module is misaligned) for all inputs. I
removed the check once in v6 and Sashiko questioned it, so I thought the
assumption isn't obviously correct to everyone.

Without the check, we'd need something like:

	/*
	 * Assume sig/module base addresses and sizes are page-aligned.
	 * If violated, P-SEAMLDR will reject the update.
	 */

It's a few lines of straightforward validation vs. a comment that requires
readers to verify the assumption chain (vmalloc internals, userspace
contract, P-SEAMLDR behavior). Not sure the trade is worth it.

If you think the comment is sufficient, I'm fine dropping the checks.

>
>I might be confused, but it seems different then we discussed.
>
>> +
>> +	params = (struct seamldr_params *)get_zeroed_page(GFP_KERNEL);
>> +	if (!params)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	/*
>> +	 * Only use version 1 when required (sigstruct > 4KB) for backward
>> +	 * compatibility with P-SEAMLDR that lacks version 1 support.
>> +	 */
>> +	if (sig_size > SZ_4K)
>> +		params->version = 1;
>> +	else
>> +		params->version = 0;
>
>I'm a bit confused by this part. What does it mean to support old P-SEAMLDRs?

All current P-SEAMLDRs only support version 0. Version 1 is a planned
extension but not yet implemented by any P-SEAMLDR. Always requiring
version 1 just saves us a conditional but disables updates for all existing
P-SEAMLDRs for no reason. I think it is not worthwhile.

>But also could it be:
>
>params->version = sig_size > SZ_4K;

ok with me. I just wanted to make it match with "version 1" in the
comment right above.

>> +static struct seamldr_params *init_seamldr_params(const u8 *data, u32 size)
>> +{
>> +	const struct tdx_blob *blob = (const void *)data;
>> +	int module_size, sig_size;
>> +	const void *sig, *module;
>> +
>> +	/*
>> +	 * Ensure the size is valid otherwise reading any field from the
>> +	 * blob may overflow.
>> +	 */
>> +	if (size <= sizeof(struct tdx_blob) || size <= blob->offset_of_module)
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	if (blob->version != TDX_BLOB_VERSION_1)
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	if (blob->reserved0 || memchr_inv(blob->reserved1, 0, sizeof(blob->reserved1)))
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	/* Split the blob into a sigstruct and a module. */
>> +	sig		= blob->data;
>> +	sig_size	= blob->offset_of_module - sizeof(struct tdx_blob);
>> +	module		= data + blob->offset_of_module;
>> +	module_size	= size - blob->offset_of_module;
>
>Did you consider just passing the tdx_blob into alloc_seamldr_params()?
>Basically, this function checks the blob fields, then alloc_seamldr_params()
>turns blob into  struct seamldr_params without checks. The way it is, the work
>seems kind of spread around two functions with various checks.

Fine with merging them. How about:

static struct seamldr_params *alloc_seamldr_params(const u8 *data, u32 size)
{
	const struct tdx_blob *blob = (const void *)data;
	struct seamldr_params *params;
	int module_size, sig_size;
	const void *sig, *module;
	const u8 *ptr;
	int i;

	/*
	 * Ensure the size is valid otherwise reading any field from the
	 * blob may overflow.
	 */
	if (size <= sizeof(struct tdx_blob))
		return ERR_PTR(-EINVAL);

	if (blob->version != TDX_BLOB_VERSION_1)
		return ERR_PTR(-EINVAL);
	
	if (blob->length != size)
		return ERR_PTR(-EINVAL);

	if (memcmp(blob->signature, "TDX-BLOB", 8))
		return ERR_PTR(-EINVAL);

	if (blob->reserved0 || memchr_inv(blob->reserved1, 0, sizeof(blob->reserved1)))
		return ERR_PTR(-EINVAL);

	/* Split the blob into a sigstruct and a module. */
	sig		= blob->data;
	sig_size	= blob->offset_of_module - sizeof(struct tdx_blob);
	module		= data + blob->offset_of_module;
	module_size	= size - blob->offset_of_module;

	if (module_size > SEAMLDR_MAX_NR_MODULE_4KB_PAGES * SZ_4K || module_size <= 0)
		return ERR_PTR(-EINVAL);

	if (sig_size > SEAMLDR_MAX_NR_SIG_4KB_PAGES * SZ_4K || sig_size <= 0)
		return ERR_PTR(-EINVAL);

	params = (struct seamldr_params *)get_zeroed_page(GFP_KERNEL);
	if (!params)
		return ERR_PTR(-ENOMEM);

	/*
	 * Only use version 1 when required (sigstruct > 4KB) for backward
	 * compatibility with P-SEAMLDR that lacks version 1 support.
	 */
	params->version = sig_size > SZ_4K;
	params->scenario = SEAMLDR_SCENARIO_UPDATE;

	ptr = sig;
	/*
	 * Assume sig/module base addresses and sizes are page-aligned.
	 * If violated, P-SEAMLDR will reject the update.
	 */
	for (i = 0; i < sig_size / SZ_4K; i++) {
		params->sigstruct_pa[i] = vmalloc_to_pfn(ptr) << PAGE_SHIFT;
		ptr += SZ_4K;
	}

	params->num_module_pages = module_size / SZ_4K;

	ptr = module;
	for (i = 0; i < params->num_module_pages; i++) {
		params->mod_pages_pa_list[i] = vmalloc_to_pfn(ptr) << PAGE_SHIFT;
		ptr += SZ_4K;
	}

	return params;
}

^ permalink raw reply

* Re: [PATCH v7 07/22] coco/tdx-host: Implement firmware upload sysfs ABI for TDX module updates
From: Chao Gao @ 2026-04-14  9:50 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Li, Xiaoyao, Huang, Kai,
	Zhao, Yan Y, dave.hansen@linux.intel.com, kas@kernel.org,
	seanjc@google.com, binbin.wu@linux.intel.com, pbonzini@redhat.com,
	Chatre, Reinette, Verma, Vishal L, nik.borisov@suse.com,
	mingo@redhat.com, Weiny, Ira, tony.lindgren@linux.intel.com,
	Annapurve, Vishal, sagis@google.com, hpa@zytor.com,
	tglx@kernel.org, paulmck@kernel.org, bp@alien8.de,
	yilun.xu@linux.intel.com, dan.j.williams@intel.com,
	x86@kernel.org
In-Reply-To: <91f8645bff41f5b28bf964b215429b8c7ad53715.camel@intel.com>

On Sat, Apr 11, 2026 at 08:26:23AM +0800, Edgecombe, Rick P wrote:
>On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
>> +static enum fw_upload_err tdx_fw_write(struct fw_upload *fwl, const u8 *data,
>> +				       u32 offset, u32 size, u32 *written)
>> +{
>> +	int ret;
>> +
>> +	/*
>> +	 * tdx_fw_write() always processes all data on the first call with
>> +	 * offset == 0. Since it never returns partial success (it either
>> +	 * succeeds completely or fails), there is no subsequent call with
>> +	 * non-zero offsets.
>> +	 */
>> +	WARN_ON_ONCE(offset);
>> +	ret = seamldr_install_module(data, size);
>> +	switch (ret) {
>> +	case 0:
>> +		*written = size;
>> +		return FW_UPLOAD_ERR_NONE;
>> +	case -EBUSY:
>> +		return FW_UPLOAD_ERR_BUSY;
>> +	case -EIO:
>> +		return FW_UPLOAD_ERR_HW_ERROR;
>> +	case -ENOMEM:
>> +		return FW_UPLOAD_ERR_RW_ERROR;
>> +	default:
>> +		return FW_UPLOAD_ERR_FW_INVALID;
>
>It's hard to review whether these error codes match because the function doesn't
>return them yet. Why isn't this patch just done later in the series after
>everything is in place?

Good point.

The series is ordered top-down (user interface first, implementation details
later). I could either move this patch later, or add each error mapping in the
patch that introduces the corresponding return value. I'd prefer the latter to
keep the top-down ordering. Would that work for you?

^ permalink raw reply

* Re: [PATCH v7 10/22] x86/virt/seamldr: Abort updates if errors occurred midway
From: Chao Gao @ 2026-04-14  9:59 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Li, Xiaoyao, Huang, Kai,
	Zhao, Yan Y, dave.hansen@linux.intel.com, kas@kernel.org,
	seanjc@google.com, binbin.wu@linux.intel.com, pbonzini@redhat.com,
	Chatre, Reinette, Verma, Vishal L, nik.borisov@suse.com,
	mingo@redhat.com, Weiny, Ira, tony.lindgren@linux.intel.com,
	Annapurve, Vishal, sagis@google.com, hpa@zytor.com,
	tglx@kernel.org, paulmck@kernel.org, bp@alien8.de,
	yilun.xu@linux.intel.com, dan.j.williams@intel.com,
	x86@kernel.org
In-Reply-To: <fb472138f6ba06efd90a3834a5748662d7b7ac44.camel@intel.com>

On Sat, Apr 11, 2026 at 09:26:58AM +0800, Edgecombe, Rick P wrote:
>On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
>> The TDX module update process has multiple steps, each of which may
>> encounter failures.
>> 
>> The current state machine of updates proceeds to the next step regardless
>> of errors. But continuing updates when errors occur midway is pointless.
>
>This kind of begs the question of how much it matters if some pointless work
>happens in error condition during a rare operation. I'm thinking at this point,
>aha!, do we need this?
>
>> 
>> Abort the update by setting a flag to indicate that a CPU has encountered
>> an error, forcing all CPUs to exit the execution loop. Note that failing
>> CPUs do not acknowledge the current step. This keeps all other CPUs waiting
>> in the current step (since advancing to the next step requires all CPUs to
>> acknowledge the current step) until they detect the fault flag and exit the
>> loop.
>
>So is the point of the patch to prevent the operation from getting stuck? Or
>saving the user experiencing a failed update a little time?

Good question.

The main point is correctness, not saving time.

If shutdown fails midway, the update is still recoverable — TDs can continue
running. But if we proceed to seamldr.install anyway, it becomes destructive.
Aborting early on shutdown failure preserves recoverability (this is needed to
handle races between updates and TD build/migration).

If seamldr.install itself fails, it's already destructive, so aborting early
there just saves time. But using the same abort mechanism for both keeps the
error handling uniform.

^ permalink raw reply

* Re: [PATCH v7 11/22] x86/virt/seamldr: Shut down the current TDX module
From: Chao Gao @ 2026-04-14 10:09 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Li, Xiaoyao, Huang, Kai,
	Zhao, Yan Y, dave.hansen@linux.intel.com, kas@kernel.org,
	seanjc@google.com, binbin.wu@linux.intel.com, pbonzini@redhat.com,
	Chatre, Reinette, Verma, Vishal L, nik.borisov@suse.com,
	mingo@redhat.com, Weiny, Ira, tony.lindgren@linux.intel.com,
	Annapurve, Vishal, sagis@google.com, hpa@zytor.com,
	tglx@kernel.org, paulmck@kernel.org, bp@alien8.de,
	yilun.xu@linux.intel.com, dan.j.williams@intel.com,
	x86@kernel.org
In-Reply-To: <a50bb8a4d8c2bdcb0795eca4551567fb5a577215.camel@intel.com>

On Sat, Apr 11, 2026 at 09:35:32AM +0800, Edgecombe, Rick P wrote:
>On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
>> The first step of TDX module updates is shutting down the current TDX
>> Module. This step also packs state information that needs to be
>> preserved across updates as handoff data, which will be consumed by the
>> updated module. The handoff data is stored internally in the SEAM range
>> and is hidden from the kernel.
>> 
>> To ensure a successful update, the new module must be able to consume
>> the handoff data generated by the old module. Since handoff data layout
>> may change between modules, the handoff data is versioned. Each module
>> has a native handoff version and provides backward support for several
>> older versions.
>> 
>> The complete handoff versioning protocol is complex as it supports both
>> module upgrades and downgrades. See details in Intel® Trust Domain
>> Extensions (Intel® TDX) Module Base Architecture Specification, Chapter
>> "Handoff Versioning".
>> 
>> Ideally, the kernel needs to retrieve the handoff versions supported by
>> the current module and the new module and select a version supported by
>> both. But, since this implementation chooses to only support module
>> upgrades, simply request the current module to generate handoff data
>> using its highest supported version, expecting that the new module will
>> likely support it.
>
>I feel like somewhere it's missing what this patch does. It explains the
>reasoning for the handoff version selection, but nothing about implement
>"MODULE_UPDATE_SHUTDOWN" or anything like that.

Yes. How about:

Retrieve the module's handoff version from TDX global metadata and add an
update step to shut down the module. Module shutdown has global effect, so
it only needs to run on one CPU.

^ permalink raw reply

* Re: [PATCH 0/3] arm64/virt: Add Arm CCA measurement register support
From: Suzuki K Poulose @ 2026-04-14 10:10 UTC (permalink / raw)
  To: Jason Gunthorpe, Sami Mujawar, Dan Williams
  Cc: linux-arm-kernel, linux-kernel, catalin.marinas, will, thuth,
	steven.price, gshan, YeoReum.Yun, cedric.xing, Dan Williams,
	Dionna Glaze, Aneesh Kumar K . V, Alexey Kardashevskiy,
	linux-coco@lists.linux.dev
In-Reply-To: <20260413125925.GK3694781@ziepe.ca>

Cc: Dan, Cedric, Dionna, Aneesh, Alexey. linux-coco

Hi Jason,

On 13/04/2026 13:59, Jason Gunthorpe wrote:
> On Mon, Apr 13, 2026 at 09:49:54AM +0100, Sami Mujawar wrote:
>> This series adds support for Arm Confidential Compute Architecture (CCA)
>> measurement registers in the Linux kernel, enabling guest Realms to
>> access, extend, and expose measurement values for attestation and runtime
>> integrity tracking.
>>
>> The Realm Management Monitor (RMM) defines a set of measurement registers
>> consisting of a Realm Initial Measurement (RIM) and a number of Realm
>> Extensible Measurements (REMs). This series introduces the necessary
>> infrastructure to interact with these registers via the RSI interface
>> and exposes them to userspace through the TSM measurement framework.
>>
>> At a high level, the series includes:
>>   - Helper interfaces for reading and extending measurement
>>     registers via RSI
>>   - Definitions for Realm hash algorithms as defined by the
>>     RMM specification
>>   - Integration with the TSM measurement subsystem and sysfs
>>     exposure for userspace visibility and interaction
>>
>> After applying this series, measurement registers are exposed under:
>>      /sys/devices/virtual/misc/arm_cca_guest/measurements/
> 
> I'm surprised we get some random sysfs files? How does some more
> generic userspace figure out to use this vs a TPM or some other
> platform's version of it?

That is true. This is the infrastructure for exposing Runtime
Measurement registers (R/W) for use by the OS, complementing the
TSM_REPORTS (Read Only Platform measurements+Attestation Reports, e.g.
on CCA Attestation Report from RMM). Unlike the TSM reports,
this doesn't have a generic interface for userspace.


> I also think exposing PCRs as was done for TPM in sysfs was something
> of a mistake.. Allowing extension without logging is too low level and
> is very hard to build an entire attestation system around.
> 
> I really think we are missing a subsystem here, TPM has sort of been
> filling this role in a non-generic way, but we should have a
> common uAPI for platform measurement & attestation:

Agreed, such a subsystem would solve the below.

>   - Discover available measurements
>   - Report signed measurements, with ingesting a nonce
>   - Report measurement logs
>   - Extend measurements and udpate logs
>   - Report certificates used in signing
>   - General reporting of various kinds of attestation evidence
> 
> And it would be nice for the PCI devices and others to plug into the
> general framework as well instead of building a parallel TSM framework
> for handling evidence.

That makes sense and AFAIU, there are efforts in progress to expose
the Device measurements+Certificates in a different form. May be a good
idea to intervene early enough to see if we can find a common ground.

> 
> Isn't this also sort of incomplete?  Doesn't anything serious need
> signed measurements? Isnt't there alot more data that comes out of RMM
> than just a few measurement registers?
As mentioned above, this series adds the support for Runtime Extendible
Measurements (REM in CCA, RTMR on TDX). The RIM+Platform Attestation is 
already provided via the TSM_REPORT


Kind regards
Suzuki

> 
> Jason


^ permalink raw reply

* Re: [PATCH v2 05/31] x86/virt/tdx: Extend tdx_page_array to support IOMMU_MT
From: Xu Yilun @ 2026-04-14  9:57 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: Gao, Chao, Xu, Yilun, x86@kernel.org, kas@kernel.org,
	baolu.lu@linux.intel.com, dave.hansen@linux.intel.com,
	Li, Xiaoyao, Williams, Dan J, Jiang, Dave,
	linux-pci@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Duan, Zhenzhong, Verma, Vishal L,
	kvm@vger.kernel.org
In-Reply-To: <f38d0a080aee052937cb6721683d55155c657717.camel@intel.com>

On Wed, Apr 01, 2026 at 12:17:45AM +0000, Edgecombe, Rick P wrote:
> On Tue, 2026-03-31 at 22:19 +0800, Xu Yilun wrote:
> > > Consider the amount of tricks that are needed to coax the tdx_page_array to
> > > populate the handoff page as needed. It adds 2 pages here, then subtracts
> > > them
> > > later in the callback. Then tweaks the pa in tdx_page_array_populate() to
> > > add
> > > the length...
> > 
> > mm.. The tricky part is the specific memory requirement/allocation, the
> > common part is the pa list contained in a root page. Maybe we only model
> > the later, let the specific user does the memory allocation. Is that
> > closer to your "break concepts apart" idea?
> 
> I haven't wrapped my head around this enough to suggest anything is definitely
> the right approach.
> 
> But yes, the idea would be that the allocation of the list of pages to give to
> the TDX module would be a separate allocation and set of management functions.
> And the the allocation of the pages that are used to communicate the list of
> pages (and in this case other args) with the module would be another set. So
> each type of TDX module arg page format (IOMMU_MT, etc) would be separable, but
> share the page list allocation part only. It looks like Nikolay was probing
> along the same path. Not sure if he had the same solution in mind.
> 
> So for this:
> 1. Allocate a list or array of pages using a generic method.
> 2. Allocate these two IOMMU special pages.
> 3. Allocate memory needed for the seamcall (root pages)
> 
> Hand all three to the wrapper and have it shove them all through in the special
> way it prefers.

I'm drafting some changes and make the tdx_page_array look like:

  struct tdx_page_array {
	/* public: */
	unsigned int nr_pages;
	struct page **pages;

	/* private: */
	u64 *root;
	bool flush_on_free;
  };

  - I removed the page allocations for tdx_page_array kAPIs. Now the
    caller needs to allocate the struct page **pages and the page list,
    then create the tdx_page_array by providing these pages.

    struct tdx_page_array *tdx_page_array_create(struct page **pages,
						 unsigned int nr_pages)

    This also means tdx_page_array doesn't have to hold more than 512
    pages anymore, it now an exact descriptor for the TDX Module's
    definitions rather than a manager. It's a chunk of the required
    memory when we need more than 512 pages. This eliminates the need
    for 'offset' field and the slide window operations so make the
    helpers simpler.

  - I still keep the generic struct tdx_page_array to represent all
    kinds of object types (HPA_ARRAY_T, HPA_LIST_INFO, IOMMU_MT), and
    provide the tdx_page_array to SEAMCALL helpers as parameters. I
    think this structure is generally good enough to represent a list of
    pages, keeps type safety compared to a list of HPAs.

  - I still record both the page list (struct page **pages) and the HPA
    list (in u64 *root). struct page **pages works with kernel memory
    management (e.g. vmap) well while the populated root works with
    SEAMCALLs.

  - I'm not introducing more structures each for an object type, like 
    struct hpa_array, struct hpa_list_info, struct iommu_metadata. They
    are conceptually the same thing. The iommu_mt supports multi-order
    pages, hpa_array_t & hpa_list_info don't support. But their bit
    definitions don't conflict. I can use the same piece of code to
    populate their root page content.

  - Add a flush_on_free field to mark if a cache write back is needed on
    tdx_page_array_free(), then we don't need 2 free APIs.

I want to clean up my code, then post an incremental patch for preview.

Thanks.

^ permalink raw reply

* Re: [PATCH v7 13/22] x86/virt/seamldr: Install a new TDX module
From: Chao Gao @ 2026-04-14 10:19 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Li, Xiaoyao, Huang, Kai,
	Zhao, Yan Y, dave.hansen@linux.intel.com, kas@kernel.org,
	seanjc@google.com, binbin.wu@linux.intel.com, pbonzini@redhat.com,
	Chatre, Reinette, Verma, Vishal L, nik.borisov@suse.com,
	mingo@redhat.com, Weiny, Ira, tony.lindgren@linux.intel.com,
	Annapurve, Vishal, sagis@google.com, hpa@zytor.com,
	tglx@kernel.org, paulmck@kernel.org, bp@alien8.de,
	yilun.xu@linux.intel.com, dan.j.williams@intel.com,
	x86@kernel.org
In-Reply-To: <2eec98d012b85113b4e6876d07eb6513618e2404.camel@intel.com>

>> +static int seamldr_install(const struct seamldr_params *params)
>> +{
>> +	struct tdx_module_args args = { .rcx = __pa(params) };
>
>In an earlier patch you have a wrapper as:
>struct tdx_module_args args = = {};
>args.rxx = foo;
>
>Why the style difference? It would be good to standardize, but the existing code
>isn't standardized. What do you think about going with this style through the
>series for the one arg ones?

Sure, happy to standardize. To confirm, you prefer this style?

	struct tdx_module_args args = { .rcx = __pa(params) };

it looks more common than the other way.

^ permalink raw reply

* Re: [PATCH v7 22/22] x86/virt/seamldr: Log TDX module update failures
From: Chao Gao @ 2026-04-14 10:25 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Li, Xiaoyao, Huang, Kai,
	Zhao, Yan Y, dave.hansen@linux.intel.com, kas@kernel.org,
	seanjc@google.com, binbin.wu@linux.intel.com, pbonzini@redhat.com,
	Chatre, Reinette, Verma, Vishal L, nik.borisov@suse.com,
	mingo@redhat.com, Weiny, Ira, tony.lindgren@linux.intel.com,
	Annapurve, Vishal, sagis@google.com, hpa@zytor.com,
	tglx@kernel.org, paulmck@kernel.org, bp@alien8.de,
	yilun.xu@linux.intel.com, dan.j.williams@intel.com,
	x86@kernel.org
In-Reply-To: <e4cdab7d83bfc1c7fe3b5759a987c71b922585b7.camel@intel.com>

On Tue, Apr 14, 2026 at 04:04:05AM +0800, Edgecombe, Rick P wrote:
>On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
>> Currently, there is no way to restore a TDX module from shutdown state to
>> running state. This means if errors occur after a successful module
>> shutdown, they are unrecoverable since the old module is gone but the new
>> module isn't installed. All subsequent SEAMCALLs to the TDX module will
>> fail, so TDs will be killed due to SEAMCALL failures.
>> 
>> Log a message to clarify that SEAMCALL errors are expected in this
>> scenario. This ensures that after update failures, the first message in
>> dmesg explains the situation rather than showing confusing call traces from
>> various code paths.
>
>Why is this patch at the end? Is it not valid initial behavior?

It's not strictly required for the update to function. It just improves
debuggability by explaining expected SEAMCALL errors after a failure.

I placed it last so maintainers can easily drop it if they consider it
non-essential, similar to the guidance in *.

*: https://lore.kernel.org/kvm/63082cd1-15ab-4aaf-83ad-f72d94b9bb8e@intel.com/

^ permalink raw reply

* Re: [PATCH v7 06/22] coco/tdx-host: Expose P-SEAMLDR information via sysfs
From: Chao Gao @ 2026-04-14 11:20 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Li, Xiaoyao, Huang, Kai,
	Zhao, Yan Y, dave.hansen@linux.intel.com, kas@kernel.org,
	seanjc@google.com, binbin.wu@linux.intel.com, pbonzini@redhat.com,
	Chatre, Reinette, Verma, Vishal L, nik.borisov@suse.com,
	mingo@redhat.com, Weiny, Ira, tony.lindgren@linux.intel.com,
	Annapurve, Vishal, sagis@google.com, hpa@zytor.com,
	tglx@kernel.org, paulmck@kernel.org, bp@alien8.de,
	yilun.xu@linux.intel.com, dan.j.williams@intel.com,
	x86@kernel.org
In-Reply-To: <fbcce1bd98ff8a7ff106f7583e3b576b170358cf.camel@intel.com>

On Tue, Apr 14, 2026 at 03:08:33AM +0800, Edgecombe, Rick P wrote:
>On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
>> TDX module updates require userspace to select the appropriate module
>> to load. Expose necessary information to facilitate this decision. Two
>> values are needed:
>> 
>> - P-SEAMLDR version: for compatibility checks between TDX module and
>> 		     P-SEAMLDR
>> - num_remaining_updates: indicates how many updates can be performed
>
>Can you explain how all of these overlap?
> - TDX module supports module update
> - SEAMLDR supports NUM_REMAINING_UPDATES info
> - SEAMLDR supports VERSION info
>
>If the TDX module supports module update, do we know the SEAMLDR supports this
>other stuff somehow? It might be worth a comment the reasoning.

VERSION and NUM_REMAINING_UPDATES are always available for any P-SEAMLDR. They
don't depend on TDX module's update support.

>
>> 
>> Expose them as tdx-host device attributes. Make seamldr attributes
>> visible only when the update feature is supported, as that's their sole
>> purpose.
>> 
>> Unconditional exposure is also problematic because reading them
>> triggers P-SEAMLDR calls that break KVM on CPUs with a specific erratum
>> (to be enumerated and handled in a later patch).
>
>Since this is later handled with the errata check, what is the point being made
>here?

I will drop it. Dave also questioned mentioning the erratum here.

^ permalink raw reply

* Re: [PATCH 0/3] arm64/virt: Add Arm CCA measurement register support
From: Jason Gunthorpe @ 2026-04-14 12:29 UTC (permalink / raw)
  To: Suzuki K Poulose
  Cc: Sami Mujawar, Dan Williams, linux-arm-kernel, linux-kernel,
	catalin.marinas, will, thuth, steven.price, gshan, YeoReum.Yun,
	cedric.xing, Dan Williams, Dionna Glaze, Aneesh Kumar K . V,
	Alexey Kardashevskiy, linux-coco@lists.linux.dev
In-Reply-To: <514ddb27-137b-4223-84fe-2152737db3a6@arm.com>

On Tue, Apr 14, 2026 at 11:10:51AM +0100, Suzuki K Poulose wrote:

> > Isn't this also sort of incomplete?  Doesn't anything serious need
> > signed measurements? Isnt't there alot more data that comes out of RMM
> > than just a few measurement registers?
> As mentioned above, this series adds the support for Runtime Extendible
> Measurements (REM in CCA, RTMR on TDX). The RIM+Platform Attestation is
> already provided via the TSM_REPORT

Okay, but what actual use is this?

Extendable measrements with no log
Measurement read back without signature

What is the use case? What do you imagine any userspace will do with
this? Put it in the cover letter.

I don't think the raw rmm calls are sufficiently developed to be
usable directly by userspace. They are less capable than TPM and even
TPM has a lot of software around it to make it useful.

Jason

^ permalink raw reply

* Re: [PATCH v2 1/6] KVM: x86: Add dedicated storage for guest RIP
From: Xiaoyao Li @ 2026-04-14 12:31 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Kiryl Shutsemau
  Cc: kvm, x86, linux-coco, linux-kernel, Chang S . Bae
In-Reply-To: <20260409224236.2021562-2-seanjc@google.com>

On 4/10/2026 6:42 AM, Sean Christopherson wrote:
> Add kvm_vcpu_arch.rip to track guest RIP instead of including it in the
> generic regs[] array.  Decoupling RIP from regs[] will allow using a
> *completely* arbitrary index for RIP, as opposed to the mostly-arbitrary
> index that is currently used.  That in turn will allow using indices
> 16-31 to track R16-R31 that are coming with APX.

Even leave RIP in regs[], what is the problem by just allocating the 
index 16-31 to R16-R31 and making RIP the index 32? (I think I need go 
read the APX discussion to better understand the reason)

> Note, although RIP can used for addressing, it does NOT have an
                          ^
missing a 'be'

> architecturally defined index, and so can't be reached via flows like
> get_vmx_mem_address() where KVM "blindly" reads a general purpose register
> given the SIB information reported by hardware.  For RIP-relative
> addressing, hardware reports the full "offset" in vmcs.EXIT_QUALIFICATION.
> 
> Note #2, keep the available/dirty tracking as RSP is context switched

s/RSP/RIP

> through the VMCS, i.e. needs to be cached for VMX.
> 
> Opportunistically rename NR_VCPU_REGS to NR_VCPU_GENERAL_PURPOSE_REGS to
> better capture what it tracks, and so that KVM can slot in R16-R13 without

s/R16-R13/R16-R31

> running into weirdness where KVM's definition of "EXREG" doesn't line up
> with APX's definition of "extended reg".
> 
> No functional change intended.
> 

^ permalink raw reply

* Re: [PATCH 0/3] arm64/virt: Add Arm CCA measurement register support
From: Suzuki K Poulose @ 2026-04-14 13:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sami Mujawar, Dan Williams, linux-arm-kernel, linux-kernel,
	catalin.marinas, will, thuth, steven.price, gshan, YeoReum.Yun,
	cedric.xing, Dan Williams, Dionna Glaze, Aneesh Kumar K . V,
	Alexey Kardashevskiy, linux-coco@lists.linux.dev
In-Reply-To: <20260414122950.GW3694781@ziepe.ca>

On 14/04/2026 13:29, Jason Gunthorpe wrote:
> On Tue, Apr 14, 2026 at 11:10:51AM +0100, Suzuki K Poulose wrote:
> 
>>> Isn't this also sort of incomplete?  Doesn't anything serious need
>>> signed measurements? Isnt't there alot more data that comes out of RMM
>>> than just a few measurement registers?
>> As mentioned above, this series adds the support for Runtime Extendible
>> Measurements (REM in CCA, RTMR on TDX). The RIM+Platform Attestation is
>> already provided via the TSM_REPORT
> 
> Okay, but what actual use is this?
> 

Good point. This REMs are planned to be used for 
EFI_CC_MEASUREMENT_PROTOCOL as described below:

https://github.com/tianocore/edk2/issues/11383

At the moment they are exposed as raw, similar to the Intel TDX RTMRs.
This may eventually need to be connected to IMA subsystem.

> Extendable measrements with no log
> Measurement read back without signature
> 
> What is the use case? What do you imagine any userspace will do with
> this? Put it in the cover letter.

Agreed.

> 
> I don't think the raw rmm calls are sufficiently developed to be
> usable directly by userspace. They are less capable than TPM and even
> TPM has a lot of software around it to make it useful.

See above.

Kind regards
Suzuki

> 
> Jason


^ permalink raw reply

* Re: [PATCH 0/3] arm64/virt: Add Arm CCA measurement register support
From: Jason Gunthorpe @ 2026-04-14 13:35 UTC (permalink / raw)
  To: Suzuki K Poulose
  Cc: Sami Mujawar, Dan Williams, linux-arm-kernel, linux-kernel,
	catalin.marinas, will, thuth, steven.price, gshan, YeoReum.Yun,
	cedric.xing, Dan Williams, Dionna Glaze, Aneesh Kumar K . V,
	Alexey Kardashevskiy, linux-coco@lists.linux.dev
In-Reply-To: <a1831d3d-af31-4fa7-b107-f4001841f051@arm.com>

On Tue, Apr 14, 2026 at 02:26:58PM +0100, Suzuki K Poulose wrote:
> On 14/04/2026 13:29, Jason Gunthorpe wrote:
> > On Tue, Apr 14, 2026 at 11:10:51AM +0100, Suzuki K Poulose wrote:
> > 
> > > > Isn't this also sort of incomplete?  Doesn't anything serious need
> > > > signed measurements? Isnt't there alot more data that comes out of RMM
> > > > than just a few measurement registers?
> > > As mentioned above, this series adds the support for Runtime Extendible
> > > Measurements (REM in CCA, RTMR on TDX). The RIM+Platform Attestation is
> > > already provided via the TSM_REPORT
> > 
> > Okay, but what actual use is this?
> > 
> 
> Good point. This REMs are planned to be used for EFI_CC_MEASUREMENT_PROTOCOL
> as described below:
> 
> https://github.com/tianocore/edk2/issues/11383

So this is tying it to the same FW event log that TPM uses.

I think that strengthens my point this should all be uninform. TPM
drivers are directly exposing the event log today, but I guess that
needs generalization if non-TPM drivers are going to present it as
well.

How do you imagine getting and manipulating the EFI event log to use
with this?

Jason

^ permalink raw reply

* Re: [PATCH v2 1/6] KVM: x86: Add dedicated storage for guest RIP
From: Chang S. Bae @ 2026-04-14 13:59 UTC (permalink / raw)
  To: Xiaoyao Li, Sean Christopherson, Paolo Bonzini, Kiryl Shutsemau
  Cc: kvm, x86, linux-coco, linux-kernel
In-Reply-To: <20b82b65-b156-4a2c-8094-b86dccfb3025@intel.com>

On 4/14/2026 5:31 AM, Xiaoyao Li wrote:
> Even leave RIP in regs[], what is the problem by just allocating the 
> index 16-31 to R16-R31 and making RIP the index 32?

But why?

Even though the array isn't explicitly labeled as GPRs, that's 
effectively how it's being used, and RIP isn't part of that set.

I don't think there is any benefit of leaving it in regs[]. Instead, It 
can be stored like that simple, period.

Thanks,
Chang

^ permalink raw reply

* Re: [PATCH v2 5/6] KVM: x86: Track available/dirty register masks as "unsigned long" values
From: Sean Christopherson @ 2026-04-14 14:04 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Kai Huang, Chang Seok Bae, kvm@vger.kernel.org,
	pbonzini@redhat.com, kas@kernel.org, linux-kernel@vger.kernel.org,
	linux-coco@lists.linux.dev, x86@kernel.org
In-Reply-To: <95a931f8-42cc-4834-953c-30c9167bfdc1@intel.com>

On Tue, Apr 14, 2026, Xiaoyao Li wrote:
> On 4/14/2026 7:03 AM, Huang, Kai wrote:
> > > Because VMX and SVM make all GRPs available immediately, except
> > > for RSP, KVM ignores avail/dirty for GPRs.  I.e. "fixing" TDX will just shift the
> > > "bugs" elsewhere.
> > Just want to understand:
> > 
> > I thought the fix could be we simply remove the wrong GPRs from the list.
> > Not sure how fixing TDX will shift bugs elsewhere?
> 
> I'm curious too.

What I'm saying is that, _if_ there are bugs where KVM uses a register that isn't
available, then modifying TDX's list won't actually fix anything (without more
changes), it will just change which code is technically buggy (hence all the quotes
above).

> > > More importantly, because the TDX-Module*requires* RCX (the GPR that holds the
> > > mask of registers to expose to the VMM) to be hidden on TDVMCALL, KVM*can't*
> > > do any kind of meaningful "available" tracking.
> > > 
> > Hmm I think RCX conveys the shared GPRs and VMM can read.  Per "Table 5.323:
> > TDH.VP.ENTER Output Operands Format #5 Definition: On TDCALL(TDG.VP.VMCALL)
> > Following a TD Entry":
> > 
> >    RCX   ...
> > 	Bit(s) Name         Description
> > 
> > 	31:0   PARAMS_MASK  Value as passed into TDCALL(TDG.VP.VMCALL) by
> > 			    the guest TD: indicates which part of the guest
> > 			    TD GPR and XMM state is passed as-is to the
> > VMM
> > 			    and back. For details, see the description of
> > 			    TDG.VP.VMCALL in 5.5.26.
> > 
> > I think the problem is, as said previously, currently KVM TDX code uses
> > KVM's existing infrastructure to emulate MSR, KVM hypercall etc,  but
> > TDVMCALL has a different ABI, thus there's a mismatch here.
> 
> I once had patch for it internally.
> 
> It adds back the available check for GPRs when accessing instead of assuming
> they are always available. For normal VMX and SVM, all the GPRs are still
> always available. But for TDX, only EXIT_INFO_1 and EXIT_INFO_2 are always
> marked available, while others need to be explicitly set case by case.
> 
> The good thing is it makes TDX safer that KVM won't consume invalid data
> silently for TDX. But it adds additional overhead of checking the
> unnecessary register availability for VMX and SVM case.
> 
> -----------------------------&<-------------------------------------
> From: Xiaoyao Li <xiaoyao.li@intel.com>
> Date: Tue, 11 Mar 2025 07:13:29 -0400
> Subject: [PATCH] KVM: x86: Add available check for GPRs
> 
> Since commit de3cd117ed2f ("KVM: x86: Omit caching logic for
> always-available GPRs"), KVM doesn't check the availability of GPRs
> except RSP and RIP when accessing them, because they are always
> available.
> 
> However, it's not true when it comes to TDX. The GPRs are not available
> after TD vcpu exits actually.

> And it relies on KVM manually sets the
> GPRs value when needed, e.g.
> 
>  - setting rax, rbx, rcx, rdx, rsi, for hypercall emulation in
>    tdx_emulate_tdvmall();
> 
>  - setting rax, rcx and rdx before MSR write emulation;
> 
> Add the available check of GPRs read, and WARN_ON_ONCE() when unavailable.
> It can help capture the cases of undesired GPRs consumption by TDX.

Sorry, but NAK.  I am strongly against adding any code to the GPR accessors/mutators
just for TDX.  It's a _lot_ of code.  From commit de3cd117ed2f ("KVM: x86: Omit
caching logic for always-available GPRs"):

    E.g. on x86_64, kvm_emulate_cpuid() is reduced from 342 to 182 bytes and
    kvm_emulate_hypercall() from 1362 to 1143, with the total size of KVM
    dropping by ~1000 bytes.  With CONFIG_RETPOLINE=y, the numbers are even
    more pronounced, e.g.: 353->182, 1418->1172 and well over 2000 bytes.

Note that updating only the "available" masks is wrong, as TDX needs to marshall
written registers back to their correct location.

In the end, the available/dirty tracking isn't about hardening against bugs, it's
about deferring expensive VMREAD and VMWRITE (and guest memory) operations until
action is required.

We could bury sanity checks behind a Kconfig of some kind, but I genuinely don't
see much value in doing so.  These emulation flows are very static (all register
usage is hardcoded), and so it's very much a "get it right once" sort of thing,
i.e. the odds of a runtime check finding a bug after initial development are
basically zero.

An alternative for TDX would be to avoid bouncing through GPRs in the first place,
e.g. by reworking __kvm_emulate_rdmsr() to not access any registers.  But I'm
probably opposed to even that, because I doubt the end result would be an overall
net positive for KVM.  We'd end up with duplicate code, harder to read common
code (because of the new abstractions), and likely without meaningfully moving
the needle in terms of finding/preventing bugs.  KVM still needs to get operands
to/from the right parameters, though only difference is that for TDX, the parameters
would be very "direct".

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox