Linux Confidential Computing Development
 help / color / mirror / Atom feed
* [PATCH v7 5/6] KVM: SEV: Perform RMP optimizations on SNP guest shutdown
From: Ashish Kalra @ 2026-06-08 18:57 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1780903370.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

Pages are converted from shared to private as SNP guests are launched.
This destroys exisiting RMPOPT optimizations in the regions where
pages are converted.

Conversely, guest pages are converted back to shared during SNP guest
termination and their region may become eligible for RMPOPT
optimization.

To take advantage of this, perform RMPOPT after guest termination.
Do it after a delay so that a single RMPOPT pass can be done if
multiple guests terminate in a short period of time.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/kvm/svm/sev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index e107f368ed2d..29af6f6e603c 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3005,6 +3005,8 @@ void sev_vm_destroy(struct kvm *kvm)
 		 */
 		if (snp_decommission_context(kvm))
 			return;
+
+		snp_rmpopt_all_physmem();
 	} else {
 		sev_unbind_asid(kvm, sev->handle);
 	}
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 4/6] x86/sev: Add interface to re-enable RMP optimizations.
From: Ashish Kalra @ 2026-06-08 18:57 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1780903370.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

RMPOPT table is a per-CPU table which indicates if 1GB regions of
physical memory are entirely hypervisor-owned or not.

When performing host memory accesses in hypervisor mode as well as
non-SNP guest mode, the processor may consult the RMPOPT table to
potentially skip an RMP access and improve performance.

Events such as RMPUPDATE can clear RMP optimizations. Add an interface
to re-enable those optimizations.

Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/sev.h |  2 ++
 arch/x86/virt/svm/sev.c    | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 0d662221615a..a11306f25336 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
 	__snp_leak_pages(pfn, pages, true);
 }
 int snp_prepare(void);
+void snp_rmpopt_all_physmem(void);
 void snp_setup_rmpopt(void);
 void snp_clear_rmpopt_configured(void);
 void snp_shutdown(void);
@@ -682,6 +683,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
 static inline void kdump_sev_callback(void) { }
 static inline void snp_fixup_e820_tables(void) {}
 static inline int snp_prepare(void) { return -ENODEV; }
+static inline void snp_rmpopt_all_physmem(void) {}
 static inline void snp_setup_rmpopt(void) {}
 static inline void snp_clear_rmpopt_configured(void) {}
 static inline void snp_shutdown(void) {}
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index b42788a66d40..db2d4c1f5dd7 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -760,6 +760,21 @@ static void rmpopt_work_handler(struct work_struct *work)
 	free_cpumask_var(follower_mask);
 }
 
+void snp_rmpopt_all_physmem(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT) || !rmpopt_configured)
+		return;
+
+	guard(mutex)(&rmpopt_wq_mutex);
+
+	if (!rmpopt_wq)
+		return;
+
+	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work,
+			   msecs_to_jiffies(RMPOPT_WORK_TIMEOUT));
+}
+EXPORT_SYMBOL_GPL(snp_rmpopt_all_physmem);
+
 void snp_setup_rmpopt(void)
 {
 	u64 rmpopt_base;
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 3/6] x86/sev: Add support to perform RMP optimizations asynchronously
From: Ashish Kalra @ 2026-06-08 18:56 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1780903370.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

When SEV-SNP is enabled, all writes to memory are checked to ensure
integrity of SNP guest memory. This imposes performance overhead on the
whole system.

RMPOPT is a new instruction that minimizes the performance overhead of
RMP checks on the hypervisor and on non-SNP guests by allowing RMP
checks to be skipped for 1GB regions of memory that are known not to
contain any SEV-SNP guest memory.

Add support for performing RMP optimizations asynchronously using a
dedicated workqueue.

Enable RMPOPT optimizations for up to 2TB of system RAM starting from
the lowest physical memory address aligned down to a 1GB boundary at
RMP initialization time. RMP checks can initially be skipped for 1GB
memory ranges that do not contain SEV-SNP guest memory (excluding
preassigned pages such as the RMP table and firmware pages). As SNP
guests are launched, RMPUPDATE will disable the corresponding RMPOPT
optimizations.

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/virt/svm/sev.c | 208 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 205 insertions(+), 3 deletions(-)

diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 482008bb07e4..b42788a66d40 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -19,6 +19,7 @@
 #include <linux/iommu.h>
 #include <linux/amd-iommu.h>
 #include <linux/nospec.h>
+#include <linux/workqueue.h>
 
 #include <asm/sev.h>
 #include <asm/processor.h>
@@ -125,9 +126,20 @@ static void *rmp_bookkeeping __ro_after_init;
 static u64 probed_rmp_base, probed_rmp_size;
 
 static cpumask_t rmpopt_cpumask;
-static phys_addr_t rmpopt_pa_start;
+static phys_addr_t rmpopt_pa_start, rmpopt_pa_end;
 static bool rmpopt_configured;
 
+enum rmpopt_function {
+	RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS,
+	RMPOPT_FUNC_REPORT_STATUS
+};
+
+#define RMPOPT_WORK_TIMEOUT	10000
+
+static struct workqueue_struct *rmpopt_wq;
+static struct delayed_work rmpopt_delayed_work;
+static DEFINE_MUTEX(rmpopt_wq_mutex);
+
 static LIST_HEAD(snp_leaked_pages_list);
 static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
 
@@ -568,6 +580,14 @@ static void rmpopt_cleanup(void)
 {
 	int cpu;
 
+	guard(mutex)(&rmpopt_wq_mutex);
+
+	if (!rmpopt_wq)
+		return;
+
+	cancel_delayed_work_sync(&rmpopt_delayed_work);
+	destroy_workqueue(rmpopt_wq);
+
 	cpus_read_lock();
 
 	for_each_cpu(cpu, &rmpopt_cpumask)
@@ -576,7 +596,8 @@ static void rmpopt_cleanup(void)
 	cpus_read_unlock();
 
 	cpumask_clear(&rmpopt_cpumask);
-	rmpopt_pa_start = 0;
+	rmpopt_pa_start = rmpopt_pa_end = 0;
+	rmpopt_wq = NULL;
 }
 
 void snp_shutdown(void)
@@ -599,6 +620,146 @@ void snp_clear_rmpopt_configured(void)
 	rmpopt_configured = false;
 }
 
+/*
+ * RMPOPT: F2 0F 01 FC
+ *   Input:  RAX = system physical address (1GB aligned)
+ *           RCX = operation type
+ *   Output: CF set if the range was optimized
+ */
+static inline bool __rmpopt(u64 pa_start, u64 op_type)
+{
+	bool optimized;
+
+	asm volatile(".byte 0xf2, 0x0f, 0x01, 0xfc"
+		     : "=@ccc" (optimized)
+		     : "a" (pa_start), "c" (op_type)
+		     : "memory", "cc");
+
+	return optimized;
+}
+
+static void rmpopt(u64 pa)
+{
+	u64 pa_start = ALIGN_DOWN(pa, SZ_1G);
+	u64 op_type = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
+
+	__rmpopt(pa_start, op_type);
+}
+
+/*
+ * 'val' is a system physical address.
+ */
+static void rmpopt_smp(void *val)
+{
+	rmpopt((u64)val);
+}
+
+/*
+ * RMPOPT optimizations skip RMP checks at 1GB granularity if this
+ * range of memory does not contain any SNP guest memory.
+ */
+static void rmpopt_work_handler(struct work_struct *work)
+{
+	cpumask_var_t follower_mask;
+	phys_addr_t pa;
+	int this_cpu;
+
+	pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
+		rmpopt_pa_start, rmpopt_pa_end);
+
+	if (!alloc_cpumask_var(&follower_mask, GFP_KERNEL))
+		return;
+
+	/*
+	 * RMPOPT scans the RMP table, stores the result of the scan in the
+	 * reserved processor memory. The RMP scan is the most expensive
+	 * part. If a second RMPOPT occurs, it can skip the expensive scan
+	 * if they can see a cached result in the reserved processor memory.
+	 *
+	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
+	 * on every other primary thread. Followers are "designed to"
+	 * skip the scan if they see the "cached" scan results.
+	 */
+	cpumask_copy(follower_mask, &rmpopt_cpumask);
+
+	/*
+	 * Pin the worker to the current CPU for the leader loop so that
+	 * this_cpu remains valid and the RMPOPT instruction executes on
+	 * the correct CPU.
+	 *
+	 * Use migrate_disable() rather than get_cpu() to prevent
+	 * migration while still allowing preemption.
+	 */
+	migrate_disable();
+	this_cpu = smp_processor_id();
+
+	if (cpumask_test_cpu(this_cpu, follower_mask)) {
+		/*
+		 * Current CPU is a primary thread in rmpopt_cpumask.
+		 * Run leader locally and remove from follower mask.
+		 */
+		cpumask_clear_cpu(this_cpu, follower_mask);
+
+		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+			rmpopt(pa);
+			cond_resched();
+		}
+	} else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
+				      follower_mask)) {
+		/*
+		 * Current CPU is a sibling thread whose primary is in
+		 * rmpopt_cpumask.  RMPOPT_BASE MSR is per-core, so it
+		 * is safe to run the leader locally.  Remove the sibling's
+		 * primary from the follower mask as this core is already
+		 * covered by the leader.
+		 */
+		cpumask_andnot(follower_mask, follower_mask,
+			       topology_sibling_cpumask(this_cpu));
+
+		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+			rmpopt(pa);
+			cond_resched();
+		}
+	} else {
+		/*
+		 * Current CPU does not have RMPOPT_BASE MSR programmed.
+		 * Pick an explicit leader from the cpumask to avoid #UD.
+		 */
+		int leader_cpu = cpumask_first(follower_mask);
+
+		if (WARN_ON_ONCE(leader_cpu >= nr_cpu_ids)) {
+			migrate_enable();
+			goto out;
+		}
+
+		cpumask_clear_cpu(leader_cpu, follower_mask);
+
+		cpus_read_lock();
+		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+			smp_call_function_single(leader_cpu, rmpopt_smp,
+						 (void *)pa, true);
+			cond_resched();
+		}
+		cpus_read_unlock();
+	}
+
+	migrate_enable();
+
+	/* Followers: run RMPOPT on remaining cores */
+	cpus_read_lock();
+	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+		on_each_cpu_mask(follower_mask, rmpopt_smp,
+				 (void *)pa, true);
+
+		 /* Give a chance for other threads to run */
+		cond_resched();
+	}
+	cpus_read_unlock();
+
+out:
+	free_cpumask_var(follower_mask);
+}
+
 void snp_setup_rmpopt(void)
 {
 	u64 rmpopt_base;
@@ -607,11 +768,35 @@ void snp_setup_rmpopt(void)
 	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT) || !rmpopt_configured)
 		return;
 
+	guard(mutex)(&rmpopt_wq_mutex);
+
+	/*
+	 * Guard against re-initialization.  When SNP_SHUTDOWN_EX is issued
+	 * with x86_snp_shutdown=0, snp_shutdown() is not called and
+	 * rmpopt_cleanup() is skipped, but snp_initialized is still cleared.
+	 * A subsequent __sev_snp_init_locked() would call snp_setup_rmpopt()
+	 * again, leaking the existing workqueue, delayed work, debugfs
+	 * entries, and cpumask state.
+	 */
+	if (rmpopt_wq)
+		return;
+
+	/*
+	 * Create an RMPOPT-specific workqueue to avoid scheduling
+	 * RMPOPT workitem on the global system workqueue.
+	 */
+	rmpopt_wq = alloc_workqueue("rmpopt_wq", WQ_UNBOUND, 1);
+	if (!rmpopt_wq) {
+		pr_err("Failed to allocate RMPOPT workqueue\n");
+		return;
+	}
+
 	cpus_read_lock();
 
 	/*
 	 * The RMPOPT_BASE MSR is per-core, so only one thread per core needs
-	 * to set up the RMPOPT_BASE MSR.
+	 * to set up the RMPOPT_BASE MSR. Likewise, only one thread per core
+	 * needs to issue the RMPOPT instruction.
 	 *
 	 * Note: only online primary threads are included.  If a core's
 	 * primary thread is offline, that core is not covered.  CPU hotplug
@@ -635,6 +820,23 @@ void snp_setup_rmpopt(void)
 		wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
 
 	cpus_read_unlock();
+
+	INIT_DELAYED_WORK(&rmpopt_delayed_work, rmpopt_work_handler);
+
+	rmpopt_pa_end = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
+
+	/* Limit memory scanning to 2TB of RAM */
+	if ((rmpopt_pa_end - rmpopt_pa_start) > SZ_2T) {
+		pr_info("RMPOPT coverage limited to 2TB; memory above 0x%llx not optimized\n",
+			rmpopt_pa_start + SZ_2T);
+		rmpopt_pa_end = rmpopt_pa_start + SZ_2T;
+	}
+
+	/*
+	 * Once all per-CPU RMPOPT tables have been configured, enable RMPOPT
+	 * optimizations on all physical memory.
+	 */
+	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 2/6] x86/sev: Initialize RMPOPT configuration MSRs
From: Ashish Kalra @ 2026-06-08 18:56 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1780903370.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

The new RMPOPT instruction helps manage per-CPU RMP optimization
structures inside the CPU. It takes a 1GB-aligned physical address
and either returns the status of the optimizations or tries to enable
the optimizations.

Per-CPU RMPOPT tables support at most 2 TB of addressable memory for
RMP optimizations.

Initialize the per-CPU RMPOPT table base to the starting physical
address. This enables RMP optimization for up to 2 TB of system RAM on
all CPUs.

Additionally, add support to setup and enable RMPOPT once SNP is
enabled and initialized.

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/coco/core.c             |  3 ++
 arch/x86/include/asm/msr-index.h |  3 ++
 arch/x86/include/asm/sev.h       |  4 ++
 arch/x86/virt/svm/sev.c          | 72 +++++++++++++++++++++++++++++++-
 drivers/crypto/ccp/sev-dev.c     |  3 ++
 5 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 989ca9f72ba3..a8fc2ae50298 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -16,6 +16,7 @@
 #include <asm/archrandom.h>
 #include <asm/coco.h>
 #include <asm/processor.h>
+#include <asm/sev.h>
 
 enum cc_vendor cc_vendor __ro_after_init = CC_VENDOR_NONE;
 SYM_PIC_ALIAS(cc_vendor);
@@ -172,6 +173,8 @@ static void amd_cc_platform_clear(enum cc_attr attr)
 	switch (attr) {
 	case CC_ATTR_HOST_SEV_SNP:
 		cc_flags.host_sev_snp = 0;
+		snp_clear_rmpopt_configured();
+		setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
 		break;
 	default:
 		break;
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 86554de9a3f5..28540744f1eb 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -761,6 +761,9 @@
 #define MSR_AMD64_SEG_RMP_ENABLED_BIT	0
 #define MSR_AMD64_SEG_RMP_ENABLED	BIT_ULL(MSR_AMD64_SEG_RMP_ENABLED_BIT)
 #define MSR_AMD64_RMP_SEGMENT_SHIFT(x)	(((x) & GENMASK_ULL(13, 8)) >> 8)
+#define MSR_AMD64_RMPOPT_BASE		0xc0010139
+#define MSR_AMD64_RMPOPT_ENABLE_BIT	0
+#define MSR_AMD64_RMPOPT_ENABLE		BIT_ULL(MSR_AMD64_RMPOPT_ENABLE_BIT)
 
 #define MSR_SVSM_CAA			0xc001f000
 
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 594cfa19cbd4..0d662221615a 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,8 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
 	__snp_leak_pages(pfn, pages, true);
 }
 int snp_prepare(void);
+void snp_setup_rmpopt(void);
+void snp_clear_rmpopt_configured(void);
 void snp_shutdown(void);
 #else
 static inline bool snp_probe_rmptable_info(void) { return false; }
@@ -680,6 +682,8 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
 static inline void kdump_sev_callback(void) { }
 static inline void snp_fixup_e820_tables(void) {}
 static inline int snp_prepare(void) { return -ENODEV; }
+static inline void snp_setup_rmpopt(void) {}
+static inline void snp_clear_rmpopt_configured(void) {}
 static inline void snp_shutdown(void) {}
 #endif
 
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 8bcdce98f6dc..482008bb07e4 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -124,6 +124,10 @@ static void *rmp_bookkeeping __ro_after_init;
 
 static u64 probed_rmp_base, probed_rmp_size;
 
+static cpumask_t rmpopt_cpumask;
+static phys_addr_t rmpopt_pa_start;
+static bool rmpopt_configured;
+
 static LIST_HEAD(snp_leaked_pages_list);
 static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
 
@@ -488,9 +492,14 @@ static bool __init setup_segmented_rmptable(void)
 static bool __init setup_rmptable(void)
 {
 	if (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) {
-		if (!setup_segmented_rmptable())
+		if (!setup_segmented_rmptable()) {
+			setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
 			return false;
+		}
+		rmpopt_configured = true;
 	} else {
+		/* Note that Segmented RMP must be enabled to enable RMPOPT. */
+		setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
 		if (!setup_contiguous_rmptable())
 			return false;
 	}
@@ -555,6 +564,21 @@ int snp_prepare(void)
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
 
+static void rmpopt_cleanup(void)
+{
+	int cpu;
+
+	cpus_read_lock();
+
+	for_each_cpu(cpu, &rmpopt_cpumask)
+		wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, 0);
+
+	cpus_read_unlock();
+
+	cpumask_clear(&rmpopt_cpumask);
+	rmpopt_pa_start = 0;
+}
+
 void snp_shutdown(void)
 {
 	u64 syscfg;
@@ -563,11 +587,57 @@ void snp_shutdown(void)
 	if (syscfg & MSR_AMD64_SYSCFG_SNP_EN)
 		return;
 
+	rmpopt_cleanup();
+
 	clear_rmp();
 	on_each_cpu(mfd_reconfigure, NULL, 1);
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
 
+void snp_clear_rmpopt_configured(void)
+{
+	rmpopt_configured = false;
+}
+
+void snp_setup_rmpopt(void)
+{
+	u64 rmpopt_base;
+	int cpu;
+
+	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT) || !rmpopt_configured)
+		return;
+
+	cpus_read_lock();
+
+	/*
+	 * The RMPOPT_BASE MSR is per-core, so only one thread per core needs
+	 * to set up the RMPOPT_BASE MSR.
+	 *
+	 * Note: only online primary threads are included.  If a core's
+	 * primary thread is offline, that core is not covered.  CPU hotplug
+	 * is not currently supported with SNP enabled.
+	 */
+
+	for_each_online_cpu(cpu)
+		if (topology_is_primary_thread(cpu))
+			cpumask_set_cpu(cpu, &rmpopt_cpumask);
+
+	rmpopt_pa_start = ALIGN_DOWN(PFN_PHYS(min_low_pfn), SZ_1G);
+	rmpopt_base = rmpopt_pa_start | MSR_AMD64_RMPOPT_ENABLE;
+
+	/*
+	 * Per-CPU RMPOPT tables support at most 2 TB of addressable memory
+	 * for RMP optimizations. Initialize the per-CPU RMPOPT table base
+	 * to the starting physical address to enable RMP optimizations for
+	 * up to 2 TB of system RAM on all CPUs.
+	 */
+	for_each_cpu(cpu, &rmpopt_cpumask)
+		wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
+
+	cpus_read_unlock();
+}
+EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
+
 /*
  * Do the necessary preparations which are verified by the firmware as
  * described in the SNP_INIT_EX firmware command description in the SNP
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 78f98aee7a66..217b6b19802e 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1478,6 +1478,9 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
 	}
 
 	snp_hv_fixed_pages_state_update(sev, HV_FIXED);
+
+	snp_setup_rmpopt();
+
 	sev->snp_initialized = true;
 	dev_dbg(sev->dev, "SEV-SNP firmware initialized, SEV-TIO is %s\n",
 		data.tio_en ? "enabled" : "disabled");
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 1/6] x86/cpufeatures: Add X86_FEATURE_RMPOPT feature flag
From: Ashish Kalra @ 2026-06-08 18:56 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1780903370.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

Add a flag indicating whether RMPOPT instruction is supported.

RMPOPT is a new instruction that reduces the performance overhead of
RMP checks for the hypervisor and non-SNP guests by allowing those
checks to be skipped when 1-GB memory regions are known to contain no
SEV-SNP guest memory.

For more information on the RMPOPT instruction, see the AMD64 RMPOPT
technical documentation.

Suggested-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/cpufeatures.h       | 2 +-
 arch/x86/kernel/cpu/scattered.c          | 1 +
 tools/arch/x86/include/asm/cpufeatures.h | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 1d506e5d6f46..794cc96b8493 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -76,7 +76,7 @@
 #define X86_FEATURE_K8			( 3*32+ 4) /* Opteron, Athlon64 */
 #define X86_FEATURE_ZEN5		( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
 #define X86_FEATURE_ZEN6		( 3*32+ 6) /* CPU based on Zen6 microarchitecture */
-/* Free                                 ( 3*32+ 7) */
+#define X86_FEATURE_RMPOPT		( 3*32+ 7) /* Support for AMD RMPOPT instruction */
 #define X86_FEATURE_CONSTANT_TSC	( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
 #define X86_FEATURE_UP			( 3*32+ 9) /* "up" SMP kernel running on UP */
 #define X86_FEATURE_ART			( 3*32+10) /* "art" Always running timer (ART) */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 937129ce6a96..021c0bf22de2 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -67,6 +67,7 @@ static const struct cpuid_bit cpuid_bits[] = {
 	{ X86_FEATURE_PERFMON_V2,		CPUID_EAX,  0, 0x80000022, 0 },
 	{ X86_FEATURE_AMD_LBR_V2,		CPUID_EAX,  1, 0x80000022, 0 },
 	{ X86_FEATURE_AMD_LBR_PMC_FREEZE,	CPUID_EAX,  2, 0x80000022, 0 },
+	{ X86_FEATURE_RMPOPT,			CPUID_EDX,  0, 0x80000025, 0 },
 	{ X86_FEATURE_AMD_HTR_CORES,		CPUID_EAX, 30, 0x80000026, 0 },
 	{ 0, 0, 0, 0, 0 }
 };
diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h
index 86d17b195e79..7ce681af1dd7 100644
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -76,7 +76,7 @@
 #define X86_FEATURE_K8			( 3*32+ 4) /* Opteron, Athlon64 */
 #define X86_FEATURE_ZEN5		( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
 #define X86_FEATURE_ZEN6		( 3*32+ 6) /* CPU based on Zen6 microarchitecture */
-/* Free                                 ( 3*32+ 7) */
+#define X86_FEATURE_RMPOPT		( 3*32+ 7) /* Support for AMD RMPOPT instruction */
 #define X86_FEATURE_CONSTANT_TSC	( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
 #define X86_FEATURE_UP			( 3*32+ 9) /* "up" SMP kernel running on UP */
 #define X86_FEATURE_ART			( 3*32+10) /* "art" Always running timer (ART) */
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 0/6] Add RMPOPT support.
From: Ashish Kalra @ 2026-06-08 18:56 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco

From: Ashish Kalra <ashish.kalra@amd.com>

In the SEV-SNP architecture, hypervisor and non-SNP guests are subject
to RMP checks on writes to provide integrity of SEV-SNP guest memory.

The RMPOPT architecture enables optimizations whereby the RMP checks
can be skipped if 1GB regions of memory are known to not contain any
SNP guest memory.

RMPOPT is a new instruction designed to minimize the performance
overhead of RMP checks for the hypervisor and non-SNP guests.

RMPOPT instruction currently supports two functions. In case of the
verify and report status function the CPU will read the RMP contents,
verify the entire 1GB region starting at the provided SPA is HV-owned.
For the entire 1GB region it checks that all RMP entries in this region
are HV-owned (i.e, not in assigned state) and then accordingly updates
the RMPOPT table to indicate if optimization has been enabled and
provide indication to software if the optimization was successful.

In case of report status function, the CPU returns the optimization
status for the 1GB region.

The RMPOPT table is managed by a combination of software and hardware.
Software uses the RMPOPT instruction to set bits in the table,
indicating that regions of memory are entirely HV-owned.  Hardware
automatically clears bits in the RMPOPT table when RMP contents are
changed during RMPUPDATE instruction.

For more information on the RMPOPT instruction, see the AMD64 RMPOPT
technical documentation.

As SNP is enabled by default the hypervisor and non-SNP guests are
subject to RMP write checks to provide integrity of SNP guest memory.

This patch-series adds support to enable RMP optimizations for up to
2TB of system RAM across the system and allow RMPUPDATE to disable
those optimizations as SNP guests are launched.

Support for RAM larger than 2 TB will be added in follow-on series.

This series also introduces support to re-enable RMP optimizations
during SNP guest termination, after guest pages have been converted
back to shared.

RMP optimizations are performed asynchronously by queuing work on a
dedicated workqueue after a 10 second delay.

Delaying work allows batching of multiple SNP guest terminations.

Once 1GB hugetlb guest_memfd support is merged, support for
re-enabling RMPOPT optimizations during 1GB page cleanup will be added
in follow-on series.

Additionally add debugfs interface to report per-CPU RMPOPT status
across all system RAM.

v7:
- Sync tools/arch/x86/include/asm/cpufeatures.h to mirror the kernel
  header for X86_FEATURE_RMPOPT.
- Fix commit title to use X86_FEATURE_RMPOPT to match the code
  (was X86_FEATURE_AMD_RMPOPT).
- Add static bool rmpopt_configured, set only when segmented RMP setup
  succeeds in setup_rmptable().  Check rmpopt_configured alongside
  cpu_feature_enabled(X86_FEATURE_RMPOPT) in snp_setup_rmpopt() and
  snp_rmpopt_all_physmem(), because setup_clear_cpu_cap() is unreliable
  after alternatives are patched.  Add snp_clear_rmpopt_configured()
  called from amd_cc_platform_clear() when CC_ATTR_HOST_SEV_SNP is
  cleared.  Do not use __ro_after_init on rmpopt_configured since the
  writer snp_clear_rmpopt_configured() is not __init.
- Add cond_resched() to all three leader loops in rmpopt_work_handler()
  to prevent soft lockups on systems with up to 2TB of RAM.
- Add comment above __rmpopt() documenting the RMPOPT instruction
  encoding (F2 0F 01 FC) and register interface (RAX = system physical
  address input, RCX = operation type input, RFLAGS.CF = output).
  Note: RMPOPT does not modify RAX unlike PVALIDATE/RMPUPDATE, so
  the existing "a" (input-only) constraint is correct.

  Sashiko AI code review identified several of the above issues.

v6:
- Drop wrmsrq_on_cpus() helper; use for_each_cpu() with wrmsrq_on_cpu()
  instead, as RMPOPT_BASE MSR programming is not performance-critical.
- Rewrite rmpopt_work_handler() leader selection to use a local
  follower_mask copy instead of modifying the global rmpopt_cpumask.
  This eliminates the current_cpu_cleared tracking and the restore at
  the end, and removes the need for synchronization comments about
  transient cpumask inconsistency.
- Add three-way leader selection in rmpopt_work_handler():
  1. Current CPU is a primary thread in cpumask: run leader locally.
  2. Current CPU is a sibling thread whose primary is in cpumask:
     run leader locally (RMPOPT_BASE MSR is per-core), remove the
     primary from followers via cpumask_andnot(topology_sibling_cpumask).
  3. Current CPU's core has no RMPOPT_BASE MSR programmed: pick an
     explicit leader via cpumask_first() + smp_call_function_single()
     to avoid #UD, with cpus_read_lock() around the IPI loop.
- Add WARN_ON_ONCE guard for empty cpumask in the explicit leader
  fallback path, with migrate_enable() before goto out.
- Add .llseek = seq_lseek to rmpopt_table_fops for consistency with
  other seq_file-based debugfs files and to support tools like "less".
- Change debugfs file permissions from 0444 to 0400 to restrict access
  to root only.
- Add comment in rmpopt_table_seq_show() explaining why cpu_online_mask
  is safe: RMPOPT_BASE MSR is per-core and snp_prepare() ensures all
  CPUs are online when the MSR is programmed.

  Sashiko AI code review identified several of the above issues.

v5:
- Introduce rmpopt_cleanup() to tear down workqueue, debugfs, cpumask,
  and MSR state, called from snp_shutdown().
- Introduce rmpopt_wq_mutex to serialize snp_setup_rmpopt(),
  snp_rmpopt_all_physmem(), and rmpopt_cleanup().
- Introduce rmpopt_show_mutex to serialize debugfs reporting of
  rmpopt_report_cpumask.
- Move snp_rmpopt_all_physmem() call after SNP DECOMMISSION during
  guest shutdown.
- Use migrate_disable()/migrate_enable() for CPU pinning in the
  rmpopt_work_handler() leader loop to maintain CPU affinity without
  disabling preemption for the entire RMPOPT scan.
- Add cpus_read_lock()/cpus_read_unlock() around the follower
  on_each_cpu_mask() loop in rmpopt_work_handler().
- Guard snp_setup_rmpopt() against re-initialization when
  SNP_SHUTDOWN_EX with x86_snp_shutdown=0 skips rmpopt_cleanup()
  but clears snp_initialized, preventing workqueue and resource
  leaks on repeated init/shutdown cycles.
- Replace setup_clear_cpu_cap() with pr_err() on alloc_workqueue()
  failure in snp_setup_rmpopt(), as setup_clear_cpu_cap() cannot be
  used after alternatives are patched; callers check rmpopt_wq != NULL
  as the runtime guard instead.
- Add pr_info() when RMPOPT coverage is capped at 2TB.
- Add comments noting CPU hotplug is not supported with SNP enabled
  and only online primary threads are covered by rmpopt_cpumask.
- Add comment in setup_rmptable() noting Segmented RMP must be
  enabled to enable RMPOPT.
- Simplify cpumask setup loop to set if primary thread rather than
  skip if not primary.
- Improve grammar and clarity in snp_setup_rmpopt() comments.
- Added Reviewed-by's.

  Sashiko AI code review identified several of the above issues.

v4:
- Add new wrmsrq_on_cpus() helper to write same u64 value to a
  per-CPU MSR across a cpumask without per-cpu struct allocation
  overhead. 
- Rename configure_and_enable_rmpopt() to snp_setup_rmpopt().
- Use wrmsrq_on_cpus() instead of wrmsrq_on_cpu() loop for
  programming RMPOPT_BASE MSRs.
- Add setup_clear_cpu_cap(X86_FEATURE_RMPOPT) if segmented RMP
  setup fails or workqueue allocation fails.
- Add X86_FEATURE_RMPOPT feature clear logic in amd_cc_platform_clear()
  for CC_ATTR_HOST_SEV_SNP.
- All of the above allow checking for only X86_FEATURE_RMPOPT for both
  RMPOPT setup/enable and RMP re-optimizations.
- Rename snp_perform_rmp_optimization() to snp_rmpopt_all_physmem().
- Split rmpopt() into rmpopt() and rmpopt_smp() for SMP callback use.
- Introduce separate rmpopt_report_cpumask for debugfs reporting,
  distinct from rmpopt_cpumask used for primary thread tracking.
- Remove snp_perform_rmp_optimization() call from __sev_snp_init_locked() 
  and instead setup and enable RMPOPT after SNP is enabled and 
  initialized.

v3:
- Drop all RMPOPT kthread support and introduce adding custom and
  dedicated workqueue to schedule delayed and asynchronous RMPOPT work.
- Drop the guest_memfd inode cleanup interface and add support to
  re-enable RMP optimizations during guest shutdown using the
  asynchronous and delayed workqueue interface.
- Introduce new __rmpopt() helper and rmpopt() and
  rmpopt_report_status() wrappers on top which use rax and rcx
  parameters to closely match RMPOPT specs.
- Use new optimized RMPOPT loop to issue RMPOPT instructions on all
  system RAM upto 2TB and all CPUs, by optimizing each range on one CPU
  first, then let other CPUs execute RMPOPT in parallel so they can skip
  most work as the range has already been optimized.
- Also add support for running the optimized RMPOPT loop only on
  one thread per core.
- Replace all PUD_SIZE references with SZ_1G to conform to 1GB regions
  as specified by RMPOPT specifications and not be dependent on PUD_SIZE
  which makes the RMPOPT patch-set independent of x86 page table sizes.
- Use wrmsrq_on_cpu() to program the RMPOPT_BASE MSR registers on
  all CPUs that removes all ugly casting to use on_each_cpu_mask().
- Fix inline commits and patch commit messages


v2:
- Drop all NUMA and Socket configuration and enablement support and
  enable RMPOPT support for up to 2TB of system RAM.
- Drop get_cpumask_of_primary_threads() and enable per-core RMPOPT
  base MSRs and issue RMPOPT instruction on all CPUs.
- Drop the configfs interface to manually re-enable RMP optimizations.
- Add new guest_memfd cleanup interface to automatically re-enable
  RMP optimizations during guest shutdown.
- Include references to the public RMPOPT documentation.
- Move debugfs directory for RMPOPT under architecuture specific
  parent directory.

Ashish Kalra (6):
  x86/cpufeatures: Add X86_FEATURE_RMPOPT feature flag
  x86/sev: Initialize RMPOPT configuration MSRs
  x86/sev: Add support to perform RMP optimizations asynchronously
  x86/sev: Add interface to re-enable RMP optimizations.
  KVM: SEV: Perform RMP optimizations on SNP guest shutdown
  x86/sev: Add debugfs support for RMPOPT

 arch/x86/coco/core.c                     |   3 +
 arch/x86/include/asm/cpufeatures.h       |   2 +-
 arch/x86/include/asm/msr-index.h         |   3 +
 arch/x86/include/asm/sev.h               |   6 +
 arch/x86/kernel/cpu/scattered.c          |   1 +
 arch/x86/kvm/svm/sev.c                   |   2 +
 arch/x86/virt/svm/sev.c                  | 417 ++++++++++++++++++++++-
 drivers/crypto/ccp/sev-dev.c             |   3 +
 tools/arch/x86/include/asm/cpufeatures.h |   2 +-
 9 files changed, 436 insertions(+), 3 deletions(-)

-- 
2.43.0


^ permalink raw reply

* Re: [PATCH 00/15] Enable TDX Module Extensions and DICE-based TDX Quoting
From: Adrian Hunter @ 2026-06-08 18:31 UTC (permalink / raw)
  To: Xu Yilun, kas, djbw, rick.p.edgecombe, x86, peter.fang
  Cc: linux-coco, linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li
In-Reply-To: <20260522034128.3144354-1-yilun.xu@linux.intel.com>

On 22/05/2026 06:41, Xu Yilun wrote:
> This posting is just to collect initial review.
> 
> Sean, Paolo, Dave please feel free to ignore for now. Sean, especially
> the x86 KVM stuff is only here as an example for the init code, and not
> ready for review.
> 
> Kiryl and Dan, we are trying to get acks for the first 4 patches of the
> series so they can be serve as a settled base for all the other work
> that uses Extensions. Please review the first 4 patches and treat the
> later ones as an example for the Extensions initialization.
> 
> == Why it's being posted ==
> 
> The TDX Module is introducing a new concept called "TDX Module
> Extensions", and several upcoming features depend on them. The
> Extensions need some extra setup at TDX module init time, and the code
> to do this is expected to be somewhat generic.
> 
> We want to get the basics of this TDX module extensions piece sorted so
> that all of the extension-based work can build on it. This series
> includes those basics, and an example usage called DICE-based TDX
> Quoting. Only the first 4 patches are about initializing the TDX module
> Extensions. I'd like some review on them. The later DICE patches are
> just included to serve as a usage example for the TDX module extension
> code.
> 
> The first 4 patches will eventually need an ack by an x86 maintainer, so
> please review with that in mind.
> 
> == Overview ==
> 
> TDX Module introduces the "TDX Module Extensions" to support long
> running / hard-irq preemptible flows inside. This makes TDX Module
> capable of handling complex tasks through "Extension SEAMCALLs".

For me it would be easier to understand by starting higher level,
like:

"TDX Module Extensions enables optional but important TDX features
 - such as DICE-based attestation quoting, TDX Connect, and live
migration - that require substantially more processing time than
core TDX operations, and also additional memory."

Also I would find it helpful to clarify how "TDX Module Extensions"
enhances interruptibility for Extension SEAMCALLs compared with
regular SEAMCALLs, since "hard-irq preemptible flows" had me
initially thinking along the wrong lines.

> 
> TDX Module allows some add-on features to use the Extension. The first
> feature to use Extensions is DICE-based TDX Quoting [1]. DICE is an
> industry-standard, certificate-backed attestation framework that layers
> evidence through a chain of certificates.
> 
> This series adds infrastructure to enable the Extensions and then
> implement DICE-based TDX Quoting.
> 
> The Extensions consumes relatively large amount of memory (~50MB). So it
> is designed to be off by default. It must be enabled after basic TDX
> Module initialization and when add-on features require it. To enable
> the Extensions, host first adds extra memory to TDX Module via a
> SEAMCALL (TDH.EXT.MEM.ADD), then uses another SEAMCALL (TDH.EXT.INIT) to
> initialize Extensions, and then some add-on features, e.g. DICE, could
> use Extension SEAMCALLs for work. Note that host can never get the added
> memory back.
> 
> Theoretically, the Extensions doesn't need to be enabled right after
> basic TDX initialization. It could be enabled right before the first
> Extension SEAMCALL is issued. That would save or postpone memory usage.
> But it isn't worth the complexity, the needs for the Extensions are vast
> but the savings are little for a typical TDX capable system (about
> 0.001% of memory). So the Linux decision is to just enable it along with
> the basic TDX.
> 
> This series has 2 distinct parts:
> 
>   Patches  1-4:  TDX Module Extensions enabling
>   Patches  5-15: DICE-based TDX Quoting, primarily Peter's work.
> 
> == Some history ==
> 
> The TDX Module Extensions part was first posted along with TDX
> Connect [2]. Now this part is remarkably smaller because we've removed
> the generic tdx_page_array abstraction for HPA_LIST_INFO. TDX Module
> Extensions is the first user of HPA_LIST_INFO, and doesn't use it in a
> typical way (HPA_LIST_INFO can only hold at most 2MB memory). There
> isn't enough justification to make the abstraction in this series. A
> possible plan is to rebuild tdx_page_array iteratively when more use
> cases arise.
> 
> == Misc ==
> 
> This series is based on tip/x86/tdx [3], because we need a small
> being-merged patch [4] before our work.
> 
> 
> Link: https://cdrdv2.intel.com/v1/dl/getContent/874303 # [1]
> Link: https://lore.kernel.org/all/20260327160132.2946114-1-yilun.xu@linux.intel.com/ # [2]
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=x86/tdx # [3]
> Link: https://patch.msgid.link/20260402-fuller_tdx_kexec_support-v3-1-34438d7094bf@intel.com # [4]
> 
> 
> Peter Fang (10):
>   x86/virt/tdx: Move tdx_tdr_pa() up in the file
>   x86/virt/tdx: Initialize Quoting extension during bringup
>   x86/virt/tdx: Prepare Quote buffer during extension bringup
>   x86/virt/tdx: Add interface to check Quoting availability
>   x86/virt/tdx: Add interface to generate a Quote
>   x86/tdx: Move and rename Quote request structure
>   KVM: TDX: Factor out userspace return path from tdx_get_quote()
>   KVM: TDX: Add in-kernel Quote generation
>   KVM: TDX: Support event-notify interrupts only with userspace quoting
>   x86/virt/tdx: Enable TDX Quoting extension
> 
> Xu Yilun (5):
>   x86/virt/tdx: Read global metadata for TDX Module Extensions
>   x86/virt/tdx: Add extra memory to TDX Module for Extensions
>   x86/virt/tdx: Make TDX Module initialize Extensions
>   x86/virt/tdx: Enable the Extensions right after basic TDX Module init
>   x86/virt/tdx: Embed version info in SEAMCALL leaf function definitions
> 
>  Documentation/virt/kvm/api.rst              |   8 +-
>  arch/x86/include/asm/tdx.h                  |  34 ++
>  arch/x86/include/asm/tdx_global_metadata.h  |  11 +
>  arch/x86/kvm/vmx/tdx.h                      |   6 +
>  arch/x86/virt/vmx/tdx/tdx.h                 |  32 +-
>  arch/x86/kvm/vmx/tdx.c                      | 176 ++++++++-
>  arch/x86/virt/vmx/tdx/tdx.c                 | 387 +++++++++++++++++++-
>  arch/x86/virt/vmx/tdx/tdx_global_metadata.c |  27 ++
>  drivers/virt/coco/tdx-guest/tdx-guest.c     |  25 +-
>  virt/kvm/kvm_main.c                         |   1 +
>  10 files changed, 655 insertions(+), 52 deletions(-)
> 
> 
> base-commit: 5209e5bfe5cab593476c3e7754e42c5e47ce36de


^ permalink raw reply

* Re: [PATCH v6 3/4] firmware: smccc: arm-cca-guest: Bind the TSM provider to an SMCCC device
From: Suzuki K Poulose @ 2026-06-08 16:27 UTC (permalink / raw)
  To: Sudeep Holla, Aneesh Kumar K.V
  Cc: linux-coco, linux-arm-kernel, linux-kernel, Catalin Marinas,
	Greg KH, Jeremy Linton, Jonathan Cameron, Lorenzo Pieralisi,
	Mark Rutland, Will Deacon, Steven Price
In-Reply-To: <20260608-hot-fascinating-tortoise-cccc61@sudeepholla>

On 08/06/2026 13:32, Sudeep Holla wrote:
> On Mon, Jun 08, 2026 at 04:56:29PM +0530, Aneesh Kumar K.V wrote:
>> Suzuki K Poulose <suzuki.poulose@arm.com> writes:
>>
>>> On 08/06/2026 09:19, Aneesh Kumar K.V wrote:
>>>> Sudeep Holla <sudeep.holla@kernel.org> writes:
>>>>
>>>>> On Thu, Jun 04, 2026 at 06:56:28PM +0530, Aneesh Kumar K.V wrote:
>>>>>> Sudeep Holla <sudeep.holla@kernel.org> writes:
>>>>>>
>>>>>> ...
>>
>> ...
>>
>>>>> I was trying to avoid conditional compilation altogether and hence the
>>>>> reason for keeping it as simple as possible. Also IS_ENABLED(CONFIG_ARM64)
>>>>> in above snippet must come as some condition to this generic probe.
>>>>>
>>>>> Adding any more logic or callback defeats the bus idea here if we need
>>>>> to rely/depend on multiple conditional compilation or callbacks IMO.
>>>>>
>>>>> Let's find see if it can work with what we are adding now and may add in
>>>>> near future and then decide.
>>>>>
>>>>
>>>> If we move all the conditional checks to the driver probe path, then I
>>>> think this can work. Something like the below:
>>>>
>>>> struct smccc_device_info {
>>>> 	u32 func_id;
>>>> 	bool requires_smc;
>>>> 	const char *device_name;
>>>> };
>>>>
>>>> static const struct smccc_device_info smccc_devices[] __initconst = {
>>>> 	{
>>>> 		.func_id        = ARM_SMCCC_TRNG_VERSION,
>>>> 		.requires_smc   = false,
>>>> 		.device_name    = "arm-smccc-trng",
>>>> 	},
>>>>
>>>> 	{
>>>> 		.func_id        = RSI_ABI_VERSION,
>>>
>>> Don't we need parameters passed to this (Requested Interface version for
>>> e.g.) ? See more below.
>>>
>>
>> The idea is that we only check whether the function ID is supported. All
>> other conditional logic should be handled in the driver probe path, as
>> demonstrated by the changes in drivers/char/hw_random/arm_smccc_trng.c.
>>
> 
> +1. Yes, we just want to know whether the firmware is aware of that feature
> before creating the `smccc_device` for it. The device probe can then perform a
> more thorough, feature-specific check to determine whether the device/feature
> is usable.
> 
> That is the main idea behind the approach I suggested. Please let me know if
> you still see any issues or think this may not work.

Ok, yea, I kind of forgot that we are boot strapping a driver based on
this "smccc device" and the device is only an indication of the firmware
service, the driver would decide if the service matches its expectation.

Thanks for the clarification, apologies for the noise.

Suzuki



> 


^ permalink raw reply

* Re: [RFC PATCH] mm/vmalloc: add vmalloc_decrypted() and vzalloc_decrypted()
From: Catalin Marinas @ 2026-06-08 15:37 UTC (permalink / raw)
  To: Kameron Carr
  Cc: akpm, urezki, linux-mm, linux-kernel, rppt, mhklinux, linux-coco,
	Suzuki K Poulose
In-Reply-To: <20260521205834.1012925-1-kameroncarr@linux.microsoft.com>

+ linux-coco, Suzuki (for the arm64 behaviour)

On Thu, May 21, 2026 at 01:58:34PM -0700, Kameron Carr wrote:
> In confidential computing environments (arm64 CCA, x86 SEV/TDX), guest
> memory is encrypted by default and must be explicitly transitioned to a
> decrypted/shared state for host-visible access.  Calling
> set_memory_decrypted() on a vmalloc address is not supported, and not
> recommended as it would be inefficient to decrypt the pages after they
> have been mapped.
> 
> Add vmalloc_decrypted() and vzalloc_decrypted() which decrypt pages on
> the linear map before creating the vmalloc mapping via vmap(), so
> physical pages are never mapped with conflicting encryption attributes
> across aliases.  A new VM_DECRYPTED flag marks these allocations so that
> vfree() automatically re-encrypts pages before returning them to the
> page allocator.
> 
> Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
> Link: https://lore.kernel.org/linux-arm-kernel/ZmNJdSxSz-sYpVgI@arm.com/
> Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
> ---
>  include/linux/vmalloc.h |   7 ++
>  mm/vmalloc.c            | 163 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 170 insertions(+)

There are a few Sashiko comments worth reviewing:

https://sashiko.dev/#/patchset/20260521205834.1012925-1-kameroncarr@linux.microsoft.com

[...]
> +/*
> + * Re-encrypt the linear-map alias of all pages backing a VM_DECRYPTED area.
> + * Best-effort: on per-block failure the loop continues so as many pages as
> + * possible are returned to the encrypted state.  Pages that fail to
> + * transition are left out of area->pages and leaked.
> + */
> +static int vm_pages_encrypt(struct vm_struct *area)
> +{
> +	unsigned int nr = 1U << vm_area_page_order(area);
> +	unsigned int i;
> +	unsigned int leaked;
> +	int ret = 0;
> +
> +	for (i = 0; i < area->nr_pages; i += nr) {
> +		int err = __vm_pages_enc_dec(area, i, nr, true);
> +
> +		if (err && !ret)
> +			ret = err;
> +	}
> +
> +	leaked = vm_compact_leaked_pages(area);
> +	if (leaked)
> +		pr_warn("vmalloc: re-encryption failed, leaked %u pages\n",
> +			leaked);
> +	return ret;
> +}
> +
> +/*
> + * Decrypt the linear-map alias of all pages backing a VM_DECRYPTED area.
> + * On failure, the already-decrypted prefix is rolled back to encrypted.
> + * Pages that fail either the initial decrypt or the rollback re-encrypt are
> + * left out of area->pages and leaked.
> + */
> +static int vm_pages_decrypt(struct vm_struct *area)
> +{
> +	unsigned int nr = 1U << vm_area_page_order(area);
> +	unsigned int i;
> +	unsigned int leaked;
> +	int ret = 0;
> +
> +	for (i = 0; i < area->nr_pages; i += nr) {
> +		ret = __vm_pages_enc_dec(area, i, nr, false);
> +		if (ret)
> +			goto rollback;
> +	}
> +	return 0;
> +
> +rollback:
> +	while (i) {
> +		i -= nr;
> +		__vm_pages_enc_dec(area, i, nr, true);
> +	}
> +
> +	leaked = vm_compact_leaked_pages(area);
> +	if (leaked)
> +		pr_warn("vmalloc: decryption failed, leaked %u pages\n",
> +			leaked);
> +	return ret;
> +}
> +
>  /**
>   * vfree - Release memory allocated by vmalloc()
>   * @addr:  Memory base address
> @@ -3457,6 +3554,9 @@ void vfree(const void *addr)
>  		return;
>  	}
>  
> +	if (unlikely(vm->flags & VM_DECRYPTED))
> +		vm_pages_encrypt(vm);

I think we still have the vmalloc aliases at this point as we lazily
reclaim them. We should call vm_unmap_aliases() before
vm_pages_encrypt(). It matches the x86 __set_memory_enc_pgtable() as
well with the explicit call to vm_unmap_aliases().

The vrealloc() path may have some issues as well but I haven't looked in
detail. Not sure it actually re-allocs decrypted pages. The simplest is
to reject vrealloc() for such vms until we have a use-case.

> +/**
> + * vzalloc_decrypted - allocate zeroed virtually contiguous decrypted memory
> + * @size:    allocation size
> + *
> + * Like vmalloc_decrypted(), but the memory is set to zero.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +void *vzalloc_decrypted_noprof(unsigned long size)
> +{
> +	void *addr;
> +
> +	addr = __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
> +					   GFP_KERNEL,
> +					   pgprot_decrypted(PAGE_KERNEL),
> +					   VM_DECRYPTED, NUMA_NO_NODE,
> +					   __builtin_return_address(0));
> +	if (addr)
> +		memset(addr, 0, size);

Talking to Suzuki, the small window between set_memory_decrypted() and
memset() potentially exposing stale data is safe, at least for Arm CCA
as the memory would be scrubbed (there are other places in the kernel
where we do something similar). I assume that's also the case for other
architectures, although not sure what pKVM does.

-- 
Catalin

^ permalink raw reply

* Re: [RFC PATCH 15/15] x86/virt/tdx: Enable TDX Quoting extension
From: Xu Yilun @ 2026-06-08 15:10 UTC (permalink / raw)
  To: Kishen Maloor
  Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
	linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li
In-Reply-To: <caf5de99-bce3-4dc6-8757-cf2b4aaf85c9@intel.com>

> > diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> > index 10aff23cd01f..524a14c01aa6 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.h
> > +++ b/arch/x86/virt/vmx/tdx/tdx.h
> > @@ -58,7 +58,8 @@
> >   #define TDH_PHYMEM_CACHE_WB		40
> >   #define TDH_PHYMEM_PAGE_WBINVD		41
> >   #define TDH_VP_WR			43
> > -#define TDH_SYS_CONFIG			45
> > +#define TDH_SYS_CONFIG_V0		45
> 
> Is it necessary to add _Vx macros when multiple versions can co-exist?
> 
> Just wondering if it would be cleaner in the following way?
> 
> - Leave the macros set at the current (non-deprecated) baseline version.
> - Select vX using SEAMCALL_LEAF_VER() in config_tdx_module() when a vX feature
>   is enabled.

My perference is to have the most simple rule for the unversioned macro.
And leave all workarounds to _Vx macros, they will eventually been
removed or deprecated.

Anyway this patch is not expected to merge with this series, maybe we
will have a public release with V1 supported when it is to be merged,
then we don't have to make such a hard choice.

> 
> u64 seamcall_fn = TDH_SYS_CONFIG;
> ...
> if (tdx_sysinfo.features.tdx_features0 & TDX_FEATURES0_QUOTE) {
> ...
>     seamcall_fn = SEAMCALL_LEAF_VER(TDH_SYS_CONFIG, 1);

^ permalink raw reply

* Re: [PATCH v14 26/44] arm64: RMI: Allow populating initial contents
From: Steven Price @ 2026-06-08 13:53 UTC (permalink / raw)
  To: Suzuki K Poulose, Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
	Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
	linux-coco, Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
	Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
	Lorenzo.Pieralisi2
In-Reply-To: <ad45c64c-18ab-4d42-9813-0ac1704bb127@arm.com>

On 08/06/2026 10:41, Suzuki K Poulose wrote:
> On 08/06/2026 10:36, Steven Price wrote:
>> On 28/05/2026 06:30, Gavin Shan wrote:
>>> Hi Steve,
>>>
>>> On 5/13/26 11:17 PM, Steven Price wrote:
>>>> The VMM needs to populate the realm with some data before starting
>>>> (e.g.
>>>> a kernel and initrd). This is measured by the RMM and used as part of
>>>> the attestation later on.
>>>>
>>>> Signed-off-by: Steven Price <steven.price@arm.com>
> 
> ...
> 
>>>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>>>> index a89873a5eb77..209087bcf399 100644
>>>> --- a/arch/arm64/kvm/rmi.c
>>>> +++ b/arch/arm64/kvm/rmi.c
>>>> @@ -486,6 +486,75 @@ void kvm_realm_unmap_range(struct kvm *kvm,
>>>> unsigned long start,
>>>>            realm_unmap_private_range(kvm, start, end, may_block);
>>>>    }
>>>>    +static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
>>>> +                   kvm_pfn_t dst_pfn, kvm_pfn_t src_pfn,
>>>> +                   unsigned long flags)
>>>> +{
>>>> +    struct realm *realm = &kvm->arch.realm;
>>>> +    phys_addr_t rd = virt_to_phys(realm->rd);
>>>> +    phys_addr_t dst_phys, src_phys;
>>>> +    int ret;
>>>> +
>>>> +    dst_phys = __pfn_to_phys(dst_pfn);
>>>> +    src_phys = __pfn_to_phys(src_pfn);
>>>> +
>>>> +    if (rmi_delegate_page(dst_phys))
>>>> +        return -ENXIO;
>>>> +
>>>> +    ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys, flags);
>>>> +    if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>>>> +        /* Create missing RTTs and retry */
>>>> +        int level = RMI_RETURN_INDEX(ret);
>>>> +
>>>> +        KVM_BUG_ON(level == KVM_PGTABLE_LAST_LEVEL, kvm);
>>>
>>>          KVM_BUG_ON(level >= KVM_PGTABLE_LAST_LEVEL, kvm);
>>
>> Ack.
>>
> 
> Thinking more about this, I guess a buggy VMM can trigger this
> by populating twice ? (level == KVM_PGTABLE_LAST_LEVEL). So, we should
> return the error back, than warning here and suppressing the error ?

Populating twice causes rmi_delegate_page() to be run twice on the same
page and the second one will then fail. So I don't think this is
possible (please correct me if I've missed something!)

Thanks,
Steve

^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Suzuki K Poulose @ 2026-06-08 12:58 UTC (permalink / raw)
  To: Steven Price, Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
	Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
	linux-coco, Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
	Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
	Lorenzo.Pieralisi2
In-Reply-To: <21abf014-a68c-49b6-a113-4ec59c520a30@arm.com>

On 08/06/2026 11:56, Steven Price wrote:
> On 08/06/2026 10:30, Suzuki K Poulose wrote:
>> On 05/06/2026 07:23, Gavin Shan wrote:
>>> Hi Steve,
>>>
>>> On 5/13/26 11:17 PM, Steven Price wrote:
>>>> At runtime if the realm guest accesses memory which hasn't yet been
>>>> mapped then KVM needs to either populate the region or fault the guest.
>>>>
>>>> For memory in the lower (protected) region of IPA a fresh page is
>>>> provided to the RMM which will zero the contents. For memory in the
>>>> upper (shared) region of IPA, the memory from the memslot is mapped
>>>> into the realm VM non secure.
>>>>
>>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>>> ---
>>>> Changes since v13:
>>>>    * Numerous changes due to rebasing.
>>>>    * Fix addr_range_desc() to encode the correct block size.
>>>> Changes since v12:
>>>>    * Switch to RMM v2.0 range based APIs.
>>>> Changes since v11:
>>>>    * Adapt to upstream changes.
>>>> Changes since v10:
>>>>    * RME->RMI renaming.
>>>>    * Adapt to upstream gmem changes.
>>>> Changes since v9:
>>>>    * Fix call to kvm_stage2_unmap_range() in kvm_free_stage2_pgd() to set
>>>>      may_block to avoid stall warnings.
>>>>    * Minor coding style fixes.
>>>> Changes since v8:
>>>>    * Propagate the may_block flag.
>>>>    * Minor comments and coding style changes.
>>>> Changes since v7:
>>>>    * Remove redundant WARN_ONs for realm_create_rtt_levels() - it will
>>>>      internally WARN when necessary.
>>>> Changes since v6:
>>>>    * Handle PAGE_SIZE being larger than RMM granule size.
>>>>    * Some minor renaming following review comments.
>>>> Changes since v5:
>>>>    * Reduce use of struct page in preparation for supporting the RMM
>>>>      having a different page size to the host.
>>>>    * Handle a race when delegating a page where another CPU has
>>>> faulted on
>>>>      a the same page (and already delegated the physical page) but not
>>>> yet
>>>>      mapped it. In this case simply return to the guest to either use the
>>>>      mapping from the other CPU (or refault if the race is lost).
>>>>    * The changes to populate_par_region() are moved into the previous
>>>>      patch where they belong.
>>>> Changes since v4:
>>>>    * Code cleanup following review feedback.
>>>>    * Drop the PTE_SHARED bit when creating unprotected page table
>>>> entries.
>>>>      This is now set by the RMM and the host has no control of it and the
>>>>      spec requires the bit to be set to zero.
>>>> Changes since v2:
>>>>    * Avoid leaking memory if failing to map it in the realm.
>>>>    * Correctly mask RTT based on LPA2 flag (see rtt_get_phys()).
>>>>    * Adapt to changes in previous patches.
>>>> ---
>>>>    arch/arm64/include/asm/kvm_emulate.h |   8 ++
>>>>    arch/arm64/include/asm/kvm_rmi.h     |  12 ++
>>>>    arch/arm64/kvm/mmu.c                 | 128 ++++++++++++++++----
>>>>    arch/arm64/kvm/rmi.c                 | 173 +++++++++++++++++++++++++++
>>>>    4 files changed, 301 insertions(+), 20 deletions(-)
>>>>
> 
> [...]
> 
>>>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>>>> index cae29fd3353c..761b38a4071c 100644
>>>> --- a/arch/arm64/kvm/rmi.c
>>>> +++ b/arch/arm64/kvm/rmi.c
>>>> @@ -597,6 +597,179 @@ static int realm_data_map_init(struct kvm *kvm,
>>>> unsigned long ipa,
>>>>        return ret;
>>>>    }
>>>> +static unsigned long addr_range_desc(unsigned long phys, unsigned
>>>> long size)
>>>> +{
>>>> +    unsigned long out = 0;
>>>> +
>>>> +    switch (size) {
>>>> +    case P4D_SIZE:
>>>> +        out = 3 | (1 << 2);
>>>> +        break;
>>>> +    case PUD_SIZE:
>>>> +        out = 2 | (1 << 2);
>>>> +        break;
>>>> +    case PMD_SIZE:
>>>> +        out = 1 | (1 << 2);
>>>> +        break;
>>>> +    case PAGE_SIZE:
>>>> +        out = 0 | (1 << 2);
>>>> +        break;
>>>> +    default:
>>>> +        /*
>>>> +         * Only support mapping at the page level granulatity when
>>>> +         * it's an unusual length. This should get us back onto a
>>>> larger
>>>> +         * block size for the subsequent mappings.
>>>> +         */
>>>> +        out = 0 | ((MIN(size >> PAGE_SHIFT, PTRS_PER_PTE - 1)) << 2);
>>>> +        break;
>>>> +    }
>>>> +
>>>> +    WARN_ON(phys & ~PAGE_MASK);
>>>> +
>>>> +    out |= phys & PAGE_MASK;
>>>> +
>>>> +    return out;
>>>> +}
>>>> +
>>>> +int realm_map_protected(struct kvm *kvm,
>>>> +            unsigned long ipa,
>>>> +            kvm_pfn_t pfn,
>>>> +            unsigned long map_size,
>>>> +            struct kvm_mmu_memory_cache *memcache)
>>>> +{
>>>> +    struct realm *realm = &kvm->arch.realm;
>>>> +    phys_addr_t phys = __pfn_to_phys(pfn);
>>>> +    phys_addr_t base_phys = phys;
>>>> +    phys_addr_t rd = virt_to_phys(realm->rd);
>>>> +    unsigned long base_ipa = ipa;
>>>> +    unsigned long ipa_top = ipa + map_size;
>>>> +    int ret = 0;
>>>> +
>>>> +    if (WARN_ON(!IS_ALIGNED(map_size, PAGE_SIZE) ||
>>>> +            !IS_ALIGNED(ipa, map_size)))
>>>> +        return -EINVAL;
>>>> +
>>>> +    if (rmi_delegate_range(phys, map_size)) {
>>>> +        /*
>>>> +         * It's likely we raced with another VCPU on the same
>>>> +         * fault. Assume the other VCPU has handled the fault
>>>> +         * and return to the guest.
>>>> +         */
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    while (ipa < ipa_top) {
>>>> +        unsigned long flags = RMI_ADDR_TYPE_SINGLE;
>>>> +        unsigned long range_desc = addr_range_desc(phys, ipa_top -
>>>> ipa);
>>>> +        unsigned long out_top;
>>>> +
>>>> +        ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags, range_desc,
>>>> +                       &out_top);
>>>> +
>>>> +        if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>>>> +            /* Create missing RTTs and retry */
>>>> +            int level = RMI_RETURN_INDEX(ret);
>>>> +
>>>> +            WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
>>>> +            ret = realm_create_rtt_levels(realm, ipa, level,
>>>> +                              KVM_PGTABLE_LAST_LEVEL,
>>>> +                              memcache);
>>
>> Could we give the RMM a chance to make use of the Block mappings by
>> creating the Missing RTTs to the level that may work for the current
>> range_desc ? i.e., if the range_desc is a 2M block size, we could create
>> tables upto L2 in the first go and if the RMM still needs RTT, we could
>> go further down to the KVM_PGTABLE_LAST_LEVEL. I understand this is
>> kind of an optimisation, so may be we could defer it. (Same applies for
>> the non_secure map below).
> 
> A simple change would be just to create one level at a time like this:
> 
> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
> index b79b96f7dffb..3f3ade1d3895 100644
> --- a/arch/arm64/kvm/rmi.c
> +++ b/arch/arm64/kvm/rmi.c
> @@ -767,15 +767,15 @@ static int realm_map_protected(struct kvm *kvm,
>   			/* Create missing RTTs and retry */
>   			int level = RMI_RETURN_INDEX(ret);
>   
> -			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
> +			if (WARN_ON(level >= KVM_PGTABLE_LAST_LEVEL))
> +				goto err_undelegate;
>   			ret = realm_create_rtt_levels(realm, ipa, level,
> -						      KVM_PGTABLE_LAST_LEVEL,
> +						      level + 1,
>   						      memcache);
>   			if (ret)
>   				goto err_undelegate;
>   
> -			ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags,
> -					       range_desc, &out_top);
> +			continue;
>   		}

That looks good to me.

Cheers
Suzuki


>   
>   		if (WARN_ON(ret))
> 
> Thanks,
> Steve
> 


^ permalink raw reply

* Re: [PATCH v6 3/4] firmware: smccc: arm-cca-guest: Bind the TSM provider to an SMCCC device
From: Sudeep Holla @ 2026-06-08 12:32 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Suzuki K Poulose
  Cc: linux-coco, Sudeep Holla, linux-arm-kernel, linux-kernel,
	Catalin Marinas, Greg KH, Jeremy Linton, Jonathan Cameron,
	Lorenzo Pieralisi, Mark Rutland, Will Deacon, Steven Price
In-Reply-To: <yq5aldcpz316.fsf@kernel.org>

On Mon, Jun 08, 2026 at 04:56:29PM +0530, Aneesh Kumar K.V wrote:
> Suzuki K Poulose <suzuki.poulose@arm.com> writes:
> 
> > On 08/06/2026 09:19, Aneesh Kumar K.V wrote:
> >> Sudeep Holla <sudeep.holla@kernel.org> writes:
> >> 
> >>> On Thu, Jun 04, 2026 at 06:56:28PM +0530, Aneesh Kumar K.V wrote:
> >>>> Sudeep Holla <sudeep.holla@kernel.org> writes:
> >>>>
> >>>> ...
> 
> ...
> 
> >>> I was trying to avoid conditional compilation altogether and hence the
> >>> reason for keeping it as simple as possible. Also IS_ENABLED(CONFIG_ARM64)
> >>> in above snippet must come as some condition to this generic probe.
> >>>
> >>> Adding any more logic or callback defeats the bus idea here if we need
> >>> to rely/depend on multiple conditional compilation or callbacks IMO.
> >>>
> >>> Let's find see if it can work with what we are adding now and may add in
> >>> near future and then decide.
> >>>
> >> 
> >> If we move all the conditional checks to the driver probe path, then I
> >> think this can work. Something like the below:
> >> 
> >> struct smccc_device_info {
> >> 	u32 func_id;
> >> 	bool requires_smc;
> >> 	const char *device_name;
> >> };
> >> 
> >> static const struct smccc_device_info smccc_devices[] __initconst = {
> >> 	{
> >> 		.func_id        = ARM_SMCCC_TRNG_VERSION,
> >> 		.requires_smc   = false,
> >> 		.device_name    = "arm-smccc-trng",
> >> 	},
> >> 
> >> 	{
> >> 		.func_id        = RSI_ABI_VERSION,
> >
> > Don't we need parameters passed to this (Requested Interface version for 
> > e.g.) ? See more below.
> >
> 
> The idea is that we only check whether the function ID is supported. All
> other conditional logic should be handled in the driver probe path, as
> demonstrated by the changes in drivers/char/hw_random/arm_smccc_trng.c.
> 

+1. Yes, we just want to know whether the firmware is aware of that feature
before creating the `smccc_device` for it. The device probe can then perform a
more thorough, feature-specific check to determine whether the device/feature
is usable.

That is the main idea behind the approach I suggested. Please let me know if
you still see any issues or think this may not work.

-- 
Regards,
Sudeep

^ permalink raw reply

* Re: [PATCH v6 3/4] firmware: smccc: arm-cca-guest: Bind the TSM provider to an SMCCC device
From: Aneesh Kumar K.V @ 2026-06-08 11:26 UTC (permalink / raw)
  To: Suzuki K Poulose, Sudeep Holla
  Cc: linux-coco, linux-arm-kernel, linux-kernel, Catalin Marinas,
	Greg KH, Jeremy Linton, Jonathan Cameron, Lorenzo Pieralisi,
	Mark Rutland, Will Deacon, Steven Price
In-Reply-To: <e0071e90-c2f5-4d15-b3c6-fe05bf1464e4@arm.com>

Suzuki K Poulose <suzuki.poulose@arm.com> writes:

> On 08/06/2026 09:19, Aneesh Kumar K.V wrote:
>> Sudeep Holla <sudeep.holla@kernel.org> writes:
>> 
>>> On Thu, Jun 04, 2026 at 06:56:28PM +0530, Aneesh Kumar K.V wrote:
>>>> Sudeep Holla <sudeep.holla@kernel.org> writes:
>>>>
>>>> ...

...

>>> I was trying to avoid conditional compilation altogether and hence the
>>> reason for keeping it as simple as possible. Also IS_ENABLED(CONFIG_ARM64)
>>> in above snippet must come as some condition to this generic probe.
>>>
>>> Adding any more logic or callback defeats the bus idea here if we need
>>> to rely/depend on multiple conditional compilation or callbacks IMO.
>>>
>>> Let's find see if it can work with what we are adding now and may add in
>>> near future and then decide.
>>>
>> 
>> If we move all the conditional checks to the driver probe path, then I
>> think this can work. Something like the below:
>> 
>> struct smccc_device_info {
>> 	u32 func_id;
>> 	bool requires_smc;
>> 	const char *device_name;
>> };
>> 
>> static const struct smccc_device_info smccc_devices[] __initconst = {
>> 	{
>> 		.func_id        = ARM_SMCCC_TRNG_VERSION,
>> 		.requires_smc   = false,
>> 		.device_name    = "arm-smccc-trng",
>> 	},
>> 
>> 	{
>> 		.func_id        = RSI_ABI_VERSION,
>
> Don't we need parameters passed to this (Requested Interface version for 
> e.g.) ? See more below.
>

The idea is that we only check whether the function ID is supported. All
other conditional logic should be handled in the driver probe path, as
demonstrated by the changes in drivers/char/hw_random/arm_smccc_trng.c.

>
>> 		.requires_smc   = true,
>> 		.device_name    = RSI_DEV_NAME,
>> 	},
>> };
>> 
>> static bool __init smccc_probe_smccc_device(const struct smccc_device_info *smccc_dev)
>> {
>> 	unsigned long ret;
>> 	struct arm_smccc_res res;
>> 
>> 	if (smccc_conduit == SMCCC_CONDUIT_NONE)
>> 		return false;
>> 
>> 	if (smccc_dev->requires_smc && smccc_conduit != SMCCC_CONDUIT_SMC)
>> 		return false;
>> 
>> 	arm_smccc_1_1_invoke(smccc_dev->func_id, &res);
>> 	ret = res.a0;
>> 
>> 	if ((s32)ret == SMCCC_RET_NOT_SUPPORTED)
>
> Is this a reliable check for all possible SMCCC services ? i.e., Are we 
> expected to get RET_NOT_SUPPORTED for any service for which the backend
> is not available ?
>

That is correct. That is what the SMCCC specification says

> Also, as pointed out RSI_ABI_VERSION may return other errors based on 
> the input (requested version, e.g., RSI_ERROR_INPUT) and we may still go
> ahead and register the device ?
>

Correct, and the driver probe will check for the minimum and maximum supported versions.

>> 		return false;
>> 
>> 	return true;
>> }
>> 
>> static int __init smccc_devices_init(void)
>> {
>> 	struct arm_smccc_device *sdev;
>> 	const struct smccc_device_info *smccc_dev;
>> 
>> 	for (int i = 0; i < ARRAY_SIZE(smccc_devices); i++) {
>> 		smccc_dev = &smccc_devices[i];
>> 
>> 		if (!smccc_probe_smccc_device(smccc_dev))
>> 			continue;
>> 
>>                 sdev = arm_smccc_device_register(smccc_dev->device_name);
>>                 if (IS_ERR(sdev))
>>                         pr_err("%s: could not register device: %ld\n",
>>                                smccc_dev->device_name, PTR_ERR(sdev));
>> 
>> 	}
>> 
>> 	return 0;
>> }
>> device_initcall(smccc_devices_init);
>> 
>> with the diff to hw_random/smccc_trng
>> 
>> modified   arch/arm64/include/asm/archrandom.h
>> @@ -12,7 +12,7 @@
>>   
>>   extern bool smccc_trng_available;
>>   
>> -static inline bool __init smccc_probe_trng(void)
>> +static inline bool smccc_probe_trng(void)
>>   {
>>   	struct arm_smccc_res res;
>>   
>> modified   drivers/char/hw_random/arm_smccc_trng.c
>> @@ -19,6 +19,8 @@
>>   #include <linux/arm-smccc.h>
>>   #include <linux/arm-smccc-bus.h>
>>   
>> +#include <asm/archrandom.h>
>> +
>>   #ifdef CONFIG_ARM64
>>   #define ARM_SMCCC_TRNG_RND	ARM_SMCCC_TRNG_RND64
>>   #define MAX_BITS_PER_CALL	(3 * 64UL)
>> @@ -98,6 +100,10 @@ static int smccc_trng_probe(struct arm_smccc_device *sdev)
>>   {
>>   	struct hwrng *trng;
>>   
>> +	/* validate the minimum version requirement */
>> +	if (!smccc_probe_trng())
>> +		return -ENODEV;
>> +
>>   	trng = devm_kzalloc(&sdev->dev, sizeof(*trng), GFP_KERNEL);
>>   	if (!trng)
>>   		return -ENOMEM;
>> 
>> We can also move arch/arm64/include/asm/rsi_smc.h to
>> include/linux/arm-rsi-smccc.h. There was a suggestion to move these
>
> super minor nit: arm-smccc-rsi.h ?
>

-aneesh

^ permalink raw reply

* Re: [PATCH v14 32/44] KVM: arm64: Handle Realm PSCI requests
From: Steven Price @ 2026-06-08 11:15 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <775a0d29-4d92-4ecc-96dd-5b0eaeff1528@redhat.com>

On 28/05/2026 07:55, Gavin Shan wrote:
> Hi Steve,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> The RMM needs to be informed of the target REC when a PSCI call is made
>> with an MPIDR argument.
>>
>> This requirement will be removed in a future release of the RMM 2.0
>> specification but is still required for v2.0-bet1.
>>
>> Co-developed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Chanegs since v13:
>>   * The ioctl KVM_ARM_VCPU_RMI_PSCI_COMPLETE has gone. The RMI call is
>>     made automatically just before entering the REC again.
>> Changes since v12:
>>   * Chance return code for non-realms to -ENXIO to better represent that
>>     the ioctl is invalid for non-realms (checkpatch is insistent that
>>     "ENOSYS means 'invalid syscall nr' and nothing else").
>> Changes since v11:
>>   * RMM->RMI renaming.
>> Changes since v6:
>>   * Use vcpu_is_rec() rather than kvm_is_realm(vcpu->kvm).
>>   * Minor renaming/formatting fixes.
>> ---
>>   arch/arm64/include/asm/kvm_rmi.h |  3 ++
>>   arch/arm64/kvm/psci.c            | 15 ++++++++-
>>   arch/arm64/kvm/rmi.c             | 58 ++++++++++++++++++++++++++++++++
>>   3 files changed, 75 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/
>> asm/kvm_rmi.h
>> index b65cfec10dee..eacf82a7467d 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -109,6 +109,9 @@ int realm_map_non_secure(struct realm *realm,
>>                unsigned long size,
>>                enum kvm_pgtable_prot prot,
>>                struct kvm_mmu_memory_cache *memcache);
>> +int realm_psci_complete(struct kvm_vcpu *source,
>> +            struct kvm_vcpu *target,
>> +            unsigned long status);
>>     static inline bool kvm_realm_is_private_address(struct realm *realm,
>>                           unsigned long addr)
>> diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
>> index 3b5dbe9a0a0e..a2cd55dc7b5b 100644
>> --- a/arch/arm64/kvm/psci.c
>> +++ b/arch/arm64/kvm/psci.c
>> @@ -103,7 +103,6 @@ static unsigned long kvm_psci_vcpu_on(struct
>> kvm_vcpu *source_vcpu)
>>         reset_state->reset = true;
>>       kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
>> -
> 
> This change isn't supposed to be part of this patch :-)

Whoops - indeed it isn't!

>>       /*
>>        * Make sure the reset request is observed if the RUNNABLE
>> mp_state is
>>        * observed.
>> @@ -142,6 +141,20 @@ static unsigned long
>> kvm_psci_vcpu_affinity_info(struct kvm_vcpu *vcpu)
>>       /* Ignore other bits of target affinity */
>>       target_affinity &= target_affinity_mask;
>>   +    if (vcpu_is_rec(vcpu)) {
>> +        struct kvm_vcpu *target_vcpu;
>> +
>> +        /* RMM supports only zero affinity level */
>> +        if (lowest_affinity_level != 0)
>> +            return PSCI_RET_INVALID_PARAMS;
>> +
>> +        target_vcpu = kvm_mpidr_to_vcpu(kvm, target_affinity);
>> +        if (!target_vcpu)
>> +            return PSCI_RET_INVALID_PARAMS;
>> +
>> +        return PSCI_RET_SUCCESS;
>> +    }
>> +
>>       /*
>>        * If one or more VCPU matching target affinity are running
>>        * then ON else OFF
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index 761b38a4071c..2b03e962ee41 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -3,6 +3,7 @@
>>    * Copyright (C) 2023-2025 ARM Ltd.
>>    */
>>   +#include <uapi/linux/psci.h>
>>   #include <linux/kvm_host.h>
>>     #include <asm/kvm_emulate.h>
>> @@ -127,6 +128,25 @@ static void free_rtt(phys_addr_t phys)
>>       kvm_account_pgtable_pages(phys_to_virt(phys), -1);
>>   }
>>   +int realm_psci_complete(struct kvm_vcpu *source, struct kvm_vcpu
>> *target,
>> +            unsigned long status)
>> +{
>> +    int ret;
>> +
>> +    /*
>> +     * XXX: RMM-v2.0 doesn't require the target REC address for
>> completing
>> +     * PSCI requests. Temporary hack until RMM implementation catches up
>> +     * to the full spec.
>> +     */
>> +    ret = rmi_psci_complete(virt_to_phys(source->arch.rec.rec_page),
>> +                virt_to_phys(target->arch.rec.rec_page),
>> +                status);
>> +    if (ret)
>> +        return -EINVAL;
> 
>         return -ENXIO;

Ack, although as the comment says this should be going away.

Thanks,
Steve

>> +
>> +    return 0;
>> +}
>> +
>>   static int realm_rtt_create(struct realm *realm,
>>                   unsigned long addr,
>>                   int level,
>> @@ -1004,6 +1024,41 @@ static void kvm_complete_ripas_change(struct
>> kvm_vcpu *vcpu)
>>       rec->run->exit.ripas_base = base;
>>   }
>>   +static void kvm_rec_complete_psci(struct kvm_vcpu *vcpu)
>> +{
>> +    struct rec_run *run = vcpu->arch.rec.run;
>> +    unsigned long status = PSCI_RET_DENIED;
>> +    unsigned long ret = vcpu_get_reg(vcpu, 0);
>> +    struct kvm_vcpu *target;
>> +
>> +    switch (run->exit.gprs[0]) {
>> +    /*
>> +     * XXX: RMM-v2.0 doesn't cause RMI_EXIT_PSCI for AFFINITY_INFO
>> +     * Temporary hack until tf-RMM gets the REC to MPIDR mapping via
>> +     * RD Auxiliary granules.
>> +     * For now always report SUCCESS
>> +     */
>> +    case PSCI_0_2_FN64_AFFINITY_INFO:
>> +        status = PSCI_RET_SUCCESS;
>> +        break;
>> +    case PSCI_0_2_FN64_CPU_ON: {
>> +        if (ret != PSCI_RET_SUCCESS &&
>> +            ret != PSCI_RET_ALREADY_ON)
>> +            status = PSCI_RET_DENIED;
>> +        else
>> +            status = PSCI_RET_SUCCESS;
>> +        break;
>> +    }
>> +    default:
>> +        return;
>> +    }
>> +
>> +    target = kvm_mpidr_to_vcpu(vcpu->kvm, run->exit.gprs[1]);
>> +    /* RMM makes sure that we don't get RMI_EXIT_PSCI for invalid
>> mpidrs */
>> +    if (target)
>> +        realm_psci_complete(vcpu, target, status);
>> +}
>> +
>>   /*
>>    * kvm_rec_pre_enter - Complete operations before entering a REC
>>    *
>> @@ -1028,6 +1083,9 @@ int kvm_rec_pre_enter(struct kvm_vcpu *vcpu)
>>           for (int i = 0; i < REC_RUN_GPRS; i++)
>>               rec->run->enter.gprs[i] = vcpu_get_reg(vcpu, i);
>>           break;
>> +    case RMI_EXIT_PSCI:
>> +        kvm_rec_complete_psci(vcpu);
>> +        break;
>>       case RMI_EXIT_RIPAS_CHANGE:
>>           kvm_complete_ripas_change(vcpu);
>>           break;
> 
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Steven Price @ 2026-06-08 10:56 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <d08ca4d6-1e7c-473d-909e-89642bd8c4c2@redhat.com>

On 05/06/2026 12:20, Gavin Shan wrote:
> Hi Steve,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> At runtime if the realm guest accesses memory which hasn't yet been
>> mapped then KVM needs to either populate the region or fault the guest.
>>
>> For memory in the lower (protected) region of IPA a fresh page is
>> provided to the RMM which will zero the contents. For memory in the
>> upper (shared) region of IPA, the memory from the memslot is mapped
>> into the realm VM non secure.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>   * Numerous changes due to rebasing.
>>   * Fix addr_range_desc() to encode the correct block size.
>> Changes since v12:
>>   * Switch to RMM v2.0 range based APIs.
>> Changes since v11:
>>   * Adapt to upstream changes.
>> Changes since v10:
>>   * RME->RMI renaming.
>>   * Adapt to upstream gmem changes.
>> Changes since v9:
>>   * Fix call to kvm_stage2_unmap_range() in kvm_free_stage2_pgd() to set
>>     may_block to avoid stall warnings.
>>   * Minor coding style fixes.
>> Changes since v8:
>>   * Propagate the may_block flag.
>>   * Minor comments and coding style changes.
>> Changes since v7:
>>   * Remove redundant WARN_ONs for realm_create_rtt_levels() - it will
>>     internally WARN when necessary.
>> Changes since v6:
>>   * Handle PAGE_SIZE being larger than RMM granule size.
>>   * Some minor renaming following review comments.
>> Changes since v5:
>>   * Reduce use of struct page in preparation for supporting the RMM
>>     having a different page size to the host.
>>   * Handle a race when delegating a page where another CPU has faulted on
>>     a the same page (and already delegated the physical page) but not yet
>>     mapped it. In this case simply return to the guest to either use the
>>     mapping from the other CPU (or refault if the race is lost).
>>   * The changes to populate_par_region() are moved into the previous
>>     patch where they belong.
>> Changes since v4:
>>   * Code cleanup following review feedback.
>>   * Drop the PTE_SHARED bit when creating unprotected page table entries.
>>     This is now set by the RMM and the host has no control of it and the
>>     spec requires the bit to be set to zero.
>> Changes since v2:
>>   * Avoid leaking memory if failing to map it in the realm.
>>   * Correctly mask RTT based on LPA2 flag (see rtt_get_phys()).
>>   * Adapt to changes in previous patches.
>> ---
>>   arch/arm64/include/asm/kvm_emulate.h |   8 ++
>>   arch/arm64/include/asm/kvm_rmi.h     |  12 ++
>>   arch/arm64/kvm/mmu.c                 | 128 ++++++++++++++++----
>>   arch/arm64/kvm/rmi.c                 | 173 +++++++++++++++++++++++++++
>>   4 files changed, 301 insertions(+), 20 deletions(-)
>>
> 
> [...]
> 
>> @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct
>> kvm_s2_fault_desc *s2fd)
>>       bool write_fault, exec_fault;
>>       enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>>       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>> -    struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
>> +    struct kvm_vcpu *vcpu = s2fd->vcpu;
>> +    struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
>> +    gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
>>       unsigned long mmu_seq;
>>       struct page *page;
>> -    struct kvm *kvm = s2fd->vcpu->kvm;
>> +    struct kvm *kvm = vcpu->kvm;
>>       void *memcache;
>>       kvm_pfn_t pfn;
>>       gfn_t gfn;
>>       int ret;
>>   -    memcache = get_mmu_memcache(s2fd->vcpu);
>> -    ret = topup_mmu_memcache(s2fd->vcpu, memcache);
>> +    if (kvm_is_realm(vcpu->kvm)) {
>> +        /* check for memory attribute mismatch */
>> +        bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
>> +        /*
>> +         * For Realms, the shared address is an alias of the private
>> +         * PA with the top bit set. Thus if the fault address matches
>> +         * the GPA then it is the private alias.
>> +         */
>> +        bool is_priv_fault = (gpa == s2fd->fault_ipa);
>> +
>> +        if (is_priv_gfn != is_priv_fault) {
>> +            kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>> +                              kvm_is_write_fault(vcpu),
>> +                              false,
>> +                              is_priv_fault);
>> +            /*
>> +             * KVM_EXIT_MEMORY_FAULT requires an return code of
>> +             * -EFAULT, see the API documentation
>> +             */
>> +            return -EFAULT;
>> +        }
>> +    }
>> +
> 
> For a Realm, gmem_abort() is called by kvm_handle_guest_abort() only when
> we're faulting in the private (protected) space.
> 
>     if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu->kvm,
> fault_ipa))
>         ret = gmem_abort(&s2fd);
>     else
>         ret = user_mem_abort(&s2fd);
> 
> With the condition, this block of code can be simplied to handle conversion
> (shared -> private) instead of both directions.
> 
>     /* Convert the shared address to the private adress for Realm */
>     if (kvm_is_realm(vcpu->kvm) &&
>         !kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT)) {
>         /*
>          * KVM_EXIT_MEMORY_FAULT requires an return code of
>          * -EFAULT, see the API documentation
>          */
>         kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>                                       kvm_is_write_fault(vcpu),
>                                       false, true);
>         return -EFAULT;
>     }
> 
> 
> [...]
> 
>> @@ -2396,7 +2475,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>                   !write_fault &&
>>                   !kvm_vcpu_trap_is_exec_fault(vcpu));
>>   -        if (kvm_slot_has_gmem(memslot))
>> +        if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu-
>> >kvm, fault_ipa))
>>               ret = gmem_abort(&s2fd);
>>           else
>>               ret = user_mem_abort(&s2fd);
> gmem_abort() is only called for faults in the protected (private) space.

You're absolutely correct - that's a nice simplification!

Thanks,
Steve

^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Steven Price @ 2026-06-08 10:56 UTC (permalink / raw)
  To: Suzuki K Poulose, Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
	Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
	linux-coco, Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
	Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
	Lorenzo.Pieralisi2
In-Reply-To: <cecbd148-5d33-49c8-928f-572f71b3dd69@arm.com>

On 08/06/2026 10:30, Suzuki K Poulose wrote:
> On 05/06/2026 07:23, Gavin Shan wrote:
>> Hi Steve,
>>
>> On 5/13/26 11:17 PM, Steven Price wrote:
>>> At runtime if the realm guest accesses memory which hasn't yet been
>>> mapped then KVM needs to either populate the region or fault the guest.
>>>
>>> For memory in the lower (protected) region of IPA a fresh page is
>>> provided to the RMM which will zero the contents. For memory in the
>>> upper (shared) region of IPA, the memory from the memslot is mapped
>>> into the realm VM non secure.
>>>
>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>> ---
>>> Changes since v13:
>>>   * Numerous changes due to rebasing.
>>>   * Fix addr_range_desc() to encode the correct block size.
>>> Changes since v12:
>>>   * Switch to RMM v2.0 range based APIs.
>>> Changes since v11:
>>>   * Adapt to upstream changes.
>>> Changes since v10:
>>>   * RME->RMI renaming.
>>>   * Adapt to upstream gmem changes.
>>> Changes since v9:
>>>   * Fix call to kvm_stage2_unmap_range() in kvm_free_stage2_pgd() to set
>>>     may_block to avoid stall warnings.
>>>   * Minor coding style fixes.
>>> Changes since v8:
>>>   * Propagate the may_block flag.
>>>   * Minor comments and coding style changes.
>>> Changes since v7:
>>>   * Remove redundant WARN_ONs for realm_create_rtt_levels() - it will
>>>     internally WARN when necessary.
>>> Changes since v6:
>>>   * Handle PAGE_SIZE being larger than RMM granule size.
>>>   * Some minor renaming following review comments.
>>> Changes since v5:
>>>   * Reduce use of struct page in preparation for supporting the RMM
>>>     having a different page size to the host.
>>>   * Handle a race when delegating a page where another CPU has
>>> faulted on
>>>     a the same page (and already delegated the physical page) but not
>>> yet
>>>     mapped it. In this case simply return to the guest to either use the
>>>     mapping from the other CPU (or refault if the race is lost).
>>>   * The changes to populate_par_region() are moved into the previous
>>>     patch where they belong.
>>> Changes since v4:
>>>   * Code cleanup following review feedback.
>>>   * Drop the PTE_SHARED bit when creating unprotected page table
>>> entries.
>>>     This is now set by the RMM and the host has no control of it and the
>>>     spec requires the bit to be set to zero.
>>> Changes since v2:
>>>   * Avoid leaking memory if failing to map it in the realm.
>>>   * Correctly mask RTT based on LPA2 flag (see rtt_get_phys()).
>>>   * Adapt to changes in previous patches.
>>> ---
>>>   arch/arm64/include/asm/kvm_emulate.h |   8 ++
>>>   arch/arm64/include/asm/kvm_rmi.h     |  12 ++
>>>   arch/arm64/kvm/mmu.c                 | 128 ++++++++++++++++----
>>>   arch/arm64/kvm/rmi.c                 | 173 +++++++++++++++++++++++++++
>>>   4 files changed, 301 insertions(+), 20 deletions(-)
>>>

[...]

>>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>>> index cae29fd3353c..761b38a4071c 100644
>>> --- a/arch/arm64/kvm/rmi.c
>>> +++ b/arch/arm64/kvm/rmi.c
>>> @@ -597,6 +597,179 @@ static int realm_data_map_init(struct kvm *kvm,
>>> unsigned long ipa,
>>>       return ret;
>>>   }
>>> +static unsigned long addr_range_desc(unsigned long phys, unsigned
>>> long size)
>>> +{
>>> +    unsigned long out = 0;
>>> +
>>> +    switch (size) {
>>> +    case P4D_SIZE:
>>> +        out = 3 | (1 << 2);
>>> +        break;
>>> +    case PUD_SIZE:
>>> +        out = 2 | (1 << 2);
>>> +        break;
>>> +    case PMD_SIZE:
>>> +        out = 1 | (1 << 2);
>>> +        break;
>>> +    case PAGE_SIZE:
>>> +        out = 0 | (1 << 2);
>>> +        break;
>>> +    default:
>>> +        /*
>>> +         * Only support mapping at the page level granulatity when
>>> +         * it's an unusual length. This should get us back onto a
>>> larger
>>> +         * block size for the subsequent mappings.
>>> +         */
>>> +        out = 0 | ((MIN(size >> PAGE_SHIFT, PTRS_PER_PTE - 1)) << 2);
>>> +        break;
>>> +    }
>>> +
>>> +    WARN_ON(phys & ~PAGE_MASK);
>>> +
>>> +    out |= phys & PAGE_MASK;
>>> +
>>> +    return out;
>>> +}
>>> +
>>> +int realm_map_protected(struct kvm *kvm,
>>> +            unsigned long ipa,
>>> +            kvm_pfn_t pfn,
>>> +            unsigned long map_size,
>>> +            struct kvm_mmu_memory_cache *memcache)
>>> +{
>>> +    struct realm *realm = &kvm->arch.realm;
>>> +    phys_addr_t phys = __pfn_to_phys(pfn);
>>> +    phys_addr_t base_phys = phys;
>>> +    phys_addr_t rd = virt_to_phys(realm->rd);
>>> +    unsigned long base_ipa = ipa;
>>> +    unsigned long ipa_top = ipa + map_size;
>>> +    int ret = 0;
>>> +
>>> +    if (WARN_ON(!IS_ALIGNED(map_size, PAGE_SIZE) ||
>>> +            !IS_ALIGNED(ipa, map_size)))
>>> +        return -EINVAL;
>>> +
>>> +    if (rmi_delegate_range(phys, map_size)) {
>>> +        /*
>>> +         * It's likely we raced with another VCPU on the same
>>> +         * fault. Assume the other VCPU has handled the fault
>>> +         * and return to the guest.
>>> +         */
>>> +        return 0;
>>> +    }
>>> +
>>> +    while (ipa < ipa_top) {
>>> +        unsigned long flags = RMI_ADDR_TYPE_SINGLE;
>>> +        unsigned long range_desc = addr_range_desc(phys, ipa_top -
>>> ipa);
>>> +        unsigned long out_top;
>>> +
>>> +        ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags, range_desc,
>>> +                       &out_top);
>>> +
>>> +        if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>>> +            /* Create missing RTTs and retry */
>>> +            int level = RMI_RETURN_INDEX(ret);
>>> +
>>> +            WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
>>> +            ret = realm_create_rtt_levels(realm, ipa, level,
>>> +                              KVM_PGTABLE_LAST_LEVEL,
>>> +                              memcache);
> 
> Could we give the RMM a chance to make use of the Block mappings by
> creating the Missing RTTs to the level that may work for the current
> range_desc ? i.e., if the range_desc is a 2M block size, we could create
> tables upto L2 in the first go and if the RMM still needs RTT, we could
> go further down to the KVM_PGTABLE_LAST_LEVEL. I understand this is
> kind of an optimisation, so may be we could defer it. (Same applies for
> the non_secure map below).

A simple change would be just to create one level at a time like this:

diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
index b79b96f7dffb..3f3ade1d3895 100644
--- a/arch/arm64/kvm/rmi.c
+++ b/arch/arm64/kvm/rmi.c
@@ -767,15 +767,15 @@ static int realm_map_protected(struct kvm *kvm,
 			/* Create missing RTTs and retry */
 			int level = RMI_RETURN_INDEX(ret);
 
-			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
+			if (WARN_ON(level >= KVM_PGTABLE_LAST_LEVEL))
+				goto err_undelegate;
 			ret = realm_create_rtt_levels(realm, ipa, level,
-						      KVM_PGTABLE_LAST_LEVEL,
+						      level + 1,
 						      memcache);
 			if (ret)
 				goto err_undelegate;
 
-			ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags,
-					       range_desc, &out_top);
+			continue;
 		}
 
 		if (WARN_ON(ret))

Thanks,
Steve


^ permalink raw reply related

* Re: [PATCH v5 05/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Marek Szyprowski @ 2026-06-08 10:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Aneesh Kumar K.V
  Cc: Michael Kelley, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	Robin Murphy, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Mostafa Saleh,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <20260604182419.GC2487554@ziepe.ca>

On 04.06.2026 20:24, Jason Gunthorpe wrote:
> On Thu, Jun 04, 2026 at 08:27:36PM +0530, Aneesh Kumar K.V wrote:
>> I already sent a v6 in the hope of getting this merged for the next
>> merge window. Should I send a v7, or would you prefer that I do the
>> rename on top of v6?
> I think it is too late for such a major change, but this should be
> imaginged to be for rc2ish next cycle. You also have to spell out how
> the pkvm patch will get sequenced as well, it would be best to push
> that it gets picked up right away.


I would like to give this a bit of time in linux-next so it is a bit too
late for v7.2, but before merging it I would also like to have this code
reviewed by someone with confidential computing knowledge.


Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply

* Re: [PATCH 04/15] x86/virt/tdx: Enable the Extensions right after basic TDX Module init
From: Xu Yilun @ 2026-06-08 10:12 UTC (permalink / raw)
  To: Kishen Maloor
  Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
	linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li
In-Reply-To: <f8b7aea2-228c-4e51-b79e-094da656f4c9@intel.com>

> > +static __init int init_tdx_ext(void)
> >   {
> >   	int ret;
> > @@ -1373,6 +1373,10 @@ static __init int init_tdx_module(void)
> >   	if (ret)
> >   		goto err_reset_pamts;
> > +	ret = init_tdx_ext();
> > +	if (ret)
> > +		goto err_reset_pamts;
> 
> Is it a reasonable policy to fail TDX bringup entirely upon failing
> initialization of extensions (which are "add-on features")?

mm.. I think TDX Extension is not strictly an add-on feature from OS
POV. It is still a fundamental TDX infrastructure. Host should not
look into the Module and create substates like Base-good-Extension-bad or
both-good. There are some considerations:

 - Extension cannot be explicitly configured by TDH.SYS.CONFIG, it is
   implicitly configured by TDX Module if an add-on feature requires it.

 - There is no enumeration of which SEAMCALLs are Extension-SEAMCALLs so
   Base-good-Extension-bad actually brings more chaos later.

So the series is making all effort to make TDX bringup a stateless
process, no intermediate state.

> 
> The handling of tdx_quote_init() in Patch 6 suggests a more
> best-effort approach.

TDX Quoting is however a clear self-contained add-on feature from OS POV.
Though I'm not sure if a TDX platform is still a safe TCB with DICE
available but failed, and good for "best-effort" policy? Maybe Peter
could answer.
> 

^ permalink raw reply

* Re: [PATCH v6 06/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Yan Zhao @ 2026-06-08  9:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kiryl Shutsemau, Chao Gao, Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Huang, Kai, seanjc@google.com,
	mingo@redhat.com, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com,
	linux-doc@vger.kernel.org, hpa@zytor.com, tglx@kernel.org,
	Annapurve, Vishal, bp@alien8.de, kirill.shutemov@linux.intel.com,
	x86@kernel.org
In-Reply-To: <572868d7-4794-4fec-b80f-97d8434d5fb6@intel.com>

On Fri, Jun 05, 2026 at 09:23:21AM -0700, Dave Hansen wrote:
> On 6/5/26 04:42, Kiryl Shutsemau wrote:
> >>> I don't see a reason why we can't keep the scoped_guard() on get side.
> >> One additional reason to drop scoped_guard() is that it mixes cleanup helpers
> >> with goto, which is discouraged. See [*]
> >>
> >>  :Lastly, given that the benefit of cleanup helpers is removal of “goto”, and
> >>  :that the “goto” statement can jump between scopes, the expectation is that
> >>  :usage of “goto” and cleanup helpers is never mixed in the same function.
> > Fair enough.
> > 
> > But it can also be address if we free the PAMT page array with the guard
> > too :P
> 
> How important is this patch? I see "Optimize" but I read "Optional".
This patch reduces the number of global pamt_lock acquisitions.

Reference testing data with/without the optimization:
(collected on my SPR test machine)

Booting/teardown of 1 TD (8 vcpus/8G memory) per iteration:
                |--------------|-------------|------------|
                |    avg (us)  |   max (us)  |   min (us) | 
                |  w/o  |  w/  |  w/o  | w/  | w/o  |  w/ |
----------------|-------|------|-------|-----|------|-----|
__tdx_pamt_get()|   2   |  0   |  578  | 505 |  2   |  0  |
__tdx_pamt_put()|   0   |  0   |  563  | 496 |  0   |  0  |
----------------|--------------|-------------|------------|

Boot/teardown of 5 TDs (each TD: 8 vcpus/8G memory) concurrently:
                |--------------|-------------|------------|
                |    avg (us)  |   max (us)  |   min (us) | 
                |  w/o  |  w/  |  w/o  | w/  | w/o  |  w/ |
----------------|-------|------|-------|-----|------|-----|
__tdx_pamt_get()|  15   |  0   |  1723 | 1386|  2   |  0  |
__tdx_pamt_put()|   0   |  0   |   562 |  733|  0   |  0  |
----------------|--------------|-------------|------------|


> If we're arguing about it, maybe we should just kick it out and focus on
> the more important bits.
DPAMT still works fine without this optimization. The optimization can reduce
the average time spent on the global lock, especially when there's high
contention.

^ permalink raw reply

* Re: [PATCH 02/15] x86/virt/tdx: Add extra memory to TDX Module for Extensions
From: Xu Yilun @ 2026-06-08  9:41 UTC (permalink / raw)
  To: Kishen Maloor
  Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
	linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li
In-Reply-To: <f44d997e-49fe-4d48-84e3-e260bb9d3164@intel.com>

> > +static int tdx_ext_mem_add(struct page *root, unsigned int nr_pages)
> > +{
> > +	struct tdx_module_args args = {
> > +		.rcx = to_hpa_list_info(root, nr_pages),
> > +	};
> > +	u64 r;
> > +
> > +	tdx_clflush_hpa_list(root, nr_pages);
> > +
> > +	do {
> > +		/*
> > +		 * TDH_EXT_MEM_ADD is designed to use output parameter RCX to
> > +		 * override/update input parameter RCX, so the caller doesn't
> > +		 * have to do manual parameter update on retry call.
> > +		 */
> > +		r = seamcall_ret(TDH_EXT_MEM_ADD, &args);
> > +	} while (r == TDX_INTERRUPTED_RESUMABLE);
> 
> The retry loop compares the full return value against TDX_INTERRUPTED_RESUMABLE. Should
> it mask with TDX_SEAMCALL_STATUS_MASK first, in case the module sets any
> lower detail bits?

mm.. there is an existing case for TDX_INTERRUPTED_RESUMABLE which
doesn't do the mask:

  err = tdh_phymem_cache_wb(resume);
  switch (err) {
	case TDX_INTERRUPTED_RESUMABLE:
		continue;

I believe we don't mask it. TDX_INTERRUPTED_RESUMABLE should not carry
any lower bits according to its bit definition, if it does it's a
problem we should not skip.

> 
> Ditto for TDH.EXT.INIT in patch 3.
> 
> > +
> > +	if (r != TDX_SUCCESS)
> > +		return -EFAULT;
> > +
> > +	return 0;
> > +}
> > +
> > +static int tdx_ext_mem_setup(void)
> > +{
> > +	unsigned int nr_pages;
> > +	struct page *page;
> > +	u64 *root;
> > +	unsigned int i;
> > +	int ret;
> > +
> > +	nr_pages = tdx_sysinfo.ext.memory_pool_required_pages;
> > +	/*
> > +	 * memory_pool_required_pages == 0 means no need to add pages,
> > +	 * skip the memory setup.
> > +	 */
> > +	if (!nr_pages)
> > +		return 0;
> > +
> > +	root = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > +	if (!root)
> > +		return -ENOMEM;
> > +
> > +	page = alloc_contig_pages(nr_pages, GFP_KERNEL, numa_mem_id(),
> > +				  &node_online_map);
> 
> The SEAMCALL takes a scatter list (HPA_LIST_INFO), so the module
> doesn't require contiguity. If the goal is just to avoid scattering
> pages across many 2MB regions, maybe dense, 2MB-aligned allocations should
> achieve that without a single pool-wide contiguous block.
> 
> > +	if (!page) {
> > +		ret = -ENOMEM;
> > +		goto out_free_root;
> > +	}
> > +
> > +	for (i = 0; i < nr_pages;) {
> > +		unsigned int nents = min(nr_pages - i,
> > +					 PAGE_SIZE / sizeof(*root));
> > +		int j;
> > +
> > +		for (j = 0; j < nents; j++)
> > +			root[j] = page_to_phys(page + i + j);
> 
> Would it be better to allocate per-batch (i.e. one root page's worth
> at a time) rather than the whole pool up front?
> 
> That way an intermediate TDH.EXT.MEM.ADD failure wouldn't leak
> all nr_pages. Also, a batch is up to 512 pages (= 2MB) and its allocation
> could be 2MB-aligned, addressing your fragmentation concern.

So IIUC allocating 2MB by 2MB has the pros:

  - Larger chance to get the memory.
  - Less memory waste when TDH.EXT.MEM.ADD failed.

and the cons:

  - Still fragment 4M & 1G memory region.


I think first of all we should focus on the normal path when Extension is
successfully initialized and memory is added, note these memory can
never be reclaimed in this case, so memory fragmentation becomes the
primary considration.

And in TDX platform, the TDH.EXT.MEM.ADD failure is not expected to
happen, which means the TDX module is buggy and from Confidential
Computing POV we should not continue, we should change to a new module
and reboot. So less memory waste doesn't matter much actually.

Then, the Extension initialization is done at bootup time. We can get
the memory in big chance. If we really can't, it is a signal that the
system is not well configured for TDX, and failing earlier isn't such
a bad thing to me.

So for now I still think alloc_contig_pages() is better than 2M-by-2M
allocation.

> 
> > +
> > +		ret = tdx_ext_mem_add(virt_to_page(root), nents);
> > +		/*
> > +		 * No SEAMCALLs to reclaim the added pages. For simple error
> > +		 * handling, leak all pages.
> > +		 */
> > +		WARN_ON_ONCE(ret);
> > +		if (ret)
> > +			break;
> > +
> > +		i += nents;
> > +	}
> > +
> > +	/*
> > +	 * Extensions memory can't be reclaimed once added, print out the
> > +	 * amount, stop tracking it and free the root page, no matter success
> > +	 * or failure.
> > +	 */
> > +	pr_info("%lu KB allocated for TDX Module Extensions\n",
> > +		nr_pages * PAGE_SIZE / 1024);
> > +
> > +out_free_root:
> > +	kfree(root);
> > +
> > +	return ret;
> > +}
> > +
> > +static int __maybe_unused init_tdx_ext(void)
> 
> Could this be named init_tdx_extensions() instead to disambiguate
> from tdx_ext_init() in patch 3?

Yes, good to me.

I'm changing all Extensions to Extension, cause the SPEC says "TDX
Module Extension". So I'll use init_tdx_extension().

Thanks,
Yilun

^ permalink raw reply

* Re: [PATCH v14 28/44] arm64: RMI: Create the realm descriptor
From: Steven Price @ 2026-06-08  9:56 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <80b6ab2d-1a3e-4c15-b06b-00aaa23fcf74@redhat.com>

On 28/05/2026 06:51, Gavin Shan wrote:
> Hi Steve,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> Creating a realm involves first creating a realm descriptor (RD). This
>> involves passing the configuration information to the RMM. Do this as
>> part of realm_ensure_created() so that the realm is created when it is
>> first needed.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>   * The RMM no longer uses AUX granules, so no need to ask it how many it
>>     needs.
>>   * Adapted to other changes.
>> Changes since v12:
>>   * Since RMM page size is now equal to the host's page size various
>>     calculations are simplified.
>>   * Switch to using range based APIs to delegate/undelegate.
>>   * VMID handling is now handled entirely by the RMM.
>> ---
>>   arch/arm64/kvm/rmi.c | 88 +++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 86 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index fb96bcaa73ed..cae29fd3353c 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -418,6 +418,77 @@ static void realm_unmap_shared_range(struct kvm
>> *kvm,
>>                    start, end);
>>   }
>>   +static int realm_create_rd(struct kvm *kvm)
>> +{
>> +    struct realm *realm = &kvm->arch.realm;
>> +    struct realm_params *params = realm->params;
>> +    void *rd = NULL;
>> +    phys_addr_t rd_phys, params_phys;
>> +    size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr);
>> +    int r;
>> +
>> +    realm->ia_bits = VTCR_EL2_IPA(kvm->arch.mmu.vtcr);
>> +
>> +    if (WARN_ON(realm->rd || !realm->params))
>> +        return -EEXIST;
>> +
>> +    rd = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
>> +    if (!rd)
>> +        return -ENOMEM;
>> +
>> +    rd_phys = virt_to_phys(rd);
>> +    if (rmi_delegate_page(rd_phys)) {
>> +        r = -ENXIO;
>> +        goto free_rd;
>> +    }
>> +
>> +    if (rmi_delegate_range(kvm->arch.mmu.pgd_phys, pgd_size)) {
>> +        r = -ENXIO;
>> +        goto out_undelegate_tables;
>> +    }
>> +
>> +    params->s2sz = VTCR_EL2_IPA(kvm->arch.mmu.vtcr);
>> +    params->rtt_level_start = get_start_level(realm);
>> +    params->rtt_num_start = pgd_size / PAGE_SIZE;
>> +    params->rtt_base = kvm->arch.mmu.pgd_phys;
>> +
>> +    if (kvm->arch.arm_pmu) {
>> +        params->pmu_num_ctrs = kvm->arch.nr_pmu_counters;
>> +        params->flags |= RMI_REALM_PARAM_FLAG_PMU;
>> +    }
>> +
>> +    if (kvm_lpa2_is_enabled())
>> +        params->flags |= RMI_REALM_PARAM_FLAG_LPA2;
>> +
>> +    params_phys = virt_to_phys(params);
>> +
>> +    if (rmi_realm_create(rd_phys, params_phys)) {
>> +        r = -ENXIO;
>> +        goto out_undelegate_tables;
>> +    }
>> +
>> +    realm->rd = rd;
>> +    kvm_set_realm_state(kvm, REALM_STATE_NEW);
>> +    /* The realm is up, free the parameters.  */
>> +    free_page((unsigned long)realm->params);
>> +    realm->params = NULL;
>> +
>> +    return 0;
>> +
>> +out_undelegate_tables:
>> +    if (WARN_ON(rmi_undelegate_range(kvm->arch.mmu.pgd_phys,
>> pgd_size))) {
>> +        /* Leak the pages if they cannot be returned */
>> +        kvm->arch.mmu.pgt = NULL;
>> +    }
> 
> In the latest RMM implementation (topics/rmm-v2.0-poc_2),
> rmi_delegate_range() works
> with the granularity of granule (4KB) and it can fail on any granule.
> For example,
> we have 16x granule as the root RTT and rmi_delegate_range() fails on
> the first
> granule, we're going to undelegate all these 16x granules, which were
> never delegated
> to RMM. It eventually leads to error and memory leakage.
> 
> For this, rmi_delegate_range() could be improved to return the number of
> granules that
> have been delegated. The return value can be used by the caller to
> handle the erroneous
> case by passing the correct range to rmi_undelegate_page().

Well spotted - yes the current situation where the entire region is
leaked if the delegate only partially completes is less than ideal! I'll
add a third argument to rmi_delegate_range() to return the top of the
region that was successfully delegated. The caller can then attempt an
undelegate on just the range which was delegated.

Thanks,
Steve

>> +    if (WARN_ON(rmi_undelegate_page(rd_phys))) {
>> +        /* Leak the page if it isn't returned */
>> +        return r;
>> +    }
>> +free_rd:
>> +    free_page((unsigned long)rd);
>> +    return r;
>> +}
>> +
>>   static void realm_unmap_private_range(struct kvm *kvm,
>>                         unsigned long start,
>>                         unsigned long end,
>> @@ -647,8 +718,21 @@ static int realm_init_ipa_state(struct kvm *kvm,
>>     static int realm_ensure_created(struct kvm *kvm)
>>   {
>> -    /* Provided in later patch */
>> -    return -ENXIO;
>> +    int ret;
>> +
>> +    switch (kvm_realm_state(kvm)) {
>> +    case REALM_STATE_NONE:
>> +        break;
>> +    case REALM_STATE_NEW:
>> +        return 0;
>> +    case REALM_STATE_DEAD:
>> +        return -ENXIO;
>> +    default:
>> +        return -EBUSY;
>> +    }
>> +
>> +    ret = realm_create_rd(kvm);
>> +    return ret;
>>   }
>>     static int set_ripas_of_protected_regions(struct kvm *kvm)
> 
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v14 28/44] arm64: RMI: Create the realm descriptor
From: Steven Price @ 2026-06-08  9:49 UTC (permalink / raw)
  To: Wei-Lin Chang, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, Lorenzo.Pieralisi2
In-Reply-To: <wshgh4i6uqytq65rh3j2risam2y2evjnfyztoee46soemp5i4x@qhzj4lcs33yj>

On 26/05/2026 23:47, Wei-Lin Chang wrote:
> Hi,
> 
> On Wed, May 13, 2026 at 02:17:36PM +0100, Steven Price wrote:
>> Creating a realm involves first creating a realm descriptor (RD). This
>> involves passing the configuration information to the RMM. Do this as
>> part of realm_ensure_created() so that the realm is created when it is
>> first needed.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>  * The RMM no longer uses AUX granules, so no need to ask it how many it
>>    needs.
>>  * Adapted to other changes.
>> Changes since v12:
>>  * Since RMM page size is now equal to the host's page size various
>>    calculations are simplified.
>>  * Switch to using range based APIs to delegate/undelegate.
>>  * VMID handling is now handled entirely by the RMM.
>> ---
>>  arch/arm64/kvm/rmi.c | 88 +++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 86 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index fb96bcaa73ed..cae29fd3353c 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -418,6 +418,77 @@ static void realm_unmap_shared_range(struct kvm *kvm,
>>  			     start, end);
>>  }
>>  
>> +static int realm_create_rd(struct kvm *kvm)
>> +{
>> +	struct realm *realm = &kvm->arch.realm;
>> +	struct realm_params *params = realm->params;
>> +	void *rd = NULL;
>> +	phys_addr_t rd_phys, params_phys;
>> +	size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr);
>> +	int r;
>> +
>> +	realm->ia_bits = VTCR_EL2_IPA(kvm->arch.mmu.vtcr);
>> +
>> +	if (WARN_ON(realm->rd || !realm->params))
>> +		return -EEXIST;
>> +
>> +	rd = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
>> +	if (!rd)
>> +		return -ENOMEM;
>> +
>> +	rd_phys = virt_to_phys(rd);
>> +	if (rmi_delegate_page(rd_phys)) {
>> +		r = -ENXIO;
>> +		goto free_rd;
>> +	}
>> +
>> +	if (rmi_delegate_range(kvm->arch.mmu.pgd_phys, pgd_size)) {
>> +		r = -ENXIO;
>> +		goto out_undelegate_tables;
>> +	}
>> +
>> +	params->s2sz = VTCR_EL2_IPA(kvm->arch.mmu.vtcr);
>> +	params->rtt_level_start = get_start_level(realm);
>> +	params->rtt_num_start = pgd_size / PAGE_SIZE;
>> +	params->rtt_base = kvm->arch.mmu.pgd_phys;
>> +
>> +	if (kvm->arch.arm_pmu) {
>> +		params->pmu_num_ctrs = kvm->arch.nr_pmu_counters;
>> +		params->flags |= RMI_REALM_PARAM_FLAG_PMU;
>> +	}
>> +
>> +	if (kvm_lpa2_is_enabled())
>> +		params->flags |= RMI_REALM_PARAM_FLAG_LPA2;
>> +
>> +	params_phys = virt_to_phys(params);
>> +
>> +	if (rmi_realm_create(rd_phys, params_phys)) {
>> +		r = -ENXIO;
>> +		goto out_undelegate_tables;
>> +	}
>> +
>> +	realm->rd = rd;
>> +	kvm_set_realm_state(kvm, REALM_STATE_NEW);
>> +	/* The realm is up, free the parameters.  */
>> +	free_page((unsigned long)realm->params);
>> +	realm->params = NULL;
>> +
>> +	return 0;
>> +
>> +out_undelegate_tables:
>> +	if (WARN_ON(rmi_undelegate_range(kvm->arch.mmu.pgd_phys, pgd_size))) {
>> +		/* Leak the pages if they cannot be returned */
>> +		kvm->arch.mmu.pgt = NULL;
>> +	}
>> +	if (WARN_ON(rmi_undelegate_page(rd_phys))) {
>> +		/* Leak the page if it isn't returned */
>> +		return r;
>> +	}
>> +free_rd:
>> +	free_page((unsigned long)rd);
>> +	return r;
>> +}
>> +
>>  static void realm_unmap_private_range(struct kvm *kvm,
>>  				      unsigned long start,
>>  				      unsigned long end,
>> @@ -647,8 +718,21 @@ static int realm_init_ipa_state(struct kvm *kvm,
>>  
>>  static int realm_ensure_created(struct kvm *kvm)
>>  {
>> -	/* Provided in later patch */
>> -	return -ENXIO;
>> +	int ret;
>> +
>> +	switch (kvm_realm_state(kvm)) {
>> +	case REALM_STATE_NONE:
>> +		break;
>> +	case REALM_STATE_NEW:
>> +		return 0;
>> +	case REALM_STATE_DEAD:
>> +		return -ENXIO;
>> +	default:
>> +		return -EBUSY;
>> +	}
>> +
>> +	ret = realm_create_rd(kvm);
>> +	return ret;
>>  }
> 
> I think ret can be simplified out.
Indeed.

Thanks,
Steve

> Thanks,
> Wei-Lin Chang
> 
>>  
>>  static int set_ripas_of_protected_regions(struct kvm *kvm)
>> -- 
>> 2.43.0
>>


^ permalink raw reply

* Re: [PATCH v14 26/44] arm64: RMI: Allow populating initial contents
From: Suzuki K Poulose @ 2026-06-08  9:41 UTC (permalink / raw)
  To: Steven Price, Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
	Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
	linux-coco, Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
	Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
	Lorenzo.Pieralisi2
In-Reply-To: <0c71b4b8-ad0b-4a24-9f4a-180b2aaacdb6@arm.com>

On 08/06/2026 10:36, Steven Price wrote:
> On 28/05/2026 06:30, Gavin Shan wrote:
>> Hi Steve,
>>
>> On 5/13/26 11:17 PM, Steven Price wrote:
>>> The VMM needs to populate the realm with some data before starting (e.g.
>>> a kernel and initrd). This is measured by the RMM and used as part of
>>> the attestation later on.
>>>
>>> Signed-off-by: Steven Price <steven.price@arm.com>

...

>>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>>> index a89873a5eb77..209087bcf399 100644
>>> --- a/arch/arm64/kvm/rmi.c
>>> +++ b/arch/arm64/kvm/rmi.c
>>> @@ -486,6 +486,75 @@ void kvm_realm_unmap_range(struct kvm *kvm,
>>> unsigned long start,
>>>            realm_unmap_private_range(kvm, start, end, may_block);
>>>    }
>>>    +static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
>>> +                   kvm_pfn_t dst_pfn, kvm_pfn_t src_pfn,
>>> +                   unsigned long flags)
>>> +{
>>> +    struct realm *realm = &kvm->arch.realm;
>>> +    phys_addr_t rd = virt_to_phys(realm->rd);
>>> +    phys_addr_t dst_phys, src_phys;
>>> +    int ret;
>>> +
>>> +    dst_phys = __pfn_to_phys(dst_pfn);
>>> +    src_phys = __pfn_to_phys(src_pfn);
>>> +
>>> +    if (rmi_delegate_page(dst_phys))
>>> +        return -ENXIO;
>>> +
>>> +    ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys, flags);
>>> +    if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>>> +        /* Create missing RTTs and retry */
>>> +        int level = RMI_RETURN_INDEX(ret);
>>> +
>>> +        KVM_BUG_ON(level == KVM_PGTABLE_LAST_LEVEL, kvm);
>>
>>          KVM_BUG_ON(level >= KVM_PGTABLE_LAST_LEVEL, kvm);
> 
> Ack.
> 

Thinking more about this, I guess a buggy VMM can trigger this
by populating twice ? (level == KVM_PGTABLE_LAST_LEVEL). So, we should
return the error back, than warning here and suppressing the error ?


Suzuki

^ permalink raw reply

* Re: [PATCH v14 26/44] arm64: RMI: Allow populating initial contents
From: Steven Price @ 2026-06-08  9:36 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <ea4be6c5-9506-4253-80c5-c76c9ac3b77d@redhat.com>

On 28/05/2026 06:30, Gavin Shan wrote:
> Hi Steve,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> The VMM needs to populate the realm with some data before starting (e.g.
>> a kernel and initrd). This is measured by the RMM and used as part of
>> the attestation later on.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>   * Rename realm_create_protected_data_page() to realm_data_map_init().
>> Changes since v12:
>>   * The ioctl now updates the structure with the amount populated rather
>>     than returning this through the ioctl return code.
>>   * Use the new RMM v2.0 range based RMI calls.
>>   * Adapt to upstream changes in kvm_gmem_populate().
>> Changes since v11:
>>   * The multiplex CAP is gone and there's a new ioctl which makes use of
>>     the generic kvm_gmem_populate() functionality.
>> Changes since v7:
>>   * Improve the error codes.
>>   * Other minor changes from review.
>> Changes since v6:
>>   * Handle host potentially having a larger page size than the RMM
>>     granule.
>>   * Drop historic "par" (protected address range) from
>>     populate_par_region() - it doesn't exist within the current
>>     architecture.
>>   * Add a cond_resched() call in kvm_populate_realm().
>> Changes since v5:
>>   * Refactor to use PFNs rather than tracking struct page in
>>     realm_create_protected_data_page().
>>   * Pull changes from a later patch (in the v5 series) for accessing
>>     pages from a guest memfd.
>>   * Do the populate in chunks to avoid holding locks for too long and
>>     triggering RCU stall warnings.
>> ---
>>   arch/arm64/include/asm/kvm_rmi.h |   4 ++
>>   arch/arm64/kvm/Kconfig           |   1 +
>>   arch/arm64/kvm/arm.c             |  13 ++++
>>   arch/arm64/kvm/rmi.c             | 106 +++++++++++++++++++++++++++++++
>>   4 files changed, 124 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/
>> asm/kvm_rmi.h
>> index 007249a13dbc..a2b6bc412a22 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -88,6 +88,10 @@ int kvm_rec_enter(struct kvm_vcpu *vcpu);
>>   int kvm_rec_pre_enter(struct kvm_vcpu *vcpu);
>>   int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_status);
>>   +struct kvm_arm_rmi_populate;
>> +
>> +int kvm_arm_rmi_populate(struct kvm *kvm,
>> +             struct kvm_arm_rmi_populate *arg);
>>   void kvm_realm_unmap_range(struct kvm *kvm,
>>                  unsigned long ipa,
>>                  unsigned long size,
>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>> index 4e16719fda22..d0cd011cf672 100644
>> --- a/arch/arm64/kvm/Kconfig
>> +++ b/arch/arm64/kvm/Kconfig
>> @@ -38,6 +38,7 @@ menuconfig KVM
>>       select GUEST_PERF_EVENTS if PERF_EVENTS
>>       select KVM_GUEST_MEMFD
>>       select KVM_GENERIC_MEMORY_ATTRIBUTES
>> +    select HAVE_KVM_ARCH_GMEM_POPULATE
>>       help
>>         Support hosting virtualized guest machines.
>>   diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index ed88a203b892..073ba9181da9 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -2131,6 +2131,19 @@ int kvm_arch_vm_ioctl(struct file *filp,
>> unsigned int ioctl, unsigned long arg)
>>               return -EFAULT;
>>           return kvm_vm_ioctl_get_reg_writable_masks(kvm, &range);
>>       }
>> +    case KVM_ARM_RMI_POPULATE: {
>> +        struct kvm_arm_rmi_populate req;
>> +        int ret;
>> +
>> +        if (!kvm_is_realm(kvm))
>> +            return -ENXIO;
>> +        if (copy_from_user(&req, argp, sizeof(req)))
>> +            return -EFAULT;
>> +        ret = kvm_arm_rmi_populate(kvm, &req);
>> +        if (copy_to_user(argp, &req, sizeof(req)))
>> +            return -EFAULT;
>> +        return ret;
>> +    }
> 
> s/return ret/return 0; The variable 'ret' can be dropped.

kvm_arm_rmi_populate() may return an error though. E.g. if the
"reserved" field is set then it's kvm_arm_rmi_populate() that detects
that and returns -EINVAL.

>>       default:
>>           return -EINVAL;
>>       }
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index a89873a5eb77..209087bcf399 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -486,6 +486,75 @@ void kvm_realm_unmap_range(struct kvm *kvm,
>> unsigned long start,
>>           realm_unmap_private_range(kvm, start, end, may_block);
>>   }
>>   +static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
>> +                   kvm_pfn_t dst_pfn, kvm_pfn_t src_pfn,
>> +                   unsigned long flags)
>> +{
>> +    struct realm *realm = &kvm->arch.realm;
>> +    phys_addr_t rd = virt_to_phys(realm->rd);
>> +    phys_addr_t dst_phys, src_phys;
>> +    int ret;
>> +
>> +    dst_phys = __pfn_to_phys(dst_pfn);
>> +    src_phys = __pfn_to_phys(src_pfn);
>> +
>> +    if (rmi_delegate_page(dst_phys))
>> +        return -ENXIO;
>> +
>> +    ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys, flags);
>> +    if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>> +        /* Create missing RTTs and retry */
>> +        int level = RMI_RETURN_INDEX(ret);
>> +
>> +        KVM_BUG_ON(level == KVM_PGTABLE_LAST_LEVEL, kvm);
> 
>         KVM_BUG_ON(level >= KVM_PGTABLE_LAST_LEVEL, kvm);

Ack.

>> +        ret = realm_create_rtt_levels(realm, ipa, level,
>> +                          KVM_PGTABLE_LAST_LEVEL, NULL);
>> +        if (!ret) {
>> +            ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys,
>> +                            flags);
>> +        }
>> +    }
>> +
>> +    if (ret) {
>> +        if (WARN_ON(rmi_undelegate_page(dst_phys))) {
>> +            /* Undelegate failed, so we leak the page */
>> +            get_page(pfn_to_page(dst_pfn));
>> +        }
>> +    }
>> +
> 
>     if (ret && WARN_ON(rmi_undelegate_page(dst_phys)) {
>         /* Leak the page that fails to be undelegated */
>         get_page(pfn_to_page(dst_pfn));
>     }

Ack

>> +    return ret;
>> +}
>> +
>> +static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>> +                  struct page *src_page, void *opaque)
>> +{
>> +    unsigned long data_flags = *(unsigned long *)opaque;
>> +    phys_addr_t ipa = gfn_to_gpa(gfn);
>> +
>> +    if (!src_page)
>> +        return -EOPNOTSUPP;
>> +
>> +    return realm_data_map_init(kvm, ipa, pfn, page_to_pfn(src_page),
>> +                   data_flags);
>> +}
>> +
>> +static long populate_region(struct kvm *kvm,
>> +                gfn_t base_gfn,
>> +                unsigned long pages,
>> +                u64 uaddr,
>> +                unsigned long data_flags)
>> +{
>> +    long ret = 0;
>> +
>> +    mutex_lock(&kvm->slots_lock);
>> +    ret = kvm_gmem_populate(kvm, base_gfn, u64_to_user_ptr(uaddr),
>> pages,
>> +                populate_region_cb, &data_flags);
>> +    mutex_unlock(&kvm->slots_lock);
>> +
>> +    return ret;
>> +}
>> +
>>   enum ripas_action {
>>       RIPAS_INIT,
>>       RIPAS_SET,
>> @@ -574,6 +643,43 @@ static int realm_ensure_created(struct kvm *kvm)
>>       return -ENXIO;
>>   }
>>   +int kvm_arm_rmi_populate(struct kvm *kvm,
>> +             struct kvm_arm_rmi_populate *args)
>> +{
>> +    unsigned long data_flags = 0;
>> +    unsigned long ipa_start = args->base;
>> +    unsigned long ipa_end = ipa_start + args->size;
>> +    long pages_populated;
>> +    int ret;
>> +
>> +    if (args->reserved ||
>> +        (args->flags & ~KVM_ARM_RMI_POPULATE_FLAGS_MEASURE) ||
>> +        !IS_ALIGNED(ipa_start, PAGE_SIZE) ||
>> +        !IS_ALIGNED(ipa_end, PAGE_SIZE) ||
>> +        !IS_ALIGNED(args->source_uaddr, PAGE_SIZE))
>> +        return -EINVAL;
>> +
> 
> There are more conditions missed here:
> 
>     args->size == 0, return 0;
>     args->base + args->size < args->base, return -EINVAL;  // wrapped range

Good catch. args->size == 0 can trigger a WARN_ON currently. I'll put
the "return 0" after the realm_ensure_created() call so the behaviour
matches.

I don't think the wrapped range is quite such a problem - but detecting
it and rejecting it early seems like a good idea.

>> +    ret = realm_ensure_created(kvm);
>> +    if (ret)
>> +        return ret;
>> +
>> +    if (args->flags & KVM_ARM_RMI_POPULATE_FLAGS_MEASURE)
>> +        data_flags |= RMI_MEASURE_CONTENT;
>> +
>> +    pages_populated = populate_region(kvm, gpa_to_gfn(ipa_start),
>> +                      args->size >> PAGE_SHIFT,
>> +                      args->source_uaddr, data_flags);
>> +
>> +    if (pages_populated < 0)
>> +        return pages_populated;
> 
> pages_populaged is 'unsigned long', this function returns a 'int' value.

pages_populated is *signed* long. This is handling an error code - so if
it's negative we expect the error code to be between -1 and -MAX_ERRNO
which should easily fit within the 'int' return.

For positive values we continue below (encoding the potentially larger
number in the args outputs) and return 0.

Thanks,
Steve

>> +
>> +    args->size -= pages_populated << PAGE_SHIFT;
>> +    args->source_uaddr += pages_populated << PAGE_SHIFT;
>> +    args->base += pages_populated << PAGE_SHIFT;
>> +
>> +    return 0;
>> +}
>> +
>>   static void kvm_complete_ripas_change(struct kvm_vcpu *vcpu)
>>   {
>>       struct kvm *kvm = vcpu->kvm;
> 
> Thanks,
> Gavin
> 


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox