* [PATCH v9 6/6] KVM: SEV: Perform RMP optimizations on SNP guest shutdown
From: Ashish Kalra @ 2026-06-24 21:59 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1782336473.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
Pages are converted from shared to private as SNP guests are launched.
This destroys exisiting RMPOPT optimizations in the regions where
pages are converted.
Conversely, guest pages are converted back to shared during SNP guest
termination and their region may become eligible for RMPOPT
optimization.
To take advantage of this, perform RMPOPT after guest termination.
Do it after a delay so that a single RMPOPT pass can be done if
multiple guests terminate in a short period of time.
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/kvm/svm/sev.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index e107f368ed2d..23e236b13ccd 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3005,6 +3005,16 @@ void sev_vm_destroy(struct kvm *kvm)
*/
if (snp_decommission_context(kvm))
return;
+
+ /*
+ * Perform RMP optimizations on memory freed by terminating
+ * guests. The scan is deferred, so it normally runs after
+ * sev_gmem_invalidate() has converted this guest's pages back to
+ * shared, and picks them up then. A very large guest whose
+ * conversion has not finished by then is picked up by a later
+ * teardown's scan.
+ */
+ snp_rmpopt_all_physmem();
} else {
sev_unbind_asid(kvm, sev->handle);
}
--
2.43.0
^ permalink raw reply related
* [PATCH v9 5/6] x86/sev: Add interface to re-enable RMP optimizations.
From: Ashish Kalra @ 2026-06-24 21:58 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1782336473.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
RMPOPT table is a per-CPU table which indicates if 1GB regions of
physical memory are entirely hypervisor-owned or not.
When performing host memory accesses in hypervisor mode as well as
non-SNP guest mode, the processor may consult the RMPOPT table to
potentially skip an RMP access and improve performance.
Normal guest events clear RMP optimizations: pages are converted from
shared to private as SNP guests are launched, and large pages are split
and collapsed during guest operation -- both clear the RMPOPT
optimizations for the affected 1GB regions. Conversely, guest pages are
converted back to shared during SNP guest termination, so those regions
may become eligible for RMPOPT optimization again.
Without some intervention, all RMP optimizations would eventually be
lost. Add an interface to re-optimize all of physical memory.
The interface uses mod_delayed_work() instead of queue_delayed_work()
so that the delay timer is reset on each call. This provides proper
batching semantics: re-optimization runs 10 seconds after the *last*
VM termination rather than after the first. mod_delayed_work() also
re-queues work that is already in-flight, so a re-scan request
during an active scan is not silently dropped.
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/include/asm/sev.h | 2 ++
arch/x86/virt/svm/sev.c | 15 +++++++++++++++
2 files changed, 17 insertions(+)
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 440c813fedde..d40beafbebb6 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
__snp_leak_pages(pfn, pages, true);
}
int snp_prepare(void);
+void snp_rmpopt_all_physmem(void);
void snp_setup_rmpopt(void);
void snp_clear_rmpopt_capable(void);
void snp_disable_cpu_hotplug(void);
@@ -683,6 +684,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
static inline void kdump_sev_callback(void) { }
static inline void snp_fixup_e820_tables(void) {}
static inline int snp_prepare(void) { return -ENODEV; }
+static inline void snp_rmpopt_all_physmem(void) {}
static inline void snp_setup_rmpopt(void) {}
static inline void snp_clear_rmpopt_capable(void) {}
static inline void snp_disable_cpu_hotplug(void) {}
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 5f99cbbc6cbd..4661e5271a2d 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -743,6 +743,21 @@ static void rmpopt_work_handler(struct work_struct *work)
free_cpumask_var(follower_mask);
}
+void snp_rmpopt_all_physmem(void)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_RMPOPT) || !rmpopt_capable)
+ return;
+
+ guard(mutex)(&rmpopt_wq_mutex);
+
+ if (!rmpopt_wq)
+ return;
+
+ mod_delayed_work(rmpopt_wq, &rmpopt_delayed_work,
+ msecs_to_jiffies(RMPOPT_WORK_TIMEOUT));
+}
+EXPORT_SYMBOL_FOR_MODULES(snp_rmpopt_all_physmem, "kvm-amd");
+
void snp_setup_rmpopt(void)
{
u64 rmpopt_base;
--
2.43.0
^ permalink raw reply related
* [PATCH v9 4/6] x86/sev: Add support to perform RMP optimizations asynchronously
From: Ashish Kalra @ 2026-06-24 21:57 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1782336473.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
When SEV-SNP is enabled, all writes to memory are checked to ensure
integrity of SNP guest memory. This imposes performance overhead on the
whole system.
RMPOPT is a new instruction that minimizes the performance overhead of
RMP checks on the hypervisor and on non-SNP guests by allowing RMP
checks to be skipped for 1GB regions of memory that are known not to
contain any SEV-SNP guest memory.
Add support for performing RMP optimizations asynchronously using a
dedicated workqueue.
Enable RMPOPT optimizations for up to 2TB of system RAM starting from
the lowest physical memory address aligned down to a 1GB boundary at
RMP initialization time. RMP checks can initially be skipped for 1GB
memory ranges that do not contain SEV-SNP guest memory (excluding
preassigned pages such as the RMP table and firmware pages). As SNP
guests are launched, RMPUPDATE will disable the corresponding RMPOPT
optimizations.
Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/virt/svm/sev.c | 165 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 162 insertions(+), 3 deletions(-)
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 60984f76b4e9..5f99cbbc6cbd 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -19,6 +19,7 @@
#include <linux/iommu.h>
#include <linux/amd-iommu.h>
#include <linux/nospec.h>
+#include <linux/workqueue.h>
#include <asm/sev.h>
#include <asm/processor.h>
@@ -125,9 +126,20 @@ static void *rmp_bookkeeping __ro_after_init;
static u64 probed_rmp_base, probed_rmp_size;
static cpumask_var_t rmpopt_cpumask;
-static phys_addr_t rmpopt_pa_start;
+static phys_addr_t rmpopt_pa_start, rmpopt_pa_end;
static bool rmpopt_capable;
+enum rmpopt_function {
+ RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS,
+ RMPOPT_FUNC_REPORT_STATUS
+};
+
+#define RMPOPT_WORK_TIMEOUT 10000
+
+static struct workqueue_struct *rmpopt_wq;
+static struct delayed_work rmpopt_delayed_work;
+static DEFINE_MUTEX(rmpopt_wq_mutex);
+
static LIST_HEAD(snp_leaked_pages_list);
static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
@@ -571,13 +583,22 @@ static void rmpopt_cleanup(void)
{
int cpu;
+ guard(mutex)(&rmpopt_wq_mutex);
+
+ if (!rmpopt_wq)
+ return;
+
+ cancel_delayed_work_sync(&rmpopt_delayed_work);
+ destroy_workqueue(rmpopt_wq);
+
scoped_guard(cpus_read_lock) {
for_each_cpu(cpu, rmpopt_cpumask)
wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, 0);
}
free_cpumask_var(rmpopt_cpumask);
- rmpopt_pa_start = 0;
+ rmpopt_pa_start = rmpopt_pa_end = 0;
+ rmpopt_wq = NULL;
}
/*
@@ -627,6 +648,101 @@ void snp_clear_rmpopt_capable(void)
rmpopt_capable = false;
}
+/*
+ * RMPOPT: F2 0F 01 FC
+ * Input: RAX = system physical address (1GB aligned)
+ * RCX = operation type
+ * Output: CF set if the range was optimized
+ */
+static inline bool __rmpopt(u64 pa_start, u64 op_type)
+{
+ bool optimized;
+
+ asm volatile(".byte 0xf2, 0x0f, 0x01, 0xfc"
+ : "=@ccc" (optimized)
+ : "a" (pa_start), "c" (op_type)
+ : "memory", "cc");
+
+ return optimized;
+}
+
+static void rmpopt(u64 pa)
+{
+ u64 pa_start = ALIGN_DOWN(pa, SZ_1G);
+ u64 op_type = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
+
+ __rmpopt(pa_start, op_type);
+}
+
+/*
+ * 'val' is a system physical address.
+ */
+static void rmpopt_smp(void *val)
+{
+ rmpopt((u64)val);
+}
+
+/*
+ * RMPOPT optimizations skip RMP checks at 1GB granularity if this
+ * range of memory does not contain any SNP guest memory.
+ */
+static void rmpopt_work_handler(struct work_struct *work)
+{
+ cpumask_var_t follower_mask;
+ phys_addr_t pa;
+ int this_cpu;
+
+ pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
+ rmpopt_pa_start, rmpopt_pa_end);
+
+ if (!alloc_cpumask_var(&follower_mask, GFP_KERNEL))
+ return;
+
+ /*
+ * RMPOPT scans the RMP table, stores the result of the scan in the
+ * reserved processor memory. The RMP scan is the most expensive
+ * part. If a second RMPOPT occurs, it can skip the expensive scan
+ * if they can see a cached result in the reserved processor memory.
+ *
+ * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
+ * on every other primary thread. Followers are "designed to"
+ * skip the scan if they see the "cached" scan results.
+ *
+ * Pin the worker to the current CPU for the leader loop so that
+ * this_cpu remains valid and the RMPOPT instruction executes on
+ * the correct CPU. Use migrate_disable() rather than get_cpu() to
+ * prevent migration while still allowing preemption.
+ */
+ migrate_disable();
+ this_cpu = smp_processor_id();
+
+ cpumask_andnot(follower_mask, rmpopt_cpumask,
+ topology_sibling_cpumask(this_cpu));
+
+ for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+ rmpopt(pa);
+ cond_resched();
+ }
+ migrate_enable();
+
+ /*
+ * Followers: run RMPOPT on remaining cores. CPUs cannot go offline
+ * while SNP is active, so the follower set stays valid across the
+ * scan and cpus_read_lock() is uncontended.
+ */
+ scoped_guard(cpus_read_lock) {
+ for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+ on_each_cpu_mask(follower_mask, rmpopt_smp,
+ (void *)pa, true);
+
+ /* Give a chance for other threads to run */
+ cond_resched();
+ }
+ }
+
+ free_cpumask_var(follower_mask);
+}
+
void snp_setup_rmpopt(void)
{
u64 rmpopt_base;
@@ -635,14 +751,42 @@ void snp_setup_rmpopt(void)
if (!cpu_feature_enabled(X86_FEATURE_RMPOPT) || !rmpopt_capable)
return;
+ guard(mutex)(&rmpopt_wq_mutex);
+
+ /*
+ * Guard against re-initialization. When SNP_SHUTDOWN_EX is issued
+ * with x86_snp_shutdown=0, snp_shutdown() is not called and
+ * rmpopt_cleanup() is skipped, but snp_initialized is still cleared.
+ * A subsequent __sev_snp_init_locked() would call snp_setup_rmpopt()
+ * again, leaking the existing workqueue, delayed work, and cpumask
+ * state.
+ */
+ if (rmpopt_wq)
+ return;
+
+ /*
+ * Create an RMPOPT-specific workqueue to avoid scheduling
+ * RMPOPT workitem on the global system workqueue.
+ */
+ rmpopt_wq = alloc_workqueue("rmpopt_wq", WQ_UNBOUND, 1);
+ if (!rmpopt_wq) {
+ pr_err("Failed to allocate RMPOPT workqueue\n");
+ return;
+ }
+
+ INIT_DELAYED_WORK(&rmpopt_delayed_work, rmpopt_work_handler);
+
if (!zalloc_cpumask_var(&rmpopt_cpumask, GFP_KERNEL)) {
pr_err("Failed to allocate RMPOPT cpumask\n");
+ destroy_workqueue(rmpopt_wq);
+ rmpopt_wq = NULL;
return;
}
/*
* The RMPOPT_BASE MSR is per-core, so only one thread per core needs
- * to set up the RMPOPT_BASE MSR.
+ * to set up the RMPOPT_BASE MSR. Likewise, only one thread per core
+ * needs to issue the RMPOPT instruction.
*
* Note: only online primary threads are included. If a core's
* primary thread is offline, that core is not covered. CPU hotplug
@@ -665,6 +809,21 @@ void snp_setup_rmpopt(void)
for_each_cpu(cpu, rmpopt_cpumask)
wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
}
+
+ rmpopt_pa_end = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
+
+ /* Limit memory scanning to 2TB of RAM */
+ if ((rmpopt_pa_end - rmpopt_pa_start) > SZ_2T) {
+ pr_info("RMPOPT coverage limited to 2TB; memory above 0x%llx not optimized\n",
+ rmpopt_pa_start + SZ_2T);
+ rmpopt_pa_end = rmpopt_pa_start + SZ_2T;
+ }
+
+ /*
+ * Once all per-CPU RMPOPT tables have been configured, enable RMPOPT
+ * optimizations on all physical memory.
+ */
+ queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
}
EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
--
2.43.0
^ permalink raw reply related
* [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: Ashish Kalra @ 2026-06-24 21:56 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1782336473.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
While SNP is active, every memory write is checked against the RMP to
protect the integrity of SEV-SNP guest memory. By the SNP architecture
these checks cannot be disabled on a subset of CPUs: they are gated
per-core by SYSCFG[SNP_EN], which the SEV firmware requires to be set on
every present CPU before SNP initialization. A CPU that does not have
SNP_EN set and was not initialized via SNP_INIT performs no RMP checks at
all, so there is no valid configuration with SNP active and any CPU exempt
from RMP checks.
The firmware determines which CPUs are present from the processor and the
BIOS/UEFI configuration (e.g. SMT disabled in the BIOS) and enumerates
them at SNP init; it is not aware of the OS bringing CPUs online or
offline afterwards. A CPU brought online after SNP init was not
enumerated at SNP_INIT and does not have SNP_EN set, so writes from it are
not RMP-checked and could corrupt SEV-SNP guest memory, and there is no
way to keep work off such a CPU once it is online. OS CPU hotplug can thus
diverge from the firmware's expectations and break SNP. Disable CPU
hotplug while SNP is active.
Use cpu_hotplug_disable() at SNP init and cpu_hotplug_enable() only on the
full x86_snp_shutdown path; the legacy SNP_SHUTDOWN_EX path leaves SNP
active and must keep hotplug disabled. A flag in built-in SNP code keeps
the disable balanced across the teardown paths, re-init and kexec, and
survives a ccp module reload.
This also keeps the CPU set stable for the asynchronous RMPOPT scan added
later in this series, and ensures cpus_read_lock() in the scan is
uncontended.
Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/include/asm/sev.h | 2 ++
arch/x86/virt/svm/sev.c | 30 ++++++++++++++++++++++++++++++
drivers/crypto/ccp/sev-dev.c | 3 +++
3 files changed, 35 insertions(+)
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 0243989f229b..440c813fedde 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -664,6 +664,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
int snp_prepare(void);
void snp_setup_rmpopt(void);
void snp_clear_rmpopt_capable(void);
+void snp_disable_cpu_hotplug(void);
void snp_shutdown(void);
#else
static inline bool snp_probe_rmptable_info(void) { return false; }
@@ -684,6 +685,7 @@ static inline void snp_fixup_e820_tables(void) {}
static inline int snp_prepare(void) { return -ENODEV; }
static inline void snp_setup_rmpopt(void) {}
static inline void snp_clear_rmpopt_capable(void) {}
+static inline void snp_disable_cpu_hotplug(void) {}
static inline void snp_shutdown(void) {}
#endif
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index dab6e1c290bc..60984f76b4e9 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -133,6 +133,9 @@ static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
static unsigned long snp_nr_leaked_pages;
+/* Set while SNP has CPU hotplug disabled (kernel-lifetime; survives ccp reload). */
+static bool snp_cpu_hotplug_disabled;
+
#undef pr_fmt
#define pr_fmt(fmt) "SEV-SNP: " fmt
@@ -577,6 +580,22 @@ static void rmpopt_cleanup(void)
rmpopt_pa_start = 0;
}
+/*
+ * Disable CPU hotplug while SNP is active. Applied once per SNP-active
+ * window and balanced by cpu_hotplug_enable() in snp_shutdown().
+ * The legacy SNP_SHUTDOWN_EX path leaves SNP enabled without re-enabling
+ * hotplug, so a re-init while SNP is still active must not stack the
+ * disable count.
+ */
+void snp_disable_cpu_hotplug(void)
+{
+ if (!snp_cpu_hotplug_disabled) {
+ cpu_hotplug_disable();
+ snp_cpu_hotplug_disabled = true;
+ }
+}
+EXPORT_SYMBOL_FOR_MODULES(snp_disable_cpu_hotplug, "ccp");
+
void snp_shutdown(void)
{
u64 syscfg;
@@ -587,6 +606,17 @@ void snp_shutdown(void)
rmpopt_cleanup();
+ /*
+ * Re-enable CPU hotplug now that SNP is fully shut down. Done here
+ * (x86_snp_shutdown path) only -- the legacy path leaves SNP active
+ * and must keep hotplug disabled. After rmpopt_cleanup() so the
+ * per-core RMPOPT_BASE MSRs are cleared with hotplug still disabled.
+ */
+ if (snp_cpu_hotplug_disabled) {
+ cpu_hotplug_enable();
+ snp_cpu_hotplug_disabled = false;
+ }
+
clear_rmp();
on_each_cpu(mfd_reconfigure, NULL, 1);
}
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 217b6b19802e..66475145b3fa 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1479,6 +1479,9 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
snp_hv_fixed_pages_state_update(sev, HV_FIXED);
+ /* Disable CPU hotplug while SNP is active (see snp_disable_cpu_hotplug). */
+ snp_disable_cpu_hotplug();
+
snp_setup_rmpopt();
sev->snp_initialized = true;
--
2.43.0
^ permalink raw reply related
* [PATCH v9 2/6] x86/sev: Initialize RMPOPT configuration MSRs
From: Ashish Kalra @ 2026-06-24 21:56 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1782336473.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
The new RMPOPT instruction helps manage per-CPU RMP optimization
structures inside the CPU. It takes a 1GB-aligned physical address
and either returns the status of the optimizations or tries to enable
the optimizations.
Per-CPU RMPOPT tables support at most 2 TB of addressable memory for
RMP optimizations.
Initialize the per-CPU RMPOPT table base to the starting physical
address. This enables RMP optimization for up to 2 TB of system RAM on
all CPUs.
Additionally, add support to setup and enable RMPOPT once SNP is
enabled and initialized.
Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/coco/core.c | 2 +
arch/x86/include/asm/msr-index.h | 3 ++
arch/x86/include/asm/sev.h | 4 ++
arch/x86/virt/svm/sev.c | 70 ++++++++++++++++++++++++++++++++
drivers/crypto/ccp/sev-dev.c | 3 ++
5 files changed, 82 insertions(+)
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 989ca9f72ba3..f0ed6c62d86c 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -16,6 +16,7 @@
#include <asm/archrandom.h>
#include <asm/coco.h>
#include <asm/processor.h>
+#include <asm/sev.h>
enum cc_vendor cc_vendor __ro_after_init = CC_VENDOR_NONE;
SYM_PIC_ALIAS(cc_vendor);
@@ -172,6 +173,7 @@ static void amd_cc_platform_clear(enum cc_attr attr)
switch (attr) {
case CC_ATTR_HOST_SEV_SNP:
cc_flags.host_sev_snp = 0;
+ snp_clear_rmpopt_capable();
break;
default:
break;
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 86554de9a3f5..28540744f1eb 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -761,6 +761,9 @@
#define MSR_AMD64_SEG_RMP_ENABLED_BIT 0
#define MSR_AMD64_SEG_RMP_ENABLED BIT_ULL(MSR_AMD64_SEG_RMP_ENABLED_BIT)
#define MSR_AMD64_RMP_SEGMENT_SHIFT(x) (((x) & GENMASK_ULL(13, 8)) >> 8)
+#define MSR_AMD64_RMPOPT_BASE 0xc0010139
+#define MSR_AMD64_RMPOPT_ENABLE_BIT 0
+#define MSR_AMD64_RMPOPT_ENABLE BIT_ULL(MSR_AMD64_RMPOPT_ENABLE_BIT)
#define MSR_SVSM_CAA 0xc001f000
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 594cfa19cbd4..0243989f229b 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,8 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
__snp_leak_pages(pfn, pages, true);
}
int snp_prepare(void);
+void snp_setup_rmpopt(void);
+void snp_clear_rmpopt_capable(void);
void snp_shutdown(void);
#else
static inline bool snp_probe_rmptable_info(void) { return false; }
@@ -680,6 +682,8 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
static inline void kdump_sev_callback(void) { }
static inline void snp_fixup_e820_tables(void) {}
static inline int snp_prepare(void) { return -ENODEV; }
+static inline void snp_setup_rmpopt(void) {}
+static inline void snp_clear_rmpopt_capable(void) {}
static inline void snp_shutdown(void) {}
#endif
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 8bcdce98f6dc..dab6e1c290bc 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -124,6 +124,10 @@ static void *rmp_bookkeeping __ro_after_init;
static u64 probed_rmp_base, probed_rmp_size;
+static cpumask_var_t rmpopt_cpumask;
+static phys_addr_t rmpopt_pa_start;
+static bool rmpopt_capable;
+
static LIST_HEAD(snp_leaked_pages_list);
static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
@@ -490,6 +494,11 @@ static bool __init setup_rmptable(void)
if (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) {
if (!setup_segmented_rmptable())
return false;
+ /*
+ * RMPOPT requires a segmented RMP, so indicate that the
+ * system is capable of configuring and running RMPOPT.
+ */
+ rmpopt_capable = true;
} else {
if (!setup_contiguous_rmptable())
return false;
@@ -555,6 +564,19 @@ int snp_prepare(void)
}
EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
+static void rmpopt_cleanup(void)
+{
+ int cpu;
+
+ scoped_guard(cpus_read_lock) {
+ for_each_cpu(cpu, rmpopt_cpumask)
+ wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, 0);
+ }
+
+ free_cpumask_var(rmpopt_cpumask);
+ rmpopt_pa_start = 0;
+}
+
void snp_shutdown(void)
{
u64 syscfg;
@@ -563,11 +585,59 @@ void snp_shutdown(void)
if (syscfg & MSR_AMD64_SYSCFG_SNP_EN)
return;
+ rmpopt_cleanup();
+
clear_rmp();
on_each_cpu(mfd_reconfigure, NULL, 1);
}
EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
+void snp_clear_rmpopt_capable(void)
+{
+ rmpopt_capable = false;
+}
+
+void snp_setup_rmpopt(void)
+{
+ u64 rmpopt_base;
+ int cpu;
+
+ if (!cpu_feature_enabled(X86_FEATURE_RMPOPT) || !rmpopt_capable)
+ return;
+
+ if (!zalloc_cpumask_var(&rmpopt_cpumask, GFP_KERNEL)) {
+ pr_err("Failed to allocate RMPOPT cpumask\n");
+ return;
+ }
+
+ /*
+ * The RMPOPT_BASE MSR is per-core, so only one thread per core needs
+ * to set up the RMPOPT_BASE MSR.
+ *
+ * Note: only online primary threads are included. If a core's
+ * primary thread is offline, that core is not covered. CPU hotplug
+ * is not currently supported with SNP enabled.
+ */
+ scoped_guard(cpus_read_lock) {
+ for_each_online_cpu(cpu)
+ if (topology_is_primary_thread(cpu))
+ cpumask_set_cpu(cpu, rmpopt_cpumask);
+
+ rmpopt_pa_start = ALIGN_DOWN(PFN_PHYS(min_low_pfn), SZ_1G);
+ rmpopt_base = rmpopt_pa_start | MSR_AMD64_RMPOPT_ENABLE;
+
+ /*
+ * Per-CPU RMPOPT tables support at most 2 TB of addressable memory
+ * for RMP optimizations. Initialize the per-CPU RMPOPT table base
+ * to the starting physical address to enable RMP optimizations for
+ * up to 2 TB of system RAM on all CPUs.
+ */
+ for_each_cpu(cpu, rmpopt_cpumask)
+ wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
+ }
+}
+EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
+
/*
* Do the necessary preparations which are verified by the firmware as
* described in the SNP_INIT_EX firmware command description in the SNP
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 78f98aee7a66..217b6b19802e 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1478,6 +1478,9 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
}
snp_hv_fixed_pages_state_update(sev, HV_FIXED);
+
+ snp_setup_rmpopt();
+
sev->snp_initialized = true;
dev_dbg(sev->dev, "SEV-SNP firmware initialized, SEV-TIO is %s\n",
data.tio_en ? "enabled" : "disabled");
--
2.43.0
^ permalink raw reply related
* [PATCH v9 1/6] x86/cpufeatures: Add X86_FEATURE_RMPOPT feature flag
From: Ashish Kalra @ 2026-06-24 21:56 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1782336473.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
Add a flag indicating whether RMPOPT instruction is supported.
RMPOPT is a new instruction that reduces the performance overhead of
RMP checks for the hypervisor and non-SNP guests by allowing those
checks to be skipped when 1-GB memory regions are known to contain no
SEV-SNP guest memory.
For more information on the RMPOPT instruction, see the AMD64 RMPOPT
technical documentation.
Suggested-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/include/asm/cpufeatures.h | 2 +-
arch/x86/kernel/cpu/scattered.c | 1 +
tools/arch/x86/include/asm/cpufeatures.h | 2 +-
3 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 1d506e5d6f46..794cc96b8493 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -76,7 +76,7 @@
#define X86_FEATURE_K8 ( 3*32+ 4) /* Opteron, Athlon64 */
#define X86_FEATURE_ZEN5 ( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
#define X86_FEATURE_ZEN6 ( 3*32+ 6) /* CPU based on Zen6 microarchitecture */
-/* Free ( 3*32+ 7) */
+#define X86_FEATURE_RMPOPT ( 3*32+ 7) /* Support for AMD RMPOPT instruction */
#define X86_FEATURE_CONSTANT_TSC ( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
#define X86_FEATURE_UP ( 3*32+ 9) /* "up" SMP kernel running on UP */
#define X86_FEATURE_ART ( 3*32+10) /* "art" Always running timer (ART) */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 937129ce6a96..021c0bf22de2 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -67,6 +67,7 @@ static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_PERFMON_V2, CPUID_EAX, 0, 0x80000022, 0 },
{ X86_FEATURE_AMD_LBR_V2, CPUID_EAX, 1, 0x80000022, 0 },
{ X86_FEATURE_AMD_LBR_PMC_FREEZE, CPUID_EAX, 2, 0x80000022, 0 },
+ { X86_FEATURE_RMPOPT, CPUID_EDX, 0, 0x80000025, 0 },
{ X86_FEATURE_AMD_HTR_CORES, CPUID_EAX, 30, 0x80000026, 0 },
{ 0, 0, 0, 0, 0 }
};
diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h
index 86d17b195e79..7ce681af1dd7 100644
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -76,7 +76,7 @@
#define X86_FEATURE_K8 ( 3*32+ 4) /* Opteron, Athlon64 */
#define X86_FEATURE_ZEN5 ( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
#define X86_FEATURE_ZEN6 ( 3*32+ 6) /* CPU based on Zen6 microarchitecture */
-/* Free ( 3*32+ 7) */
+#define X86_FEATURE_RMPOPT ( 3*32+ 7) /* Support for AMD RMPOPT instruction */
#define X86_FEATURE_CONSTANT_TSC ( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
#define X86_FEATURE_UP ( 3*32+ 9) /* "up" SMP kernel running on UP */
#define X86_FEATURE_ART ( 3*32+10) /* "art" Always running timer (ART) */
--
2.43.0
^ permalink raw reply related
* [PATCH v9 0/6] Add RMPOPT support.
From: Ashish Kalra @ 2026-06-24 21:55 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
From: Ashish Kalra <ashish.kalra@amd.com>
In the SEV-SNP architecture, hypervisor and non-SNP guests are subject
to RMP checks on writes to provide integrity of SEV-SNP guest memory.
The RMPOPT architecture enables optimizations whereby the RMP checks
can be skipped if 1GB regions of memory are known to not contain any
SNP guest memory.
RMPOPT is a new instruction designed to minimize the performance
overhead of RMP checks for the hypervisor and non-SNP guests.
RMPOPT instruction currently supports two functions. In case of the
verify and report status function the CPU will read the RMP contents,
verify the entire 1GB region starting at the provided SPA is HV-owned.
For the entire 1GB region it checks that all RMP entries in this region
are HV-owned (i.e, not in assigned state) and then accordingly updates
the RMPOPT table to indicate if optimization has been enabled and
provide indication to software if the optimization was successful.
In case of report status function, the CPU returns the optimization
status for the 1GB region.
The RMPOPT table is managed by a combination of software and hardware.
Software uses the RMPOPT instruction to set bits in the table,
indicating that regions of memory are entirely HV-owned. Hardware
automatically clears bits in the RMPOPT table when RMP contents are
changed during RMPUPDATE instruction.
For more information on the RMPOPT instruction, see the AMD64 RMPOPT
technical documentation.
As SNP is enabled by default the hypervisor and non-SNP guests are
subject to RMP write checks to provide integrity of SNP guest memory.
This patch-series adds support to enable RMP optimizations for up to
2TB of system RAM across the system and allow RMPUPDATE to disable
those optimizations as SNP guests are launched.
Support for RAM larger than 2 TB will be added in follow-on series.
This series also adds support to disable CPU hotplug while SNP is
active, as the SEV firmware enumerates CPUs at SNP initialization and is
not aware of the OS bringing CPUs online or offline afterwards. This
also keeps the set of CPUs stable for the asynchronous RMPOPT scan, so
the per-core RMPOPT_BASE MSRs programmed during setup remain valid.
This series also introduces support to re-enable RMP optimizations
during SNP guest termination, after guest pages have been converted
back to shared.
RMP optimizations are performed asynchronously by queuing work on a
dedicated workqueue after a 10 second delay.
Delaying work allows batching of multiple SNP guest terminations.
Once 1GB hugetlb guest_memfd support is merged, support for
re-enabling RMPOPT optimizations during 1GB page cleanup will be added
in follow-on series.
v9:
- Rename rmpopt_configured to rmpopt_capable.
- Make rmpopt_cpumask a cpumask_var_t (allocated/freed at setup/cleanup)
instead of a static cpumask_t.
- Drop the v8 WARN_ON_ONCE() on the RMPOPT_BASE writes; use a plain
wrmsrq_on_cpu(), matching the SNP MSR-write convention in this file.
- Disable CPU hotplug with cpu_hotplug_disable()/cpu_hotplug_enable()
(per tglx); re-enable only on the full x86_snp_shutdown path.
- Simplify rmpopt_work_handler() to a single leader-then-followers path:
with CPU hotplug disabled while SNP is active and snp_prepare()
requiring all CPUs online when RMPOPT_BASE is programmed, every core is
always programmed, so the explicit-leader fallback is now unreachable.
Drop it along with the v8 work_on_cpu()/rmpopt_leader_fn() helper.
- Drop the debugfs interface (was patch 7/7) and its report-only
plumbing; observability will be revisited after this series is merged.
- Restrict snp_rmpopt_all_physmem()'s export to the kvm-amd module.
- Use scoped_guard(cpus_read_lock) for the per-CPU MSR and follower
loops.
Sashiko AI upstream review identified several of the above issues.
v8:
- Add a new patch to disable CPU hotplug while SNP is active, keeping
the CPU set stable for the RMPOPT work handler.
- Drop the setup_clear_cpu_cap(X86_FEATURE_RMPOPT) calls; the
rmpopt_configured bool is the runtime guard.
- WARN_ON_ONCE() on the RMPOPT_BASE MSR writes that previously ignored
their return value.
- Simplify rmpopt_work_handler() by removing the explicit-leader
fallback: with CPU hotplug disabled while SNP is active and
snp_prepare() requiring all CPUs online when RMPOPT_BASE is programmed,
every core is always programmed, so the running CPU can always be the
leader. This drops the smp_call_function_single() fallback (and with
it the AB-BA deadlock and IRQ-latency concerns) and collapses the
leader selection into a single leader-then-followers path.
- Use mod_delayed_work() in snp_rmpopt_all_physmem() so the batching
delay tracks the last SNP guest termination.
Sashiko AI code review identified several of the above issues.
v7:
- Sync tools/arch/x86/include/asm/cpufeatures.h to mirror the kernel
header for X86_FEATURE_RMPOPT.
- Fix commit title to use X86_FEATURE_RMPOPT to match the code
(was X86_FEATURE_AMD_RMPOPT).
- Add static bool rmpopt_configured, set only when segmented RMP setup
succeeds in setup_rmptable(). Check rmpopt_configured alongside
cpu_feature_enabled(X86_FEATURE_RMPOPT) in snp_setup_rmpopt() and
snp_rmpopt_all_physmem(), because setup_clear_cpu_cap() is unreliable
after alternatives are patched. Add snp_clear_rmpopt_configured()
called from amd_cc_platform_clear() when CC_ATTR_HOST_SEV_SNP is
cleared. Do not use __ro_after_init on rmpopt_configured since the
writer snp_clear_rmpopt_configured() is not __init.
- Add cond_resched() to all three leader loops in rmpopt_work_handler()
to prevent soft lockups on systems with up to 2TB of RAM.
- Add comment above __rmpopt() documenting the RMPOPT instruction
encoding (F2 0F 01 FC) and register interface (RAX = system physical
address input, RCX = operation type input, RFLAGS.CF = output).
Note: RMPOPT does not modify RAX unlike PVALIDATE/RMPUPDATE, so
the existing "a" (input-only) constraint is correct.
Sashiko AI code review identified several of the above issues.
v6:
- Drop wrmsrq_on_cpus() helper; use for_each_cpu() with wrmsrq_on_cpu()
instead, as RMPOPT_BASE MSR programming is not performance-critical.
- Rewrite rmpopt_work_handler() leader selection to use a local
follower_mask copy instead of modifying the global rmpopt_cpumask.
This eliminates the current_cpu_cleared tracking and the restore at
the end, and removes the need for synchronization comments about
transient cpumask inconsistency.
- Add three-way leader selection in rmpopt_work_handler():
1. Current CPU is a primary thread in cpumask: run leader locally.
2. Current CPU is a sibling thread whose primary is in cpumask:
run leader locally (RMPOPT_BASE MSR is per-core), remove the
primary from followers via cpumask_andnot(topology_sibling_cpumask).
3. Current CPU's core has no RMPOPT_BASE MSR programmed: pick an
explicit leader via cpumask_first() + smp_call_function_single()
to avoid #UD, with cpus_read_lock() around the IPI loop.
- Add WARN_ON_ONCE guard for empty cpumask in the explicit leader
fallback path, with migrate_enable() before goto out.
- Add .llseek = seq_lseek to rmpopt_table_fops for consistency with
other seq_file-based debugfs files and to support tools like "less".
- Change debugfs file permissions from 0444 to 0400 to restrict access
to root only.
- Add comment in rmpopt_table_seq_show() explaining why cpu_online_mask
is safe: RMPOPT_BASE MSR is per-core and snp_prepare() ensures all
CPUs are online when the MSR is programmed.
Sashiko AI code review identified several of the above issues.
v5:
- Introduce rmpopt_cleanup() to tear down workqueue, debugfs, cpumask,
and MSR state, called from snp_shutdown().
- Introduce rmpopt_wq_mutex to serialize snp_setup_rmpopt(),
snp_rmpopt_all_physmem(), and rmpopt_cleanup().
- Introduce rmpopt_show_mutex to serialize debugfs reporting of
rmpopt_report_cpumask.
- Move snp_rmpopt_all_physmem() call after SNP DECOMMISSION during
guest shutdown.
- Use migrate_disable()/migrate_enable() for CPU pinning in the
rmpopt_work_handler() leader loop to maintain CPU affinity without
disabling preemption for the entire RMPOPT scan.
- Add cpus_read_lock()/cpus_read_unlock() around the follower
on_each_cpu_mask() loop in rmpopt_work_handler().
- Guard snp_setup_rmpopt() against re-initialization when
SNP_SHUTDOWN_EX with x86_snp_shutdown=0 skips rmpopt_cleanup()
but clears snp_initialized, preventing workqueue and resource
leaks on repeated init/shutdown cycles.
- Replace setup_clear_cpu_cap() with pr_err() on alloc_workqueue()
failure in snp_setup_rmpopt(), as setup_clear_cpu_cap() cannot be
used after alternatives are patched; callers check rmpopt_wq != NULL
as the runtime guard instead.
- Add pr_info() when RMPOPT coverage is capped at 2TB.
- Add comments noting CPU hotplug is not supported with SNP enabled
and only online primary threads are covered by rmpopt_cpumask.
- Add comment in setup_rmptable() noting Segmented RMP must be
enabled to enable RMPOPT.
- Simplify cpumask setup loop to set if primary thread rather than
skip if not primary.
- Improve grammar and clarity in snp_setup_rmpopt() comments.
- Added Reviewed-by's.
Sashiko AI code review identified several of the above issues.
v4:
- Add new wrmsrq_on_cpus() helper to write same u64 value to a
per-CPU MSR across a cpumask without per-cpu struct allocation
overhead.
- Rename configure_and_enable_rmpopt() to snp_setup_rmpopt().
- Use wrmsrq_on_cpus() instead of wrmsrq_on_cpu() loop for
programming RMPOPT_BASE MSRs.
- Add setup_clear_cpu_cap(X86_FEATURE_RMPOPT) if segmented RMP
setup fails or workqueue allocation fails.
- Add X86_FEATURE_RMPOPT feature clear logic in amd_cc_platform_clear()
for CC_ATTR_HOST_SEV_SNP.
- All of the above allow checking for only X86_FEATURE_RMPOPT for both
RMPOPT setup/enable and RMP re-optimizations.
- Rename snp_perform_rmp_optimization() to snp_rmpopt_all_physmem().
- Split rmpopt() into rmpopt() and rmpopt_smp() for SMP callback use.
- Introduce separate rmpopt_report_cpumask for debugfs reporting,
distinct from rmpopt_cpumask used for primary thread tracking.
- Remove snp_perform_rmp_optimization() call from __sev_snp_init_locked()
and instead setup and enable RMPOPT after SNP is enabled and
initialized.
v3:
- Drop all RMPOPT kthread support and introduce adding custom and
dedicated workqueue to schedule delayed and asynchronous RMPOPT work.
- Drop the guest_memfd inode cleanup interface and add support to
re-enable RMP optimizations during guest shutdown using the
asynchronous and delayed workqueue interface.
- Introduce new __rmpopt() helper and rmpopt() and
rmpopt_report_status() wrappers on top which use rax and rcx
parameters to closely match RMPOPT specs.
- Use new optimized RMPOPT loop to issue RMPOPT instructions on all
system RAM upto 2TB and all CPUs, by optimizing each range on one CPU
first, then let other CPUs execute RMPOPT in parallel so they can skip
most work as the range has already been optimized.
- Also add support for running the optimized RMPOPT loop only on
one thread per core.
- Replace all PUD_SIZE references with SZ_1G to conform to 1GB regions
as specified by RMPOPT specifications and not be dependent on PUD_SIZE
which makes the RMPOPT patch-set independent of x86 page table sizes.
- Use wrmsrq_on_cpu() to program the RMPOPT_BASE MSR registers on
all CPUs that removes all ugly casting to use on_each_cpu_mask().
- Fix inline commits and patch commit messages
v2:
- Drop all NUMA and Socket configuration and enablement support and
enable RMPOPT support for up to 2TB of system RAM.
- Drop get_cpumask_of_primary_threads() and enable per-core RMPOPT
base MSRs and issue RMPOPT instruction on all CPUs.
- Drop the configfs interface to manually re-enable RMP optimizations.
- Add new guest_memfd cleanup interface to automatically re-enable
RMP optimizations during guest shutdown.
- Include references to the public RMPOPT documentation.
- Move debugfs directory for RMPOPT under architecuture specific
parent directory.
Ashish Kalra (6):
x86/cpufeatures: Add X86_FEATURE_RMPOPT feature flag
x86/sev: Initialize RMPOPT configuration MSRs
x86/sev: Disable CPU hotplug while SNP is active
x86/sev: Add support to perform RMP optimizations asynchronously
x86/sev: Add interface to re-enable RMP optimizations.
KVM: SEV: Perform RMP optimizations on SNP guest shutdown
arch/x86/coco/core.c | 2 +
arch/x86/include/asm/cpufeatures.h | 2 +-
arch/x86/include/asm/msr-index.h | 3 +
arch/x86/include/asm/sev.h | 8 +
arch/x86/kernel/cpu/scattered.c | 1 +
arch/x86/kvm/svm/sev.c | 10 +
arch/x86/virt/svm/sev.c | 274 +++++++++++++++++++++++
drivers/crypto/ccp/sev-dev.c | 6 +
tools/arch/x86/include/asm/cpufeatures.h | 2 +-
9 files changed, 306 insertions(+), 2 deletions(-)
--
2.43.0
^ permalink raw reply
* Re: [PATCH v8 13/46] KVM: guest_memfd: Add base support for KVM_SET_MEMORY_ATTRIBUTES2
From: Ackerley Tng @ 2026-06-24 21:10 UTC (permalink / raw)
To: Binbin Wu
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <ede86ac4-d560-49a6-82d6-b33ac5fc9355@linux.intel.com>
Binbin Wu <binbin.wu@linux.intel.com> writes:
> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
>> From: Ackerley Tng <ackerleytng@google.com>
>>
>> Introduce base support for KVM_SET_MEMORY_ATTRIBUTES2 in guest_memfd, which
>> just updates attributes tracked by guest_memfd.
>>
>> Validate input fields in general. Guard usage of KVM_SET_MEMORY_ATTRIBUTES2
>> by making sure requested attributes are supported for this instance of kvm.
>>
>> A new KVM_SET_MEMORY_ATTRIBUTES2 is defined to support writes (unlike
>> KVM_SET_MEMORY_ATTRIBUTES) in addition to reads so it can provide error
>> details to userspace. This will be used in a later patch.
>>
>> The two ioctls use their corresponding structs with no overlap, but
>> backward compatibility is baked in for future support of
>> KVM_SET_MEMORY_ATTRIBUTES2 and struct kvm_memory_attributes2 in the VM
>> ioctl.
>>
>> The process of setting memory attributes is set up such that the later half
>> will not fail due to allocation. Any necessary checks are performed before
>> the point of no return.
>>
>> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
>> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
>> Co-developed-by: Sean Christoperson <seanjc@google.com>
>> Signed-off-by: Sean Christoperson <seanjc@google.com>
>
> s/Christoperson /Christopherson
>
Thanks!
>> Reviewed-by: Fuad Tabba <tabba@google.com>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>> include/uapi/linux/kvm.h | 13 ++++++
>> virt/kvm/Kconfig | 1 +
>> virt/kvm/guest_memfd.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++
>> virt/kvm/kvm_main.c | 12 +++++
>> 4 files changed, 142 insertions(+)
>>
>>
>
> [...]
>
>> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
>> index 297e4399fbd49..cfa2c78ba5fb9 100644
>> --- a/virt/kvm/Kconfig
>> +++ b/virt/kvm/Kconfig
>> @@ -102,6 +102,7 @@ config KVM_MMU_LOCKLESS_AGING
>>
>> config KVM_GUEST_MEMFD
>> select XARRAY_MULTI
>> + select KVM_MEMORY_ATTRIBUTES
>
> What's this?
> This config is gone.
>
I'm surprised this compiles... I'll fix it, thanks!
>> bool
>>
^ permalink raw reply
* Re: SVSM Development Call June 24th, 2026
From: Jörg Rödel @ 2026-06-24 21:08 UTC (permalink / raw)
To: coconut-svsm, linux-coco
In-Reply-To: <ajrYD8ojVpEiQ20D@8bytes.org>
Meeting minutes are here:
https://github.com/coconut-svsm/governance/pull/114
-Joerg
^ permalink raw reply
* Re: [PATCH v8 13/46] KVM: guest_memfd: Add base support for KVM_SET_MEMORY_ATTRIBUTES2
From: Ackerley Tng @ 2026-06-24 21:03 UTC (permalink / raw)
To: Fuad Tabba, Sean Christopherson
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CA+EHjTwLPCvZJgPv=8u3pgp+kwEwQbsXn_13FL3xUbJ7HRfXzw@mail.gmail.com>
Fuad Tabba <fuad.tabba@linux.dev> writes:
>
> [...snip...]
>
>> >
>> > Note sure if it's user error on my part, if I'm applying this to the
>> > wrong base, but I found a build break here on patch 13:
>> > kvm_gmem_invalidate_start() doesn't exist in the base tree. The
>> > function is kvm_gmem_invalidate_begin() here. The rename
>> > (190cc5370a8b6) landed via a different merge path and isn't an
>> > ancestor of the stated base.
>> >
>> > Patches 19 and 20 have the same mismatch. Fix for all three is
>> > s/kvm_gmem_invalidate_start/kvm_gmem_invalidate_begin/.
I took Sean's patches (off-list) and tried to combine it onto my
existing state. (I'm using b4 [1] to manage these series and I didn't
know I had to manually update the base-commit. Will try again next
revision.
[1] https://b4.docs.kernel.org/en/latest/
>>
>> Ya, Ackerley used a slightly older kvm/next to send the patches. I at least was
>> testing against kvm-x86/next, which does have the rename.
>>
>> Other than noting that this should be applied against the current kvm/next, I
>> don't think there's anything else to be done?
Should I base v9 on kvm/next, or kvm-x86/next?
>
> Agree. Sorry, didn't mean to be nit-picky, but this really threw me off :)
>
> Cheers,
> /fuad
^ permalink raw reply
* Re: [PATCH v8 10/46] KVM: guest_memfd: Wire up core private/shared attribute interfaces
From: Ackerley Tng @ 2026-06-24 20:44 UTC (permalink / raw)
To: Binbin Wu
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <2ef455c3-a3f5-4ba1-86ea-b96416d163ce@linux.intel.com>
Binbin Wu <binbin.wu@linux.intel.com> writes:
> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
>
> [...]
>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index bca912db5be6e..e0e544ef47d69 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -926,6 +926,24 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>>
>> #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE
>> +static bool kvm_gmem_range_is_private(struct file *file, pgoff_t index,
>> + size_t nr_pages, struct kvm *kvm, gfn_t gfn)
>> +{
>> + struct maple_tree *mt = &GMEM_I(file_inode(file))->attributes;
>> + pgoff_t end = index + nr_pages - 1;
>> + void *entry;
>> +
>> + if (!gmem_in_place_conversion)
>> + return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
>> + KVM_MEMORY_ATTRIBUTE_PRIVATE,
>> + KVM_MEMORY_ATTRIBUTE_PRIVATE);
>> +
>> + mt_for_each(mt, entry, index, end) {
>> + if (xa_to_value(entry) != KVM_MEMORY_ATTRIBUTE_PRIVATE)
>> + return false;
>> + }
>
> Patch 1 noted that "Ensuring every index is represented in the maple tree at all times".
> So I think the queried range should not be a hole in the maple tree.
> However, there is a inconsistency: in patch 1 kvm_gmem_get_attributes() explicitly
> checks for holes, but this patch does not.
>
>> + return true;
>> +}
>>
With Sean's suggestion for patch 1, I'll update this one to default to
the "init" state if xa_to_value(entry) is NULL.
Thanks!
^ permalink raw reply
* Re: [PATCH v8 32/46] KVM: selftests: Test conversion flow when INIT_SHARED
From: Fuad Tabba @ 2026-06-24 19:55 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-32-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Add a test case to verify that conversions between private and shared
> memory work correctly when the memory is initially created as shared.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> .../testing/selftests/kvm/x86/guest_memfd_conversions_test.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> index 8e09e241723e5..5b070d3374eae 100644
> --- a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> @@ -95,6 +95,12 @@ static void __gmem_conversions_##test(test_data_t *t, int nr_pages) \
> #define GMEM_CONVERSION_TEST_INIT_PRIVATE(test) \
> __GMEM_CONVERSION_TEST_INIT_PRIVATE(test, 1)
>
> +#define __GMEM_CONVERSION_TEST_INIT_SHARED(test, __nr_pages) \
> + GMEM_CONVERSION_TEST(test, __nr_pages, GUEST_MEMFD_FLAG_INIT_SHARED)
> +
> +#define GMEM_CONVERSION_TEST_INIT_SHARED(test) \
> + __GMEM_CONVERSION_TEST_INIT_SHARED(test, 1)
> +
> struct guest_check_data {
> void *mem;
> char expected_val;
> @@ -186,6 +192,12 @@ GMEM_CONVERSION_TEST_INIT_PRIVATE(init_private)
> test_convert_to_private(t, 0, 'C', 'E');
> }
>
> +GMEM_CONVERSION_TEST_INIT_SHARED(init_shared)
> +{
> + test_shared(t, 0, 0, 'A', 'B');
> + test_convert_to_private(t, 0, 'B', 'C');
> + test_convert_to_shared(t, 0, 'C', 'D', 'E');
> +}
>
> int main(int argc, char *argv[])
> {
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 31/46] KVM: selftests: Test basic single-page conversion flow
From: Fuad Tabba @ 2026-06-24 19:45 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-31-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Add a selftest for the guest_memfd memory attribute conversion ioctls.
> The test starts the guest_memfd as all-private (the default state), and
> verifies the basic flow of converting a single page to shared and then back
> to private.
>
> Add infrastructure that supports extensions to other conversion flow
> tests. This infrastructure will be used in upcoming patches for other
> conversion tests.
>
> Add test as an x86-specific test since guest_memfd's testing
> vehicle (KVM_X86_SW_PROTECTED_VM) is x86-specific.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> tools/testing/selftests/kvm/Makefile.kvm | 1 +
> .../kvm/x86/guest_memfd_conversions_test.c | 199 +++++++++++++++++++++
> 2 files changed, 200 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
> index 4ace12606e937..b0e64a6dde21a 100644
> --- a/tools/testing/selftests/kvm/Makefile.kvm
> +++ b/tools/testing/selftests/kvm/Makefile.kvm
> @@ -152,6 +152,7 @@ TEST_GEN_PROGS_x86 += x86/max_vcpuid_cap_test
> TEST_GEN_PROGS_x86 += x86/triple_fault_event_test
> TEST_GEN_PROGS_x86 += x86/recalc_apic_map_test
> TEST_GEN_PROGS_x86 += x86/aperfmperf_test
> +TEST_GEN_PROGS_x86 += x86/guest_memfd_conversions_test
> TEST_GEN_PROGS_x86 += access_tracking_perf_test
> TEST_GEN_PROGS_x86 += coalesced_io_test
> TEST_GEN_PROGS_x86 += dirty_log_perf_test
> diff --git a/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> new file mode 100644
> index 0000000000000..8e09e241723e5
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/x86/guest_memfd_conversions_test.c
> @@ -0,0 +1,199 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2024, Google LLC.
> + */
> +#include <sys/mman.h>
> +#include <unistd.h>
> +
> +#include <linux/align.h>
> +#include <linux/kvm.h>
> +#include <linux/sizes.h>
> +
> +#include "kvm_util.h"
> +#include "kselftest_harness.h"
> +#include "test_util.h"
> +#include "ucall_common.h"
> +
> +FIXTURE(gmem_conversions) {
> + struct kvm_vcpu *vcpu;
> + int gmem_fd;
> + /* HVA of the first byte of the memory mmap()-ed from gmem_fd. */
> + char *mem;
> +};
> +
> +typedef FIXTURE_DATA(gmem_conversions) test_data_t;
> +
> +FIXTURE_SETUP(gmem_conversions) { }
> +
> +static size_t page_size;
> +
> +static void guest_do_rmw(void);
> +#define GUEST_MEMFD_SHARING_TEST_GVA 0x90000000ULL
> +
> +/*
> + * Defer setup until the individual test is invoked so that tests can specify
> + * the number of pages and flags for the guest_memfd instance.
> + */
> +static void gmem_conversions_do_setup(test_data_t *t, int nr_pages,
> + int gmem_flags)
> +{
> + const struct vm_shape shape = {
> + .mode = VM_MODE_DEFAULT,
> + .type = KVM_X86_SW_PROTECTED_VM,
> + };
> + /*
> + * Use high GPA above APIC_DEFAULT_PHYS_BASE to avoid clashing with
> + * APIC_DEFAULT_PHYS_BASE.
> + */
> + const gpa_t gpa = SZ_4G;
> + const u32 slot = 1;
> + struct kvm_vm *vm;
> +
> + vm = __vm_create_shape_with_one_vcpu(shape, &t->vcpu, nr_pages, guest_do_rmw);
> +
> + vm_mem_add(vm, VM_MEM_SRC_SHMEM, gpa, slot, nr_pages,
> + KVM_MEM_GUEST_MEMFD, -1, 0, gmem_flags);
> +
> + t->gmem_fd = kvm_slot_to_fd(vm, slot);
> + t->mem = addr_gpa2hva(vm, gpa);
> + virt_map(vm, GUEST_MEMFD_SHARING_TEST_GVA, gpa, nr_pages);
> +}
> +
> +static void gmem_conversions_do_teardown(test_data_t *t)
> +{
> + /* No need to close gmem_fd, it's owned by the VM structure. */
> + kvm_vm_free(t->vcpu->vm);
> +}
> +
> +FIXTURE_TEARDOWN(gmem_conversions)
> +{
> + gmem_conversions_do_teardown(self);
> +}
> +
> +/*
> + * In these test definition macros, __nr_pages and nr_pages is used to set up
> + * the total number of pages in the guest_memfd under test. This will be
> + * available in the test definitions as nr_pages.
> + */
> +
> +#define __GMEM_CONVERSION_TEST(test, __nr_pages, flags) \
> +static void __gmem_conversions_##test(test_data_t *t, int nr_pages); \
> + \
> +TEST_F(gmem_conversions, test) \
> +{ \
> + gmem_conversions_do_setup(self, __nr_pages, flags); \
> + __gmem_conversions_##test(self, __nr_pages); \
> +} \
> +static void __gmem_conversions_##test(test_data_t *t, int nr_pages) \
> +
> +#define GMEM_CONVERSION_TEST(test, __nr_pages, flags) \
> + __GMEM_CONVERSION_TEST(test, __nr_pages, (flags) | GUEST_MEMFD_FLAG_MMAP)
> +
> +#define __GMEM_CONVERSION_TEST_INIT_PRIVATE(test, __nr_pages) \
> + GMEM_CONVERSION_TEST(test, __nr_pages, 0)
> +
> +#define GMEM_CONVERSION_TEST_INIT_PRIVATE(test) \
> + __GMEM_CONVERSION_TEST_INIT_PRIVATE(test, 1)
> +
> +struct guest_check_data {
> + void *mem;
> + char expected_val;
> + char write_val;
> +};
> +static struct guest_check_data guest_data;
> +
> +static void guest_do_rmw(void)
> +{
> + for (;;) {
> + char *mem = READ_ONCE(guest_data.mem);
> +
> + GUEST_ASSERT_EQ(READ_ONCE(*mem), READ_ONCE(guest_data.expected_val));
> + WRITE_ONCE(*mem, READ_ONCE(guest_data.write_val));
> +
> + GUEST_SYNC(0);
> + }
> +}
> +
> +static void run_guest_do_rmw(struct kvm_vcpu *vcpu, u64 pgoff,
> + char expected_val, char write_val)
> +{
> + struct ucall uc;
> + int r;
> +
> + guest_data.mem = (void *)GUEST_MEMFD_SHARING_TEST_GVA + pgoff * page_size;
> + guest_data.expected_val = expected_val;
> + guest_data.write_val = write_val;
> + sync_global_to_guest(vcpu->vm, guest_data);
> +
> + do {
> + r = __vcpu_run(vcpu);
> + } while (r == -1 && errno == EINTR);
> +
> + TEST_ASSERT_EQ(r, 0);
> +
> + switch (get_ucall(vcpu, &uc)) {
> + case UCALL_ABORT:
> + REPORT_GUEST_ASSERT(uc);
> + case UCALL_SYNC:
> + break;
> + default:
> + TEST_FAIL("Unexpected ucall %lu", uc.cmd);
> + }
> +}
> +
> +static void host_do_rmw(char *mem, u64 pgoff, char expected_val,
> + char write_val)
> +{
> + TEST_ASSERT_EQ(READ_ONCE(mem[pgoff * page_size]), expected_val);
> + WRITE_ONCE(mem[pgoff * page_size], write_val);
> +}
> +
> +static void test_private(test_data_t *t, u64 pgoff, char starting_val,
> + char write_val)
> +{
> + TEST_EXPECT_SIGBUS(WRITE_ONCE(t->mem[pgoff * page_size], write_val));
> + run_guest_do_rmw(t->vcpu, pgoff, starting_val, write_val);
> + TEST_EXPECT_SIGBUS(READ_ONCE(t->mem[pgoff * page_size]));
> +}
> +
> +static void test_convert_to_private(test_data_t *t, u64 pgoff,
> + char starting_val, char write_val)
> +{
> + gmem_set_private(t->gmem_fd, pgoff * page_size, page_size);
> + test_private(t, pgoff, starting_val, write_val);
> +}
> +
> +static void test_shared(test_data_t *t, u64 pgoff, char starting_val,
> + char host_write_val, char write_val)
> +{
> + host_do_rmw(t->mem, pgoff, starting_val, host_write_val);
> + run_guest_do_rmw(t->vcpu, pgoff, host_write_val, write_val);
> + TEST_ASSERT_EQ(READ_ONCE(t->mem[pgoff * page_size]), write_val);
> +}
> +
> +static void test_convert_to_shared(test_data_t *t, u64 pgoff,
> + char starting_val, char host_write_val,
> + char write_val)
> +{
> + gmem_set_shared(t->gmem_fd, pgoff * page_size, page_size);
> + test_shared(t, pgoff, starting_val, host_write_val, write_val);
> +}
> +
> +GMEM_CONVERSION_TEST_INIT_PRIVATE(init_private)
> +{
> + test_private(t, 0, 0, 'A');
> + test_convert_to_shared(t, 0, 'A', 'B', 'C');
> + test_convert_to_private(t, 0, 'C', 'E');
> +}
> +
> +
> +int main(int argc, char *argv[])
> +{
> + TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
> + TEST_REQUIRE(kvm_check_cap(KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES) &
> + KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +
> + page_size = getpagesize();
> +
> + return test_harness_run(argc, argv);
> +}
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 30/46] KVM: selftests: Add helpers for calling ioctls on guest_memfd
From: Fuad Tabba @ 2026-06-24 19:26 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-30-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Add helper functions to kvm_util.h to support calling ioctls, specifically
> KVM_SET_MEMORY_ATTRIBUTES2, on a guest_memfd file descriptor.
>
> Introduce gmem_ioctl() and __gmem_ioctl() macros, modeled after the
> existing vm_ioctl() helpers, to provide a standard way to call ioctls
> on a guest_memfd.
>
> Add gmem_set_memory_attributes() and its derivatives (gmem_set_private(),
> gmem_set_shared()) to set memory attributes on a guest_memfd region.
> Also provide "__" variants that return the ioctl error code instead of
> aborting the test. These helpers will be used by upcoming guest_memfd
> tests.
>
> To avoid code duplication, factor out the check for supported memory
> attributes into a new macro, TEST_ASSERT_SUPPORTED_ATTRIBUTES, and use
> it in both the existing vm_set_memory_attributes() and the new
> gmem_set_memory_attributes() helpers.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> tools/testing/selftests/kvm/include/kvm_util.h | 94 +++++++++++++++++++++++---
> 1 file changed, 86 insertions(+), 8 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index 0cacf3698b259..323d06b5699ec 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -392,6 +392,16 @@ static __always_inline void static_assert_is_vcpu(struct kvm_vcpu *vcpu) { }
> __TEST_ASSERT_VM_VCPU_IOCTL(!ret, #cmd, ret, (vcpu)->vm); \
> })
>
> +#define __gmem_ioctl(gmem_fd, cmd, arg) \
> + kvm_do_ioctl(gmem_fd, cmd, arg)
> +
> +#define gmem_ioctl(gmem_fd, cmd, arg) \
> +({ \
> + int ret = __gmem_ioctl(gmem_fd, cmd, arg); \
> + \
> + TEST_ASSERT(!ret, __KVM_IOCTL_ERROR(#cmd, ret)); \
> +})
> +
> /*
> * Looks up and returns the value corresponding to the capability
> * (KVM_CAP_*) given by cap.
> @@ -418,8 +428,16 @@ static inline void vm_enable_cap(struct kvm_vm *vm, u32 cap, u64 arg0)
> vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap);
> }
>
> +/*
> + * KVM_SET_MEMORY_ATTRIBUTES{,2} overwrites _all_ attributes. These
> + * flows need significant enhancements to support multiple attributes.
> + */
> +#define TEST_ASSERT_SUPPORTED_ATTRIBUTES(attributes) \
> + TEST_ASSERT(!(attributes) || (attributes) == KVM_MEMORY_ATTRIBUTE_PRIVATE, \
> + "Update me to support multiple attributes!")
> +
> static inline void vm_set_memory_attributes(struct kvm_vm *vm, gpa_t gpa,
> - u64 size, u64 attributes)
> + size_t size, u64 attributes)
> {
> struct kvm_memory_attributes attr = {
> .attributes = attributes,
> @@ -428,17 +446,11 @@ static inline void vm_set_memory_attributes(struct kvm_vm *vm, gpa_t gpa,
> .flags = 0,
> };
>
> - /*
> - * KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes. These flows
> - * need significant enhancements to support multiple attributes.
> - */
> - TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,
> - "Update me to support multiple attributes!");
> + TEST_ASSERT_SUPPORTED_ATTRIBUTES(attributes);
>
> vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
> }
>
> -
> static inline void vm_mem_set_private(struct kvm_vm *vm, gpa_t gpa,
> u64 size)
> {
> @@ -451,6 +463,72 @@ static inline void vm_mem_set_shared(struct kvm_vm *vm, gpa_t gpa,
> vm_set_memory_attributes(vm, gpa, size, 0);
> }
>
> +static inline int __gmem_set_memory_attributes(int fd, u64 offset,
> + size_t size, u64 attributes,
> + u64 *error_offset)
> +{
> + struct kvm_memory_attributes2 attr = {
> + .attributes = attributes,
> + .offset = offset,
> + .size = size,
> + .flags = 0,
> + .error_offset = 0,
> + };
> + int r;
> +
> + r = __gmem_ioctl(fd, KVM_SET_MEMORY_ATTRIBUTES2, &attr);
> +
> + /* Copy error_offset regardless of r so caller can check. */
> + if (error_offset)
> + *error_offset = attr.error_offset;
> +
> + return r;
> +}
> +
> +static inline int __gmem_set_private(int fd, u64 offset, size_t size,
> + u64 *error_offset)
> +{
> + return __gmem_set_memory_attributes(fd, offset, size,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE,
> + error_offset);
> +}
> +
> +static inline int __gmem_set_shared(int fd, u64 offset, size_t size,
> + u64 *error_offset)
> +{
> + return __gmem_set_memory_attributes(fd, offset, size, 0,
> + error_offset);
> +}
> +
> +static inline void gmem_set_memory_attributes(int fd, u64 offset,
> + size_t size, u64 attributes)
> +{
> + struct kvm_memory_attributes2 attr = {
> + .attributes = attributes,
> + .offset = offset,
> + .size = size,
> + .flags = 0,
> + };
> +
> + TEST_ASSERT_SUPPORTED_ATTRIBUTES(attributes);
> +
> + __TEST_REQUIRE(kvm_check_cap(KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES) > 0,
> + "No valid attributes for guest_memfd ioctl!");
> +
> + gmem_ioctl(fd, KVM_SET_MEMORY_ATTRIBUTES2, &attr);
> +}
> +
> +static inline void gmem_set_private(int fd, u64 offset, size_t size)
> +{
> + gmem_set_memory_attributes(fd, offset, size,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +}
> +
> +static inline void gmem_set_shared(int fd, u64 offset, size_t size)
> +{
> + gmem_set_memory_attributes(fd, offset, size, 0);
> +}
> +
> void vm_guest_mem_fallocate(struct kvm_vm *vm, gpa_t gpa, u64 size,
> bool punch_hole);
>
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 29/46] KVM: selftests: Add selftests global for guest memory attributes capability
From: Fuad Tabba @ 2026-06-24 19:26 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-29-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Add a global variable, kvm_has_gmem_attributes, to make the result of
> checking for KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES available to all tests.
>
> kvm_has_gmem_attributes is true if guest_memfd tracks memory attributes, as
> opposed to VM-level tracking.
>
> This global variable is synced to the guest for testing convenience, to
> avoid introducing subtle bugs when host/guest state is desynced.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> tools/testing/selftests/kvm/include/test_util.h | 2 ++
> tools/testing/selftests/kvm/lib/kvm_util.c | 5 +++++
> 2 files changed, 7 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
> index a56271c237ae9..51287fac8138a 100644
> --- a/tools/testing/selftests/kvm/include/test_util.h
> +++ b/tools/testing/selftests/kvm/include/test_util.h
> @@ -115,6 +115,8 @@ struct guest_random_state {
> extern u32 guest_random_seed;
> extern struct guest_random_state guest_rng;
>
> +extern bool kvm_has_gmem_attributes;
> +
> struct guest_random_state new_guest_random_state(u32 seed);
> u32 guest_random_u32(struct guest_random_state *state);
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index d5bbc80b2bf1c..b73817f7bc803 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -24,6 +24,8 @@ u32 guest_random_seed;
> struct guest_random_state guest_rng;
> static u32 last_guest_seed;
>
> +bool kvm_has_gmem_attributes;
> +
> static size_t vcpu_mmap_sz(void);
>
> int __open_path_or_exit(const char *path, int flags, const char *enoent_help)
> @@ -521,6 +523,7 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
> }
> guest_rng = new_guest_random_state(guest_random_seed);
> sync_global_to_guest(vm, guest_rng);
> + sync_global_to_guest(vm, kvm_has_gmem_attributes);
>
> kvm_arch_vm_post_create(vm, nr_runnable_vcpus);
>
> @@ -2286,6 +2289,8 @@ void __attribute((constructor)) kvm_selftest_init(void)
> guest_random_seed = last_guest_seed = random();
> pr_info("Random seed: 0x%x\n", guest_random_seed);
>
> + kvm_has_gmem_attributes = kvm_has_cap(KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES);
> +
> kvm_selftest_arch_init();
> }
>
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 28/46] KVM: selftests: Add support for mmap() on guest_memfd in core library
From: Fuad Tabba @ 2026-06-24 19:07 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-28-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:32, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Accept gmem_flags in vm_mem_add() to be able to create a guest_memfd within
> vm_mem_add().
>
> When vm_mem_add() is used to set up a guest_memfd for a memslot, set up the
> provided (or created) gmem_fd as the fd for the user memory region. This
> makes it available to be mmap()-ed from just like fds from other memory
> sources. mmap() from guest_memfd using the provided gmem_flags and
> gmem_offset.
>
> Add a kvm_slot_to_fd() helper to provide convenient access to the file
> descriptor of a memslot.
>
> Update existing callers of vm_mem_add() to pass 0 for gmem_flags to
> preserve existing behavior.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> [For guest_memfds, mmap() using gmem_offset instead of 0 all the time.]
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> tools/testing/selftests/kvm/include/kvm_util.h | 7 +++++-
> tools/testing/selftests/kvm/lib/kvm_util.c | 27 ++++++++++++----------
> .../kvm/x86/private_mem_conversions_test.c | 2 +-
> 3 files changed, 22 insertions(+), 14 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index d4c104cb0418f..0cacf3698b259 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -700,7 +700,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
> gpa_t gpa, u32 slot, u64 npages, u32 flags);
> void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
> gpa_t gpa, u32 slot, u64 npages, u32 flags,
> - int gmem_fd, u64 gmem_offset);
> + int gmem_fd, u64 gmem_offset, u64 gmem_flags);
>
> #ifndef vm_arch_has_protected_memory
> static inline bool vm_arch_has_protected_memory(struct kvm_vm *vm)
> @@ -732,6 +732,11 @@ void *addr_gva2hva(struct kvm_vm *vm, gva_t gva);
> gpa_t addr_hva2gpa(struct kvm_vm *vm, void *hva);
> void *addr_gpa2alias(struct kvm_vm *vm, gpa_t gpa);
>
> +static inline int kvm_slot_to_fd(struct kvm_vm *vm, u32 slot)
> +{
> + return memslot2region(vm, slot)->fd;
> +}
> +
> #ifndef vcpu_arch_put_guest
> #define vcpu_arch_put_guest(mem, val) do { (mem) = (val); } while (0)
> #endif
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 9b482778f7379..d5bbc80b2bf1c 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -978,12 +978,13 @@ void vm_set_user_memory_region2(struct kvm_vm *vm, u32 slot, u32 flags,
> /* FIXME: This thing needs to be ripped apart and rewritten. */
> void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
> gpa_t gpa, u32 slot, u64 npages, u32 flags,
> - int gmem_fd, u64 gmem_offset)
> + int gmem_fd, u64 gmem_offset, u64 gmem_flags)
> {
> int ret;
> struct userspace_mem_region *region;
> size_t backing_src_pagesz = get_backing_src_pagesz(src_type);
> size_t mem_size = npages * vm->page_size;
> + off_t mmap_offset = 0;
> size_t alignment = 1;
>
> TEST_REQUIRE_SET_USER_MEMORY_REGION2();
> @@ -1055,8 +1056,6 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
>
> if (flags & KVM_MEM_GUEST_MEMFD) {
> if (gmem_fd < 0) {
> - u32 gmem_flags = 0;
> -
> TEST_ASSERT(!gmem_offset,
> "Offset must be zero when creating new guest_memfd");
> gmem_fd = vm_create_guest_memfd(vm, mem_size, gmem_flags);
> @@ -1077,13 +1076,17 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
> }
>
> region->fd = -1;
> - if (backing_src_is_shared(src_type))
> + if (flags & KVM_MEM_GUEST_MEMFD && gmem_flags & GUEST_MEMFD_FLAG_MMAP) {
> + region->fd = kvm_dup(gmem_fd);
> + mmap_offset = gmem_offset;
> + } else if (backing_src_is_shared(src_type)) {
> region->fd = kvm_memfd_alloc(region->mmap_size,
> src_type == VM_MEM_SRC_SHARED_HUGETLB);
> + }
>
> - region->mmap_start = kvm_mmap(region->mmap_size, PROT_READ | PROT_WRITE,
> - vm_mem_backing_src_alias(src_type)->flag,
> - region->fd);
> + region->mmap_start = __kvm_mmap(region->mmap_size, PROT_READ | PROT_WRITE,
> + vm_mem_backing_src_alias(src_type)->flag,
> + region->fd, mmap_offset);
>
> TEST_ASSERT(!is_backing_src_hugetlb(src_type) ||
> region->mmap_start == align_ptr_up(region->mmap_start, backing_src_pagesz),
> @@ -1129,10 +1132,10 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
>
> /* If shared memory, create an alias. */
> if (region->fd >= 0) {
> - region->mmap_alias = kvm_mmap(region->mmap_size,
> - PROT_READ | PROT_WRITE,
> - vm_mem_backing_src_alias(src_type)->flag,
> - region->fd);
> + region->mmap_alias = __kvm_mmap(region->mmap_size,
> + PROT_READ | PROT_WRITE,
> + vm_mem_backing_src_alias(src_type)->flag,
> + region->fd, mmap_offset);
>
> /* Align host alias address */
> region->host_alias = align_ptr_up(region->mmap_alias, alignment);
> @@ -1143,7 +1146,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
> enum vm_mem_backing_src_type src_type,
> gpa_t gpa, u32 slot, u64 npages, u32 flags)
> {
> - vm_mem_add(vm, src_type, gpa, slot, npages, flags, -1, 0);
> + vm_mem_add(vm, src_type, gpa, slot, npages, flags, -1, 0, 0);
> }
>
> /*
> diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> index 1d2f5d4fd45d7..861baff201e78 100644
> --- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
> @@ -399,7 +399,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, u32 nr_v
> for (i = 0; i < nr_memslots; i++)
> vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i,
> BASE_DATA_SLOT + i, slot_size / vm->page_size,
> - KVM_MEM_GUEST_MEMFD, memfd, slot_size * i);
> + KVM_MEM_GUEST_MEMFD, memfd, slot_size * i, 0);
>
> for (i = 0; i < nr_vcpus; i++) {
> gpa_t gpa = BASE_DATA_GPA + i * per_cpu_size;
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Fuad Tabba @ 2026-06-24 18:57 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park,
Qi Zheng, Shakeel Butt, Kiryl Shutsemau, Baoquan He,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-24-9d2959357853@google.com>
On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Make in-place conversion the default if the arch has private mem.
>
> The default can be overridden at compile type by enabling
compile _time_
> CONFIG_KVM_VM_MEMORY_ATTRIBUTES, or at KVM load time through a module
> parameter.
>
> In-place conversion also implies tracking a guest's private/shared state in
> guest_memfd. To avoid inconsistencies in the way memory attributes are
> tracked between the per-VM or by guest_memfd, make the module_param
> read-only (0444).
>
> Document that using per-VM attributes for tracking private/shared state of
> guest memory is deprecated in favor of tracking in guest_memfd.
>
> Warn if the admin sets gmem_in_place_conversion as false when
> CONFIG_KVM_VM_MEMORY_ATTRIBUTES is not enabled. Add warning in the code
> path where guest memory is populated for a CoCo VM, since that's the
> earliest point in a CoCo VM's lifecycle where memory attributes are
> queried. Unlike other query sites, this site is exclusively used by CoCo
> VMs.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kvm/Kconfig | 7 ++++++-
> virt/kvm/guest_memfd.c | 5 +++++
> virt/kvm/kvm_main.c | 3 ++-
> 3 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index c28393dc664eb..a3c189d765150 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -85,7 +85,12 @@ config KVM_VM_MEMORY_ATTRIBUTES
> bool "Enable per-VM PRIVATE vs. SHARED attributes (for CoCo VMs)"
> help
> Enable support for tracking PRIVATE vs. SHARED memory using per-VM
> - memory attributes.
> + memory attributes. Using per-VM attributes are deprecated in favor
nit:
are->is
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> + of tracking PRIVATE state in guest_memfd. Select this if you need
> + to run CoCo VMs using a VMM that doesn't support guest_memfd memory
> + attributes.
> +
> + If unsure, say N.
>
> config KVM_SW_PROTECTED_VM
> bool "Enable support for KVM software-protected VMs"
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 86c9f5b0863cb..5cb73543c03c8 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -1193,10 +1193,15 @@ static bool kvm_gmem_range_is_private(struct file *file, pgoff_t index,
> {
> struct maple_tree *mt = &GMEM_I(file_inode(file))->attributes;
>
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> if (!gmem_in_place_conversion)
> return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
> KVM_MEMORY_ATTRIBUTE_PRIVATE,
> KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +#else
> + if (WARN_ON_ONCE(!gmem_in_place_conversion))
> + return false;
> +#endif
>
> return kvm_gmem_range_has_attributes(mt, index, nr_pages,
> KVM_MEMORY_ATTRIBUTE_PRIVATE);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index dd1d18a1d2f68..46e92b5dc3804 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -102,7 +102,8 @@ static bool __ro_after_init allow_unsafe_mappings;
> module_param(allow_unsafe_mappings, bool, 0444);
>
> #ifdef kvm_arch_has_private_mem
> -bool __ro_after_init gmem_in_place_conversion = false;
> +bool __ro_after_init gmem_in_place_conversion = !IS_ENABLED(CONFIG_KVM_VM_MEMORY_ATTRIBUTES);
> +module_param(gmem_in_place_conversion, bool, 0444);
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(gmem_in_place_conversion);
> #endif
>
>
> --
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
>
>
^ permalink raw reply
* Re: [PATCH v8 15/46] KVM: guest_memfd: Call arch invalidate hooks on conversion
From: Ackerley Tng @ 2026-06-24 17:46 UTC (permalink / raw)
To: Sean Christopherson, Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ajneQVLriUshjFIO@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Fri, Jun 19, 2026, Fuad Tabba wrote:
>> On Fri, 19 Jun 2026 at 01:31, Ackerley Tng via B4 Relay
>> <devnull+ackerleytng.google.com@kernel.org> wrote:
>> >
>> > From: Ackerley Tng <ackerleytng@google.com>
>> >
>> > When memory in guest_memfd is converted from private to shared, the
>> > platform-specific state associated with the guest-private pages must be
>> > invalidated or cleaned up.
>> >
>> > Iterate over the folios in the affected range and call the
>> > kvm_arch_gmem_invalidate() hook for each PFN range. This allows
>> > architectures to perform necessary teardown, such as updating hardware
>> > metadata or encryption states, before the pages are transitioned to the
>> > shared state.
>> >
>> > Invoke this helper after indicating to KVM's mmu code that an invalidation
>> > is in progress to stop in-flight page faults from succeeding.
>> >
>> > Reviewed-by: Fuad Tabba <tabba@google.com>
>> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>> Coming back to this after working through the arm64/pKVM side. My
>> Reviewed-by here is from the previous round and the patch hasn't
>> changed, but I missed an implication for arm64.
>>
>> kvm_arch_gmem_invalidate() is now called from two paths with the same
>> (start, end) signature: folio teardown (kvm_gmem_free_folio) and
>> private->shared conversion (here). For SNP/TDX that's fine, conversion is
>> destructive anyway. For pKVM the two need opposite content semantics:
>> conversion must preserve the page in place (same physical page, the point
>> of in-place conversion without encryption), while teardown must scrub it
>> before returning it to the host.
>>
>> The hook gets only a pfn range with no indication of which caller it's
>> serving, so arm64 can't give the two paths the behaviour they need. It
>> would help to signal intent on the conversion path: a reason/flag, a
>> separate hook, or not routing non-destructive conversion through the
>> teardown hook.
>>
>> arm64 isn't here yet, so this isn't urgent, but the hook is gaining a
>> second caller now, and it's cheaper to leave room for the distinction
>> than to change a generic contract other arches depend on later.
>
> Crud. It may not be urgent for arm64, but it's urgent for other reasons that
> I "can't" describe in detail at the moment, and even if that weren't the case, I
> think we should clean things up now. More below.
>
>> > virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
>> > 1 file changed, 41 insertions(+)
>> >
>> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> > index 433f79047b9d1..3c94442bc8131 100644
>> > --- a/virt/kvm/guest_memfd.c
>> > +++ b/virt/kvm/guest_memfd.c
>> > @@ -607,6 +607,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>> > return safe;
>> > }
>> >
>> > +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
>> > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
>
> Not your fault, but kvm_arch_gmem_invalidate() is badly misnamed. It's not
> "invalidating" anything, it's much more of a "free" callback, as SNP uses it to
> put physical pages back into a shared state when a maybe-private folio is freed.
>
> As Fuad points out, (ab)using that hook for the private=>shared conversion case
> "works", but not broadly. And it makes the bad name worse, because it's called
> from code that _is_ doing true invalidations. For pKVM, it may not even need to
> do anything invalidation-like.
>
Thanks, I also didn't like the naming of kvm_gmem_invalidate(),
especially when conversions also calls
kvm_gmem_invalidate_{start,end}() and those do different things.
> To avoid a conflict with patches that are going to have priority over this series,
> to set the stage for arm64 support, and to avoid avoid bleeding vendor details
> into guest_memfd, as if they are core guest_memfd behavior (only SNP needs the
> "invalidation" on this specific transition), I think we should add an arch hook
> to do conversions straightaway.
>
> Unless there's a clever option I'm missing, it'll mean adding yet another
> HAVE_KVM_ARCH_GMEM_XXX flag? Hmm, especially because IIUC, arm64/pKVM doesn't
> need a callback for this case, only the free_folio case.
>
>> > +{
>> > + struct folio_batch fbatch;
>> > + pgoff_t next = start;
>> > + int i;
>> > +
>> > + folio_batch_init(&fbatch);
>> > + while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
>> > + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
>> > + struct folio *folio = fbatch.folios[i];
>> > + pgoff_t start_index, end_index;
>> > + kvm_pfn_t start_pfn, end_pfn;
>> > +
>> > + start_index = max(start, folio->index);
>> > + end_index = min(end, folio_next_index(folio));
>> > + /*
>> > + * end_index is either in folio or points to
>> > + * the first page of the next folio. Hence,
>> > + * all pages in range [start_index, end_index)
>> > + * are contiguous.
>> > + */
>> > + start_pfn = folio_file_pfn(folio, start_index);
>> > + end_pfn = start_pfn + end_index - start_index;
>> > +
>> > + kvm_arch_gmem_invalidate(start_pfn, end_pfn);
>> > + }
>> > +
>> > + folio_batch_release(&fbatch);
>> > + cond_resched();
>> > + }
>> > +}
>> > +#else
>> > +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
>> > +#endif
>> > +
>> > static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>> > size_t nr_pages, uint64_t attrs,
>> > pgoff_t *err_index)
>> > @@ -647,7 +683,12 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>> > */
>> >
>> > kvm_gmem_invalidate_start(inode, start, end);
>> > +
>> > + if (!to_private)
>> > + kvm_gmem_invalidate(inode, start, end);
>
> E.g. instead make this something like this?
>
> kvm_gmem_set_pfn_attributes(...)
>
> Hrm, though that wastes folio lookups in the to_private case. So maybe just this,
> assuming pKVM doesn't need to take additional action on conversions?
>
> if (!to_private)
> kvm_gmem_make_shared(...)
>
> Actually, if we do that, then we don't need a separate arch hook, just a separate
> config. It'll still bleed SNP details into guest_memfd, but it'll at least be
> done in a way that's more explicitly arch specific (and it's no different than
> what we already do for PREPARE...).
>
pKVM needs some arch guest_memfd lifecycle functions that
+ for conversion, doesn't do anything,
+ for teardown, resets page state (IIUC it'll be reset to
PKVM_PAGE_OWNED (by the host))
So I think we need different functions for those two stages in the
lifecycle of a page with guest_memfd? What if we have
CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES, which gates
+ kvm_gmem_should_set_pfn_attributes(attributes) and
.gmem_should_set_pfn_attributes
+ kvm_gmem_set_pfn_attributes(start_pfn, end_pfn, attributes) and
.gmem_set_pfn_attributes
CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN, which gates
+ kvm_gmem_teardown() and .gmem_teardown
SNP:
+ .gmem_should_set_pfn_attributes = sev_gmem_should_set_pfn_attributes,
and sev_gmem_should_set_pfn_attributes returns !is_private
+ Rename .gmem_invalidate and sev_gmem_invalidate to *set_pfn_attributes
+ .gmem_teardown = sev_gmem_set_pfn_attributes
TDX:
+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_TEARDOWN
pKVM:
+ Disable CONFIG_HAVE_KVM_ARCH_GMEM_SET_PFN_ATTRIBUTES
+ .gmem_teardown = pkvm_gmem_set_pfn_attributes
Suzuki, does this work for ARM CCA?
This way,
+ The if (is_private) check doesn't leak SNP details into guest_memfd
+ .gmem_make_shared doesn't stick out without a .gmem_make_private
+ .gmem_set_pfn_attributes, .gmem_prepare and .gmem_teardown are aligned
conceptually as lifecycle hooks
+ I think the private/shared check for prepare can also be folded into
preparation.
+ Preparation perhaps doesn't need a should_prepare equivalent since
there's no iteration and getting the gfn is just doing some math?
+ In another patch series?
> E.g. this? There will still be a looming rename conflict, but that's easy enough
> to handle.
>
> diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> index 9ce5be7843f2..8aead0abd788 100644
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -648,8 +648,8 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> return safe;
> }
>
> -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> +#ifdef CONFIG_KVM_ARCH_GMEM_FREE_ON_SHARED_CONVERSION
> +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t end)
> {
> struct folio_batch fbatch;
> pgoff_t next = start;
> @@ -681,7 +681,7 @@ static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> }
> }
> #else
> -static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> +static void kvm_gmem_make_shared(struct inode *inode, pgoff_t start, pgoff_t end) { }
> #endif
>
> static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> @@ -729,7 +729,7 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> kvm_gmem_invalidate_start(inode, start, end);
>
> if (!to_private)
> - kvm_gmem_invalidate(inode, start, end);
> + kvm_gmem_make_shared(inode, start, end);
>
> mas_store_prealloc(&mas, xa_mk_value(attrs));
^ permalink raw reply
* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Sean Christopherson @ 2026-06-24 17:01 UTC (permalink / raw)
To: Binbin Wu
Cc: ackerleytng, aik, andrew.jones, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <6fc7f450-6d0a-494d-b295-297e4703148d@linux.intel.com>
On Tue, Jun 23, 2026, Binbin Wu wrote:
> On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> > @@ -606,12 +608,20 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> > next = start;
> > while (safe && filemap_get_folios(mapping, &next, last, &fbatch)) {
> >
> > - for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> > + for (i = 0; i < folio_batch_count(&fbatch);) {
> > struct folio *folio = fbatch.folios[i];
> >
> > - if (folio_ref_count(folio) !=
> > - folio_nr_pages(folio) + filemap_get_folios_refcount) {
> > - safe = false;
> > + safe = (folio_ref_count(folio) ==
> > + folio_nr_pages(folio) +
> > + filemap_get_folios_refcount);
> > +
> > + if (safe) {
> > + ++i;
> > + } else if (folio_may_be_lru_cached(folio) &&
> > + !lru_drained) {
> > + lru_add_drain_all();
>
> It seems unprivileged userspace is able to trigger lru_add_drain_all() repeatedly
> by invoking KVM_SET_MEMORY_ATTRIBUTES2 in a loop, which could lead to DoS risk?
FIW, if there's a risk, then AFAICT fadvise() and memfd's F_ADD_SEALS already
have the same risk.
^ permalink raw reply
* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Sean Christopherson @ 2026-06-24 16:57 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260618-gmem-inplace-conversion-v8-18-9d2959357853@google.com>
On Thu, Jun 18, 2026, Ackerley Tng wrote:
> When checking if a guest_memfd folio is safe for conversion, its refcount
> is examined. A folio may be present in a per-CPU lru_add fbatch, which
> temporarily increases its refcount.
Under what circumstances does this happen, and what alternatives are there for
userspace to work around the issue?
^ permalink raw reply
* Re: [PATCH v13 11/22] KVM: selftests: Set up TDX boot parameters region
From: Xiaoyao Li @ 2026-06-24 16:07 UTC (permalink / raw)
To: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-11-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
> +void tdx_vm_load_common_boot_parameters(struct kvm_vm *vm)
> +{
> + struct td_boot_parameters *params =
> + addr_gpa2hva(vm, TD_BOOT_PARAMETERS_GPA);
> + u32 cr4;
> +
> + TEST_ASSERT_EQ(vm->mode, VM_MODE_PXXVYY_4K);
Why TDX needs to check it explicitly?
x86's virt_arch_pgd_alloc() already has such assertation. I think we can
just drop it.
^ permalink raw reply
* Re: [PATCH v13 13/22] KVM: selftests: Set first memory region as shared if guest_memfd
From: Xiaoyao Li @ 2026-06-24 15:43 UTC (permalink / raw)
To: Ackerley Tng, Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao,
Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <CAEvNRgEBiivp9tOHZnFmWvF6ek-Ar-m2B+hb=-H0dgAcM2=z8Q@mail.gmail.com>
On 6/16/2026 7:46 AM, Ackerley Tng wrote:
> Lisa Wang <wyihan@google.com> writes:
>
>> Set the initial state of the first memory region as shared if it is
>> backed by guest_memfd, so that the KVM selftest framework functions can
>> populate mmap()-ed guest_memfd memory the same way memory from other
>> memory providers are populated.
>>
>> For CoCo VMs, pages that need to be private are explicitly set to
>> private before executing the VM.
>>
>>
>> [...snip...]
>>
>> @@ -495,14 +497,16 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
>> vm = ____vm_create(shape);
>>
>> /*
>> - * Force GUEST_MEMFD for the primary memory region if necessary, e.g.
>> - * for CoCo VMs that require GUEST_MEMFD backed private memory.
>> + * Force GUEST_MEMFD for the primary memory region if necessary, and
>> + * initialize it as shared so the selftest framework can populate it
>> + * exactly like other memory providers.
>> */
>> - flags = 0;
>> - if (is_guest_memfd_required(shape))
>> + if (is_guest_memfd_required(shape)) {
>> flags |= KVM_MEM_GUEST_MEMFD;
>> + gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
>> + }
>>
>
> Just noticed this while hacking some SNP tests.
>
>> - vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags);
>> + vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
>> for (i = 0; i < NR_MEM_REGIONS; i++)
>> vm->memslots[i] = 0;
>>
>>
>> --
>> 2.54.0.746.g67dd491aae-goog
>
> I think this patch should fully buy into in-place conversions, so we
> need to also set GUEST_MEMFD_FLAG_MMAP:
>
> @@ -483,6 +483,7 @@ struct kvm_vm *__vm_create(struct vm_shape shape,
> u32 nr_runnable_vcpus,
> {
> u64 nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus,
> nr_extra_pages);
> + enum vm_mem_backing_src_type src_type = VM_MEM_SRC_ANONYMOUS;
> struct userspace_mem_region *slot0;
> u64 gmem_flags = 0;
> struct kvm_vm *vm;
> @@ -503,10 +504,16 @@ struct kvm_vm *__vm_create(struct vm_shape
> shape, u32 nr_runnable_vcpus,
> */
> if (is_guest_memfd_required(shape)) {
> flags |= KVM_MEM_GUEST_MEMFD;
> - gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> + gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED | GUEST_MEMFD_FLAG_MMAP;
GUEST_MEMFD_FLAG_INIT_SHARED is valid only when the memory attributes is
per-gmem.
we need to check KVM_CAP_GUEST_MEMFD_FLAGS or kvm_has_gmem_attributes.
^ permalink raw reply
* Re: [PATCH v8 04/46] KVM: Decouple kvm_has_arch_private_mem from CONFIG_KVM_VM_MEMORY_ATTRIBUTES
From: Sean Christopherson @ 2026-06-24 15:12 UTC (permalink / raw)
To: Ackerley Tng
Cc: Binbin Wu, aik, andrew.jones, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgGF+O7r-YHqcLp-ZgoXTCbqjuUhpOdD5eE5w2wu3YYYpw@mail.gmail.com>
On Tue, Jun 23, 2026, Ackerley Tng wrote:
> Binbin Wu <binbin.wu@linux.intel.com> writes:
>
> > On 6/19/2026 8:31 AM, Ackerley Tng via B4 Relay wrote:
> >> From: Sean Christopherson <seanjc@google.com>
> >>
> >> When memory attributes become trackable in guest_memfd, the concept of
> >> having private memory is no longer dependent on
> >> CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
> >>
> >> With this, on x86, kvm_arch_has_private_mem() is defined if some CoCo
> >> platform support (or the testing CONFIG_KVM_SW_PROTECTED_VM) is compiled
> >> in.
> >>
> >> Signed-off-by: Sean Christopherson <seanjc@google.com>
> >> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> >> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> >
> > Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> >
> > One nit below.
> >
> >> ---
> >> arch/x86/include/asm/kvm_host.h | 4 +++-
> >> include/linux/kvm_host.h | 2 +-
> >> 2 files changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >> index 8e8eb8a5e8a6b..1bde67cf6eb0e 100644
> >> --- a/arch/x86/include/asm/kvm_host.h
> >> +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -2394,7 +2394,9 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> >> int tdp_max_root_level, int tdp_huge_page_level);
> >>
> >>
> >> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> >> +#if defined(CONFIG_KVM_SW_PROTECTED_VM) || \
> >> + defined(CONFIG_KVM_INTEL_TDX) || \
> >> + defined(CONFIG_KVM_AMD_SEV)
> >
> > Nit:
> > Vertically align the defined(XXX) statements for better readability?
> >
>
> Sean had this aligned with spaces, and checkpatch complained about
checkpatch is a tool, it is neither omniscient nor authoritative. And for things
like this, the *entire* purpose for rules/guildlines like "no tabs after spaces"
is to help ensure the code is easier to read, e.g. doesn't end up with wonky
formatting when viewed in certain editors or whatever. So, ignore checkpatch if
it complains about formatting that is visually superior to what makes checkpatch
happy.
> having no spaces before tabs, so I switched it to tabs instead since I
> don't think alignment like that is officially documented either way.
This exact case may not be "officially" documented, but the general gist is in
Documentation/process/maintainer-tip.rst:
When splitting function declarations or function calls, then please align
the first argument in the second line with the first argument in the first
line::
And there is lots and lots of prior art on-list (from me and others) that is more
or less as good as official documentation.
> Either way is fine :)
Please restore the alignment.
^ permalink raw reply
* Re: [PATCH v8 09/46] KVM: guest_memfd: Introduce function to check GFN private/shared status
From: Ackerley Tng @ 2026-06-24 14:38 UTC (permalink / raw)
To: Binbin Wu
Cc: aik, andrew.jones, brauner, chao.p.peng, david, jmattson,
jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <1b59fec2-a464-4429-8532-880394912af5@linux.intel.com>
Binbin Wu <binbin.wu@linux.intel.com> writes:
>
> [...snip...]
>
>> +bool kvm_gmem_is_private(struct kvm *kvm, gfn_t gfn)
>> +{
>> + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>> + struct inode *inode;
>> +
>> + /*
>> + * If this gfn has no associated memslot, there's no chance of the gfn
>> + * being backed by private memory, since guest_memfd must be used for
>> + * private memory,
>
> "guest_memfd must be used for private memory" is a bit confusing to me.
>
Hmm good point. Is the source of confusion that guest_memfd can be used
for both shared and private memory?
Perhaps this can be rephrased as:
guest_memfd is the only provider of private memory and guest_memfd must
be used with a memslot, hence if there's no associated memslot, there's
no chance of this gfn being private.
>> and guest_memfd must be associated with some memslot.
>> + */
>> + if (!slot)
>> + return 0;
>> +
>>
>> [...snip...]
>>
^ permalink raw reply
* Re: [RFCv2 PATCH 1/6] efi/unaccepted: Support hotplug memory in unaccepted bitmap via SRAT
From: Pratik R. Sampat @ 2026-06-24 14:23 UTC (permalink / raw)
To: Kiryl Shutsemau, Zhenzhong Duan
Cc: marcandre.lureau, david, rick.p.edgecombe, pbonzini, mst, peterx,
chenyi.qiang, elena.reshetova, michael.roth, ackerleytng,
linux-kernel, linux-coco, virtualization, x86, yilun.xu,
xiaoyao.li, chao.p.peng
In-Reply-To: <ajvLaBs62bDoxC3W@thinkstation>
On 6/24/26 8:25 AM, Kiryl Shutsemau wrote:
> On Tue, Jun 23, 2026 at 06:17:32AM -0400, Zhenzhong Duan wrote:
>> Currently, allocate_unaccepted_bitmap() only scans the initial EFI
>> boot memory map. This misses hotpluggable ranges described in the
>> ACPI SRAT. Without early tracking, hotplug pages are accessed without
>> acceptance and this triggers guest crash.
>>
>> Introduce a lightweight ACPI SRAT parser to scan these regions early.
>> If a region has both ACPI_SRAT_MEM_ENABLED and ACPI_SRAT_MEM_HOT_PLUGGABLE
>> flags, expand the tracking boundaries. This avoids pulling in the full
>> ACPI subsystem while ensuring the bitmap covers both static memory and
>> hotplug memory.
>
> Ugh.. Parsing SRAT there is ugly. I would rather avoid it.
>
I agree. Parsing it here means SRAT gets parsed twice, which doesn't make much
sense.
> Do I understand correctly that we don't have a way represent pluggable,
> but not present memory in EFI memory map?
>
> IIUC, EFI_MEMORY_HOT_PLUGGABLE is actually present, but unpluggable
> memory.
>
Right. And repurposing EFI_MEMORY_HOT_PLUGGABLE (plus updating the spec) would
likely make this messier: by its current definition it describes cold-plugged
pages that may be removed, not pages that may be hot-added later.
> Maybe it would be better just allocate bitmap upto maxmem?
>
> And fix EFI spec to add pluggable-but-not-present attribute.
>
I am currently working with the UEFI community around two proposals for a spec
change:
1. Add a new attribute, as Kiryl suggested, or
2. Add a generic new hotplug memory type that represents all the memory that
could be added later.
In either case, we could then precisely allocate the bitmap by parsing the
region with the attribute/type.
I prefer (1), but I have RFC proposals, code-first edk2 changes, and the Linux
plumbing ready for both approaches, and plan to post them in the following week
after ironing out a few kinks.
Thanks,
--Pratik
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox