* Re: [PATCH v13 00/22] TDX KVM selftests
From: Ackerley Tng @ 2026-06-16 17:51 UTC (permalink / raw)
To: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Kiryl Shutsemau, linux-kselftest,
Paolo Bonzini, Pratik R. Sampat, Reinette Chatre, Rick Edgecombe,
Roger Wang, Ryan Afranji, Sagi Shahar, Sean Christopherson,
Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86,
Adrian Hunter
In-Reply-To: <20260521-tdx-selftests-v13-v13-0-6983ae4c3a4d@google.com>
Lisa Wang <wyihan@google.com> writes:
> This patch series focuses on setting up a TDX VM and adding all code
> necessary to run a basic lifecycle test.
>
> Unlike standard KVM selftests can set up the VM through guest registers,
> TDX module protects TDs' register state from the host. This feature of
> TDX causes problems on VM boot state initialization and the ucall
> implementation.
>
> In standard KVM selftests, the host directly initializes the guest state
> by manipulating Special Registers (SREGs) and General Purpose Registers
> (GPRs) via IOCTLs (KVM_SET_SREGS, etc.) before the first KVM_RUN.
>
> To bypass direct register initialization by the host, we utilize the
> standard x86 reset vector as the default entry point.
>
> The mechanism works as follows:
> 1. The host places register values into a specific memory region and
> inserts boot code at the VM's default starting point.
> 2. When the VM starts, it executes this boot code to "pull" values from
> memory and manually set up its own SREGs and GPRs.
> 3. Once the environment is ready, the boot code jumps to the guest code.
>
> The standard x86 ucall() implementation uses PIO, but it does not
> actually transmit data through the 4-byte PIO data. Instead, it relies
> on the host reading the ucall address directly from the guest's RDI
> register.
>
> TDX selftests cannot utilize the standard x86 ucall implementation,
> because the host is unable to access the guest's RDI register. Based on
> this restriction, we considered these potential solutions for the TDX
> ucall implementation.
>
> 1. TDCALL PIO with RCX-bits Passthrough
> We first considered passing the RDI value through RCX bits to bypass the
> hardware's register protection, which could be the closest approach to
> the non-TDX implementation as per Sean's suggestion[1]. However, this
> approach is blocked by the software-side implementation: KVM_GET_REGS
> currently does not support TDX VMs and returns -EINVAL. To make this
> work, the KVM ioctl would need a test-only hack.
>
> 2. TDCALL PIO with buffer indexing
> To keep a PIO-based approach and unify the get_ucall implementation for
> both TDX and non-TDX VMs, we considered TDCALL PIO with buffer indexing.
> Since the ucall buffer is initialized prior to execution, the VM could
> just pass a buffer index rather than an 8-byte ucall address to fit
> within the 4-byte PIO data limit. The host, already knowing the ucall
> buffer's base address, could then resolve the ucall content via this
> index. We abandoned this solution because it would require changes to
> the common ucall structure and impact other non-x86 architectures.
>
> 3. TDCALL MMIO (Selected solution)
> We ultimately selected TDCALL with an 8-byte MMIO data. This method only
> requires initializing an MMIO GPA and adding TDCALL MMIO implementation
> for TDX under the original x86 ucall path. While this diverges from the
> non-TDX PIO, it provides the cleanest implementation with minimal
> disruption to the overall ucall architecture.
>
Sean, Lisa evaluated your suggestion [1] (summarized as 1. above) but we
think TDCALL MMIO is better, what do you think?
+ Jump directly to where the mmio is used: [2]
+ And here's [3] how tdx_mmio_write() is implemented, with no more
throwing everything in a structure. It's also not macroed/prototyped
like you suggested in [4], but I think those prototypes can evolve out
of future tdx functions?
Let us know so Lisa can try another option (if necessary) while we
collect more reviews :)
[1] https://lore.kernel.org/all/aQTcDH9LRezI30dm@google.com/
[2] https://lore.kernel.org/all/20260521-tdx-selftests-v13-v13-21-6983ae4c3a4d@google.com/
[3] https://lore.kernel.org/all/20260521-tdx-selftests-v13-v13-20-6983ae4c3a4d@google.com/
[4] https://lore.kernel.org/all/aQTdTkMIukzt-YlA@google.com/
> 4. A note on #VE and x86 ucall simplification
> It is worth noting that the use of a Virtualization Exception (#VE)
> is orthogonal to the PIO vs. MMIO discussion; rather, it is a question
> of how much we want to simplify the x86 ucall implementation. A #VE
> handler is one option to allow VMs use PIO/MMIO identical to the
> non-TDX case. Alternatively, having an MMIO_WRITE wrapper macro, as Sean
> suggested[2], is another option. Either way, discussion for this is
> likely a premature optimization right now, since the PIO/MMIO call is
> only used under ucall_arch_do_ucall(), and standard and TDX VMs use
> different ones now. We should optimize this in the future, but for now,
> invoking TDCALL directly is more robust and concise.
>
>
> [...snip...]
>
^ permalink raw reply
* Re: [PATCH] PCI/TSM: Resume device to D0 for CMA-SPDM operation
From: Dan Williams (nvidia) @ 2026-06-16 17:34 UTC (permalink / raw)
To: Lukas Wunner, Dan Williams, Ashish Kalra, Tom Lendacky
Cc: Vivaik Balasubrawmanian, John Allen, Bjorn Helgaas, linux-coco,
linux-pci, Jonathan Cameron, Aneesh Kumar K.V, Yilun Xu,
Zhenzhong Duan, Alexey Kardashevskiy
In-Reply-To: <7bdfaf14d7e5a466f3f650150c688a60e947a7a9.1781527060.git.lukas@wunner.de>
Lukas Wunner wrote:
> Per PCIe r7.0 sec 6.31.3, CMA-SPDM operation in non-D0 states is optional.
> The spec does not define a way to determine if it's supported, so resume
> to D0 unconditionally for the duration of a CMA-SPDM exchange. Vivaik has
> talked to Windows engineers and they said that Windows does the same.
>
> Note that for plain DOE operation, it is sufficient for the device to be
> in D3hot and its parents in D0 because config space remains accessible in
> D3hot. So CMA-SPDM goes beyond the requirements of plain DOE and hence
> resuming to D0 needs to (only) be done in code paths which use DOE
> specifically for CMA-SPDM.
>
> The pattern used herein for runtime resume is the best practice introduced
> by commit ef8057b07c72 ("PM: runtime: Wrapper macros for ACQUIRE()/
> ACQUIRE_ERR()").
>
> Fixes: 3225f52cde56 ("PCI/TSM: Establish Secure Sessions and Link Encryption")
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org # v6.19+
> Cc: Vivaik Balasubrawmanian <vivaik.balasubrawmanian@intel.com>
> ---
> We're in the merge window for v7.2 and this isn't super urgent,
> so it's targeting v7.3 via tsm.git/next.
>
> Technically I'd have permission to apply myself,
> but I wouldn't want to without acks from Dan and AMD!
> Thanks for taking a look!
Thanks, Lukas. A few questions:
This says Fixes, but I assume it is based on inspection and not a
report?
There are no upstream usages of pci_tsm_doe_transfer() yet, but the ones
in flight would suffer from the "D0 -> D3hot -> D0 -> D3hot" bounce that
you described to sashiko. I.e. the runtime acquire should be done at a
higher level.
I think the natural place to add PM_RUNTIME_ACQUIRE() that covers all
cases is withing pci_tsm_connect() and pci_tsm_disconnect().
I also think failure to power manage the device in the disconnect path
should not be fatal to performing the rest of the cleanup.
^ permalink raw reply
* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Sean Christopherson @ 2026-06-16 17:06 UTC (permalink / raw)
To: Ackerley Tng
Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Isaku Yamahata, Kiryl Shutsemau,
linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar, Shuah Khan,
Oliver Upton, Jeremiah McReynolds, kvm, linux-coco, linux-kernel,
x86
In-Reply-To: <CAEvNRgFPKC2uOMaams7SS9B7LxvfU4h8DrPM5vXFb=pmXsgPbA@mail.gmail.com>
On Tue, Jun 16, 2026, Ackerley Tng wrote:
> >> 1. What do you think of a kvm_arch_vm_finalize() that calls
> >> vm_sev_launch() and tdx_vm_finalize()? My key issue is that
> >> kvm_arch_vm_finalize_*vcpus*() seems to be for finalizing vCPUs
> >> rather than the whole VM.
> >
> > Key word "seems". I'm pretty sure Oliver picked kvm_arch_vm_finalize_vcpus() as
> > the name in commit 8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for
> > 'default' VMs") for the same reasons I think it's a good fit for coco VMs: like
> > finalizing TDX VMs, initializing the vGIC effectively finalizes vCPUs.
> >
> > We could rename it to kvm_arch_vm_finalize(), but that won't change the fact that
> > we'll need to decide between automagic vs. manual finalization, and it definitely
> > should be a separate discussion.
> >
>
> This definitely should not block this series.
>
> It's coming together for me now with your explanation:
> kvm_arch_vm_finalize_vcpus() actually means finalizing vCPUs! vGIC ==
> Virtual Generic Interrupt Controller, which has to be done after all the
> vCPUs are set up. Since the name is describing where in the VM
> creation/setup flow the hook is called (after creating VM and after
> creating vCPUs), maybe something like kvm_arch_vm_post_vcpu_create()?
No, because I would expect post_vcpu_create() to run after creating each vCPU,
not after creating all vCPUs. E.g. see KVM's kvm_arch_vcpu_{pre,post}create().
> Renaming this to kvm_arch_vm_finalize() makes it sound like it is
> finalizing the VM, but this function shouldn't finalize the VM since for
> CoCo finalizing the VM also loads the guest image into the guest - deals
> with memory, not just vCPUs.
>
> 8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for 'default'
> VMs") also includes a test_disable_default_vgic() function, we could
> also use something like that to skip CoCo VM finalization for some
> tests? Maybe that's a good middle ground.
That probably won't work well, and in practice it's just shuffling deck chairs
on the Titanic. For vGIC, and pre-create hook works because the tests that opt
out of automatic vGIC instantiation want that behavior to apply to all VMs that
the test creates. That's not the case for sev_smoke_test though, because some
testcases need deferred launch (test_sync_vmsa()), whereas others can use
automatic launch (test_sev()).
The other wrinkle is that SEV at least needs to provide the policy, which again
varies per VM within a single test.
^ permalink raw reply
* Re: [PATCH v13 01/22] KVM: selftests: Add macros to simplify creating VM shapes for non-default types
From: Sean Christopherson @ 2026-06-16 16:51 UTC (permalink / raw)
To: Xiaoyao Li
Cc: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
linux-coco, linux-kernel, x86
In-Reply-To: <e0b99e9a-c20f-4def-ac4b-0070996c10ef@intel.com>
On Tue, Jun 16, 2026, Xiaoyao Li wrote:
> On 5/22/2026 7:16 AM, Lisa Wang wrote:
> > diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> > index dc70c6da63fa..041bdbfb93f7 100644
> > --- a/tools/testing/selftests/kvm/include/kvm_util.h
> > +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> > @@ -233,6 +233,19 @@ kvm_static_assert(sizeof(struct vm_shape) == sizeof(u64));
> > shape; \
> > })
> > +#define __VM_TYPE(__mode, __type) \
>
> It seems the name "__VM_SHAPE" fits better?
>
> > +({ \
> > + struct vm_shape shape = { \
> > + .mode = (__mode), \
> > + .type = (__type) \
> > + }; \
> > + \
> > + shape; \
> > +})
> > +
> > +#define VM_TYPE(__type) \
> > + __VM_TYPE(VM_MODE_DEFAULT, __type)
>
> and I think making it one line would be OK?
>
> So something on top:
>
> ---8<---
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h
> b/tools/testing/selftests/kvm/include/kvm_util.h
> index 041bdbfb93f7..a1b5d2029d05 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -223,17 +223,7 @@ kvm_static_assert(sizeof(struct vm_shape) ==
> sizeof(u64));
>
> #define VM_TYPE_DEFAULT 0
>
> -#define VM_SHAPE(__mode) \
> -({ \
> - struct vm_shape shape = { \
> - .mode = (__mode), \
> - .type = VM_TYPE_DEFAULT \
> - }; \
> - \
> - shape; \
> -})
> -
> -#define __VM_TYPE(__mode, __type) \
> +#define __VM_SHAPE(__mode, __type) \
> ({ \
> struct vm_shape shape = { \
> .mode = (__mode), \
> @@ -243,8 +233,8 @@ kvm_static_assert(sizeof(struct vm_shape) ==
> sizeof(u64));
> shape; \
> })
>
> -#define VM_TYPE(__type) \
> - __VM_TYPE(VM_MODE_DEFAULT, __type)
> +#define VM_SHAPE(__mode) __VM_SHAPE(__mode, VM_TYPE_DEFAULT)
> +#define VM_TYPE(__type) __VM_SHAPE(VM_MODE_DEFAULT, __type)
Oh, that's way better! I say we go straight there:
--
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 28 Oct 2025 21:20:27 +0000
Subject: [PATCH] KVM: selftests: Add macros to simplify creating VM shapes for
non-default types
Add VM_TYPE() and __VM_SHAPE() macros to create a vm_shape structure given
a type (and mode), and use the macros to define VM_SHAPE_{SEV,SEV_ES,SNP}
shapes for x86's SEV family of VM shapes. Providing common infrastructure
will avoid having to copy+paste vm_sev_create_with_one_vcpu() for TDX.
Use the new SEV+ shapes and drop vm_sev_create_with_one_vcpu().
Opportunistically move the existing VM_SHAPE() (now __VM_SHAPE()) macro
below the definitions of VM_MODE_DEFAULT so that all of the SHAPE/TYPE
macros are bundled together.
No functional change intended.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
.../testing/selftests/kvm/include/kvm_util.h | 28 +++++++------
.../selftests/kvm/include/x86/processor.h | 4 ++
tools/testing/selftests/kvm/include/x86/sev.h | 2 -
tools/testing/selftests/kvm/lib/x86/sev.c | 16 --------
.../selftests/kvm/x86/sev_smoke_test.c | 40 +++++++++----------
5 files changed, 40 insertions(+), 50 deletions(-)
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index dc70c6da63fa..46bae183d7fc 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -221,18 +221,6 @@ struct vm_shape {
kvm_static_assert(sizeof(struct vm_shape) == sizeof(u64));
-#define VM_TYPE_DEFAULT 0
-
-#define VM_SHAPE(__mode) \
-({ \
- struct vm_shape shape = { \
- .mode = (__mode), \
- .type = VM_TYPE_DEFAULT \
- }; \
- \
- shape; \
-})
-
extern enum vm_guest_mode vm_mode_default;
#if defined(__aarch64__)
@@ -270,8 +258,24 @@ extern enum vm_guest_mode vm_mode_default;
#endif
+#define VM_TYPE_DEFAULT 0
+
+#define __VM_SHAPE(__mode, __type) \
+({ \
+ struct vm_shape shape = { \
+ .mode = (__mode), \
+ .type = (__type), \
+ }; \
+ \
+ shape; \
+})
+
+
+#define VM_SHAPE(__mode) __VM_SHAPE(__mode, VM_TYPE_DEFAULT)
#define VM_SHAPE_DEFAULT VM_SHAPE(VM_MODE_DEFAULT)
+#define VM_TYPE(__type) __VM_SHAPE(VM_MODE_DEFAULT, __type)
+
#define MIN_PAGE_SIZE (1U << MIN_PAGE_SHIFT)
#define PTES_PER_MIN_PAGE ptes_per_page(MIN_PAGE_SIZE)
diff --git a/tools/testing/selftests/kvm/include/x86/processor.h b/tools/testing/selftests/kvm/include/x86/processor.h
index 77f576ee7789..0aa6eecfcbde 100644
--- a/tools/testing/selftests/kvm/include/x86/processor.h
+++ b/tools/testing/selftests/kvm/include/x86/processor.h
@@ -365,6 +365,10 @@ static inline unsigned int x86_model(unsigned int eax)
return ((eax >> 12) & 0xf0) | ((eax >> 4) & 0x0f);
}
+#define VM_SHAPE_SEV VM_TYPE(KVM_X86_SEV_VM)
+#define VM_SHAPE_SEV_ES VM_TYPE(KVM_X86_SEV_ES_VM)
+#define VM_SHAPE_SNP VM_TYPE(KVM_X86_SNP_VM)
+
#define PHYSICAL_PAGE_MASK GENMASK_ULL(51, 12)
#define PAGE_SHIFT 12
diff --git a/tools/testing/selftests/kvm/include/x86/sev.h b/tools/testing/selftests/kvm/include/x86/sev.h
index 1af44c151d60..944c59dbe510 100644
--- a/tools/testing/selftests/kvm/include/x86/sev.h
+++ b/tools/testing/selftests/kvm/include/x86/sev.h
@@ -53,8 +53,6 @@ void snp_vm_launch_start(struct kvm_vm *vm, u64 policy);
void snp_vm_launch_update(struct kvm_vm *vm);
void snp_vm_launch_finish(struct kvm_vm *vm);
-struct kvm_vm *vm_sev_create_with_one_vcpu(u32 type, void *guest_code,
- struct kvm_vcpu **cpu);
void vm_sev_launch(struct kvm_vm *vm, u64 policy, u8 *measurement);
kvm_static_assert(SEV_RET_SUCCESS == 0);
diff --git a/tools/testing/selftests/kvm/lib/x86/sev.c b/tools/testing/selftests/kvm/lib/x86/sev.c
index 93f916903461..95d8520eea34 100644
--- a/tools/testing/selftests/kvm/lib/x86/sev.c
+++ b/tools/testing/selftests/kvm/lib/x86/sev.c
@@ -158,22 +158,6 @@ void snp_vm_launch_finish(struct kvm_vm *vm)
vm_sev_ioctl(vm, KVM_SEV_SNP_LAUNCH_FINISH, &launch_finish);
}
-struct kvm_vm *vm_sev_create_with_one_vcpu(u32 type, void *guest_code,
- struct kvm_vcpu **cpu)
-{
- struct vm_shape shape = {
- .mode = VM_MODE_DEFAULT,
- .type = type,
- };
- struct kvm_vm *vm;
- struct kvm_vcpu *cpus[1];
-
- vm = __vm_create_with_vcpus(shape, 1, 0, guest_code, cpus);
- *cpu = cpus[0];
-
- return vm;
-}
-
void vm_sev_launch(struct kvm_vm *vm, u64 policy, u8 *measurement)
{
if (is_sev_snp_vm(vm)) {
diff --git a/tools/testing/selftests/kvm/x86/sev_smoke_test.c b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
index 1a49ee391586..fe2c438882ae 100644
--- a/tools/testing/selftests/kvm/x86/sev_smoke_test.c
+++ b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
@@ -104,7 +104,7 @@ static void compare_xsave(u8 *from_host, u8 *from_guest)
abort();
}
-static void test_sync_vmsa(u32 type, u64 policy)
+static void test_sync_vmsa(struct vm_shape shape, u64 policy)
{
struct kvm_vcpu *vcpu;
struct kvm_vm *vm;
@@ -114,7 +114,7 @@ static void test_sync_vmsa(u32 type, u64 policy)
double x87val = M_PI;
struct kvm_xsave __attribute__((aligned(64))) xsave = { 0 };
- vm = vm_sev_create_with_one_vcpu(type, guest_code_xsave, &vcpu);
+ vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_code_xsave);
gva = vm_alloc_shared(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR,
MEM_REGION_TEST_DATA);
hva = addr_gva2hva(vm, gva);
@@ -150,13 +150,13 @@ static void test_sync_vmsa(u32 type, u64 policy)
kvm_vm_free(vm);
}
-static void test_sev(void *guest_code, u32 type, u64 policy)
+static void test_sev(void *guest_code, struct vm_shape shape, u64 policy)
{
struct kvm_vcpu *vcpu;
struct kvm_vm *vm;
struct ucall uc;
- vm = vm_sev_create_with_one_vcpu(type, guest_code, &vcpu);
+ vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_code);
/* TODO: Validate the measurement is as expected. */
vm_sev_launch(vm, policy, NULL);
@@ -201,12 +201,12 @@ static void guest_shutdown_code(void)
__asm__ __volatile__("ud2");
}
-static void test_sev_shutdown(u32 type, u64 policy)
+static void test_sev_shutdown(struct vm_shape shape, u64 policy)
{
struct kvm_vcpu *vcpu;
struct kvm_vm *vm;
- vm = vm_sev_create_with_one_vcpu(type, guest_shutdown_code, &vcpu);
+ vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_shutdown_code);
vm_sev_launch(vm, policy, NULL);
@@ -218,28 +218,28 @@ static void test_sev_shutdown(u32 type, u64 policy)
kvm_vm_free(vm);
}
-static void test_sev_smoke(void *guest, u32 type, u64 policy)
+static void test_sev_smoke(void *guest, struct vm_shape shape, u64 policy)
{
const u64 xf_mask = XFEATURE_MASK_X87_AVX;
- if (type == KVM_X86_SNP_VM)
- test_sev(guest, type, policy | SNP_POLICY_DBG);
+ if (shape.type == KVM_X86_SNP_VM)
+ test_sev(guest, shape, policy | SNP_POLICY_DBG);
else
- test_sev(guest, type, policy | SEV_POLICY_NO_DBG);
- test_sev(guest, type, policy);
+ test_sev(guest, shape, policy | SEV_POLICY_NO_DBG);
+ test_sev(guest, shape, policy);
- if (type == KVM_X86_SEV_VM)
+ if (shape.type == KVM_X86_SEV_VM)
return;
- test_sev_shutdown(type, policy);
+ test_sev_shutdown(shape, policy);
if (kvm_has_cap(KVM_CAP_XCRS) &&
(xgetbv(0) & kvm_cpu_supported_xcr0() & xf_mask) == xf_mask) {
- test_sync_vmsa(type, policy);
- if (type == KVM_X86_SNP_VM)
- test_sync_vmsa(type, policy | SNP_POLICY_DBG);
+ test_sync_vmsa(shape, policy);
+ if (shape.type == KVM_X86_SNP_VM)
+ test_sync_vmsa(shape, policy | SNP_POLICY_DBG);
else
- test_sync_vmsa(type, policy | SEV_POLICY_NO_DBG);
+ test_sync_vmsa(shape, policy | SEV_POLICY_NO_DBG);
}
}
@@ -247,13 +247,13 @@ int main(int argc, char *argv[])
{
TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SEV));
- test_sev_smoke(guest_sev_code, KVM_X86_SEV_VM, 0);
+ test_sev_smoke(guest_sev_code, VM_SHAPE_SEV, 0);
if (kvm_cpu_has(X86_FEATURE_SEV_ES))
- test_sev_smoke(guest_sev_es_code, KVM_X86_SEV_ES_VM, SEV_POLICY_ES);
+ test_sev_smoke(guest_sev_es_code, VM_SHAPE_SEV_ES, SEV_POLICY_ES);
if (kvm_cpu_has(X86_FEATURE_SEV_SNP))
- test_sev_smoke(guest_snp_code, KVM_X86_SNP_VM, snp_default_policy());
+ test_sev_smoke(guest_snp_code, VM_SHAPE_SNP, snp_default_policy());
return 0;
}
base-commit: e49bb0b5e1e3a8d7783bc7222c02cc6ff90fa2aa
--
^ permalink raw reply related
* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Ackerley Tng @ 2026-06-16 16:13 UTC (permalink / raw)
To: Sean Christopherson
Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Isaku Yamahata, Kiryl Shutsemau,
linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar, Shuah Khan,
Oliver Upton, Jeremiah McReynolds, kvm, linux-coco, linux-kernel,
x86
In-Reply-To: <ajFfb9u6dU47Nj3v@google.com>
>
> [...snip...]
>
>>
>> I still think kvm_arch_vm_finalize_vcpus() is an odd place to be
>> finalizing the VM.
>
> That's literally why the function exists though. The one and only existing
> implementation (on arm64) uses it to initialize the vGIC.
>
> void kvm_arch_vm_finalize_vcpus(struct kvm_vm *vm)
> {
> if (vm->arch.has_gic)
> __vgic_v3_init(vm->arch.gic_fd);
> }
>
> That's *very* similar to the proposed TDX usage, where some per-VM asset(s) can
> be initialized/frozen only after all vCPUs have been added. In other words, the
> name is describing where in the VM creation/setup flow the hook is called, and
> perhaps more importantly, the impact of the call: vCPUs are finalized (obviously
> with a different definition of "finalized" based on the VM properties).
>
>> I would prefer to not have to explicitly call some function like
>> kvm_arch_vm_finalize() (no vcpu in the name), but a common arch function
>> calling vm_sev_launch() and tdx_vm_finalize() is what I can think of
>> for test setup flexibility, without too much magic.
>
> We can't have our cake and eat it too. Either we launch/finalize SEV/TDX VMs as
> part of the common VM creation flows (as proposed for TDX), or we force tests to
> manually launch/finalize the VM after additional setup. The only way to have it
> both ways is to utilize crazy macro shenanigans to execute arbitrary code between
> creating the VM and launching/finalizing the VM, and even I would balk at that.
>
> I honestly don't see any reason to try to figure out which of the two approaches
> is optimal at this time. Whatever decision we make isn't set in stone, and in
> fact should be relative easy to change. And without being able to see all the
> TDX/SEV tests that are waiting in the wings, it's more or less impossible to make
> an informed decision.
>
> I definitely want to have SEV and TDX use the same core approach in the end, but
> I don't think we should force the issue right now, because the cost of reworking
> the SEV and/or TDX infrastructure when there are only a bare handful of tests is
> so low. It will cost more to try to predict the future of a 50/50 outcome than
> it will to make a completely wild guess between the two options and rework things
> if we guess wrong.
>
Makes sense. I'm good with merging this as it is done in this
patch. Thanks :)
>> For now, I can't think of many uses of __shared. ucall shared memory is
>> allocated dynamically, and we can also make it shared cleanly within
>> ucall code.
>
> Uh, every selftest that uses global variables to communicate between guest and
> host?
>
>> The global variables (sync_global_to_guest()) will appear in the guest
>> as long as sync_global_to_guest() is called before
>> kvm_arch_vm_finalize(), which I think makes sense to people writing
>> tests for CoCo.
>
> Yes, but that's completely orthogonal to all of this.
>
>> So 2 questions to push this along:
>>
>> 1. What do you think of a kvm_arch_vm_finalize() that calls
>> vm_sev_launch() and tdx_vm_finalize()? My key issue is that
>> kvm_arch_vm_finalize_*vcpus*() seems to be for finalizing vCPUs
>> rather than the whole VM.
>
> Key word "seems". I'm pretty sure Oliver picked kvm_arch_vm_finalize_vcpus() as
> the name in commit 8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for
> 'default' VMs") for the same reasons I think it's a good fit for coco VMs: like
> finalizing TDX VMs, initializing the vGIC effectively finalizes vCPUs.
>
> We could rename it to kvm_arch_vm_finalize(), but that won't change the fact that
> we'll need to decide between automagic vs. manual finalization, and it definitely
> should be a separate discussion.
>
This definitely should not block this series.
It's coming together for me now with your explanation:
kvm_arch_vm_finalize_vcpus() actually means finalizing vCPUs! vGIC ==
Virtual Generic Interrupt Controller, which has to be done after all the
vCPUs are set up. Since the name is describing where in the VM
creation/setup flow the hook is called (after creating VM and after
creating vCPUs), maybe something like kvm_arch_vm_post_vcpu_create()?
Renaming this to kvm_arch_vm_finalize() makes it sound like it is
finalizing the VM, but this function shouldn't finalize the VM since for
CoCo finalizing the VM also loads the guest image into the guest - deals
with memory, not just vCPUs.
8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for 'default'
VMs") also includes a test_disable_default_vgic() function, we could
also use something like that to skip CoCo VM finalization for some
tests? Maybe that's a good middle ground.
>> 3. Would you like __shared implemented together with this series, as a
>> prerequisite, or later?
>
> Only if __shared is a hard requirement for base TDX support, which I assume is
> not the case.
Yup!
^ permalink raw reply
* SVSM Development Call June 17th, 2026
From: Jörg Rödel @ 2026-06-16 16:10 UTC (permalink / raw)
To: coconut-svsm, linux-coco
Hi,
Here is the call for agenda items for this weeks SVSM development call. Please
send any agenda items you have in mind as a reply to this email or raise them
in the meeting.
We will use the LF Zoom instance. Details of the meeting can be found in our
governance repository at:
https://github.com/coconut-svsm/governance
The link to the COCONUT-SVSM calendar is:
https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week
The meeting will be recorded and the recording eventually published.
Regards,
Jörg
^ permalink raw reply
* Re: [PATCH 00/15] Enable TDX Module Extensions and DICE-based TDX Quoting
From: Xu Yilun @ 2026-06-16 15:19 UTC (permalink / raw)
To: Dave Hansen
Cc: Dan Williams (nvidia), kas, rick.p.edgecombe, x86, peter.fang,
linux-coco, linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
zhenzhong.duan, xiaoyao.li, aneesh.kumar, aik
In-Reply-To: <1fc76b45-a25b-41a7-b7fa-06650c8052fb@intel.com>
On Mon, Jun 15, 2026 at 08:57:09AM -0700, Dave Hansen wrote:
> On 6/15/26 08:22, Xu Yilun wrote:
> >> The TDX "Extension SEAMCALL" capability is akin to ARM CCA's "Stateful
> >> RMI Operations (SRO)", and achieves similar externalized complexity
> >> relief as a dedicated hardware coprocessor like AMD SEV-SNP. The
> > I may not include the ARM/AMD examples, not sure I can explain them
> > well.
>
> I actually think they're pretty important proof points. One of the big
OK, I can include this section that Dan provides.
> challenges as a maintainer evaluating these things is judging the
> solution itself.
>
> Is this architecture a good one? Is it overly complex? Are the avenues
> for simplification?
>
> If five vendors pop up all with similar problems and solutions, then
> it's a pretty good bet that they're all on the right track. But, if
> there are four going one direction and one going off by itself, it's a
> sign that the errant one might need a course correction.
>
> It would honestly be worth your time to go *talk* to the AMD and ARM
> folks and ensure that you are all on the same page. Last I checked, they
Yes, I queried ARM/AMD TDISP folks offline and CCed them in this thread.
Correct me if anything wrong:
AFAIK, AMD firmware run on an external physical core (PSP), firmware call
execution won't occupy host CPU, and the two partners communicate
asynchronously, so no worry about interruptibility and preemptibility.
From Alexey:
"The AMD CPU puts a request in a queue, writes to doorbell, and wait for
an interrupt. The PSP (a separate physical core) will see this, handle,
put the data in the CPU memory (if needed), trigger the interrupt. Done.
The host CPU can be rescheduled while waiting"
ARM SRO is something I don't familiar with. ARM has no co-processor for
CC, host invokes RMI and trap into RMM for secure execution, stateless
RMI blocks interrupt so should be short lived. This is very similar to
Intel SEAMCALL.
Stateful RMI, however, from their RMM 2.0bet1 SPEC [1] B4.3.2 Stateful
RMI operations, could be used "When an RMI operation cannot be completed
within an IMPLEMENTATION DEFINED time limit". It is "guaranteed to yield
within an IMPLEMENTATION DEFINED time limit from the point at which an
interrupt becomes pending." I see it tries to solve the same problem as
extension SEAMCALLs.
I see SRO is WIP in [2], and is used for TDISP [3].
[1] https://developer.arm.com/documentation/den0137/2-0bet1/
[2] https://lore.kernel.org/all/20260318155413.793430-49-steven.price@arm.com/
[3] https://lore.kernel.org/all/20260427065121.916615-3-aneesh.kumar@kernel.org/
^ permalink raw reply
* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Sean Christopherson @ 2026-06-16 14:36 UTC (permalink / raw)
To: Ackerley Tng
Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
linux-coco, linux-kernel, x86
In-Reply-To: <CAEvNRgHjno=XDnMpTvZcNCR6D5RAi1rbynrL4CnnEa7Y-M3VLg@mail.gmail.com>
On Mon, Jun 15, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
>
> > On Fri, Jun 05, 2026, Ackerley Tng wrote:
> >> Sean Christopherson <seanjc@google.com> writes:
> >> Many tests use vm_userspace_mem_region_add(), CoCo tests that require
> >> finalizing shouldn't be disallowed that option.
> >
> > What does that have to do with finalizing the VM?
>
> I could add more memslots after finalizing the VM, but then I'd have to
> make the guest accept more memory. Hence, I'd rather set up all the
> memslots and then finalize the VM.
Why? It's not like the performance of this code matters terribly.
Regardless, nothing *requires* a test to go down that route. As I said before,
my goal with the selftests infrastructure is to effectively optimize the entire
experience, i.e. to provide default behavior that works well for the majority of
tests. Attempting to provide default behavior that perfectly fits *every* test
is simply impossible.
> sev_smoke_test currently has a separate vm_sev_launch(), which is the
> equivalent of tdx_init_mem_region(), and that's called in the tests
> themselves.
>
> sev_smoke_test also uses vcpu_args_set() after creating the VM with
> vm_create_shape_with_one_vcpu(). Would that work if vm_sev_launch() got
> moved into kvm_arch_vm_finalize_vcpus()?
No, because vCPU state would be encrypted at that point. Moving vm_sev_launch()
would also break test_sev_dbg(), which sets a global buffer to all 0xaa prior to
encrypting that data.
> >> >> It's also possible to have some kvm_vm_finalize() call that can be
> >> >> explicitly and manually invoked from selftests just for CoCo selftests.
> >> >
> >> > Why bother? It's obviously possible to all kvm_arch_vm_finalize_vcpus() directly.
> >>
> >> Works for me to call directly. Do you mean kvm_arch_vm_finalize_vcpus()
> >> is the right function where the TD is finalized?
> >>
> >> For tests that need to do more setup after creating a vm, is the only
> >> way out to call __vm_create() then vm_vcpu_add() to avoid premature
> >> finalization in __vm_create_with_vcpus() when
> >> kvm_arch_vm_finalize_vcpus() is called?
> >
> > Depends on what you're doing. Sometimes, the answer will be yes. That's why
> > there are "low level" APIs, so that some tests can do fancy things, while most
> > tests can leave the details to the infrastructure.
> >
> > If there's a recurring problem, or we anticipate one, then we can and should
> > figure out how to minimize the pain so that tests don't have to deal with the
> > same boilerplate issues over and over. Hence the __shared idea.
>
> I still think kvm_arch_vm_finalize_vcpus() is an odd place to be
> finalizing the VM.
That's literally why the function exists though. The one and only existing
implementation (on arm64) uses it to initialize the vGIC.
void kvm_arch_vm_finalize_vcpus(struct kvm_vm *vm)
{
if (vm->arch.has_gic)
__vgic_v3_init(vm->arch.gic_fd);
}
That's *very* similar to the proposed TDX usage, where some per-VM asset(s) can
be initialized/frozen only after all vCPUs have been added. In other words, the
name is describing where in the VM creation/setup flow the hook is called, and
perhaps more importantly, the impact of the call: vCPUs are finalized (obviously
with a different definition of "finalized" based on the VM properties).
> I would prefer to not have to explicitly call some function like
> kvm_arch_vm_finalize() (no vcpu in the name), but a common arch function
> calling vm_sev_launch() and tdx_vm_finalize() is what I can think of
> for test setup flexibility, without too much magic.
We can't have our cake and eat it too. Either we launch/finalize SEV/TDX VMs as
part of the common VM creation flows (as proposed for TDX), or we force tests to
manually launch/finalize the VM after additional setup. The only way to have it
both ways is to utilize crazy macro shenanigans to execute arbitrary code between
creating the VM and launching/finalizing the VM, and even I would balk at that.
I honestly don't see any reason to try to figure out which of the two approaches
is optimal at this time. Whatever decision we make isn't set in stone, and in
fact should be relative easy to change. And without being able to see all the
TDX/SEV tests that are waiting in the wings, it's more or less impossible to make
an informed decision.
I definitely want to have SEV and TDX use the same core approach in the end, but
I don't think we should force the issue right now, because the cost of reworking
the SEV and/or TDX infrastructure when there are only a bare handful of tests is
so low. It will cost more to try to predict the future of a 50/50 outcome than
it will to make a completely wild guess between the two options and rework things
if we guess wrong.
> For now, I can't think of many uses of __shared. ucall shared memory is
> allocated dynamically, and we can also make it shared cleanly within
> ucall code.
Uh, every selftest that uses global variables to communicate between guest and
host?
> The global variables (sync_global_to_guest()) will appear in the guest
> as long as sync_global_to_guest() is called before
> kvm_arch_vm_finalize(), which I think makes sense to people writing
> tests for CoCo.
Yes, but that's completely orthogonal to all of this.
> So 2 questions to push this along:
>
> 1. What do you think of a kvm_arch_vm_finalize() that calls
> vm_sev_launch() and tdx_vm_finalize()? My key issue is that
> kvm_arch_vm_finalize_*vcpus*() seems to be for finalizing vCPUs
> rather than the whole VM.
Key word "seems". I'm pretty sure Oliver picked kvm_arch_vm_finalize_vcpus() as
the name in commit 8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for
'default' VMs") for the same reasons I think it's a good fit for coco VMs: like
finalizing TDX VMs, initializing the vGIC effectively finalizes vCPUs.
We could rename it to kvm_arch_vm_finalize(), but that won't change the fact that
we'll need to decide between automagic vs. manual finalization, and it definitely
should be a separate discussion.
> 3. Would you like __shared implemented together with this series, as a
> prerequisite, or later?
Only if __shared is a hard requirement for base TDX support, which I assume is
not the case.
^ permalink raw reply
* Re: [PATCH] PCI/TSM: Resume device to D0 for CMA-SPDM operation
From: Jonathan Cameron @ 2026-06-16 12:57 UTC (permalink / raw)
To: Lukas Wunner
Cc: Dan Williams, Ashish Kalra, Tom Lendacky, Vivaik Balasubrawmanian,
John Allen, Bjorn Helgaas, linux-coco, linux-pci,
Aneesh Kumar K.V, Yilun Xu, Zhenzhong Duan, Alexey Kardashevskiy
In-Reply-To: <7bdfaf14d7e5a466f3f650150c688a60e947a7a9.1781527060.git.lukas@wunner.de>
On Mon, 15 Jun 2026 15:19:30 +0200
Lukas Wunner <lukas@wunner.de> wrote:
> Per PCIe r7.0 sec 6.31.3, CMA-SPDM operation in non-D0 states is optional.
> The spec does not define a way to determine if it's supported, so resume
> to D0 unconditionally for the duration of a CMA-SPDM exchange. Vivaik has
> talked to Windows engineers and they said that Windows does the same.
>
> Note that for plain DOE operation, it is sufficient for the device to be
> in D3hot and its parents in D0 because config space remains accessible in
> D3hot. So CMA-SPDM goes beyond the requirements of plain DOE and hence
> resuming to D0 needs to (only) be done in code paths which use DOE
> specifically for CMA-SPDM.
>
> The pattern used herein for runtime resume is the best practice introduced
> by commit ef8057b07c72 ("PM: runtime: Wrapper macros for ACQUIRE()/
> ACQUIRE_ERR()").
>
> Fixes: 3225f52cde56 ("PCI/TSM: Establish Secure Sessions and Link Encryption")
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org # v6.19+
> Cc: Vivaik Balasubrawmanian <vivaik.balasubrawmanian@intel.com>
Seems reasonable to me and your replies to sashiko stuff seem to have that well
covered. So FWIW
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
^ permalink raw reply
* Re: [PATCH v13 08/22] KVM: selftests: Add TDX boot code
From: Chenyi Qiang @ 2026-06-16 9:21 UTC (permalink / raw)
To: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-8-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
> From: Erdem Aktas <erdemaktas@google.com>
>
> Add code to boot a TDX test VM. Since TDX registers are inaccessible to
> KVM, the boot code loads the relevant values from memory into the
> registers before jumping to the guest code.
>
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> Signed-off-by: Erdem Aktas <erdemaktas@google.com>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sagi Shahar <sagis@google.com>
> Signed-off-by: Sagi Shahar <sagis@google.com>
> Signed-off-by: Lisa Wang <wyihan@google.com>
> ---
> tools/testing/selftests/kvm/Makefile.kvm | 1 +
> .../selftests/kvm/include/x86/tdx/td_boot.h | 5 ++
> .../selftests/kvm/include/x86/tdx/td_boot_asm.h | 16 ++++++
> tools/testing/selftests/kvm/lib/x86/tdx/td_boot.S | 60 ++++++++++++++++++++++
> 4 files changed, 82 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
> index 02fad7b35eac..929965ca4b75 100644
> --- a/tools/testing/selftests/kvm/Makefile.kvm
> +++ b/tools/testing/selftests/kvm/Makefile.kvm
> @@ -31,6 +31,7 @@ LIBKVM_x86 += lib/x86/sev.c
> LIBKVM_x86 += lib/x86/svm.c
> LIBKVM_x86 += lib/x86/ucall.c
> LIBKVM_x86 += lib/x86/vmx.c
> +LIBKVM_x86 += lib/x86/tdx/td_boot.S
>
> LIBKVM_arm64 += lib/arm64/gic.c
> LIBKVM_arm64 += lib/arm64/gic_v3.c
> diff --git a/tools/testing/selftests/kvm/include/x86/tdx/td_boot.h b/tools/testing/selftests/kvm/include/x86/tdx/td_boot.h
> index af4474dee387..e5d54a20ed72 100644
> --- a/tools/testing/selftests/kvm/include/x86/tdx/td_boot.h
> +++ b/tools/testing/selftests/kvm/include/x86/tdx/td_boot.h
> @@ -66,4 +66,9 @@ struct td_boot_parameters {
> struct td_per_vcpu_parameters per_vcpu[];
> };
>
> +void td_boot(void);
> +void td_boot_code_end(void);
> +
> +#define TD_BOOT_CODE_SIZE (td_boot_code_end - td_boot)
> +
> #endif /* SELFTEST_TDX_TD_BOOT_H */
> diff --git a/tools/testing/selftests/kvm/include/x86/tdx/td_boot_asm.h b/tools/testing/selftests/kvm/include/x86/tdx/td_boot_asm.h
> new file mode 100644
> index 000000000000..10b4b527595c
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/include/x86/tdx/td_boot_asm.h
> @@ -0,0 +1,16 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef SELFTEST_TDX_TD_BOOT_ASM_H
> +#define SELFTEST_TDX_TD_BOOT_ASM_H
> +
> +/*
> + * GPA where TD boot parameters will be loaded.
> + *
> + * TD_BOOT_PARAMETERS_GPA is arbitrarily chosen to
> + *
> + * + be within the 4GB address space
> + * + provide enough contiguous memory for the struct td_boot_parameters such
> + * that there is one struct td_per_vcpu_parameters for KVM_MAX_VCPUS
> + */
> +#define TD_BOOT_PARAMETERS_GPA 0xffff0000
> +
> +#endif // SELFTEST_TDX_TD_BOOT_ASM_H
It would be better to maintain consistency by using /* ... */ for single-line comments in this series, such as the SELFTESTS_TDX_TDX_H in patch 20 and other license Identifier.
^ permalink raw reply
* Re: [PATCH v13 01/22] KVM: selftests: Add macros to simplify creating VM shapes for non-default types
From: Xiaoyao Li @ 2026-06-16 8:57 UTC (permalink / raw)
To: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-1-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
> From: Sean Christopherson<seanjc@google.com>
>
> Add VM_TYPE() and __VM_TYPE() macros to create a vm_shape structure given
> a type (and mode), and use the macros to define VM_SHAPE_{SEV,SEV_ES,SNP}
> shapes for x86's SEV family of VM shapes. Providing common infrastructure
> will avoid having to copy+paste vm_sev_create_with_one_vcpu() for TDX.
>
> Use the new SEV+ shapes and drop vm_sev_create_with_one_vcpu().
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson<seanjc@google.com>
> Signed-off-by: Sagi Shahar<sagis@google.com>
> Reviewed-by: Binbin Wu<binbin.wu@linux.intel.com>
> Reviewed-by: Ira Weiny<ira.weiny@intel.com>
> Signed-off-by: Lisa Wang<wyihan@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> tools/testing/selftests/kvm/include/kvm_util.h | 13 +++++++
> .../testing/selftests/kvm/include/x86/processor.h | 4 +++
> tools/testing/selftests/kvm/include/x86/sev.h | 2 --
> tools/testing/selftests/kvm/lib/x86/sev.c | 16 ---------
> tools/testing/selftests/kvm/x86/sev_smoke_test.c | 40 +++++++++++-----------
> 5 files changed, 37 insertions(+), 38 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index dc70c6da63fa..041bdbfb93f7 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -233,6 +233,19 @@ kvm_static_assert(sizeof(struct vm_shape) == sizeof(u64));
> shape; \
> })
>
> +#define __VM_TYPE(__mode, __type) \
It seems the name "__VM_SHAPE" fits better?
> +({ \
> + struct vm_shape shape = { \
> + .mode = (__mode), \
> + .type = (__type) \
> + }; \
> + \
> + shape; \
> +})
> +
> +#define VM_TYPE(__type) \
> + __VM_TYPE(VM_MODE_DEFAULT, __type)
and I think making it one line would be OK?
So something on top:
---8<---
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h
b/tools/testing/selftests/kvm/include/kvm_util.h
index 041bdbfb93f7..a1b5d2029d05 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -223,17 +223,7 @@ kvm_static_assert(sizeof(struct vm_shape) ==
sizeof(u64));
#define VM_TYPE_DEFAULT 0
-#define VM_SHAPE(__mode) \
-({ \
- struct vm_shape shape = { \
- .mode = (__mode), \
- .type = VM_TYPE_DEFAULT \
- }; \
- \
- shape; \
-})
-
-#define __VM_TYPE(__mode, __type) \
+#define __VM_SHAPE(__mode, __type) \
({ \
struct vm_shape shape = { \
.mode = (__mode), \
@@ -243,8 +233,8 @@ kvm_static_assert(sizeof(struct vm_shape) ==
sizeof(u64));
shape; \
})
-#define VM_TYPE(__type) \
- __VM_TYPE(VM_MODE_DEFAULT, __type)
+#define VM_SHAPE(__mode) __VM_SHAPE(__mode, VM_TYPE_DEFAULT)
+#define VM_TYPE(__type) __VM_SHAPE(VM_MODE_DEFAULT, __type)
extern enum vm_guest_mode vm_mode_default;
^ permalink raw reply related
* Re: [PATCH v8 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: K Prateek Nayak @ 2026-06-16 7:27 UTC (permalink / raw)
To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
peterz, thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <de274c2fb3f794ff1f19f0c96184ee50d04d1282.1781419998.git.ashish.kalra@amd.com>
Hello Ashish,
On 6/16/2026 1:19 AM, Ashish Kalra wrote:
> + /*
> + * RMPOPT scans the RMP table, stores the result of the scan in the
> + * reserved processor memory. The RMP scan is the most expensive
> + * part. If a second RMPOPT occurs, it can skip the expensive scan
> + * if they can see a cached result in the reserved processor memory.
> + *
> + * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
> + * on every other primary thread. Followers are "designed to"
> + * skip the scan if they see the "cached" scan results.
> + */
> + cpumask_copy(follower_mask, &rmpopt_cpumask);
rmpopt_cpumask is constructed after hotplug is disabled but ...
> +
> + /*
> + * Pin the worker to the current CPU for the leader loop so that
> + * this_cpu remains valid and the RMPOPT instruction executes on
> + * the correct CPU.
> + *
> + * Use migrate_disable() rather than get_cpu() to prevent
> + * migration while still allowing preemption.
> + */
> + migrate_disable();
> + this_cpu = smp_processor_id();
> +
> + if (cpumask_test_cpu(this_cpu, follower_mask)) {
> + /*
> + * Current CPU is a primary thread in rmpopt_cpumask.
> + * Run leader locally and remove from follower mask.
> + */
> + cpumask_clear_cpu(this_cpu, follower_mask);
> +
> + for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> + rmpopt(pa);
> + cond_resched();
> + }
> + } else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
> + follower_mask)) {
> + /*
> + * Current CPU is a sibling thread whose primary is in
> + * rmpopt_cpumask. RMPOPT_BASE MSR is per-core, so it
> + * is safe to run the leader locally. Remove the sibling's
> + * primary from the follower mask as this core is already
> + * covered by the leader.
> + */
> + cpumask_andnot(follower_mask, follower_mask,
> + topology_sibling_cpumask(this_cpu));
> +
> + for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> + rmpopt(pa);
> + cond_resched();
> + }
> + } else {
> + /*
> + * Current CPU does not have RMPOPT_BASE MSR programmed.
> + * Pick an explicit leader from the cpumask to avoid #UD.
> + * Use work_on_cpu() to run in process context on the leader,
> + * avoiding IPI latency.
> + */
... this_cpu is neither in the "rmpopt_cpumask", nor is any of its
siblings on "rmpopt_cpumask".
How does that happen?
> + int leader_cpu = cpumask_first(follower_mask);
> +
> + if (WARN_ON_ONCE(leader_cpu >= nr_cpu_ids)) {
> + migrate_enable();
> + goto out;
> + }
> +
> + cpumask_clear_cpu(leader_cpu, follower_mask);
> +
> + /* Release migration pin before work_on_cpu(). */
> + migrate_enable();
> +
> + work_on_cpu(leader_cpu, rmpopt_leader_fn, NULL);
This creates a delayed work and also waits for it to finish execution
which will add more latency than a simple IPI if the comment about IPI
latency above is accurate.
I think there is some corner case in construction of the
"rmpopt_cpumask" that requires this not-so-pretty else block. Can you
elaborate why this is required?
Perhaps the "rmpopt_cpumask" construction needs:
for_each_online_cpu(cpu) {
/* Nominate the first CPU on the sibling mask for RMPOPT */
if (cpu != cpumask_first(topology_sibling_cpumask(cpu)))
continue;
cpumask_set_cpu(cpu, &rmpopt_cpumask);
}
and all you need here is:
/* Do RMPOPt for local core */
for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
rmpopt(pa);
/* Skip this core from concurrent RMPOPT */
cpumask_and_not(follower_mask, &rmpopt_cpumask, topology_sibling_cpumask(cpu));
No?
> +
> + goto followers;
> + }
> +
> + migrate_enable();
> +
--
Thanks and Regards,
Prateek
^ permalink raw reply
* Re: [PATCH v8 2/7] x86/sev: Initialize RMPOPT configuration MSRs
From: K Prateek Nayak @ 2026-06-16 6:03 UTC (permalink / raw)
To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
peterz, thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <6a4d0ec9e037d91c0fdba9c525942ca288e1b1b2.1781419998.git.ashish.kalra@amd.com>
Hello Ashish,
On 6/16/2026 1:18 AM, Ashish Kalra wrote:
> diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
> index 8bcdce98f6dc..1b5c18408f0b 100644
> --- a/arch/x86/virt/svm/sev.c
> +++ b/arch/x86/virt/svm/sev.c
> @@ -124,6 +124,10 @@ static void *rmp_bookkeeping __ro_after_init;
>
> static u64 probed_rmp_base, probed_rmp_size;
>
> +static cpumask_t rmpopt_cpumask;
nit.
I believe you can use cpumask_var_t here and do a zalloc_cpumask_var()
during snp_setup_rmpopt(). That way !X86_FEATURE_RMPOPT configs don't
have to needlessly waste space to keep a redundant cpumask around.
Same comment for rmpopt_report_cpumask in Patch 7 which can be
allocated dynamically during rmpopt_debugfs_setup().
> +static phys_addr_t rmpopt_pa_start;
> +static bool rmpopt_configured;
> +
> static LIST_HEAD(snp_leaked_pages_list);
> static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
>
> @@ -490,7 +494,12 @@ static bool __init setup_rmptable(void)
> if (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) {
> if (!setup_segmented_rmptable())
> return false;
> + rmpopt_configured = true;
> } else {
> + /*
> + * RMPOPT requires a segmented RMP table, so leave
> + * rmpopt_configured clear on contiguous RMP systems.
> + */
> if (!setup_contiguous_rmptable())
> return false;
> }
> @@ -555,6 +564,21 @@ int snp_prepare(void)
> }
> EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
>
> +static void rmpopt_cleanup(void)
> +{
> + int cpu;
> +
> + cpus_read_lock();
nit.
You can use guard(cpus_read_lock)() unless there is a complicated
locking pattern where you need to drop and re-acquire the read lock.
> +
> + for_each_cpu(cpu, &rmpopt_cpumask)
> + WARN_ON_ONCE(wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, 0));
> +
> + cpus_read_unlock();
> +
> + cpumask_clear(&rmpopt_cpumask);
> + rmpopt_pa_start = 0;
> +}
> +
> void snp_shutdown(void)
> {
> u64 syscfg;
--
Thanks and Regards,
Prateek
^ permalink raw reply
* Re: [PATCH] PCI/TSM: fix use-after-free in find_dsm_dev()
From: Lukas Wunner @ 2026-06-16 3:16 UTC (permalink / raw)
To: Wentao Liang; +Cc: djbw, bhelgaas, linux-coco, linux-pci, linux-kernel, stable
In-Reply-To: <20260616030243.1661791-1-vulab@iscas.ac.cn>
On Tue, Jun 16, 2026 at 03:02:43AM +0000, Wentao Liang wrote:
> In find_dsm_dev(), pf0 is obtained via pf0_dev_get() which returns a
> reference-counted pointer. It is declared with __free(pci_dev_put),
> so pci_dev_put() will be called when the variable goes out of scope.
> Returning 'pf0' directly while it still has __free cleanup causes the
> reference to be dropped before the caller can use the pointer, leading
> to a use-after-free.
No, the code comment preceding find_dsm_dev() explicitly states:
"Note that no additional reference is held for the resulting device
because that resulting object always has a registered lifetime
greater-than-or-equal to that of the @pdev argument."
Your patch looks like it may be an LLM-generated hallucination.
Did you use an LLM to come up with the patch? If so, please use
an Assisted-by tag per Documentation/process/coding-assistants.rst
so that we know to expect hallucinations.
Thanks,
Lukas
^ permalink raw reply
* [PATCH] PCI/TSM: fix use-after-free in find_dsm_dev()
From: Wentao Liang @ 2026-06-16 3:02 UTC (permalink / raw)
To: djbw, bhelgaas; +Cc: linux-coco, linux-pci, linux-kernel, Wentao Liang, stable
In find_dsm_dev(), pf0 is obtained via pf0_dev_get() which returns a
reference-counted pointer. It is declared with __free(pci_dev_put),
so pci_dev_put() will be called when the variable goes out of scope.
Returning 'pf0' directly while it still has __free cleanup causes the
reference to be dropped before the caller can use the pointer, leading
to a use-after-free.
Fix by using return no_free_ptr(pf0) to suppress the automatic
cleanup and properly transfer ownership to the caller.
Fixes: 3225f52cde56 ("PCI/TSM: Establish Secure Sessions and Link Encryption")
Cc: stable@vger.kernel.org
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
---
drivers/pci/tsm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pci/tsm.c b/drivers/pci/tsm.c
index 5fdcd7f2e820..dd4e0cb0c6aa 100644
--- a/drivers/pci/tsm.c
+++ b/drivers/pci/tsm.c
@@ -670,7 +670,7 @@ static struct pci_dev *find_dsm_dev(struct pci_dev *pdev)
return NULL;
if (is_dsm(pf0))
- return pf0;
+ return no_free_ptr(pf0);
/*
* For cases where a switch may be hosting TDISP services on behalf of
--
2.34.1
^ permalink raw reply related
* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Ackerley Tng @ 2026-06-16 0:26 UTC (permalink / raw)
To: Sean Christopherson
Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
linux-coco, linux-kernel, x86
In-Reply-To: <aiM2Nm9Eh68mIVX3@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Fri, Jun 05, 2026, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>>
>> >
>> > [...snip...]
>> >
>> >> Was kvm_arch_vm_finalize_vcpus() supposed to be for finalizing vCPUs
>> >> instead?
>> >>
>> >> The awkward part is that kvm_arch_vm_finalize_vcpus() is called from
>> >> __vm_create_with_vcpus().
>> >>
>> >> While building this POC to test conversions [1] I only wanted to create
>> >> the vm and vcpus and didn't want to finalize yet, since I still needed
>> >> to do more mappings in the guest (and I needed the vm pointer to do
>> >> mappings in the guest).
>> >
>> > Hmm, I would argue this is a flaw in the selftests infrastructure. IMO, as a
>> > developer, it's quite surprising that the current value of a global variable
>> > doesn't show up in the VM automagically. I totally understand why selftests
>> > work that way, but it's certainly odd and annoying. If _that_ were solved, then
>> > the kludginess of what you're doing goes away.
>> >
>> > The other way this could be solved is by adding support for annotating globals
>> > with a __shared flag, a la the kernel's __bss_decrypted, so that loading memory
>> > into the VM can automatically mark the associated globals' pages as shared.
>> >
>>
>> More generally, is your opinion that tests should not have to add extra
>> memslots?
>
> I don't care? What I care about is making it as easy and intuitive as possible
> for people to write tests, and to minimize maintenance costs.
>
>> If I wanted a shared page, would I have to do
>>
>> static __shared test_page[4096] = {0};
>>
>> and then rely on ELF loading to put that in the guest for me? Are there
>> some compiler flags/how will I require that test_page be page aligned?
>
> Compilere and linker shenanigans.
>
>> If I mark 10 globals as __shared, would the compiler automatically
>> consolidate the shared memory together?
>
> Yes, follow the __bss_decrypted breadcrumbs.
>
> #define __bss_decrypted __section(".bss..decrypted")
>
>> I think it's a bit constraining to require that all guest memory be set
>> up statically. It's nice to have but I'd like another option...
>
> You do have options, they just require more work.
>
>> Many tests use vm_userspace_mem_region_add(), CoCo tests that require
>> finalizing shouldn't be disallowed that option.
>
> What does that have to do with finalizing the VM?
>
I could add more memslots after finalizing the VM, but then I'd have to
make the guest accept more memory. Hence, I'd rather set up all the
memslots and then finalize the VM.
sev_smoke_test currently has a separate vm_sev_launch(), which is the
equivalent of tdx_init_mem_region(), and that's called in the tests
themselves.
sev_smoke_test also uses vcpu_args_set() after creating the VM with
vm_create_shape_with_one_vcpu(). Would that work if vm_sev_launch() got
moved into kvm_arch_vm_finalize_vcpus()?
>> >> It's also possible to have some kvm_vm_finalize() call that can be
>> >> explicitly and manually invoked from selftests just for CoCo selftests.
>> >
>> > Why bother? It's obviously possible to all kvm_arch_vm_finalize_vcpus() directly.
>>
>> Works for me to call directly. Do you mean kvm_arch_vm_finalize_vcpus()
>> is the right function where the TD is finalized?
>>
>> For tests that need to do more setup after creating a vm, is the only
>> way out to call __vm_create() then vm_vcpu_add() to avoid premature
>> finalization in __vm_create_with_vcpus() when
>> kvm_arch_vm_finalize_vcpus() is called?
>
> Depends on what you're doing. Sometimes, the answer will be yes. That's why
> there are "low level" APIs, so that some tests can do fancy things, while most
> tests can leave the details to the infrastructure.
>
> If there's a recurring problem, or we anticipate one, then we can and should
> figure out how to minimize the pain so that tests don't have to deal with the
> same boilerplate issues over and over. Hence the __shared idea.
I still think kvm_arch_vm_finalize_vcpus() is an odd place to be
finalizing the VM.
I would prefer to not have to explicitly call some function like
kvm_arch_vm_finalize() (no vcpu in the name), but a common arch function
calling vm_sev_launch() and tdx_vm_finalize() is what I can think of
for test setup flexibility, without too much magic.
For now, I can't think of many uses of __shared. ucall shared memory is
allocated dynamically, and we can also make it shared cleanly within
ucall code.
The global variables (sync_global_to_guest()) will appear in the guest
as long as sync_global_to_guest() is called before
kvm_arch_vm_finalize(), which I think makes sense to people writing
tests for CoCo.
So 2 questions to push this along:
1. What do you think of a kvm_arch_vm_finalize() that calls
vm_sev_launch() and tdx_vm_finalize()? My key issue is that
kvm_arch_vm_finalize_*vcpus*() seems to be for finalizing vCPUs
rather than the whole VM.
3. Would you like __shared implemented together with this series, as a
prerequisite, or later?
^ permalink raw reply
* Re: [PATCH v13 13/22] KVM: selftests: Set first memory region as shared if guest_memfd
From: Lisa Wang @ 2026-06-16 0:04 UTC (permalink / raw)
To: Binbin Wu
Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
Sean Christopherson, Shuah Khan, Oliver Upton,
Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <eab3711d-c5b1-4132-92e4-7f7040fcd17b@linux.intel.com>
On Mon, Jun 08, 2026 at 04:03:23PM +0800, Binbin Wu wrote:
> On 5/22/2026 7:16 AM, Lisa Wang wrote:
> > + * exactly like other memory providers.
> > */
> > - flags = 0;
> > - if (is_guest_memfd_required(shape))
> > + if (is_guest_memfd_required(shape)) {
> > flags |= KVM_MEM_GUEST_MEMFD;
> > + gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> > + }
> >
> > - vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags);
> > + vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
>
> The build failed due to this:
>
> lib/kvm_util.c: In function ‘__vm_create’:
> lib/kvm_util.c:507:9: error: too many arguments to function ‘vm_mem_add’
> 507 | vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
> | ^~~~~~~~~~
> In file included from lib/kvm_util.c:9:
> include/kvm_util.h:714:6: note: declared here
> 714 | void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
> | ^~~~~~~~~~
> lib/kvm_util.c: At top level:
> cc1: note: unrecognized command-line option ‘-Wno-gnu-variable-sized-type-not-at-end’ may have been intended to silence earlier diagnostics
>
> It seems the patch set doesn't wire gmem_flags parameter to vm_mem_add().
This series need to based on "KVM: selftests: Add support for mmap() on
guest_memfd in core library" for the gmem_flags parameter.
Lisa
> > for (i = 0; i < NR_MEM_REGIONS; i++)
> > vm->memslots[i] = 0;
> >
> >
>
^ permalink raw reply
* Re: [PATCH v13 13/22] KVM: selftests: Set first memory region as shared if guest_memfd
From: Ackerley Tng @ 2026-06-15 23:46 UTC (permalink / raw)
To: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-13-6983ae4c3a4d@google.com>
Lisa Wang <wyihan@google.com> writes:
> Set the initial state of the first memory region as shared if it is
> backed by guest_memfd, so that the KVM selftest framework functions can
> populate mmap()-ed guest_memfd memory the same way memory from other
> memory providers are populated.
>
> For CoCo VMs, pages that need to be private are explicitly set to
> private before executing the VM.
>
>
> [...snip...]
>
> @@ -495,14 +497,16 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
> vm = ____vm_create(shape);
>
> /*
> - * Force GUEST_MEMFD for the primary memory region if necessary, e.g.
> - * for CoCo VMs that require GUEST_MEMFD backed private memory.
> + * Force GUEST_MEMFD for the primary memory region if necessary, and
> + * initialize it as shared so the selftest framework can populate it
> + * exactly like other memory providers.
> */
> - flags = 0;
> - if (is_guest_memfd_required(shape))
> + if (is_guest_memfd_required(shape)) {
> flags |= KVM_MEM_GUEST_MEMFD;
> + gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> + }
>
Just noticed this while hacking some SNP tests.
> - vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags);
> + vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
> for (i = 0; i < NR_MEM_REGIONS; i++)
> vm->memslots[i] = 0;
>
>
> --
> 2.54.0.746.g67dd491aae-goog
I think this patch should fully buy into in-place conversions, so we
need to also set GUEST_MEMFD_FLAG_MMAP:
@@ -483,6 +483,7 @@ struct kvm_vm *__vm_create(struct vm_shape shape,
u32 nr_runnable_vcpus,
{
u64 nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus,
nr_extra_pages);
+ enum vm_mem_backing_src_type src_type = VM_MEM_SRC_ANONYMOUS;
struct userspace_mem_region *slot0;
u64 gmem_flags = 0;
struct kvm_vm *vm;
@@ -503,10 +504,16 @@ struct kvm_vm *__vm_create(struct vm_shape
shape, u32 nr_runnable_vcpus,
*/
if (is_guest_memfd_required(shape)) {
flags |= KVM_MEM_GUEST_MEMFD;
- gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
+ gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED | GUEST_MEMFD_FLAG_MMAP;
+ /*
+ * TODO: Clean this up together with vm_mem_add(). Use
+ * VM_MEM_SRC_SHMEM to tell vm_mem_add() to mmap
+ * guest_memfd with MAP_SHARED.
+ */
+ src_type = VM_MEM_SRC_SHMEM;
}
- vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0,
gmem_flags);
+ vm_mem_add(vm, src_type, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
for (i = 0; i < NR_MEM_REGIONS; i++)
vm->memslots[i] = 0;
The VM_MEM_SRC_SHMEM is a bit of a hack but imo that refactor should go
with another series to clean up vm_mem_add().
The above code was working for TDX because vm_mem_add() without the
guest_memfd MMAP flag would leave region->fd as -1. Without
setting src_type to VM_MEM_SRC_ANONYMOUS, the assertion in vm_mem_add()
region->fd == -1 || backing_src_is_shared(src_type)
would pass.
tdx_init_mem_region() was called with the anonymously mmap()-ed address,
which turns out fine.
With the above change, we would also need to call tdx_init_mem_region()
with NULL for source_address.
^ permalink raw reply
* Re: [PATCH v13 03/22] KVM: selftests: Initialize the TDX VM
From: Lisa Wang @ 2026-06-15 23:33 UTC (permalink / raw)
To: Binbin Wu
Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
Sean Christopherson, Shuah Khan, Oliver Upton,
Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <a58e2941-77f9-43cf-a54d-023506dd7eb0@linux.intel.com>
On Mon, Jun 08, 2026 at 01:57:24PM +0800, Binbin Wu wrote:
[...snip...]
> + } tdx_cmd = { .c = { \
> + .id = (cmd), \
> + .flags = (u32)(_flags), \
> > + .data = (u64)(arg), \
>
> Nit:
> The two lines' backslashes are misaligned.
>
> > +
> > +#define tdx_vm_ioctl(vm, cmd, flags, arg) \
> > +({ \
> > + int ret = __tdx_vm_ioctl(vm, cmd, flags, arg); \
>
> tdx_cmd.c.hw_error is u64 and it could be assigned to ret, which is a int,
> the upper bits could be truncated if the upper 32-bit is set.
Thanks. Will fix these in the next version.
Lisa
> > + \
> > + __TEST_ASSERT_VM_VCPU_IOCTL(!ret, #cmd, ret, vm); \
> > +})
> > +
> > +void tdx_init_vm(struct kvm_vm *vm, u64 attributes);
> > +
> > #endif /* SELFTESTS_TDX_TDX_UTIL_H */[...]
^ permalink raw reply
* [PATCH v8 7/7] x86/sev: Add debugfs support for RMPOPT
From: Ashish Kalra @ 2026-06-15 19:50 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1781419998.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
Add a debugfs interface to report per-CPU RMPOPT status across all
system RAM.
To dump the per-CPU RMPOPT status for all system RAM:
/sys/kernel/debug/rmpopt# cat rmpopt-table
Memory @ 0GB: CPU(s): none
Memory @ 1GB: CPU(s): none
Memory @ 2GB: CPU(s): 0-1023
Memory @ 3GB: CPU(s): 0-1023
Memory @ 4GB: CPU(s): none
Memory @ 5GB: CPU(s): 0-1023
Memory @ 6GB: CPU(s): 0-1023
Memory @ 7GB: CPU(s): 0-1023
...
Memory @1025GB: CPU(s): 0-1023
Memory @1026GB: CPU(s): 0-1023
Memory @1027GB: CPU(s): 0-1023
Memory @1028GB: CPU(s): 0-1023
Memory @1029GB: CPU(s): 0-1023
Memory @1030GB: CPU(s): 0-1023
Memory @1031GB: CPU(s): 0-1023
Memory @1032GB: CPU(s): 0-1023
Memory @1033GB: CPU(s): 0-1023
Memory @1034GB: CPU(s): 0-1023
Memory @1035GB: CPU(s): 0-1023
Memory @1036GB: CPU(s): 0-1023
Memory @1037GB: CPU(s): 0-1023
Memory @1038GB: CPU(s): none
Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/virt/svm/sev.c | 128 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 128 insertions(+)
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 253a534b9a0d..b8b00c50ce41 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -20,6 +20,8 @@
#include <linux/amd-iommu.h>
#include <linux/nospec.h>
#include <linux/workqueue.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
#include <asm/sev.h>
#include <asm/processor.h>
@@ -145,6 +147,15 @@ static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
static unsigned long snp_nr_leaked_pages;
+/* All users of rmpopt_report_cpumask must hold rmpopt_show_mutex. */
+static cpumask_t rmpopt_report_cpumask;
+static struct dentry *rmpopt_debugfs;
+static DEFINE_MUTEX(rmpopt_show_mutex);
+
+struct seq_paddr {
+ phys_addr_t next_seq_paddr;
+};
+
#undef pr_fmt
#define pr_fmt(fmt) "SEV-SNP: " fmt
@@ -587,6 +598,8 @@ static void rmpopt_cleanup(void)
cancel_delayed_work_sync(&rmpopt_delayed_work);
destroy_workqueue(rmpopt_wq);
+ debugfs_remove_recursive(rmpopt_debugfs);
+ rmpopt_debugfs = NULL;
cpus_read_lock();
@@ -635,6 +648,10 @@ static inline bool __rmpopt(u64 pa_start, u64 op_type)
: "a" (pa_start), "c" (op_type)
: "memory", "cc");
+ if (op_type == RMPOPT_FUNC_REPORT_STATUS)
+ assign_cpu(smp_processor_id(), &rmpopt_report_cpumask,
+ optimized);
+
return optimized;
}
@@ -669,6 +686,115 @@ static long rmpopt_leader_fn(void *arg)
return 0;
}
+/*
+ * 'val' is a system physical address.
+ */
+static void rmpopt_report_status(void *val)
+{
+ u64 pa_start = ALIGN_DOWN((u64)val, SZ_1G);
+ u64 op_type = RMPOPT_FUNC_REPORT_STATUS;
+
+ __rmpopt(pa_start, op_type);
+}
+
+/*
+ * start() can be called multiple times if allocated buffer has overflowed
+ * and bigger buffer is allocated.
+ */
+static void *rmpopt_table_seq_start(struct seq_file *seq, loff_t *pos)
+{
+ phys_addr_t end_paddr = rmpopt_pa_end;
+ struct seq_paddr *p = seq->private;
+
+ if (*pos == 0) {
+ p->next_seq_paddr = rmpopt_pa_start;
+ if (p->next_seq_paddr >= end_paddr)
+ return NULL;
+ return &p->next_seq_paddr;
+ }
+
+ if (p->next_seq_paddr >= end_paddr)
+ return NULL;
+
+ return &p->next_seq_paddr;
+}
+
+static void *rmpopt_table_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ phys_addr_t end_paddr = rmpopt_pa_end;
+ phys_addr_t *curr_paddr = v;
+
+ (*pos)++;
+ *curr_paddr += SZ_1G;
+ if (*curr_paddr >= end_paddr)
+ return NULL;
+
+ return curr_paddr;
+}
+
+static void rmpopt_table_seq_stop(struct seq_file *seq, void *v)
+{
+}
+
+static int rmpopt_table_seq_show(struct seq_file *seq, void *v)
+{
+ phys_addr_t *curr_paddr = v;
+
+ guard(mutex)(&rmpopt_show_mutex);
+
+ seq_printf(seq, "Memory @%3lluGB: ",
+ *curr_paddr >> (get_order(SZ_1G) + PAGE_SHIFT));
+
+ /*
+ * Query all online CPUs rather than just rmpopt_cpumask (primary
+ * threads only). The RMPOPT instruction only needs to run on one
+ * thread per core for the optimization to take effect, but debugfs
+ * reporting requires the RMPOPT status across all CPUs.
+ * Performance is not a concern for this diagnostic interface.
+ *
+ * This is safe because RMPOPT_BASE MSR is per-core and
+ * snp_prepare() ensures all CPUs are online when the MSR is
+ * programmed during snp_setup_rmpopt().
+ */
+ cpumask_clear(&rmpopt_report_cpumask);
+ on_each_cpu_mask(cpu_online_mask, rmpopt_report_status,
+ (void *)*curr_paddr, true);
+
+ if (cpumask_empty(&rmpopt_report_cpumask))
+ seq_puts(seq, "CPU(s): none\n");
+ else
+ seq_printf(seq, "CPU(s): %*pbl\n", cpumask_pr_args(&rmpopt_report_cpumask));
+
+ return 0;
+}
+
+static const struct seq_operations rmpopt_table_seq_ops = {
+ .start = rmpopt_table_seq_start,
+ .next = rmpopt_table_seq_next,
+ .stop = rmpopt_table_seq_stop,
+ .show = rmpopt_table_seq_show
+};
+
+static int rmpopt_table_open(struct inode *inode, struct file *file)
+{
+ return seq_open_private(file, &rmpopt_table_seq_ops, sizeof(struct seq_paddr));
+}
+
+static const struct file_operations rmpopt_table_fops = {
+ .open = rmpopt_table_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release_private,
+};
+
+static void rmpopt_debugfs_setup(void)
+{
+ rmpopt_debugfs = debugfs_create_dir("rmpopt", arch_debugfs_dir);
+
+ debugfs_create_file("rmpopt-table", 0400, rmpopt_debugfs,
+ NULL, &rmpopt_table_fops);
+}
+
/*
* RMPOPT optimizations skip RMP checks at 1GB granularity if this
* range of memory does not contain any SNP guest memory.
@@ -874,6 +1000,8 @@ void snp_setup_rmpopt(void)
* optimizations on all physical memory.
*/
queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
+
+ rmpopt_debugfs_setup();
}
EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
--
2.43.0
^ permalink raw reply related
* [PATCH v8 6/7] KVM: SEV: Perform RMP optimizations on SNP guest shutdown
From: Ashish Kalra @ 2026-06-15 19:50 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1781419998.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
Pages are converted from shared to private as SNP guests are launched.
This destroys exisiting RMPOPT optimizations in the regions where
pages are converted.
Conversely, guest pages are converted back to shared during SNP guest
termination and their region may become eligible for RMPOPT
optimization.
To take advantage of this, perform RMPOPT after guest termination.
Do it after a delay so that a single RMPOPT pass can be done if
multiple guests terminate in a short period of time.
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/kvm/svm/sev.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index e107f368ed2d..29af6f6e603c 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3005,6 +3005,8 @@ void sev_vm_destroy(struct kvm *kvm)
*/
if (snp_decommission_context(kvm))
return;
+
+ snp_rmpopt_all_physmem();
} else {
sev_unbind_asid(kvm, sev->handle);
}
--
2.43.0
^ permalink raw reply related
* [PATCH v8 5/7] x86/sev: Add interface to re-enable RMP optimizations.
From: Ashish Kalra @ 2026-06-15 19:49 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1781419998.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
RMPOPT table is a per-CPU table which indicates if 1GB regions of
physical memory are entirely hypervisor-owned or not.
When performing host memory accesses in hypervisor mode as well as
non-SNP guest mode, the processor may consult the RMPOPT table to
potentially skip an RMP access and improve performance.
Events such as RMPUPDATE can clear RMP optimizations. Add an interface
to re-enable those optimizations.
The interface uses mod_delayed_work() instead of queue_delayed_work()
so that the delay timer is reset on each call. This provides proper
batching semantics: re-optimization runs 10 seconds after the *last*
VM termination rather than after the first. mod_delayed_work() also
re-queues work that is already in-flight, so a re-scan request
during an active scan is not silently dropped.
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/include/asm/sev.h | 2 ++
arch/x86/virt/svm/sev.c | 15 +++++++++++++++
2 files changed, 17 insertions(+)
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 0d662221615a..a11306f25336 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
__snp_leak_pages(pfn, pages, true);
}
int snp_prepare(void);
+void snp_rmpopt_all_physmem(void);
void snp_setup_rmpopt(void);
void snp_clear_rmpopt_configured(void);
void snp_shutdown(void);
@@ -682,6 +683,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
static inline void kdump_sev_callback(void) { }
static inline void snp_fixup_e820_tables(void) {}
static inline int snp_prepare(void) { return -ENODEV; }
+static inline void snp_rmpopt_all_physmem(void) {}
static inline void snp_setup_rmpopt(void) {}
static inline void snp_clear_rmpopt_configured(void) {}
static inline void snp_shutdown(void) {}
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index b63b639bfc30..253a534b9a0d 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -782,6 +782,21 @@ static void rmpopt_work_handler(struct work_struct *work)
free_cpumask_var(follower_mask);
}
+void snp_rmpopt_all_physmem(void)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_RMPOPT) || !rmpopt_configured)
+ return;
+
+ guard(mutex)(&rmpopt_wq_mutex);
+
+ if (!rmpopt_wq)
+ return;
+
+ mod_delayed_work(rmpopt_wq, &rmpopt_delayed_work,
+ msecs_to_jiffies(RMPOPT_WORK_TIMEOUT));
+}
+EXPORT_SYMBOL_GPL(snp_rmpopt_all_physmem);
+
void snp_setup_rmpopt(void)
{
u64 rmpopt_base;
--
2.43.0
^ permalink raw reply related
* [PATCH v8 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Ashish Kalra @ 2026-06-15 19:49 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1781419998.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
When SEV-SNP is enabled, all writes to memory are checked to ensure
integrity of SNP guest memory. This imposes performance overhead on the
whole system.
RMPOPT is a new instruction that minimizes the performance overhead of
RMP checks on the hypervisor and on non-SNP guests by allowing RMP
checks to be skipped for 1GB regions of memory that are known not to
contain any SEV-SNP guest memory.
Add support for performing RMP optimizations asynchronously using a
dedicated workqueue.
Enable RMPOPT optimizations for up to 2TB of system RAM starting from
the lowest physical memory address aligned down to a 1GB boundary at
RMP initialization time. RMP checks can initially be skipped for 1GB
memory ranges that do not contain SEV-SNP guest memory (excluding
preassigned pages such as the RMP table and firmware pages). As SNP
guests are launched, RMPUPDATE will disable the corresponding RMPOPT
optimizations.
Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/virt/svm/sev.c | 230 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 227 insertions(+), 3 deletions(-)
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 1b5c18408f0b..b63b639bfc30 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -19,6 +19,7 @@
#include <linux/iommu.h>
#include <linux/amd-iommu.h>
#include <linux/nospec.h>
+#include <linux/workqueue.h>
#include <asm/sev.h>
#include <asm/processor.h>
@@ -125,9 +126,20 @@ static void *rmp_bookkeeping __ro_after_init;
static u64 probed_rmp_base, probed_rmp_size;
static cpumask_t rmpopt_cpumask;
-static phys_addr_t rmpopt_pa_start;
+static phys_addr_t rmpopt_pa_start, rmpopt_pa_end;
static bool rmpopt_configured;
+enum rmpopt_function {
+ RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS,
+ RMPOPT_FUNC_REPORT_STATUS
+};
+
+#define RMPOPT_WORK_TIMEOUT 10000
+
+static struct workqueue_struct *rmpopt_wq;
+static struct delayed_work rmpopt_delayed_work;
+static DEFINE_MUTEX(rmpopt_wq_mutex);
+
static LIST_HEAD(snp_leaked_pages_list);
static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
@@ -568,6 +580,14 @@ static void rmpopt_cleanup(void)
{
int cpu;
+ guard(mutex)(&rmpopt_wq_mutex);
+
+ if (!rmpopt_wq)
+ return;
+
+ cancel_delayed_work_sync(&rmpopt_delayed_work);
+ destroy_workqueue(rmpopt_wq);
+
cpus_read_lock();
for_each_cpu(cpu, &rmpopt_cpumask)
@@ -576,7 +596,8 @@ static void rmpopt_cleanup(void)
cpus_read_unlock();
cpumask_clear(&rmpopt_cpumask);
- rmpopt_pa_start = 0;
+ rmpopt_pa_start = rmpopt_pa_end = 0;
+ rmpopt_wq = NULL;
}
void snp_shutdown(void)
@@ -599,6 +620,168 @@ void snp_clear_rmpopt_configured(void)
rmpopt_configured = false;
}
+/*
+ * RMPOPT: F2 0F 01 FC
+ * Input: RAX = system physical address (1GB aligned)
+ * RCX = operation type
+ * Output: CF set if the range was optimized
+ */
+static inline bool __rmpopt(u64 pa_start, u64 op_type)
+{
+ bool optimized;
+
+ asm volatile(".byte 0xf2, 0x0f, 0x01, 0xfc"
+ : "=@ccc" (optimized)
+ : "a" (pa_start), "c" (op_type)
+ : "memory", "cc");
+
+ return optimized;
+}
+
+static void rmpopt(u64 pa)
+{
+ u64 pa_start = ALIGN_DOWN(pa, SZ_1G);
+ u64 op_type = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
+
+ __rmpopt(pa_start, op_type);
+}
+
+/*
+ * 'val' is a system physical address.
+ */
+static void rmpopt_smp(void *val)
+{
+ rmpopt((u64)val);
+}
+
+/*
+ * Leader function for work_on_cpu(): runs the full RMPOPT scan in
+ * process context on a CPU that has RMPOPT_BASE MSR programmed.
+ */
+static long rmpopt_leader_fn(void *arg)
+{
+ phys_addr_t pa;
+
+ for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+ rmpopt(pa);
+ cond_resched();
+ }
+ return 0;
+}
+
+/*
+ * RMPOPT optimizations skip RMP checks at 1GB granularity if this
+ * range of memory does not contain any SNP guest memory.
+ */
+static void rmpopt_work_handler(struct work_struct *work)
+{
+ cpumask_var_t follower_mask;
+ phys_addr_t pa;
+ int this_cpu;
+
+ pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
+ rmpopt_pa_start, rmpopt_pa_end);
+
+ if (!alloc_cpumask_var(&follower_mask, GFP_KERNEL))
+ return;
+
+ /*
+ * RMPOPT scans the RMP table, stores the result of the scan in the
+ * reserved processor memory. The RMP scan is the most expensive
+ * part. If a second RMPOPT occurs, it can skip the expensive scan
+ * if they can see a cached result in the reserved processor memory.
+ *
+ * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
+ * on every other primary thread. Followers are "designed to"
+ * skip the scan if they see the "cached" scan results.
+ */
+ cpumask_copy(follower_mask, &rmpopt_cpumask);
+
+ /*
+ * Pin the worker to the current CPU for the leader loop so that
+ * this_cpu remains valid and the RMPOPT instruction executes on
+ * the correct CPU.
+ *
+ * Use migrate_disable() rather than get_cpu() to prevent
+ * migration while still allowing preemption.
+ */
+ migrate_disable();
+ this_cpu = smp_processor_id();
+
+ if (cpumask_test_cpu(this_cpu, follower_mask)) {
+ /*
+ * Current CPU is a primary thread in rmpopt_cpumask.
+ * Run leader locally and remove from follower mask.
+ */
+ cpumask_clear_cpu(this_cpu, follower_mask);
+
+ for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+ rmpopt(pa);
+ cond_resched();
+ }
+ } else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
+ follower_mask)) {
+ /*
+ * Current CPU is a sibling thread whose primary is in
+ * rmpopt_cpumask. RMPOPT_BASE MSR is per-core, so it
+ * is safe to run the leader locally. Remove the sibling's
+ * primary from the follower mask as this core is already
+ * covered by the leader.
+ */
+ cpumask_andnot(follower_mask, follower_mask,
+ topology_sibling_cpumask(this_cpu));
+
+ for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+ rmpopt(pa);
+ cond_resched();
+ }
+ } else {
+ /*
+ * Current CPU does not have RMPOPT_BASE MSR programmed.
+ * Pick an explicit leader from the cpumask to avoid #UD.
+ * Use work_on_cpu() to run in process context on the leader,
+ * avoiding IPI latency.
+ */
+ int leader_cpu = cpumask_first(follower_mask);
+
+ if (WARN_ON_ONCE(leader_cpu >= nr_cpu_ids)) {
+ migrate_enable();
+ goto out;
+ }
+
+ cpumask_clear_cpu(leader_cpu, follower_mask);
+
+ /* Release migration pin before work_on_cpu(). */
+ migrate_enable();
+
+ work_on_cpu(leader_cpu, rmpopt_leader_fn, NULL);
+
+ goto followers;
+ }
+
+ migrate_enable();
+
+followers:
+ /*
+ * Followers: run RMPOPT on remaining cores.
+ * CPU hotplug is disabled while SNP is active
+ * (cpu_hotplug_disable() in __sev_snp_init_locked()),
+ * so cpus_read_lock() is uncontended.
+ */
+ cpus_read_lock();
+ for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+ on_each_cpu_mask(follower_mask, rmpopt_smp,
+ (void *)pa, true);
+
+ /* Give a chance for other threads to run */
+ cond_resched();
+ }
+ cpus_read_unlock();
+
+out:
+ free_cpumask_var(follower_mask);
+}
+
void snp_setup_rmpopt(void)
{
u64 rmpopt_base;
@@ -607,11 +790,37 @@ void snp_setup_rmpopt(void)
if (!cpu_feature_enabled(X86_FEATURE_RMPOPT) || !rmpopt_configured)
return;
+ guard(mutex)(&rmpopt_wq_mutex);
+
+ /*
+ * Guard against re-initialization. When SNP_SHUTDOWN_EX is issued
+ * with x86_snp_shutdown=0, snp_shutdown() is not called and
+ * rmpopt_cleanup() is skipped, but snp_initialized is still cleared.
+ * A subsequent __sev_snp_init_locked() would call snp_setup_rmpopt()
+ * again, leaking the existing workqueue, delayed work, debugfs
+ * entries, and cpumask state.
+ */
+ if (rmpopt_wq)
+ return;
+
+ /*
+ * Create an RMPOPT-specific workqueue to avoid scheduling
+ * RMPOPT workitem on the global system workqueue.
+ */
+ rmpopt_wq = alloc_workqueue("rmpopt_wq", WQ_UNBOUND, 1);
+ if (!rmpopt_wq) {
+ pr_err("Failed to allocate RMPOPT workqueue\n");
+ return;
+ }
+
+ INIT_DELAYED_WORK(&rmpopt_delayed_work, rmpopt_work_handler);
+
cpus_read_lock();
/*
* The RMPOPT_BASE MSR is per-core, so only one thread per core needs
- * to set up the RMPOPT_BASE MSR.
+ * to set up the RMPOPT_BASE MSR. Likewise, only one thread per core
+ * needs to issue the RMPOPT instruction.
*
* Note: only online primary threads are included. If a core's
* primary thread is offline, that core is not covered. CPU hotplug
@@ -635,6 +844,21 @@ void snp_setup_rmpopt(void)
WARN_ON_ONCE(wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base));
cpus_read_unlock();
+
+ rmpopt_pa_end = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
+
+ /* Limit memory scanning to 2TB of RAM */
+ if ((rmpopt_pa_end - rmpopt_pa_start) > SZ_2T) {
+ pr_info("RMPOPT coverage limited to 2TB; memory above 0x%llx not optimized\n",
+ rmpopt_pa_start + SZ_2T);
+ rmpopt_pa_end = rmpopt_pa_start + SZ_2T;
+ }
+
+ /*
+ * Once all per-CPU RMPOPT tables have been configured, enable RMPOPT
+ * optimizations on all physical memory.
+ */
+ queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
}
EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
--
2.43.0
^ permalink raw reply related
* [PATCH v8 3/7] crypto/ccp: Disable CPU hotplug while SNP is active
From: Ashish Kalra @ 2026-06-15 19:49 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1781419998.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
The SEV firmware enumerates the CPUs at SNP initialization and is not
aware of the OS bringing CPUs online or offline afterwards, so OS CPU
hotplug can diverge from the firmware's expectations and break SNP.
Disable CPU hotplug while SNP is active.
SNP is fully torn down only on the SNP_SHUTDOWN_EX x86_snp_shutdown
path; the legacy path leaves SNP enabled in hardware while clearing
snp_initialized, so __sev_snp_init_locked() can run again. Track the
disable with a flag so it is balanced by a matching enable rather than
stacked, and re-enable hotplug only on the x86_snp_shutdown path, after
snp_shutdown() has cleared the per-core RMPOPT_BASE MSRs with hotplug
still disabled.
This also keeps the CPU set stable for the asynchronous RMPOPT scan
added later in this series, and ensures cpus_read_lock() in the scan
is uncontended.
Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
drivers/crypto/ccp/sev-dev.c | 29 ++++++++++++++++++++++++++++-
1 file changed, 28 insertions(+), 1 deletion(-)
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 217b6b19802e..c8c3c577463c 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -106,6 +106,9 @@ struct snp_hv_fixed_pages_entry {
static LIST_HEAD(snp_hv_fixed_pages);
+/* Set while SNP has CPU hotplug disabled. */
+static bool snp_cpu_hotplug_disabled;
+
/* Trusted Memory Region (TMR):
* The TMR is a 1MB area that must be 1MB aligned. Use the page allocator
* to allocate the memory, which will return aligned memory for the specified
@@ -1479,6 +1482,17 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
snp_hv_fixed_pages_state_update(sev, HV_FIXED);
+ /*
+ * Disable CPU hotplug while SNP is active. Guard against stacking
+ * the disable count: the legacy SNP_SHUTDOWN_EX path clears
+ * snp_initialized without re-enabling hotplug, so this can run
+ * again while hotplug is already disabled.
+ */
+ if (!snp_cpu_hotplug_disabled) {
+ cpu_hotplug_disable();
+ snp_cpu_hotplug_disabled = true;
+ }
+
snp_setup_rmpopt();
sev->snp_initialized = true;
@@ -2083,8 +2097,21 @@ static int __sev_snp_shutdown_locked(int *error, bool panic)
}
if (data.x86_snp_shutdown) {
- if (!panic)
+ if (!panic) {
snp_shutdown();
+ /*
+ * snp_shutdown() fully tears SNP down (clear_rmp()) and
+ * has already cleared the per-core RMPOPT_BASE MSRs via
+ * rmpopt_cleanup() with hotplug still disabled. Re-enable
+ * CPU hotplug now. On the legacy path SNP stays
+ * enabled in hardware, so hotplug is correctly left
+ * disabled.
+ */
+ if (snp_cpu_hotplug_disabled) {
+ cpu_hotplug_enable();
+ snp_cpu_hotplug_disabled = false;
+ }
+ }
snp_hv_fixed_pages_state_update(sev, ALLOCATED);
} else {
/*
--
2.43.0
^ permalink raw reply related
* [PATCH v8 2/7] x86/sev: Initialize RMPOPT configuration MSRs
From: Ashish Kalra @ 2026-06-15 19:48 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1781419998.git.ashish.kalra@amd.com>
From: Ashish Kalra <ashish.kalra@amd.com>
The new RMPOPT instruction helps manage per-CPU RMP optimization
structures inside the CPU. It takes a 1GB-aligned physical address
and either returns the status of the optimizations or tries to enable
the optimizations.
Per-CPU RMPOPT tables support at most 2 TB of addressable memory for
RMP optimizations.
Initialize the per-CPU RMPOPT table base to the starting physical
address. This enables RMP optimization for up to 2 TB of system RAM on
all CPUs.
Additionally, add support to setup and enable RMPOPT once SNP is
enabled and initialized.
Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/coco/core.c | 2 +
arch/x86/include/asm/msr-index.h | 3 ++
arch/x86/include/asm/sev.h | 4 ++
arch/x86/virt/svm/sev.c | 70 ++++++++++++++++++++++++++++++++
drivers/crypto/ccp/sev-dev.c | 3 ++
5 files changed, 82 insertions(+)
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 989ca9f72ba3..8c1393ddc5df 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -16,6 +16,7 @@
#include <asm/archrandom.h>
#include <asm/coco.h>
#include <asm/processor.h>
+#include <asm/sev.h>
enum cc_vendor cc_vendor __ro_after_init = CC_VENDOR_NONE;
SYM_PIC_ALIAS(cc_vendor);
@@ -172,6 +173,7 @@ static void amd_cc_platform_clear(enum cc_attr attr)
switch (attr) {
case CC_ATTR_HOST_SEV_SNP:
cc_flags.host_sev_snp = 0;
+ snp_clear_rmpopt_configured();
break;
default:
break;
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 86554de9a3f5..28540744f1eb 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -761,6 +761,9 @@
#define MSR_AMD64_SEG_RMP_ENABLED_BIT 0
#define MSR_AMD64_SEG_RMP_ENABLED BIT_ULL(MSR_AMD64_SEG_RMP_ENABLED_BIT)
#define MSR_AMD64_RMP_SEGMENT_SHIFT(x) (((x) & GENMASK_ULL(13, 8)) >> 8)
+#define MSR_AMD64_RMPOPT_BASE 0xc0010139
+#define MSR_AMD64_RMPOPT_ENABLE_BIT 0
+#define MSR_AMD64_RMPOPT_ENABLE BIT_ULL(MSR_AMD64_RMPOPT_ENABLE_BIT)
#define MSR_SVSM_CAA 0xc001f000
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 594cfa19cbd4..0d662221615a 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,8 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
__snp_leak_pages(pfn, pages, true);
}
int snp_prepare(void);
+void snp_setup_rmpopt(void);
+void snp_clear_rmpopt_configured(void);
void snp_shutdown(void);
#else
static inline bool snp_probe_rmptable_info(void) { return false; }
@@ -680,6 +682,8 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
static inline void kdump_sev_callback(void) { }
static inline void snp_fixup_e820_tables(void) {}
static inline int snp_prepare(void) { return -ENODEV; }
+static inline void snp_setup_rmpopt(void) {}
+static inline void snp_clear_rmpopt_configured(void) {}
static inline void snp_shutdown(void) {}
#endif
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 8bcdce98f6dc..1b5c18408f0b 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -124,6 +124,10 @@ static void *rmp_bookkeeping __ro_after_init;
static u64 probed_rmp_base, probed_rmp_size;
+static cpumask_t rmpopt_cpumask;
+static phys_addr_t rmpopt_pa_start;
+static bool rmpopt_configured;
+
static LIST_HEAD(snp_leaked_pages_list);
static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
@@ -490,7 +494,12 @@ static bool __init setup_rmptable(void)
if (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) {
if (!setup_segmented_rmptable())
return false;
+ rmpopt_configured = true;
} else {
+ /*
+ * RMPOPT requires a segmented RMP table, so leave
+ * rmpopt_configured clear on contiguous RMP systems.
+ */
if (!setup_contiguous_rmptable())
return false;
}
@@ -555,6 +564,21 @@ int snp_prepare(void)
}
EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
+static void rmpopt_cleanup(void)
+{
+ int cpu;
+
+ cpus_read_lock();
+
+ for_each_cpu(cpu, &rmpopt_cpumask)
+ WARN_ON_ONCE(wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, 0));
+
+ cpus_read_unlock();
+
+ cpumask_clear(&rmpopt_cpumask);
+ rmpopt_pa_start = 0;
+}
+
void snp_shutdown(void)
{
u64 syscfg;
@@ -563,11 +587,57 @@ void snp_shutdown(void)
if (syscfg & MSR_AMD64_SYSCFG_SNP_EN)
return;
+ rmpopt_cleanup();
+
clear_rmp();
on_each_cpu(mfd_reconfigure, NULL, 1);
}
EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
+void snp_clear_rmpopt_configured(void)
+{
+ rmpopt_configured = false;
+}
+
+void snp_setup_rmpopt(void)
+{
+ u64 rmpopt_base;
+ int cpu;
+
+ if (!cpu_feature_enabled(X86_FEATURE_RMPOPT) || !rmpopt_configured)
+ return;
+
+ cpus_read_lock();
+
+ /*
+ * The RMPOPT_BASE MSR is per-core, so only one thread per core needs
+ * to set up the RMPOPT_BASE MSR.
+ *
+ * Note: only online primary threads are included. If a core's
+ * primary thread is offline, that core is not covered. CPU hotplug
+ * is not currently supported with SNP enabled.
+ */
+
+ for_each_online_cpu(cpu)
+ if (topology_is_primary_thread(cpu))
+ cpumask_set_cpu(cpu, &rmpopt_cpumask);
+
+ rmpopt_pa_start = ALIGN_DOWN(PFN_PHYS(min_low_pfn), SZ_1G);
+ rmpopt_base = rmpopt_pa_start | MSR_AMD64_RMPOPT_ENABLE;
+
+ /*
+ * Per-CPU RMPOPT tables support at most 2 TB of addressable memory
+ * for RMP optimizations. Initialize the per-CPU RMPOPT table base
+ * to the starting physical address to enable RMP optimizations for
+ * up to 2 TB of system RAM on all CPUs.
+ */
+ for_each_cpu(cpu, &rmpopt_cpumask)
+ WARN_ON_ONCE(wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base));
+
+ cpus_read_unlock();
+}
+EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
+
/*
* Do the necessary preparations which are verified by the firmware as
* described in the SNP_INIT_EX firmware command description in the SNP
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 78f98aee7a66..217b6b19802e 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1478,6 +1478,9 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
}
snp_hv_fixed_pages_state_update(sev, HV_FIXED);
+
+ snp_setup_rmpopt();
+
sev->snp_initialized = true;
dev_dbg(sev->dev, "SEV-SNP firmware initialized, SEV-TIO is %s\n",
data.tio_en ? "enabled" : "disabled");
--
2.43.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox