Linux Confidential Computing Development
 help / color / mirror / Atom feed
* SVSM Development Call June 17th, 2026
From: Jörg Rödel @ 2026-06-16 16:10 UTC (permalink / raw)
  To: coconut-svsm, linux-coco

Hi,

Here is the call for agenda items for this weeks SVSM development call.  Please
send any agenda items you have in mind as a reply to this email or raise them
in the meeting.

We will use the LF Zoom instance. Details of the meeting  can be found in our
governance repository at:

	https://github.com/coconut-svsm/governance

The link to the COCONUT-SVSM calendar is:

	https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week

The meeting will be recorded and the recording eventually published.

Regards,

	Jörg

^ permalink raw reply

* Re: [PATCH 00/15] Enable TDX Module Extensions and DICE-based TDX Quoting
From: Xu Yilun @ 2026-06-16 15:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dan Williams (nvidia), kas, rick.p.edgecombe, x86, peter.fang,
	linux-coco, linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li, aneesh.kumar, aik
In-Reply-To: <1fc76b45-a25b-41a7-b7fa-06650c8052fb@intel.com>

On Mon, Jun 15, 2026 at 08:57:09AM -0700, Dave Hansen wrote:
> On 6/15/26 08:22, Xu Yilun wrote:
> >> The TDX "Extension SEAMCALL" capability is akin to ARM CCA's "Stateful
> >> RMI Operations (SRO)", and achieves similar externalized complexity
> >> relief as a dedicated hardware coprocessor like AMD SEV-SNP. The
> > I may not include the ARM/AMD examples, not sure I can explain them
> > well.
> 
> I actually think they're pretty important proof points. One of the big

OK, I can include this section that Dan provides.

> challenges as a maintainer evaluating these things is judging the
> solution itself.
> 
> Is this architecture a good one? Is it overly complex? Are the avenues
> for simplification?
> 
> If five vendors pop up all with similar problems and solutions, then
> it's a pretty good bet that they're all on the right track. But, if
> there are four going one direction and one going off by itself, it's a
> sign that the errant one might need a course correction.
> 
> It would honestly be worth your time to go *talk* to the AMD and ARM
> folks and ensure that you are all on the same page. Last I checked, they

Yes, I queried ARM/AMD TDISP folks offline and CCed them in this thread.
Correct me if anything wrong:

AFAIK, AMD firmware run on an external physical core (PSP), firmware call
execution won't occupy host CPU, and the two partners communicate
asynchronously, so no worry about interruptibility and preemptibility.

From Alexey:

  "The AMD CPU puts a request in a queue, writes to doorbell, and wait for
   an interrupt. The PSP (a separate physical core) will see this, handle,
   put the data in the CPU memory (if needed), trigger the interrupt. Done.
   The host CPU can be rescheduled while waiting"


ARM SRO is something I don't familiar with. ARM has no co-processor for
CC, host invokes RMI and trap into RMM for secure execution, stateless
RMI blocks interrupt so should be short lived. This is very similar to
Intel SEAMCALL.

Stateful RMI, however, from their RMM 2.0bet1 SPEC [1] B4.3.2 Stateful
RMI operations, could be used "When an RMI operation cannot be completed
within an IMPLEMENTATION DEFINED time limit". It is "guaranteed to yield
within an IMPLEMENTATION DEFINED time limit from the point at which an
interrupt becomes pending." I see it tries to solve the same problem as
extension SEAMCALLs.

I see SRO is WIP in [2], and is used for TDISP [3].

[1] https://developer.arm.com/documentation/den0137/2-0bet1/
[2] https://lore.kernel.org/all/20260318155413.793430-49-steven.price@arm.com/
[3] https://lore.kernel.org/all/20260427065121.916615-3-aneesh.kumar@kernel.org/

^ permalink raw reply

* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Sean Christopherson @ 2026-06-16 14:36 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
	linux-coco, linux-kernel, x86
In-Reply-To: <CAEvNRgHjno=XDnMpTvZcNCR6D5RAi1rbynrL4CnnEa7Y-M3VLg@mail.gmail.com>

On Mon, Jun 15, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Fri, Jun 05, 2026, Ackerley Tng wrote:
> >> Sean Christopherson <seanjc@google.com> writes:
> >> Many tests use vm_userspace_mem_region_add(), CoCo tests that require
> >> finalizing shouldn't be disallowed that option.
> >
> > What does that have to do with finalizing the VM?
> 
> I could add more memslots after finalizing the VM, but then I'd have to
> make the guest accept more memory. Hence, I'd rather set up all the
> memslots and then finalize the VM.

Why?  It's not like the performance of this code matters terribly.

Regardless, nothing *requires* a test to go down that route.  As I said before,
my goal with the selftests infrastructure is to effectively optimize the entire
experience, i.e. to provide default behavior that works well for the majority of
tests.  Attempting to provide default behavior that perfectly fits *every* test
is simply impossible.

> sev_smoke_test currently has a separate vm_sev_launch(), which is the
> equivalent of tdx_init_mem_region(), and that's called in the tests
> themselves.
> 
> sev_smoke_test also uses vcpu_args_set() after creating the VM with
> vm_create_shape_with_one_vcpu(). Would that work if vm_sev_launch() got
> moved into kvm_arch_vm_finalize_vcpus()?

No, because vCPU state would be encrypted at that point.  Moving vm_sev_launch()
would also break test_sev_dbg(), which sets a global buffer to all 0xaa prior to
encrypting that data.

> >> >> It's also possible to have some kvm_vm_finalize() call that can be
> >> >> explicitly and manually invoked from selftests just for CoCo selftests.
> >> >
> >> > Why bother?  It's obviously possible to all kvm_arch_vm_finalize_vcpus() directly.
> >>
> >> Works for me to call directly. Do you mean kvm_arch_vm_finalize_vcpus()
> >> is the right function where the TD is finalized?
> >>
> >> For tests that need to do more setup after creating a vm, is the only
> >> way out to call __vm_create() then vm_vcpu_add() to avoid premature
> >> finalization in __vm_create_with_vcpus() when
> >> kvm_arch_vm_finalize_vcpus() is called?
> >
> > Depends on what you're doing.  Sometimes, the answer will be yes.  That's why
> > there are "low level" APIs, so that some tests can do fancy things, while most
> > tests can leave the details to the infrastructure.
> >
> > If there's a recurring problem, or we anticipate one, then we can and should
> > figure out how to minimize the pain so that tests don't have to deal with the
> > same boilerplate issues over and over.  Hence the __shared idea.
> 
> I still think kvm_arch_vm_finalize_vcpus() is an odd place to be
> finalizing the VM.

That's literally why the function exists though.  The one and only existing
implementation (on arm64) uses it to initialize the vGIC.

  void kvm_arch_vm_finalize_vcpus(struct kvm_vm *vm)
  {
	if (vm->arch.has_gic)
		__vgic_v3_init(vm->arch.gic_fd);
  }

That's *very* similar to the proposed TDX usage, where some per-VM asset(s) can
be initialized/frozen only after all vCPUs have been added.  In other words, the
name is describing where in the VM creation/setup flow the hook is called, and
perhaps more importantly, the impact of the call: vCPUs are finalized (obviously
with a different definition of "finalized" based on the VM properties).

> I would prefer to not have to explicitly call some function like
> kvm_arch_vm_finalize() (no vcpu in the name), but a common arch function
> calling vm_sev_launch() and tdx_vm_finalize() is what I can think of
> for test setup flexibility, without too much magic.

We can't have our cake and eat it too.  Either we launch/finalize SEV/TDX VMs as
part of the common VM creation flows (as proposed for TDX), or we force tests to
manually launch/finalize the VM after additional setup.  The only way to have it
both ways is to utilize crazy macro shenanigans to execute arbitrary code between
creating the VM and launching/finalizing the VM, and even I would balk at that.

I honestly don't see any reason to try to figure out which of the two approaches
is optimal at this time.  Whatever decision we make isn't set in stone, and in
fact should be relative easy to change.  And without being able to see all the
TDX/SEV tests that are waiting in the wings, it's more or less impossible to make
an informed decision.

I definitely want to have SEV and TDX use the same core approach in the end, but
I don't think we should force the issue right now, because the cost of reworking
the SEV and/or TDX infrastructure when there are only a bare handful of tests is
so low.  It will cost more to try to predict the future of a 50/50 outcome than
it will to make a completely wild guess between the two options and rework things
if we guess wrong.

> For now, I can't think of many uses of __shared. ucall shared memory is
> allocated dynamically, and we can also make it shared cleanly within
> ucall code.

Uh, every selftest that uses global variables to communicate between guest and
host?

> The global variables (sync_global_to_guest()) will appear in the guest
> as long as sync_global_to_guest() is called before
> kvm_arch_vm_finalize(), which I think makes sense to people writing
> tests for CoCo.

Yes, but that's completely orthogonal to all of this.

> So 2 questions to push this along:
> 
> 1. What do you think of a kvm_arch_vm_finalize() that calls
>    vm_sev_launch() and tdx_vm_finalize()? My key issue is that
>    kvm_arch_vm_finalize_*vcpus*() seems to be for finalizing vCPUs
>    rather than the whole VM.

Key word "seems".  I'm pretty sure Oliver picked kvm_arch_vm_finalize_vcpus() as
the name in commit 8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for
'default' VMs") for the same reasons I think it's a good fit for coco VMs: like
finalizing TDX VMs, initializing the vGIC effectively finalizes vCPUs.

We could rename it to kvm_arch_vm_finalize(), but that won't change the fact that
we'll need to decide between automagic vs. manual finalization, and it definitely
should be a separate discussion.

> 3. Would you like __shared implemented together with this series, as a
>    prerequisite, or later?

Only if __shared is a hard requirement for base TDX support, which I assume is
not the case.

^ permalink raw reply

* Re: [PATCH] PCI/TSM: Resume device to D0 for CMA-SPDM operation
From: Jonathan Cameron @ 2026-06-16 12:57 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Dan Williams, Ashish Kalra, Tom Lendacky, Vivaik Balasubrawmanian,
	John Allen, Bjorn Helgaas, linux-coco, linux-pci,
	Aneesh Kumar K.V, Yilun Xu, Zhenzhong Duan, Alexey Kardashevskiy
In-Reply-To: <7bdfaf14d7e5a466f3f650150c688a60e947a7a9.1781527060.git.lukas@wunner.de>

On Mon, 15 Jun 2026 15:19:30 +0200
Lukas Wunner <lukas@wunner.de> wrote:

> Per PCIe r7.0 sec 6.31.3, CMA-SPDM operation in non-D0 states is optional.
> The spec does not define a way to determine if it's supported, so resume
> to D0 unconditionally for the duration of a CMA-SPDM exchange.  Vivaik has
> talked to Windows engineers and they said that Windows does the same.
> 
> Note that for plain DOE operation, it is sufficient for the device to be
> in D3hot and its parents in D0 because config space remains accessible in
> D3hot.  So CMA-SPDM goes beyond the requirements of plain DOE and hence
> resuming to D0 needs to (only) be done in code paths which use DOE
> specifically for CMA-SPDM.
> 
> The pattern used herein for runtime resume is the best practice introduced
> by commit ef8057b07c72 ("PM: runtime: Wrapper macros for ACQUIRE()/
> ACQUIRE_ERR()").
> 
> Fixes: 3225f52cde56 ("PCI/TSM: Establish Secure Sessions and Link Encryption")
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org # v6.19+
> Cc: Vivaik Balasubrawmanian <vivaik.balasubrawmanian@intel.com>
Seems reasonable to me and your replies to sashiko stuff seem to have that well
covered.  So FWIW

Reviewed-by: Jonathan Cameron <jic23@kernel.org>

^ permalink raw reply

* Re: [PATCH v13 08/22] KVM: selftests: Add TDX boot code
From: Chenyi Qiang @ 2026-06-16  9:21 UTC (permalink / raw)
  To: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
	Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
  Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-8-6983ae4c3a4d@google.com>



On 5/22/2026 7:16 AM, Lisa Wang wrote:
> From: Erdem Aktas <erdemaktas@google.com>
> 
> Add code to boot a TDX test VM. Since TDX registers are inaccessible to
> KVM, the boot code loads the relevant values from memory into the
> registers before jumping to the guest code.
> 
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> Signed-off-by: Erdem Aktas <erdemaktas@google.com>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sagi Shahar <sagis@google.com>
> Signed-off-by: Sagi Shahar <sagis@google.com>
> Signed-off-by: Lisa Wang <wyihan@google.com>
> ---
>  tools/testing/selftests/kvm/Makefile.kvm           |  1 +
>  .../selftests/kvm/include/x86/tdx/td_boot.h        |  5 ++
>  .../selftests/kvm/include/x86/tdx/td_boot_asm.h    | 16 ++++++
>  tools/testing/selftests/kvm/lib/x86/tdx/td_boot.S  | 60 ++++++++++++++++++++++
>  4 files changed, 82 insertions(+)
> 
> diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
> index 02fad7b35eac..929965ca4b75 100644
> --- a/tools/testing/selftests/kvm/Makefile.kvm
> +++ b/tools/testing/selftests/kvm/Makefile.kvm
> @@ -31,6 +31,7 @@ LIBKVM_x86 += lib/x86/sev.c
>  LIBKVM_x86 += lib/x86/svm.c
>  LIBKVM_x86 += lib/x86/ucall.c
>  LIBKVM_x86 += lib/x86/vmx.c
> +LIBKVM_x86 += lib/x86/tdx/td_boot.S
>  
>  LIBKVM_arm64 += lib/arm64/gic.c
>  LIBKVM_arm64 += lib/arm64/gic_v3.c
> diff --git a/tools/testing/selftests/kvm/include/x86/tdx/td_boot.h b/tools/testing/selftests/kvm/include/x86/tdx/td_boot.h
> index af4474dee387..e5d54a20ed72 100644
> --- a/tools/testing/selftests/kvm/include/x86/tdx/td_boot.h
> +++ b/tools/testing/selftests/kvm/include/x86/tdx/td_boot.h
> @@ -66,4 +66,9 @@ struct td_boot_parameters {
>  	struct td_per_vcpu_parameters per_vcpu[];
>  };
>  
> +void td_boot(void);
> +void td_boot_code_end(void);
> +
> +#define TD_BOOT_CODE_SIZE (td_boot_code_end - td_boot)
> +
>  #endif /* SELFTEST_TDX_TD_BOOT_H */
> diff --git a/tools/testing/selftests/kvm/include/x86/tdx/td_boot_asm.h b/tools/testing/selftests/kvm/include/x86/tdx/td_boot_asm.h
> new file mode 100644
> index 000000000000..10b4b527595c
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/include/x86/tdx/td_boot_asm.h
> @@ -0,0 +1,16 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef SELFTEST_TDX_TD_BOOT_ASM_H
> +#define SELFTEST_TDX_TD_BOOT_ASM_H
> +
> +/*
> + * GPA where TD boot parameters will be loaded.
> + *
> + * TD_BOOT_PARAMETERS_GPA is arbitrarily chosen to
> + *
> + * + be within the 4GB address space
> + * + provide enough contiguous memory for the struct td_boot_parameters such
> + *   that there is one struct td_per_vcpu_parameters for KVM_MAX_VCPUS
> + */
> +#define TD_BOOT_PARAMETERS_GPA 0xffff0000
> +
> +#endif  // SELFTEST_TDX_TD_BOOT_ASM_H

It would be better to maintain consistency by using /* ... */ for single-line comments in this series, such as the SELFTESTS_TDX_TDX_H in patch 20 and other license Identifier.


^ permalink raw reply

* Re: [PATCH v13 01/22] KVM: selftests: Add macros to simplify creating VM shapes for non-default types
From: Xiaoyao Li @ 2026-06-16  8:57 UTC (permalink / raw)
  To: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
	Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
  Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-1-6983ae4c3a4d@google.com>

On 5/22/2026 7:16 AM, Lisa Wang wrote:
> From: Sean Christopherson<seanjc@google.com>
> 
> Add VM_TYPE() and __VM_TYPE() macros to create a vm_shape structure given
> a type (and mode), and use the macros to define VM_SHAPE_{SEV,SEV_ES,SNP}
> shapes for x86's SEV family of VM shapes.  Providing common infrastructure
> will avoid having to copy+paste vm_sev_create_with_one_vcpu() for TDX.
> 
> Use the new SEV+ shapes and drop vm_sev_create_with_one_vcpu().
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson<seanjc@google.com>
> Signed-off-by: Sagi Shahar<sagis@google.com>
> Reviewed-by: Binbin Wu<binbin.wu@linux.intel.com>
> Reviewed-by: Ira Weiny<ira.weiny@intel.com>
> Signed-off-by: Lisa Wang<wyihan@google.com>

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

> ---
>   tools/testing/selftests/kvm/include/kvm_util.h     | 13 +++++++
>   .../testing/selftests/kvm/include/x86/processor.h  |  4 +++
>   tools/testing/selftests/kvm/include/x86/sev.h      |  2 --
>   tools/testing/selftests/kvm/lib/x86/sev.c          | 16 ---------
>   tools/testing/selftests/kvm/x86/sev_smoke_test.c   | 40 +++++++++++-----------
>   5 files changed, 37 insertions(+), 38 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index dc70c6da63fa..041bdbfb93f7 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -233,6 +233,19 @@ kvm_static_assert(sizeof(struct vm_shape) == sizeof(u64));
>   	shape;					\
>   })
>   
> +#define __VM_TYPE(__mode, __type)		\

It seems the name "__VM_SHAPE" fits better?

> +({						\
> +	struct vm_shape shape = {		\
> +		.mode = (__mode),		\
> +		.type = (__type)		\
> +	};					\
> +						\
> +	shape;					\
> +})
> +
> +#define VM_TYPE(__type)				\
> +	__VM_TYPE(VM_MODE_DEFAULT, __type)

and I think making it one line would be OK?

So something on top:

---8<---
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h 
b/tools/testing/selftests/kvm/include/kvm_util.h
index 041bdbfb93f7..a1b5d2029d05 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -223,17 +223,7 @@ kvm_static_assert(sizeof(struct vm_shape) == 
sizeof(u64));

  #define VM_TYPE_DEFAULT                        0

-#define VM_SHAPE(__mode)                       \
-({                                             \
-       struct vm_shape shape = {               \
-               .mode = (__mode),               \
-               .type = VM_TYPE_DEFAULT         \
-       };                                      \
-                                               \
-       shape;                                  \
-})
-
-#define __VM_TYPE(__mode, __type)              \
+#define __VM_SHAPE(__mode, __type)             \
  ({                                             \
         struct vm_shape shape = {               \
                 .mode = (__mode),               \
@@ -243,8 +233,8 @@ kvm_static_assert(sizeof(struct vm_shape) == 
sizeof(u64));
         shape;                                  \
  })

-#define VM_TYPE(__type)                                \
-       __VM_TYPE(VM_MODE_DEFAULT, __type)
+#define VM_SHAPE(__mode)       __VM_SHAPE(__mode, VM_TYPE_DEFAULT)
+#define VM_TYPE(__type)                __VM_SHAPE(VM_MODE_DEFAULT, __type)

  extern enum vm_guest_mode vm_mode_default;



^ permalink raw reply related

* Re: [PATCH v8 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: K Prateek Nayak @ 2026-06-16  7:27 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <de274c2fb3f794ff1f19f0c96184ee50d04d1282.1781419998.git.ashish.kalra@amd.com>

Hello Ashish,

On 6/16/2026 1:19 AM, Ashish Kalra wrote:
> +	/*
> +	 * RMPOPT scans the RMP table, stores the result of the scan in the
> +	 * reserved processor memory. The RMP scan is the most expensive
> +	 * part. If a second RMPOPT occurs, it can skip the expensive scan
> +	 * if they can see a cached result in the reserved processor memory.
> +	 *
> +	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
> +	 * on every other primary thread. Followers are "designed to"
> +	 * skip the scan if they see the "cached" scan results.
> +	 */
> +	cpumask_copy(follower_mask, &rmpopt_cpumask);

rmpopt_cpumask is constructed after hotplug is disabled but ...

> +
> +	/*
> +	 * Pin the worker to the current CPU for the leader loop so that
> +	 * this_cpu remains valid and the RMPOPT instruction executes on
> +	 * the correct CPU.
> +	 *
> +	 * Use migrate_disable() rather than get_cpu() to prevent
> +	 * migration while still allowing preemption.
> +	 */
> +	migrate_disable();
> +	this_cpu = smp_processor_id();
> +
> +	if (cpumask_test_cpu(this_cpu, follower_mask)) {
> +		/*
> +		 * Current CPU is a primary thread in rmpopt_cpumask.
> +		 * Run leader locally and remove from follower mask.
> +		 */
> +		cpumask_clear_cpu(this_cpu, follower_mask);
> +
> +		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> +			rmpopt(pa);
> +			cond_resched();
> +		}
> +	} else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
> +				      follower_mask)) {
> +		/*
> +		 * Current CPU is a sibling thread whose primary is in
> +		 * rmpopt_cpumask.  RMPOPT_BASE MSR is per-core, so it
> +		 * is safe to run the leader locally.  Remove the sibling's
> +		 * primary from the follower mask as this core is already
> +		 * covered by the leader.
> +		 */
> +		cpumask_andnot(follower_mask, follower_mask,
> +			       topology_sibling_cpumask(this_cpu));
> +
> +		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> +			rmpopt(pa);
> +			cond_resched();
> +		}
> +	} else {
> +		/*
> +		 * Current CPU does not have RMPOPT_BASE MSR programmed.
> +		 * Pick an explicit leader from the cpumask to avoid #UD.
> +		 * Use work_on_cpu() to run in process context on the leader,
> +		 * avoiding IPI latency.
> +		 */

... this_cpu is neither in the "rmpopt_cpumask", nor is any of its
siblings on "rmpopt_cpumask".

How does that happen?

> +		int leader_cpu = cpumask_first(follower_mask);
> +
> +		if (WARN_ON_ONCE(leader_cpu >= nr_cpu_ids)) {
> +			migrate_enable();
> +			goto out;
> +		}
> +
> +		cpumask_clear_cpu(leader_cpu, follower_mask);
> +
> +		/* Release migration pin before work_on_cpu(). */
> +		migrate_enable();
> +
> +		work_on_cpu(leader_cpu, rmpopt_leader_fn, NULL);

This creates a delayed work and also waits for it to finish execution
which will add more latency than a simple IPI if the comment about IPI
latency above is accurate.

I think there is some corner case in construction of the
"rmpopt_cpumask" that requires this not-so-pretty else block. Can you
elaborate why this is required?

Perhaps the "rmpopt_cpumask" construction needs:

    for_each_online_cpu(cpu) {
        /* Nominate the first CPU on the sibling mask for RMPOPT */
        if (cpu != cpumask_first(topology_sibling_cpumask(cpu)))
            continue;
        cpumask_set_cpu(cpu, &rmpopt_cpumask);
    }


and all you need here is:

    /* Do RMPOPt for local core */
    for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
        rmpopt(pa);

    /* Skip this core from concurrent RMPOPT */
    cpumask_and_not(follower_mask, &rmpopt_cpumask, topology_sibling_cpumask(cpu));

No?

> +
> +		goto followers;
> +	}
> +
> +	migrate_enable();
> +
-- 
Thanks and Regards,
Prateek


^ permalink raw reply

* Re: [PATCH v8 2/7] x86/sev: Initialize RMPOPT configuration MSRs
From: K Prateek Nayak @ 2026-06-16  6:03 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <6a4d0ec9e037d91c0fdba9c525942ca288e1b1b2.1781419998.git.ashish.kalra@amd.com>

Hello Ashish,

On 6/16/2026 1:18 AM, Ashish Kalra wrote:
> diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
> index 8bcdce98f6dc..1b5c18408f0b 100644
> --- a/arch/x86/virt/svm/sev.c
> +++ b/arch/x86/virt/svm/sev.c
> @@ -124,6 +124,10 @@ static void *rmp_bookkeeping __ro_after_init;
>  
>  static u64 probed_rmp_base, probed_rmp_size;
>  
> +static cpumask_t rmpopt_cpumask;

nit.

I believe you can use cpumask_var_t here and do a zalloc_cpumask_var()
during snp_setup_rmpopt(). That way !X86_FEATURE_RMPOPT configs don't
have to needlessly waste space to keep a redundant cpumask around.

Same comment for rmpopt_report_cpumask in Patch 7 which can be
allocated dynamically during rmpopt_debugfs_setup().

> +static phys_addr_t rmpopt_pa_start;
> +static bool rmpopt_configured;
> +
>  static LIST_HEAD(snp_leaked_pages_list);
>  static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
>  
> @@ -490,7 +494,12 @@ static bool __init setup_rmptable(void)
>  	if (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) {
>  		if (!setup_segmented_rmptable())
>  			return false;
> +		rmpopt_configured = true;
>  	} else {
> +		/*
> +		 * RMPOPT requires a segmented RMP table, so leave
> +		 * rmpopt_configured clear on contiguous RMP systems.
> +		 */
>  		if (!setup_contiguous_rmptable())
>  			return false;
>  	}
> @@ -555,6 +564,21 @@ int snp_prepare(void)
>  }
>  EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
>  
> +static void rmpopt_cleanup(void)
> +{
> +	int cpu;
> +
> +	cpus_read_lock();

nit.

You can use guard(cpus_read_lock)() unless there is a complicated
locking pattern where you need to drop and re-acquire the read lock.

> +
> +	for_each_cpu(cpu, &rmpopt_cpumask)
> +		WARN_ON_ONCE(wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, 0));
> +
> +	cpus_read_unlock();
> +
> +	cpumask_clear(&rmpopt_cpumask);
> +	rmpopt_pa_start = 0;
> +}
> +
>  void snp_shutdown(void)
>  {
>  	u64 syscfg;

-- 
Thanks and Regards,
Prateek


^ permalink raw reply

* Re: [PATCH] PCI/TSM: fix use-after-free in find_dsm_dev()
From: Lukas Wunner @ 2026-06-16  3:16 UTC (permalink / raw)
  To: Wentao Liang; +Cc: djbw, bhelgaas, linux-coco, linux-pci, linux-kernel, stable
In-Reply-To: <20260616030243.1661791-1-vulab@iscas.ac.cn>

On Tue, Jun 16, 2026 at 03:02:43AM +0000, Wentao Liang wrote:
> In find_dsm_dev(), pf0 is obtained via pf0_dev_get() which returns a
> reference-counted pointer.  It is declared with __free(pci_dev_put),
> so pci_dev_put() will be called when the variable goes out of scope.
> Returning 'pf0' directly while it still has __free cleanup causes the
> reference to be dropped before the caller can use the pointer, leading
> to a use-after-free.

No, the code comment preceding find_dsm_dev() explicitly states:

   "Note that no additional reference is held for the resulting device
    because that resulting object always has a registered lifetime
    greater-than-or-equal to that of the @pdev argument."

Your patch looks like it may be an LLM-generated hallucination.
Did you use an LLM to come up with the patch?  If so, please use
an Assisted-by tag per Documentation/process/coding-assistants.rst
so that we know to expect hallucinations.

Thanks,

Lukas

^ permalink raw reply

* [PATCH] PCI/TSM: fix use-after-free in find_dsm_dev()
From: Wentao Liang @ 2026-06-16  3:02 UTC (permalink / raw)
  To: djbw, bhelgaas; +Cc: linux-coco, linux-pci, linux-kernel, Wentao Liang, stable

In find_dsm_dev(), pf0 is obtained via pf0_dev_get() which returns a
reference-counted pointer.  It is declared with __free(pci_dev_put),
so pci_dev_put() will be called when the variable goes out of scope.
Returning 'pf0' directly while it still has __free cleanup causes the
reference to be dropped before the caller can use the pointer, leading
to a use-after-free.

Fix by using return no_free_ptr(pf0) to suppress the automatic
cleanup and properly transfer ownership to the caller.

Fixes: 3225f52cde56 ("PCI/TSM: Establish Secure Sessions and Link Encryption")
Cc: stable@vger.kernel.org
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
---
 drivers/pci/tsm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/tsm.c b/drivers/pci/tsm.c
index 5fdcd7f2e820..dd4e0cb0c6aa 100644
--- a/drivers/pci/tsm.c
+++ b/drivers/pci/tsm.c
@@ -670,7 +670,7 @@ static struct pci_dev *find_dsm_dev(struct pci_dev *pdev)
 		return NULL;
 
 	if (is_dsm(pf0))
-		return pf0;
+		return no_free_ptr(pf0);
 
 	/*
 	 * For cases where a switch may be hosting TDISP services on behalf of
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Ackerley Tng @ 2026-06-16  0:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
	linux-coco, linux-kernel, x86
In-Reply-To: <aiM2Nm9Eh68mIVX3@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Fri, Jun 05, 2026, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>>
>> >
>> > [...snip...]
>> >
>> >> Was kvm_arch_vm_finalize_vcpus() supposed to be for finalizing vCPUs
>> >> instead?
>> >>
>> >> The awkward part is that kvm_arch_vm_finalize_vcpus() is called from
>> >> __vm_create_with_vcpus().
>> >>
>> >> While building this POC to test conversions [1] I only wanted to create
>> >> the vm and vcpus and didn't want to finalize yet, since I still needed
>> >> to do more mappings in the guest (and I needed the vm pointer to do
>> >> mappings in the guest).
>> >
>> > Hmm, I would argue this is a flaw in the selftests infrastructure.  IMO, as a
>> > developer, it's quite surprising that the current value of a global variable
>> > doesn't show up in the VM automagically.  I totally understand why selftests
>> > work that way, but it's certainly odd and annoying.  If _that_ were solved, then
>> > the kludginess of what you're doing goes away.
>> >
>> > The other way this could be solved is by adding support for annotating globals
>> > with a __shared flag, a la the kernel's __bss_decrypted, so that loading memory
>> > into the VM can automatically mark the associated globals' pages as shared.
>> >
>>
>> More generally, is your opinion that tests should not have to add extra
>> memslots?
>
> I don't care?  What I care about is making it as easy and intuitive as possible
> for people to write tests, and to minimize maintenance costs.
>
>> If I wanted a shared page, would I have to do
>>
>>   static __shared test_page[4096] = {0};
>>
>> and then rely on ELF loading to put that in the guest for me? Are there
>> some compiler flags/how will I require that test_page be page aligned?
>
> Compilere and linker shenanigans.
>
>> If I mark 10 globals as __shared, would the compiler automatically
>> consolidate the shared memory together?
>
> Yes, follow the __bss_decrypted breadcrumbs.
>
>   #define __bss_decrypted __section(".bss..decrypted")
>
>> I think it's a bit constraining to require that all guest memory be set
>> up statically. It's nice to have but I'd like another option...
>
> You do have options, they just require more work.
>
>> Many tests use vm_userspace_mem_region_add(), CoCo tests that require
>> finalizing shouldn't be disallowed that option.
>
> What does that have to do with finalizing the VM?
>

I could add more memslots after finalizing the VM, but then I'd have to
make the guest accept more memory. Hence, I'd rather set up all the
memslots and then finalize the VM.

sev_smoke_test currently has a separate vm_sev_launch(), which is the
equivalent of tdx_init_mem_region(), and that's called in the tests
themselves.

sev_smoke_test also uses vcpu_args_set() after creating the VM with
vm_create_shape_with_one_vcpu(). Would that work if vm_sev_launch() got
moved into kvm_arch_vm_finalize_vcpus()?

>> >> It's also possible to have some kvm_vm_finalize() call that can be
>> >> explicitly and manually invoked from selftests just for CoCo selftests.
>> >
>> > Why bother?  It's obviously possible to all kvm_arch_vm_finalize_vcpus() directly.
>>
>> Works for me to call directly. Do you mean kvm_arch_vm_finalize_vcpus()
>> is the right function where the TD is finalized?
>>
>> For tests that need to do more setup after creating a vm, is the only
>> way out to call __vm_create() then vm_vcpu_add() to avoid premature
>> finalization in __vm_create_with_vcpus() when
>> kvm_arch_vm_finalize_vcpus() is called?
>
> Depends on what you're doing.  Sometimes, the answer will be yes.  That's why
> there are "low level" APIs, so that some tests can do fancy things, while most
> tests can leave the details to the infrastructure.
>
> If there's a recurring problem, or we anticipate one, then we can and should
> figure out how to minimize the pain so that tests don't have to deal with the
> same boilerplate issues over and over.  Hence the __shared idea.

I still think kvm_arch_vm_finalize_vcpus() is an odd place to be
finalizing the VM.

I would prefer to not have to explicitly call some function like
kvm_arch_vm_finalize() (no vcpu in the name), but a common arch function
calling vm_sev_launch() and tdx_vm_finalize() is what I can think of
for test setup flexibility, without too much magic.

For now, I can't think of many uses of __shared. ucall shared memory is
allocated dynamically, and we can also make it shared cleanly within
ucall code.

The global variables (sync_global_to_guest()) will appear in the guest
as long as sync_global_to_guest() is called before
kvm_arch_vm_finalize(), which I think makes sense to people writing
tests for CoCo.


So 2 questions to push this along:

1. What do you think of a kvm_arch_vm_finalize() that calls
   vm_sev_launch() and tdx_vm_finalize()? My key issue is that
   kvm_arch_vm_finalize_*vcpus*() seems to be for finalizing vCPUs
   rather than the whole VM.

3. Would you like __shared implemented together with this series, as a
   prerequisite, or later?

^ permalink raw reply

* Re: [PATCH v13 13/22] KVM: selftests: Set first memory region as shared if guest_memfd
From: Lisa Wang @ 2026-06-16  0:04 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
	Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
	linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
	Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
	Sean Christopherson, Shuah Khan, Oliver Upton,
	Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <eab3711d-c5b1-4132-92e4-7f7040fcd17b@linux.intel.com>

On Mon, Jun 08, 2026 at 04:03:23PM +0800, Binbin Wu wrote:
> On 5/22/2026 7:16 AM, Lisa Wang wrote:
> > +	 * exactly like other memory providers.
> >  	 */
> > -	flags = 0;
> > -	if (is_guest_memfd_required(shape))
> > +	if (is_guest_memfd_required(shape)) {
> >  		flags |= KVM_MEM_GUEST_MEMFD;
> > +		gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> > +	}
> >  
> > -	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags);
> > +	vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
> 
> The build failed due to this:
> 
> lib/kvm_util.c: In function ‘__vm_create’:
> lib/kvm_util.c:507:9: error: too many arguments to function ‘vm_mem_add’
>   507 |         vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
>       |         ^~~~~~~~~~
> In file included from lib/kvm_util.c:9:
> include/kvm_util.h:714:6: note: declared here
>   714 | void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
>       |      ^~~~~~~~~~
> lib/kvm_util.c: At top level:
> cc1: note: unrecognized command-line option ‘-Wno-gnu-variable-sized-type-not-at-end’ may have been intended to silence earlier diagnostics
> 
> It seems the patch set doesn't wire gmem_flags parameter to vm_mem_add().

This series need to based on "KVM: selftests: Add support for mmap() on
guest_memfd in core library" for the gmem_flags parameter.

Lisa

> >  	for (i = 0; i < NR_MEM_REGIONS; i++)
> >  		vm->memslots[i] = 0;
> >  
> > 
> 

^ permalink raw reply

* Re: [PATCH v13 13/22] KVM: selftests: Set first memory region as shared if guest_memfd
From: Ackerley Tng @ 2026-06-15 23:46 UTC (permalink / raw)
  To: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
  Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-13-6983ae4c3a4d@google.com>

Lisa Wang <wyihan@google.com> writes:

> Set the initial state of the first memory region as shared if it is
> backed by guest_memfd, so that the KVM selftest framework functions can
> populate mmap()-ed guest_memfd memory the same way memory from other
> memory providers are populated.
>
> For CoCo VMs, pages that need to be private are explicitly set to
> private before executing the VM.
>
>
> [...snip...]
>
> @@ -495,14 +497,16 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
>  	vm = ____vm_create(shape);
>
>  	/*
> -	 * Force GUEST_MEMFD for the primary memory region if necessary, e.g.
> -	 * for CoCo VMs that require GUEST_MEMFD backed private memory.
> +	 * Force GUEST_MEMFD for the primary memory region if necessary, and
> +	 * initialize it as shared so the selftest framework can populate it
> +	 * exactly like other memory providers.
>  	 */
> -	flags = 0;
> -	if (is_guest_memfd_required(shape))
> +	if (is_guest_memfd_required(shape)) {
>  		flags |= KVM_MEM_GUEST_MEMFD;
> +		gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> +	}
>

Just noticed this while hacking some SNP tests.

> -	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags);
> +	vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
>  	for (i = 0; i < NR_MEM_REGIONS; i++)
>  		vm->memslots[i] = 0;
>
>
> --
> 2.54.0.746.g67dd491aae-goog

I think this patch should fully buy into in-place conversions, so we
need to also set GUEST_MEMFD_FLAG_MMAP:

@@ -483,6 +483,7 @@ struct kvm_vm *__vm_create(struct vm_shape shape,
u32 nr_runnable_vcpus,
 {
 	u64 nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus,
 						 nr_extra_pages);
+	enum vm_mem_backing_src_type src_type = VM_MEM_SRC_ANONYMOUS;
 	struct userspace_mem_region *slot0;
 	u64 gmem_flags = 0;
 	struct kvm_vm *vm;
@@ -503,10 +504,16 @@ struct kvm_vm *__vm_create(struct vm_shape
shape, u32 nr_runnable_vcpus,
 	 */
 	if (is_guest_memfd_required(shape)) {
 		flags |= KVM_MEM_GUEST_MEMFD;
-		gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
+		gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED | GUEST_MEMFD_FLAG_MMAP;
+		/*
+		 * TODO: Clean this up together with vm_mem_add(). Use
+		 * VM_MEM_SRC_SHMEM to tell vm_mem_add() to mmap
+		 * guest_memfd with MAP_SHARED.
+		 */
+		src_type = VM_MEM_SRC_SHMEM;
 	}

-	vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0,
gmem_flags);
+	vm_mem_add(vm, src_type, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
 	for (i = 0; i < NR_MEM_REGIONS; i++)
 		vm->memslots[i] = 0;

The VM_MEM_SRC_SHMEM is a bit of a hack but imo that refactor should go
with another series to clean up vm_mem_add().

The above code was working for TDX because vm_mem_add() without the
guest_memfd MMAP flag would leave region->fd as -1. Without
setting src_type to VM_MEM_SRC_ANONYMOUS, the assertion in vm_mem_add()

  region->fd == -1 || backing_src_is_shared(src_type)

would pass.

tdx_init_mem_region() was called with the anonymously mmap()-ed address,
which turns out fine.

With the above change, we would also need to call tdx_init_mem_region()
with NULL for source_address.

^ permalink raw reply

* Re: [PATCH v13 03/22] KVM: selftests: Initialize the TDX VM
From: Lisa Wang @ 2026-06-15 23:33 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
	Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
	linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
	Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
	Sean Christopherson, Shuah Khan, Oliver Upton,
	Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <a58e2941-77f9-43cf-a54d-023506dd7eb0@linux.intel.com>

On Mon, Jun 08, 2026 at 01:57:24PM +0800, Binbin Wu wrote:
[...snip...]
> +	} tdx_cmd = { .c = {						\
> +		.id = (cmd),						\
> +		.flags = (u32)(_flags),				\
> > +		.data = (u64)(arg),				\
> 
> Nit:
> The two lines' backslashes are misaligned.
> 
> > +
> > +#define tdx_vm_ioctl(vm, cmd, flags, arg)				\
> > +({									\
> > +	int ret = __tdx_vm_ioctl(vm, cmd, flags, arg);			\
> 
> tdx_cmd.c.hw_error is u64 and it could be assigned to ret, which is a int,
> the upper bits could be truncated if the upper 32-bit is set.

Thanks. Will fix these in the next version.

Lisa

> > +									\
> > +	__TEST_ASSERT_VM_VCPU_IOCTL(!ret, #cmd,	ret, vm);		\
> > +})
> > +
> > +void tdx_init_vm(struct kvm_vm *vm, u64 attributes);
> > +
> >  #endif /* SELFTESTS_TDX_TDX_UTIL_H */[...]

^ permalink raw reply

* [PATCH v8 7/7] x86/sev: Add debugfs support for RMPOPT
From: Ashish Kalra @ 2026-06-15 19:50 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1781419998.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

Add a debugfs interface to report per-CPU RMPOPT status across all
system RAM.

To dump the per-CPU RMPOPT status for all system RAM:

/sys/kernel/debug/rmpopt# cat rmpopt-table

Memory @  0GB: CPU(s): none
Memory @  1GB: CPU(s): none
Memory @  2GB: CPU(s): 0-1023
Memory @  3GB: CPU(s): 0-1023
Memory @  4GB: CPU(s): none
Memory @  5GB: CPU(s): 0-1023
Memory @  6GB: CPU(s): 0-1023
Memory @  7GB: CPU(s): 0-1023
...
Memory @1025GB: CPU(s): 0-1023
Memory @1026GB: CPU(s): 0-1023
Memory @1027GB: CPU(s): 0-1023
Memory @1028GB: CPU(s): 0-1023
Memory @1029GB: CPU(s): 0-1023
Memory @1030GB: CPU(s): 0-1023
Memory @1031GB: CPU(s): 0-1023
Memory @1032GB: CPU(s): 0-1023
Memory @1033GB: CPU(s): 0-1023
Memory @1034GB: CPU(s): 0-1023
Memory @1035GB: CPU(s): 0-1023
Memory @1036GB: CPU(s): 0-1023
Memory @1037GB: CPU(s): 0-1023
Memory @1038GB: CPU(s): none

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/virt/svm/sev.c | 128 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)

diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 253a534b9a0d..b8b00c50ce41 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -20,6 +20,8 @@
 #include <linux/amd-iommu.h>
 #include <linux/nospec.h>
 #include <linux/workqueue.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
 
 #include <asm/sev.h>
 #include <asm/processor.h>
@@ -145,6 +147,15 @@ static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
 
 static unsigned long snp_nr_leaked_pages;
 
+/* All users of rmpopt_report_cpumask must hold rmpopt_show_mutex. */
+static cpumask_t rmpopt_report_cpumask;
+static struct dentry *rmpopt_debugfs;
+static DEFINE_MUTEX(rmpopt_show_mutex);
+
+struct seq_paddr {
+	phys_addr_t next_seq_paddr;
+};
+
 #undef pr_fmt
 #define pr_fmt(fmt)	"SEV-SNP: " fmt
 
@@ -587,6 +598,8 @@ static void rmpopt_cleanup(void)
 
 	cancel_delayed_work_sync(&rmpopt_delayed_work);
 	destroy_workqueue(rmpopt_wq);
+	debugfs_remove_recursive(rmpopt_debugfs);
+	rmpopt_debugfs = NULL;
 
 	cpus_read_lock();
 
@@ -635,6 +648,10 @@ static inline bool __rmpopt(u64 pa_start, u64 op_type)
 		     : "a" (pa_start), "c" (op_type)
 		     : "memory", "cc");
 
+	if (op_type == RMPOPT_FUNC_REPORT_STATUS)
+		assign_cpu(smp_processor_id(), &rmpopt_report_cpumask,
+			   optimized);
+
 	return optimized;
 }
 
@@ -669,6 +686,115 @@ static long rmpopt_leader_fn(void *arg)
 	return 0;
 }
 
+/*
+ * 'val' is a system physical address.
+ */
+static void rmpopt_report_status(void *val)
+{
+	u64 pa_start = ALIGN_DOWN((u64)val, SZ_1G);
+	u64 op_type = RMPOPT_FUNC_REPORT_STATUS;
+
+	__rmpopt(pa_start, op_type);
+}
+
+/*
+ * start() can be called multiple times if allocated buffer has overflowed
+ * and bigger buffer is allocated.
+ */
+static void *rmpopt_table_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	phys_addr_t end_paddr = rmpopt_pa_end;
+	struct seq_paddr *p = seq->private;
+
+	if (*pos == 0) {
+		p->next_seq_paddr = rmpopt_pa_start;
+		if (p->next_seq_paddr >= end_paddr)
+			return NULL;
+		return &p->next_seq_paddr;
+	}
+
+	if (p->next_seq_paddr >= end_paddr)
+		return NULL;
+
+	return &p->next_seq_paddr;
+}
+
+static void *rmpopt_table_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	phys_addr_t end_paddr = rmpopt_pa_end;
+	phys_addr_t *curr_paddr = v;
+
+	(*pos)++;
+	*curr_paddr += SZ_1G;
+	if (*curr_paddr >= end_paddr)
+		return NULL;
+
+	return curr_paddr;
+}
+
+static void rmpopt_table_seq_stop(struct seq_file *seq, void *v)
+{
+}
+
+static int rmpopt_table_seq_show(struct seq_file *seq, void *v)
+{
+	phys_addr_t *curr_paddr = v;
+
+	guard(mutex)(&rmpopt_show_mutex);
+
+	seq_printf(seq, "Memory @%3lluGB: ",
+		   *curr_paddr >> (get_order(SZ_1G) + PAGE_SHIFT));
+
+	/*
+	 * Query all online CPUs rather than just rmpopt_cpumask (primary
+	 * threads only). The RMPOPT instruction only needs to run on one
+	 * thread per core for the optimization to take effect, but debugfs
+	 * reporting requires the RMPOPT status across all CPUs.
+	 * Performance is not a concern for this diagnostic interface.
+	 *
+	 * This is safe because RMPOPT_BASE MSR is per-core and
+	 * snp_prepare() ensures all CPUs are online when the MSR is
+	 * programmed during snp_setup_rmpopt().
+	 */
+	cpumask_clear(&rmpopt_report_cpumask);
+	on_each_cpu_mask(cpu_online_mask, rmpopt_report_status,
+			 (void *)*curr_paddr, true);
+
+	if (cpumask_empty(&rmpopt_report_cpumask))
+		seq_puts(seq, "CPU(s): none\n");
+	else
+		seq_printf(seq, "CPU(s): %*pbl\n", cpumask_pr_args(&rmpopt_report_cpumask));
+
+	return 0;
+}
+
+static const struct seq_operations rmpopt_table_seq_ops = {
+	.start = rmpopt_table_seq_start,
+	.next = rmpopt_table_seq_next,
+	.stop = rmpopt_table_seq_stop,
+	.show = rmpopt_table_seq_show
+};
+
+static int rmpopt_table_open(struct inode *inode, struct file *file)
+{
+	return seq_open_private(file, &rmpopt_table_seq_ops, sizeof(struct seq_paddr));
+}
+
+static const struct file_operations rmpopt_table_fops = {
+	.open = rmpopt_table_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release_private,
+};
+
+static void rmpopt_debugfs_setup(void)
+{
+	rmpopt_debugfs = debugfs_create_dir("rmpopt", arch_debugfs_dir);
+
+	debugfs_create_file("rmpopt-table", 0400, rmpopt_debugfs,
+			    NULL, &rmpopt_table_fops);
+}
+
 /*
  * RMPOPT optimizations skip RMP checks at 1GB granularity if this
  * range of memory does not contain any SNP guest memory.
@@ -874,6 +1000,8 @@ void snp_setup_rmpopt(void)
 	 * optimizations on all physical memory.
 	 */
 	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
+
+	rmpopt_debugfs_setup();
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH v8 6/7] KVM: SEV: Perform RMP optimizations on SNP guest shutdown
From: Ashish Kalra @ 2026-06-15 19:50 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1781419998.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

Pages are converted from shared to private as SNP guests are launched.
This destroys exisiting RMPOPT optimizations in the regions where
pages are converted.

Conversely, guest pages are converted back to shared during SNP guest
termination and their region may become eligible for RMPOPT
optimization.

To take advantage of this, perform RMPOPT after guest termination.
Do it after a delay so that a single RMPOPT pass can be done if
multiple guests terminate in a short period of time.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/kvm/svm/sev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index e107f368ed2d..29af6f6e603c 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3005,6 +3005,8 @@ void sev_vm_destroy(struct kvm *kvm)
 		 */
 		if (snp_decommission_context(kvm))
 			return;
+
+		snp_rmpopt_all_physmem();
 	} else {
 		sev_unbind_asid(kvm, sev->handle);
 	}
-- 
2.43.0


^ permalink raw reply related

* [PATCH v8 5/7] x86/sev: Add interface to re-enable RMP optimizations.
From: Ashish Kalra @ 2026-06-15 19:49 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1781419998.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

RMPOPT table is a per-CPU table which indicates if 1GB regions of
physical memory are entirely hypervisor-owned or not.

When performing host memory accesses in hypervisor mode as well as
non-SNP guest mode, the processor may consult the RMPOPT table to
potentially skip an RMP access and improve performance.

Events such as RMPUPDATE can clear RMP optimizations. Add an interface
to re-enable those optimizations.

The interface uses mod_delayed_work() instead of queue_delayed_work()
so that the delay timer is reset on each call. This provides proper
batching semantics: re-optimization runs 10 seconds after the *last*
VM termination rather than after the first. mod_delayed_work() also
re-queues work that is already in-flight, so a re-scan request
during an active scan is not silently dropped.

Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/sev.h |  2 ++
 arch/x86/virt/svm/sev.c    | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 0d662221615a..a11306f25336 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
 	__snp_leak_pages(pfn, pages, true);
 }
 int snp_prepare(void);
+void snp_rmpopt_all_physmem(void);
 void snp_setup_rmpopt(void);
 void snp_clear_rmpopt_configured(void);
 void snp_shutdown(void);
@@ -682,6 +683,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
 static inline void kdump_sev_callback(void) { }
 static inline void snp_fixup_e820_tables(void) {}
 static inline int snp_prepare(void) { return -ENODEV; }
+static inline void snp_rmpopt_all_physmem(void) {}
 static inline void snp_setup_rmpopt(void) {}
 static inline void snp_clear_rmpopt_configured(void) {}
 static inline void snp_shutdown(void) {}
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index b63b639bfc30..253a534b9a0d 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -782,6 +782,21 @@ static void rmpopt_work_handler(struct work_struct *work)
 	free_cpumask_var(follower_mask);
 }
 
+void snp_rmpopt_all_physmem(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT) || !rmpopt_configured)
+		return;
+
+	guard(mutex)(&rmpopt_wq_mutex);
+
+	if (!rmpopt_wq)
+		return;
+
+	mod_delayed_work(rmpopt_wq, &rmpopt_delayed_work,
+			 msecs_to_jiffies(RMPOPT_WORK_TIMEOUT));
+}
+EXPORT_SYMBOL_GPL(snp_rmpopt_all_physmem);
+
 void snp_setup_rmpopt(void)
 {
 	u64 rmpopt_base;
-- 
2.43.0


^ permalink raw reply related

* [PATCH v8 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Ashish Kalra @ 2026-06-15 19:49 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1781419998.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

When SEV-SNP is enabled, all writes to memory are checked to ensure
integrity of SNP guest memory. This imposes performance overhead on the
whole system.

RMPOPT is a new instruction that minimizes the performance overhead of
RMP checks on the hypervisor and on non-SNP guests by allowing RMP
checks to be skipped for 1GB regions of memory that are known not to
contain any SEV-SNP guest memory.

Add support for performing RMP optimizations asynchronously using a
dedicated workqueue.

Enable RMPOPT optimizations for up to 2TB of system RAM starting from
the lowest physical memory address aligned down to a 1GB boundary at
RMP initialization time. RMP checks can initially be skipped for 1GB
memory ranges that do not contain SEV-SNP guest memory (excluding
preassigned pages such as the RMP table and firmware pages). As SNP
guests are launched, RMPUPDATE will disable the corresponding RMPOPT
optimizations.

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/virt/svm/sev.c | 230 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 227 insertions(+), 3 deletions(-)

diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 1b5c18408f0b..b63b639bfc30 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -19,6 +19,7 @@
 #include <linux/iommu.h>
 #include <linux/amd-iommu.h>
 #include <linux/nospec.h>
+#include <linux/workqueue.h>
 
 #include <asm/sev.h>
 #include <asm/processor.h>
@@ -125,9 +126,20 @@ static void *rmp_bookkeeping __ro_after_init;
 static u64 probed_rmp_base, probed_rmp_size;
 
 static cpumask_t rmpopt_cpumask;
-static phys_addr_t rmpopt_pa_start;
+static phys_addr_t rmpopt_pa_start, rmpopt_pa_end;
 static bool rmpopt_configured;
 
+enum rmpopt_function {
+	RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS,
+	RMPOPT_FUNC_REPORT_STATUS
+};
+
+#define RMPOPT_WORK_TIMEOUT	10000
+
+static struct workqueue_struct *rmpopt_wq;
+static struct delayed_work rmpopt_delayed_work;
+static DEFINE_MUTEX(rmpopt_wq_mutex);
+
 static LIST_HEAD(snp_leaked_pages_list);
 static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
 
@@ -568,6 +580,14 @@ static void rmpopt_cleanup(void)
 {
 	int cpu;
 
+	guard(mutex)(&rmpopt_wq_mutex);
+
+	if (!rmpopt_wq)
+		return;
+
+	cancel_delayed_work_sync(&rmpopt_delayed_work);
+	destroy_workqueue(rmpopt_wq);
+
 	cpus_read_lock();
 
 	for_each_cpu(cpu, &rmpopt_cpumask)
@@ -576,7 +596,8 @@ static void rmpopt_cleanup(void)
 	cpus_read_unlock();
 
 	cpumask_clear(&rmpopt_cpumask);
-	rmpopt_pa_start = 0;
+	rmpopt_pa_start = rmpopt_pa_end = 0;
+	rmpopt_wq = NULL;
 }
 
 void snp_shutdown(void)
@@ -599,6 +620,168 @@ void snp_clear_rmpopt_configured(void)
 	rmpopt_configured = false;
 }
 
+/*
+ * RMPOPT: F2 0F 01 FC
+ *   Input:  RAX = system physical address (1GB aligned)
+ *           RCX = operation type
+ *   Output: CF set if the range was optimized
+ */
+static inline bool __rmpopt(u64 pa_start, u64 op_type)
+{
+	bool optimized;
+
+	asm volatile(".byte 0xf2, 0x0f, 0x01, 0xfc"
+		     : "=@ccc" (optimized)
+		     : "a" (pa_start), "c" (op_type)
+		     : "memory", "cc");
+
+	return optimized;
+}
+
+static void rmpopt(u64 pa)
+{
+	u64 pa_start = ALIGN_DOWN(pa, SZ_1G);
+	u64 op_type = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
+
+	__rmpopt(pa_start, op_type);
+}
+
+/*
+ * 'val' is a system physical address.
+ */
+static void rmpopt_smp(void *val)
+{
+	rmpopt((u64)val);
+}
+
+/*
+ * Leader function for work_on_cpu(): runs the full RMPOPT scan in
+ * process context on a CPU that has RMPOPT_BASE MSR programmed.
+ */
+static long rmpopt_leader_fn(void *arg)
+{
+	phys_addr_t pa;
+
+	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+		rmpopt(pa);
+		cond_resched();
+	}
+	return 0;
+}
+
+/*
+ * RMPOPT optimizations skip RMP checks at 1GB granularity if this
+ * range of memory does not contain any SNP guest memory.
+ */
+static void rmpopt_work_handler(struct work_struct *work)
+{
+	cpumask_var_t follower_mask;
+	phys_addr_t pa;
+	int this_cpu;
+
+	pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
+		rmpopt_pa_start, rmpopt_pa_end);
+
+	if (!alloc_cpumask_var(&follower_mask, GFP_KERNEL))
+		return;
+
+	/*
+	 * RMPOPT scans the RMP table, stores the result of the scan in the
+	 * reserved processor memory. The RMP scan is the most expensive
+	 * part. If a second RMPOPT occurs, it can skip the expensive scan
+	 * if they can see a cached result in the reserved processor memory.
+	 *
+	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
+	 * on every other primary thread. Followers are "designed to"
+	 * skip the scan if they see the "cached" scan results.
+	 */
+	cpumask_copy(follower_mask, &rmpopt_cpumask);
+
+	/*
+	 * Pin the worker to the current CPU for the leader loop so that
+	 * this_cpu remains valid and the RMPOPT instruction executes on
+	 * the correct CPU.
+	 *
+	 * Use migrate_disable() rather than get_cpu() to prevent
+	 * migration while still allowing preemption.
+	 */
+	migrate_disable();
+	this_cpu = smp_processor_id();
+
+	if (cpumask_test_cpu(this_cpu, follower_mask)) {
+		/*
+		 * Current CPU is a primary thread in rmpopt_cpumask.
+		 * Run leader locally and remove from follower mask.
+		 */
+		cpumask_clear_cpu(this_cpu, follower_mask);
+
+		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+			rmpopt(pa);
+			cond_resched();
+		}
+	} else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
+				      follower_mask)) {
+		/*
+		 * Current CPU is a sibling thread whose primary is in
+		 * rmpopt_cpumask.  RMPOPT_BASE MSR is per-core, so it
+		 * is safe to run the leader locally.  Remove the sibling's
+		 * primary from the follower mask as this core is already
+		 * covered by the leader.
+		 */
+		cpumask_andnot(follower_mask, follower_mask,
+			       topology_sibling_cpumask(this_cpu));
+
+		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+			rmpopt(pa);
+			cond_resched();
+		}
+	} else {
+		/*
+		 * Current CPU does not have RMPOPT_BASE MSR programmed.
+		 * Pick an explicit leader from the cpumask to avoid #UD.
+		 * Use work_on_cpu() to run in process context on the leader,
+		 * avoiding IPI latency.
+		 */
+		int leader_cpu = cpumask_first(follower_mask);
+
+		if (WARN_ON_ONCE(leader_cpu >= nr_cpu_ids)) {
+			migrate_enable();
+			goto out;
+		}
+
+		cpumask_clear_cpu(leader_cpu, follower_mask);
+
+		/* Release migration pin before work_on_cpu(). */
+		migrate_enable();
+
+		work_on_cpu(leader_cpu, rmpopt_leader_fn, NULL);
+
+		goto followers;
+	}
+
+	migrate_enable();
+
+followers:
+	/*
+	 * Followers: run RMPOPT on remaining cores.
+	 * CPU hotplug is disabled while SNP is active
+	 * (cpu_hotplug_disable() in __sev_snp_init_locked()),
+	 * so cpus_read_lock() is uncontended.
+	 */
+	cpus_read_lock();
+	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+		on_each_cpu_mask(follower_mask, rmpopt_smp,
+				 (void *)pa, true);
+
+		 /* Give a chance for other threads to run */
+		cond_resched();
+	}
+	cpus_read_unlock();
+
+out:
+	free_cpumask_var(follower_mask);
+}
+
 void snp_setup_rmpopt(void)
 {
 	u64 rmpopt_base;
@@ -607,11 +790,37 @@ void snp_setup_rmpopt(void)
 	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT) || !rmpopt_configured)
 		return;
 
+	guard(mutex)(&rmpopt_wq_mutex);
+
+	/*
+	 * Guard against re-initialization.  When SNP_SHUTDOWN_EX is issued
+	 * with x86_snp_shutdown=0, snp_shutdown() is not called and
+	 * rmpopt_cleanup() is skipped, but snp_initialized is still cleared.
+	 * A subsequent __sev_snp_init_locked() would call snp_setup_rmpopt()
+	 * again, leaking the existing workqueue, delayed work, debugfs
+	 * entries, and cpumask state.
+	 */
+	if (rmpopt_wq)
+		return;
+
+	/*
+	 * Create an RMPOPT-specific workqueue to avoid scheduling
+	 * RMPOPT workitem on the global system workqueue.
+	 */
+	rmpopt_wq = alloc_workqueue("rmpopt_wq", WQ_UNBOUND, 1);
+	if (!rmpopt_wq) {
+		pr_err("Failed to allocate RMPOPT workqueue\n");
+		return;
+	}
+
+	INIT_DELAYED_WORK(&rmpopt_delayed_work, rmpopt_work_handler);
+
 	cpus_read_lock();
 
 	/*
 	 * The RMPOPT_BASE MSR is per-core, so only one thread per core needs
-	 * to set up the RMPOPT_BASE MSR.
+	 * to set up the RMPOPT_BASE MSR. Likewise, only one thread per core
+	 * needs to issue the RMPOPT instruction.
 	 *
 	 * Note: only online primary threads are included.  If a core's
 	 * primary thread is offline, that core is not covered.  CPU hotplug
@@ -635,6 +844,21 @@ void snp_setup_rmpopt(void)
 		WARN_ON_ONCE(wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base));
 
 	cpus_read_unlock();
+
+	rmpopt_pa_end = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
+
+	/* Limit memory scanning to 2TB of RAM */
+	if ((rmpopt_pa_end - rmpopt_pa_start) > SZ_2T) {
+		pr_info("RMPOPT coverage limited to 2TB; memory above 0x%llx not optimized\n",
+			rmpopt_pa_start + SZ_2T);
+		rmpopt_pa_end = rmpopt_pa_start + SZ_2T;
+	}
+
+	/*
+	 * Once all per-CPU RMPOPT tables have been configured, enable RMPOPT
+	 * optimizations on all physical memory.
+	 */
+	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH v8 3/7] crypto/ccp: Disable CPU hotplug while SNP is active
From: Ashish Kalra @ 2026-06-15 19:49 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1781419998.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

The SEV firmware enumerates the CPUs at SNP initialization and is not
aware of the OS bringing CPUs online or offline afterwards, so OS CPU
hotplug can diverge from the firmware's expectations and break SNP.
Disable CPU hotplug while SNP is active.

SNP is fully torn down only on the SNP_SHUTDOWN_EX x86_snp_shutdown
path; the legacy path leaves SNP enabled in hardware while clearing
snp_initialized, so __sev_snp_init_locked() can run again.  Track the
disable with a flag so it is balanced by a matching enable rather than
stacked, and re-enable hotplug only on the x86_snp_shutdown path, after
snp_shutdown() has cleared the per-core RMPOPT_BASE MSRs with hotplug
still disabled.

This also keeps the CPU set stable for the asynchronous RMPOPT scan
added later in this series, and ensures cpus_read_lock() in the scan
is uncontended.

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 drivers/crypto/ccp/sev-dev.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 217b6b19802e..c8c3c577463c 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -106,6 +106,9 @@ struct snp_hv_fixed_pages_entry {
 
 static LIST_HEAD(snp_hv_fixed_pages);
 
+/* Set while SNP has CPU hotplug disabled. */
+static bool snp_cpu_hotplug_disabled;
+
 /* Trusted Memory Region (TMR):
  *   The TMR is a 1MB area that must be 1MB aligned.  Use the page allocator
  *   to allocate the memory, which will return aligned memory for the specified
@@ -1479,6 +1482,17 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
 
 	snp_hv_fixed_pages_state_update(sev, HV_FIXED);
 
+	/*
+	 * Disable CPU hotplug while SNP is active.  Guard against stacking
+	 * the disable count: the legacy SNP_SHUTDOWN_EX path clears
+	 * snp_initialized without re-enabling hotplug, so this can run
+	 * again while hotplug is already disabled.
+	 */
+	if (!snp_cpu_hotplug_disabled) {
+		cpu_hotplug_disable();
+		snp_cpu_hotplug_disabled = true;
+	}
+
 	snp_setup_rmpopt();
 
 	sev->snp_initialized = true;
@@ -2083,8 +2097,21 @@ static int __sev_snp_shutdown_locked(int *error, bool panic)
 	}
 
 	if (data.x86_snp_shutdown) {
-		if (!panic)
+		if (!panic) {
 			snp_shutdown();
+			/*
+			 * snp_shutdown() fully tears SNP down (clear_rmp()) and
+			 * has already cleared the per-core RMPOPT_BASE MSRs via
+			 * rmpopt_cleanup() with hotplug still disabled.  Re-enable
+			 * CPU hotplug now.  On the legacy path SNP stays
+			 * enabled in hardware, so hotplug is correctly left
+			 * disabled.
+			 */
+			if (snp_cpu_hotplug_disabled) {
+				cpu_hotplug_enable();
+				snp_cpu_hotplug_disabled = false;
+			}
+		}
 		snp_hv_fixed_pages_state_update(sev, ALLOCATED);
 	} else {
 		/*
-- 
2.43.0


^ permalink raw reply related

* [PATCH v8 2/7] x86/sev: Initialize RMPOPT configuration MSRs
From: Ashish Kalra @ 2026-06-15 19:48 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1781419998.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

The new RMPOPT instruction helps manage per-CPU RMP optimization
structures inside the CPU. It takes a 1GB-aligned physical address
and either returns the status of the optimizations or tries to enable
the optimizations.

Per-CPU RMPOPT tables support at most 2 TB of addressable memory for
RMP optimizations.

Initialize the per-CPU RMPOPT table base to the starting physical
address. This enables RMP optimization for up to 2 TB of system RAM on
all CPUs.

Additionally, add support to setup and enable RMPOPT once SNP is
enabled and initialized.

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/coco/core.c             |  2 +
 arch/x86/include/asm/msr-index.h |  3 ++
 arch/x86/include/asm/sev.h       |  4 ++
 arch/x86/virt/svm/sev.c          | 70 ++++++++++++++++++++++++++++++++
 drivers/crypto/ccp/sev-dev.c     |  3 ++
 5 files changed, 82 insertions(+)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 989ca9f72ba3..8c1393ddc5df 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -16,6 +16,7 @@
 #include <asm/archrandom.h>
 #include <asm/coco.h>
 #include <asm/processor.h>
+#include <asm/sev.h>
 
 enum cc_vendor cc_vendor __ro_after_init = CC_VENDOR_NONE;
 SYM_PIC_ALIAS(cc_vendor);
@@ -172,6 +173,7 @@ static void amd_cc_platform_clear(enum cc_attr attr)
 	switch (attr) {
 	case CC_ATTR_HOST_SEV_SNP:
 		cc_flags.host_sev_snp = 0;
+		snp_clear_rmpopt_configured();
 		break;
 	default:
 		break;
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 86554de9a3f5..28540744f1eb 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -761,6 +761,9 @@
 #define MSR_AMD64_SEG_RMP_ENABLED_BIT	0
 #define MSR_AMD64_SEG_RMP_ENABLED	BIT_ULL(MSR_AMD64_SEG_RMP_ENABLED_BIT)
 #define MSR_AMD64_RMP_SEGMENT_SHIFT(x)	(((x) & GENMASK_ULL(13, 8)) >> 8)
+#define MSR_AMD64_RMPOPT_BASE		0xc0010139
+#define MSR_AMD64_RMPOPT_ENABLE_BIT	0
+#define MSR_AMD64_RMPOPT_ENABLE		BIT_ULL(MSR_AMD64_RMPOPT_ENABLE_BIT)
 
 #define MSR_SVSM_CAA			0xc001f000
 
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 594cfa19cbd4..0d662221615a 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,8 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
 	__snp_leak_pages(pfn, pages, true);
 }
 int snp_prepare(void);
+void snp_setup_rmpopt(void);
+void snp_clear_rmpopt_configured(void);
 void snp_shutdown(void);
 #else
 static inline bool snp_probe_rmptable_info(void) { return false; }
@@ -680,6 +682,8 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
 static inline void kdump_sev_callback(void) { }
 static inline void snp_fixup_e820_tables(void) {}
 static inline int snp_prepare(void) { return -ENODEV; }
+static inline void snp_setup_rmpopt(void) {}
+static inline void snp_clear_rmpopt_configured(void) {}
 static inline void snp_shutdown(void) {}
 #endif
 
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 8bcdce98f6dc..1b5c18408f0b 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -124,6 +124,10 @@ static void *rmp_bookkeeping __ro_after_init;
 
 static u64 probed_rmp_base, probed_rmp_size;
 
+static cpumask_t rmpopt_cpumask;
+static phys_addr_t rmpopt_pa_start;
+static bool rmpopt_configured;
+
 static LIST_HEAD(snp_leaked_pages_list);
 static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
 
@@ -490,7 +494,12 @@ static bool __init setup_rmptable(void)
 	if (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) {
 		if (!setup_segmented_rmptable())
 			return false;
+		rmpopt_configured = true;
 	} else {
+		/*
+		 * RMPOPT requires a segmented RMP table, so leave
+		 * rmpopt_configured clear on contiguous RMP systems.
+		 */
 		if (!setup_contiguous_rmptable())
 			return false;
 	}
@@ -555,6 +564,21 @@ int snp_prepare(void)
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
 
+static void rmpopt_cleanup(void)
+{
+	int cpu;
+
+	cpus_read_lock();
+
+	for_each_cpu(cpu, &rmpopt_cpumask)
+		WARN_ON_ONCE(wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, 0));
+
+	cpus_read_unlock();
+
+	cpumask_clear(&rmpopt_cpumask);
+	rmpopt_pa_start = 0;
+}
+
 void snp_shutdown(void)
 {
 	u64 syscfg;
@@ -563,11 +587,57 @@ void snp_shutdown(void)
 	if (syscfg & MSR_AMD64_SYSCFG_SNP_EN)
 		return;
 
+	rmpopt_cleanup();
+
 	clear_rmp();
 	on_each_cpu(mfd_reconfigure, NULL, 1);
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
 
+void snp_clear_rmpopt_configured(void)
+{
+	rmpopt_configured = false;
+}
+
+void snp_setup_rmpopt(void)
+{
+	u64 rmpopt_base;
+	int cpu;
+
+	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT) || !rmpopt_configured)
+		return;
+
+	cpus_read_lock();
+
+	/*
+	 * The RMPOPT_BASE MSR is per-core, so only one thread per core needs
+	 * to set up the RMPOPT_BASE MSR.
+	 *
+	 * Note: only online primary threads are included.  If a core's
+	 * primary thread is offline, that core is not covered.  CPU hotplug
+	 * is not currently supported with SNP enabled.
+	 */
+
+	for_each_online_cpu(cpu)
+		if (topology_is_primary_thread(cpu))
+			cpumask_set_cpu(cpu, &rmpopt_cpumask);
+
+	rmpopt_pa_start = ALIGN_DOWN(PFN_PHYS(min_low_pfn), SZ_1G);
+	rmpopt_base = rmpopt_pa_start | MSR_AMD64_RMPOPT_ENABLE;
+
+	/*
+	 * Per-CPU RMPOPT tables support at most 2 TB of addressable memory
+	 * for RMP optimizations. Initialize the per-CPU RMPOPT table base
+	 * to the starting physical address to enable RMP optimizations for
+	 * up to 2 TB of system RAM on all CPUs.
+	 */
+	for_each_cpu(cpu, &rmpopt_cpumask)
+		WARN_ON_ONCE(wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base));
+
+	cpus_read_unlock();
+}
+EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
+
 /*
  * Do the necessary preparations which are verified by the firmware as
  * described in the SNP_INIT_EX firmware command description in the SNP
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 78f98aee7a66..217b6b19802e 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1478,6 +1478,9 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
 	}
 
 	snp_hv_fixed_pages_state_update(sev, HV_FIXED);
+
+	snp_setup_rmpopt();
+
 	sev->snp_initialized = true;
 	dev_dbg(sev->dev, "SEV-SNP firmware initialized, SEV-TIO is %s\n",
 		data.tio_en ? "enabled" : "disabled");
-- 
2.43.0


^ permalink raw reply related

* [PATCH v8 1/7] x86/cpufeatures: Add X86_FEATURE_RMPOPT feature flag
From: Ashish Kalra @ 2026-06-15 19:48 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1781419998.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

Add a flag indicating whether RMPOPT instruction is supported.

RMPOPT is a new instruction that reduces the performance overhead of
RMP checks for the hypervisor and non-SNP guests by allowing those
checks to be skipped when 1-GB memory regions are known to contain no
SEV-SNP guest memory.

For more information on the RMPOPT instruction, see the AMD64 RMPOPT
technical documentation.

Suggested-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/cpufeatures.h       | 2 +-
 arch/x86/kernel/cpu/scattered.c          | 1 +
 tools/arch/x86/include/asm/cpufeatures.h | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 1d506e5d6f46..794cc96b8493 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -76,7 +76,7 @@
 #define X86_FEATURE_K8			( 3*32+ 4) /* Opteron, Athlon64 */
 #define X86_FEATURE_ZEN5		( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
 #define X86_FEATURE_ZEN6		( 3*32+ 6) /* CPU based on Zen6 microarchitecture */
-/* Free                                 ( 3*32+ 7) */
+#define X86_FEATURE_RMPOPT		( 3*32+ 7) /* Support for AMD RMPOPT instruction */
 #define X86_FEATURE_CONSTANT_TSC	( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
 #define X86_FEATURE_UP			( 3*32+ 9) /* "up" SMP kernel running on UP */
 #define X86_FEATURE_ART			( 3*32+10) /* "art" Always running timer (ART) */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 937129ce6a96..021c0bf22de2 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -67,6 +67,7 @@ static const struct cpuid_bit cpuid_bits[] = {
 	{ X86_FEATURE_PERFMON_V2,		CPUID_EAX,  0, 0x80000022, 0 },
 	{ X86_FEATURE_AMD_LBR_V2,		CPUID_EAX,  1, 0x80000022, 0 },
 	{ X86_FEATURE_AMD_LBR_PMC_FREEZE,	CPUID_EAX,  2, 0x80000022, 0 },
+	{ X86_FEATURE_RMPOPT,			CPUID_EDX,  0, 0x80000025, 0 },
 	{ X86_FEATURE_AMD_HTR_CORES,		CPUID_EAX, 30, 0x80000026, 0 },
 	{ 0, 0, 0, 0, 0 }
 };
diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h
index 86d17b195e79..7ce681af1dd7 100644
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -76,7 +76,7 @@
 #define X86_FEATURE_K8			( 3*32+ 4) /* Opteron, Athlon64 */
 #define X86_FEATURE_ZEN5		( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
 #define X86_FEATURE_ZEN6		( 3*32+ 6) /* CPU based on Zen6 microarchitecture */
-/* Free                                 ( 3*32+ 7) */
+#define X86_FEATURE_RMPOPT		( 3*32+ 7) /* Support for AMD RMPOPT instruction */
 #define X86_FEATURE_CONSTANT_TSC	( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
 #define X86_FEATURE_UP			( 3*32+ 9) /* "up" SMP kernel running on UP */
 #define X86_FEATURE_ART			( 3*32+10) /* "art" Always running timer (ART) */
-- 
2.43.0


^ permalink raw reply related

* [PATCH v8 0/7] Add RMPOPT support.
From: Ashish Kalra @ 2026-06-15 19:47 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco

From: Ashish Kalra <ashish.kalra@amd.com>

In the SEV-SNP architecture, hypervisor and non-SNP guests are subject
to RMP checks on writes to provide integrity of SEV-SNP guest memory.

The RMPOPT architecture enables optimizations whereby the RMP checks
can be skipped if 1GB regions of memory are known to not contain any
SNP guest memory.

RMPOPT is a new instruction designed to minimize the performance
overhead of RMP checks for the hypervisor and non-SNP guests.

RMPOPT instruction currently supports two functions. In case of the
verify and report status function the CPU will read the RMP contents,
verify the entire 1GB region starting at the provided SPA is HV-owned.
For the entire 1GB region it checks that all RMP entries in this region
are HV-owned (i.e, not in assigned state) and then accordingly updates
the RMPOPT table to indicate if optimization has been enabled and
provide indication to software if the optimization was successful.

In case of report status function, the CPU returns the optimization
status for the 1GB region.

The RMPOPT table is managed by a combination of software and hardware.
Software uses the RMPOPT instruction to set bits in the table,
indicating that regions of memory are entirely HV-owned.  Hardware
automatically clears bits in the RMPOPT table when RMP contents are
changed during RMPUPDATE instruction.

For more information on the RMPOPT instruction, see the AMD64 RMPOPT
technical documentation.

As SNP is enabled by default the hypervisor and non-SNP guests are
subject to RMP write checks to provide integrity of SNP guest memory.

This patch-series adds support to enable RMP optimizations for up to
2TB of system RAM across the system and allow RMPUPDATE to disable
those optimizations as SNP guests are launched.

Support for RAM larger than 2 TB will be added in follow-on series.

This series also adds support to disable CPU hotplug while SNP is
active, as the SEV firmware enumerates CPUs at SNP initialization and is
not aware of the OS bringing CPUs online or offline afterwards.  This
also keeps the set of CPUs stable for the asynchronous RMPOPT scan, so
the per-core RMPOPT_BASE MSRs programmed during setup remain valid.

This series also introduces support to re-enable RMP optimizations
during SNP guest termination, after guest pages have been converted
back to shared.

RMP optimizations are performed asynchronously by queuing work on a
dedicated workqueue after a 10 second delay.

Delaying work allows batching of multiple SNP guest terminations.

Once 1GB hugetlb guest_memfd support is merged, support for
re-enabling RMPOPT optimizations during 1GB page cleanup will be added
in follow-on series.

Additionally add debugfs interface to report per-CPU RMPOPT status
across all system RAM.

v8:
- Add a new patch to disable CPU hotplug while SNP is active, keeping
  the CPU set stable for the RMPOPT work handler.
- Drop the setup_clear_cpu_cap(X86_FEATURE_RMPOPT) calls; the
  rmpopt_configured bool is the runtime guard.
- WARN_ON_ONCE() on the RMPOPT_BASE MSR writes that previously ignored
  their return value.
- Run the RMPOPT leader scan via work_on_cpu() instead of
  smp_call_function_single() so it executes in process context.  This
  fixes the AB-BA deadlock between migrate_disable() and cpus_read_lock()
  and avoids running the long RMP scan in IPI context with interrupts
  disabled.
- Use mod_delayed_work() in snp_rmpopt_all_physmem() so the batching
  delay tracks the last SNP guest termination.

  Sashiko AI code review identified several of the above issues.

v7:
- Sync tools/arch/x86/include/asm/cpufeatures.h to mirror the kernel
  header for X86_FEATURE_RMPOPT.
- Fix commit title to use X86_FEATURE_RMPOPT to match the code
  (was X86_FEATURE_AMD_RMPOPT).
- Add static bool rmpopt_configured, set only when segmented RMP setup
  succeeds in setup_rmptable().  Check rmpopt_configured alongside
  cpu_feature_enabled(X86_FEATURE_RMPOPT) in snp_setup_rmpopt() and
  snp_rmpopt_all_physmem(), because setup_clear_cpu_cap() is unreliable
  after alternatives are patched.  Add snp_clear_rmpopt_configured()
  called from amd_cc_platform_clear() when CC_ATTR_HOST_SEV_SNP is
  cleared.  Do not use __ro_after_init on rmpopt_configured since the
  writer snp_clear_rmpopt_configured() is not __init.
- Add cond_resched() to all three leader loops in rmpopt_work_handler()
  to prevent soft lockups on systems with up to 2TB of RAM.
- Add comment above __rmpopt() documenting the RMPOPT instruction
  encoding (F2 0F 01 FC) and register interface (RAX = system physical
  address input, RCX = operation type input, RFLAGS.CF = output).
  Note: RMPOPT does not modify RAX unlike PVALIDATE/RMPUPDATE, so
  the existing "a" (input-only) constraint is correct.

  Sashiko AI code review identified several of the above issues.

v6:
- Drop wrmsrq_on_cpus() helper; use for_each_cpu() with wrmsrq_on_cpu()
  instead, as RMPOPT_BASE MSR programming is not performance-critical.
- Rewrite rmpopt_work_handler() leader selection to use a local
  follower_mask copy instead of modifying the global rmpopt_cpumask.
  This eliminates the current_cpu_cleared tracking and the restore at
  the end, and removes the need for synchronization comments about
  transient cpumask inconsistency.
- Add three-way leader selection in rmpopt_work_handler():
  1. Current CPU is a primary thread in cpumask: run leader locally.
  2. Current CPU is a sibling thread whose primary is in cpumask:
     run leader locally (RMPOPT_BASE MSR is per-core), remove the
     primary from followers via cpumask_andnot(topology_sibling_cpumask).
  3. Current CPU's core has no RMPOPT_BASE MSR programmed: pick an
     explicit leader via cpumask_first() + smp_call_function_single()
     to avoid #UD, with cpus_read_lock() around the IPI loop.
- Add WARN_ON_ONCE guard for empty cpumask in the explicit leader
  fallback path, with migrate_enable() before goto out.
- Add .llseek = seq_lseek to rmpopt_table_fops for consistency with
  other seq_file-based debugfs files and to support tools like "less".
- Change debugfs file permissions from 0444 to 0400 to restrict access
  to root only.
- Add comment in rmpopt_table_seq_show() explaining why cpu_online_mask
  is safe: RMPOPT_BASE MSR is per-core and snp_prepare() ensures all
  CPUs are online when the MSR is programmed.

  Sashiko AI code review identified several of the above issues.

v5:
- Introduce rmpopt_cleanup() to tear down workqueue, debugfs, cpumask,
  and MSR state, called from snp_shutdown().
- Introduce rmpopt_wq_mutex to serialize snp_setup_rmpopt(),
  snp_rmpopt_all_physmem(), and rmpopt_cleanup().
- Introduce rmpopt_show_mutex to serialize debugfs reporting of
  rmpopt_report_cpumask.
- Move snp_rmpopt_all_physmem() call after SNP DECOMMISSION during
  guest shutdown.
- Use migrate_disable()/migrate_enable() for CPU pinning in the
  rmpopt_work_handler() leader loop to maintain CPU affinity without
  disabling preemption for the entire RMPOPT scan.
- Add cpus_read_lock()/cpus_read_unlock() around the follower
  on_each_cpu_mask() loop in rmpopt_work_handler().
- Guard snp_setup_rmpopt() against re-initialization when
  SNP_SHUTDOWN_EX with x86_snp_shutdown=0 skips rmpopt_cleanup()
  but clears snp_initialized, preventing workqueue and resource
  leaks on repeated init/shutdown cycles.
- Replace setup_clear_cpu_cap() with pr_err() on alloc_workqueue()
  failure in snp_setup_rmpopt(), as setup_clear_cpu_cap() cannot be
  used after alternatives are patched; callers check rmpopt_wq != NULL
  as the runtime guard instead.
- Add pr_info() when RMPOPT coverage is capped at 2TB.
- Add comments noting CPU hotplug is not supported with SNP enabled
  and only online primary threads are covered by rmpopt_cpumask.
- Add comment in setup_rmptable() noting Segmented RMP must be
  enabled to enable RMPOPT.
- Simplify cpumask setup loop to set if primary thread rather than
  skip if not primary.
- Improve grammar and clarity in snp_setup_rmpopt() comments.
- Added Reviewed-by's.

  Sashiko AI code review identified several of the above issues.

v4:
- Add new wrmsrq_on_cpus() helper to write same u64 value to a
  per-CPU MSR across a cpumask without per-cpu struct allocation
  overhead. 
- Rename configure_and_enable_rmpopt() to snp_setup_rmpopt().
- Use wrmsrq_on_cpus() instead of wrmsrq_on_cpu() loop for
  programming RMPOPT_BASE MSRs.
- Add setup_clear_cpu_cap(X86_FEATURE_RMPOPT) if segmented RMP
  setup fails or workqueue allocation fails.
- Add X86_FEATURE_RMPOPT feature clear logic in amd_cc_platform_clear()
  for CC_ATTR_HOST_SEV_SNP.
- All of the above allow checking for only X86_FEATURE_RMPOPT for both
  RMPOPT setup/enable and RMP re-optimizations.
- Rename snp_perform_rmp_optimization() to snp_rmpopt_all_physmem().
- Split rmpopt() into rmpopt() and rmpopt_smp() for SMP callback use.
- Introduce separate rmpopt_report_cpumask for debugfs reporting,
  distinct from rmpopt_cpumask used for primary thread tracking.
- Remove snp_perform_rmp_optimization() call from __sev_snp_init_locked() 
  and instead setup and enable RMPOPT after SNP is enabled and 
  initialized.

v3:
- Drop all RMPOPT kthread support and introduce adding custom and
  dedicated workqueue to schedule delayed and asynchronous RMPOPT work.
- Drop the guest_memfd inode cleanup interface and add support to
  re-enable RMP optimizations during guest shutdown using the
  asynchronous and delayed workqueue interface.
- Introduce new __rmpopt() helper and rmpopt() and
  rmpopt_report_status() wrappers on top which use rax and rcx
  parameters to closely match RMPOPT specs.
- Use new optimized RMPOPT loop to issue RMPOPT instructions on all
  system RAM upto 2TB and all CPUs, by optimizing each range on one CPU
  first, then let other CPUs execute RMPOPT in parallel so they can skip
  most work as the range has already been optimized.
- Also add support for running the optimized RMPOPT loop only on
  one thread per core.
- Replace all PUD_SIZE references with SZ_1G to conform to 1GB regions
  as specified by RMPOPT specifications and not be dependent on PUD_SIZE
  which makes the RMPOPT patch-set independent of x86 page table sizes.
- Use wrmsrq_on_cpu() to program the RMPOPT_BASE MSR registers on
  all CPUs that removes all ugly casting to use on_each_cpu_mask().
- Fix inline commits and patch commit messages


v2:
- Drop all NUMA and Socket configuration and enablement support and
  enable RMPOPT support for up to 2TB of system RAM.
- Drop get_cpumask_of_primary_threads() and enable per-core RMPOPT
  base MSRs and issue RMPOPT instruction on all CPUs.
- Drop the configfs interface to manually re-enable RMP optimizations.
- Add new guest_memfd cleanup interface to automatically re-enable
  RMP optimizations during guest shutdown.
- Include references to the public RMPOPT documentation.
- Move debugfs directory for RMPOPT under architecuture specific
  parent directory.

Ashish Kalra (7):
  x86/cpufeatures: Add X86_FEATURE_RMPOPT feature flag
  x86/sev: Initialize RMPOPT configuration MSRs
  crypto/ccp: Disable CPU hotplug while SNP is active
  x86/sev: Add support to perform RMP optimizations asynchronously
  x86/sev: Add interface to re-enable RMP optimizations.
  KVM: SEV: Perform RMP optimizations on SNP guest shutdown
  x86/sev: Add debugfs support for RMPOPT

 arch/x86/coco/core.c                     |   2 +
 arch/x86/include/asm/cpufeatures.h       |   2 +-
 arch/x86/include/asm/msr-index.h         |   3 +
 arch/x86/include/asm/sev.h               |   6 +
 arch/x86/kernel/cpu/scattered.c          |   1 +
 arch/x86/kvm/svm/sev.c                   |   2 +
 arch/x86/virt/svm/sev.c                  | 437 +++++++++++++++++++++++
 drivers/crypto/ccp/sev-dev.c             |  32 +-
 tools/arch/x86/include/asm/cpufeatures.h |   2 +-
 9 files changed, 484 insertions(+), 3 deletions(-)

-- 
2.43.0


^ permalink raw reply

* Re: [PATCH RFC 2/3] KVM: guest_memfd: support folio migration for non-confidential VMs
From: David Hildenbrand (Arm) @ 2026-06-15 18:35 UTC (permalink / raw)
  To: Shivank Garg, Matthew Wilcox (Oracle), Jan Kara, Andrew Morton,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Paolo Bonzini, Shuah Khan, Chao Peng,
	Nikunj A Dadhania, Ira Weiny, Michael Roth, Pankaj Gupta,
	Ackerley Tng, Fuad Tabba, Sean Christopherson, Vishal Annapurve,
	Nikita Kalyazin, Patrick Roy, Pratik Sampat, Ashish Kalra
  Cc: linux-fsdevel, linux-coco, linux-mm, linux-kernel, kvm,
	linux-kselftest
In-Reply-To: <20260611-shivank-gmem-migrate-v1-2-2d266bfc6f95@amd.com>

On 6/11/26 15:05, Shivank Garg wrote:
> guest_memfd folios are currently marked unmmovable, so the kernel
> cannot perform NUMA-balancing, memory compaction, etc.
> This is unavoidable for confidential VMs (SEV-SNP, TDX),
> since memory is encrypted and copying it need firmware assistance.
> However, for non-cofidential VMs (like firecracker), we can migrate
> the folios.
> 
> Mark non-confidential VMs as movable and implement
> kvm_gmem_migrate_folio() using filemap_migrate_folio().
> 
> This lays the ground work for migrating cofidential guest_memfd
> later. Once the firmware-assisted copying support is available,
> those VMs can be made movable. The confidential folio content can
> be copied separately, and the destination folio can be marked with
> FOLIO_CONTENT_COPIED so __migrate_folio() skips the host-side
> folio_mc_copy().
> 
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
>  virt/kvm/guest_memfd.c | 50 +++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 45 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 806a42f0e031a1c7729f53c786316d2502532553..e4470106fc7792f328bce5275419683328c8b4ab 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -487,13 +487,45 @@ static struct file_operations kvm_gmem_fops = {
>  	.fallocate	= kvm_gmem_fallocate,
>  };
>  
> +#ifdef CONFIG_MIGRATION
>  static int kvm_gmem_migrate_folio(struct address_space *mapping,
>  				  struct folio *dst, struct folio *src,
>  				  enum migrate_mode mode)
>  {
> -	WARN_ON_ONCE(1);
> -	return -EINVAL;
> +	struct inode *inode = mapping->host;
> +	pgoff_t start, end;
> +	int ret;
> +
> +	if (!filemap_invalidate_trylock_shared(mapping))
> +		return -EAGAIN;
> +
> +	start = src->index;
> +	end = start + folio_nr_pages(src);
> +
> +	kvm_gmem_invalidate_begin(inode, start, end);
> +
> +	/*
> +	 * For non-confidential guest_memfd the folio is host-readable,
> +	 * so filemap_migrate_folio() can copy the contents itself via
> +	 * folio_mc_copy().
> +	 *
> +	 * This is also the hook point for confidential VMs (SEV-SNP, TDX) once
> +	 * they are made movable: the host cannot copy encrypted/private memory,
> +	 * so a firmware-assisted copy would run here.
> +	 * Idea: https://lore.kernel.org/r/20260428155043.39251-8-shivankg@amd.com
> +	 * Mark the @dst->migrate_info field with FOLIO_CONTENT_COPIED, so
> +	 * __migrate_folio() skip folio_mc_copy() for confidential VMs.
> +	 */
> +	ret = filemap_migrate_folio(mapping, dst, src, mode);
> +
> +	kvm_gmem_invalidate_end(inode, start, end);
> +
> +	filemap_invalidate_unlock_shared(mapping);
> +	return ret;
>  }
> +#else
> +#define kvm_gmem_migrate_folio NULL
> +#endif
>  
>  static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *folio)
>  {
> @@ -592,9 +624,17 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>  	inode->i_size = size;
>  	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
>  	mapping_set_inaccessible(inode->i_mapping);
> -	mapping_set_unmovable(inode->i_mapping);
> -	/* Unmovable mappings are supposed to be marked unevictable as well. */
> -	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> +
> +	/*
> +	 * Confidential VMs (SEV-SNP, TDX) bind encryption to the physical
> +	 * address and require firmware assisted copy, so their folios cannot
> +	 * be migrated yet.
> +	 */
> +	if (kvm_arch_has_private_mem(kvm)) {
> +		mapping_set_unmovable(inode->i_mapping);
> +		/* Unmovable mappings are supposed to be marked unevictable as well. */
> +		WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));

We would still want our movable mappings to be flagged unevictable.

> +	}
>

As discussed, for guest_memfd instances that support page migration, we would
want to also allocate the pages in for guest_memfd as GFP_HIGHUSER_MOVABLE.

That is, handle the mapping_set_gfp_mask() call as well.

It will unlock access to areas reserved for movable allocations (CMA/
ZONE_MOVABLE) and properly let the page allocator group pages by mobility
(MOVABLE vs. UNMOVABLE vs. RECLAIMABLE).

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH RFC 0/3] KVM: guest_memfd: folio migration for non-confidential VMs
From: David Hildenbrand (Arm) @ 2026-06-15 18:30 UTC (permalink / raw)
  To: Alexandru Elisei, Shivank Garg
  Cc: Matthew Wilcox (Oracle), Jan Kara, Andrew Morton, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	Paolo Bonzini, Shuah Khan, Chao Peng, Nikunj A Dadhania,
	Ira Weiny, Michael Roth, Pankaj Gupta, Ackerley Tng, Fuad Tabba,
	Sean Christopherson, Vishal Annapurve, Nikita Kalyazin,
	Patrick Roy, Pratik Sampat, Ashish Kalra, linux-fsdevel,
	linux-coco, linux-mm, linux-kernel, kvm, linux-kselftest
In-Reply-To: <ai_XK__RTXMCEcCG@raptor>

On 6/15/26 12:43, Alexandru Elisei wrote:
> Hi,
> 
> On Thu, Jun 11, 2026 at 01:05:07PM +0000, Shivank Garg wrote:
>> guest_memfd folios are currently marked unmovable, so the kernel cannot
>> perform NUMA-balancing, memory compaction, etc. This is unavoidable for
>> confidential VMs (SEV-SNP, TDX), since memory is encrypted and copying it
>> needs firmware assistance. However, for non-confidential VMs (like
>> Firecracker), we can migrate the folios.
>>
>> This series enables folio migration for non-confidential guest_memfd and
>> also lays the groundwork for migrating confidential guest_memfd later.
>> Once firmware-assisted copying support is available, those VMs can be
>> made movable, the confidential folio content can be copied separately,
>> and the destination folio marked with FOLIO_CONTENT_COPIED so
>> __migrate_folio() skips the host-side folio_mc_copy().
> 
> I always thought that one of the nice things about using guest_memfd as a
> memory backend, as opposed to host userspace mappings, is that the host
> cannot unmap VM memory because of KSM, automatic NUMA balancing, hugepage
> collapse, compaction, etc, acting on the host userspace mapping of the
> VM memory, and outside of the VMM's or KVM's control.

Yeah, but it doesn't play nice with THPs / large folios. So if you want to run
something else on a hypervisor than just confidential VMs, you definitely want
guest_memfd to be as nice to the system.

That is, support page migration if nothing speaks against it.

Now, if something speaks against it, for sure we can just leave the pages be
unmovable.

Fortunately, the patch is rather trivial.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH RFC 0/3] KVM: guest_memfd: folio migration for non-confidential VMs
From: David Hildenbrand (Arm) @ 2026-06-15 18:24 UTC (permalink / raw)
  To: Sean Christopherson, Alexandru Elisei
  Cc: Shivank Garg, Matthew Wilcox (Oracle), Jan Kara, Andrew Morton,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Paolo Bonzini, Shuah Khan, Chao Peng,
	Nikunj A Dadhania, Ira Weiny, Michael Roth, Pankaj Gupta,
	Ackerley Tng, Fuad Tabba, Vishal Annapurve, Nikita Kalyazin,
	Patrick Roy, Pratik Sampat, Ashish Kalra, linux-fsdevel,
	linux-coco, linux-mm, linux-kernel, kvm, linux-kselftest
In-Reply-To: <ajA4z_Wkb93cTW4m@google.com>

On 6/15/26 19:39, Sean Christopherson wrote:
> On Mon, Jun 15, 2026, Alexandru Elisei wrote:
>> Hi,
>>
>> On Mon, Jun 15, 2026 at 11:43:14AM +0100, Alexandru Elisei wrote:
>>> Hi,
>>>
>>>
>>> I always thought that one of the nice things about using guest_memfd as a
>>> memory backend, as opposed to host userspace mappings, is that the host
>>> cannot unmap VM memory because of KSM, automatic NUMA balancing, hugepage
>>> collapse, compaction, etc, acting on the host userspace mapping of the
>>> VM memory, and outside of the VMM's or KVM's control.
> 
> +1000.  It's not just "nice to have", it's a core design principle of guest_memfd.

Right, and I raised in the guest_memfd call also the rough idea of Alexandru's
use case of having non-movable guest_memfd pages such that we can support use
cases where we can hopefully guarantee that a stage-2 mapping will not just
randomly go away.

> 
>>> I think it would be useful to preserve this behaviour, even in the absence
>>> of confidential VMs (i.e, guest_memfd file descriptor created with
>>> GUEST_MEMFD_FLAG_MMAP).
>>
>> Just to be clear, I was thinking that it might be useful for both
>> behaviours to exist (migratable and non-migratable) for non-confidential
>> VMs, and allow KVM or userspace to decide which they prefer for a
>> guest_memfd.
> 
> For the purposes of this discussion, we should separate the physical act of
> migrating pages from the features that trigger migration.  As I said in last week's
> guest-memfd call, I am a-ok with supporting page migration as a mechanism, but I
> am dead set against supporting NUMA balancing, KSM, LRU-based swap/reclaim, and
> anything else that goes against the goal of guest-first memory.

Right. Page migration for supporting ZONE_MOVABLE/CMA, compaction, memory
offlining, virtio-mem and possibly some collapse mechanism if we were to support
THP of some sorts in guest_memfd would are all reasonable.

As soon as we mix in access/lru semantics, we're going into the wrong direction.

Fortunately KSM is anon-only and not even worth a rant here :)



-- 
Cheers,

David

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox