Linux Confidential Computing Development

Linux Confidential Computing Development
 help / color / mirror / Atom feed

* Re: [PATCH v8 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Kalra, Ashish @ 2026-06-16 19:56 UTC (permalink / raw)
  To: K Prateek Nayak, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <0fa0bc95-ff31-40c5-b083-3c885d09d0ab@amd.com>

Hello Prateek,

On 6/16/2026 2:27 AM, K Prateek Nayak wrote:
> Hello Ashish,
> 
> On 6/16/2026 1:19 AM, Ashish Kalra wrote:
>> +	/*
>> +	 * RMPOPT scans the RMP table, stores the result of the scan in the
>> +	 * reserved processor memory. The RMP scan is the most expensive
>> +	 * part. If a second RMPOPT occurs, it can skip the expensive scan
>> +	 * if they can see a cached result in the reserved processor memory.
>> +	 *
>> +	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
>> +	 * on every other primary thread. Followers are "designed to"
>> +	 * skip the scan if they see the "cached" scan results.
>> +	 */
>> +	cpumask_copy(follower_mask, &rmpopt_cpumask);
> 
> rmpopt_cpumask is constructed after hotplug is disabled but ...
> 
>> +
>> +	/*
>> +	 * Pin the worker to the current CPU for the leader loop so that
>> +	 * this_cpu remains valid and the RMPOPT instruction executes on
>> +	 * the correct CPU.
>> +	 *
>> +	 * Use migrate_disable() rather than get_cpu() to prevent
>> +	 * migration while still allowing preemption.
>> +	 */
>> +	migrate_disable();
>> +	this_cpu = smp_processor_id();
>> +
>> +	if (cpumask_test_cpu(this_cpu, follower_mask)) {
>> +		/*
>> +		 * Current CPU is a primary thread in rmpopt_cpumask.
>> +		 * Run leader locally and remove from follower mask.
>> +		 */
>> +		cpumask_clear_cpu(this_cpu, follower_mask);
>> +
>> +		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>> +			rmpopt(pa);
>> +			cond_resched();
>> +		}
>> +	} else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
>> +				      follower_mask)) {
>> +		/*
>> +		 * Current CPU is a sibling thread whose primary is in
>> +		 * rmpopt_cpumask.  RMPOPT_BASE MSR is per-core, so it
>> +		 * is safe to run the leader locally.  Remove the sibling's
>> +		 * primary from the follower mask as this core is already
>> +		 * covered by the leader.
>> +		 */
>> +		cpumask_andnot(follower_mask, follower_mask,
>> +			       topology_sibling_cpumask(this_cpu));
>> +
>> +		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>> +			rmpopt(pa);
>> +			cond_resched();
>> +		}
>> +	} else {
>> +		/*
>> +		 * Current CPU does not have RMPOPT_BASE MSR programmed.
>> +		 * Pick an explicit leader from the cpumask to avoid #UD.
>> +		 * Use work_on_cpu() to run in process context on the leader,
>> +		 * avoiding IPI latency.
>> +		 */
> 
> ... this_cpu is neither in the "rmpopt_cpumask", nor is any of its
> siblings on "rmpopt_cpumask".
> 
> How does that happen?

Actually, this was the implementation before the CPU hotplug disable enforcement code was implemented and added in v8,
and i should have fixed this rmpopt_work_handler() accordingly for v8.

With the enforced cpu hotplug disable support, case #3 here (above) is now dead code, and removing it lets
cases #1 and #2 collapse too.

snp_prepare() requires cpu_online_mask == cpu_present_mask before SNP init — so when snp_setup_rmpopt() programs the MSRs, every
core's primary is online -> every core is in rmpopt_cpumask.
  
So now the work handler always runs on a CPU whose core is programmed. topology_sibling_cpumask(this_cpu) therefore always intersects
rmpopt_cpumask -> case #1 or #2 always matches.

So i should actually drop case #3 here - which is: "this_cpu is neither in the "rmpopt_cpumask", nor is any of its
siblings on rmpopt_cpumask"


> 
>> +		int leader_cpu = cpumask_first(follower_mask);
>> +
>> +		if (WARN_ON_ONCE(leader_cpu >= nr_cpu_ids)) {
>> +			migrate_enable();
>> +			goto out;
>> +		}
>> +
>> +		cpumask_clear_cpu(leader_cpu, follower_mask);
>> +
>> +		/* Release migration pin before work_on_cpu(). */
>> +		migrate_enable();
>> +
>> +		work_on_cpu(leader_cpu, rmpopt_leader_fn, NULL);
> 
> This creates a delayed work and also waits for it to finish execution
> which will add more latency than a simple IPI if the comment about IPI
> latency above is accurate.
> 
> I think there is some corner case in construction of the
> "rmpopt_cpumask" that requires this not-so-pretty else block. Can you
> elaborate why this is required?
> 
> Perhaps the "rmpopt_cpumask" construction needs:
> 
>     for_each_online_cpu(cpu) {
>         /* Nominate the first CPU on the sibling mask for RMPOPT */
>         if (cpu != cpumask_first(topology_sibling_cpumask(cpu)))
>             continue;
>         cpumask_set_cpu(cpu, &rmpopt_cpumask);
>     }
> 
> 
> and all you need here is:
> 
>     /* Do RMPOPt for local core */
>     for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
>         rmpopt(pa);
> 
>     /* Skip this core from concurrent RMPOPT */
>     cpumask_and_not(follower_mask, &rmpopt_cpumask, topology_sibling_cpumask(cpu));
> 
> No?
> 

Yes, a simpler implementation will be like this: 
...

 	if (!alloc_cpumask_var(&follower_mask, GFP_KERNEL))
                return;

 	cpumask_copy(follower_mask, &rmpopt_cpumask);

        /*
         * The current CPU's core always has RMPOPT_BASE programmed
         * (snp_prepare() required all CPUs online at setup and CPU hotplug
         * is disabled while SNP is active), so it can always be the leader.
         * RMPOPT_BASE is per-core; exclude this core from the followers.
         */
        migrate_disable();
        cpumask_andnot(follower_mask, follower_mask,
                       topology_sibling_cpumask(smp_processor_id()));

        for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
                rmpopt(pa);
                cond_resched();
        }
        migrate_enable();

        cpus_read_lock();
        for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
                on_each_cpu_mask(follower_mask, rmpopt_smp, (void *)pa, true);
                cond_resched();
        }
        cpus_read_unlock();

        free_cpumask_var(follower_mask);


 Here, the leader exclusion must use the sibling mask, not clear_cpu(this_cpu). That's why my collapsed version uses:

        cpumask_andnot(follower_mask, follower_mask,
                       topology_sibling_cpumask(smp_processor_id()));

  - If this_cpu is a primary: its sibling mask contains itself (the primary) -> andnot removes this core's primary from the followers.
  
  - If this_cpu is a secondary: it isn't in follower_mask at all, but its sibling mask contains its primary, which is in
  follower_mask -> andnot still removes this core's primary. 

  So either way the current core is dropped from the followers. (The old code needed two branches because case #1 used
  clear_cpu(this_cpu) — only correct when this_cpu is the primary — while case #2 used the sibling andnot. The single andnot works for
  both cases).

Thanks,
Ashish

>> +		goto followers;
>> +	}
>> +
>> +	migrate_enable();
>> +

^ permalink raw reply

* Re: [PATCH v13 00/22] TDX KVM selftests
From: Sean Christopherson @ 2026-06-16 18:48 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Kiryl Shutsemau, linux-kselftest,
	Paolo Bonzini, Pratik R. Sampat, Reinette Chatre, Rick Edgecombe,
	Roger Wang, Ryan Afranji, Sagi Shahar, Shuah Khan, Oliver Upton,
	Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86,
	Adrian Hunter
In-Reply-To: <CAEvNRgH7Lk=z9NqcY4OZXv=y5SeCZHnDNcB0=kHfarjCA4ZPTw@mail.gmail.com>

On Tue, Jun 16, 2026, Ackerley Tng wrote:
> Lisa Wang <wyihan@google.com> writes:
> 
> > This patch series focuses on setting up a TDX VM and adding all code
> > necessary to run a basic lifecycle test.
> >
> > Unlike standard KVM selftests can set up the VM through guest registers,
> > TDX module protects TDs' register state from the host. This feature of
> > TDX causes problems on VM boot state initialization and the ucall
> > implementation.
> >
> > In standard KVM selftests, the host directly initializes the guest state
> > by manipulating Special Registers (SREGs) and General Purpose Registers
> > (GPRs) via IOCTLs (KVM_SET_SREGS, etc.) before the first KVM_RUN.
> >
> > To bypass direct register initialization by the host, we utilize the
> > standard x86 reset vector as the default entry point.
> >
> > The mechanism works as follows:
> > 1. The host places register values into a specific memory region and
> >    inserts boot code at the VM's default starting point.
> > 2. When the VM starts, it executes this boot code to "pull" values from
> >    memory and manually set up its own SREGs and GPRs.
> > 3. Once the environment is ready, the boot code jumps to the guest code.
> >
> > The standard x86 ucall() implementation uses PIO, but it does not
> > actually transmit data through the 4-byte PIO data. Instead, it relies
> > on the host reading the ucall address directly from the guest's RDI
> > register.
> >
> > TDX selftests cannot utilize the standard x86 ucall implementation,
> > because the host is unable to access the guest's RDI register. Based on
> > this restriction, we considered these potential solutions for the TDX
> > ucall implementation.
> >
> > 1. TDCALL PIO with RCX-bits Passthrough
> > We first considered passing the RDI value through RCX bits to bypass the
> > hardware's register protection, which could be the closest approach to
> > the non-TDX implementation as per Sean's suggestion[1]. However, this
> > approach is blocked by the software-side implementation: KVM_GET_REGS
> > currently does not support TDX VMs and returns -EINVAL. To make this
> > work, the KVM ioctl would need a test-only hack.
> >
> > 2. TDCALL PIO with buffer indexing
> > To keep a PIO-based approach and unify the get_ucall implementation for
> > both TDX and non-TDX VMs, we considered TDCALL PIO with buffer indexing.
> > Since the ucall buffer is initialized prior to execution, the VM could
> > just pass a buffer index rather than an 8-byte ucall address to fit
> > within the 4-byte PIO data limit. The host, already knowing the ucall
> > buffer's base address, could then resolve the ucall content via this
> > index. We abandoned this solution because it would require changes to
> > the common ucall structure and impact other non-x86 architectures.
> >
> > 3. TDCALL MMIO (Selected solution)
> > We ultimately selected TDCALL with an 8-byte MMIO data. This method only
> > requires initializing an MMIO GPA and adding TDCALL MMIO implementation
> > for TDX under the original x86 ucall path. While this diverges from the
> > non-TDX PIO, it provides the cleanest implementation with minimal
> > disruption to the overall ucall architecture.
> >
> 
> Sean, Lisa evaluated your suggestion [1] (summarized as 1. above) but we
> think TDCALL MMIO is better, what do you think?

I think y'all should have responded to that thread with "that doesn't work
because host userspace can't access the registers".  Reviews are multi-way
discussions, not one-way streams of "do this".  And the expectation is that
either review feedback is addressed in the next version, or the dicussion is
closed/resolved *before* posting the next version.

Remaining silent and then writing a thesis in the cover letter of a future version
of the series is very inefficient for everyone involved.  I obviously don't read
cover letters all that closely at v13 and I gotta imagine a *lot* of effort went
into the above (which I greatly appreciate!).  The paper trail also becomes
impossible to follow, because anyone reading my response would probably make the
same assumption as me: it was a viable idea and that's what we implemented.

I'm a-ok with using MMIO, because yeah, there doesn't seem to be a better option.

^ permalink raw reply

* Re: [PATCH  v13 21/22] KVM: selftests: Add ucall support for TDX
From: Sean Christopherson @ 2026-06-16 18:47 UTC (permalink / raw)
  To: Lisa Wang
  Cc: Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
	linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-21-6983ae4c3a4d@google.com>

On Thu, May 21, 2026, Lisa Wang wrote:
> diff --git a/tools/testing/selftests/kvm/lib/x86/ucall.c b/tools/testing/selftests/kvm/lib/x86/ucall.c
> index e7dd5791959b..c8e3418d53af 100644
> --- a/tools/testing/selftests/kvm/lib/x86/ucall.c
> +++ b/tools/testing/selftests/kvm/lib/x86/ucall.c
> @@ -5,11 +5,34 @@
>   * Copyright (C) 2018, Red Hat, Inc.
>   */
>  #include "kvm_util.h"
> +#include "tdx/tdx.h"
> +#include "tdx/tdx_util.h"
>  
>  #define UCALL_PIO_PORT ((u16)0x1000)
>  
> +static u8 vm_type;
> +static gpa_t host_ucall_mmio_gpa;
> +static gpa_t ucall_mmio_gpa;
> +
> +void ucall_arch_init(struct kvm_vm *vm, gpa_t mmio_gpa)

I think we should use an x86-specific GPA, not the first address past memslot0.
Unlike other architectures, x86 has a nice swath of addresses that are pretty
much guaranteed to be unused, thanks to selftests creating a local APIC by default.
On the other hand, the chances of a collision with a memslot just after memslot0
are decidedly non-zero.

Note, because CoCo VMS don't support read-only memslots, the TODO in __vm_create()
can't be resolved for TDX using the suggested shenanigans.

I vote for either the I/O APIC (0xfec00000) or HPET(0xfed00000).  We *know* TDX
doesn't support an in-kernel I/O APIC, and the odds of KVM selftests ever
implementing an I/O APIC are basically nil.  Ditto for the HPET.

> +{
> +	vm_type = vm->type;
> +	sync_global_to_guest(vm, vm_type);
> +
> +	if (is_tdx_vm(vm)) {
> +		host_ucall_mmio_gpa = ucall_mmio_gpa = mmio_gpa;

Drop host_ucall_mmio_gpa entirely.  "host GPA" is rather nonsensical, and KVM is
responsible for stripping the shared bit.  You can actually drop ucall_mmio_gpa
as well if we go with a hardcoded magic address.

> +		ucall_mmio_gpa |= vm->arch.s_bit;
> +		sync_global_to_guest(vm, ucall_mmio_gpa);
> +	}
> +}
> +
>  void ucall_arch_do_ucall(gva_t uc)
>  {
> +	if (vm_type == KVM_X86_TDX_VM) {
> +		tdx_mmio_write(ucall_mmio_gpa, MMIO_SIZE_8B, uc);

s/MMIO_SIZE_8B/sizeof(hva_t), because what you're writing is the address of a
pointer in the host virtual address space.

> +		return;
> +	}
> +
>  	/*
>  	 * FIXME: Revert this hack (the entire commit that added it) once nVMX
>  	 * preserves L2 GPRs across a nested VM-Exit.  If a ucall from L2, e.g.
> @@ -46,6 +69,13 @@ void *ucall_arch_get_ucall(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_run *run = vcpu->run;
>  
> +	if (vm_type == KVM_X86_TDX_VM) {
> +		if (run->exit_reason == KVM_EXIT_MMIO &&
> +		    run->mmio.phys_addr == host_ucall_mmio_gpa &&
> +		    run->mmio.len == MMIO_SIZE_8B && run->mmio.is_write)
> +			return (void *)(*((u64 *)run->mmio.data));

This needs to return NULL.  Either that or make this an if-elif.  Falling
through to the normal KVM_EXIT_IO check is not what we want.

^ permalink raw reply

* Re: [RFC PATCH] mm/vmalloc: add vmalloc_decrypted() and vzalloc_decrypted()
From: Jason Gunthorpe @ 2026-06-16 18:45 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Christoph Hellwig, Kameron Carr, akpm, urezki, linux-mm,
	linux-kernel, rppt, mhklinux, linux-coco, Suzuki K Poulose
In-Reply-To: <ajGTPelKLhgFqh7F@arm.com>

On Tue, Jun 16, 2026 at 07:17:33PM +0100, Catalin Marinas wrote:

> > The entry point is dma_alloc_noncontiguous() and you get a scatterlist
> > back.
> 
> Yes but not scattered pages unless there's an iommu behind. Anyway,
> that's an implementation detail, something like
> dma_alloc_noncontiguous_vmap() could allocate scattered pages as a
> fallback.

Oh I never noticed it deliberately returns only a single dma entry. I
think that could be optionally weakened without alot of trouble

There is also dma_vmap_noncontiguous() already, so I think the main
framework is there, though it seems like it needs a a bit mmore features.

> Currently, something like netvsc_init_buf() just does a vzalloc() and
> passes it down to vmbus_establish_gpadl() which knows how to interpret
> the channel encryption status. I assume with the vzalloc_decrypted()
> API, that info needs to be interpreted at the netvsc_init_buf() level to
> know which allocation to call.

But it doesn't end at alloc does it? hyperv will still have to reach
into the vmap and convert it into an appropriate IPA to pass to the
hypervisor. That really needs to use the arch helpers the DMA API has
and those should not be called by any sort of driver environment like
hyperv.

> If we move towards a dma_alloc_noncontiguous_vmap() API we need vmbus to
> encode the encryption requirement in the hv_device::device somehow so
> that force_dma_unencrypted() knows what do return. 

Yes, they would have to act like PCI and mark in-TEE and out of-TEE
struct devices properly so the DMA API knows what to do instead of
open coding a copy of all this logic in hyperv.

> We have the DMA_ATTR_CC_SHARED but that's not interpreted on the DMA
> alloc path, 

It is to describe memory that was deliberately allocated as decrypted,
not to control allocation choices.

> so there's a bit more work needed on the DMA API I think (not sure
> whether Aneesh's series covers any of this).

I don't think it does directly. It largely sets the stage to properly
allow a struct device to opt out of force_dma_unencrypted() so we get
support a T=1 PCI device.

Jason

^ permalink raw reply

* Re: [PATCH  v13 20/22] KVM: selftests: Implement MMIO WRITE for the TDX VM
From: Sean Christopherson @ 2026-06-16 18:20 UTC (permalink / raw)
  To: Lisa Wang
  Cc: Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
	linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-20-6983ae4c3a4d@google.com>

On Thu, May 21, 2026, Lisa Wang wrote:
> diff --git a/tools/testing/selftests/kvm/include/x86/tdx/tdx.h b/tools/testing/selftests/kvm/include/x86/tdx/tdx.h
> new file mode 100644
> index 000000000000..810ca7423c84
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/include/x86/tdx/tdx.h
> @@ -0,0 +1,16 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef SELFTESTS_TDX_TDX_H
> +#define SELFTESTS_TDX_TDX_H
> +
> +#include <linux/types.h>
> +
> +enum mmio_size {
> +	MMIO_SIZE_1B = 1,
> +	MMIO_SIZE_2B = 2,
> +	MMIO_SIZE_4B = 4,
> +	MMIO_SIZE_8B = 8

This is absurd.  Either open code the literals or use sizeof() where appropriate.

> +};
> +
> +u64 tdx_mmio_write(u64 address, enum mmio_size size, u64 data_in);
> +
> +#endif // SELFTESTS_TDX_TDX_H
> diff --git a/tools/testing/selftests/kvm/lib/x86/tdx/tdx.c b/tools/testing/selftests/kvm/lib/x86/tdx/tdx.c
> new file mode 100644
> index 000000000000..f19be79fe11f
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/lib/x86/tdx/tdx.c
> @@ -0,0 +1,30 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include "tdx/tdx.h"
> +
> +#define TDG_VP_VMCALL 0
> +#define TDG_VP_VMCALL_VE_REQUEST_MMIO    48
> +#define TDVMCALL_MMIO_WRITE		  1
> +#define TDVMCALL_EXPOSE_REGS_MASK    0xFC00
> +
> +u64 tdx_mmio_write(u64 address, enum mmio_size size, u64 data_in)
> +{
> +	register u64 r10_reg asm("r10") = TDG_VP_VMCALL;
> +	register u64 r11_reg asm("r11") = TDG_VP_VMCALL_VE_REQUEST_MMIO;
> +	register u64 r12_reg asm("r12") = size;
> +	register u64 r13_reg asm("r13") = TDVMCALL_MMIO_WRITE;
> +	register u64 r14_reg asm("r14") = address;
> +	register u64 r15_reg asm("r15") = data_in;
> +	register u64 rax_reg asm("rax") = TDG_VP_VMCALL;
> +	register u64 rcx_reg asm("rcx") = TDVMCALL_EXPOSE_REGS_MASK;

This needs to be proper assembly, i.e. in a .S file.  Using register like this
is *extremely* dangerous, because the compiler is (stupidly) allowed to clobber
registers between their declarations/initialization and their consumption in
the asm() blob.

> +
> +	asm volatile(
> +	 ".byte 0x66,0x0f,0x01,0xcc" /* tdcall */
> +	 : "+r" (r10_reg), "+r" (r11_reg)
> +	 : "r" (r12_reg), "r" (r13_reg), "r" (r14_reg), "r" (r15_reg),
> +	   "r" (rax_reg), "r" (rcx_reg)
> +	 : "cc", "memory"
> +	);
> +
> +	return r10_reg;
> +}
> 
> -- 
> 2.54.0.746.g67dd491aae-goog
> 

^ permalink raw reply

* Re: [RFC PATCH] mm/vmalloc: add vmalloc_decrypted() and vzalloc_decrypted()
From: Catalin Marinas @ 2026-06-16 18:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Kameron Carr, akpm, urezki, linux-mm,
	linux-kernel, rppt, mhklinux, linux-coco, Suzuki K Poulose
In-Reply-To: <20260612181807.GP1066031@ziepe.ca>

On Fri, Jun 12, 2026 at 03:18:07PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 12, 2026 at 06:49:28PM +0100, Catalin Marinas wrote:
> > On Thu, Jun 11, 2026 at 08:49:54AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jun 08, 2026 at 04:37:02PM +0100, Catalin Marinas wrote:
> > > > > +/**
> > > > > + * vzalloc_decrypted - allocate zeroed virtually contiguous decrypted memory
> > > > > + * @size:    allocation size
> > > > > + *
> > > > > + * Like vmalloc_decrypted(), but the memory is set to zero.
> > > > > + *
> > > > > + * Return: pointer to the allocated memory or %NULL on error
> > > > > + */
> > > > > +void *vzalloc_decrypted_noprof(unsigned long size)
> > > > > +{
> > > > > +	void *addr;
> > > > > +
> > > > > +	addr = __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
> > > > > +					   GFP_KERNEL,
> > > > > +					   pgprot_decrypted(PAGE_KERNEL),
> > > > > +					   VM_DECRYPTED, NUMA_NO_NODE,
> > > > > +					   __builtin_return_address(0));
> > > > > +	if (addr)
> > > > > +		memset(addr, 0, size);
[...]
> > > But what is the purpose of this? I guess some hyperv thing - but
> > > shouldn't we have a more structured way to "DMA map" things for the
> > > hypervisor instead of stuff like this? Why can't you use
> > > dma_alloc_coherent() which actually gives you an address that is
> > > sensible to pass to the hypervisor?
> > 
> > IIRC netvsc_init_buf() uses vzalloc() to allocate some memory and that
> > buffer ends up in set_memory_decrypted() via vmbus_establish_gpadl().
> > arm64 does not support changing the decrypted/shared attributed of
> > vmalloc mappings and I don't think we should add it. Better to just
> > allocate it properly upfront.
> 
> Sure
>  
> > We might be able to use the DMA API but we won't get something like
> > vmalloc() - physically non-contiguous. 
> 
> The entry point is dma_alloc_noncontiguous() and you get a scatterlist
> back.

Yes but not scattered pages unless there's an iommu behind. Anyway,
that's an implementation detail, something like
dma_alloc_noncontiguous_vmap() could allocate scattered pages as a
fallback.

> > I think dma_alloc_noncontiguous() just falls back to
> > dma_direct_alloc_pages() in the absence of an iommu.
> 
> In all cases you get a scatterlist with a CPU list and a DMA
> list. iommu gives a smaller DMA list.
> 
> If you want a vmap then you can feed that CPU page list from the sgl
> into vmap().
> 
> A dma_alloc_noncontiguous_vmap() helper would not be hard to make, and
> IMHO, would make alot more sense for hyperv to treat the memory access
> from the hypervisor as "DMA" instead of trying to re-invent the DMA
> API.. :\
> 
> HCH was already saying we should not be allowing drivers to use
> set_memory_decrypted() at all, and hyperv is the biggest non-core user
> right now...

That's a good aim longer term. I'm not familiar with hyper-v but I think
it needs a mix of private or shared allocations depending on whether a
paravisor is present. That's handled by the vmbus code and the
information is encoded in the vmbus_channel objects.

Currently, something like netvsc_init_buf() just does a vzalloc() and
passes it down to vmbus_establish_gpadl() which knows how to interpret
the channel encryption status. I assume with the vzalloc_decrypted()
API, that info needs to be interpreted at the netvsc_init_buf() level to
know which allocation to call.

If we move towards a dma_alloc_noncontiguous_vmap() API we need vmbus to
encode the encryption requirement in the hv_device::device somehow so
that force_dma_unencrypted() knows what do return. We have the
DMA_ATTR_CC_SHARED but that's not interpreted on the DMA alloc path, so
there's a bit more work needed on the DMA API I think (not sure whether
Aneesh's series covers any of this).

-- 
Catalin

^ permalink raw reply

* Re: [PATCH RFC 0/3] KVM: guest_memfd: folio migration for non-confidential VMs
From: Ackerley Tng @ 2026-06-16 18:09 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Sean Christopherson, Alexandru Elisei
  Cc: Shivank Garg, Matthew Wilcox (Oracle), Jan Kara, Andrew Morton,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Paolo Bonzini, Shuah Khan, Chao Peng,
	Nikunj A Dadhania, Ira Weiny, Michael Roth, Pankaj Gupta,
	Fuad Tabba, Vishal Annapurve, Nikita Kalyazin, Patrick Roy,
	Pratik Sampat, Ashish Kalra, linux-fsdevel, linux-coco, linux-mm,
	linux-kernel, kvm, linux-kselftest
In-Reply-To: <1e77f24d-315a-4f68-a109-2f4520343c0c@kernel.org>

"David Hildenbrand (Arm)" <david@kernel.org> writes:

> On 6/15/26 19:39, Sean Christopherson wrote:
>> On Mon, Jun 15, 2026, Alexandru Elisei wrote:
>>> Hi,
>>>
>>> On Mon, Jun 15, 2026 at 11:43:14AM +0100, Alexandru Elisei wrote:
>>>> Hi,
>>>>
>>>>
>>>> I always thought that one of the nice things about using guest_memfd as a
>>>> memory backend, as opposed to host userspace mappings, is that the host
>>>> cannot unmap VM memory because of KSM, automatic NUMA balancing, hugepage
>>>> collapse, compaction, etc, acting on the host userspace mapping of the
>>>> VM memory, and outside of the VMM's or KVM's control.
>>
>> +1000.  It's not just "nice to have", it's a core design principle of guest_memfd.
>
> Right, and I raised in the guest_memfd call also the rough idea of Alexandru's
> use case of having non-movable guest_memfd pages such that we can support use
> cases where we can hopefully guarantee that a stage-2 mapping will not just
> randomly go away.
>
>>
>>>> I think it would be useful to preserve this behaviour, even in the absence
>>>> of confidential VMs (i.e, guest_memfd file descriptor created with
>>>> GUEST_MEMFD_FLAG_MMAP).
>>>
>>> Just to be clear, I was thinking that it might be useful for both
>>> behaviours to exist (migratable and non-migratable) for non-confidential
>>> VMs, and allow KVM or userspace to decide which they prefer for a
>>> guest_memfd.

More concretely, are y'all pointing towards a
GUEST_MEMFD_FLAG_MIGRATABLE, which will set .migrate =
kvm_gmem_migrate_folio, and for now, error out for CoCo VMs?

>>
>> For the purposes of this discussion, we should separate the physical act of
>> migrating pages from the features that trigger migration.  As I said in last week's
>> guest-memfd call, I am a-ok with supporting page migration as a mechanism, but I
>> am dead set against supporting NUMA balancing, KSM, LRU-based swap/reclaim, and
>> anything else that goes against the goal of guest-first memory.
>
> Right. Page migration for supporting ZONE_MOVABLE/CMA, compaction, memory
> offlining, virtio-mem and possibly some collapse mechanism if we were to support
> THP of some sorts in guest_memfd would are all reasonable.
>

Background question: how would virtio-mem use migration in the host/guest_memfd?

> As soon as we mix in access/lru semantics, we're going into the wrong direction.
>
> Fortunately KSM is anon-only and not even worth a rant here :)
>
>
>
> --
> Cheers,
>
> David

^ permalink raw reply

* Re: [PATCH v13 00/22] TDX KVM selftests
From: Ackerley Tng @ 2026-06-16 17:51 UTC (permalink / raw)
  To: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Kiryl Shutsemau, linux-kselftest,
	Paolo Bonzini, Pratik R. Sampat, Reinette Chatre, Rick Edgecombe,
	Roger Wang, Ryan Afranji, Sagi Shahar, Sean Christopherson,
	Shuah Khan, Oliver Upton
  Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86,
	Adrian Hunter
In-Reply-To: <20260521-tdx-selftests-v13-v13-0-6983ae4c3a4d@google.com>

Lisa Wang <wyihan@google.com> writes:

> This patch series focuses on setting up a TDX VM and adding all code
> necessary to run a basic lifecycle test.
>
> Unlike standard KVM selftests can set up the VM through guest registers,
> TDX module protects TDs' register state from the host. This feature of
> TDX causes problems on VM boot state initialization and the ucall
> implementation.
>
> In standard KVM selftests, the host directly initializes the guest state
> by manipulating Special Registers (SREGs) and General Purpose Registers
> (GPRs) via IOCTLs (KVM_SET_SREGS, etc.) before the first KVM_RUN.
>
> To bypass direct register initialization by the host, we utilize the
> standard x86 reset vector as the default entry point.
>
> The mechanism works as follows:
> 1. The host places register values into a specific memory region and
>    inserts boot code at the VM's default starting point.
> 2. When the VM starts, it executes this boot code to "pull" values from
>    memory and manually set up its own SREGs and GPRs.
> 3. Once the environment is ready, the boot code jumps to the guest code.
>
> The standard x86 ucall() implementation uses PIO, but it does not
> actually transmit data through the 4-byte PIO data. Instead, it relies
> on the host reading the ucall address directly from the guest's RDI
> register.
>
> TDX selftests cannot utilize the standard x86 ucall implementation,
> because the host is unable to access the guest's RDI register. Based on
> this restriction, we considered these potential solutions for the TDX
> ucall implementation.
>
> 1. TDCALL PIO with RCX-bits Passthrough
> We first considered passing the RDI value through RCX bits to bypass the
> hardware's register protection, which could be the closest approach to
> the non-TDX implementation as per Sean's suggestion[1]. However, this
> approach is blocked by the software-side implementation: KVM_GET_REGS
> currently does not support TDX VMs and returns -EINVAL. To make this
> work, the KVM ioctl would need a test-only hack.
>
> 2. TDCALL PIO with buffer indexing
> To keep a PIO-based approach and unify the get_ucall implementation for
> both TDX and non-TDX VMs, we considered TDCALL PIO with buffer indexing.
> Since the ucall buffer is initialized prior to execution, the VM could
> just pass a buffer index rather than an 8-byte ucall address to fit
> within the 4-byte PIO data limit. The host, already knowing the ucall
> buffer's base address, could then resolve the ucall content via this
> index. We abandoned this solution because it would require changes to
> the common ucall structure and impact other non-x86 architectures.
>
> 3. TDCALL MMIO (Selected solution)
> We ultimately selected TDCALL with an 8-byte MMIO data. This method only
> requires initializing an MMIO GPA and adding TDCALL MMIO implementation
> for TDX under the original x86 ucall path. While this diverges from the
> non-TDX PIO, it provides the cleanest implementation with minimal
> disruption to the overall ucall architecture.
>

Sean, Lisa evaluated your suggestion [1] (summarized as 1. above) but we
think TDCALL MMIO is better, what do you think?

+ Jump directly to where the mmio is used: [2]
+ And here's [3] how tdx_mmio_write() is implemented, with no more
  throwing everything in a structure. It's also not macroed/prototyped
  like you suggested in [4], but I think those prototypes can evolve out
  of future tdx functions?

Let us know so Lisa can try another option (if necessary) while we
collect more reviews :)

[1] https://lore.kernel.org/all/aQTcDH9LRezI30dm@google.com/
[2] https://lore.kernel.org/all/20260521-tdx-selftests-v13-v13-21-6983ae4c3a4d@google.com/
[3] https://lore.kernel.org/all/20260521-tdx-selftests-v13-v13-20-6983ae4c3a4d@google.com/
[4] https://lore.kernel.org/all/aQTdTkMIukzt-YlA@google.com/

> 4. A note on #VE and x86 ucall simplification
> It is worth noting that the use of a Virtualization Exception (#VE)
> is orthogonal to the PIO vs. MMIO discussion; rather, it is a question
> of how much we want to simplify the x86 ucall implementation. A #VE
> handler is one option to allow VMs use PIO/MMIO identical to the
> non-TDX case. Alternatively, having an MMIO_WRITE wrapper macro, as Sean
> suggested[2], is another option. Either way, discussion for this is
> likely a premature optimization right now, since the PIO/MMIO call is
> only used under ucall_arch_do_ucall(), and standard and TDX VMs use
> different ones now. We should optimize this in the future, but for now,
> invoking TDCALL directly is more robust and concise.
>
>
> [...snip...]
>

^ permalink raw reply

* Re: [PATCH] PCI/TSM: Resume device to D0 for CMA-SPDM operation
From: Dan Williams (nvidia) @ 2026-06-16 17:34 UTC (permalink / raw)
  To: Lukas Wunner, Dan Williams, Ashish Kalra, Tom Lendacky
  Cc: Vivaik Balasubrawmanian, John Allen, Bjorn Helgaas, linux-coco,
	linux-pci, Jonathan Cameron, Aneesh Kumar K.V, Yilun Xu,
	Zhenzhong Duan, Alexey Kardashevskiy
In-Reply-To: <7bdfaf14d7e5a466f3f650150c688a60e947a7a9.1781527060.git.lukas@wunner.de>

Lukas Wunner wrote:
> Per PCIe r7.0 sec 6.31.3, CMA-SPDM operation in non-D0 states is optional.
> The spec does not define a way to determine if it's supported, so resume
> to D0 unconditionally for the duration of a CMA-SPDM exchange.  Vivaik has
> talked to Windows engineers and they said that Windows does the same.
> 
> Note that for plain DOE operation, it is sufficient for the device to be
> in D3hot and its parents in D0 because config space remains accessible in
> D3hot.  So CMA-SPDM goes beyond the requirements of plain DOE and hence
> resuming to D0 needs to (only) be done in code paths which use DOE
> specifically for CMA-SPDM.
> 
> The pattern used herein for runtime resume is the best practice introduced
> by commit ef8057b07c72 ("PM: runtime: Wrapper macros for ACQUIRE()/
> ACQUIRE_ERR()").
> 
> Fixes: 3225f52cde56 ("PCI/TSM: Establish Secure Sessions and Link Encryption")
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org # v6.19+
> Cc: Vivaik Balasubrawmanian <vivaik.balasubrawmanian@intel.com>
> ---
> We're in the merge window for v7.2 and this isn't super urgent,
> so it's targeting v7.3 via tsm.git/next.
> 
> Technically I'd have permission to apply myself,
> but I wouldn't want to without acks from Dan and AMD!
> Thanks for taking a look!

Thanks, Lukas. A few questions:

This says Fixes, but I assume it is based on inspection and not a
report?

There are no upstream usages of pci_tsm_doe_transfer() yet, but the ones
in flight would suffer from the "D0 -> D3hot -> D0 -> D3hot" bounce that
you described to sashiko. I.e. the runtime acquire should be done at a
higher level.

I think the natural place to add PM_RUNTIME_ACQUIRE() that covers all
cases is withing pci_tsm_connect() and pci_tsm_disconnect().

I also think failure to power manage the device in the disconnect path
should not be fatal to performing the rest of the cleanup.

^ permalink raw reply

* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Sean Christopherson @ 2026-06-16 17:06 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Isaku Yamahata, Kiryl Shutsemau,
	linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
	Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar, Shuah Khan,
	Oliver Upton, Jeremiah McReynolds, kvm, linux-coco, linux-kernel,
	x86
In-Reply-To: <CAEvNRgFPKC2uOMaams7SS9B7LxvfU4h8DrPM5vXFb=pmXsgPbA@mail.gmail.com>

On Tue, Jun 16, 2026, Ackerley Tng wrote:
> >> 1. What do you think of a kvm_arch_vm_finalize() that calls
> >>    vm_sev_launch() and tdx_vm_finalize()? My key issue is that
> >>    kvm_arch_vm_finalize_*vcpus*() seems to be for finalizing vCPUs
> >>    rather than the whole VM.
> >
> > Key word "seems".  I'm pretty sure Oliver picked kvm_arch_vm_finalize_vcpus() as
> > the name in commit 8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for
> > 'default' VMs") for the same reasons I think it's a good fit for coco VMs: like
> > finalizing TDX VMs, initializing the vGIC effectively finalizes vCPUs.
> >
> > We could rename it to kvm_arch_vm_finalize(), but that won't change the fact that
> > we'll need to decide between automagic vs. manual finalization, and it definitely
> > should be a separate discussion.
> >
> 
> This definitely should not block this series.
> 
> It's coming together for me now with your explanation:
> kvm_arch_vm_finalize_vcpus() actually means finalizing vCPUs! vGIC ==
> Virtual Generic Interrupt Controller, which has to be done after all the
> vCPUs are set up. Since the name is describing where in the VM
> creation/setup flow the hook is called (after creating VM and after
> creating vCPUs), maybe something like kvm_arch_vm_post_vcpu_create()?

No, because I would expect post_vcpu_create() to run after creating each vCPU,
not after creating all vCPUs.  E.g. see KVM's kvm_arch_vcpu_{pre,post}create().

> Renaming this to kvm_arch_vm_finalize() makes it sound like it is
> finalizing the VM, but this function shouldn't finalize the VM since for
> CoCo finalizing the VM also loads the guest image into the guest - deals
> with memory, not just vCPUs.
> 
> 8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for 'default'
> VMs") also includes a test_disable_default_vgic() function, we could
> also use something like that to skip CoCo VM finalization for some
> tests? Maybe that's a good middle ground.

That probably won't work well, and in practice it's just shuffling deck chairs
on the Titanic.  For vGIC, and pre-create hook works because the tests that opt
out of automatic vGIC instantiation want that behavior to apply to all VMs that
the test creates.  That's not the case for sev_smoke_test though, because some
testcases need deferred launch (test_sync_vmsa()), whereas others can use
automatic launch (test_sev()).

The other wrinkle is that SEV at least needs to provide the policy, which again
varies per VM within a single test.

^ permalink raw reply

* Re: [PATCH v13 01/22] KVM: selftests: Add macros to simplify creating VM shapes for non-default types
From: Sean Christopherson @ 2026-06-16 16:51 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
	Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
	linux-coco, linux-kernel, x86
In-Reply-To: <e0b99e9a-c20f-4def-ac4b-0070996c10ef@intel.com>

On Tue, Jun 16, 2026, Xiaoyao Li wrote:
> On 5/22/2026 7:16 AM, Lisa Wang wrote:
> > diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> > index dc70c6da63fa..041bdbfb93f7 100644
> > --- a/tools/testing/selftests/kvm/include/kvm_util.h
> > +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> > @@ -233,6 +233,19 @@ kvm_static_assert(sizeof(struct vm_shape) == sizeof(u64));
> >   	shape;					\
> >   })
> > +#define __VM_TYPE(__mode, __type)		\
> 
> It seems the name "__VM_SHAPE" fits better?
> 
> > +({						\
> > +	struct vm_shape shape = {		\
> > +		.mode = (__mode),		\
> > +		.type = (__type)		\
> > +	};					\
> > +						\
> > +	shape;					\
> > +})
> > +
> > +#define VM_TYPE(__type)				\
> > +	__VM_TYPE(VM_MODE_DEFAULT, __type)
> 
> and I think making it one line would be OK?
> 
> So something on top:
> 
> ---8<---
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h
> b/tools/testing/selftests/kvm/include/kvm_util.h
> index 041bdbfb93f7..a1b5d2029d05 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -223,17 +223,7 @@ kvm_static_assert(sizeof(struct vm_shape) ==
> sizeof(u64));
> 
>  #define VM_TYPE_DEFAULT                        0
> 
> -#define VM_SHAPE(__mode)                       \
> -({                                             \
> -       struct vm_shape shape = {               \
> -               .mode = (__mode),               \
> -               .type = VM_TYPE_DEFAULT         \
> -       };                                      \
> -                                               \
> -       shape;                                  \
> -})
> -
> -#define __VM_TYPE(__mode, __type)              \
> +#define __VM_SHAPE(__mode, __type)             \
>  ({                                             \
>         struct vm_shape shape = {               \
>                 .mode = (__mode),               \
> @@ -243,8 +233,8 @@ kvm_static_assert(sizeof(struct vm_shape) ==
> sizeof(u64));
>         shape;                                  \
>  })
> 
> -#define VM_TYPE(__type)                                \
> -       __VM_TYPE(VM_MODE_DEFAULT, __type)
> +#define VM_SHAPE(__mode)       __VM_SHAPE(__mode, VM_TYPE_DEFAULT)
> +#define VM_TYPE(__type)                __VM_SHAPE(VM_MODE_DEFAULT, __type)

Oh, that's way better!  I say we go straight there:

--
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 28 Oct 2025 21:20:27 +0000
Subject: [PATCH] KVM: selftests: Add macros to simplify creating VM shapes for
 non-default types

Add VM_TYPE() and __VM_SHAPE() macros to create a vm_shape structure given
a type (and mode), and use the macros to define VM_SHAPE_{SEV,SEV_ES,SNP}
shapes for x86's SEV family of VM shapes.  Providing common infrastructure
will avoid having to copy+paste vm_sev_create_with_one_vcpu() for TDX.

Use the new SEV+ shapes and drop vm_sev_create_with_one_vcpu().

Opportunistically move the existing VM_SHAPE() (now __VM_SHAPE()) macro
below the definitions of VM_MODE_DEFAULT so that all of the SHAPE/TYPE
macros are bundled together.

No functional change intended.

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../testing/selftests/kvm/include/kvm_util.h  | 28 +++++++------
 .../selftests/kvm/include/x86/processor.h     |  4 ++
 tools/testing/selftests/kvm/include/x86/sev.h |  2 -
 tools/testing/selftests/kvm/lib/x86/sev.c     | 16 --------
 .../selftests/kvm/x86/sev_smoke_test.c        | 40 +++++++++----------
 5 files changed, 40 insertions(+), 50 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index dc70c6da63fa..46bae183d7fc 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -221,18 +221,6 @@ struct vm_shape {
 
 kvm_static_assert(sizeof(struct vm_shape) == sizeof(u64));
 
-#define VM_TYPE_DEFAULT			0
-
-#define VM_SHAPE(__mode)			\
-({						\
-	struct vm_shape shape = {		\
-		.mode = (__mode),		\
-		.type = VM_TYPE_DEFAULT		\
-	};					\
-						\
-	shape;					\
-})
-
 extern enum vm_guest_mode vm_mode_default;
 
 #if defined(__aarch64__)
@@ -270,8 +258,24 @@ extern enum vm_guest_mode vm_mode_default;
 
 #endif
 
+#define VM_TYPE_DEFAULT			0
+
+#define __VM_SHAPE(__mode, __type)			\
+({						\
+	struct vm_shape shape = {		\
+		.mode = (__mode),		\
+		.type = (__type),		\
+	};					\
+						\
+	shape;					\
+})
+
+
+#define VM_SHAPE(__mode)	__VM_SHAPE(__mode, VM_TYPE_DEFAULT)
 #define VM_SHAPE_DEFAULT	VM_SHAPE(VM_MODE_DEFAULT)
 
+#define VM_TYPE(__type)		__VM_SHAPE(VM_MODE_DEFAULT, __type)
+
 #define MIN_PAGE_SIZE		(1U << MIN_PAGE_SHIFT)
 #define PTES_PER_MIN_PAGE	ptes_per_page(MIN_PAGE_SIZE)
 
diff --git a/tools/testing/selftests/kvm/include/x86/processor.h b/tools/testing/selftests/kvm/include/x86/processor.h
index 77f576ee7789..0aa6eecfcbde 100644
--- a/tools/testing/selftests/kvm/include/x86/processor.h
+++ b/tools/testing/selftests/kvm/include/x86/processor.h
@@ -365,6 +365,10 @@ static inline unsigned int x86_model(unsigned int eax)
 	return ((eax >> 12) & 0xf0) | ((eax >> 4) & 0x0f);
 }
 
+#define VM_SHAPE_SEV		VM_TYPE(KVM_X86_SEV_VM)
+#define VM_SHAPE_SEV_ES		VM_TYPE(KVM_X86_SEV_ES_VM)
+#define VM_SHAPE_SNP		VM_TYPE(KVM_X86_SNP_VM)
+
 #define PHYSICAL_PAGE_MASK      GENMASK_ULL(51, 12)
 
 #define PAGE_SHIFT		12
diff --git a/tools/testing/selftests/kvm/include/x86/sev.h b/tools/testing/selftests/kvm/include/x86/sev.h
index 1af44c151d60..944c59dbe510 100644
--- a/tools/testing/selftests/kvm/include/x86/sev.h
+++ b/tools/testing/selftests/kvm/include/x86/sev.h
@@ -53,8 +53,6 @@ void snp_vm_launch_start(struct kvm_vm *vm, u64 policy);
 void snp_vm_launch_update(struct kvm_vm *vm);
 void snp_vm_launch_finish(struct kvm_vm *vm);
 
-struct kvm_vm *vm_sev_create_with_one_vcpu(u32 type, void *guest_code,
-					   struct kvm_vcpu **cpu);
 void vm_sev_launch(struct kvm_vm *vm, u64 policy, u8 *measurement);
 
 kvm_static_assert(SEV_RET_SUCCESS == 0);
diff --git a/tools/testing/selftests/kvm/lib/x86/sev.c b/tools/testing/selftests/kvm/lib/x86/sev.c
index 93f916903461..95d8520eea34 100644
--- a/tools/testing/selftests/kvm/lib/x86/sev.c
+++ b/tools/testing/selftests/kvm/lib/x86/sev.c
@@ -158,22 +158,6 @@ void snp_vm_launch_finish(struct kvm_vm *vm)
 	vm_sev_ioctl(vm, KVM_SEV_SNP_LAUNCH_FINISH, &launch_finish);
 }
 
-struct kvm_vm *vm_sev_create_with_one_vcpu(u32 type, void *guest_code,
-					   struct kvm_vcpu **cpu)
-{
-	struct vm_shape shape = {
-		.mode = VM_MODE_DEFAULT,
-		.type = type,
-	};
-	struct kvm_vm *vm;
-	struct kvm_vcpu *cpus[1];
-
-	vm = __vm_create_with_vcpus(shape, 1, 0, guest_code, cpus);
-	*cpu = cpus[0];
-
-	return vm;
-}
-
 void vm_sev_launch(struct kvm_vm *vm, u64 policy, u8 *measurement)
 {
 	if (is_sev_snp_vm(vm)) {
diff --git a/tools/testing/selftests/kvm/x86/sev_smoke_test.c b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
index 1a49ee391586..fe2c438882ae 100644
--- a/tools/testing/selftests/kvm/x86/sev_smoke_test.c
+++ b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
@@ -104,7 +104,7 @@ static void compare_xsave(u8 *from_host, u8 *from_guest)
 		abort();
 }
 
-static void test_sync_vmsa(u32 type, u64 policy)
+static void test_sync_vmsa(struct vm_shape shape, u64 policy)
 {
 	struct kvm_vcpu *vcpu;
 	struct kvm_vm *vm;
@@ -114,7 +114,7 @@ static void test_sync_vmsa(u32 type, u64 policy)
 	double x87val = M_PI;
 	struct kvm_xsave __attribute__((aligned(64))) xsave = { 0 };
 
-	vm = vm_sev_create_with_one_vcpu(type, guest_code_xsave, &vcpu);
+	vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_code_xsave);
 	gva = vm_alloc_shared(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR,
 			      MEM_REGION_TEST_DATA);
 	hva = addr_gva2hva(vm, gva);
@@ -150,13 +150,13 @@ static void test_sync_vmsa(u32 type, u64 policy)
 	kvm_vm_free(vm);
 }
 
-static void test_sev(void *guest_code, u32 type, u64 policy)
+static void test_sev(void *guest_code, struct vm_shape shape, u64 policy)
 {
 	struct kvm_vcpu *vcpu;
 	struct kvm_vm *vm;
 	struct ucall uc;
 
-	vm = vm_sev_create_with_one_vcpu(type, guest_code, &vcpu);
+	vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_code);
 
 	/* TODO: Validate the measurement is as expected. */
 	vm_sev_launch(vm, policy, NULL);
@@ -201,12 +201,12 @@ static void guest_shutdown_code(void)
 	__asm__ __volatile__("ud2");
 }
 
-static void test_sev_shutdown(u32 type, u64 policy)
+static void test_sev_shutdown(struct vm_shape shape, u64 policy)
 {
 	struct kvm_vcpu *vcpu;
 	struct kvm_vm *vm;
 
-	vm = vm_sev_create_with_one_vcpu(type, guest_shutdown_code, &vcpu);
+	vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_shutdown_code);
 
 	vm_sev_launch(vm, policy, NULL);
 
@@ -218,28 +218,28 @@ static void test_sev_shutdown(u32 type, u64 policy)
 	kvm_vm_free(vm);
 }
 
-static void test_sev_smoke(void *guest, u32 type, u64 policy)
+static void test_sev_smoke(void *guest, struct vm_shape shape, u64 policy)
 {
 	const u64 xf_mask = XFEATURE_MASK_X87_AVX;
 
-	if (type == KVM_X86_SNP_VM)
-		test_sev(guest, type, policy | SNP_POLICY_DBG);
+	if (shape.type == KVM_X86_SNP_VM)
+		test_sev(guest, shape, policy | SNP_POLICY_DBG);
 	else
-		test_sev(guest, type, policy | SEV_POLICY_NO_DBG);
-	test_sev(guest, type, policy);
+		test_sev(guest, shape, policy | SEV_POLICY_NO_DBG);
+	test_sev(guest, shape, policy);
 
-	if (type == KVM_X86_SEV_VM)
+	if (shape.type == KVM_X86_SEV_VM)
 		return;
 
-	test_sev_shutdown(type, policy);
+	test_sev_shutdown(shape, policy);
 
 	if (kvm_has_cap(KVM_CAP_XCRS) &&
 	    (xgetbv(0) & kvm_cpu_supported_xcr0() & xf_mask) == xf_mask) {
-		test_sync_vmsa(type, policy);
-		if (type == KVM_X86_SNP_VM)
-			test_sync_vmsa(type, policy | SNP_POLICY_DBG);
+		test_sync_vmsa(shape, policy);
+		if (shape.type == KVM_X86_SNP_VM)
+			test_sync_vmsa(shape, policy | SNP_POLICY_DBG);
 		else
-			test_sync_vmsa(type, policy | SEV_POLICY_NO_DBG);
+			test_sync_vmsa(shape, policy | SEV_POLICY_NO_DBG);
 	}
 }
 
@@ -247,13 +247,13 @@ int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SEV));
 
-	test_sev_smoke(guest_sev_code, KVM_X86_SEV_VM, 0);
+	test_sev_smoke(guest_sev_code, VM_SHAPE_SEV, 0);
 
 	if (kvm_cpu_has(X86_FEATURE_SEV_ES))
-		test_sev_smoke(guest_sev_es_code, KVM_X86_SEV_ES_VM, SEV_POLICY_ES);
+		test_sev_smoke(guest_sev_es_code, VM_SHAPE_SEV_ES, SEV_POLICY_ES);
 
 	if (kvm_cpu_has(X86_FEATURE_SEV_SNP))
-		test_sev_smoke(guest_snp_code, KVM_X86_SNP_VM, snp_default_policy());
+		test_sev_smoke(guest_snp_code, VM_SHAPE_SNP, snp_default_policy());
 
 	return 0;
 }

base-commit: e49bb0b5e1e3a8d7783bc7222c02cc6ff90fa2aa
--

^ permalink raw reply related

* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Ackerley Tng @ 2026-06-16 16:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Isaku Yamahata, Kiryl Shutsemau,
	linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
	Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar, Shuah Khan,
	Oliver Upton, Jeremiah McReynolds, kvm, linux-coco, linux-kernel,
	x86
In-Reply-To: <ajFfb9u6dU47Nj3v@google.com>

>
> [...snip...]
>
>>
>> I still think kvm_arch_vm_finalize_vcpus() is an odd place to be
>> finalizing the VM.
>
> That's literally why the function exists though.  The one and only existing
> implementation (on arm64) uses it to initialize the vGIC.
>
>   void kvm_arch_vm_finalize_vcpus(struct kvm_vm *vm)
>   {
> 	if (vm->arch.has_gic)
> 		__vgic_v3_init(vm->arch.gic_fd);
>   }
>
> That's *very* similar to the proposed TDX usage, where some per-VM asset(s) can
> be initialized/frozen only after all vCPUs have been added.  In other words, the
> name is describing where in the VM creation/setup flow the hook is called, and
> perhaps more importantly, the impact of the call: vCPUs are finalized (obviously
> with a different definition of "finalized" based on the VM properties).
>
>> I would prefer to not have to explicitly call some function like
>> kvm_arch_vm_finalize() (no vcpu in the name), but a common arch function
>> calling vm_sev_launch() and tdx_vm_finalize() is what I can think of
>> for test setup flexibility, without too much magic.
>
> We can't have our cake and eat it too.  Either we launch/finalize SEV/TDX VMs as
> part of the common VM creation flows (as proposed for TDX), or we force tests to
> manually launch/finalize the VM after additional setup.  The only way to have it
> both ways is to utilize crazy macro shenanigans to execute arbitrary code between
> creating the VM and launching/finalizing the VM, and even I would balk at that.
>
> I honestly don't see any reason to try to figure out which of the two approaches
> is optimal at this time.  Whatever decision we make isn't set in stone, and in
> fact should be relative easy to change.  And without being able to see all the
> TDX/SEV tests that are waiting in the wings, it's more or less impossible to make
> an informed decision.
>
> I definitely want to have SEV and TDX use the same core approach in the end, but
> I don't think we should force the issue right now, because the cost of reworking
> the SEV and/or TDX infrastructure when there are only a bare handful of tests is
> so low.  It will cost more to try to predict the future of a 50/50 outcome than
> it will to make a completely wild guess between the two options and rework things
> if we guess wrong.
>

Makes sense. I'm good with merging this as it is done in this
patch. Thanks :)

>> For now, I can't think of many uses of __shared. ucall shared memory is
>> allocated dynamically, and we can also make it shared cleanly within
>> ucall code.
>
> Uh, every selftest that uses global variables to communicate between guest and
> host?
>
>> The global variables (sync_global_to_guest()) will appear in the guest
>> as long as sync_global_to_guest() is called before
>> kvm_arch_vm_finalize(), which I think makes sense to people writing
>> tests for CoCo.
>
> Yes, but that's completely orthogonal to all of this.
>
>> So 2 questions to push this along:
>>
>> 1. What do you think of a kvm_arch_vm_finalize() that calls
>>    vm_sev_launch() and tdx_vm_finalize()? My key issue is that
>>    kvm_arch_vm_finalize_*vcpus*() seems to be for finalizing vCPUs
>>    rather than the whole VM.
>
> Key word "seems".  I'm pretty sure Oliver picked kvm_arch_vm_finalize_vcpus() as
> the name in commit 8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for
> 'default' VMs") for the same reasons I think it's a good fit for coco VMs: like
> finalizing TDX VMs, initializing the vGIC effectively finalizes vCPUs.
>
> We could rename it to kvm_arch_vm_finalize(), but that won't change the fact that
> we'll need to decide between automagic vs. manual finalization, and it definitely
> should be a separate discussion.
>

This definitely should not block this series.

It's coming together for me now with your explanation:
kvm_arch_vm_finalize_vcpus() actually means finalizing vCPUs! vGIC ==
Virtual Generic Interrupt Controller, which has to be done after all the
vCPUs are set up. Since the name is describing where in the VM
creation/setup flow the hook is called (after creating VM and after
creating vCPUs), maybe something like kvm_arch_vm_post_vcpu_create()?

Renaming this to kvm_arch_vm_finalize() makes it sound like it is
finalizing the VM, but this function shouldn't finalize the VM since for
CoCo finalizing the VM also loads the guest image into the guest - deals
with memory, not just vCPUs.

8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for 'default'
VMs") also includes a test_disable_default_vgic() function, we could
also use something like that to skip CoCo VM finalization for some
tests? Maybe that's a good middle ground.

>> 3. Would you like __shared implemented together with this series, as a
>>    prerequisite, or later?
>
> Only if __shared is a hard requirement for base TDX support, which I assume is
> not the case.

Yup!

^ permalink raw reply

* SVSM Development Call June 17th, 2026
From: Jörg Rödel @ 2026-06-16 16:10 UTC (permalink / raw)
  To: coconut-svsm, linux-coco

Hi,

Here is the call for agenda items for this weeks SVSM development call.  Please
send any agenda items you have in mind as a reply to this email or raise them
in the meeting.

We will use the LF Zoom instance. Details of the meeting  can be found in our
governance repository at:

	https://github.com/coconut-svsm/governance

The link to the COCONUT-SVSM calendar is:

	https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week

The meeting will be recorded and the recording eventually published.

Regards,

	Jörg

^ permalink raw reply

* Re: [PATCH 00/15] Enable TDX Module Extensions and DICE-based TDX Quoting
From: Xu Yilun @ 2026-06-16 15:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dan Williams (nvidia), kas, rick.p.edgecombe, x86, peter.fang,
	linux-coco, linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li, aneesh.kumar, aik
In-Reply-To: <1fc76b45-a25b-41a7-b7fa-06650c8052fb@intel.com>

On Mon, Jun 15, 2026 at 08:57:09AM -0700, Dave Hansen wrote:
> On 6/15/26 08:22, Xu Yilun wrote:
> >> The TDX "Extension SEAMCALL" capability is akin to ARM CCA's "Stateful
> >> RMI Operations (SRO)", and achieves similar externalized complexity
> >> relief as a dedicated hardware coprocessor like AMD SEV-SNP. The
> > I may not include the ARM/AMD examples, not sure I can explain them
> > well.
> 
> I actually think they're pretty important proof points. One of the big

OK, I can include this section that Dan provides.

> challenges as a maintainer evaluating these things is judging the
> solution itself.
> 
> Is this architecture a good one? Is it overly complex? Are the avenues
> for simplification?
> 
> If five vendors pop up all with similar problems and solutions, then
> it's a pretty good bet that they're all on the right track. But, if
> there are four going one direction and one going off by itself, it's a
> sign that the errant one might need a course correction.
> 
> It would honestly be worth your time to go *talk* to the AMD and ARM
> folks and ensure that you are all on the same page. Last I checked, they

Yes, I queried ARM/AMD TDISP folks offline and CCed them in this thread.
Correct me if anything wrong:

AFAIK, AMD firmware run on an external physical core (PSP), firmware call
execution won't occupy host CPU, and the two partners communicate
asynchronously, so no worry about interruptibility and preemptibility.

From Alexey:

  "The AMD CPU puts a request in a queue, writes to doorbell, and wait for
   an interrupt. The PSP (a separate physical core) will see this, handle,
   put the data in the CPU memory (if needed), trigger the interrupt. Done.
   The host CPU can be rescheduled while waiting"

ARM SRO is something I don't familiar with. ARM has no co-processor for
CC, host invokes RMI and trap into RMM for secure execution, stateless
RMI blocks interrupt so should be short lived. This is very similar to
Intel SEAMCALL.

Stateful RMI, however, from their RMM 2.0bet1 SPEC [1] B4.3.2 Stateful
RMI operations, could be used "When an RMI operation cannot be completed
within an IMPLEMENTATION DEFINED time limit". It is "guaranteed to yield
within an IMPLEMENTATION DEFINED time limit from the point at which an
interrupt becomes pending." I see it tries to solve the same problem as
extension SEAMCALLs.

I see SRO is WIP in [2], and is used for TDISP [3].

[1] https://developer.arm.com/documentation/den0137/2-0bet1/
[2] https://lore.kernel.org/all/20260318155413.793430-49-steven.price@arm.com/
[3] https://lore.kernel.org/all/20260427065121.916615-3-aneesh.kumar@kernel.org/

^ permalink raw reply

* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Sean Christopherson @ 2026-06-16 14:36 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
	linux-coco, linux-kernel, x86
In-Reply-To: <CAEvNRgHjno=XDnMpTvZcNCR6D5RAi1rbynrL4CnnEa7Y-M3VLg@mail.gmail.com>

On Mon, Jun 15, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Fri, Jun 05, 2026, Ackerley Tng wrote:
> >> Sean Christopherson <seanjc@google.com> writes:
> >> Many tests use vm_userspace_mem_region_add(), CoCo tests that require
> >> finalizing shouldn't be disallowed that option.
> >
> > What does that have to do with finalizing the VM?
> 
> I could add more memslots after finalizing the VM, but then I'd have to
> make the guest accept more memory. Hence, I'd rather set up all the
> memslots and then finalize the VM.

Why?  It's not like the performance of this code matters terribly.

Regardless, nothing *requires* a test to go down that route.  As I said before,
my goal with the selftests infrastructure is to effectively optimize the entire
experience, i.e. to provide default behavior that works well for the majority of
tests.  Attempting to provide default behavior that perfectly fits *every* test
is simply impossible.

> sev_smoke_test currently has a separate vm_sev_launch(), which is the
> equivalent of tdx_init_mem_region(), and that's called in the tests
> themselves.
> 
> sev_smoke_test also uses vcpu_args_set() after creating the VM with
> vm_create_shape_with_one_vcpu(). Would that work if vm_sev_launch() got
> moved into kvm_arch_vm_finalize_vcpus()?

No, because vCPU state would be encrypted at that point.  Moving vm_sev_launch()
would also break test_sev_dbg(), which sets a global buffer to all 0xaa prior to
encrypting that data.

> >> >> It's also possible to have some kvm_vm_finalize() call that can be
> >> >> explicitly and manually invoked from selftests just for CoCo selftests.
> >> >
> >> > Why bother?  It's obviously possible to all kvm_arch_vm_finalize_vcpus() directly.
> >>
> >> Works for me to call directly. Do you mean kvm_arch_vm_finalize_vcpus()
> >> is the right function where the TD is finalized?
> >>
> >> For tests that need to do more setup after creating a vm, is the only
> >> way out to call __vm_create() then vm_vcpu_add() to avoid premature
> >> finalization in __vm_create_with_vcpus() when
> >> kvm_arch_vm_finalize_vcpus() is called?
> >
> > Depends on what you're doing.  Sometimes, the answer will be yes.  That's why
> > there are "low level" APIs, so that some tests can do fancy things, while most
> > tests can leave the details to the infrastructure.
> >
> > If there's a recurring problem, or we anticipate one, then we can and should
> > figure out how to minimize the pain so that tests don't have to deal with the
> > same boilerplate issues over and over.  Hence the __shared idea.
> 
> I still think kvm_arch_vm_finalize_vcpus() is an odd place to be
> finalizing the VM.

That's literally why the function exists though.  The one and only existing
implementation (on arm64) uses it to initialize the vGIC.

  void kvm_arch_vm_finalize_vcpus(struct kvm_vm *vm)
  {
	if (vm->arch.has_gic)
		__vgic_v3_init(vm->arch.gic_fd);
  }

That's *very* similar to the proposed TDX usage, where some per-VM asset(s) can
be initialized/frozen only after all vCPUs have been added.  In other words, the
name is describing where in the VM creation/setup flow the hook is called, and
perhaps more importantly, the impact of the call: vCPUs are finalized (obviously
with a different definition of "finalized" based on the VM properties).

> I would prefer to not have to explicitly call some function like
> kvm_arch_vm_finalize() (no vcpu in the name), but a common arch function
> calling vm_sev_launch() and tdx_vm_finalize() is what I can think of
> for test setup flexibility, without too much magic.

We can't have our cake and eat it too.  Either we launch/finalize SEV/TDX VMs as
part of the common VM creation flows (as proposed for TDX), or we force tests to
manually launch/finalize the VM after additional setup.  The only way to have it
both ways is to utilize crazy macro shenanigans to execute arbitrary code between
creating the VM and launching/finalizing the VM, and even I would balk at that.

I honestly don't see any reason to try to figure out which of the two approaches
is optimal at this time.  Whatever decision we make isn't set in stone, and in
fact should be relative easy to change.  And without being able to see all the
TDX/SEV tests that are waiting in the wings, it's more or less impossible to make
an informed decision.

I definitely want to have SEV and TDX use the same core approach in the end, but
I don't think we should force the issue right now, because the cost of reworking
the SEV and/or TDX infrastructure when there are only a bare handful of tests is
so low.  It will cost more to try to predict the future of a 50/50 outcome than
it will to make a completely wild guess between the two options and rework things
if we guess wrong.

> For now, I can't think of many uses of __shared. ucall shared memory is
> allocated dynamically, and we can also make it shared cleanly within
> ucall code.

Uh, every selftest that uses global variables to communicate between guest and
host?

> The global variables (sync_global_to_guest()) will appear in the guest
> as long as sync_global_to_guest() is called before
> kvm_arch_vm_finalize(), which I think makes sense to people writing
> tests for CoCo.

Yes, but that's completely orthogonal to all of this.

> So 2 questions to push this along:
> 
> 1. What do you think of a kvm_arch_vm_finalize() that calls
>    vm_sev_launch() and tdx_vm_finalize()? My key issue is that
>    kvm_arch_vm_finalize_*vcpus*() seems to be for finalizing vCPUs
>    rather than the whole VM.

Key word "seems".  I'm pretty sure Oliver picked kvm_arch_vm_finalize_vcpus() as
the name in commit 8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for
'default' VMs") for the same reasons I think it's a good fit for coco VMs: like
finalizing TDX VMs, initializing the vGIC effectively finalizes vCPUs.

We could rename it to kvm_arch_vm_finalize(), but that won't change the fact that
we'll need to decide between automagic vs. manual finalization, and it definitely
should be a separate discussion.

> 3. Would you like __shared implemented together with this series, as a
>    prerequisite, or later?

Only if __shared is a hard requirement for base TDX support, which I assume is
not the case.

^ permalink raw reply

* Re: [PATCH] PCI/TSM: Resume device to D0 for CMA-SPDM operation
From: Jonathan Cameron @ 2026-06-16 12:57 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Dan Williams, Ashish Kalra, Tom Lendacky, Vivaik Balasubrawmanian,
	John Allen, Bjorn Helgaas, linux-coco, linux-pci,
	Aneesh Kumar K.V, Yilun Xu, Zhenzhong Duan, Alexey Kardashevskiy
In-Reply-To: <7bdfaf14d7e5a466f3f650150c688a60e947a7a9.1781527060.git.lukas@wunner.de>

On Mon, 15 Jun 2026 15:19:30 +0200
Lukas Wunner <lukas@wunner.de> wrote:

> Per PCIe r7.0 sec 6.31.3, CMA-SPDM operation in non-D0 states is optional.
> The spec does not define a way to determine if it's supported, so resume
> to D0 unconditionally for the duration of a CMA-SPDM exchange.  Vivaik has
> talked to Windows engineers and they said that Windows does the same.
> 
> Note that for plain DOE operation, it is sufficient for the device to be
> in D3hot and its parents in D0 because config space remains accessible in
> D3hot.  So CMA-SPDM goes beyond the requirements of plain DOE and hence
> resuming to D0 needs to (only) be done in code paths which use DOE
> specifically for CMA-SPDM.
> 
> The pattern used herein for runtime resume is the best practice introduced
> by commit ef8057b07c72 ("PM: runtime: Wrapper macros for ACQUIRE()/
> ACQUIRE_ERR()").
> 
> Fixes: 3225f52cde56 ("PCI/TSM: Establish Secure Sessions and Link Encryption")
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org # v6.19+
> Cc: Vivaik Balasubrawmanian <vivaik.balasubrawmanian@intel.com>
Seems reasonable to me and your replies to sashiko stuff seem to have that well
covered.  So FWIW

Reviewed-by: Jonathan Cameron <jic23@kernel.org>

^ permalink raw reply

* Re: [PATCH v13 08/22] KVM: selftests: Add TDX boot code
From: Chenyi Qiang @ 2026-06-16  9:21 UTC (permalink / raw)
  To: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
	Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
  Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-8-6983ae4c3a4d@google.com>



On 5/22/2026 7:16 AM, Lisa Wang wrote:
> From: Erdem Aktas <erdemaktas@google.com>
> 
> Add code to boot a TDX test VM. Since TDX registers are inaccessible to
> KVM, the boot code loads the relevant values from memory into the
> registers before jumping to the guest code.
> 
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> Signed-off-by: Erdem Aktas <erdemaktas@google.com>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sagi Shahar <sagis@google.com>
> Signed-off-by: Sagi Shahar <sagis@google.com>
> Signed-off-by: Lisa Wang <wyihan@google.com>
> ---
>  tools/testing/selftests/kvm/Makefile.kvm           |  1 +
>  .../selftests/kvm/include/x86/tdx/td_boot.h        |  5 ++
>  .../selftests/kvm/include/x86/tdx/td_boot_asm.h    | 16 ++++++
>  tools/testing/selftests/kvm/lib/x86/tdx/td_boot.S  | 60 ++++++++++++++++++++++
>  4 files changed, 82 insertions(+)
> 
> diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
> index 02fad7b35eac..929965ca4b75 100644
> --- a/tools/testing/selftests/kvm/Makefile.kvm
> +++ b/tools/testing/selftests/kvm/Makefile.kvm
> @@ -31,6 +31,7 @@ LIBKVM_x86 += lib/x86/sev.c
>  LIBKVM_x86 += lib/x86/svm.c
>  LIBKVM_x86 += lib/x86/ucall.c
>  LIBKVM_x86 += lib/x86/vmx.c
> +LIBKVM_x86 += lib/x86/tdx/td_boot.S
>  
>  LIBKVM_arm64 += lib/arm64/gic.c
>  LIBKVM_arm64 += lib/arm64/gic_v3.c
> diff --git a/tools/testing/selftests/kvm/include/x86/tdx/td_boot.h b/tools/testing/selftests/kvm/include/x86/tdx/td_boot.h
> index af4474dee387..e5d54a20ed72 100644
> --- a/tools/testing/selftests/kvm/include/x86/tdx/td_boot.h
> +++ b/tools/testing/selftests/kvm/include/x86/tdx/td_boot.h
> @@ -66,4 +66,9 @@ struct td_boot_parameters {
>  	struct td_per_vcpu_parameters per_vcpu[];
>  };
>  
> +void td_boot(void);
> +void td_boot_code_end(void);
> +
> +#define TD_BOOT_CODE_SIZE (td_boot_code_end - td_boot)
> +
>  #endif /* SELFTEST_TDX_TD_BOOT_H */
> diff --git a/tools/testing/selftests/kvm/include/x86/tdx/td_boot_asm.h b/tools/testing/selftests/kvm/include/x86/tdx/td_boot_asm.h
> new file mode 100644
> index 000000000000..10b4b527595c
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/include/x86/tdx/td_boot_asm.h
> @@ -0,0 +1,16 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef SELFTEST_TDX_TD_BOOT_ASM_H
> +#define SELFTEST_TDX_TD_BOOT_ASM_H
> +
> +/*
> + * GPA where TD boot parameters will be loaded.
> + *
> + * TD_BOOT_PARAMETERS_GPA is arbitrarily chosen to
> + *
> + * + be within the 4GB address space
> + * + provide enough contiguous memory for the struct td_boot_parameters such
> + *   that there is one struct td_per_vcpu_parameters for KVM_MAX_VCPUS
> + */
> +#define TD_BOOT_PARAMETERS_GPA 0xffff0000
> +
> +#endif  // SELFTEST_TDX_TD_BOOT_ASM_H

It would be better to maintain consistency by using /* ... */ for single-line comments in this series, such as the SELFTESTS_TDX_TDX_H in patch 20 and other license Identifier.


^ permalink raw reply

* Re: [PATCH v13 01/22] KVM: selftests: Add macros to simplify creating VM shapes for non-default types
From: Xiaoyao Li @ 2026-06-16  8:57 UTC (permalink / raw)
  To: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
	Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
  Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-1-6983ae4c3a4d@google.com>

On 5/22/2026 7:16 AM, Lisa Wang wrote:
> From: Sean Christopherson<seanjc@google.com>
> 
> Add VM_TYPE() and __VM_TYPE() macros to create a vm_shape structure given
> a type (and mode), and use the macros to define VM_SHAPE_{SEV,SEV_ES,SNP}
> shapes for x86's SEV family of VM shapes.  Providing common infrastructure
> will avoid having to copy+paste vm_sev_create_with_one_vcpu() for TDX.
> 
> Use the new SEV+ shapes and drop vm_sev_create_with_one_vcpu().
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson<seanjc@google.com>
> Signed-off-by: Sagi Shahar<sagis@google.com>
> Reviewed-by: Binbin Wu<binbin.wu@linux.intel.com>
> Reviewed-by: Ira Weiny<ira.weiny@intel.com>
> Signed-off-by: Lisa Wang<wyihan@google.com>

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

> ---
>   tools/testing/selftests/kvm/include/kvm_util.h     | 13 +++++++
>   .../testing/selftests/kvm/include/x86/processor.h  |  4 +++
>   tools/testing/selftests/kvm/include/x86/sev.h      |  2 --
>   tools/testing/selftests/kvm/lib/x86/sev.c          | 16 ---------
>   tools/testing/selftests/kvm/x86/sev_smoke_test.c   | 40 +++++++++++-----------
>   5 files changed, 37 insertions(+), 38 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index dc70c6da63fa..041bdbfb93f7 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -233,6 +233,19 @@ kvm_static_assert(sizeof(struct vm_shape) == sizeof(u64));
>   	shape;					\
>   })
>   
> +#define __VM_TYPE(__mode, __type)		\

It seems the name "__VM_SHAPE" fits better?

> +({						\
> +	struct vm_shape shape = {		\
> +		.mode = (__mode),		\
> +		.type = (__type)		\
> +	};					\
> +						\
> +	shape;					\
> +})
> +
> +#define VM_TYPE(__type)				\
> +	__VM_TYPE(VM_MODE_DEFAULT, __type)

and I think making it one line would be OK?

So something on top:

---8<---
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h 
b/tools/testing/selftests/kvm/include/kvm_util.h
index 041bdbfb93f7..a1b5d2029d05 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -223,17 +223,7 @@ kvm_static_assert(sizeof(struct vm_shape) == 
sizeof(u64));

  #define VM_TYPE_DEFAULT                        0

-#define VM_SHAPE(__mode)                       \
-({                                             \
-       struct vm_shape shape = {               \
-               .mode = (__mode),               \
-               .type = VM_TYPE_DEFAULT         \
-       };                                      \
-                                               \
-       shape;                                  \
-})
-
-#define __VM_TYPE(__mode, __type)              \
+#define __VM_SHAPE(__mode, __type)             \
  ({                                             \
         struct vm_shape shape = {               \
                 .mode = (__mode),               \
@@ -243,8 +233,8 @@ kvm_static_assert(sizeof(struct vm_shape) == 
sizeof(u64));
         shape;                                  \
  })

-#define VM_TYPE(__type)                                \
-       __VM_TYPE(VM_MODE_DEFAULT, __type)
+#define VM_SHAPE(__mode)       __VM_SHAPE(__mode, VM_TYPE_DEFAULT)
+#define VM_TYPE(__type)                __VM_SHAPE(VM_MODE_DEFAULT, __type)

  extern enum vm_guest_mode vm_mode_default;



^ permalink raw reply related

* Re: [PATCH v8 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: K Prateek Nayak @ 2026-06-16  7:27 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <de274c2fb3f794ff1f19f0c96184ee50d04d1282.1781419998.git.ashish.kalra@amd.com>

Hello Ashish,

On 6/16/2026 1:19 AM, Ashish Kalra wrote:
> +	/*
> +	 * RMPOPT scans the RMP table, stores the result of the scan in the
> +	 * reserved processor memory. The RMP scan is the most expensive
> +	 * part. If a second RMPOPT occurs, it can skip the expensive scan
> +	 * if they can see a cached result in the reserved processor memory.
> +	 *
> +	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
> +	 * on every other primary thread. Followers are "designed to"
> +	 * skip the scan if they see the "cached" scan results.
> +	 */
> +	cpumask_copy(follower_mask, &rmpopt_cpumask);

rmpopt_cpumask is constructed after hotplug is disabled but ...

> +
> +	/*
> +	 * Pin the worker to the current CPU for the leader loop so that
> +	 * this_cpu remains valid and the RMPOPT instruction executes on
> +	 * the correct CPU.
> +	 *
> +	 * Use migrate_disable() rather than get_cpu() to prevent
> +	 * migration while still allowing preemption.
> +	 */
> +	migrate_disable();
> +	this_cpu = smp_processor_id();
> +
> +	if (cpumask_test_cpu(this_cpu, follower_mask)) {
> +		/*
> +		 * Current CPU is a primary thread in rmpopt_cpumask.
> +		 * Run leader locally and remove from follower mask.
> +		 */
> +		cpumask_clear_cpu(this_cpu, follower_mask);
> +
> +		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> +			rmpopt(pa);
> +			cond_resched();
> +		}
> +	} else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
> +				      follower_mask)) {
> +		/*
> +		 * Current CPU is a sibling thread whose primary is in
> +		 * rmpopt_cpumask.  RMPOPT_BASE MSR is per-core, so it
> +		 * is safe to run the leader locally.  Remove the sibling's
> +		 * primary from the follower mask as this core is already
> +		 * covered by the leader.
> +		 */
> +		cpumask_andnot(follower_mask, follower_mask,
> +			       topology_sibling_cpumask(this_cpu));
> +
> +		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> +			rmpopt(pa);
> +			cond_resched();
> +		}
> +	} else {
> +		/*
> +		 * Current CPU does not have RMPOPT_BASE MSR programmed.
> +		 * Pick an explicit leader from the cpumask to avoid #UD.
> +		 * Use work_on_cpu() to run in process context on the leader,
> +		 * avoiding IPI latency.
> +		 */

... this_cpu is neither in the "rmpopt_cpumask", nor is any of its
siblings on "rmpopt_cpumask".

How does that happen?

> +		int leader_cpu = cpumask_first(follower_mask);
> +
> +		if (WARN_ON_ONCE(leader_cpu >= nr_cpu_ids)) {
> +			migrate_enable();
> +			goto out;
> +		}
> +
> +		cpumask_clear_cpu(leader_cpu, follower_mask);
> +
> +		/* Release migration pin before work_on_cpu(). */
> +		migrate_enable();
> +
> +		work_on_cpu(leader_cpu, rmpopt_leader_fn, NULL);

This creates a delayed work and also waits for it to finish execution
which will add more latency than a simple IPI if the comment about IPI
latency above is accurate.

I think there is some corner case in construction of the
"rmpopt_cpumask" that requires this not-so-pretty else block. Can you
elaborate why this is required?

Perhaps the "rmpopt_cpumask" construction needs:

    for_each_online_cpu(cpu) {
        /* Nominate the first CPU on the sibling mask for RMPOPT */
        if (cpu != cpumask_first(topology_sibling_cpumask(cpu)))
            continue;
        cpumask_set_cpu(cpu, &rmpopt_cpumask);
    }


and all you need here is:

    /* Do RMPOPt for local core */
    for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
        rmpopt(pa);

    /* Skip this core from concurrent RMPOPT */
    cpumask_and_not(follower_mask, &rmpopt_cpumask, topology_sibling_cpumask(cpu));

No?

> +
> +		goto followers;
> +	}
> +
> +	migrate_enable();
> +
-- 
Thanks and Regards,
Prateek


^ permalink raw reply

* Re: [PATCH v8 2/7] x86/sev: Initialize RMPOPT configuration MSRs
From: K Prateek Nayak @ 2026-06-16  6:03 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <6a4d0ec9e037d91c0fdba9c525942ca288e1b1b2.1781419998.git.ashish.kalra@amd.com>

Hello Ashish,

On 6/16/2026 1:18 AM, Ashish Kalra wrote:
> diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
> index 8bcdce98f6dc..1b5c18408f0b 100644
> --- a/arch/x86/virt/svm/sev.c
> +++ b/arch/x86/virt/svm/sev.c
> @@ -124,6 +124,10 @@ static void *rmp_bookkeeping __ro_after_init;
>  
>  static u64 probed_rmp_base, probed_rmp_size;
>  
> +static cpumask_t rmpopt_cpumask;

nit.

I believe you can use cpumask_var_t here and do a zalloc_cpumask_var()
during snp_setup_rmpopt(). That way !X86_FEATURE_RMPOPT configs don't
have to needlessly waste space to keep a redundant cpumask around.

Same comment for rmpopt_report_cpumask in Patch 7 which can be
allocated dynamically during rmpopt_debugfs_setup().

> +static phys_addr_t rmpopt_pa_start;
> +static bool rmpopt_configured;
> +
>  static LIST_HEAD(snp_leaked_pages_list);
>  static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
>  
> @@ -490,7 +494,12 @@ static bool __init setup_rmptable(void)
>  	if (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) {
>  		if (!setup_segmented_rmptable())
>  			return false;
> +		rmpopt_configured = true;
>  	} else {
> +		/*
> +		 * RMPOPT requires a segmented RMP table, so leave
> +		 * rmpopt_configured clear on contiguous RMP systems.
> +		 */
>  		if (!setup_contiguous_rmptable())
>  			return false;
>  	}
> @@ -555,6 +564,21 @@ int snp_prepare(void)
>  }
>  EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
>  
> +static void rmpopt_cleanup(void)
> +{
> +	int cpu;
> +
> +	cpus_read_lock();

nit.

You can use guard(cpus_read_lock)() unless there is a complicated
locking pattern where you need to drop and re-acquire the read lock.

> +
> +	for_each_cpu(cpu, &rmpopt_cpumask)
> +		WARN_ON_ONCE(wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, 0));
> +
> +	cpus_read_unlock();
> +
> +	cpumask_clear(&rmpopt_cpumask);
> +	rmpopt_pa_start = 0;
> +}
> +
>  void snp_shutdown(void)
>  {
>  	u64 syscfg;

-- 
Thanks and Regards,
Prateek


^ permalink raw reply

* Re: [PATCH] PCI/TSM: fix use-after-free in find_dsm_dev()
From: Lukas Wunner @ 2026-06-16  3:16 UTC (permalink / raw)
  To: Wentao Liang; +Cc: djbw, bhelgaas, linux-coco, linux-pci, linux-kernel, stable
In-Reply-To: <20260616030243.1661791-1-vulab@iscas.ac.cn>

On Tue, Jun 16, 2026 at 03:02:43AM +0000, Wentao Liang wrote:
> In find_dsm_dev(), pf0 is obtained via pf0_dev_get() which returns a
> reference-counted pointer.  It is declared with __free(pci_dev_put),
> so pci_dev_put() will be called when the variable goes out of scope.
> Returning 'pf0' directly while it still has __free cleanup causes the
> reference to be dropped before the caller can use the pointer, leading
> to a use-after-free.

No, the code comment preceding find_dsm_dev() explicitly states:

   "Note that no additional reference is held for the resulting device
    because that resulting object always has a registered lifetime
    greater-than-or-equal to that of the @pdev argument."

Your patch looks like it may be an LLM-generated hallucination.
Did you use an LLM to come up with the patch?  If so, please use
an Assisted-by tag per Documentation/process/coding-assistants.rst
so that we know to expect hallucinations.

Thanks,

Lukas

^ permalink raw reply

* [PATCH] PCI/TSM: fix use-after-free in find_dsm_dev()
From: Wentao Liang @ 2026-06-16  3:02 UTC (permalink / raw)
  To: djbw, bhelgaas; +Cc: linux-coco, linux-pci, linux-kernel, Wentao Liang, stable

In find_dsm_dev(), pf0 is obtained via pf0_dev_get() which returns a
reference-counted pointer.  It is declared with __free(pci_dev_put),
so pci_dev_put() will be called when the variable goes out of scope.
Returning 'pf0' directly while it still has __free cleanup causes the
reference to be dropped before the caller can use the pointer, leading
to a use-after-free.

Fix by using return no_free_ptr(pf0) to suppress the automatic
cleanup and properly transfer ownership to the caller.

Fixes: 3225f52cde56 ("PCI/TSM: Establish Secure Sessions and Link Encryption")
Cc: stable@vger.kernel.org
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
---
 drivers/pci/tsm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/tsm.c b/drivers/pci/tsm.c
index 5fdcd7f2e820..dd4e0cb0c6aa 100644
--- a/drivers/pci/tsm.c
+++ b/drivers/pci/tsm.c
@@ -670,7 +670,7 @@ static struct pci_dev *find_dsm_dev(struct pci_dev *pdev)
 		return NULL;

 	if (is_dsm(pf0))
-		return pf0;
+		return no_free_ptr(pf0);

 	/*
 	 * For cases where a switch may be hosting TDISP services on behalf of
-- 
2.34.1

^ permalink raw reply related

* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Ackerley Tng @ 2026-06-16  0:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
	linux-coco, linux-kernel, x86
In-Reply-To: <aiM2Nm9Eh68mIVX3@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Fri, Jun 05, 2026, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>>
>> >
>> > [...snip...]
>> >
>> >> Was kvm_arch_vm_finalize_vcpus() supposed to be for finalizing vCPUs
>> >> instead?
>> >>
>> >> The awkward part is that kvm_arch_vm_finalize_vcpus() is called from
>> >> __vm_create_with_vcpus().
>> >>
>> >> While building this POC to test conversions [1] I only wanted to create
>> >> the vm and vcpus and didn't want to finalize yet, since I still needed
>> >> to do more mappings in the guest (and I needed the vm pointer to do
>> >> mappings in the guest).
>> >
>> > Hmm, I would argue this is a flaw in the selftests infrastructure.  IMO, as a
>> > developer, it's quite surprising that the current value of a global variable
>> > doesn't show up in the VM automagically.  I totally understand why selftests
>> > work that way, but it's certainly odd and annoying.  If _that_ were solved, then
>> > the kludginess of what you're doing goes away.
>> >
>> > The other way this could be solved is by adding support for annotating globals
>> > with a __shared flag, a la the kernel's __bss_decrypted, so that loading memory
>> > into the VM can automatically mark the associated globals' pages as shared.
>> >
>>
>> More generally, is your opinion that tests should not have to add extra
>> memslots?
>
> I don't care?  What I care about is making it as easy and intuitive as possible
> for people to write tests, and to minimize maintenance costs.
>
>> If I wanted a shared page, would I have to do
>>
>>   static __shared test_page[4096] = {0};
>>
>> and then rely on ELF loading to put that in the guest for me? Are there
>> some compiler flags/how will I require that test_page be page aligned?
>
> Compilere and linker shenanigans.
>
>> If I mark 10 globals as __shared, would the compiler automatically
>> consolidate the shared memory together?
>
> Yes, follow the __bss_decrypted breadcrumbs.
>
>   #define __bss_decrypted __section(".bss..decrypted")
>
>> I think it's a bit constraining to require that all guest memory be set
>> up statically. It's nice to have but I'd like another option...
>
> You do have options, they just require more work.
>
>> Many tests use vm_userspace_mem_region_add(), CoCo tests that require
>> finalizing shouldn't be disallowed that option.
>
> What does that have to do with finalizing the VM?
>

I could add more memslots after finalizing the VM, but then I'd have to
make the guest accept more memory. Hence, I'd rather set up all the
memslots and then finalize the VM.

sev_smoke_test currently has a separate vm_sev_launch(), which is the
equivalent of tdx_init_mem_region(), and that's called in the tests
themselves.

sev_smoke_test also uses vcpu_args_set() after creating the VM with
vm_create_shape_with_one_vcpu(). Would that work if vm_sev_launch() got
moved into kvm_arch_vm_finalize_vcpus()?

>> >> It's also possible to have some kvm_vm_finalize() call that can be
>> >> explicitly and manually invoked from selftests just for CoCo selftests.
>> >
>> > Why bother?  It's obviously possible to all kvm_arch_vm_finalize_vcpus() directly.
>>
>> Works for me to call directly. Do you mean kvm_arch_vm_finalize_vcpus()
>> is the right function where the TD is finalized?
>>
>> For tests that need to do more setup after creating a vm, is the only
>> way out to call __vm_create() then vm_vcpu_add() to avoid premature
>> finalization in __vm_create_with_vcpus() when
>> kvm_arch_vm_finalize_vcpus() is called?
>
> Depends on what you're doing.  Sometimes, the answer will be yes.  That's why
> there are "low level" APIs, so that some tests can do fancy things, while most
> tests can leave the details to the infrastructure.
>
> If there's a recurring problem, or we anticipate one, then we can and should
> figure out how to minimize the pain so that tests don't have to deal with the
> same boilerplate issues over and over.  Hence the __shared idea.

I still think kvm_arch_vm_finalize_vcpus() is an odd place to be
finalizing the VM.

I would prefer to not have to explicitly call some function like
kvm_arch_vm_finalize() (no vcpu in the name), but a common arch function
calling vm_sev_launch() and tdx_vm_finalize() is what I can think of
for test setup flexibility, without too much magic.

For now, I can't think of many uses of __shared. ucall shared memory is
allocated dynamically, and we can also make it shared cleanly within
ucall code.

The global variables (sync_global_to_guest()) will appear in the guest
as long as sync_global_to_guest() is called before
kvm_arch_vm_finalize(), which I think makes sense to people writing
tests for CoCo.


So 2 questions to push this along:

1. What do you think of a kvm_arch_vm_finalize() that calls
   vm_sev_launch() and tdx_vm_finalize()? My key issue is that
   kvm_arch_vm_finalize_*vcpus*() seems to be for finalizing vCPUs
   rather than the whole VM.

3. Would you like __shared implemented together with this series, as a
   prerequisite, or later?

^ permalink raw reply

* Re: [PATCH v13 13/22] KVM: selftests: Set first memory region as shared if guest_memfd
From: Lisa Wang @ 2026-06-16  0:04 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
	Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
	linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
	Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
	Sean Christopherson, Shuah Khan, Oliver Upton,
	Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <eab3711d-c5b1-4132-92e4-7f7040fcd17b@linux.intel.com>

On Mon, Jun 08, 2026 at 04:03:23PM +0800, Binbin Wu wrote:
> On 5/22/2026 7:16 AM, Lisa Wang wrote:
> > +	 * exactly like other memory providers.
> >  	 */
> > -	flags = 0;
> > -	if (is_guest_memfd_required(shape))
> > +	if (is_guest_memfd_required(shape)) {
> >  		flags |= KVM_MEM_GUEST_MEMFD;
> > +		gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> > +	}
> >  
> > -	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags);
> > +	vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
> 
> The build failed due to this:
> 
> lib/kvm_util.c: In function ‘__vm_create’:
> lib/kvm_util.c:507:9: error: too many arguments to function ‘vm_mem_add’
>   507 |         vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
>       |         ^~~~~~~~~~
> In file included from lib/kvm_util.c:9:
> include/kvm_util.h:714:6: note: declared here
>   714 | void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
>       |      ^~~~~~~~~~
> lib/kvm_util.c: At top level:
> cc1: note: unrecognized command-line option ‘-Wno-gnu-variable-sized-type-not-at-end’ may have been intended to silence earlier diagnostics
> 
> It seems the patch set doesn't wire gmem_flags parameter to vm_mem_add().

This series need to based on "KVM: selftests: Add support for mmap() on
guest_memfd in core library" for the gmem_flags parameter.

Lisa

> >  	for (i = 0; i < NR_MEM_REGIONS; i++)
> >  		vm->memslots[i] = 0;
> >  
> > 
> 

^ permalink raw reply

* Re: [PATCH v13 13/22] KVM: selftests: Set first memory region as shared if guest_memfd
From: Ackerley Tng @ 2026-06-15 23:46 UTC (permalink / raw)
  To: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
	Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
	Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
	Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
	Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
  Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-13-6983ae4c3a4d@google.com>

Lisa Wang <wyihan@google.com> writes:

> Set the initial state of the first memory region as shared if it is
> backed by guest_memfd, so that the KVM selftest framework functions can
> populate mmap()-ed guest_memfd memory the same way memory from other
> memory providers are populated.
>
> For CoCo VMs, pages that need to be private are explicitly set to
> private before executing the VM.
>
>
> [...snip...]
>
> @@ -495,14 +497,16 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
>  	vm = ____vm_create(shape);
>
>  	/*
> -	 * Force GUEST_MEMFD for the primary memory region if necessary, e.g.
> -	 * for CoCo VMs that require GUEST_MEMFD backed private memory.
> +	 * Force GUEST_MEMFD for the primary memory region if necessary, and
> +	 * initialize it as shared so the selftest framework can populate it
> +	 * exactly like other memory providers.
>  	 */
> -	flags = 0;
> -	if (is_guest_memfd_required(shape))
> +	if (is_guest_memfd_required(shape)) {
>  		flags |= KVM_MEM_GUEST_MEMFD;
> +		gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> +	}
>

Just noticed this while hacking some SNP tests.

> -	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags);
> +	vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
>  	for (i = 0; i < NR_MEM_REGIONS; i++)
>  		vm->memslots[i] = 0;
>
>
> --
> 2.54.0.746.g67dd491aae-goog

I think this patch should fully buy into in-place conversions, so we
need to also set GUEST_MEMFD_FLAG_MMAP:

@@ -483,6 +483,7 @@ struct kvm_vm *__vm_create(struct vm_shape shape,
u32 nr_runnable_vcpus,
 {
 	u64 nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus,
 						 nr_extra_pages);
+	enum vm_mem_backing_src_type src_type = VM_MEM_SRC_ANONYMOUS;
 	struct userspace_mem_region *slot0;
 	u64 gmem_flags = 0;
 	struct kvm_vm *vm;
@@ -503,10 +504,16 @@ struct kvm_vm *__vm_create(struct vm_shape
shape, u32 nr_runnable_vcpus,
 	 */
 	if (is_guest_memfd_required(shape)) {
 		flags |= KVM_MEM_GUEST_MEMFD;
-		gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
+		gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED | GUEST_MEMFD_FLAG_MMAP;
+		/*
+		 * TODO: Clean this up together with vm_mem_add(). Use
+		 * VM_MEM_SRC_SHMEM to tell vm_mem_add() to mmap
+		 * guest_memfd with MAP_SHARED.
+		 */
+		src_type = VM_MEM_SRC_SHMEM;
 	}

-	vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0,
gmem_flags);
+	vm_mem_add(vm, src_type, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
 	for (i = 0; i < NR_MEM_REGIONS; i++)
 		vm->memslots[i] = 0;

The VM_MEM_SRC_SHMEM is a bit of a hack but imo that refactor should go
with another series to clean up vm_mem_add().

The above code was working for TDX because vm_mem_add() without the
guest_memfd MMAP flag would leave region->fd as -1. Without
setting src_type to VM_MEM_SRC_ANONYMOUS, the assertion in vm_mem_add()

  region->fd == -1 || backing_src_is_shared(src_type)

would pass.

tdx_init_mem_region() was called with the anonymously mmap()-ed address,
which turns out fine.

With the above change, we would also need to call tdx_init_mem_region()
with NULL for source_address.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox