* Re: [PATCH v6 03/20] dma-direct: use DMA_ATTR_CC_SHARED in alloc/free paths
From: Aneesh Kumar K.V @ 2026-06-17 14:46 UTC (permalink / raw)
To: Alexey Kardashevskiy, iommu, linux-arm-kernel, linux-kernel,
linux-coco
Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
Jason Gunthorpe, Mostafa Saleh, Petr Tesarik, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86, Jiri Pirko,
Michael Kelley, Cheloha, Scott
In-Reply-To: <845d0c8a-6d51-47aa-8e0b-8381e733444a@amd.com>
Alexey Kardashevskiy <aik@amd.com> writes:
> On 4/6/26 18:39, Aneesh Kumar K.V (Arm) wrote:
>> Propagate force_dma_unencrypted() into DMA_ATTR_CC_SHARED in the
>> dma-direct allocation path and use the attribute to drive the related
>> decisions.
>>
>> This updates dma_direct_alloc(), dma_direct_free(), and
>> dma_direct_alloc_pages() to fold the forced unencrypted case into attrs.
>>
>> Tested-by: Jiri Pirko <jiri@nvidia.com>
>> Tested-by: Michael Kelley <mhklinux@outlook.com>
>> Tested-by: Mostafa Saleh <smostafa@google.com>
>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>> ---
>> kernel/dma/direct.c | 53 +++++++++++++++++++++++++++++++++++++--------
>> 1 file changed, 44 insertions(+), 9 deletions(-)
>>
>> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>> index a741c8a2ee66..90dc5057a0c0 100644
>> --- a/kernel/dma/direct.c
>> +++ b/kernel/dma/direct.c
>> @@ -193,16 +193,31 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>> dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
>> {
>> bool remap = false, set_uncached = false;
>> - bool mark_mem_decrypt = true;
>> + bool mark_mem_decrypt = false;
>> struct page *page;
>> void *ret;
>>
>> + /*
>> + * DMA_ATTR_CC_SHARED is not a caller-visible dma_alloc_*()
>> + * attribute. The direct allocator uses it internally after it has
>> + * decided that the backing pages must be shared/decrypted, so the
>> + * rest of the allocation path can consistently select DMA addresses,
>> + * choose compatible pools and restore encryption on free.
>
> Why this limit?
>
> Context: I am looking for a memory pool for a few shared pages (to do
> some guest<->host communication), SWIOTLB seems like the right fit but
> swiotlb_alloc() is not exported and
> dma_direct_alloc(DMA_ATTR_CC_SHARED) is not allowed. Thanks,
>
swiotlb is not the right pool to use for that, right?
CCA had a similar requirement for ITS pages and ended up creating a genpool:
b08e2f42e86b ("irqchip/gic-v3-its: Share ITS tables with a non-trusted hypervisor")
-aneesh
^ permalink raw reply
* Re: [PATCH RFC 0/3] KVM: guest_memfd: folio migration for non-confidential VMs
From: David Hildenbrand (Arm) @ 2026-06-17 10:34 UTC (permalink / raw)
To: Ackerley Tng, Sean Christopherson, Alexandru Elisei
Cc: Shivank Garg, Matthew Wilcox (Oracle), Jan Kara, Andrew Morton,
Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Paolo Bonzini, Shuah Khan, Chao Peng,
Nikunj A Dadhania, Ira Weiny, Michael Roth, Pankaj Gupta,
Fuad Tabba, Vishal Annapurve, Nikita Kalyazin, Patrick Roy,
Pratik Sampat, Ashish Kalra, linux-fsdevel, linux-coco, linux-mm,
linux-kernel, kvm, linux-kselftest
In-Reply-To: <CAEvNRgFQLEsKanKrj=ePHoShiY2cgQgxtGs_2CJcZHP=JOjidg@mail.gmail.com>
On 6/16/26 20:09, Ackerley Tng wrote:
> "David Hildenbrand (Arm)" <david@kernel.org> writes:
>
>> On 6/15/26 19:39, Sean Christopherson wrote:
>>>
>>> +1000. It's not just "nice to have", it's a core design principle of guest_memfd.
>>
>> Right, and I raised in the guest_memfd call also the rough idea of Alexandru's
>> use case of having non-movable guest_memfd pages such that we can support use
>> cases where we can hopefully guarantee that a stage-2 mapping will not just
>> randomly go away.
>>
>>>
>
> More concretely, are y'all pointing towards a
> GUEST_MEMFD_FLAG_MIGRATABLE, which will set .migrate =
> kvm_gmem_migrate_folio, and for now, error out for CoCo VMs?
>
>>>
>>> For the purposes of this discussion, we should separate the physical act of
>>> migrating pages from the features that trigger migration. As I said in last week's
>>> guest-memfd call, I am a-ok with supporting page migration as a mechanism, but I
>>> am dead set against supporting NUMA balancing, KSM, LRU-based swap/reclaim, and
>>> anything else that goes against the goal of guest-first memory.
>>
>> Right. Page migration for supporting ZONE_MOVABLE/CMA, compaction, memory
>> offlining, virtio-mem and possibly some collapse mechanism if we were to support
>> THP of some sorts in guest_memfd would are all reasonable.
>>
>
> Background question: how would virtio-mem use migration in the host/guest_memfd?
Good question! As long as there is no nested-virt support (and virtio-mem
support for coco still being in the making) that wouldn't apply, only ordinary
memory hot(un)plug (incl CXL).
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH RFC 0/3] KVM: guest_memfd: folio migration for non-confidential VMs
From: Garg, Shivank @ 2026-06-17 10:17 UTC (permalink / raw)
To: Sean Christopherson, Alexandru Elisei
Cc: Matthew Wilcox (Oracle), Jan Kara, Andrew Morton, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, David Hildenbrand, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Paolo Bonzini, Shuah Khan, Chao Peng,
Nikunj A Dadhania, Ira Weiny, Michael Roth, Pankaj Gupta,
Ackerley Tng, Fuad Tabba, Vishal Annapurve, Nikita Kalyazin,
Patrick Roy, Pratik Sampat, Ashish Kalra, linux-fsdevel,
linux-coco, linux-mm, linux-kernel, kvm, linux-kselftest
In-Reply-To: <ajA4z_Wkb93cTW4m@google.com>
On 6/15/2026 11:09 PM, Sean Christopherson wrote:
> On Mon, Jun 15, 2026, Alexandru Elisei wrote:
>> Hi,
>>
>> On Mon, Jun 15, 2026 at 11:43:14AM +0100, Alexandru Elisei wrote:
>>> Hi,
>>>
>>> On Thu, Jun 11, 2026 at 01:05:07PM +0000, Shivank Garg wrote:
>>>> guest_memfd folios are currently marked unmovable, so the kernel cannot
>>>> perform NUMA-balancing, memory compaction, etc. This is unavoidable for
>>>> confidential VMs (SEV-SNP, TDX), since memory is encrypted and copying it
>>>> needs firmware assistance. However, for non-confidential VMs (like
>>>> Firecracker), we can migrate the folios.
>>>>
>>>> This series enables folio migration for non-confidential guest_memfd and
>>>> also lays the groundwork for migrating confidential guest_memfd later.
>>>> Once firmware-assisted copying support is available, those VMs can be
>>>> made movable, the confidential folio content can be copied separately,
>>>> and the destination folio marked with FOLIO_CONTENT_COPIED so
>>>> __migrate_folio() skips the host-side folio_mc_copy().
>>>
>>> I always thought that one of the nice things about using guest_memfd as a
>>> memory backend, as opposed to host userspace mappings, is that the host
>>> cannot unmap VM memory because of KSM, automatic NUMA balancing, hugepage
>>> collapse, compaction, etc, acting on the host userspace mapping of the
>>> VM memory, and outside of the VMM's or KVM's control.
>
> +1000. It's not just "nice to have", it's a core design principle of guest_memfd.
>
>>> I think it would be useful to preserve this behaviour, even in the absence
>>> of confidential VMs (i.e, guest_memfd file descriptor created with
>>> GUEST_MEMFD_FLAG_MMAP).
>>
>> Just to be clear, I was thinking that it might be useful for both
>> behaviours to exist (migratable and non-migratable) for non-confidential
>> VMs, and allow KVM or userspace to decide which they prefer for a
>> guest_memfd.
>
> For the purposes of this discussion, we should separate the physical act of
> migrating pages from the features that trigger migration. As I said in last week's
> guest-memfd call, I am a-ok with supporting page migration as a mechanism, but I
> am dead set against supporting NUMA balancing, KSM, LRU-based swap/reclaim, and
> anything else that goes against the goal of guest-first memory.
>
> If userspace wants mm/ functionality, then use anon, memfd, hugetlb, shmem, etc.
>
> Shivank, what's the immediate motivation for this series?
Hi Sean,
This makes sense!
Tbh, my main motivation was to start a dialogue on this, since the
implementation+testing itself was easy.
Compaction and memory failure handling were the cases I initially
had in mind. And as David noted, ZONE_MOVABLE/CMA, compaction, memory
offlining, virtio-mem cases would be useful too.
I fully agree that NUMA balancing, LRU/reclaim and etc. features
should stay out, and keeping the migration as mechanism only for
guest_memfd.
Thanks,
Shivank
^ permalink raw reply
* Re: [PATCH v13 03/22] KVM: selftests: Initialize the TDX VM
From: Xiaoyao Li @ 2026-06-17 9:50 UTC (permalink / raw)
To: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-3-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
> diff --git a/tools/testing/selftests/kvm/lib/x86/processor.c b/tools/testing/selftests/kvm/lib/x86/processor.c
> index b68ad1dc7e02..8d06e7186df1 100644
> --- a/tools/testing/selftests/kvm/lib/x86/processor.c
> +++ b/tools/testing/selftests/kvm/lib/x86/processor.c
> @@ -802,6 +802,9 @@ void kvm_arch_vm_post_create(struct kvm_vm *vm, unsigned int nr_vcpus)
> vm_sev_ioctl(vm, KVM_SEV_INIT2, &init);
> }
>
> + if (is_tdx_vm(vm))
> + tdx_init_vm(vm, 0);
> +
It fails compilation:
kvm/tools/testing/selftests/kvm/lib/x86/processor.c:806:(.text+0x212c):
undefined reference to `tdx_init_vm'
We need grab the change on Makefile.kvm from Patch 10 to this patch.
diff --git a/tools/testing/selftests/kvm/Makefile.kvm
b/tools/testing/selftests/kvm/Makefile.kvm
index e5769268936a..0107ba02b01c 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -29,6 +29,7 @@ LIBKVM_x86 += lib/x86/sev.c
LIBKVM_x86 += lib/x86/svm.c
LIBKVM_x86 += lib/x86/ucall.c
LIBKVM_x86 += lib/x86/vmx.c
+LIBKVM_x86 += lib/x86/tdx/tdx_util.c
LIBKVM_arm64 += lib/arm64/gic.c
LIBKVM_arm64 += lib/arm64/gic_v3.c
^ permalink raw reply related
* Re: [PATCH v8 3/7] crypto/ccp: Disable CPU hotplug while SNP is active
From: K Prateek Nayak @ 2026-06-17 4:33 UTC (permalink / raw)
To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
peterz, thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <1feccf6e2a56d949b30f403c0ca7949f580e5982.1781419998.git.ashish.kalra@amd.com>
Hello Ashish,
On 6/16/2026 1:19 AM, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra@amd.com>
>
> The SEV firmware enumerates the CPUs at SNP initialization and is not
> aware of the OS bringing CPUs online or offline afterwards, so OS CPU
> hotplug can diverge from the firmware's expectations and break SNP.
> Disable CPU hotplug while SNP is active.
Dumb question: Is this specific to RMPOPT? Otherwise ...
>
> SNP is fully torn down only on the SNP_SHUTDOWN_EX x86_snp_shutdown
> path; the legacy path leaves SNP enabled in hardware while clearing
> snp_initialized, so __sev_snp_init_locked() can run again. Track the
> disable with a flag so it is balanced by a matching enable rather than
> stacked, and re-enable hotplug only on the x86_snp_shutdown path, after
> snp_shutdown() has cleared the per-core RMPOPT_BASE MSRs with hotplug
> still disabled.
>
> This also keeps the CPU set stable for the asynchronous RMPOPT scan
> added later in this series, and ensures cpus_read_lock() in the scan
> is uncontended.
>
> Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
> ---
> drivers/crypto/ccp/sev-dev.c | 29 ++++++++++++++++++++++++++++-
> 1 file changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index 217b6b19802e..c8c3c577463c 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -106,6 +106,9 @@ struct snp_hv_fixed_pages_entry {
>
> static LIST_HEAD(snp_hv_fixed_pages);
>
> +/* Set while SNP has CPU hotplug disabled. */
> +static bool snp_cpu_hotplug_disabled;
> +
> /* Trusted Memory Region (TMR):
> * The TMR is a 1MB area that must be 1MB aligned. Use the page allocator
> * to allocate the memory, which will return aligned memory for the specified
> @@ -1479,6 +1482,17 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
>
> snp_hv_fixed_pages_state_update(sev, HV_FIXED);
>
> + /*
> + * Disable CPU hotplug while SNP is active. Guard against stacking
> + * the disable count: the legacy SNP_SHUTDOWN_EX path clears
> + * snp_initialized without re-enabling hotplug, so this can run
> + * again while hotplug is already disabled.
> + */
> + if (!snp_cpu_hotplug_disabled) {
> + cpu_hotplug_disable();
> + snp_cpu_hotplug_disabled = true;
> + }
> +
... should this be done before __sev_do_cmd_locked(SEV_CMD_SNP_INIT_EX)
is issued?
I'm assuming that is when the firmware enumerates the CPUs during SNP
initialization and any hotplug after that should be disallowed?
> snp_setup_rmpopt();
>
> sev->snp_initialized = true;
--
Thanks and Regards,
Prateek
^ permalink raw reply
* Re: [PATCH v8 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: K Prateek Nayak @ 2026-06-17 4:20 UTC (permalink / raw)
To: Kalra, Ashish, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
peterz, thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <8c5f4082-e3a5-4f65-b058-33938a7ee324@amd.com>
Hello Ashish,
On 6/17/2026 1:26 AM, Kalra, Ashish wrote:
> Hello Prateek,
>
> On 6/16/2026 2:27 AM, K Prateek Nayak wrote:
>> Hello Ashish,
>>
>> On 6/16/2026 1:19 AM, Ashish Kalra wrote:
>>> + /*
>>> + * RMPOPT scans the RMP table, stores the result of the scan in the
>>> + * reserved processor memory. The RMP scan is the most expensive
>>> + * part. If a second RMPOPT occurs, it can skip the expensive scan
>>> + * if they can see a cached result in the reserved processor memory.
>>> + *
>>> + * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
>>> + * on every other primary thread. Followers are "designed to"
>>> + * skip the scan if they see the "cached" scan results.
>>> + */
>>> + cpumask_copy(follower_mask, &rmpopt_cpumask);
>>
>> rmpopt_cpumask is constructed after hotplug is disabled but ...
>>
>>> +
>>> + /*
>>> + * Pin the worker to the current CPU for the leader loop so that
>>> + * this_cpu remains valid and the RMPOPT instruction executes on
>>> + * the correct CPU.
>>> + *
>>> + * Use migrate_disable() rather than get_cpu() to prevent
>>> + * migration while still allowing preemption.
>>> + */
>>> + migrate_disable();
>>> + this_cpu = smp_processor_id();
>>> +
>>> + if (cpumask_test_cpu(this_cpu, follower_mask)) {
>>> + /*
>>> + * Current CPU is a primary thread in rmpopt_cpumask.
>>> + * Run leader locally and remove from follower mask.
>>> + */
>>> + cpumask_clear_cpu(this_cpu, follower_mask);
>>> +
>>> + for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>>> + rmpopt(pa);
>>> + cond_resched();
>>> + }
>>> + } else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
>>> + follower_mask)) {
>>> + /*
>>> + * Current CPU is a sibling thread whose primary is in
>>> + * rmpopt_cpumask. RMPOPT_BASE MSR is per-core, so it
>>> + * is safe to run the leader locally. Remove the sibling's
>>> + * primary from the follower mask as this core is already
>>> + * covered by the leader.
>>> + */
>>> + cpumask_andnot(follower_mask, follower_mask,
>>> + topology_sibling_cpumask(this_cpu));
>>> +
>>> + for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>>> + rmpopt(pa);
>>> + cond_resched();
>>> + }
>>> + } else {
>>> + /*
>>> + * Current CPU does not have RMPOPT_BASE MSR programmed.
>>> + * Pick an explicit leader from the cpumask to avoid #UD.
>>> + * Use work_on_cpu() to run in process context on the leader,
>>> + * avoiding IPI latency.
>>> + */
>>
>> ... this_cpu is neither in the "rmpopt_cpumask", nor is any of its
>> siblings on "rmpopt_cpumask".
>>
>> How does that happen?
>
> Actually, this was the implementation before the CPU hotplug disable enforcement code was implemented and added in v8,
> and i should have fixed this rmpopt_work_handler() accordingly for v8.
>
> With the enforced cpu hotplug disable support, case #3 here (above) is now dead code, and removing it lets
> cases #1 and #2 collapse too.
>
> snp_prepare() requires cpu_online_mask == cpu_present_mask before SNP init — so when snp_setup_rmpopt() programs the MSRs, every
> core's primary is online -> every core is in rmpopt_cpumask.
>
> So now the work handler always runs on a CPU whose core is programmed. topology_sibling_cpumask(this_cpu) therefore always intersects
> rmpopt_cpumask -> case #1 or #2 always matches.
>
> So i should actually drop case #3 here - which is: "this_cpu is neither in the "rmpopt_cpumask", nor is any of its
> siblings on rmpopt_cpumask"
Ack.
Also the fact that cpu_mark_primary_thread() uses LSBs of APICID and if
you have some insanely weird configuration - like boot with maxcpus=1,
online all the secondary threads (CPUs 256-511 on a 256C/512T system),
launch an SNP guest - it can actually leave everything except CORE0 out
of the "rmpopt_cpumask".
>
>
>>
>>> + int leader_cpu = cpumask_first(follower_mask);
>>> +
>>> + if (WARN_ON_ONCE(leader_cpu >= nr_cpu_ids)) {
>>> + migrate_enable();
>>> + goto out;
>>> + }
>>> +
>>> + cpumask_clear_cpu(leader_cpu, follower_mask);
>>> +
>>> + /* Release migration pin before work_on_cpu(). */
>>> + migrate_enable();
>>> +
>>> + work_on_cpu(leader_cpu, rmpopt_leader_fn, NULL);
>>
>> This creates a delayed work and also waits for it to finish execution
>> which will add more latency than a simple IPI if the comment about IPI
>> latency above is accurate.
>>
>> I think there is some corner case in construction of the
>> "rmpopt_cpumask" that requires this not-so-pretty else block. Can you
>> elaborate why this is required?
>>
>> Perhaps the "rmpopt_cpumask" construction needs:
>>
>> for_each_online_cpu(cpu) {
>> /* Nominate the first CPU on the sibling mask for RMPOPT */
>> if (cpu != cpumask_first(topology_sibling_cpumask(cpu)))
>> continue;
>> cpumask_set_cpu(cpu, &rmpopt_cpumask);
>> }
>>
>>
>> and all you need here is:
>>
>> /* Do RMPOPt for local core */
>> for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
>> rmpopt(pa);
>>
>> /* Skip this core from concurrent RMPOPT */
>> cpumask_and_not(follower_mask, &rmpopt_cpumask, topology_sibling_cpumask(cpu));
>>
>> No?
>>
>
> Yes, a simpler implementation will be like this:
> ...
>
> if (!alloc_cpumask_var(&follower_mask, GFP_KERNEL))
> return;
>
If you move the migrate_disable() here, you can simply do an andnot
without needing to copy the rmpopt_cpumask beforehand and save on one
cpumask iteration.
> cpumask_copy(follower_mask, &rmpopt_cpumask);
>
> /*
> * The current CPU's core always has RMPOPT_BASE programmed
> * (snp_prepare() required all CPUs online at setup and CPU hotplug
> * is disabled while SNP is active), so it can always be the leader.
> * RMPOPT_BASE is per-core; exclude this core from the followers.
> */
> migrate_disable();
> cpumask_andnot(follower_mask, follower_mask,
> topology_sibling_cpumask(smp_processor_id()));
>
> for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> rmpopt(pa);
> cond_resched();
> }
> migrate_enable();
>
> cpus_read_lock();
I think you can even skip the cpus_read_lock() since we know for a
fact that hotplug is disabled when we are here.
Perhaps we can have a lockdep_assert_cpu_hotplug_disabled() which
ensures we'll get a splat if that assumption ever changes when
running with LOCKDEP?
I'll let others comment if that is a good idea or not.
> for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> on_each_cpu_mask(follower_mask, rmpopt_smp, (void *)pa, true);
> cond_resched();
> }
> cpus_read_unlock();
>
> free_cpumask_var(follower_mask);
>
>
> Here, the leader exclusion must use the sibling mask, not clear_cpu(this_cpu). That's why my collapsed version uses:
>
> cpumask_andnot(follower_mask, follower_mask,
> topology_sibling_cpumask(smp_processor_id()));
>
> - If this_cpu is a primary: its sibling mask contains itself (the primary) -> andnot removes this core's primary from the followers.
>
> - If this_cpu is a secondary: it isn't in follower_mask at all, but its sibling mask contains its primary, which is in
> follower_mask -> andnot still removes this core's primary.
>
> So either way the current core is dropped from the followers. (The old code needed two branches because case #1 used
> clear_cpu(this_cpu) — only correct when this_cpu is the primary — while case #2 used the sibling andnot. The single andnot works for
> both cases).
Ack! And I think this looks much cleaner (to my eyes at least ;-)
--
Thanks and Regards,
Prateek
^ permalink raw reply
* Re: [PATCH v13 03/22] KVM: selftests: Initialize the TDX VM
From: Xiaoyao Li @ 2026-06-17 3:54 UTC (permalink / raw)
To: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-3-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
> +/*
> + * Filter CPUID based on TDX supported capabilities
> + *
> + * Input Args:
> + * vm - Virtual Machine
> + * cpuid_data - CPUID fields to filter
> + *
> + * Output Args: None
> + *
> + * Return: None
> + *
> + * For each CPUID leaf, filter out non-supported bits based on the capabilities reported
> + * by the TDX module
> + */
s/non-supported/unsupported/
and break the line to <80 chars
^ permalink raw reply
* Re: [PATCH v13 04/22] KVM: selftests: TDX: Use KVM_TDX_CAPABILITIES to validate TDs' attribute configuration
From: Xiaoyao Li @ 2026-06-17 3:51 UTC (permalink / raw)
To: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-4-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Make sure that all the attributes enabled by the test are reported as
> supported by both the TDX module and KVM. KVM filters out the attributes
> not supported by itself.
>
> This also exercises the KVM_TDX_CAPABILITIES ioctl.
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Sagi Shahar <sagis@google.com>
> Signed-off-by: Sagi Shahar <sagis@google.com>
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Lisa Wang <wyihan@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> tools/testing/selftests/kvm/lib/x86/tdx/tdx_util.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/lib/x86/tdx/tdx_util.c b/tools/testing/selftests/kvm/lib/x86/tdx/tdx_util.c
> index 868ff62e22f2..e5c998874a0d 100644
> --- a/tools/testing/selftests/kvm/lib/x86/tdx/tdx_util.c
> +++ b/tools/testing/selftests/kvm/lib/x86/tdx/tdx_util.c
> @@ -110,6 +110,18 @@ static void tdx_filter_cpuid(struct kvm_vm *vm,
> free(tdx_cap);
> }
>
> +static void tdx_check_attributes(struct kvm_vm *vm, u64 attributes)
> +{
> + struct kvm_tdx_capabilities *tdx_cap;
> +
> + tdx_cap = tdx_read_capabilities(vm);
well, this is another caller of tdx_read_capabilities().
As I commented in the previous patch, it's worth caching the result in
tdx_read_capabilities() like what kvm_get_supported_cpuid() does for
kvm_supported_cpuid.
And it can help only print the debug once.
> + /* Make sure all the attributes are reported as supported */
> + TEST_ASSERT_EQ(attributes & tdx_cap->supported_attrs, attributes);
> +
> + free(tdx_cap);
> +}
> +
> void tdx_init_vm(struct kvm_vm *vm, u64 attributes)
> {
> struct kvm_tdx_init_vm *init_vm;
> @@ -129,6 +141,8 @@ void tdx_init_vm(struct kvm_vm *vm, u64 attributes)
> memcpy(&init_vm->cpuid, cpuid, kvm_cpuid2_size(cpuid->nent));
> free(cpuid);
>
> + tdx_check_attributes(vm, attributes);
> +
> init_vm->attributes = attributes;
>
> tdx_vm_ioctl(vm, KVM_TDX_INIT_VM, 0, init_vm);
>
^ permalink raw reply
* Re: [PATCH v13 03/22] KVM: selftests: Initialize the TDX VM
From: Xiaoyao Li @ 2026-06-17 3:21 UTC (permalink / raw)
To: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-3-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
> From: Sagi Shahar <sagis@google.com>
>
> Add tdx_init_vm() to handle the mandatory VM-level initialization
> sequence required for Intel TDX.
>
> For TDX, the guest's CPUID configuration must be "sealed" during
> KVM_TDX_INIT_VM before any vCPUs are created. This is necessary because
> the TDX hardware directly virtualizes CPUID and includes the
> configuration in the guest's initial security measurement.
>
> The helper calculates the required CPUID values by filtering the host-
> supported bits (kvm_get_supported_cpuid) against the "directly
> configurable" bits reported by KVM_TDX_CAPABILITIES, ensuring
> compliance with the strict requirements of the TDH.MNG.INIT SEAMCALL.
>
> Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Sagi Shahar <sagis@google.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Lisa Wang <wyihan@google.com>
> ---
> .../selftests/kvm/include/x86/tdx/tdx_util.h | 30 +++++
> tools/testing/selftests/kvm/lib/x86/processor.c | 3 +
> tools/testing/selftests/kvm/lib/x86/tdx/tdx_util.c | 137 +++++++++++++++++++++
> 3 files changed, 170 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h b/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> index f647e6ca6b34..48d4bd36c35b 100644
> --- a/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> +++ b/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> @@ -11,4 +11,34 @@ static inline bool is_tdx_vm(struct kvm_vm *vm)
> return vm->type == KVM_X86_TDX_VM;
> }
>
> +/*
> + * TDX ioctls
> + * Use underscores to avoid collisions with struct member names.
> + */
> +#define __tdx_vm_ioctl(vm, cmd, _flags, arg) \
> +({ \
> + int r; \
> + \
> + union { \
> + struct kvm_tdx_cmd c; \
> + unsigned long raw; \
> + } tdx_cmd = { .c = { \
> + .id = (cmd), \
> + .flags = (u32)(_flags), \
> + .data = (u64)(arg), \
> + } }; \
> + \
> + r = __vm_ioctl(vm, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd.raw); \
> + r ?: tdx_cmd.c.hw_error; \
> +})
It looks __tdx_vm_ioctl() can be implemented as the static inline function.
Given all the existing xxx_ioctl() are implmeneted as MACRO, I'm OK with it.
> +
> +#define tdx_vm_ioctl(vm, cmd, flags, arg) \
> +({ \
> + int ret = __tdx_vm_ioctl(vm, cmd, flags, arg); \
> + \
> + __TEST_ASSERT_VM_VCPU_IOCTL(!ret, #cmd, ret, vm); \
> +})
> +
> +void tdx_init_vm(struct kvm_vm *vm, u64 attributes);
> +
> #endif /* SELFTESTS_TDX_TDX_UTIL_H */
> diff --git a/tools/testing/selftests/kvm/lib/x86/processor.c b/tools/testing/selftests/kvm/lib/x86/processor.c
> index b68ad1dc7e02..8d06e7186df1 100644
> --- a/tools/testing/selftests/kvm/lib/x86/processor.c
> +++ b/tools/testing/selftests/kvm/lib/x86/processor.c
> @@ -802,6 +802,9 @@ void kvm_arch_vm_post_create(struct kvm_vm *vm, unsigned int nr_vcpus)
> vm_sev_ioctl(vm, KVM_SEV_INIT2, &init);
> }
>
> + if (is_tdx_vm(vm))
> + tdx_init_vm(vm, 0);
> +
> r = __vm_ioctl(vm, KVM_GET_TSC_KHZ, NULL);
> TEST_ASSERT(r > 0, "KVM_GET_TSC_KHZ did not provide a valid TSC frequency.");
> guest_tsc_khz = r;
> diff --git a/tools/testing/selftests/kvm/lib/x86/tdx/tdx_util.c b/tools/testing/selftests/kvm/lib/x86/tdx/tdx_util.c
> new file mode 100644
> index 000000000000..868ff62e22f2
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/lib/x86/tdx/tdx_util.c
> @@ -0,0 +1,137 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include "kvm_util.h"
> +#include "processor.h"
> +#include "tdx/tdx_util.h"
> +
> +static struct kvm_tdx_capabilities *tdx_read_capabilities(struct kvm_vm *vm)
> +{
> + struct kvm_tdx_capabilities *tdx_cap = NULL;
> + int nr_cpuid_configs = 4;
> + int rc = -1;
> + int i;
> +
> + do {
> + nr_cpuid_configs *= 2;
> +
> + tdx_cap = realloc(tdx_cap, sizeof(*tdx_cap) +
> + sizeof(tdx_cap->cpuid) +
No need to add sizeof(tdx_cap->cpuid). It's included by sizeof(*tdx_cap)
> + (sizeof(struct kvm_cpuid_entry2) * nr_cpuid_configs));
> + TEST_ASSERT(tdx_cap,
> + "Could not allocate memory for tdx capability nr_cpuid_configs %d\n",
> + nr_cpuid_configs);
> +
> + tdx_cap->cpuid.nent = nr_cpuid_configs;
> + rc = __tdx_vm_ioctl(vm, KVM_TDX_CAPABILITIES, 0, tdx_cap);
> + } while (rc < 0 && errno == E2BIG);
> +
> + TEST_ASSERT(rc == 0, "KVM_TDX_CAPABILITIES failed: %d %d",
> + rc, errno);
> +
> + pr_debug("tdx_cap: supported_attrs: 0x%016llx\n"
> + "tdx_cap: supported_xfam 0x%016llx\n",
> + tdx_cap->supported_attrs, tdx_cap->supported_xfam);
> +
> + for (i = 0; i < tdx_cap->cpuid.nent; i++) {
> + const struct kvm_cpuid_entry2 *config = &tdx_cap->cpuid.entries[i];
> +
> + pr_debug("cpuid config[%d]: leaf 0x%x sub_leaf 0x%x eax 0x%08x ebx 0x%08x ecx 0x%08x edx 0x%08x\n",
> + i, config->function, config->index,
> + config->eax, config->ebx, config->ecx, config->edx);
> + }
The debug info will be printed everytime the function is called, which
is unnecessary.
Ideally, the kvm_tdx_capabilities can be cached like what is done for
kvm_supported_cpuid.
> + return tdx_cap;
> +}
> +
> +static struct kvm_cpuid_entry2 *tdx_find_cpuid_config(struct kvm_tdx_capabilities *cap,
> + u32 leaf, u32 sub_leaf)
> +{
> + struct kvm_cpuid_entry2 *config;
> + u32 i;
> +
> + for (i = 0; i < cap->cpuid.nent; i++) {
> + config = &cap->cpuid.entries[i];
> +
> + if (config->function == leaf && config->index == sub_leaf)
> + return config;
> + }
> +
> + return NULL;
> +}
No need to introduce a new fucntin. We can use get_cpuid_entry().
^ permalink raw reply
* Re: [PATCH v13 01/22] KVM: selftests: Add macros to simplify creating VM shapes for non-default types
From: Xiaoyao Li @ 2026-06-17 3:04 UTC (permalink / raw)
To: Sean Christopherson
Cc: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
linux-coco, linux-kernel, x86
In-Reply-To: <ajF-9isiWxPyzxci@google.com>
On 6/17/2026 12:51 AM, Sean Christopherson wrote:
> From: Sean Christopherson<seanjc@google.com>
> Date: Tue, 28 Oct 2025 21:20:27 +0000
> Subject: [PATCH] KVM: selftests: Add macros to simplify creating VM shapes for
> non-default types
>
> Add VM_TYPE() and __VM_SHAPE() macros to create a vm_shape structure given
> a type (and mode), and use the macros to define VM_SHAPE_{SEV,SEV_ES,SNP}
> shapes for x86's SEV family of VM shapes. Providing common infrastructure
> will avoid having to copy+paste vm_sev_create_with_one_vcpu() for TDX.
>
> Use the new SEV+ shapes and drop vm_sev_create_with_one_vcpu().
>
> Opportunistically move the existing VM_SHAPE() (now __VM_SHAPE()) macro
> below the definitions of VM_MODE_DEFAULT so that all of the SHAPE/TYPE
> macros are bundled together.
>
> No functional change intended.
>
> Reviewed-by: Binbin Wu<binbin.wu@linux.intel.com>
> Reviewed-by: Ira Weiny<ira.weiny@intel.com>
> Signed-off-by: Sean Christopherson<seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
some nits below
> ---
> .../testing/selftests/kvm/include/kvm_util.h | 28 +++++++------
> .../selftests/kvm/include/x86/processor.h | 4 ++
> tools/testing/selftests/kvm/include/x86/sev.h | 2 -
> tools/testing/selftests/kvm/lib/x86/sev.c | 16 --------
> .../selftests/kvm/x86/sev_smoke_test.c | 40 +++++++++----------
> 5 files changed, 40 insertions(+), 50 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> index dc70c6da63fa..46bae183d7fc 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -221,18 +221,6 @@ struct vm_shape {
>
> kvm_static_assert(sizeof(struct vm_shape) == sizeof(u64));
>
> -#define VM_TYPE_DEFAULT 0
> -
> -#define VM_SHAPE(__mode) \
> -({ \
> - struct vm_shape shape = { \
> - .mode = (__mode), \
> - .type = VM_TYPE_DEFAULT \
> - }; \
> - \
> - shape; \
> -})
> -
> extern enum vm_guest_mode vm_mode_default;
>
> #if defined(__aarch64__)
> @@ -270,8 +258,24 @@ extern enum vm_guest_mode vm_mode_default;
>
> #endif
>
> +#define VM_TYPE_DEFAULT 0
> +
> +#define __VM_SHAPE(__mode, __type) \
inconsistent indentation with below lines.
> +({ \
> + struct vm_shape shape = { \
> + .mode = (__mode), \
> + .type = (__type), \
> + }; \
> + \
> + shape; \
> +})
> +
> +
one extra new line.
> +#define VM_SHAPE(__mode) __VM_SHAPE(__mode, VM_TYPE_DEFAULT)
> #define VM_SHAPE_DEFAULT VM_SHAPE(VM_MODE_DEFAULT)
>
> +#define VM_TYPE(__type) __VM_SHAPE(VM_MODE_DEFAULT, __type)
> +
> #define MIN_PAGE_SIZE (1U << MIN_PAGE_SHIFT)
> #define PTES_PER_MIN_PAGE ptes_per_page(MIN_PAGE_SIZE)
^ permalink raw reply
* Re: [PATCH v13 02/22] KVM: selftests: Update kvm_init_vm_address_properties() for TDX
From: Xiaoyao Li @ 2026-06-17 2:37 UTC (permalink / raw)
To: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Sean Christopherson, Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86,
Adrian Hunter
In-Reply-To: <20260521-tdx-selftests-v13-v13-2-6983ae4c3a4d@google.com>
On 5/22/2026 7:16 AM, Lisa Wang wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Initialize the TDX S-bit and the GPA tag mask in
> kvm_init_vm_address_properties() for TDX VMs, similar to how the C-bit
> is initialized for SEV VMs.
>
> The TDX S-bit is used to distinguish between shared and private guest
> physical addresses. Its position is determined by the guest physical
> address width, which is either 48 or 52 bits for current TDX
> implementations.
>
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Sagi Shahar <sagis@google.com>
> Signed-off-by: Sagi Shahar <sagis@google.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Lisa Wang <wyihan@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h | 14 ++++++++++++++
> tools/testing/selftests/kvm/lib/x86/processor.c | 12 ++++++++++--
> 2 files changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h b/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> new file mode 100644
> index 000000000000..f647e6ca6b34
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef SELFTESTS_TDX_TDX_UTIL_H
> +#define SELFTESTS_TDX_TDX_UTIL_H
> +
> +#include <stdbool.h>
> +
> +#include "kvm_util.h"
> +
> +static inline bool is_tdx_vm(struct kvm_vm *vm)
> +{
> + return vm->type == KVM_X86_TDX_VM;
> +}
> +
> +#endif /* SELFTESTS_TDX_TDX_UTIL_H */
> diff --git a/tools/testing/selftests/kvm/lib/x86/processor.c b/tools/testing/selftests/kvm/lib/x86/processor.c
> index b51467d70f6e..b68ad1dc7e02 100644
> --- a/tools/testing/selftests/kvm/lib/x86/processor.c
> +++ b/tools/testing/selftests/kvm/lib/x86/processor.c
> @@ -11,6 +11,7 @@
> #include "smm.h"
> #include "svm_util.h"
> #include "sev.h"
> +#include "tdx/tdx_util.h"
> #include "vmx.h"
>
> #ifndef NUM_INTERRUPTS
> @@ -1311,12 +1312,19 @@ void kvm_get_cpu_address_width(unsigned int *pa_bits, unsigned int *va_bits)
>
> void kvm_init_vm_address_properties(struct kvm_vm *vm)
> {
> + u32 gpa_bits = kvm_cpu_property(X86_PROPERTY_GUEST_MAX_PHY_ADDR);
> +
> + vm->arch.sev_fd = -1;
> +
> if (is_sev_vm(vm)) {
> vm->arch.sev_fd = open_sev_dev_path_or_exit();
> vm->arch.c_bit = BIT_ULL(this_cpu_property(X86_PROPERTY_SEV_C_BIT));
> vm->gpa_tag_mask = vm->arch.c_bit;
> - } else {
> - vm->arch.sev_fd = -1;
> + } else if (is_tdx_vm(vm)) {
> + TEST_ASSERT(gpa_bits == 48 || gpa_bits == 52,
> + "TDX: bad X86_PROPERTY_GUEST_MAX_PHY_ADDR value: %u", gpa_bits);
> + vm->arch.s_bit = BIT_ULL(gpa_bits - 1);
> + vm->gpa_tag_mask = vm->arch.s_bit;
> }
> }
>
>
^ permalink raw reply
* Re: [PATCH v6 03/20] dma-direct: use DMA_ATTR_CC_SHARED in alloc/free paths
From: Alexey Kardashevskiy @ 2026-06-17 0:50 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
linux-coco
Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
Jason Gunthorpe, Mostafa Saleh, Petr Tesarik, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86, Jiri Pirko,
Michael Kelley, Cheloha, Scott
In-Reply-To: <20260604083959.1265923-4-aneesh.kumar@kernel.org>
On 4/6/26 18:39, Aneesh Kumar K.V (Arm) wrote:
> Propagate force_dma_unencrypted() into DMA_ATTR_CC_SHARED in the
> dma-direct allocation path and use the attribute to drive the related
> decisions.
>
> This updates dma_direct_alloc(), dma_direct_free(), and
> dma_direct_alloc_pages() to fold the forced unencrypted case into attrs.
>
> Tested-by: Jiri Pirko <jiri@nvidia.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
> kernel/dma/direct.c | 53 +++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 44 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index a741c8a2ee66..90dc5057a0c0 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -193,16 +193,31 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
> {
> bool remap = false, set_uncached = false;
> - bool mark_mem_decrypt = true;
> + bool mark_mem_decrypt = false;
> struct page *page;
> void *ret;
>
> + /*
> + * DMA_ATTR_CC_SHARED is not a caller-visible dma_alloc_*()
> + * attribute. The direct allocator uses it internally after it has
> + * decided that the backing pages must be shared/decrypted, so the
> + * rest of the allocation path can consistently select DMA addresses,
> + * choose compatible pools and restore encryption on free.
Why this limit?
Context: I am looking for a memory pool for a few shared pages (to do some guest<->host communication), SWIOTLB seems like the right fit but swiotlb_alloc() is not exported and dma_direct_alloc(DMA_ATTR_CC_SHARED) is not allowed. Thanks,
> + */
> + if (attrs & DMA_ATTR_CC_SHARED)
> + return NULL;
> +
> + if (force_dma_unencrypted(dev)) {
> + attrs |= DMA_ATTR_CC_SHARED;
> + mark_mem_decrypt = true;
> + }
> +
> size = PAGE_ALIGN(size);
> if (attrs & DMA_ATTR_NO_WARN)
> gfp |= __GFP_NOWARN;
>
> - if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
> - !force_dma_unencrypted(dev) && !is_swiotlb_for_alloc(dev))
> + if (((attrs & (DMA_ATTR_NO_KERNEL_MAPPING | DMA_ATTR_CC_SHARED)) ==
> + DMA_ATTR_NO_KERNEL_MAPPING) && !is_swiotlb_for_alloc(dev))
> return dma_direct_alloc_no_mapping(dev, size, dma_handle, gfp);
>
> if (!dev_is_dma_coherent(dev)) {
> @@ -236,7 +251,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> * Remapping or decrypting memory may block, allocate the memory from
> * the atomic pools instead if we aren't allowed block.
> */
> - if ((remap || force_dma_unencrypted(dev)) &&
> + if ((remap || (attrs & DMA_ATTR_CC_SHARED)) &&
> dma_direct_use_pool(dev, gfp))
> return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
>
> @@ -312,12 +327,24 @@ void dma_direct_free(struct device *dev, size_t size,
> void *cpu_addr, dma_addr_t dma_addr, unsigned long attrs)
> {
> phys_addr_t phys;
> - bool mark_mem_encrypted = true;
> + bool mark_mem_encrypted = false;
> struct io_tlb_pool *swiotlb_pool;
> unsigned int page_order = get_order(size);
>
> - if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
> - !force_dma_unencrypted(dev) && !is_swiotlb_for_alloc(dev)) {
> + /* see dma_direct_alloc() for details */
> + WARN_ON(attrs & DMA_ATTR_CC_SHARED);
> +
> + /*
> + * if the device had requested for an unencrypted buffer,
> + * convert it to encrypted on free
> + */
> + if (force_dma_unencrypted(dev)) {
> + attrs |= DMA_ATTR_CC_SHARED;
> + mark_mem_encrypted = true;
> + }
> +
> + if (((attrs & (DMA_ATTR_NO_KERNEL_MAPPING | DMA_ATTR_CC_SHARED)) ==
> + DMA_ATTR_NO_KERNEL_MAPPING) && !is_swiotlb_for_alloc(dev)) {
> /* cpu_addr is a struct page cookie, not a kernel address */
> dma_free_contiguous(dev, cpu_addr, size);
> return;
> @@ -366,10 +393,14 @@ void dma_direct_free(struct device *dev, size_t size,
> struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
> dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
> {
> + unsigned long attrs = 0;
> struct page *page;
> void *ret;
>
> - if (force_dma_unencrypted(dev) && dma_direct_use_pool(dev, gfp))
> + if (force_dma_unencrypted(dev))
> + attrs |= DMA_ATTR_CC_SHARED;
> +
> + if ((attrs & DMA_ATTR_CC_SHARED) && dma_direct_use_pool(dev, gfp))
> return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
>
> if (is_swiotlb_for_alloc(dev)) {
> @@ -403,7 +434,11 @@ void dma_direct_free_pages(struct device *dev, size_t size,
> phys_addr_t phys;
> void *vaddr = page_address(page);
> struct io_tlb_pool *swiotlb_pool;
> - bool mark_mem_encrypted = true;
> + /*
> + * if the device had requested for an unencrypted buffer,
> + * convert it to encrypted on free
> + */
> + bool mark_mem_encrypted = force_dma_unencrypted(dev);
>
> /* If cpu_addr is not from an atomic pool, dma_free_from_pool() fails */
> if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
--
Alexey
^ permalink raw reply
* Re: [PATCH v8 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Kalra, Ashish @ 2026-06-16 19:56 UTC (permalink / raw)
To: K Prateek Nayak, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
peterz, thomas.lendacky, herbert, davem, ardb
Cc: pbonzini, aik, Michael.Roth, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <0fa0bc95-ff31-40c5-b083-3c885d09d0ab@amd.com>
Hello Prateek,
On 6/16/2026 2:27 AM, K Prateek Nayak wrote:
> Hello Ashish,
>
> On 6/16/2026 1:19 AM, Ashish Kalra wrote:
>> + /*
>> + * RMPOPT scans the RMP table, stores the result of the scan in the
>> + * reserved processor memory. The RMP scan is the most expensive
>> + * part. If a second RMPOPT occurs, it can skip the expensive scan
>> + * if they can see a cached result in the reserved processor memory.
>> + *
>> + * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
>> + * on every other primary thread. Followers are "designed to"
>> + * skip the scan if they see the "cached" scan results.
>> + */
>> + cpumask_copy(follower_mask, &rmpopt_cpumask);
>
> rmpopt_cpumask is constructed after hotplug is disabled but ...
>
>> +
>> + /*
>> + * Pin the worker to the current CPU for the leader loop so that
>> + * this_cpu remains valid and the RMPOPT instruction executes on
>> + * the correct CPU.
>> + *
>> + * Use migrate_disable() rather than get_cpu() to prevent
>> + * migration while still allowing preemption.
>> + */
>> + migrate_disable();
>> + this_cpu = smp_processor_id();
>> +
>> + if (cpumask_test_cpu(this_cpu, follower_mask)) {
>> + /*
>> + * Current CPU is a primary thread in rmpopt_cpumask.
>> + * Run leader locally and remove from follower mask.
>> + */
>> + cpumask_clear_cpu(this_cpu, follower_mask);
>> +
>> + for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>> + rmpopt(pa);
>> + cond_resched();
>> + }
>> + } else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
>> + follower_mask)) {
>> + /*
>> + * Current CPU is a sibling thread whose primary is in
>> + * rmpopt_cpumask. RMPOPT_BASE MSR is per-core, so it
>> + * is safe to run the leader locally. Remove the sibling's
>> + * primary from the follower mask as this core is already
>> + * covered by the leader.
>> + */
>> + cpumask_andnot(follower_mask, follower_mask,
>> + topology_sibling_cpumask(this_cpu));
>> +
>> + for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>> + rmpopt(pa);
>> + cond_resched();
>> + }
>> + } else {
>> + /*
>> + * Current CPU does not have RMPOPT_BASE MSR programmed.
>> + * Pick an explicit leader from the cpumask to avoid #UD.
>> + * Use work_on_cpu() to run in process context on the leader,
>> + * avoiding IPI latency.
>> + */
>
> ... this_cpu is neither in the "rmpopt_cpumask", nor is any of its
> siblings on "rmpopt_cpumask".
>
> How does that happen?
Actually, this was the implementation before the CPU hotplug disable enforcement code was implemented and added in v8,
and i should have fixed this rmpopt_work_handler() accordingly for v8.
With the enforced cpu hotplug disable support, case #3 here (above) is now dead code, and removing it lets
cases #1 and #2 collapse too.
snp_prepare() requires cpu_online_mask == cpu_present_mask before SNP init — so when snp_setup_rmpopt() programs the MSRs, every
core's primary is online -> every core is in rmpopt_cpumask.
So now the work handler always runs on a CPU whose core is programmed. topology_sibling_cpumask(this_cpu) therefore always intersects
rmpopt_cpumask -> case #1 or #2 always matches.
So i should actually drop case #3 here - which is: "this_cpu is neither in the "rmpopt_cpumask", nor is any of its
siblings on rmpopt_cpumask"
>
>> + int leader_cpu = cpumask_first(follower_mask);
>> +
>> + if (WARN_ON_ONCE(leader_cpu >= nr_cpu_ids)) {
>> + migrate_enable();
>> + goto out;
>> + }
>> +
>> + cpumask_clear_cpu(leader_cpu, follower_mask);
>> +
>> + /* Release migration pin before work_on_cpu(). */
>> + migrate_enable();
>> +
>> + work_on_cpu(leader_cpu, rmpopt_leader_fn, NULL);
>
> This creates a delayed work and also waits for it to finish execution
> which will add more latency than a simple IPI if the comment about IPI
> latency above is accurate.
>
> I think there is some corner case in construction of the
> "rmpopt_cpumask" that requires this not-so-pretty else block. Can you
> elaborate why this is required?
>
> Perhaps the "rmpopt_cpumask" construction needs:
>
> for_each_online_cpu(cpu) {
> /* Nominate the first CPU on the sibling mask for RMPOPT */
> if (cpu != cpumask_first(topology_sibling_cpumask(cpu)))
> continue;
> cpumask_set_cpu(cpu, &rmpopt_cpumask);
> }
>
>
> and all you need here is:
>
> /* Do RMPOPt for local core */
> for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
> rmpopt(pa);
>
> /* Skip this core from concurrent RMPOPT */
> cpumask_and_not(follower_mask, &rmpopt_cpumask, topology_sibling_cpumask(cpu));
>
> No?
>
Yes, a simpler implementation will be like this:
...
if (!alloc_cpumask_var(&follower_mask, GFP_KERNEL))
return;
cpumask_copy(follower_mask, &rmpopt_cpumask);
/*
* The current CPU's core always has RMPOPT_BASE programmed
* (snp_prepare() required all CPUs online at setup and CPU hotplug
* is disabled while SNP is active), so it can always be the leader.
* RMPOPT_BASE is per-core; exclude this core from the followers.
*/
migrate_disable();
cpumask_andnot(follower_mask, follower_mask,
topology_sibling_cpumask(smp_processor_id()));
for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
rmpopt(pa);
cond_resched();
}
migrate_enable();
cpus_read_lock();
for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
on_each_cpu_mask(follower_mask, rmpopt_smp, (void *)pa, true);
cond_resched();
}
cpus_read_unlock();
free_cpumask_var(follower_mask);
Here, the leader exclusion must use the sibling mask, not clear_cpu(this_cpu). That's why my collapsed version uses:
cpumask_andnot(follower_mask, follower_mask,
topology_sibling_cpumask(smp_processor_id()));
- If this_cpu is a primary: its sibling mask contains itself (the primary) -> andnot removes this core's primary from the followers.
- If this_cpu is a secondary: it isn't in follower_mask at all, but its sibling mask contains its primary, which is in
follower_mask -> andnot still removes this core's primary.
So either way the current core is dropped from the followers. (The old code needed two branches because case #1 used
clear_cpu(this_cpu) — only correct when this_cpu is the primary — while case #2 used the sibling andnot. The single andnot works for
both cases).
Thanks,
Ashish
>> + goto followers;
>> + }
>> +
>> + migrate_enable();
>> +
^ permalink raw reply
* Re: [PATCH v13 00/22] TDX KVM selftests
From: Sean Christopherson @ 2026-06-16 18:48 UTC (permalink / raw)
To: Ackerley Tng
Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Kiryl Shutsemau, linux-kselftest,
Paolo Bonzini, Pratik R. Sampat, Reinette Chatre, Rick Edgecombe,
Roger Wang, Ryan Afranji, Sagi Shahar, Shuah Khan, Oliver Upton,
Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86,
Adrian Hunter
In-Reply-To: <CAEvNRgH7Lk=z9NqcY4OZXv=y5SeCZHnDNcB0=kHfarjCA4ZPTw@mail.gmail.com>
On Tue, Jun 16, 2026, Ackerley Tng wrote:
> Lisa Wang <wyihan@google.com> writes:
>
> > This patch series focuses on setting up a TDX VM and adding all code
> > necessary to run a basic lifecycle test.
> >
> > Unlike standard KVM selftests can set up the VM through guest registers,
> > TDX module protects TDs' register state from the host. This feature of
> > TDX causes problems on VM boot state initialization and the ucall
> > implementation.
> >
> > In standard KVM selftests, the host directly initializes the guest state
> > by manipulating Special Registers (SREGs) and General Purpose Registers
> > (GPRs) via IOCTLs (KVM_SET_SREGS, etc.) before the first KVM_RUN.
> >
> > To bypass direct register initialization by the host, we utilize the
> > standard x86 reset vector as the default entry point.
> >
> > The mechanism works as follows:
> > 1. The host places register values into a specific memory region and
> > inserts boot code at the VM's default starting point.
> > 2. When the VM starts, it executes this boot code to "pull" values from
> > memory and manually set up its own SREGs and GPRs.
> > 3. Once the environment is ready, the boot code jumps to the guest code.
> >
> > The standard x86 ucall() implementation uses PIO, but it does not
> > actually transmit data through the 4-byte PIO data. Instead, it relies
> > on the host reading the ucall address directly from the guest's RDI
> > register.
> >
> > TDX selftests cannot utilize the standard x86 ucall implementation,
> > because the host is unable to access the guest's RDI register. Based on
> > this restriction, we considered these potential solutions for the TDX
> > ucall implementation.
> >
> > 1. TDCALL PIO with RCX-bits Passthrough
> > We first considered passing the RDI value through RCX bits to bypass the
> > hardware's register protection, which could be the closest approach to
> > the non-TDX implementation as per Sean's suggestion[1]. However, this
> > approach is blocked by the software-side implementation: KVM_GET_REGS
> > currently does not support TDX VMs and returns -EINVAL. To make this
> > work, the KVM ioctl would need a test-only hack.
> >
> > 2. TDCALL PIO with buffer indexing
> > To keep a PIO-based approach and unify the get_ucall implementation for
> > both TDX and non-TDX VMs, we considered TDCALL PIO with buffer indexing.
> > Since the ucall buffer is initialized prior to execution, the VM could
> > just pass a buffer index rather than an 8-byte ucall address to fit
> > within the 4-byte PIO data limit. The host, already knowing the ucall
> > buffer's base address, could then resolve the ucall content via this
> > index. We abandoned this solution because it would require changes to
> > the common ucall structure and impact other non-x86 architectures.
> >
> > 3. TDCALL MMIO (Selected solution)
> > We ultimately selected TDCALL with an 8-byte MMIO data. This method only
> > requires initializing an MMIO GPA and adding TDCALL MMIO implementation
> > for TDX under the original x86 ucall path. While this diverges from the
> > non-TDX PIO, it provides the cleanest implementation with minimal
> > disruption to the overall ucall architecture.
> >
>
> Sean, Lisa evaluated your suggestion [1] (summarized as 1. above) but we
> think TDCALL MMIO is better, what do you think?
I think y'all should have responded to that thread with "that doesn't work
because host userspace can't access the registers". Reviews are multi-way
discussions, not one-way streams of "do this". And the expectation is that
either review feedback is addressed in the next version, or the dicussion is
closed/resolved *before* posting the next version.
Remaining silent and then writing a thesis in the cover letter of a future version
of the series is very inefficient for everyone involved. I obviously don't read
cover letters all that closely at v13 and I gotta imagine a *lot* of effort went
into the above (which I greatly appreciate!). The paper trail also becomes
impossible to follow, because anyone reading my response would probably make the
same assumption as me: it was a viable idea and that's what we implemented.
I'm a-ok with using MMIO, because yeah, there doesn't seem to be a better option.
^ permalink raw reply
* Re: [PATCH v13 21/22] KVM: selftests: Add ucall support for TDX
From: Sean Christopherson @ 2026-06-16 18:47 UTC (permalink / raw)
To: Lisa Wang
Cc: Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-21-6983ae4c3a4d@google.com>
On Thu, May 21, 2026, Lisa Wang wrote:
> diff --git a/tools/testing/selftests/kvm/lib/x86/ucall.c b/tools/testing/selftests/kvm/lib/x86/ucall.c
> index e7dd5791959b..c8e3418d53af 100644
> --- a/tools/testing/selftests/kvm/lib/x86/ucall.c
> +++ b/tools/testing/selftests/kvm/lib/x86/ucall.c
> @@ -5,11 +5,34 @@
> * Copyright (C) 2018, Red Hat, Inc.
> */
> #include "kvm_util.h"
> +#include "tdx/tdx.h"
> +#include "tdx/tdx_util.h"
>
> #define UCALL_PIO_PORT ((u16)0x1000)
>
> +static u8 vm_type;
> +static gpa_t host_ucall_mmio_gpa;
> +static gpa_t ucall_mmio_gpa;
> +
> +void ucall_arch_init(struct kvm_vm *vm, gpa_t mmio_gpa)
I think we should use an x86-specific GPA, not the first address past memslot0.
Unlike other architectures, x86 has a nice swath of addresses that are pretty
much guaranteed to be unused, thanks to selftests creating a local APIC by default.
On the other hand, the chances of a collision with a memslot just after memslot0
are decidedly non-zero.
Note, because CoCo VMS don't support read-only memslots, the TODO in __vm_create()
can't be resolved for TDX using the suggested shenanigans.
I vote for either the I/O APIC (0xfec00000) or HPET(0xfed00000). We *know* TDX
doesn't support an in-kernel I/O APIC, and the odds of KVM selftests ever
implementing an I/O APIC are basically nil. Ditto for the HPET.
> +{
> + vm_type = vm->type;
> + sync_global_to_guest(vm, vm_type);
> +
> + if (is_tdx_vm(vm)) {
> + host_ucall_mmio_gpa = ucall_mmio_gpa = mmio_gpa;
Drop host_ucall_mmio_gpa entirely. "host GPA" is rather nonsensical, and KVM is
responsible for stripping the shared bit. You can actually drop ucall_mmio_gpa
as well if we go with a hardcoded magic address.
> + ucall_mmio_gpa |= vm->arch.s_bit;
> + sync_global_to_guest(vm, ucall_mmio_gpa);
> + }
> +}
> +
> void ucall_arch_do_ucall(gva_t uc)
> {
> + if (vm_type == KVM_X86_TDX_VM) {
> + tdx_mmio_write(ucall_mmio_gpa, MMIO_SIZE_8B, uc);
s/MMIO_SIZE_8B/sizeof(hva_t), because what you're writing is the address of a
pointer in the host virtual address space.
> + return;
> + }
> +
> /*
> * FIXME: Revert this hack (the entire commit that added it) once nVMX
> * preserves L2 GPRs across a nested VM-Exit. If a ucall from L2, e.g.
> @@ -46,6 +69,13 @@ void *ucall_arch_get_ucall(struct kvm_vcpu *vcpu)
> {
> struct kvm_run *run = vcpu->run;
>
> + if (vm_type == KVM_X86_TDX_VM) {
> + if (run->exit_reason == KVM_EXIT_MMIO &&
> + run->mmio.phys_addr == host_ucall_mmio_gpa &&
> + run->mmio.len == MMIO_SIZE_8B && run->mmio.is_write)
> + return (void *)(*((u64 *)run->mmio.data));
This needs to return NULL. Either that or make this an if-elif. Falling
through to the normal KVM_EXIT_IO check is not what we want.
^ permalink raw reply
* Re: [RFC PATCH] mm/vmalloc: add vmalloc_decrypted() and vzalloc_decrypted()
From: Jason Gunthorpe @ 2026-06-16 18:45 UTC (permalink / raw)
To: Catalin Marinas
Cc: Christoph Hellwig, Kameron Carr, akpm, urezki, linux-mm,
linux-kernel, rppt, mhklinux, linux-coco, Suzuki K Poulose
In-Reply-To: <ajGTPelKLhgFqh7F@arm.com>
On Tue, Jun 16, 2026 at 07:17:33PM +0100, Catalin Marinas wrote:
> > The entry point is dma_alloc_noncontiguous() and you get a scatterlist
> > back.
>
> Yes but not scattered pages unless there's an iommu behind. Anyway,
> that's an implementation detail, something like
> dma_alloc_noncontiguous_vmap() could allocate scattered pages as a
> fallback.
Oh I never noticed it deliberately returns only a single dma entry. I
think that could be optionally weakened without alot of trouble
There is also dma_vmap_noncontiguous() already, so I think the main
framework is there, though it seems like it needs a a bit mmore features.
> Currently, something like netvsc_init_buf() just does a vzalloc() and
> passes it down to vmbus_establish_gpadl() which knows how to interpret
> the channel encryption status. I assume with the vzalloc_decrypted()
> API, that info needs to be interpreted at the netvsc_init_buf() level to
> know which allocation to call.
But it doesn't end at alloc does it? hyperv will still have to reach
into the vmap and convert it into an appropriate IPA to pass to the
hypervisor. That really needs to use the arch helpers the DMA API has
and those should not be called by any sort of driver environment like
hyperv.
> If we move towards a dma_alloc_noncontiguous_vmap() API we need vmbus to
> encode the encryption requirement in the hv_device::device somehow so
> that force_dma_unencrypted() knows what do return.
Yes, they would have to act like PCI and mark in-TEE and out of-TEE
struct devices properly so the DMA API knows what to do instead of
open coding a copy of all this logic in hyperv.
> We have the DMA_ATTR_CC_SHARED but that's not interpreted on the DMA
> alloc path,
It is to describe memory that was deliberately allocated as decrypted,
not to control allocation choices.
> so there's a bit more work needed on the DMA API I think (not sure
> whether Aneesh's series covers any of this).
I don't think it does directly. It largely sets the stage to properly
allow a struct device to opt out of force_dma_unencrypted() so we get
support a T=1 PCI device.
Jason
^ permalink raw reply
* Re: [PATCH v13 20/22] KVM: selftests: Implement MMIO WRITE for the TDX VM
From: Sean Christopherson @ 2026-06-16 18:20 UTC (permalink / raw)
To: Lisa Wang
Cc: Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-20-6983ae4c3a4d@google.com>
On Thu, May 21, 2026, Lisa Wang wrote:
> diff --git a/tools/testing/selftests/kvm/include/x86/tdx/tdx.h b/tools/testing/selftests/kvm/include/x86/tdx/tdx.h
> new file mode 100644
> index 000000000000..810ca7423c84
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/include/x86/tdx/tdx.h
> @@ -0,0 +1,16 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef SELFTESTS_TDX_TDX_H
> +#define SELFTESTS_TDX_TDX_H
> +
> +#include <linux/types.h>
> +
> +enum mmio_size {
> + MMIO_SIZE_1B = 1,
> + MMIO_SIZE_2B = 2,
> + MMIO_SIZE_4B = 4,
> + MMIO_SIZE_8B = 8
This is absurd. Either open code the literals or use sizeof() where appropriate.
> +};
> +
> +u64 tdx_mmio_write(u64 address, enum mmio_size size, u64 data_in);
> +
> +#endif // SELFTESTS_TDX_TDX_H
> diff --git a/tools/testing/selftests/kvm/lib/x86/tdx/tdx.c b/tools/testing/selftests/kvm/lib/x86/tdx/tdx.c
> new file mode 100644
> index 000000000000..f19be79fe11f
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/lib/x86/tdx/tdx.c
> @@ -0,0 +1,30 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include "tdx/tdx.h"
> +
> +#define TDG_VP_VMCALL 0
> +#define TDG_VP_VMCALL_VE_REQUEST_MMIO 48
> +#define TDVMCALL_MMIO_WRITE 1
> +#define TDVMCALL_EXPOSE_REGS_MASK 0xFC00
> +
> +u64 tdx_mmio_write(u64 address, enum mmio_size size, u64 data_in)
> +{
> + register u64 r10_reg asm("r10") = TDG_VP_VMCALL;
> + register u64 r11_reg asm("r11") = TDG_VP_VMCALL_VE_REQUEST_MMIO;
> + register u64 r12_reg asm("r12") = size;
> + register u64 r13_reg asm("r13") = TDVMCALL_MMIO_WRITE;
> + register u64 r14_reg asm("r14") = address;
> + register u64 r15_reg asm("r15") = data_in;
> + register u64 rax_reg asm("rax") = TDG_VP_VMCALL;
> + register u64 rcx_reg asm("rcx") = TDVMCALL_EXPOSE_REGS_MASK;
This needs to be proper assembly, i.e. in a .S file. Using register like this
is *extremely* dangerous, because the compiler is (stupidly) allowed to clobber
registers between their declarations/initialization and their consumption in
the asm() blob.
> +
> + asm volatile(
> + ".byte 0x66,0x0f,0x01,0xcc" /* tdcall */
> + : "+r" (r10_reg), "+r" (r11_reg)
> + : "r" (r12_reg), "r" (r13_reg), "r" (r14_reg), "r" (r15_reg),
> + "r" (rax_reg), "r" (rcx_reg)
> + : "cc", "memory"
> + );
> +
> + return r10_reg;
> +}
>
> --
> 2.54.0.746.g67dd491aae-goog
>
^ permalink raw reply
* Re: [RFC PATCH] mm/vmalloc: add vmalloc_decrypted() and vzalloc_decrypted()
From: Catalin Marinas @ 2026-06-16 18:17 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Christoph Hellwig, Kameron Carr, akpm, urezki, linux-mm,
linux-kernel, rppt, mhklinux, linux-coco, Suzuki K Poulose
In-Reply-To: <20260612181807.GP1066031@ziepe.ca>
On Fri, Jun 12, 2026 at 03:18:07PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 12, 2026 at 06:49:28PM +0100, Catalin Marinas wrote:
> > On Thu, Jun 11, 2026 at 08:49:54AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jun 08, 2026 at 04:37:02PM +0100, Catalin Marinas wrote:
> > > > > +/**
> > > > > + * vzalloc_decrypted - allocate zeroed virtually contiguous decrypted memory
> > > > > + * @size: allocation size
> > > > > + *
> > > > > + * Like vmalloc_decrypted(), but the memory is set to zero.
> > > > > + *
> > > > > + * Return: pointer to the allocated memory or %NULL on error
> > > > > + */
> > > > > +void *vzalloc_decrypted_noprof(unsigned long size)
> > > > > +{
> > > > > + void *addr;
> > > > > +
> > > > > + addr = __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
> > > > > + GFP_KERNEL,
> > > > > + pgprot_decrypted(PAGE_KERNEL),
> > > > > + VM_DECRYPTED, NUMA_NO_NODE,
> > > > > + __builtin_return_address(0));
> > > > > + if (addr)
> > > > > + memset(addr, 0, size);
[...]
> > > But what is the purpose of this? I guess some hyperv thing - but
> > > shouldn't we have a more structured way to "DMA map" things for the
> > > hypervisor instead of stuff like this? Why can't you use
> > > dma_alloc_coherent() which actually gives you an address that is
> > > sensible to pass to the hypervisor?
> >
> > IIRC netvsc_init_buf() uses vzalloc() to allocate some memory and that
> > buffer ends up in set_memory_decrypted() via vmbus_establish_gpadl().
> > arm64 does not support changing the decrypted/shared attributed of
> > vmalloc mappings and I don't think we should add it. Better to just
> > allocate it properly upfront.
>
> Sure
>
> > We might be able to use the DMA API but we won't get something like
> > vmalloc() - physically non-contiguous.
>
> The entry point is dma_alloc_noncontiguous() and you get a scatterlist
> back.
Yes but not scattered pages unless there's an iommu behind. Anyway,
that's an implementation detail, something like
dma_alloc_noncontiguous_vmap() could allocate scattered pages as a
fallback.
> > I think dma_alloc_noncontiguous() just falls back to
> > dma_direct_alloc_pages() in the absence of an iommu.
>
> In all cases you get a scatterlist with a CPU list and a DMA
> list. iommu gives a smaller DMA list.
>
> If you want a vmap then you can feed that CPU page list from the sgl
> into vmap().
>
> A dma_alloc_noncontiguous_vmap() helper would not be hard to make, and
> IMHO, would make alot more sense for hyperv to treat the memory access
> from the hypervisor as "DMA" instead of trying to re-invent the DMA
> API.. :\
>
> HCH was already saying we should not be allowing drivers to use
> set_memory_decrypted() at all, and hyperv is the biggest non-core user
> right now...
That's a good aim longer term. I'm not familiar with hyper-v but I think
it needs a mix of private or shared allocations depending on whether a
paravisor is present. That's handled by the vmbus code and the
information is encoded in the vmbus_channel objects.
Currently, something like netvsc_init_buf() just does a vzalloc() and
passes it down to vmbus_establish_gpadl() which knows how to interpret
the channel encryption status. I assume with the vzalloc_decrypted()
API, that info needs to be interpreted at the netvsc_init_buf() level to
know which allocation to call.
If we move towards a dma_alloc_noncontiguous_vmap() API we need vmbus to
encode the encryption requirement in the hv_device::device somehow so
that force_dma_unencrypted() knows what do return. We have the
DMA_ATTR_CC_SHARED but that's not interpreted on the DMA alloc path, so
there's a bit more work needed on the DMA API I think (not sure whether
Aneesh's series covers any of this).
--
Catalin
^ permalink raw reply
* Re: [PATCH RFC 0/3] KVM: guest_memfd: folio migration for non-confidential VMs
From: Ackerley Tng @ 2026-06-16 18:09 UTC (permalink / raw)
To: David Hildenbrand (Arm), Sean Christopherson, Alexandru Elisei
Cc: Shivank Garg, Matthew Wilcox (Oracle), Jan Kara, Andrew Morton,
Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Paolo Bonzini, Shuah Khan, Chao Peng,
Nikunj A Dadhania, Ira Weiny, Michael Roth, Pankaj Gupta,
Fuad Tabba, Vishal Annapurve, Nikita Kalyazin, Patrick Roy,
Pratik Sampat, Ashish Kalra, linux-fsdevel, linux-coco, linux-mm,
linux-kernel, kvm, linux-kselftest
In-Reply-To: <1e77f24d-315a-4f68-a109-2f4520343c0c@kernel.org>
"David Hildenbrand (Arm)" <david@kernel.org> writes:
> On 6/15/26 19:39, Sean Christopherson wrote:
>> On Mon, Jun 15, 2026, Alexandru Elisei wrote:
>>> Hi,
>>>
>>> On Mon, Jun 15, 2026 at 11:43:14AM +0100, Alexandru Elisei wrote:
>>>> Hi,
>>>>
>>>>
>>>> I always thought that one of the nice things about using guest_memfd as a
>>>> memory backend, as opposed to host userspace mappings, is that the host
>>>> cannot unmap VM memory because of KSM, automatic NUMA balancing, hugepage
>>>> collapse, compaction, etc, acting on the host userspace mapping of the
>>>> VM memory, and outside of the VMM's or KVM's control.
>>
>> +1000. It's not just "nice to have", it's a core design principle of guest_memfd.
>
> Right, and I raised in the guest_memfd call also the rough idea of Alexandru's
> use case of having non-movable guest_memfd pages such that we can support use
> cases where we can hopefully guarantee that a stage-2 mapping will not just
> randomly go away.
>
>>
>>>> I think it would be useful to preserve this behaviour, even in the absence
>>>> of confidential VMs (i.e, guest_memfd file descriptor created with
>>>> GUEST_MEMFD_FLAG_MMAP).
>>>
>>> Just to be clear, I was thinking that it might be useful for both
>>> behaviours to exist (migratable and non-migratable) for non-confidential
>>> VMs, and allow KVM or userspace to decide which they prefer for a
>>> guest_memfd.
More concretely, are y'all pointing towards a
GUEST_MEMFD_FLAG_MIGRATABLE, which will set .migrate =
kvm_gmem_migrate_folio, and for now, error out for CoCo VMs?
>>
>> For the purposes of this discussion, we should separate the physical act of
>> migrating pages from the features that trigger migration. As I said in last week's
>> guest-memfd call, I am a-ok with supporting page migration as a mechanism, but I
>> am dead set against supporting NUMA balancing, KSM, LRU-based swap/reclaim, and
>> anything else that goes against the goal of guest-first memory.
>
> Right. Page migration for supporting ZONE_MOVABLE/CMA, compaction, memory
> offlining, virtio-mem and possibly some collapse mechanism if we were to support
> THP of some sorts in guest_memfd would are all reasonable.
>
Background question: how would virtio-mem use migration in the host/guest_memfd?
> As soon as we mix in access/lru semantics, we're going into the wrong direction.
>
> Fortunately KSM is anon-only and not even worth a rant here :)
>
>
>
> --
> Cheers,
>
> David
^ permalink raw reply
* Re: [PATCH v13 00/22] TDX KVM selftests
From: Ackerley Tng @ 2026-06-16 17:51 UTC (permalink / raw)
To: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Kiryl Shutsemau, linux-kselftest,
Paolo Bonzini, Pratik R. Sampat, Reinette Chatre, Rick Edgecombe,
Roger Wang, Ryan Afranji, Sagi Shahar, Sean Christopherson,
Shuah Khan, Oliver Upton
Cc: Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86,
Adrian Hunter
In-Reply-To: <20260521-tdx-selftests-v13-v13-0-6983ae4c3a4d@google.com>
Lisa Wang <wyihan@google.com> writes:
> This patch series focuses on setting up a TDX VM and adding all code
> necessary to run a basic lifecycle test.
>
> Unlike standard KVM selftests can set up the VM through guest registers,
> TDX module protects TDs' register state from the host. This feature of
> TDX causes problems on VM boot state initialization and the ucall
> implementation.
>
> In standard KVM selftests, the host directly initializes the guest state
> by manipulating Special Registers (SREGs) and General Purpose Registers
> (GPRs) via IOCTLs (KVM_SET_SREGS, etc.) before the first KVM_RUN.
>
> To bypass direct register initialization by the host, we utilize the
> standard x86 reset vector as the default entry point.
>
> The mechanism works as follows:
> 1. The host places register values into a specific memory region and
> inserts boot code at the VM's default starting point.
> 2. When the VM starts, it executes this boot code to "pull" values from
> memory and manually set up its own SREGs and GPRs.
> 3. Once the environment is ready, the boot code jumps to the guest code.
>
> The standard x86 ucall() implementation uses PIO, but it does not
> actually transmit data through the 4-byte PIO data. Instead, it relies
> on the host reading the ucall address directly from the guest's RDI
> register.
>
> TDX selftests cannot utilize the standard x86 ucall implementation,
> because the host is unable to access the guest's RDI register. Based on
> this restriction, we considered these potential solutions for the TDX
> ucall implementation.
>
> 1. TDCALL PIO with RCX-bits Passthrough
> We first considered passing the RDI value through RCX bits to bypass the
> hardware's register protection, which could be the closest approach to
> the non-TDX implementation as per Sean's suggestion[1]. However, this
> approach is blocked by the software-side implementation: KVM_GET_REGS
> currently does not support TDX VMs and returns -EINVAL. To make this
> work, the KVM ioctl would need a test-only hack.
>
> 2. TDCALL PIO with buffer indexing
> To keep a PIO-based approach and unify the get_ucall implementation for
> both TDX and non-TDX VMs, we considered TDCALL PIO with buffer indexing.
> Since the ucall buffer is initialized prior to execution, the VM could
> just pass a buffer index rather than an 8-byte ucall address to fit
> within the 4-byte PIO data limit. The host, already knowing the ucall
> buffer's base address, could then resolve the ucall content via this
> index. We abandoned this solution because it would require changes to
> the common ucall structure and impact other non-x86 architectures.
>
> 3. TDCALL MMIO (Selected solution)
> We ultimately selected TDCALL with an 8-byte MMIO data. This method only
> requires initializing an MMIO GPA and adding TDCALL MMIO implementation
> for TDX under the original x86 ucall path. While this diverges from the
> non-TDX PIO, it provides the cleanest implementation with minimal
> disruption to the overall ucall architecture.
>
Sean, Lisa evaluated your suggestion [1] (summarized as 1. above) but we
think TDCALL MMIO is better, what do you think?
+ Jump directly to where the mmio is used: [2]
+ And here's [3] how tdx_mmio_write() is implemented, with no more
throwing everything in a structure. It's also not macroed/prototyped
like you suggested in [4], but I think those prototypes can evolve out
of future tdx functions?
Let us know so Lisa can try another option (if necessary) while we
collect more reviews :)
[1] https://lore.kernel.org/all/aQTcDH9LRezI30dm@google.com/
[2] https://lore.kernel.org/all/20260521-tdx-selftests-v13-v13-21-6983ae4c3a4d@google.com/
[3] https://lore.kernel.org/all/20260521-tdx-selftests-v13-v13-20-6983ae4c3a4d@google.com/
[4] https://lore.kernel.org/all/aQTdTkMIukzt-YlA@google.com/
> 4. A note on #VE and x86 ucall simplification
> It is worth noting that the use of a Virtualization Exception (#VE)
> is orthogonal to the PIO vs. MMIO discussion; rather, it is a question
> of how much we want to simplify the x86 ucall implementation. A #VE
> handler is one option to allow VMs use PIO/MMIO identical to the
> non-TDX case. Alternatively, having an MMIO_WRITE wrapper macro, as Sean
> suggested[2], is another option. Either way, discussion for this is
> likely a premature optimization right now, since the PIO/MMIO call is
> only used under ucall_arch_do_ucall(), and standard and TDX VMs use
> different ones now. We should optimize this in the future, but for now,
> invoking TDCALL directly is more robust and concise.
>
>
> [...snip...]
>
^ permalink raw reply
* Re: [PATCH] PCI/TSM: Resume device to D0 for CMA-SPDM operation
From: Dan Williams (nvidia) @ 2026-06-16 17:34 UTC (permalink / raw)
To: Lukas Wunner, Dan Williams, Ashish Kalra, Tom Lendacky
Cc: Vivaik Balasubrawmanian, John Allen, Bjorn Helgaas, linux-coco,
linux-pci, Jonathan Cameron, Aneesh Kumar K.V, Yilun Xu,
Zhenzhong Duan, Alexey Kardashevskiy
In-Reply-To: <7bdfaf14d7e5a466f3f650150c688a60e947a7a9.1781527060.git.lukas@wunner.de>
Lukas Wunner wrote:
> Per PCIe r7.0 sec 6.31.3, CMA-SPDM operation in non-D0 states is optional.
> The spec does not define a way to determine if it's supported, so resume
> to D0 unconditionally for the duration of a CMA-SPDM exchange. Vivaik has
> talked to Windows engineers and they said that Windows does the same.
>
> Note that for plain DOE operation, it is sufficient for the device to be
> in D3hot and its parents in D0 because config space remains accessible in
> D3hot. So CMA-SPDM goes beyond the requirements of plain DOE and hence
> resuming to D0 needs to (only) be done in code paths which use DOE
> specifically for CMA-SPDM.
>
> The pattern used herein for runtime resume is the best practice introduced
> by commit ef8057b07c72 ("PM: runtime: Wrapper macros for ACQUIRE()/
> ACQUIRE_ERR()").
>
> Fixes: 3225f52cde56 ("PCI/TSM: Establish Secure Sessions and Link Encryption")
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org # v6.19+
> Cc: Vivaik Balasubrawmanian <vivaik.balasubrawmanian@intel.com>
> ---
> We're in the merge window for v7.2 and this isn't super urgent,
> so it's targeting v7.3 via tsm.git/next.
>
> Technically I'd have permission to apply myself,
> but I wouldn't want to without acks from Dan and AMD!
> Thanks for taking a look!
Thanks, Lukas. A few questions:
This says Fixes, but I assume it is based on inspection and not a
report?
There are no upstream usages of pci_tsm_doe_transfer() yet, but the ones
in flight would suffer from the "D0 -> D3hot -> D0 -> D3hot" bounce that
you described to sashiko. I.e. the runtime acquire should be done at a
higher level.
I think the natural place to add PM_RUNTIME_ACQUIRE() that covers all
cases is withing pci_tsm_connect() and pci_tsm_disconnect().
I also think failure to power manage the device in the disconnect path
should not be fatal to performing the rest of the cleanup.
^ permalink raw reply
* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Sean Christopherson @ 2026-06-16 17:06 UTC (permalink / raw)
To: Ackerley Tng
Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Isaku Yamahata, Kiryl Shutsemau,
linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar, Shuah Khan,
Oliver Upton, Jeremiah McReynolds, kvm, linux-coco, linux-kernel,
x86
In-Reply-To: <CAEvNRgFPKC2uOMaams7SS9B7LxvfU4h8DrPM5vXFb=pmXsgPbA@mail.gmail.com>
On Tue, Jun 16, 2026, Ackerley Tng wrote:
> >> 1. What do you think of a kvm_arch_vm_finalize() that calls
> >> vm_sev_launch() and tdx_vm_finalize()? My key issue is that
> >> kvm_arch_vm_finalize_*vcpus*() seems to be for finalizing vCPUs
> >> rather than the whole VM.
> >
> > Key word "seems". I'm pretty sure Oliver picked kvm_arch_vm_finalize_vcpus() as
> > the name in commit 8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for
> > 'default' VMs") for the same reasons I think it's a good fit for coco VMs: like
> > finalizing TDX VMs, initializing the vGIC effectively finalizes vCPUs.
> >
> > We could rename it to kvm_arch_vm_finalize(), but that won't change the fact that
> > we'll need to decide between automagic vs. manual finalization, and it definitely
> > should be a separate discussion.
> >
>
> This definitely should not block this series.
>
> It's coming together for me now with your explanation:
> kvm_arch_vm_finalize_vcpus() actually means finalizing vCPUs! vGIC ==
> Virtual Generic Interrupt Controller, which has to be done after all the
> vCPUs are set up. Since the name is describing where in the VM
> creation/setup flow the hook is called (after creating VM and after
> creating vCPUs), maybe something like kvm_arch_vm_post_vcpu_create()?
No, because I would expect post_vcpu_create() to run after creating each vCPU,
not after creating all vCPUs. E.g. see KVM's kvm_arch_vcpu_{pre,post}create().
> Renaming this to kvm_arch_vm_finalize() makes it sound like it is
> finalizing the VM, but this function shouldn't finalize the VM since for
> CoCo finalizing the VM also loads the guest image into the guest - deals
> with memory, not just vCPUs.
>
> 8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for 'default'
> VMs") also includes a test_disable_default_vgic() function, we could
> also use something like that to skip CoCo VM finalization for some
> tests? Maybe that's a good middle ground.
That probably won't work well, and in practice it's just shuffling deck chairs
on the Titanic. For vGIC, and pre-create hook works because the tests that opt
out of automatic vGIC instantiation want that behavior to apply to all VMs that
the test creates. That's not the case for sev_smoke_test though, because some
testcases need deferred launch (test_sync_vmsa()), whereas others can use
automatic launch (test_sev()).
The other wrinkle is that SEV at least needs to provide the policy, which again
varies per VM within a single test.
^ permalink raw reply
* Re: [PATCH v13 01/22] KVM: selftests: Add macros to simplify creating VM shapes for non-default types
From: Sean Christopherson @ 2026-06-16 16:51 UTC (permalink / raw)
To: Xiaoyao Li
Cc: Lisa Wang, Andrew Jones, Ackerley Tng, Binbin Wu, Chao Gao,
Chenyi Qiang, Dave Hansen, Erdem Aktas, Ira Weiny, Isaku Yamahata,
Kiryl Shutsemau, linux-kselftest, Paolo Bonzini, Pratik R. Sampat,
Reinette Chatre, Rick Edgecombe, Roger Wang, Ryan Afranji,
Sagi Shahar, Shuah Khan, Oliver Upton, Jeremiah McReynolds, kvm,
linux-coco, linux-kernel, x86
In-Reply-To: <e0b99e9a-c20f-4def-ac4b-0070996c10ef@intel.com>
On Tue, Jun 16, 2026, Xiaoyao Li wrote:
> On 5/22/2026 7:16 AM, Lisa Wang wrote:
> > diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
> > index dc70c6da63fa..041bdbfb93f7 100644
> > --- a/tools/testing/selftests/kvm/include/kvm_util.h
> > +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> > @@ -233,6 +233,19 @@ kvm_static_assert(sizeof(struct vm_shape) == sizeof(u64));
> > shape; \
> > })
> > +#define __VM_TYPE(__mode, __type) \
>
> It seems the name "__VM_SHAPE" fits better?
>
> > +({ \
> > + struct vm_shape shape = { \
> > + .mode = (__mode), \
> > + .type = (__type) \
> > + }; \
> > + \
> > + shape; \
> > +})
> > +
> > +#define VM_TYPE(__type) \
> > + __VM_TYPE(VM_MODE_DEFAULT, __type)
>
> and I think making it one line would be OK?
>
> So something on top:
>
> ---8<---
> diff --git a/tools/testing/selftests/kvm/include/kvm_util.h
> b/tools/testing/selftests/kvm/include/kvm_util.h
> index 041bdbfb93f7..a1b5d2029d05 100644
> --- a/tools/testing/selftests/kvm/include/kvm_util.h
> +++ b/tools/testing/selftests/kvm/include/kvm_util.h
> @@ -223,17 +223,7 @@ kvm_static_assert(sizeof(struct vm_shape) ==
> sizeof(u64));
>
> #define VM_TYPE_DEFAULT 0
>
> -#define VM_SHAPE(__mode) \
> -({ \
> - struct vm_shape shape = { \
> - .mode = (__mode), \
> - .type = VM_TYPE_DEFAULT \
> - }; \
> - \
> - shape; \
> -})
> -
> -#define __VM_TYPE(__mode, __type) \
> +#define __VM_SHAPE(__mode, __type) \
> ({ \
> struct vm_shape shape = { \
> .mode = (__mode), \
> @@ -243,8 +233,8 @@ kvm_static_assert(sizeof(struct vm_shape) ==
> sizeof(u64));
> shape; \
> })
>
> -#define VM_TYPE(__type) \
> - __VM_TYPE(VM_MODE_DEFAULT, __type)
> +#define VM_SHAPE(__mode) __VM_SHAPE(__mode, VM_TYPE_DEFAULT)
> +#define VM_TYPE(__type) __VM_SHAPE(VM_MODE_DEFAULT, __type)
Oh, that's way better! I say we go straight there:
--
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 28 Oct 2025 21:20:27 +0000
Subject: [PATCH] KVM: selftests: Add macros to simplify creating VM shapes for
non-default types
Add VM_TYPE() and __VM_SHAPE() macros to create a vm_shape structure given
a type (and mode), and use the macros to define VM_SHAPE_{SEV,SEV_ES,SNP}
shapes for x86's SEV family of VM shapes. Providing common infrastructure
will avoid having to copy+paste vm_sev_create_with_one_vcpu() for TDX.
Use the new SEV+ shapes and drop vm_sev_create_with_one_vcpu().
Opportunistically move the existing VM_SHAPE() (now __VM_SHAPE()) macro
below the definitions of VM_MODE_DEFAULT so that all of the SHAPE/TYPE
macros are bundled together.
No functional change intended.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
.../testing/selftests/kvm/include/kvm_util.h | 28 +++++++------
.../selftests/kvm/include/x86/processor.h | 4 ++
tools/testing/selftests/kvm/include/x86/sev.h | 2 -
tools/testing/selftests/kvm/lib/x86/sev.c | 16 --------
.../selftests/kvm/x86/sev_smoke_test.c | 40 +++++++++----------
5 files changed, 40 insertions(+), 50 deletions(-)
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index dc70c6da63fa..46bae183d7fc 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -221,18 +221,6 @@ struct vm_shape {
kvm_static_assert(sizeof(struct vm_shape) == sizeof(u64));
-#define VM_TYPE_DEFAULT 0
-
-#define VM_SHAPE(__mode) \
-({ \
- struct vm_shape shape = { \
- .mode = (__mode), \
- .type = VM_TYPE_DEFAULT \
- }; \
- \
- shape; \
-})
-
extern enum vm_guest_mode vm_mode_default;
#if defined(__aarch64__)
@@ -270,8 +258,24 @@ extern enum vm_guest_mode vm_mode_default;
#endif
+#define VM_TYPE_DEFAULT 0
+
+#define __VM_SHAPE(__mode, __type) \
+({ \
+ struct vm_shape shape = { \
+ .mode = (__mode), \
+ .type = (__type), \
+ }; \
+ \
+ shape; \
+})
+
+
+#define VM_SHAPE(__mode) __VM_SHAPE(__mode, VM_TYPE_DEFAULT)
#define VM_SHAPE_DEFAULT VM_SHAPE(VM_MODE_DEFAULT)
+#define VM_TYPE(__type) __VM_SHAPE(VM_MODE_DEFAULT, __type)
+
#define MIN_PAGE_SIZE (1U << MIN_PAGE_SHIFT)
#define PTES_PER_MIN_PAGE ptes_per_page(MIN_PAGE_SIZE)
diff --git a/tools/testing/selftests/kvm/include/x86/processor.h b/tools/testing/selftests/kvm/include/x86/processor.h
index 77f576ee7789..0aa6eecfcbde 100644
--- a/tools/testing/selftests/kvm/include/x86/processor.h
+++ b/tools/testing/selftests/kvm/include/x86/processor.h
@@ -365,6 +365,10 @@ static inline unsigned int x86_model(unsigned int eax)
return ((eax >> 12) & 0xf0) | ((eax >> 4) & 0x0f);
}
+#define VM_SHAPE_SEV VM_TYPE(KVM_X86_SEV_VM)
+#define VM_SHAPE_SEV_ES VM_TYPE(KVM_X86_SEV_ES_VM)
+#define VM_SHAPE_SNP VM_TYPE(KVM_X86_SNP_VM)
+
#define PHYSICAL_PAGE_MASK GENMASK_ULL(51, 12)
#define PAGE_SHIFT 12
diff --git a/tools/testing/selftests/kvm/include/x86/sev.h b/tools/testing/selftests/kvm/include/x86/sev.h
index 1af44c151d60..944c59dbe510 100644
--- a/tools/testing/selftests/kvm/include/x86/sev.h
+++ b/tools/testing/selftests/kvm/include/x86/sev.h
@@ -53,8 +53,6 @@ void snp_vm_launch_start(struct kvm_vm *vm, u64 policy);
void snp_vm_launch_update(struct kvm_vm *vm);
void snp_vm_launch_finish(struct kvm_vm *vm);
-struct kvm_vm *vm_sev_create_with_one_vcpu(u32 type, void *guest_code,
- struct kvm_vcpu **cpu);
void vm_sev_launch(struct kvm_vm *vm, u64 policy, u8 *measurement);
kvm_static_assert(SEV_RET_SUCCESS == 0);
diff --git a/tools/testing/selftests/kvm/lib/x86/sev.c b/tools/testing/selftests/kvm/lib/x86/sev.c
index 93f916903461..95d8520eea34 100644
--- a/tools/testing/selftests/kvm/lib/x86/sev.c
+++ b/tools/testing/selftests/kvm/lib/x86/sev.c
@@ -158,22 +158,6 @@ void snp_vm_launch_finish(struct kvm_vm *vm)
vm_sev_ioctl(vm, KVM_SEV_SNP_LAUNCH_FINISH, &launch_finish);
}
-struct kvm_vm *vm_sev_create_with_one_vcpu(u32 type, void *guest_code,
- struct kvm_vcpu **cpu)
-{
- struct vm_shape shape = {
- .mode = VM_MODE_DEFAULT,
- .type = type,
- };
- struct kvm_vm *vm;
- struct kvm_vcpu *cpus[1];
-
- vm = __vm_create_with_vcpus(shape, 1, 0, guest_code, cpus);
- *cpu = cpus[0];
-
- return vm;
-}
-
void vm_sev_launch(struct kvm_vm *vm, u64 policy, u8 *measurement)
{
if (is_sev_snp_vm(vm)) {
diff --git a/tools/testing/selftests/kvm/x86/sev_smoke_test.c b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
index 1a49ee391586..fe2c438882ae 100644
--- a/tools/testing/selftests/kvm/x86/sev_smoke_test.c
+++ b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
@@ -104,7 +104,7 @@ static void compare_xsave(u8 *from_host, u8 *from_guest)
abort();
}
-static void test_sync_vmsa(u32 type, u64 policy)
+static void test_sync_vmsa(struct vm_shape shape, u64 policy)
{
struct kvm_vcpu *vcpu;
struct kvm_vm *vm;
@@ -114,7 +114,7 @@ static void test_sync_vmsa(u32 type, u64 policy)
double x87val = M_PI;
struct kvm_xsave __attribute__((aligned(64))) xsave = { 0 };
- vm = vm_sev_create_with_one_vcpu(type, guest_code_xsave, &vcpu);
+ vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_code_xsave);
gva = vm_alloc_shared(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR,
MEM_REGION_TEST_DATA);
hva = addr_gva2hva(vm, gva);
@@ -150,13 +150,13 @@ static void test_sync_vmsa(u32 type, u64 policy)
kvm_vm_free(vm);
}
-static void test_sev(void *guest_code, u32 type, u64 policy)
+static void test_sev(void *guest_code, struct vm_shape shape, u64 policy)
{
struct kvm_vcpu *vcpu;
struct kvm_vm *vm;
struct ucall uc;
- vm = vm_sev_create_with_one_vcpu(type, guest_code, &vcpu);
+ vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_code);
/* TODO: Validate the measurement is as expected. */
vm_sev_launch(vm, policy, NULL);
@@ -201,12 +201,12 @@ static void guest_shutdown_code(void)
__asm__ __volatile__("ud2");
}
-static void test_sev_shutdown(u32 type, u64 policy)
+static void test_sev_shutdown(struct vm_shape shape, u64 policy)
{
struct kvm_vcpu *vcpu;
struct kvm_vm *vm;
- vm = vm_sev_create_with_one_vcpu(type, guest_shutdown_code, &vcpu);
+ vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_shutdown_code);
vm_sev_launch(vm, policy, NULL);
@@ -218,28 +218,28 @@ static void test_sev_shutdown(u32 type, u64 policy)
kvm_vm_free(vm);
}
-static void test_sev_smoke(void *guest, u32 type, u64 policy)
+static void test_sev_smoke(void *guest, struct vm_shape shape, u64 policy)
{
const u64 xf_mask = XFEATURE_MASK_X87_AVX;
- if (type == KVM_X86_SNP_VM)
- test_sev(guest, type, policy | SNP_POLICY_DBG);
+ if (shape.type == KVM_X86_SNP_VM)
+ test_sev(guest, shape, policy | SNP_POLICY_DBG);
else
- test_sev(guest, type, policy | SEV_POLICY_NO_DBG);
- test_sev(guest, type, policy);
+ test_sev(guest, shape, policy | SEV_POLICY_NO_DBG);
+ test_sev(guest, shape, policy);
- if (type == KVM_X86_SEV_VM)
+ if (shape.type == KVM_X86_SEV_VM)
return;
- test_sev_shutdown(type, policy);
+ test_sev_shutdown(shape, policy);
if (kvm_has_cap(KVM_CAP_XCRS) &&
(xgetbv(0) & kvm_cpu_supported_xcr0() & xf_mask) == xf_mask) {
- test_sync_vmsa(type, policy);
- if (type == KVM_X86_SNP_VM)
- test_sync_vmsa(type, policy | SNP_POLICY_DBG);
+ test_sync_vmsa(shape, policy);
+ if (shape.type == KVM_X86_SNP_VM)
+ test_sync_vmsa(shape, policy | SNP_POLICY_DBG);
else
- test_sync_vmsa(type, policy | SEV_POLICY_NO_DBG);
+ test_sync_vmsa(shape, policy | SEV_POLICY_NO_DBG);
}
}
@@ -247,13 +247,13 @@ int main(int argc, char *argv[])
{
TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SEV));
- test_sev_smoke(guest_sev_code, KVM_X86_SEV_VM, 0);
+ test_sev_smoke(guest_sev_code, VM_SHAPE_SEV, 0);
if (kvm_cpu_has(X86_FEATURE_SEV_ES))
- test_sev_smoke(guest_sev_es_code, KVM_X86_SEV_ES_VM, SEV_POLICY_ES);
+ test_sev_smoke(guest_sev_es_code, VM_SHAPE_SEV_ES, SEV_POLICY_ES);
if (kvm_cpu_has(X86_FEATURE_SEV_SNP))
- test_sev_smoke(guest_snp_code, KVM_X86_SNP_VM, snp_default_policy());
+ test_sev_smoke(guest_snp_code, VM_SHAPE_SNP, snp_default_policy());
return 0;
}
base-commit: e49bb0b5e1e3a8d7783bc7222c02cc6ff90fa2aa
--
^ permalink raw reply related
* Re: [PATCH v13 19/22] KVM: selftests: Finalize TD memory as part of kvm_arch_vm_finalize_vcpus
From: Ackerley Tng @ 2026-06-16 16:13 UTC (permalink / raw)
To: Sean Christopherson
Cc: Lisa Wang, Andrew Jones, Binbin Wu, Chao Gao, Chenyi Qiang,
Dave Hansen, Erdem Aktas, Isaku Yamahata, Kiryl Shutsemau,
linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar, Shuah Khan,
Oliver Upton, Jeremiah McReynolds, kvm, linux-coco, linux-kernel,
x86
In-Reply-To: <ajFfb9u6dU47Nj3v@google.com>
>
> [...snip...]
>
>>
>> I still think kvm_arch_vm_finalize_vcpus() is an odd place to be
>> finalizing the VM.
>
> That's literally why the function exists though. The one and only existing
> implementation (on arm64) uses it to initialize the vGIC.
>
> void kvm_arch_vm_finalize_vcpus(struct kvm_vm *vm)
> {
> if (vm->arch.has_gic)
> __vgic_v3_init(vm->arch.gic_fd);
> }
>
> That's *very* similar to the proposed TDX usage, where some per-VM asset(s) can
> be initialized/frozen only after all vCPUs have been added. In other words, the
> name is describing where in the VM creation/setup flow the hook is called, and
> perhaps more importantly, the impact of the call: vCPUs are finalized (obviously
> with a different definition of "finalized" based on the VM properties).
>
>> I would prefer to not have to explicitly call some function like
>> kvm_arch_vm_finalize() (no vcpu in the name), but a common arch function
>> calling vm_sev_launch() and tdx_vm_finalize() is what I can think of
>> for test setup flexibility, without too much magic.
>
> We can't have our cake and eat it too. Either we launch/finalize SEV/TDX VMs as
> part of the common VM creation flows (as proposed for TDX), or we force tests to
> manually launch/finalize the VM after additional setup. The only way to have it
> both ways is to utilize crazy macro shenanigans to execute arbitrary code between
> creating the VM and launching/finalizing the VM, and even I would balk at that.
>
> I honestly don't see any reason to try to figure out which of the two approaches
> is optimal at this time. Whatever decision we make isn't set in stone, and in
> fact should be relative easy to change. And without being able to see all the
> TDX/SEV tests that are waiting in the wings, it's more or less impossible to make
> an informed decision.
>
> I definitely want to have SEV and TDX use the same core approach in the end, but
> I don't think we should force the issue right now, because the cost of reworking
> the SEV and/or TDX infrastructure when there are only a bare handful of tests is
> so low. It will cost more to try to predict the future of a 50/50 outcome than
> it will to make a completely wild guess between the two options and rework things
> if we guess wrong.
>
Makes sense. I'm good with merging this as it is done in this
patch. Thanks :)
>> For now, I can't think of many uses of __shared. ucall shared memory is
>> allocated dynamically, and we can also make it shared cleanly within
>> ucall code.
>
> Uh, every selftest that uses global variables to communicate between guest and
> host?
>
>> The global variables (sync_global_to_guest()) will appear in the guest
>> as long as sync_global_to_guest() is called before
>> kvm_arch_vm_finalize(), which I think makes sense to people writing
>> tests for CoCo.
>
> Yes, but that's completely orthogonal to all of this.
>
>> So 2 questions to push this along:
>>
>> 1. What do you think of a kvm_arch_vm_finalize() that calls
>> vm_sev_launch() and tdx_vm_finalize()? My key issue is that
>> kvm_arch_vm_finalize_*vcpus*() seems to be for finalizing vCPUs
>> rather than the whole VM.
>
> Key word "seems". I'm pretty sure Oliver picked kvm_arch_vm_finalize_vcpus() as
> the name in commit 8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for
> 'default' VMs") for the same reasons I think it's a good fit for coco VMs: like
> finalizing TDX VMs, initializing the vGIC effectively finalizes vCPUs.
>
> We could rename it to kvm_arch_vm_finalize(), but that won't change the fact that
> we'll need to decide between automagic vs. manual finalization, and it definitely
> should be a separate discussion.
>
This definitely should not block this series.
It's coming together for me now with your explanation:
kvm_arch_vm_finalize_vcpus() actually means finalizing vCPUs! vGIC ==
Virtual Generic Interrupt Controller, which has to be done after all the
vCPUs are set up. Since the name is describing where in the VM
creation/setup flow the hook is called (after creating VM and after
creating vCPUs), maybe something like kvm_arch_vm_post_vcpu_create()?
Renaming this to kvm_arch_vm_finalize() makes it sound like it is
finalizing the VM, but this function shouldn't finalize the VM since for
CoCo finalizing the VM also loads the guest image into the guest - deals
with memory, not just vCPUs.
8911c7dbc607 ("KVM: arm64: selftests: Create a VGICv3 for 'default'
VMs") also includes a test_disable_default_vgic() function, we could
also use something like that to skip CoCo VM finalization for some
tests? Maybe that's a good middle ground.
>> 3. Would you like __shared implemented together with this series, as a
>> prerequisite, or later?
>
> Only if __shared is a hard requirement for base TDX support, which I assume is
> not the case.
Yup!
^ permalink raw reply
* SVSM Development Call June 17th, 2026
From: Jörg Rödel @ 2026-06-16 16:10 UTC (permalink / raw)
To: coconut-svsm, linux-coco
Hi,
Here is the call for agenda items for this weeks SVSM development call. Please
send any agenda items you have in mind as a reply to this email or raise them
in the meeting.
We will use the LF Zoom instance. Details of the meeting can be found in our
governance repository at:
https://github.com/coconut-svsm/governance
The link to the COCONUT-SVSM calendar is:
https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week
The meeting will be recorded and the recording eventually published.
Regards,
Jörg
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox