* [PATCH v4 00/13] dma-mapping: Use DMA_ATTR_CC_SHARED through direct, pool and swiotlb paths
From: Aneesh Kumar K.V (Arm) @ 2026-05-12 9:03 UTC (permalink / raw)
To: iommu, linux-arm-kernel, linux-kernel, linux-coco
Cc: Aneesh Kumar K.V (Arm), Robin Murphy, Marek Szyprowski,
Will Deacon, Marc Zyngier, Steven Price, Suzuki K Poulose,
Catalin Marinas, Jiri Pirko, Jason Gunthorpe, Mostafa Saleh,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
This series propagates DMA_ATTR_CC_SHARED through the dma-direct,
dma-pool, and swiotlb paths so that encrypted and decrypted DMA buffers
are handled consistently.
Today, the direct DMA path mostly relies on force_dma_unencrypted() for
shared/decrypted buffer handling. This series consolidates the
force_dma_unencrypted() checks in the top-level functions and ensures
that the remaining DMA interfaces use DMA attributes to make the correct
decisions.
The series:
- moves swiotlb-backed allocations out of __dma_direct_alloc_pages(),
- propagates DMA_ATTR_CC_SHARED through the dma-direct alloc/free
paths
- teaches the atomic DMA pools to track encrypted versus decrypted
state
- tracks swiotlb pool encryption state and enforces strict pool
selection
- centralizes encrypted/decrypted pgprot handling in dma_pgprot() using
DMA attributes
- passes DMA attributes down to dma_capable() so capability checks can
validate whether the selected DMA address encoding matches
DMA_ATTR_CC_SHARED
- makes dma_direct_map_phys() choose the DMA address encoding from
DMA_ATTR_CC_SHARED and fall back to swiotlb when a shared DMA request
cannot use the direct mapping, which lets arm64 and x86 CCA guests stop
relying on SWIOTLB_FORCE for DMA mappings
- use the selected swiotlb pool state to derive the returned DMA
address.
Changes from v3:
https://lore.kernel.org/all/20260427055509.898190-1-aneesh.kumar@kernel.org
* Handle DMA_ATTR_MMIO correctly in dma_direct_map_phys()
* Address most of sashiko review
* Rebase to latest kernel
* drop SWIOTLB_FORCE for s390 and powerpc secure guest.
Changes from v2:
https://lore.kernel.org/all/20260420061415.3650870-1-aneesh.kumar@kernel.org
* pass attrs to dma_capable() and update direct, swiotlb, Xen swiotlb, and
x86 GART paths so the capability checks see the DMA address attr value
DMA_ATTR_CC_SHARED.
* rework dma_direct_map_phys() so DMA_ATTR_CC_SHARED selects
phys_to_dma_unencrypted() while the default path uses
phys_to_dma_encrypted(), with swiotlb fallback when the requested
shared/private state cannot be satisfied by a direct DMA address.
* stop relying on SWIOTLB_FORCE for arm64 and x86 CC guest DMA mappings;
swiotlb is still enabled there, but shared mappings is now selected
through the generic dma_direct_map_phys()/dma_capable() decision instead
of a global force-bounce flag.
Changes from v1:
https://lore.kernel.org/all/20260417085900.3062416-1-aneesh.kumar@kernel.org
* rebased to latest kernel (change from DMA_ATTR_CC_DECRYPTED -> DMA_ATTR_CC_SHARED)
* update the alloc path so DMA_ATTR_CC_SHARED is not a caller-visible attribute.
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suzuki K Poulose <Suzuki.Poulose@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mostafa Saleh <smostafa@google.com>
Cc: Petr Tesarik <ptesarik@suse.com>
Cc: Alexey Kardashevskiy <aik@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Xu Yilun <yilun.xu@linux.intel.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: x86@kernel.org
Aneesh Kumar K.V (Arm) (13):
dma-direct: swiotlb: handle swiotlb alloc/free outside
__dma_direct_alloc_pages
dma-direct: use DMA_ATTR_CC_SHARED in alloc/free paths
dma-pool: track decrypted atomic pools and select them via attrs
dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
dma-mapping: make dma_pgprot() honor DMA_ATTR_CC_SHARED
dma-direct: pass attrs to dma_capable() for DMA_ATTR_CC_SHARED checks
dma-direct: make dma_direct_map_phys() honor DMA_ATTR_CC_SHARED
dma-direct: set decrypted flag for remapped DMA allocations
dma-direct: select DMA address encoding from DMA_ATTR_CC_SHARED
dma-pool: fix page leak in atomic_pool_expand() cleanup
dma-direct: rename ret to cpu_addr in alloc helpers
dma-direct: return struct page from dma_direct_alloc_from_pool()
x86/amd-gart: preserve the direct DMA address until GART mapping
succeeds
arch/arm64/mm/init.c | 4 +-
arch/powerpc/platforms/pseries/svm.c | 2 +-
arch/s390/mm/init.c | 2 +-
arch/x86/kernel/amd_gart_64.c | 36 ++--
arch/x86/kernel/pci-dma.c | 4 +-
drivers/iommu/dma-iommu.c | 2 +-
drivers/xen/swiotlb-xen.c | 10 +-
include/linux/dma-direct.h | 19 ++-
include/linux/dma-map-ops.h | 2 +-
include/linux/swiotlb.h | 8 +-
kernel/dma/direct.c | 243 ++++++++++++++++++++-------
kernel/dma/direct.h | 40 ++---
kernel/dma/mapping.c | 16 +-
kernel/dma/pool.c | 165 +++++++++++-------
kernel/dma/swiotlb.c | 109 +++++++++---
15 files changed, 459 insertions(+), 203 deletions(-)
base-commit: 50897c955902c93ae71c38698abb910525ebdc89
--
2.43.0
^ permalink raw reply
* Re: [RFC PATCH 04/12] vfio/pci: Allow MMIO regions to be exported through dma-buf
From: Alexey Kardashevskiy @ 2026-05-12 5:49 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
In-Reply-To: <20260511235617.GG1116784@nvidia.com>
On 12/5/26 09:56, Jason Gunthorpe wrote:
> On Tue, May 12, 2026 at 09:42:01AM +1000, Alexey Kardashevskiy wrote:
>
>>> true but either way dmabuf slicing will be directed by QEMU's msix-table
>>> emulation MR and this slicing needs to match the TDISP report so I'll
>>> have to teach QEMU these reports, right?
>>
>> Or TDISP devices are going to align MSIX BARs to 4K, and QEMU will
>> do the same and it should "just work", and if it does not - the host
>> won't crash. Can this work? Thanks,
>
> Host crashing stuff is a different issue, I think the plan was to
> revoke the entire MMIO space from userspace and remove it from the
> kernel mapping. Entire because we don't want to parse the TDISP report
> to figure out something more narrow.
>
> Therefore there is no way the host can crash.
Ah ok.
> When qemu constructs the VM memory map it already has a scheme to
> insert a hole for a SW emulated page for MSI. That will keep working
> exactly as it is.
>
> When the VM validates the MMIO the hole has to fall within a T=0 space
> of the TDISP report or the VM will reject it.
>
> This means devices need to have a T=0 hole around their MSI-X/etc
> suitable for a 64K page size OS.
Since we are ditching mappings, the entire MSIX-containing 64K block will be ioctl()ed instead of directly accessed from QEMU via mmap (which is slower the VM direct access but still)?
> This is already the case, if a device mixes MSIx with other things
> qemu will work but it becomes horribly slow and a little broken.
Really only when MSIX is not system page size aligned but yeah, I had enough of that with PPC. Thanks,
>
> Jason
--
Alexey
^ permalink raw reply
* Re: [BUG] x86/virt/tdx: tdx_offline_cpu() violates tdx_cpu_flush_cache() preemption assert
From: Huang, Kai @ 2026-05-12 1:00 UTC (permalink / raw)
To: Verma, Vishal L, devnexen@gmail.com, Edgecombe, Rick P,
dave.hansen@linux.intel.com
Cc: Gao, Chao, linux-kernel@vger.kernel.org, seanjc@google.com,
bp@alien8.de, kas@kernel.org, hpa@zytor.com, mingo@redhat.com,
Hunter, Adrian, x86@kernel.org, tglx@kernel.org,
pbonzini@redhat.com, linux-coco@lists.linux.dev,
kvm@vger.kernel.org
In-Reply-To: <CA+XhMqzcFRY=ogMhiCQeKqh-zz3RpP0nsUWYhP0jNhF9Uy+41A@mail.gmail.com>
On Mon, 2026-05-11 at 22:33 +0100, David CARLIER wrote:
> Hi,
>
> In commit 597bdf6e068e ("x86/virt/tdx: Pull kexec cache flush logic into
> arch/x86"), tdx_offline_cpu() gained a call to tdx_cpu_flush_cache(),
> which starts with lockdep_assert_preemption_disabled().
>
> tdx_offline_cpu() is registered at CPUHP_AP_ONLINE_DYN. ONLINE-section
> teardown callbacks run from the pinned per-CPU hotplug thread with
> preemption and interrupts enabled (Documentation/core-api/cpu_hotplug.rst,
> and cpuhp_thread_fun() only disables IRQs for atomic states).
>
> The other callers — tdx_shutdown_cpu() via on_each_cpu(), and the
> crash path — satisfy the assertion. Only the offline path doesn't, and
> the splat should fire on every offline once the TDX module is
> initialized and the done: path is taken.
>
> Wrapping the call with preempt_disable() / preempt_enable() at the
> offline site keeps the contract for the kexec/shutdown callers.
>
> Not yet reproduced on a debug kernel; reporting on inspection.
>
> Fixes: 597bdf6e068e ("x86/virt/tdx: Pull kexec cache flush logic
> into arch/x86")
Right the lockdep_assert_preemption_disabled() is wrong when
tdx_cpu_flush_cache() is called from CPUHP context (there's no functionality
issue, though, it's just the lockdep assertion is wrong).
It was introduced when the TDX host kexec support was added, so the above commit
is not the right one to blame. Previously the tdx_cpu_flush_cache() was called
from KVM's module unload path, also via the CPUHP context. The commit above
only moved it to TDX core's CPU offline path.
The latest version to fix is:
https://lore.kernel.org/lkml/20260407233333.1608820-1-kai.huang@intel.com/
but it needs rebasing now.
^ permalink raw reply
* Re: [RFC PATCH 04/12] vfio/pci: Allow MMIO regions to be exported through dma-buf
From: Jason Gunthorpe @ 2026-05-11 23:56 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
In-Reply-To: <c166f41e-d983-4a22-95d1-c485a82d1d06@amd.com>
On Tue, May 12, 2026 at 09:42:01AM +1000, Alexey Kardashevskiy wrote:
> > true but either way dmabuf slicing will be directed by QEMU's msix-table
> > emulation MR and this slicing needs to match the TDISP report so I'll
> > have to teach QEMU these reports, right?
>
> Or TDISP devices are going to align MSIX BARs to 4K, and QEMU will
> do the same and it should "just work", and if it does not - the host
> won't crash. Can this work? Thanks,
Host crashing stuff is a different issue, I think the plan was to
revoke the entire MMIO space from userspace and remove it from the
kernel mapping. Entire because we don't want to parse the TDISP report
to figure out something more narrow.
Therefore there is no way the host can crash.
When qemu constructs the VM memory map it already has a scheme to
insert a hole for a SW emulated page for MSI. That will keep working
exactly as it is.
When the VM validates the MMIO the hole has to fall within a T=0 space
of the TDISP report or the VM will reject it.
This means devices need to have a T=0 hole around their MSI-X/etc
suitable for a 64K page size OS.
This is already the case, if a device mixes MSIx with other things
qemu will work but it becomes horribly slow and a little broken.
Jason
^ permalink raw reply
* Re: [RFC PATCH 04/12] vfio/pci: Allow MMIO regions to be exported through dma-buf
From: Alexey Kardashevskiy @ 2026-05-11 23:42 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
In-Reply-To: <3128deea-95a3-4c36-902b-37f280913f2b@amd.com>
On 7/5/26 17:16, Alexey Kardashevskiy wrote:
> On 6/5/26 23:16, Jason Gunthorpe wrote:
>> On Wed, May 06, 2026 at 12:35:42PM +1000, Alexey Kardashevskiy wrote:
>>> Hi!
>>>
>>> Let's reignite this topic.
>>>
>>> I've been using these patches + QEMU side hacks for 6+ months. And it's been fine until I got a device where MSIX BAR is in a middle of another BAR marked as TEE in the TDISP interface report. And no trusted MSIX yet.
>>>
>>> Every time QEMU mmaps a BAR - I request a dmabuf fd from VFIO in QEMU. Since mapping of an entire MSIX BAR is allowed by default, VFIORegion::nr_mmaps==1 and it is an entire BAR.
>>>
>>> Problem: KVM memslot mismatches the dmabuf fd size
>>
>> Huh? kvm does not care about dmabuf at all? Are you running other
>> patches to hook kvm and dmabuf?
>
> yup, 06/12 of this patchset.
>
>> Putting a slice in a dmabuf is a well understood need for MSI, so I
>> expect whatever kvm dmabuf interface that gets merged to accomodate
>> this?
>
> good to know.
>
>>> Solution2: modify logic in VFIO dmabuf to allow multiple KVM memory
>>> slots per dmabuf. Now it is kvm_memory_slot::dmabuf_attach with no
>>> offset into the dmabuf and one kvm_vfio_dmabuf per dma_buf.
>>
>> Yes, when kvm learns to take in a dmabuf it needs to take in a slice,
>> not the whole buf. Or you need to create multiple dmabufs with the
>> necessary slices from the VFIO. The upstream vfio dmabuf creation
>> allows creating it with a slice.
>
> true but either way dmabuf slicing will be directed by QEMU's msix-table emulation MR and this slicing needs to match the TDISP report so I'll have to teach QEMU these reports, right?
Or TDISP devices are going to align MSIX BARs to 4K, and QEMU will do the same and it should "just work", and if it does not - the host won't crash. Can this work? Thanks,
> I am worried if I miss something obvious, again. Thanks,
>
>
> ps. I like nntp.lore.kernel.org very much for ability to dig out old stuff and then just reply to it :)
>
>>
>> Jason
>
--
Alexey
^ permalink raw reply
* [BUG] x86/virt/tdx: tdx_offline_cpu() violates tdx_cpu_flush_cache() preemption assert
From: David CARLIER @ 2026-05-11 21:33 UTC (permalink / raw)
To: Vishal Verma, Rick Edgecombe, Dave Hansen
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Kai Huang, Chao Gao, Kiryl Shutsemau, Sean Christopherson,
Paolo Bonzini, Adrian Hunter, x86, kvm, linux-coco,
open list:SCHEDULER
Hi,
In commit 597bdf6e068e ("x86/virt/tdx: Pull kexec cache flush logic into
arch/x86"), tdx_offline_cpu() gained a call to tdx_cpu_flush_cache(),
which starts with lockdep_assert_preemption_disabled().
tdx_offline_cpu() is registered at CPUHP_AP_ONLINE_DYN. ONLINE-section
teardown callbacks run from the pinned per-CPU hotplug thread with
preemption and interrupts enabled (Documentation/core-api/cpu_hotplug.rst,
and cpuhp_thread_fun() only disables IRQs for atomic states).
The other callers — tdx_shutdown_cpu() via on_each_cpu(), and the
crash path — satisfy the assertion. Only the offline path doesn't, and
the splat should fire on every offline once the TDX module is
initialized and the done: path is taken.
Wrapping the call with preempt_disable() / preempt_enable() at the
offline site keeps the contract for the kexec/shutdown callers.
Not yet reproduced on a debug kernel; reporting on inspection.
Fixes: 597bdf6e068e ("x86/virt/tdx: Pull kexec cache flush logic
into arch/x86")
Cheers,
David
^ permalink raw reply
* [Invitation] bi-weekly guest_memfd upstream call on 2026-05-14
From: Ackerley Tng @ 2026-05-11 16:13 UTC (permalink / raw)
To: linux-coco, linux-mm, kvm; +Cc: david
Hi,
Our next guest_memfd upstream call is scheduled for Thursday, 2026-05-14
at 8:00 - 9:00am (GMT-07:00) Pacific Time - Vancouver.
We'll be using the following Google meet:
http://meet.google.com/wxp-wtju-jzw
In this meeting, we'll have some quick check-ins about the progress of
guest_memfd direct map removal and related patch series.
Then, Frank would like to talk about pluggable guest_memfd backends!
The meeting notes can be found at [1], where we also link recordings and
collect current guest_memfd upstream proposals. If you want an google
calendar invitation that also covers all future meetings, just write
Ackerley or David a mail.
To put something to discuss onto the agenda, reply to this mail or add
them to the "Topics/questions for next meeting(s)" section in the
meeting notes as a comment.
[1] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?usp=sharing
Ackerley
^ permalink raw reply
* Re: [RFC PATCH 04/12] vfio/pci: Allow MMIO regions to be exported through dma-buf
From: Jason Gunthorpe @ 2026-05-11 12:01 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
zhenzhong.duan, tao1.su
In-Reply-To: <3128deea-95a3-4c36-902b-37f280913f2b@amd.com>
On Thu, May 07, 2026 at 05:16:56PM +1000, Alexey Kardashevskiy wrote:
> true but either way dmabuf slicing will be directed by QEMU's
> msix-table emulation MR and this slicing needs to match the TDISP
> report so I'll have to teach QEMU these reports, right? I am worried
> if I miss something obvious, again. Thanks,
I don't think so.. It just needs to slice it into the MSI page
blindly. When the VM goes to validate the TDISP report against the
mappings it will fail to accept the device if there is a mismatch.
The only thing qemu could do is fail sooner, but I don't know that is
worth the complexity as we do expect all devices to have their MSI
range unprotected.
Jason
^ permalink raw reply
* Re: [PATCH RFC v5 10/53] KVM: guest_memfd: Add basic support for KVM_SET_MEMORY_ATTRIBUTES2
From: Liam R. Howlett @ 2026-05-10 13:43 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgF9+Gr7UVEq-E2SQEb_XOQQMOXy9F_A2tA=DbNV_fJ0EQ@mail.gmail.com>
On 7 May 2026 12:56:11 GMT-04:00, Ackerley Tng <ackerleytng@google.com> wrote:
>"Liam R. Howlett" <liam@infradead.org> writes:
>
>> On 26/04/28 04:25PM, Ackerley Tng via B4 Relay wrote:
>>>
>>> [...snip...]
>>>
>>> +/*
>>> + * Preallocate memory for attributes to be stored on a maple tree, pointed to
>>> + * by mas. Adjacent ranges with attributes identical to the new attributes
>>> + * will be merged. Also sets mas's bounds up for storing attributes.
>>> + *
>>> + * This maintains the invariant that ranges with the same attributes will
>>> + * always be merged.
>>> + */
>>> +static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
>>> + pgoff_t start, size_t nr_pages)
>>> +{
>>> + pgoff_t end = start + nr_pages;
>>> + pgoff_t last = end - 1;
>>> + void *entry;
>>> +
>>> + /* Try extending range. entry is NULL on overflow/wrap-around. */
>>> + mas_set_range(mas, end, end);
>>> + entry = mas_find(mas, end);
>
>Thank you for your reviews!
>
>>
>> Please read the documentation as I believe you have a bug here. What
>> happens if there is another range stored higher than end + 1?
>>
>
>The invariant in this maple tree is that contiguous ranges with the same
>attribute are stored as a single range.
>
>The goal of this first part is to get the entry at the index just after
>the requested range, and see what the attribute there is. If that
>attribute is what we're about to set, extend the requested range for
>storing to the end of that range.
>
>If there is another range higher than end + 1, with the invariant
>maintained, that attribute has to be different than the attribute stored
>at end. Hence, we only want to extend this requested range up till end.
>
mas_find() will look for an entry at the given address for the first search, and if it is not found it will continue to search upwards. Since you limit the search to end, it will work as you want and there isn't a bug as I was thinking in my sleep deprived state.
Since you are searching for exactly one address (end), it might serve you better to walk there. Maybe walking is a better API for what you are doing here?
>> Do you have testing of these functions somewhere?
>>
>
>GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(indexing, 4) tests setting
>attributes in ranges. If test_page is 2,
>
>1. [0, 4) starts off shared (4 is the number of pages in the guest_memfd)
>2. [2, 3) is converted to private
> => so the ranges should now be [0, 2), [2, 3), [3, 4)
>3. [2, 3) is converted back to shared
> => so the ranges should now be [0, 4)
>
>I verified this by inserting some trace_printk()s and inspecting manually.
>
Thanks. I find the exclusive ranges a bit odd to think about in the maple tree context, but this test case makes sense. This is especially odd to look at a single index entry, at least for me.
I generally have a set of test cases and append any bug reproduces to that list so they are unlikely to reoccur. My testing is certainly different from what you'll be doing, but this method has done well with the quality of code improving over time, and limited (if any) regressions.
I actually insist that any fix has a test before I accept them. There are two reasons for this: 1. Avoiding the regression. 2. People really understand the bug if they can create a reproducer.
I hope this helps.
>>> + if (entry && xa_to_value(entry) == attributes)
>>> + last = mas->last;
>>> +
>>> + if (start > 0) {
>>> + mas_set_range(mas, start - 1, start - 1);
>>> + entry = mas_find(mas, start - 1);
>>> + if (entry && xa_to_value(entry) == attributes)
>>> + start = mas->index;
>>> + }
>>> +
>>> + mas_set_range(mas, start, last);
>>> + return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
>>> +}
>>> +
>>>
>>> [...snip...]
>>>
^ permalink raw reply
* Re: [PATCH v6 01/43] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
From: Ackerley Tng @ 2026-05-08 23:36 UTC (permalink / raw)
To: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-1-91ab5a8b19a4@google.com>
Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
writes:
>
> [...snip...]
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 69c9d6d546b28..5011d38820d0d 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -4,6 +4,7 @@
> #include <linux/falloc.h>
> #include <linux/fs.h>
> #include <linux/kvm_host.h>
> +#include <linux/maple_tree.h>
> #include <linux/mempolicy.h>
> #include <linux/pseudo_fs.h>
> #include <linux/pagemap.h>
> @@ -33,6 +34,13 @@ struct gmem_inode {
> struct list_head gmem_file_list;
>
> u64 flags;
> + /*
> + * Every index in this inode, whether memory is populated or
> + * not, is tracked in attributes. The entire range of indices,
> + * corresponding to the size of this inode, is represented in
> + * this maple tree.
Concretely, if the entire guest_memfd is 2M in size, indices [0, 511] is
represented with some value, either 0 (SHARED) or
KVM_MEMORY_ATTRIBUTE_PRIVATE. [512, ULONG_MAX] is also defined in the
tree, as NULL.
Since guest_memfd uses xa_mk_value(0) to store the value 0 ("SHARED"),
that makes 0 distinct from NULL, which works for guest_memfd.
(Liam and I discussed this off-list due to a email configuration issue)
> + */
> + struct maple_tree attributes;
> };
>
>
> [...snip...]
>
^ permalink raw reply
* Re: [PATCH v2 0/2] x86/tdx: Port I/O emulation fixes
From: Dave Hansen @ 2026-05-08 22:53 UTC (permalink / raw)
To: Kiryl Shutsemau, Dave Hansen
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
H . Peter Anvin, Rick Edgecombe, Kuppuswamy Sathyanarayanan,
Kai Huang, Borys Tsyrulnikov, linux-kernel, linux-coco, kvm,
stable
In-Reply-To: <af5orHTGMRfD5TxP@thinkstation>
On 5/8/26 15:52, Kiryl Shutsemau wrote:
> On Tue, Apr 28, 2026 at 01:56:30PM +0100, Kiryl Shutsemau (Meta) wrote:
>> Kiryl Shutsemau (Meta) (2):
>> x86/tdx: Fix off-by-one in port I/O handling
>> x86/tdx: Fix zero-extension for 32-bit port I/O
> Dave, could get them applied?
I'll look on Monday. Thanks for the reminder.
^ permalink raw reply
* Re: [PATCH v2 0/2] x86/tdx: Port I/O emulation fixes
From: Kiryl Shutsemau @ 2026-05-08 22:52 UTC (permalink / raw)
To: Dave Hansen
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
H . Peter Anvin, Rick Edgecombe, Kuppuswamy Sathyanarayanan,
Kai Huang, Borys Tsyrulnikov, linux-kernel, linux-coco, kvm,
stable
In-Reply-To: <20260428125632.129770-1-kas@kernel.org>
On Tue, Apr 28, 2026 at 01:56:30PM +0100, Kiryl Shutsemau (Meta) wrote:
> Kiryl Shutsemau (Meta) (2):
> x86/tdx: Fix off-by-one in port I/O handling
> x86/tdx: Fix zero-extension for 32-bit port I/O
Dave, could get them applied?
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH v4 0/7] Add RMPOPT support.
From: Borislav Petkov @ 2026-05-08 21:07 UTC (permalink / raw)
To: Ashish Kalra
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
jackyli, pgonda, rientjes, jacobhxu, xin, pawan.kumar.gupta,
babu.moger, dyoung, nikunj, john.allen, darwi, linux-kernel,
linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1775874970.git.ashish.kalra@amd.com>
On Mon, Apr 13, 2026 at 07:42:03PM +0000, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra@amd.com>
>
> In the SEV-SNP architecture, hypervisor and non-SNP guests are subject
> to RMP checks on writes to provide integrity of SEV-SNP guest memory.
Sashiko has comments:
https://sashiko.dev/#/patchset/77153c889934972efcfc3d210251564f29abcf51.1775874970.git.ashish.kalra%40amd.com
Pls address them.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: [PATCH v3 0/6] KVM: SEV: Add support for IBPB-on-Entry and BTB Isolation
From: Borislav Petkov @ 2026-05-08 20:11 UTC (permalink / raw)
To: Kim Phillips
Cc: linux-kernel, kvm, linux-coco, x86, Sean Christopherson,
Paolo Bonzini, K Prateek Nayak, Nikunj A Dadhania, Tom Lendacky,
Michael Roth, Naveen Rao, David Kaplan, Pawan Gupta
In-Reply-To: <20260402202558.195005-1-kim.phillips@amd.com>
On Thu, Apr 02, 2026 at 03:25:52PM -0500, Kim Phillips wrote:
> IBPB-on-Entry and BTB Isolation are supplemental Spectre V2 mitigations
> available to SNP guests.
Sashiko has a bunch of comments, pls address them:
https://sashiko.dev/#/patchset/20260402202558.195005-1-kim.phillips%40amd.com
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: [PATCH v8 08/21] x86/virt/seamldr: Allocate and populate a module update request
From: Dave Hansen @ 2026-05-08 16:48 UTC (permalink / raw)
To: Chao Gao
Cc: kvm, linux-coco, linux-kernel, x86, binbin.wu, dave.hansen, djbw,
ira.weiny, kai.huang, kas, nik.borisov, paulmck, pbonzini,
reinette.chatre, rick.p.edgecombe, sagis, seanjc, tony.lindgren,
vannapurve, vishal.l.verma, yilun.xu, xiaoyao.li, yan.y.zhao,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin
In-Reply-To: <afyReleDP93DSgQa@intel.com>
On 5/7/26 06:19, Chao Gao wrote:
...
>>> + /*
>>> + * Don't care about user passing the wrong file, but protect
>>> + * kernel ABI by preventing accepting garbage.
>>> + */
>>> + if (memcmp(blob->signature, "TDX-BLOB", 8))
>>> + return ERR_PTR(-EINVAL);
>>
>> Is there really no helper in the kernel anywhere that can safely do the
>> 8-byte compare against two known-to-the-compiler 8-byte-wide fields
>> without hard-coding the 8?
>
> I couldn't find a helper that automatically derives the comparison
> length from the operands. 'strcmp()' is not suitable here because
> 'blob->signature' is not NUL-terminated.
>
> Do you mean just avoiding the hard-coded 8, e.g.
>
> if (memcmp(blob->signature, "TDX-BLOB", sizeof(blob->signature)))
> return ERR_PTR(-EINVAL);
>
> or define the 'u8 signature[8]' as a u64 and compare it with a constant, like
>
> /* Little-endian encoding of "TDX-BLOB" string */
> #define TDX_IMAGE_SIGNATURE 0x424f4c422d584454ULL
>
> if (blob->signature != TDX_IMAGE_SIGNATURE)
> return ERR_PTR(-EINVAL);
Either one of those is fine with me. I'd probably do the sizeof()
variant, but no strong preference.
>>> + struct seamldr_params *params;
>>> + int module_pg_cnt, sig_pg_cnt;
>>> + const u8 *sig, *module;
>>> + int i;
>>> +
>>> + params = (struct seamldr_params *)get_zeroed_page(GFP_KERNEL);
>>> + if (!params)
>>> + return ERR_PTR(-ENOMEM);
>>
>> kzmalloc(PAGE_SIZE, GFP_KERNEL) will save you a cast.
>
> I noticed that 'kzalloc_obj()' can be used here, which avoids spelling out
> the size and GFP flags explicitly. So I ended up with:
>
> params = kzalloc_obj(*params);
That's fine too.
^ permalink raw reply
* Re: [PATCH v8 18/21] coco/tdx-host: Don't expose P-SEAMLDR features on CPUs with erratum
From: Chao Gao @ 2026-05-08 9:50 UTC (permalink / raw)
To: Dave Hansen
Cc: kvm, linux-coco, linux-kernel, x86, binbin.wu, dave.hansen, djbw,
ira.weiny, kai.huang, kas, nik.borisov, paulmck, pbonzini,
reinette.chatre, rick.p.edgecombe, sagis, seanjc, tony.lindgren,
vannapurve, vishal.l.verma, yilun.xu, xiaoyao.li, yan.y.zhao,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin
In-Reply-To: <abd48a30-8d51-4a86-b662-b09afb567dc5@intel.com>
On Thu, Apr 30, 2026 at 01:09:30PM -0700, Dave Hansen wrote:
>On 4/27/26 08:28, Chao Gao wrote:
>> Some TDX-capable CPUs have an erratum, as documented in Intel® Trust
>> Domain CPU Architectural Extensions (May 2021 edition) Chapter 2.3:
>>
>> SEAMRET from the P-SEAMLDR clears the current VMCS structure pointed
>> to by the current-VMCS pointer. A VMM that invokes the P-SEAMLDR using
>> SEAMCALL must reload the current-VMCS, if required, using the VMPTRLD
>> instruction.
>>
>> Clearing the current VMCS behind KVM's back will break KVM.
>>
>> This erratum is not present when IA32_VMX_BASIC[60] is set. Add a CPU
>> bug bit for this erratum and refuse to expose P-SEAMLDR features (e.g.,
>> TDX module updates) on affected CPUs.
>
>This seems totally random.
>
>Shouldn't this be way back when can_expose_seamldr() got defined in the
>first place?
I split this out because the erratum needs a longer changelog and some
discussion of alternatives. I also wanted the initial can_expose_seamldr()
patch to focus on introducing the gating mechanism, without bundling in
every detailed check from the start. The update do-while loop and the uABI
stuff are the core of this series, while this erratum check is not, so I
placed this patch later.
That said, I am perfectly fine with moving this patch to immediately follow
the patch that introduces can_expose_seamldr().
>> +#define X86_BUG_SEAMRET_INVD_VMCS X86_BUG( 1*32+11) /* "seamret_invd_vmcs" SEAMRET from P-SEAMLDR clears the current VMCS */
>
>I find myself wondering if this is worth a bug bit.
The bug bit was added in v5:
https://lore.kernel.org/all/d664ac9445b1c7cc864dead103086341c374b094.camel@intel.com/#t
Kai suggested this approach for two reasons:
1. It is consistent with how X86_BUG_TDX_PW_MCE is handled.
2. It gives userspace a clue as to why the module update feature is
unavailable.
That reasoning made sense to me, and I do not see a strong reason not to
use the "bug bit" infrastructure. If there is no objection to it, I will
add a short explanation to the changelog.
^ permalink raw reply
* Re: [PATCH v8 17/21] x86/virt/seamldr: Abort updates on failure
From: Chao Gao @ 2026-05-08 9:16 UTC (permalink / raw)
To: Dave Hansen
Cc: kvm, linux-coco, linux-kernel, x86, binbin.wu, dave.hansen, djbw,
ira.weiny, kai.huang, kas, nik.borisov, paulmck, pbonzini,
reinette.chatre, rick.p.edgecombe, sagis, seanjc, tony.lindgren,
vannapurve, vishal.l.verma, yilun.xu, xiaoyao.li, yan.y.zhao,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin
In-Reply-To: <fc27ad0e-fceb-4eed-bb1c-dbfb5b913bf6@intel.com>
On Thu, Apr 30, 2026 at 01:06:38PM -0700, Dave Hansen wrote:
>I don't like how this is being done.
>
> 1. Introduce this do{}while() loop
> 2. Do 20 other patches
> 3. Introduce a thing that can make it change
> 4. Change the fundamental flow of the loop, to fix #3
>
>I'd much rather have:
>
> 1. Introduce this do{}while() loop
> 2. Tweak fundamental flow of the loop from the last patch when I can
> remember it. Allude to future failures.
> 3. Do 20 other patches
> 4. Introduce a thing that uses #2
OK, that makes sense. I'll reorder the series so this patch comes immediately
after the skeleton patch.
>
>
>> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
>> index c81b26c4bac1..9b8f571eb03f 100644
>> --- a/arch/x86/virt/vmx/tdx/seamldr.c
>> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
>> @@ -220,6 +220,7 @@ enum module_update_state {
>> static struct {
>> enum module_update_state state;
>> int thread_ack;
>> + bool failed;
>> /*
>> * Protect update_data. Raw spinlock as it will be acquired from
>> * interrupt-disabled contexts.
>> @@ -284,12 +285,15 @@ static int do_seamldr_install_module(void *seamldr_params)
>> break;
>> }
>>
>> - ack_state();
>> + if (ret)
>> + WRITE_ONCE(update_data.failed, true);
>> + else
>> + ack_state();
>> } else {
>> touch_nmi_watchdog();
>> rcu_momentary_eqs();
>> }
>
>I don't like how this is turning out either. I don't like all the nested
>conditions or ack_state() that hides its mucking with update data while
>its caller mucks with it directly. It's just all hacked together.
>
>Defer all of the acking, and *failed* acking to the ack_state() helper.
OK. I'll fold both normal and failed acking into ack_state().
>
>Also, I'm kinda peeved that you copied and pasted the
>touch_nmi_watchdog()/rcu_momentary_eqs() bits and none of the comments.
>This is a rather subtle use of both. If you want this to be a normal
>"spinning in stop machine" idiom, then create a helper and put the
>comments there.
Those two calls were added in stop_machine() to improve debuggability.
The issue they address is that a stop_machine() callback can hang on one
CPU. Without touch_nmi_watchdog() and rcu_momentary_eqs(), the other CPUs
that are merely spinning in the wait loop can also report hard lockup and
RCU stall warnings, which obscures the actual stuck CPU.
I agree that this behavior makes sense in stop_machine() as common
infrastructure. But this update path does not take an arbitrary callback
function, so that that debuggability is not strictly necessary here. I'll
drop those calls from this path unless there is an objection.
>
>Also, this is a case where:
>
> do {
> cpu_relax();
> newstate = READ_ONCE(update_data.state);
>
> if (newstate == curstate) {
> // can cpu_relax() just go in here??
> touch_nmi_watchdog();
> rcu_momentary_eqs();
> continue;
> }
>
> switch() {
> // state changing here
> }
> } while (...);
>
>is a much more sane setup. You're not paying the if() indentation cost
>for the entire state transition block. You're also putting the "shut up
>the warnings" code out of the way where you can forget about it.
>
Agreed. Will do.
^ permalink raw reply
* [POC PATCH 5/5] KVM: selftests: Test conversions for SNP
From: Ackerley Tng @ 2026-05-07 20:34 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <cover.1778185936.git.ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
.../selftests/kvm/x86/sev_smoke_test.c | 198 +++++++++++++++++-
1 file changed, 193 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86/sev_smoke_test.c b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
index 8b859adf4cf6f..8869cca748879 100644
--- a/tools/testing/selftests/kvm/x86/sev_smoke_test.c
+++ b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
@@ -253,17 +253,205 @@ static void test_sev_smoke(void *guest, u32 type, u64 policy)
}
}
+#define GHCB_MSR_REG_GPA_REQ 0x012
+#define GHCB_MSR_REG_GPA_REQ_VAL(v) \
+ /* GHCBData[63:12] */ \
+ (((u64)((v) & GENMASK_ULL(51, 0)) << 12) | \
+ /* GHCBData[11:0] */ \
+ GHCB_MSR_REG_GPA_REQ)
+
+#define GHCB_MSR_REG_GPA_RESP 0x013
+#define GHCB_MSR_REG_GPA_RESP_VAL(v) \
+ /* GHCBData[63:12] */ \
+ (((u64)(v) & GENMASK_ULL(63, 12)) >> 12)
+
+#define GHCB_DATA_LOW 12
+#define GHCB_MSR_INFO_MASK (BIT_ULL(GHCB_DATA_LOW) - 1)
+#define GHCB_RESP_CODE(v) ((v) & GHCB_MSR_INFO_MASK)
+
+/*
+ * SNP Page State Change Operation
+ *
+ * GHCBData[55:52] - Page operation:
+ * 0x0001 Page assignment, Private
+ * 0x0002 Page assignment, Shared
+ */
+enum psc_op {
+ SNP_PAGE_STATE_PRIVATE = 1,
+ SNP_PAGE_STATE_SHARED,
+};
+
+#define GHCB_MSR_PSC_REQ 0x014
+#define GHCB_MSR_PSC_REQ_GFN(gfn, op) \
+ /* GHCBData[55:52] */ \
+ (((u64)((op) & 0xf) << 52) | \
+ /* GHCBData[51:12] */ \
+ ((u64)((gfn) & GENMASK_ULL(39, 0)) << 12) | \
+ /* GHCBData[11:0] */ \
+ GHCB_MSR_PSC_REQ)
+
+#define GHCB_MSR_PSC_RESP 0x015
+#define GHCB_MSR_PSC_RESP_VAL(val) \
+ /* GHCBData[63:32] */ \
+ (((u64)(val) & GENMASK_ULL(63, 32)) >> 32)
+
+static u64 ghcb_gpa;
+static void snp_register_ghcb(void)
+{
+ u64 ghcb_pfn = ghcb_gpa >> PAGE_SHIFT;
+ u64 val;
+
+ GUEST_ASSERT(ghcb_gpa);
+
+ wrmsr(MSR_AMD64_SEV_ES_GHCB, GHCB_MSR_REG_GPA_REQ_VAL(ghcb_gpa >> PAGE_SHIFT));
+ vmgexit();
+
+ val = rdmsr(MSR_AMD64_SEV_ES_GHCB);
+ GUEST_ASSERT_EQ(GHCB_RESP_CODE(val), GHCB_MSR_REG_GPA_RESP);
+ GUEST_ASSERT_EQ(GHCB_MSR_REG_GPA_RESP_VAL(val), ghcb_pfn);
+}
+
+static void snp_page_state_change(u64 gpa, enum psc_op op)
+{
+ u64 val;
+
+ wrmsr(MSR_AMD64_SEV_ES_GHCB, GHCB_MSR_PSC_REQ_GFN(gpa >> PAGE_SHIFT, op));
+ vmgexit();
+
+ val = rdmsr(MSR_AMD64_SEV_ES_GHCB);
+ GUEST_ASSERT_EQ(GHCB_RESP_CODE(val), GHCB_MSR_PSC_RESP);
+ GUEST_ASSERT_EQ(GHCB_MSR_PSC_RESP_VAL(val), 0);
+}
+
+#define RMP_PG_SIZE_4K 0
+static inline void pvalidate(void *vaddr, bool validate)
+{
+ bool no_rmpupdate;
+ int rc;
+
+ /* "pvalidate" mnemonic support in binutils 2.36 and newer */
+ asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFF\n\t"
+ : "=@ccc"(no_rmpupdate), "=a"(rc)
+ : "a"(vaddr), "c"(RMP_PG_SIZE_4K), "d"(validate)
+ : "memory", "cc");
+
+ GUEST_ASSERT(!no_rmpupdate);
+ GUEST_ASSERT_EQ(rc, 0);
+}
+
+#define CONVERSION_TEST_VALUE_SHARED_1 0xab
+#define CONVERSION_TEST_VALUE_SHARED_2 0xcd
+#define CONVERSION_TEST_VALUE_PRIVATE 0xef
+#define CONVERSION_TEST_VALUE_SHARED_3 0xbc
+#define CONVERSION_TEST_VALUE_SHARED_4 0xde
+static void guest_code_conversion(u8 *test_shared_gva, u8 *test_private_gva, u64 test_gpa)
+{
+ snp_register_ghcb();
+
+ GUEST_ASSERT_EQ(READ_ONCE(*test_shared_gva), CONVERSION_TEST_VALUE_SHARED_1);
+ WRITE_ONCE(*test_shared_gva, CONVERSION_TEST_VALUE_SHARED_2);
+
+ snp_page_state_change(test_gpa, SNP_PAGE_STATE_PRIVATE);
+ pvalidate(test_private_gva, true);
+
+ WRITE_ONCE(*test_private_gva, CONVERSION_TEST_VALUE_PRIVATE);
+ GUEST_ASSERT_EQ(READ_ONCE(*test_private_gva), CONVERSION_TEST_VALUE_PRIVATE);
+
+ pvalidate(test_private_gva, false);
+ snp_page_state_change(test_gpa, SNP_PAGE_STATE_SHARED);
+
+ GUEST_ASSERT_EQ(READ_ONCE(*test_shared_gva), CONVERSION_TEST_VALUE_SHARED_3);
+ WRITE_ONCE(*test_shared_gva, CONVERSION_TEST_VALUE_SHARED_4);
+
+ wrmsr(MSR_AMD64_SEV_ES_GHCB, GHCB_MSR_TERM_REQ);
+ vmgexit();
+}
+
+static void test_conversion(u64 policy)
+{
+ gva_t test_private_gva;
+ gva_t test_shared_gva;
+ struct kvm_vcpu *vcpu;
+ gva_t ghcb_gva;
+ gpa_t test_gpa;
+ struct kvm_vm *vm;
+ void *ghcb_hva;
+ void *test_hva;
+
+ vm = vm_sev_create_with_one_vcpu(KVM_X86_SNP_VM, guest_code_conversion, &vcpu);
+
+ ghcb_gva = vm_alloc_shared(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR,
+ MEM_REGION_TEST_DATA);
+ ghcb_hva = addr_gva2hva(vm, ghcb_gva);
+ ghcb_gpa = addr_gva2gpa(vm, ghcb_gva);
+ sync_global_to_guest(vm, ghcb_gpa);
+
+ test_shared_gva = vm_alloc_shared(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR,
+ MEM_REGION_TEST_DATA);
+ test_hva = addr_gva2hva(vm, test_shared_gva);
+ test_gpa = addr_gva2gpa(vm, test_shared_gva);
+
+ test_private_gva = vm_unused_gva_gap(vm, PAGE_SIZE, KVM_UTIL_MIN_VADDR);
+ ___virt_pg_map(vm, &vm->mmu, test_private_gva, test_gpa, PG_SIZE_4K, true);
+
+ vcpu_args_set(vcpu, 3, test_shared_gva, test_private_gva, test_gpa);
+
+ vm_sev_launch(vm, policy, NULL);
+
+ WRITE_ONCE(*(u8 *)test_hva, CONVERSION_TEST_VALUE_SHARED_1);
+
+ fprintf(stderr, "ghcb_hva=%p ghcb_gpa=%lx ghcb_gva=%lx\n", ghcb_hva, ghcb_gpa, ghcb_gva);
+ fprintf(stderr, "test_hva=%p test_gpa=%lx test_private_gva=%lx test_shared_gva=%lx\n", test_hva, test_gpa, test_private_gva, test_shared_gva);
+
+ vcpu_run(vcpu);
+
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_HYPERCALL);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.nr, KVM_HC_MAP_GPA_RANGE);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[0], test_gpa);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[1], 1);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[2], KVM_MAP_GPA_RANGE_ENCRYPTED | KVM_MAP_GPA_RANGE_PAGE_SZ_4K);
+
+ vm_mem_set_private(vm, test_gpa, PAGE_SIZE);
+
+ vcpu_run(vcpu);
+
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_HYPERCALL);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.nr, KVM_HC_MAP_GPA_RANGE);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[0], test_gpa);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[1], 1);
+ TEST_ASSERT_EQ(vcpu->run->hypercall.args[2], KVM_MAP_GPA_RANGE_DECRYPTED | KVM_MAP_GPA_RANGE_PAGE_SZ_4K);
+
+ vm_mem_set_shared(vm, test_gpa, PAGE_SIZE);
+
+ fprintf(stderr, "test_hva contents = %x\n", READ_ONCE(*(u8 *)test_hva));
+
+ WRITE_ONCE(*(u8 *)test_hva, CONVERSION_TEST_VALUE_SHARED_3);
+ TEST_ASSERT_EQ(*(u8 *)test_hva, CONVERSION_TEST_VALUE_SHARED_3);
+
+ vcpu_run(vcpu);
+
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_SYSTEM_EVENT);
+ TEST_ASSERT_EQ(vcpu->run->system_event.type, KVM_SYSTEM_EVENT_SEV_TERM);
+ TEST_ASSERT_EQ(vcpu->run->system_event.ndata, 1);
+ TEST_ASSERT_EQ(vcpu->run->system_event.data[0], GHCB_MSR_TERM_REQ);
+
+ TEST_ASSERT_EQ(*(u8 *)test_hva, CONVERSION_TEST_VALUE_SHARED_4);
+}
+
int main(int argc, char *argv[])
{
TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SEV));
- test_sev_smoke(guest_sev_code, KVM_X86_SEV_VM, 0);
+ // test_sev_smoke(guest_sev_code, KVM_X86_SEV_VM, 0);
+
+ // if (kvm_cpu_has(X86_FEATURE_SEV_ES))
+ // test_sev_smoke(guest_sev_es_code, KVM_X86_SEV_ES_VM, SEV_POLICY_ES);
- if (kvm_cpu_has(X86_FEATURE_SEV_ES))
- test_sev_smoke(guest_sev_es_code, KVM_X86_SEV_ES_VM, SEV_POLICY_ES);
+ if (kvm_cpu_has(X86_FEATURE_SEV_SNP)) {
+ test_conversion(snp_default_policy());
- if (kvm_cpu_has(X86_FEATURE_SEV_SNP))
- test_sev_smoke(guest_snp_code, KVM_X86_SNP_VM, snp_default_policy());
+ // test_sev_smoke(guest_snp_code, KVM_X86_SNP_VM, snp_default_policy());
+ }
return 0;
}
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [POC PATCH 4/5] KVM: selftests: Allow specifying CoCo-privateness while mapping a page
From: Ackerley Tng @ 2026-05-07 20:34 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <cover.1778185936.git.ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
tools/testing/selftests/kvm/include/x86/processor.h | 2 ++
tools/testing/selftests/kvm/lib/x86/processor.c | 13 ++++++++++---
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/kvm/include/x86/processor.h b/tools/testing/selftests/kvm/include/x86/processor.h
index 77f576ee7789d..683f21452db58 100644
--- a/tools/testing/selftests/kvm/include/x86/processor.h
+++ b/tools/testing/selftests/kvm/include/x86/processor.h
@@ -1507,6 +1507,8 @@ enum pg_level {
void tdp_mmu_init(struct kvm_vm *vm, int pgtable_levels,
struct pte_masks *pte_masks);
+void ___virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
+ gpa_t gpa, int level, bool private);
void __virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
gpa_t gpa, int level);
void virt_map_level(struct kvm_vm *vm, gva_t gva, gpa_t gpa,
diff --git a/tools/testing/selftests/kvm/lib/x86/processor.c b/tools/testing/selftests/kvm/lib/x86/processor.c
index b51467d70f6e7..02781194f51a2 100644
--- a/tools/testing/selftests/kvm/lib/x86/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86/processor.c
@@ -256,8 +256,8 @@ static u64 *virt_create_upper_pte(struct kvm_vm *vm,
return pte;
}
-void __virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
- gpa_t gpa, int level)
+void ___virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
+ gpa_t gpa, int level, bool private)
{
const u64 pg_size = PG_LEVEL_SIZE(level);
u64 *pte = &mmu->pgd;
@@ -309,12 +309,19 @@ void __virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
* Neither SEV nor TDX supports shared page tables, so only the final
* leaf PTE needs manually set the C/S-bit.
*/
- if (vm_is_gpa_protected(vm, gpa))
+ if (private)
*pte |= PTE_C_BIT_MASK(mmu);
else
*pte |= PTE_S_BIT_MASK(mmu);
}
+void __virt_pg_map(struct kvm_vm *vm, struct kvm_mmu *mmu, gva_t gva,
+ gpa_t gpa, int level)
+{
+ ___virt_pg_map(vm, mmu, gva, gpa, level,
+ vm_is_gpa_protected(vm, gpa));
+}
+
void virt_arch_pg_map(struct kvm_vm *vm, gva_t gva, gpa_t gpa)
{
__virt_pg_map(vm, &vm->mmu, gva, gpa, PG_LEVEL_4K);
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [POC PATCH 3/5] KVM: selftests: Make guest_code_xsave more friendly
From: Ackerley Tng @ 2026-05-07 20:34 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <cover.1778185936.git.ackerleytng@google.com>
The original implementation of guest_code_xsave makes a jmp to
guest_sev_es_code in inline assembly. When code that uses guest_sev_es_code
is removed, guest_sev_es_code will be optimized out, leading to a linking
error since guest_code_xsave still tries to jmp to guest_sev_es_code.
Rewrite guest_code_xsave() to instead make a call, in C, to
guest_sev_es_code(), so that usage of guest_sev_es_code() is made known to
the compiler.
This rewriting also gives a name to the xsave inline assembly, improving
readability.
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
.../selftests/kvm/x86/sev_smoke_test.c | 24 +++++++++++++------
1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86/sev_smoke_test.c b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
index 1a49ee3915864..8b859adf4cf6f 100644
--- a/tools/testing/selftests/kvm/x86/sev_smoke_test.c
+++ b/tools/testing/selftests/kvm/x86/sev_smoke_test.c
@@ -80,13 +80,23 @@ static void guest_sev_code(void)
GUEST_DONE();
}
-/* Stash state passed via VMSA before any compiled code runs. */
-extern void guest_code_xsave(void);
-asm("guest_code_xsave:\n"
- "mov $" __stringify(XFEATURE_MASK_X87_AVX) ", %eax\n"
- "xor %edx, %edx\n"
- "xsave (%rdi)\n"
- "jmp guest_sev_es_code");
+static void xsave_all_registers(void *addr)
+{
+ __asm__ __volatile__(
+ "mov $" __stringify(XFEATURE_MASK_X87_AVX) ", %eax\n"
+ "xor %edx, %edx\n"
+ "xsave (%0)"
+ :
+ : "r"(addr)
+ : "eax", "edx", "memory"
+ );
+}
+
+static void guest_code_xsave(void *vmsa_gva)
+{
+ xsave_all_registers(vmsa_gva);
+ guest_sev_es_code();
+}
static void compare_xsave(u8 *from_host, u8 *from_guest)
{
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [POC PATCH 2/5] KVM: selftests: Use guest_memfd memory contents in-place for SNP launch update
From: Ackerley Tng @ 2026-05-07 20:34 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <cover.1778185936.git.ackerleytng@google.com>
Update the SEV-SNP launch update flow to utilize guest_memfd in-place
conversion.
Include the KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE flag when setting memory
attributes to private. This is permitted before the SNP VM is finalized.
In snp_launch_update_data, pass 0 as the host virtual address. This
instructs the kernel to perform the launch update using the guest_memfd
backing the guest physical address rather than a userspace-provided
buffer.
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
tools/testing/selftests/kvm/lib/x86/sev.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/tools/testing/selftests/kvm/lib/x86/sev.c b/tools/testing/selftests/kvm/lib/x86/sev.c
index 93f9169034617..074ab0eff1e27 100644
--- a/tools/testing/selftests/kvm/lib/x86/sev.c
+++ b/tools/testing/selftests/kvm/lib/x86/sev.c
@@ -37,8 +37,7 @@ static void encrypt_region(struct kvm_vm *vm, struct userspace_mem_region *regio
if (is_sev_snp_vm(vm))
snp_launch_update_data(vm, gpa_base + offset,
- (u64)addr_gpa2hva(vm, gpa_base + offset),
- size, page_type);
+ 0, size, page_type);
else
sev_launch_update_data(vm, gpa_base + offset, size);
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [POC PATCH 1/5] KVM: selftests: Initialize guest_memfd with INIT_SHARED
From: Ackerley Tng @ 2026-05-07 20:34 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
yan.y.zhao, youngjun.park, yuanchu, Sagi Shahar
In-Reply-To: <cover.1778185936.git.ackerleytng@google.com>
Initialize guest_memfd with INIT_SHARED for VM types that require
guest_memfd.
Memory in the first memslot is used by the selftest framework to load
code, page tables, interrupt descriptor tables, and basically everything
the selftest needs to run. The selftest framework sets all of these up
assuming that the memory in the memslot can be written to from the
host. Align with that behavior by initializing guest_memfd as shared so
that all the writes from the host are permitted.
guest_memfd memory can later be marked private if necessary by CoCo
platform-specific initialization functions.
Suggested-by: Sagi Shahar <sagis@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
tools/testing/selftests/kvm/lib/kvm_util.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index d1befa3f4b305..a377e5f333116 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -483,8 +483,10 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
{
u64 nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus,
nr_extra_pages);
+ enum vm_mem_backing_src_type src_type;
struct userspace_mem_region *slot0;
struct kvm_vm *vm;
+ u64 gmem_flags;
int i, flags;
kvm_set_files_rlimit(nr_runnable_vcpus);
@@ -502,7 +504,15 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
if (is_guest_memfd_required(shape))
flags |= KVM_MEM_GUEST_MEMFD;
- vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags);
+ gmem_flags = 0;
+ src_type = VM_MEM_SRC_ANONYMOUS;
+ if (is_guest_memfd_required(shape) && kvm_has_gmem_attributes) {
+ src_type = VM_MEM_SRC_SHMEM;
+ gmem_flags = GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED;
+ }
+
+ vm_mem_add(vm, src_type, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
+
for (i = 0; i < NR_MEM_REGIONS; i++)
vm->memslots[i] = 0;
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [POC PATCH 0/5] guest_memfd in-place conversion selftests for SNP
From: Ackerley Tng @ 2026-05-07 20:34 UTC (permalink / raw)
To: devnull+ackerleytng.google.com
Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4@google.com>
With these POC patches, I was able to test the set memory
attributes/conversion ioctls with SNP.
After allowing src_addr to be NULL, SNP_LAUNCH_UPDATE can accept NULL
for source address and the SNP VM runs fine. :)
Ackerley Tng (5):
KVM: selftests: Initialize guest_memfd with INIT_SHARED
KVM: selftests: Use guest_memfd memory contents in-place for SNP
launch update
KVM: selftests: Make guest_code_xsave more friendly
KVM: selftests: Allow specifying CoCo-privateness while mapping a page
KVM: selftests: Test conversions for SNP
.../selftests/kvm/include/x86/processor.h | 2 +
tools/testing/selftests/kvm/lib/kvm_util.c | 12 +-
.../testing/selftests/kvm/lib/x86/processor.c | 13 +-
tools/testing/selftests/kvm/lib/x86/sev.c | 3 +-
.../selftests/kvm/x86/sev_smoke_test.c | 222 +++++++++++++++++-
5 files changed, 234 insertions(+), 18 deletions(-)
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply
* [PATCH v6 42/43] KVM: selftests: Add script to exercise private_mem_conversions_test
From: Ackerley Tng via B4 Relay @ 2026-05-07 20:23 UTC (permalink / raw)
To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4@google.com>
From: Ackerley Tng <ackerleytng@google.com>
Add a wrapper script to simplify running the private_mem_conversions_test
with a variety of configurations. Manually invoking the test for all
supported memory backing source types is tedious.
The script automatically detects the availability of 2MB and 1GB hugepages
and builds a list of source types to test. It then iterates through the
list, running the test for each type with both a single memslot and
multiple memslots.
This makes it easier to get comprehensive test coverage across different
memory configurations.
Add and use a helper program in C to be able to read
KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES as defined in header files and then
issue the ioctl to read the KVM CAP.
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
tools/testing/selftests/kvm/Makefile.kvm | 4 +
.../selftests/kvm/kvm_has_gmem_attributes.c | 17 +++
.../kvm/x86/private_mem_conversions_test.sh | 128 +++++++++++++++++++++
3 files changed, 149 insertions(+)
diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index 6232881be500a..e5769268936a7 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -54,6 +54,7 @@ LIBKVM_loongarch += lib/loongarch/exception.S
# Non-compiled test targets
TEST_PROGS_x86 += x86/nx_huge_pages_test.sh
+TEST_PROGS_x86 += x86/private_mem_conversions_test.sh
# Compiled test targets valid on all architectures with libkvm support
TEST_GEN_PROGS_COMMON = demand_paging_test
@@ -67,6 +68,8 @@ TEST_GEN_PROGS_COMMON += set_memory_region_test
TEST_GEN_PROGS_COMMON += memslot_modification_stress_test
TEST_GEN_PROGS_COMMON += memslot_perf_test
+TEST_GEN_PROGS_EXTENDED_COMMON += kvm_has_gmem_attributes
+
# Compiled test targets
TEST_GEN_PROGS_x86 = $(TEST_GEN_PROGS_COMMON)
TEST_GEN_PROGS_x86 += x86/cpuid_test
@@ -245,6 +248,7 @@ SPLIT_TESTS += get-reg-list
TEST_PROGS += $(TEST_PROGS_$(ARCH))
TEST_GEN_PROGS += $(TEST_GEN_PROGS_$(ARCH))
+TEST_GEN_PROGS_EXTENDED += $(TEST_GEN_PROGS_EXTENDED_COMMON)
TEST_GEN_PROGS_EXTENDED += $(TEST_GEN_PROGS_EXTENDED_$(ARCH))
LIBKVM += $(LIBKVM_$(ARCH))
diff --git a/tools/testing/selftests/kvm/kvm_has_gmem_attributes.c b/tools/testing/selftests/kvm/kvm_has_gmem_attributes.c
new file mode 100644
index 0000000000000..4f361349412fb
--- /dev/null
+++ b/tools/testing/selftests/kvm/kvm_has_gmem_attributes.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Utility to check if KVM supports guest_memfd attributes.
+ *
+ * Copyright (C) 2025, Google LLC.
+ */
+
+#include <stdio.h>
+
+#include "kvm_util.h"
+
+int main(void)
+{
+ printf("%u\n", kvm_check_cap(KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES) > 0);
+
+ return 0;
+}
diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
new file mode 100755
index 0000000000000..7179a4fcdd498
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
@@ -0,0 +1,128 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Wrapper script which runs different test setups of
+# private_mem_conversions_test.
+#
+# Copyright (C) 2025, Google LLC.
+
+NUM_VCPUS_TO_TEST=4
+NUM_MEMSLOTS_TO_TEST=$NUM_VCPUS_TO_TEST
+
+# Required pages are based on the test setup in the C code.
+REQUIRED_NUM_2M_HUGEPAGES=$((1024 * NUM_VCPUS_TO_TEST))
+REQUIRED_NUM_1G_HUGEPAGES=$((2 * NUM_VCPUS_TO_TEST))
+
+get_hugepage_count() {
+ local page_size_kb=$1
+ local path="/sys/kernel/mm/hugepages/hugepages-${page_size_kb}kB/nr_hugepages"
+ if [ -f "$path" ]; then
+ cat "$path"
+ else
+ echo 0
+ fi
+}
+
+get_default_hugepage_size_in_kb() {
+ local size=$(grep "Hugepagesize:" /proc/meminfo | awk '{print $2}')
+ echo "$size"
+}
+
+run_tests() {
+ local executable_path=$1
+ local src_type=$2
+ local num_memslots=$3
+ local num_vcpus=$4
+
+ echo "$executable_path -s $src_type -m $num_memslots -n $num_vcpus"
+ "$executable_path" -s "$src_type" -m "$num_memslots" -n "$num_vcpus"
+}
+
+script_dir=$(dirname "$(realpath "$0")")
+test_executable="${script_dir}/private_mem_conversions_test"
+kvm_has_gmem_attributes_tool="${script_dir}/../kvm_has_gmem_attributes"
+
+if [ ! -f "$test_executable" ]; then
+ echo "Error: Test executable not found at '$test_executable'" >&2
+ exit 1
+fi
+
+if [ ! -f "$kvm_has_gmem_attributes_tool" ]; then
+ echo "Error: kvm_has_gmem_attributes utility not found at '$kvm_has_gmem_attributes_tool'" >&2
+ exit 1
+fi
+
+kvm_has_gmem_attributes=$("$kvm_has_gmem_attributes_tool" | tail -n1)
+
+if [ "$kvm_has_gmem_attributes" -eq 1 ]; then
+ backing_src_types=("shmem")
+else
+ hugepage_2mb_count=$(get_hugepage_count 2048)
+ hugepage_2mb_enabled=$((hugepage_2mb_count >= REQUIRED_NUM_2M_HUGEPAGES))
+ hugepage_1gb_count=$(get_hugepage_count 1048576)
+ hugepage_1gb_enabled=$((hugepage_1gb_count >= REQUIRED_NUM_1G_HUGEPAGES))
+
+ default_hugepage_size_kb=$(get_default_hugepage_size_in_kb)
+ hugepage_default_enabled=0
+ if [ "$default_hugepage_size_kb" -eq 2048 ]; then
+ hugepage_default_enabled=$hugepage_2mb_enabled
+ elif [ "$default_hugepage_size_kb" -eq 1048576 ]; then
+ hugepage_default_enabled=$hugepage_1gb_enabled
+ fi
+
+ backing_src_types=("anonymous" "anonymous_thp")
+
+ if [ "$hugepage_default_enabled" -eq 1 ]; then
+ backing_src_types+=("anonymous_hugetlb")
+ else
+ echo "skipping anonymous_hugetlb backing source type"
+ fi
+
+ if [ "$hugepage_2mb_enabled" -eq 1 ]; then
+ backing_src_types+=("anonymous_hugetlb_2mb")
+ else
+ echo "skipping anonymous_hugetlb_2mb backing source type"
+ fi
+
+ if [ "$hugepage_1gb_enabled" -eq 1 ]; then
+ backing_src_types+=("anonymous_hugetlb_1gb")
+ else
+ echo "skipping anonymous_hugetlb_1gb backing source type"
+ fi
+
+ backing_src_types+=("shmem")
+
+ if [ "$hugepage_default_enabled" -eq 1 ]; then
+ backing_src_types+=("shared_hugetlb")
+ else
+ echo "skipping shared_hugetlb backing source type"
+ fi
+fi
+
+return_code=0
+for i in "${!backing_src_types[@]}"; do
+ src_type=${backing_src_types[$i]}
+ if [ "$i" -gt 0 ]; then
+ echo
+ fi
+
+ if ! run_tests "$test_executable" "$src_type" 1 1; then
+ return_code=$?
+ echo "Test failed for source type '$src_type'. Arguments: -s $src_type -m 1 -n 1" >&2
+ break
+ fi
+
+ if ! run_tests "$test_executable" "$src_type" 1 "$NUM_VCPUS_TO_TEST"; then
+ return_code=$?
+ echo "Test failed for source type '$src_type'. Arguments: -s $src_type -m 1 -n $NUM_VCPUS_TO_TEST" >&2
+ break
+ fi
+
+ if ! run_tests "$test_executable" "$src_type" "$NUM_MEMSLOTS_TO_TEST" "$NUM_VCPUS_TO_TEST"; then
+ return_code=$?
+ echo "Test failed for source type '$src_type'. Arguments: -s $src_type -m $NUM_MEMSLOTS_TO_TEST -n $NUM_VCPUS_TO_TEST" >&2
+ break
+ fi
+done
+
+exit "$return_code"
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* [PATCH v6 43/43] KVM: selftests: Update private memory exits test to work with per-gmem attributes
From: Ackerley Tng via B4 Relay @ 2026-05-07 20:23 UTC (permalink / raw)
To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco, Ackerley Tng
In-Reply-To: <20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4@google.com>
From: Sean Christopherson <seanjc@google.com>
Skip setting memory to private in the private memory exits test when using
per-gmem memory attributes, as memory is initialized to private by default
for guest_memfd, and using vm_mem_set_private() on a guest_memfd instance
requires creating guest_memfd with GUEST_MEMFD_FLAG_MMAP (which is totally
doable, but would need to be conditional and is ultimately unnecessary).
Expect an emulated MMIO instead of a memory fault exit when attributes are
per-gmem, as deleting the memslot effectively drops the private status,
i.e. the GPA becomes shared and thus supports emulated MMIO.
Skip the "memslot not private" test entirely, as private vs. shared state
for x86 software-protected VMs comes from the memory attributes themselves,
and so when doing in-place conversions there can never be a disconnect
between the expected and actual states.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
.../selftests/kvm/x86/private_mem_kvm_exits_test.c | 36 ++++++++++++++++++----
1 file changed, 30 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c b/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
index 10db9fe6d9063..70ed16066c63e 100644
--- a/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
+++ b/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
@@ -62,8 +62,9 @@ static void test_private_access_memslot_deleted(void)
virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
- /* Request to access page privately */
- vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+ /* Request to access page privately. */
+ if (!kvm_has_gmem_attributes)
+ vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
pthread_create(&vm_thread, NULL,
(void *(*)(void *))run_vcpu_get_exit_reason,
@@ -74,10 +75,26 @@ static void test_private_access_memslot_deleted(void)
pthread_join(vm_thread, &thread_return);
exit_reason = (u32)(u64)thread_return;
- TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
- TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
- TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
- TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+ /*
+ * If attributes are tracked per-gmem, deleting the memslot that points
+ * at the gmem instance effectively makes the memory shared, and so the
+ * read should trigger emulated MMIO.
+ *
+ * If attributes are tracked per-VM, deleting the memslot shouldn't
+ * affect the private attribute, and so KVM should generate a memory
+ * fault exit (emulated MMIO on private GPAs is disallowed).
+ */
+ if (kvm_has_gmem_attributes) {
+ TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MMIO);
+ TEST_ASSERT_EQ(vcpu->run->mmio.phys_addr, EXITS_TEST_GPA);
+ TEST_ASSERT_EQ(vcpu->run->mmio.len, sizeof(u64));
+ TEST_ASSERT_EQ(vcpu->run->mmio.is_write, false);
+ } else {
+ TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+ TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
+ TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
+ TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+ }
kvm_vm_free(vm);
}
@@ -88,6 +105,13 @@ static void test_private_access_memslot_not_private(void)
struct kvm_vcpu *vcpu;
u32 exit_reason;
+ /*
+ * Accessing non-private memory as private with a software-protected VM
+ * isn't possible when doing in-place conversions.
+ */
+ if (kvm_has_gmem_attributes)
+ return;
+
vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
guest_repeatedly_read);
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox