* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-19 16:11 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: Mostafa Saleh, iommu, linux-arm-kernel, linux-kernel, linux-coco,
Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5a8q9fs7ud.fsf@kernel.org>
On Tue, May 19, 2026 at 09:35:30PM +0530, Aneesh Kumar K.V wrote:
> Yes, that also resulted in simpler and cleaner code.
>
> swiotlb_tbl_map_single
> /*
> * If the physical address is encrypted but the device requires
> * decrypted DMA, use a decrypted io_tlb_mem and update the
> * attributes so the caller knows that a decrypted io_tlb_mem
> * was used.
> */
> if (!(*attrs & DMA_ATTR_CC_SHARED) && force_dma_unencrypted(dev))
> *attrs |= DMA_ATTR_CC_SHARED;
>
> if (mem->unencrypted != !!(*attrs & DMA_ATTR_CC_SHARED))
> return (phys_addr_t)DMA_MAPPING_ERROR;
Yeah, exactly that is so much clearer now that the mem->unecrypted is
tied directly.
That logic is reversed though, the incoming ATTR_CC doesn't matter for
swiotlb, that is just the source of the memcpy.
/* swiotlb pool is incorrect for this device */
if (mem->unencrypted != force_dma_unencrypted(dev))
return (phys_addr_t)DMA_MAPPING_ERROR;
/* Force attrs to match the kind of memory in the pool */
if (mem->unencrypted)
*attrs |= DMA_ATTR_CC_SHARED;
else
*attrs &= ~DMA_ATTR_CC_SHARED;
Attrs should be forced to whatever memory swiotlb selected.
Jason
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-19 16:05 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Mostafa Saleh, iommu, linux-arm-kernel, linux-kernel, linux-coco,
Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260519152741.GM7702@ziepe.ca>
Jason Gunthorpe <jgg@ziepe.ca> writes:
> On Tue, May 19, 2026 at 08:37:54PM +0530, Aneesh Kumar K.V wrote:
>
>> if we get force_dma_unencrypted(dev) correct, we won't need the above.
>>
>> for dma_direct_alloc and dma_direct_alloc_pages() we have
>>
>> if (force_dma_unencrypted(dev))
>> attrs |= DMA_ATTR_CC_SHARED;
>>
>>
>> for dma_direct_map_phys(), if we have swiotlb bouncing forced,
>>
>> swiotlb_tbl_map_single():
>>
>> if ((attrs & DMA_ATTR_CC_SHARED) || force_dma_unencrypted(dev))
>> require_decrypted = true;
>
> IMHO I really do prefer the DMA_ATTR_CC_SHARED flows closer to the
> thing that did the decryption. While the above is possibly sound it is
> very obtuse to be guessing what kind of memory swiotlb decided to
> return..
>
> Can we pass a pointer to the attrs into the swiotlb stuff and it can
> update it based on the kind of memory it has allocated?
>
Yes, that also resulted in simpler and cleaner code.
swiotlb_tbl_map_single
/*
* If the physical address is encrypted but the device requires
* decrypted DMA, use a decrypted io_tlb_mem and update the
* attributes so the caller knows that a decrypted io_tlb_mem
* was used.
*/
if (!(*attrs & DMA_ATTR_CC_SHARED) && force_dma_unencrypted(dev))
*attrs |= DMA_ATTR_CC_SHARED;
if (mem->unencrypted != !!(*attrs & DMA_ATTR_CC_SHARED))
return (phys_addr_t)DMA_MAPPING_ERROR;
and
@@ -1640,19 +1654,14 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
trace_swiotlb_bounced(dev, phys_to_dma(dev, paddr), size);
- swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, 0, dir, attrs);
+ swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, 0, dir, &attrs);
if (swiotlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
return DMA_MAPPING_ERROR;
- /*
- * Use the allocated io_tlb_mem encryption type to determine dma addr.
- */
- if (dev->dma_io_tlb_mem->unencrypted) {
+ if (attrs & DMA_ATTR_CC_SHARED)
dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
- attrs |= DMA_ATTR_CC_SHARED;
- } else {
+ else
dma_addr = phys_to_dma_encrypted(dev, swiotlb_addr);
- }
if (unlikely(!dma_capable(dev, dma_addr, size, true, attrs))) {
__swiotlb_tbl_unmap_single(dev, swiotlb_addr, size, dir,
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-19 15:27 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: Mostafa Saleh, iommu, linux-arm-kernel, linux-kernel, linux-coco,
Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5abjebsaid.fsf@kernel.org>
On Tue, May 19, 2026 at 08:37:54PM +0530, Aneesh Kumar K.V wrote:
> if we get force_dma_unencrypted(dev) correct, we won't need the above.
>
> for dma_direct_alloc and dma_direct_alloc_pages() we have
>
> if (force_dma_unencrypted(dev))
> attrs |= DMA_ATTR_CC_SHARED;
>
>
> for dma_direct_map_phys(), if we have swiotlb bouncing forced,
>
> swiotlb_tbl_map_single():
>
> if ((attrs & DMA_ATTR_CC_SHARED) || force_dma_unencrypted(dev))
> require_decrypted = true;
IMHO I really do prefer the DMA_ATTR_CC_SHARED flows closer to the
thing that did the decryption. While the above is possibly sound it is
very obtuse to be guessing what kind of memory swiotlb decided to
return..
Can we pass a pointer to the attrs into the swiotlb stuff and it can
update it based on the kind of memory it has allocated?
Jason
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-19 15:07 UTC (permalink / raw)
To: Mostafa Saleh
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5amrxvshxg.fsf@kernel.org>
Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes:
> Mostafa Saleh <smostafa@google.com> writes:
>
>> On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
>>> >>
>>> >> What I meant was that we need a generic way to identify a pKVM guest, so
>>> >> that we can use it in the conditional above.
>>> >
>>> > I have this patch, with that I can boot with your series unmodified,
>>> > but I will need to do more testing.
>>> >
>>>
>>> Thanks, I can add this to the series once you complete the required testing.
>>>
>>
>> I am still running more tests, but looking more into it. Setting
>> force_dma_unencrypted() to true for pKVM guests is wrong, as the
>> guest shouldn’t try to decrypt arbitrary memory as it can include
>> sensitive information (for example in case of virtio sub-page
>> allocation) and should strictly rely on the restricted-dma-pool
>> for that.
>>
>> However, with my patch and setting force_dma_unencrypted() to false
>> on top of this series, it fails on pKVM due to a missing shared
>> attribute as Alexey mentioned, as now SWIOTLB rejects non shared
>> attrs, so, the DMA-API has to pass it. With that, I can boot again:
>>
>> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>> index 5103a04df99f..b19aeec03f27 100644
>> --- a/kernel/dma/direct.c
>> +++ b/kernel/dma/direct.c
>> @@ -286,6 +286,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>> }
>>
>> if (is_swiotlb_for_alloc(dev)) {
>> + attrs |= DMA_ATTR_CC_SHARED;
>> +
>> page = dma_direct_alloc_swiotlb(dev, size, attrs);
>> if (page) {
>> /*
>> @@ -449,6 +451,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>> &cpu_addr, gfp, attrs);
>>
>> if (is_swiotlb_for_alloc(dev)) {
>> + attrs |= DMA_ATTR_CC_SHARED;
>> +
>> page = dma_direct_alloc_swiotlb(dev, size, attrs);
>> if (!page)
>> return NULL;
>> diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
>> index 4e35264ab6f8..8ee5bbf78cfb 100644
>> --- a/kernel/dma/direct.h
>> +++ b/kernel/dma/direct.h
>> @@ -92,6 +92,7 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
>> if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
>> return DMA_MAPPING_ERROR;
>>
>> + attrs |= DMA_ATTR_CC_SHARED;
>> return swiotlb_map(dev, phys, size, dir, attrs);
>> }
>>
>> --
>>
>
> How about the below?
>
> modified kernel/dma/direct.c
> @@ -278,6 +278,10 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> }
>
> if (is_swiotlb_for_alloc(dev)) {
> +
> + if (dev->dma_io_tlb_mem->unencrypted)
> + attrs |= DMA_ATTR_CC_SHARED;
> +
> page = dma_direct_alloc_swiotlb(dev, size, attrs);
> if (page) {
> /*
> @@ -451,6 +455,10 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
> &cpu_addr, gfp, attrs);
>
> if (is_swiotlb_for_alloc(dev)) {
> +
> + if (dev->dma_io_tlb_mem->unencrypted)
> + attrs |= DMA_ATTR_CC_SHARED;
> +
> page = dma_direct_alloc_swiotlb(dev, size, attrs);
> if (!page)
> return NULL;
> modified kernel/dma/direct.h
> @@ -92,6 +92,9 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
> if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
> return DMA_MAPPING_ERROR;
>
> + if (dev->dma_io_tlb_mem->unencrypted)
> + attrs |= DMA_ATTR_CC_SHARED;
> +
> return swiotlb_map(dev, phys, size, dir, attrs);
> }
>
>
>
if we get force_dma_unencrypted(dev) correct, we won't need the above.
for dma_direct_alloc and dma_direct_alloc_pages() we have
if (force_dma_unencrypted(dev))
attrs |= DMA_ATTR_CC_SHARED;
for dma_direct_map_phys(), if we have swiotlb bouncing forced,
swiotlb_tbl_map_single():
if ((attrs & DMA_ATTR_CC_SHARED) || force_dma_unencrypted(dev))
require_decrypted = true;
-aneesh
^ permalink raw reply
* Re: [PATCH v2 15/15] KVM: x86: Move the bulk of register specific code from x86.c to regs.c
From: Sean Christopherson @ 2026-05-19 15:04 UTC (permalink / raw)
To: Kai Huang
Cc: pbonzini@redhat.com, kas@kernel.org, vkuznets@redhat.com,
dwmw2@infradead.org, paul@xen.org, Rick P Edgecombe,
x86@kernel.org, binbin.wu@linux.intel.com,
dave.hansen@linux.intel.com, linux-kernel@vger.kernel.org,
yosry@kernel.org, kvm@vger.kernel.org, linux-coco@lists.linux.dev
In-Reply-To: <074e209fe4c78a735868099f13d76b9dde2c88e3.camel@intel.com>
On Tue, May 19, 2026, Kai Huang wrote:
> > @@ -12712,19 +11913,8 @@ static void store_regs(struct kvm_vcpu *vcpu)
> >
> > static int sync_regs(struct kvm_vcpu *vcpu)
> > {
> > - if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_REGS) {
> > - __set_regs(vcpu, &vcpu->run->s.regs.regs);
> > - vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_REGS;
> > - }
> > -
> > - if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_SREGS) {
> > - struct kvm_sregs sregs = vcpu->run->s.regs.sregs;
> > -
> > - if (__set_sregs(vcpu, &sregs))
> > - return -EINVAL;
> > -
> > - vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_SREGS;
> > - }
> > + if (kvm_run_set_regs(vcpu))
> > + return -EINVAL;
>
> Nit:
>
> Do you think 'kvm_run_sync_regs()' is better than 'kvm_run_set_regs()'?
>
> Because I think "sync" reflects better that vcpu->run->kvm_dirty_regs is cleared
> after the "set" operation.
The problem I have with "sync" is that it doesn't communicate the direction of
the sync. What about kvm_run_sync_regs_{to,from}_user()?
> >
> > if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_EVENTS) {
> > struct kvm_vcpu_events events = vcpu->run->s.regs.events;
>
> Also, I wonder whether it's better to add a helper for events so sync_regs() and
> store_regs() can be simplified to:
>
> static int sync_regs(struct kvm_vcpu *vcpu)
> {
> if (kvm_run_sync_regs(vcpu))
> return -EINVAL;
> return kvm_run_sync_events(vcpu);
> }
>
> static void store_regs(struct kvm_vcpu *vcpu)
> {
> kvm_run_get_regs(vcpu);
> kvm_run_get_events(vcpu);
> }
>
> And maybe 'kvm_run_get_regs()' could be 'kvm_run_store_regs()' too , so that the
> store_regs() could be:
>
> static void store_regs(struct kvm_vcpu *vcpu)
> {
> kvm_run_store_regs(vcpu);
> kvm_run_store_events(vcpu);
> }
{store,sync}_regs() look pretty, but IMO the overall code is uglier because we
end up with super small helpers that have one caller, e.g.
static void kvm_run_sync_events_to_user(struct kvm_vcpu *vcpu)
{
if (vcpu->run->kvm_valid_regs & KVM_SYNC_X86_EVENTS)
kvm_vcpu_ioctl_x86_get_vcpu_events(vcpu, &vcpu->run->s.regs.events);
}
static void store_regs(struct kvm_vcpu *vcpu)
{
BUILD_BUG_ON(sizeof(struct kvm_sync_regs) > SYNC_REGS_SIZE_BYTES);
kvm_run_sync_regs_to_user(vcpu);
kvm_run_sync_events_to_user(vcpu);
}
For me, the extra "jump" is undesirable, but it allows burying __{g,s}et_{s,}regs()
in regs.c, and so is a net positive for registers. But for events, it's pure
overhead.
> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > index 185062a26924..fd55cd031b1c 100644
> > --- a/arch/x86/kvm/x86.h
> > +++ b/arch/x86/kvm/x86.h
> > @@ -414,6 +414,7 @@ int handle_ud(struct kvm_vcpu *vcpu);
> >
> > void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
> > struct kvm_queued_exception *ex);
> > +void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu);
> >
> > int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
> > int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
> > @@ -604,6 +605,7 @@ static inline void kvm_machine_check(void)
> > int kvm_spec_ctrl_test_value(u64 value);
> > int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,
> > struct x86_exception *e);
> > +void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid);
> >
>
> If I read correct, this is because "regs.c" calls kvm_invalidate_pcid() but you
> want to keep it in x86.c. But it seems the "x86.h" isn't included by "regs.c"
> directly but via other headers ("mmu.h" does include "x86.h").
>
> Should the "regs.c" include "x86.h" directly?
Oh, yeah, I just goofed that.
> Btw, I am a bit confused the relationship between "x86.h" and other headers like
> "mmu.h" and the new "regs.h". That is, headers like "mmu.h" include "x86.h",
> but headers like "regs.h" do not (instead, "x86.h" includes them).
Heh, don't look for a theme/plan, because there isn't one. Over the years, x86.h
and x86.c became dumping grounds for everything that didn't have an obvious home,
and so there aren't real "rules".
Hmm, though looking at all of this again, I think we're actually quite close to
having somewhat sane rules. Over the past few years, I've tried multiple times
to move what I felt should be KVM-internal structures from asm/kvm_host.h to x86.h,
and I've failed miserably every time because inevitably even the most innocuous
struct manages to have usage that leads to cyclical header dependencies and/or is
used by arch-neutral KVM code.
I think it's probably time to admit I've been looking at the asm/kvm_host.h vs.
x86.h split all wrong, i.e. finally give up on moving structures out of kvm_host.h,
and do the exact opposite: commit to using kvm_host.h to define and declare widely
used structures.
Because literally the only reason that x86.h doesn't include mmu.h is that mmu.h
references struct kvm_host, which is currently defined in x86.h. If we "fix"
that, then (a) we can make x86.h the "central" include everyone expects it to be,
and (b) it can be the start of a cleanup of asm/kvm_host.h and a big step towards
defining maintainable "rules" for what goes where. E.g. there are a pile of
functional declarations in asm/kvm_host.h that can live elsewhere; if we trim
those down, then the rules become:
- asm/kvm_host.h holds "common" structure definitions and associated key global
variables, and things that are referenced by arch-neutral KVM.
- <area>.{c,h} holds relevant declarations and definitions.
- x86.{c,h} is the kitchen sink for everything else.
E.g. this compiles for at least one config:
---
arch/x86/include/asm/kvm_host.h | 50 ++++++++++++++++++++++++----
arch/x86/kvm/mmu.h | 20 +++++++----
arch/x86/kvm/x86.h | 59 ++++-----------------------------
3 files changed, 64 insertions(+), 65 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5e24987b2a94..67ba8bf22469 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -313,6 +313,50 @@ enum x86_intercept_stage;
struct kvm_kernel_irqfd;
struct kvm_kernel_irq_routing_entry;
+struct kvm_caps {
+ /* control of guest tsc rate supported? */
+ bool has_tsc_control;
+ /* maximum supported tsc_khz for guests */
+ u32 max_guest_tsc_khz;
+ /* number of bits of the fractional part of the TSC scaling ratio */
+ u8 tsc_scaling_ratio_frac_bits;
+ /* maximum allowed value of TSC scaling ratio */
+ u64 max_tsc_scaling_ratio;
+ /* 1ull << kvm_caps.tsc_scaling_ratio_frac_bits */
+ u64 default_tsc_scaling_ratio;
+ /* bus lock detection supported? */
+ bool has_bus_lock_exit;
+ /* notify VM exit supported? */
+ bool has_notify_vmexit;
+ /* bit mask of VM types */
+ u32 supported_vm_types;
+
+ u64 supported_mce_cap;
+ u64 supported_xcr0;
+ u64 supported_xss;
+ u64 supported_perf_cap;
+
+ u64 supported_quirks;
+ u64 inapplicable_quirks;
+};
+extern struct kvm_caps kvm_caps;
+
+struct kvm_host_values {
+ /*
+ * The host's raw MAXPHYADDR, i.e. the number of non-reserved physical
+ * address bits irrespective of features that repurpose legal bits,
+ * e.g. MKTME.
+ */
+ u8 maxphyaddr;
+
+ u64 efer;
+ u64 xcr0;
+ u64 xss;
+ u64 s_cet;
+ u64 arch_capabilities;
+};
+extern struct kvm_host_values kvm_host;
+
/*
* kvm_mmu_page_role tracks the properties of a shadow page (where shadow page
* also includes TDP pages) to determine whether or not a page can be used in
@@ -2056,10 +2100,6 @@ struct kvm_arch_async_pf {
u64 error_code;
};
-extern u32 __read_mostly kvm_nr_uret_msrs;
-extern bool __read_mostly allow_smaller_maxphyaddr;
-extern bool __read_mostly enable_apicv;
-extern bool __read_mostly enable_ipiv;
extern bool __read_mostly enable_device_posted_irqs;
extern struct kvm_x86_ops kvm_x86_ops;
@@ -2151,8 +2191,6 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
-extern bool tdp_enabled;
-
u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
/*
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e1bb663ebbd5..d841a4f486d1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -4,9 +4,14 @@
#include <linux/kvm_host.h>
#include "regs.h"
-#include "x86.h"
#include "cpuid.h"
+extern bool tdp_enabled;
+#ifdef CONFIG_X86_64
+extern bool tdp_mmu_enabled;
+#else
+#define tdp_mmu_enabled false
+#endif
extern bool __read_mostly enable_mmio_caching;
#define PT_WRITABLE_SHIFT 1
@@ -261,14 +266,10 @@ static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
return smp_load_acquire(&kvm->arch.shadow_root_allocated);
}
-#ifdef CONFIG_X86_64
-extern bool tdp_mmu_enabled;
-#else
-#define tdp_mmu_enabled false
-#endif
-
int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
+bool kvm_mmu_is_mappable_memslot(const struct kvm_memory_slot *slot);
+
static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
{
return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
@@ -300,6 +301,11 @@ static inline void kvm_update_page_stats(struct kvm *kvm, int level, int count)
atomic64_add(count, &kvm->stat.pages[level - 1]);
}
+static inline bool mmu_is_nested(struct kvm_vcpu *vcpu)
+{
+ return vcpu->arch.walk_mmu == &vcpu->arch.nested_mmu;
+}
+
static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
struct kvm_mmu *mmu,
gpa_t gpa, u64 access,
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index fd55cd031b1c..40c6f4c54f8e 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -6,53 +6,19 @@
#include <asm/fpu/xstate.h>
#include <asm/mce.h>
#include <asm/pvclock.h>
+#include "mmu.h"
#include "regs.h"
#include "kvm_emulate.h"
#include "cpuid.h"
#define KVM_MAX_MCE_BANKS 32
-struct kvm_caps {
- /* control of guest tsc rate supported? */
- bool has_tsc_control;
- /* maximum supported tsc_khz for guests */
- u32 max_guest_tsc_khz;
- /* number of bits of the fractional part of the TSC scaling ratio */
- u8 tsc_scaling_ratio_frac_bits;
- /* maximum allowed value of TSC scaling ratio */
- u64 max_tsc_scaling_ratio;
- /* 1ull << kvm_caps.tsc_scaling_ratio_frac_bits */
- u64 default_tsc_scaling_ratio;
- /* bus lock detection supported? */
- bool has_bus_lock_exit;
- /* notify VM exit supported? */
- bool has_notify_vmexit;
- /* bit mask of VM types */
- u32 supported_vm_types;
-
- u64 supported_mce_cap;
- u64 supported_xcr0;
- u64 supported_xss;
- u64 supported_perf_cap;
-
- u64 supported_quirks;
- u64 inapplicable_quirks;
-};
-
-struct kvm_host_values {
- /*
- * The host's raw MAXPHYADDR, i.e. the number of non-reserved physical
- * address bits irrespective of features that repurpose legal bits,
- * e.g. MKTME.
- */
- u8 maxphyaddr;
-
- u64 efer;
- u64 xcr0;
- u64 xss;
- u64 s_cet;
- u64 arch_capabilities;
-};
+extern u32 __read_mostly kvm_nr_uret_msrs;
+extern bool __read_mostly allow_smaller_maxphyaddr;
+extern bool __read_mostly enable_apicv;
+extern bool __read_mostly enable_ipiv;
+extern bool enable_pmu;
+extern bool enable_mediated_pmu;
void kvm_spurious_fault(void);
@@ -252,11 +218,6 @@ static inline bool x86_exception_has_error_code(unsigned int vector)
return (1U << vector) & exception_has_error_code;
}
-static inline bool mmu_is_nested(struct kvm_vcpu *vcpu)
-{
- return vcpu->arch.walk_mmu == &vcpu->arch.nested_mmu;
-}
-
static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu)
{
return kvm_is_cr4_bit_set(vcpu, X86_CR4_LA57) ? 57 : 48;
@@ -428,12 +389,6 @@ fastpath_t handle_fastpath_wrmsr_imm(struct kvm_vcpu *vcpu, u32 msr, int reg);
fastpath_t handle_fastpath_hlt(struct kvm_vcpu *vcpu);
fastpath_t handle_fastpath_invd(struct kvm_vcpu *vcpu);
-extern struct kvm_caps kvm_caps;
-extern struct kvm_host_values kvm_host;
-
-extern bool enable_pmu;
-extern bool enable_mediated_pmu;
-
void kvm_setup_xss_caps(void);
/*
base-commit: b99808a11a42edc2cecced7adf57c2ac231bdb68
--
^ permalink raw reply related
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-19 14:49 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Aneesh Kumar K.V, iommu, linux-arm-kernel, linux-kernel,
linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agx3j6Oc8QivZ3RG@google.com>
On Tue, May 19, 2026 at 02:45:35PM +0000, Mostafa Saleh wrote:
> However, it should not alway use SWIOTLB? It can trigger decryption for
> any memory returned from __dma_direct_alloc_pages() which can come
> from alloc_pages_node().
The alloc coherent flow is seperate and different, these are not pages
'passed into the DMA API' but pages fully allocated internally and
owned by it.
Yes, it should cause decrypted *allocation*.
Jason
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-19 14:45 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Aneesh Kumar K.V, iommu, linux-arm-kernel, linux-kernel,
linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260519143529.GD7702@ziepe.ca>
On Tue, May 19, 2026 at 11:35:29AM -0300, Jason Gunthorpe wrote:
> On Tue, May 19, 2026 at 01:41:42PM +0000, Mostafa Saleh wrote:
> > On Tue, May 19, 2026 at 10:29:11AM -0300, Jason Gunthorpe wrote:
> > > On Tue, May 19, 2026 at 11:04:37AM +0000, Mostafa Saleh wrote:
> > > > On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
> > > > > >>
> > > > > >> What I meant was that we need a generic way to identify a pKVM guest, so
> > > > > >> that we can use it in the conditional above.
> > > > > >
> > > > > > I have this patch, with that I can boot with your series unmodified,
> > > > > > but I will need to do more testing.
> > > > > >
> > > > >
> > > > > Thanks, I can add this to the series once you complete the required testing.
> > > > >
> > > >
> > > > I am still running more tests, but looking more into it. Setting
> > > > force_dma_unencrypted() to true for pKVM guests is wrong, as the
> > > > guest shouldn’t try to decrypt arbitrary memory as it can include
> > > > sensitive information (for example in case of virtio sub-page
> > > > allocation) and should strictly rely on the restricted-dma-pool
> > > > for that.
> > >
> > > ??
> > >
> > > Where does force_dma_unencrypted() cause arbitary memory passed into
> > > the DMA API to be decrypted? That should never happen???
> >
> > Sorry, maybe arbitrary is not the right expression again :)
> > I mean that, with emulated devices that use the DMA-API under pKVM,
> > they will map memory coming from other layers (VFS, net) through
> > vitrio-block, virtio-net... These can be smaller than a page, and
> > using force_dma_unencrypted() will share the whole page.
>
> force_dma_unencrypted() should only trigger swiotlb and that never
> memcpy's more than necessary?
>
> Where does it do otherwise? That sounds like a bug?
Agh, I got confused and thought that it can be triggered from dma_map()
too. I need to figure out why that made pKVM guests boot with broken
restricted-dma-pool then.
However, it should not alway use SWIOTLB? It can trigger decryption for
any memory returned from __dma_direct_alloc_pages() which can come
from alloc_pages_node().
Thanks,
Mostafa
>
> Jason
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-19 14:37 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Aneesh Kumar K.V, iommu, linux-arm-kernel, linux-kernel,
linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxzanDBmIP54hUz@google.com>
On Tue, May 19, 2026 at 02:27:54PM +0000, Mostafa Saleh wrote:
> However, as I mentioned to Jason, I think with some tweaks to
> force_dma_unencrypted() we can make it work under pKVM for aligned
> memory which eliminates some of the bouncing.
> I am currently investigating that.
force_dma_unencrypted() literally means that memory passed into the
DMA API *without* DMA_ATTR_CC_SHARED cannot be DMA'd from.
It should not mean anything else. The DMA API should never decrypt
passed in memory. You always have to bounce.
Jason
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-19 14:35 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Aneesh Kumar K.V, iommu, linux-arm-kernel, linux-kernel,
linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxolksC_1nfO34X@google.com>
On Tue, May 19, 2026 at 01:41:42PM +0000, Mostafa Saleh wrote:
> On Tue, May 19, 2026 at 10:29:11AM -0300, Jason Gunthorpe wrote:
> > On Tue, May 19, 2026 at 11:04:37AM +0000, Mostafa Saleh wrote:
> > > On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
> > > > >>
> > > > >> What I meant was that we need a generic way to identify a pKVM guest, so
> > > > >> that we can use it in the conditional above.
> > > > >
> > > > > I have this patch, with that I can boot with your series unmodified,
> > > > > but I will need to do more testing.
> > > > >
> > > >
> > > > Thanks, I can add this to the series once you complete the required testing.
> > > >
> > >
> > > I am still running more tests, but looking more into it. Setting
> > > force_dma_unencrypted() to true for pKVM guests is wrong, as the
> > > guest shouldn’t try to decrypt arbitrary memory as it can include
> > > sensitive information (for example in case of virtio sub-page
> > > allocation) and should strictly rely on the restricted-dma-pool
> > > for that.
> >
> > ??
> >
> > Where does force_dma_unencrypted() cause arbitary memory passed into
> > the DMA API to be decrypted? That should never happen???
>
> Sorry, maybe arbitrary is not the right expression again :)
> I mean that, with emulated devices that use the DMA-API under pKVM,
> they will map memory coming from other layers (VFS, net) through
> vitrio-block, virtio-net... These can be smaller than a page, and
> using force_dma_unencrypted() will share the whole page.
force_dma_unencrypted() should only trigger swiotlb and that never
memcpy's more than necessary?
Where does it do otherwise? That sounds like a bug?
Jason
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-19 14:27 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: Jason Gunthorpe, iommu, linux-arm-kernel, linux-kernel,
linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5aecj7sctv.fsf@kernel.org>
On Tue, May 19, 2026 at 07:47:48PM +0530, Aneesh Kumar K.V wrote:
> Mostafa Saleh <smostafa@google.com> writes:
>
> > On Tue, May 19, 2026 at 07:30:16PM +0530, Aneesh Kumar K.V wrote:
> >> Mostafa Saleh <smostafa@google.com> writes:
> >>
> >> >> >
> >> >> > I am still running more tests, but looking more into it. Setting
> >> >> > force_dma_unencrypted() to true for pKVM guests is wrong, as the
> >> >> > guest shouldn’t try to decrypt arbitrary memory as it can include
> >> >> > sensitive information (for example in case of virtio sub-page
> >> >> > allocation) and should strictly rely on the restricted-dma-pool
> >> >> > for that.
> >> >>
> >> >> ??
> >> >>
> >> >> Where does force_dma_unencrypted() cause arbitary memory passed into
> >> >> the DMA API to be decrypted? That should never happen???
> >> >
> >> > Sorry, maybe arbitrary is not the right expression again :)
> >> > I mean that, with emulated devices that use the DMA-API under pKVM,
> >> > they will map memory coming from other layers (VFS, net) through
> >> > vitrio-block, virtio-net... These can be smaller than a page, and
> >> >
> >>
> >> Don't we PAGE_ALIGN these requests?
> >>
> >> dma_direct_alloc
> >> size = PAGE_ALIGN(size);
> >>
> >> iommu_dma_alloc_pages
> >> size_t alloc_size = PAGE_ALIGN(size);
> >>
> >>
> >
> > For allocation, yes, and that's fine because we bring memory from
> > the pool.
> > But not for mapping, as dma_direct_map_phys(), where the memory is
> > allocated from the driver or other parts in the kernel and the page
> > may be shared with other kernel components.
> >
>
> But if we are using restricted-dma-pool, we also have:
>
> mem->force_bounce = true;
> mem->for_alloc = true;
>
> So, will we use the swiotlb buffers for mapping and copy only the shared
> content into those swiotlb buffers?
True, that's why under pKVM, force_dma_unencrypted() should never
cause any memory to be decrypted and so we set it to false.
As in case of any bugs, the guest does not leak any information,
similar to what just happened initially here due to missing attrs.
However, as I mentioned to Jason, I think with some tweaks to
force_dma_unencrypted() we can make it work under pKVM for aligned
memory which eliminates some of the bouncing.
I am currently investigating that.
Thanks,
Mostafa
>
> -aneesh
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-19 14:17 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Jason Gunthorpe, iommu, linux-arm-kernel, linux-kernel,
linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxt7SFGT7OLMIah@google.com>
Mostafa Saleh <smostafa@google.com> writes:
> On Tue, May 19, 2026 at 07:30:16PM +0530, Aneesh Kumar K.V wrote:
>> Mostafa Saleh <smostafa@google.com> writes:
>>
>> >> >
>> >> > I am still running more tests, but looking more into it. Setting
>> >> > force_dma_unencrypted() to true for pKVM guests is wrong, as the
>> >> > guest shouldn’t try to decrypt arbitrary memory as it can include
>> >> > sensitive information (for example in case of virtio sub-page
>> >> > allocation) and should strictly rely on the restricted-dma-pool
>> >> > for that.
>> >>
>> >> ??
>> >>
>> >> Where does force_dma_unencrypted() cause arbitary memory passed into
>> >> the DMA API to be decrypted? That should never happen???
>> >
>> > Sorry, maybe arbitrary is not the right expression again :)
>> > I mean that, with emulated devices that use the DMA-API under pKVM,
>> > they will map memory coming from other layers (VFS, net) through
>> > vitrio-block, virtio-net... These can be smaller than a page, and
>> >
>>
>> Don't we PAGE_ALIGN these requests?
>>
>> dma_direct_alloc
>> size = PAGE_ALIGN(size);
>>
>> iommu_dma_alloc_pages
>> size_t alloc_size = PAGE_ALIGN(size);
>>
>>
>
> For allocation, yes, and that's fine because we bring memory from
> the pool.
> But not for mapping, as dma_direct_map_phys(), where the memory is
> allocated from the driver or other parts in the kernel and the page
> may be shared with other kernel components.
>
But if we are using restricted-dma-pool, we also have:
mem->force_bounce = true;
mem->for_alloc = true;
So, will we use the swiotlb buffers for mapping and copy only the shared
content into those swiotlb buffers?
-aneesh
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-19 14:04 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: Jason Gunthorpe, iommu, linux-arm-kernel, linux-kernel,
linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5ah5o3sdn3.fsf@kernel.org>
On Tue, May 19, 2026 at 07:30:16PM +0530, Aneesh Kumar K.V wrote:
> Mostafa Saleh <smostafa@google.com> writes:
>
> >> >
> >> > I am still running more tests, but looking more into it. Setting
> >> > force_dma_unencrypted() to true for pKVM guests is wrong, as the
> >> > guest shouldn’t try to decrypt arbitrary memory as it can include
> >> > sensitive information (for example in case of virtio sub-page
> >> > allocation) and should strictly rely on the restricted-dma-pool
> >> > for that.
> >>
> >> ??
> >>
> >> Where does force_dma_unencrypted() cause arbitary memory passed into
> >> the DMA API to be decrypted? That should never happen???
> >
> > Sorry, maybe arbitrary is not the right expression again :)
> > I mean that, with emulated devices that use the DMA-API under pKVM,
> > they will map memory coming from other layers (VFS, net) through
> > vitrio-block, virtio-net... These can be smaller than a page, and
> >
>
> Don't we PAGE_ALIGN these requests?
>
> dma_direct_alloc
> size = PAGE_ALIGN(size);
>
> iommu_dma_alloc_pages
> size_t alloc_size = PAGE_ALIGN(size);
>
>
For allocation, yes, and that's fine because we bring memory from
the pool.
But not for mapping, as dma_direct_map_phys(), where the memory is
allocated from the driver or other parts in the kernel and the page
may be shared with other kernel components.
Thanks,
Mostafa
>
> > using force_dma_unencrypted() will share the whole page.
> > And as discussed, that leaks sensitive information to the untrusted
> > host.
> > I am currently investigating passing iova/phys/size
> > to force_dma_unencrypted() and then we can share pages inplace only
> > if possible without leaking extra information.
> > I am trying to get some performance results first. But the tricky part
> > is to get the semantics right, I believe in that case those devices
> > shouldn’t use restricted-dma-pools as those should always force
> > bouncing. Instead bouncing happens through the default SWIOTLB pool,
> > if not possible to decrypt in place.
> >
>
> -aneesh
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-19 14:00 UTC (permalink / raw)
To: Mostafa Saleh, Jason Gunthorpe
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Petr Tesarik,
Alexey Kardashevskiy, Dan Williams, Xu Yilun, linuxppc-dev,
linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxolksC_1nfO34X@google.com>
Mostafa Saleh <smostafa@google.com> writes:
> On Tue, May 19, 2026 at 10:29:11AM -0300, Jason Gunthorpe wrote:
>> On Tue, May 19, 2026 at 11:04:37AM +0000, Mostafa Saleh wrote:
>> > On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
>> > > >>
>> > > >> What I meant was that we need a generic way to identify a pKVM guest, so
>> > > >> that we can use it in the conditional above.
>> > > >
>> > > > I have this patch, with that I can boot with your series unmodified,
>> > > > but I will need to do more testing.
>> > > >
>> > >
>> > > Thanks, I can add this to the series once you complete the required testing.
>> > >
>> >
>> > I am still running more tests, but looking more into it. Setting
>> > force_dma_unencrypted() to true for pKVM guests is wrong, as the
>> > guest shouldn’t try to decrypt arbitrary memory as it can include
>> > sensitive information (for example in case of virtio sub-page
>> > allocation) and should strictly rely on the restricted-dma-pool
>> > for that.
>>
>> ??
>>
>> Where does force_dma_unencrypted() cause arbitary memory passed into
>> the DMA API to be decrypted? That should never happen???
>
> Sorry, maybe arbitrary is not the right expression again :)
> I mean that, with emulated devices that use the DMA-API under pKVM,
> they will map memory coming from other layers (VFS, net) through
> vitrio-block, virtio-net... These can be smaller than a page, and
>
Don't we PAGE_ALIGN these requests?
dma_direct_alloc
size = PAGE_ALIGN(size);
iommu_dma_alloc_pages
size_t alloc_size = PAGE_ALIGN(size);
> using force_dma_unencrypted() will share the whole page.
> And as discussed, that leaks sensitive information to the untrusted
> host.
> I am currently investigating passing iova/phys/size
> to force_dma_unencrypted() and then we can share pages inplace only
> if possible without leaking extra information.
> I am trying to get some performance results first. But the tricky part
> is to get the semantics right, I believe in that case those devices
> shouldn’t use restricted-dma-pools as those should always force
> bouncing. Instead bouncing happens through the default SWIOTLB pool,
> if not possible to decrypt in place.
>
-aneesh
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-19 13:41 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Aneesh Kumar K.V, iommu, linux-arm-kernel, linux-kernel,
linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260519132911.GA7702@ziepe.ca>
On Tue, May 19, 2026 at 10:29:11AM -0300, Jason Gunthorpe wrote:
> On Tue, May 19, 2026 at 11:04:37AM +0000, Mostafa Saleh wrote:
> > On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
> > > >>
> > > >> What I meant was that we need a generic way to identify a pKVM guest, so
> > > >> that we can use it in the conditional above.
> > > >
> > > > I have this patch, with that I can boot with your series unmodified,
> > > > but I will need to do more testing.
> > > >
> > >
> > > Thanks, I can add this to the series once you complete the required testing.
> > >
> >
> > I am still running more tests, but looking more into it. Setting
> > force_dma_unencrypted() to true for pKVM guests is wrong, as the
> > guest shouldn’t try to decrypt arbitrary memory as it can include
> > sensitive information (for example in case of virtio sub-page
> > allocation) and should strictly rely on the restricted-dma-pool
> > for that.
>
> ??
>
> Where does force_dma_unencrypted() cause arbitary memory passed into
> the DMA API to be decrypted? That should never happen???
Sorry, maybe arbitrary is not the right expression again :)
I mean that, with emulated devices that use the DMA-API under pKVM,
they will map memory coming from other layers (VFS, net) through
vitrio-block, virtio-net... These can be smaller than a page, and
using force_dma_unencrypted() will share the whole page.
And as discussed, that leaks sensitive information to the untrusted
host.
I am currently investigating passing iova/phys/size
to force_dma_unencrypted() and then we can share pages inplace only
if possible without leaking extra information.
I am trying to get some performance results first. But the tricky part
is to get the semantics right, I believe in that case those devices
shouldn’t use restricted-dma-pools as those should always force
bouncing. Instead bouncing happens through the default SWIOTLB pool,
if not possible to decrypt in place.
Thanks,
Mostafa
>
> Jason
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-19 13:39 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxETC1rXBUSkWYg@google.com>
On Tue, May 19, 2026 at 11:06:52AM +0000, Mostafa Saleh wrote:
> > > One other interesting case for device-passthrough is non-coherent
> > > devices which then require private pools for bouncing.
> >
> > Why does shared/private matter for bouncing? Why do you need to bounce
> > at all? Do cmo's not work in pkvm guests?
>
> At the moment, in iommu_dma_map_phys(), if a non coherent device
> tries to map an unaligned address or size it will be bounced.
Sure, that's fine.
> In pKVM, dma-iommu is used for assigned devices which operate on
> private memory, so bouncing that through the SWIOTLB would leak
> information from the guest as the SWIOTLB is decrypted.
Yes, a device that can do private access should not be using a shared
SWIOTLB, that should be part of the selection logic inside the SWIOTLB
stuff..
Jason
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-19 13:29 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Aneesh Kumar K.V, iommu, linux-arm-kernel, linux-kernel,
linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxDxdxynp4KEovA@google.com>
On Tue, May 19, 2026 at 11:04:37AM +0000, Mostafa Saleh wrote:
> On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
> > >>
> > >> What I meant was that we need a generic way to identify a pKVM guest, so
> > >> that we can use it in the conditional above.
> > >
> > > I have this patch, with that I can boot with your series unmodified,
> > > but I will need to do more testing.
> > >
> >
> > Thanks, I can add this to the series once you complete the required testing.
> >
>
> I am still running more tests, but looking more into it. Setting
> force_dma_unencrypted() to true for pKVM guests is wrong, as the
> guest shouldn’t try to decrypt arbitrary memory as it can include
> sensitive information (for example in case of virtio sub-page
> allocation) and should strictly rely on the restricted-dma-pool
> for that.
??
Where does force_dma_unencrypted() cause arbitary memory passed into
the DMA API to be decrypted? That should never happen???
Jason
^ permalink raw reply
* Re: [PATCH v14 27/44] arm64: RMI: Set RIPAS of initial memslots
From: Suzuki K Poulose @ 2026-05-19 13:06 UTC (permalink / raw)
To: Aneesh Kumar K.V, Steven Price, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
Alper Gun, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
Lorenzo.Pieralisi2
In-Reply-To: <yq5ajyszsgmf.fsf@kernel.org>
On 19/05/2026 13:55, Aneesh Kumar K.V wrote:
> Suzuki K Poulose <suzuki.poulose@arm.com> writes:
>
>> On 19/05/2026 11:02, Aneesh Kumar K.V wrote:
>>> Steven Price <steven.price@arm.com> writes:
>>>
>>>> The memory which the realm guest accesses must be set to RIPAS_RAM.
>>>> Iterate over the memslots and set all gmem memslots to RIPAS_RAM.
>>>>
>>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>>> ---
>>>
>>> ...
>>>
>>>> +static int set_ripas_of_protected_regions(struct kvm *kvm)
>>>> +{
>>>> + struct kvm_memslots *slots;
>>>> + struct kvm_memory_slot *memslot;
>>>> + int idx, bkt;
>>>> + int ret = 0;
>>>> +
>>>> + idx = srcu_read_lock(&kvm->srcu);
>>>> +
>>>> + slots = kvm_memslots(kvm);
>>>> + kvm_for_each_memslot(memslot, bkt, slots) {
>>>> + if (!kvm_slot_has_gmem(memslot))
>>>> + continue;
>>>> +
>>>> + ret = realm_init_ipa_state(kvm, memslot->base_gfn,
>>>> + memslot->npages);
>>>> + if (ret)
>>>> + break;
>>>> + }
>>>> + srcu_read_unlock(&kvm->srcu, idx);
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> int kvm_arm_rmi_populate(struct kvm *kvm,
>>>> struct kvm_arm_rmi_populate *args)
>>>> {
>>>> @@ -890,6 +922,10 @@ int kvm_activate_realm(struct kvm *kvm)
>>>> return ret;
>>>> }
>>>>
>>>> + ret = set_ripas_of_protected_regions(kvm);
>>>> + if (ret)
>>>> + return ret;
>>>> +
>>>> ret = rmi_realm_activate(virt_to_phys(realm->rd));
>>>> if (ret)
>>>> return -ENXIO;
>>>
>>> relam guest already does.
>>> for_each_mem_range(i, &start, &end) {
>>> if (rsi_set_memory_range_protected_safe(start, end)) {
>>> panic("Failed to set memory range to protected: %pa-%pa",
>>> &start, &end);
>>> }
>>> }
>>>
>>> if so why is host required to do this ?
>>
>> Ideally this should be a call from the VMM (i.e., user). Irrespective of
>> what the guest does (which the host has no knowledge about), the VMM/
>> user is better aware of what to do for a given guest. We have done this
>> implicitly in the KVM as a start, to keep the initial implementation
>> simple. This could be moved out to the VMM as UABI, if there is
>> sufficient demand for it.
>>
>> TL,DR: This should be a host/deployer decision, not the Guest. There
>> may other guest OS, which do not do RIPAS_RAM early enough.
>>
>
> Are we suggesting that when the guest is running out of DRAM initialized
> via rmi_rtt_data_map_init(), it may need to access memory outside that
> range before it gets a chance to set the RIPAS as RAM?
It may. This was one of the review comments we got when we published
the Linux Guest patches. In fact, this is in the Linux booting
requirements. See :
Documentation/arch/arm64/booting.rst: Section 1
>
> Does that mean the guest now has to trust the host for that?
No, this has been the case. We added the code in Linux to convert memory
as a back stop. The worse could happens is Guest crashing, without it
having any secrets receving from the Remote entity.
> rmi_rtt_init_ripas() is not added to the measurement details, right?
It is not (at least for now). It doesn't matter for security much.
Suzuki
>
> -aneesh
^ permalink raw reply
* Re: [PATCH v14 27/44] arm64: RMI: Set RIPAS of initial memslots
From: Aneesh Kumar K.V @ 2026-05-19 12:55 UTC (permalink / raw)
To: Suzuki K Poulose, Steven Price, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
Alper Gun, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
Lorenzo.Pieralisi2
In-Reply-To: <6681f10b-0966-42e2-ae04-4e1aef47ec4d@arm.com>
Suzuki K Poulose <suzuki.poulose@arm.com> writes:
> On 19/05/2026 11:02, Aneesh Kumar K.V wrote:
>> Steven Price <steven.price@arm.com> writes:
>>
>>> The memory which the realm guest accesses must be set to RIPAS_RAM.
>>> Iterate over the memslots and set all gmem memslots to RIPAS_RAM.
>>>
>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>> ---
>>
>> ...
>>
>>> +static int set_ripas_of_protected_regions(struct kvm *kvm)
>>> +{
>>> + struct kvm_memslots *slots;
>>> + struct kvm_memory_slot *memslot;
>>> + int idx, bkt;
>>> + int ret = 0;
>>> +
>>> + idx = srcu_read_lock(&kvm->srcu);
>>> +
>>> + slots = kvm_memslots(kvm);
>>> + kvm_for_each_memslot(memslot, bkt, slots) {
>>> + if (!kvm_slot_has_gmem(memslot))
>>> + continue;
>>> +
>>> + ret = realm_init_ipa_state(kvm, memslot->base_gfn,
>>> + memslot->npages);
>>> + if (ret)
>>> + break;
>>> + }
>>> + srcu_read_unlock(&kvm->srcu, idx);
>>> +
>>> + return ret;
>>> +}
>>> +
>>> int kvm_arm_rmi_populate(struct kvm *kvm,
>>> struct kvm_arm_rmi_populate *args)
>>> {
>>> @@ -890,6 +922,10 @@ int kvm_activate_realm(struct kvm *kvm)
>>> return ret;
>>> }
>>>
>>> + ret = set_ripas_of_protected_regions(kvm);
>>> + if (ret)
>>> + return ret;
>>> +
>>> ret = rmi_realm_activate(virt_to_phys(realm->rd));
>>> if (ret)
>>> return -ENXIO;
>>
>> relam guest already does.
>> for_each_mem_range(i, &start, &end) {
>> if (rsi_set_memory_range_protected_safe(start, end)) {
>> panic("Failed to set memory range to protected: %pa-%pa",
>> &start, &end);
>> }
>> }
>>
>> if so why is host required to do this ?
>
> Ideally this should be a call from the VMM (i.e., user). Irrespective of
> what the guest does (which the host has no knowledge about), the VMM/
> user is better aware of what to do for a given guest. We have done this
> implicitly in the KVM as a start, to keep the initial implementation
> simple. This could be moved out to the VMM as UABI, if there is
> sufficient demand for it.
>
> TL,DR: This should be a host/deployer decision, not the Guest. There
> may other guest OS, which do not do RIPAS_RAM early enough.
>
Are we suggesting that when the guest is running out of DRAM initialized
via rmi_rtt_data_map_init(), it may need to access memory outside that
range before it gets a chance to set the RIPAS as RAM?
Does that mean the guest now has to trust the host for that?
rmi_rtt_init_ripas() is not added to the measurement details, right?
-aneesh
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-19 12:27 UTC (permalink / raw)
To: Mostafa Saleh
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxDxdxynp4KEovA@google.com>
Mostafa Saleh <smostafa@google.com> writes:
> On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
>> >>
>> >> What I meant was that we need a generic way to identify a pKVM guest, so
>> >> that we can use it in the conditional above.
>> >
>> > I have this patch, with that I can boot with your series unmodified,
>> > but I will need to do more testing.
>> >
>>
>> Thanks, I can add this to the series once you complete the required testing.
>>
>
> I am still running more tests, but looking more into it. Setting
> force_dma_unencrypted() to true for pKVM guests is wrong, as the
> guest shouldn’t try to decrypt arbitrary memory as it can include
> sensitive information (for example in case of virtio sub-page
> allocation) and should strictly rely on the restricted-dma-pool
> for that.
>
> However, with my patch and setting force_dma_unencrypted() to false
> on top of this series, it fails on pKVM due to a missing shared
> attribute as Alexey mentioned, as now SWIOTLB rejects non shared
> attrs, so, the DMA-API has to pass it. With that, I can boot again:
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 5103a04df99f..b19aeec03f27 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -286,6 +286,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> }
>
> if (is_swiotlb_for_alloc(dev)) {
> + attrs |= DMA_ATTR_CC_SHARED;
> +
> page = dma_direct_alloc_swiotlb(dev, size, attrs);
> if (page) {
> /*
> @@ -449,6 +451,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
> &cpu_addr, gfp, attrs);
>
> if (is_swiotlb_for_alloc(dev)) {
> + attrs |= DMA_ATTR_CC_SHARED;
> +
> page = dma_direct_alloc_swiotlb(dev, size, attrs);
> if (!page)
> return NULL;
> diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
> index 4e35264ab6f8..8ee5bbf78cfb 100644
> --- a/kernel/dma/direct.h
> +++ b/kernel/dma/direct.h
> @@ -92,6 +92,7 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
> if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
> return DMA_MAPPING_ERROR;
>
> + attrs |= DMA_ATTR_CC_SHARED;
> return swiotlb_map(dev, phys, size, dir, attrs);
> }
>
> --
>
How about the below?
modified kernel/dma/direct.c
@@ -278,6 +278,10 @@ void *dma_direct_alloc(struct device *dev, size_t size,
}
if (is_swiotlb_for_alloc(dev)) {
+
+ if (dev->dma_io_tlb_mem->unencrypted)
+ attrs |= DMA_ATTR_CC_SHARED;
+
page = dma_direct_alloc_swiotlb(dev, size, attrs);
if (page) {
/*
@@ -451,6 +455,10 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
&cpu_addr, gfp, attrs);
if (is_swiotlb_for_alloc(dev)) {
+
+ if (dev->dma_io_tlb_mem->unencrypted)
+ attrs |= DMA_ATTR_CC_SHARED;
+
page = dma_direct_alloc_swiotlb(dev, size, attrs);
if (!page)
return NULL;
modified kernel/dma/direct.h
@@ -92,6 +92,9 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
return DMA_MAPPING_ERROR;
+ if (dev->dma_io_tlb_mem->unencrypted)
+ attrs |= DMA_ATTR_CC_SHARED;
+
return swiotlb_map(dev, phys, size, dir, attrs);
}
>
>
> I will keep testing and let you know how it goes. If there is nothing
> else required to convert pKVM guests to CC, I can just post the patch
> separately as it has no dependency on this series.
>
That would be useful. I can then carry the patch as a dependent change,
which can also be merged separately
>
> Re force_dma_unencrypted(), I am looking into a safe way to use it
> for pKVM as I beleive it will be useful to eliminate some bouncing.
> However, that’s not critical for this series and can be added later
> as I am still investigating it, if I reach something I can post it
> along the pKVM patch above.
>
> Thanks,
> Mostafa
>
>>
>>
>> -aneesh
^ permalink raw reply
* Re: [PATCH v2 15/15] KVM: x86: Move the bulk of register specific code from x86.c to regs.c
From: Huang, Kai @ 2026-05-19 12:16 UTC (permalink / raw)
To: pbonzini@redhat.com, kas@kernel.org, seanjc@google.com,
vkuznets@redhat.com, dwmw2@infradead.org, paul@xen.org
Cc: Edgecombe, Rick P, x86@kernel.org, binbin.wu@linux.intel.com,
dave.hansen@linux.intel.com, linux-kernel@vger.kernel.org,
yosry@kernel.org, kvm@vger.kernel.org, linux-coco@lists.linux.dev
In-Reply-To: <20260514215355.1648463-16-seanjc@google.com>
> +void kvm_run_get_regs(struct kvm_vcpu *vcpu)
> +{
> + BUILD_BUG_ON(sizeof(struct kvm_sync_regs) > SYNC_REGS_SIZE_BYTES);
> +
> + if (vcpu->run->kvm_valid_regs & KVM_SYNC_X86_REGS)
> + __get_regs(vcpu, &vcpu->run->s.regs.regs);
> +
> + if (vcpu->run->kvm_valid_regs & KVM_SYNC_X86_SREGS)
> + __get_sregs(vcpu, &vcpu->run->s.regs.sregs);
> +}
> +
> +int kvm_run_set_regs(struct kvm_vcpu *vcpu)
> +{
> + if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_REGS) {
> + __set_regs(vcpu, &vcpu->run->s.regs.regs);
> + vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_REGS;
> + }
> +
> + if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_SREGS) {
> + struct kvm_sregs sregs = vcpu->run->s.regs.sregs;
> +
> + if (__set_sregs(vcpu, &sregs))
> + return -EINVAL;
> +
> + vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_SREGS;
> + }
> +
> + return 0;
> +}
>
[...]
> @@ -12699,11 +11904,7 @@ static void store_regs(struct kvm_vcpu *vcpu)
> {
> BUILD_BUG_ON(sizeof(struct kvm_sync_regs) > SYNC_REGS_SIZE_BYTES);
>
> - if (vcpu->run->kvm_valid_regs & KVM_SYNC_X86_REGS)
> - __get_regs(vcpu, &vcpu->run->s.regs.regs);
> -
> - if (vcpu->run->kvm_valid_regs & KVM_SYNC_X86_SREGS)
> - __get_sregs(vcpu, &vcpu->run->s.regs.sregs);
> + kvm_run_get_regs(vcpu);
>
> if (vcpu->run->kvm_valid_regs & KVM_SYNC_X86_EVENTS)
> kvm_vcpu_ioctl_x86_get_vcpu_events(
> @@ -12712,19 +11913,8 @@ static void store_regs(struct kvm_vcpu *vcpu)
>
> static int sync_regs(struct kvm_vcpu *vcpu)
> {
> - if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_REGS) {
> - __set_regs(vcpu, &vcpu->run->s.regs.regs);
> - vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_REGS;
> - }
> -
> - if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_SREGS) {
> - struct kvm_sregs sregs = vcpu->run->s.regs.sregs;
> -
> - if (__set_sregs(vcpu, &sregs))
> - return -EINVAL;
> -
> - vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_SREGS;
> - }
> + if (kvm_run_set_regs(vcpu))
> + return -EINVAL;
Nit:
Do you think 'kvm_run_sync_regs()' is better than 'kvm_run_set_regs()'?
Because I think "sync" reflects better that vcpu->run->kvm_dirty_regs is cleared
after the "set" operation.
>
> if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_EVENTS) {
> struct kvm_vcpu_events events = vcpu->run->s.regs.events;
Also, I wonder whether it's better to add a helper for events so sync_regs() and
store_regs() can be simplified to:
static int sync_regs(struct kvm_vcpu *vcpu)
{
if (kvm_run_sync_regs(vcpu))
return -EINVAL;
return kvm_run_sync_events(vcpu);
}
static void store_regs(struct kvm_vcpu *vcpu)
{
kvm_run_get_regs(vcpu);
kvm_run_get_events(vcpu);
}
And maybe 'kvm_run_get_regs()' could be 'kvm_run_store_regs()' too , so that the
store_regs() could be:
static void store_regs(struct kvm_vcpu *vcpu)
{
kvm_run_store_regs(vcpu);
kvm_run_store_events(vcpu);
}
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 185062a26924..fd55cd031b1c 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -414,6 +414,7 @@ int handle_ud(struct kvm_vcpu *vcpu);
>
> void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
> struct kvm_queued_exception *ex);
> +void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu);
>
> int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
> int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
> @@ -604,6 +605,7 @@ static inline void kvm_machine_check(void)
> int kvm_spec_ctrl_test_value(u64 value);
> int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,
> struct x86_exception *e);
> +void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid);
>
If I read correct, this is because "regs.c" calls kvm_invalidate_pcid() but you
want to keep it in x86.c. But it seems the "x86.h" isn't included by "regs.c"
directly but via other headers ("mmu.h" does include "x86.h").
Should the "regs.c" include "x86.h" directly?
Btw, I am a bit confused the relationship between "x86.h" and other headers like
"mmu.h" and the new "regs.h". That is, headers like "mmu.h" include "x86.h",
but headers like "regs.h" do not (instead, "x86.h" includes them).
^ permalink raw reply
* Re: [PATCH v9 14/23] x86/virt/seamldr: Shut down the current TDX module
From: Chao Gao @ 2026-05-19 12:05 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Li, Xiaoyao, Huang, Kai,
Zhao, Yan Y, dave.hansen@linux.intel.com, kas@kernel.org,
Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
binbin.wu@linux.intel.com, Verma, Vishal L, nik.borisov@suse.com,
mingo@redhat.com, Weiny, Ira, tony.lindgren@linux.intel.com,
Annapurve, Vishal, Shahar, Sagi, djbw@kernel.org, tglx@kernel.org,
paulmck@kernel.org, hpa@zytor.com, bp@alien8.de,
yilun.xu@linux.intel.com, x86@kernel.org
In-Reply-To: <329d3811e8acbfc2ecdb1c7ba443f23161329e2a.camel@intel.com>
On Tue, May 19, 2026 at 11:00:54AM +0800, Edgecombe, Rick P wrote:
>On Wed, 2026-05-13 at 08:09 -0700, Chao Gao wrote:
>> The first step of TDX module updates is shutting down the current TDX
>> module. This step also packs state information that needs to be
>> preserved across updates as handoff data,
>>
>
>kinda reads like handoff data is an existing term, but its the first reference
>in this series.
>
>Maybe packs state information that needs to be preserved across updates, called
>"handoff data". This handoff data is consumed...
Sure. Will do.
>
>> which will be consumed by the
>> updated module. The handoff data is stored internally in the SEAM range
>> and is hidden from the kernel.
>>
>> To ensure a successful update, the new module must be able to consume
>> the handoff data generated by the old module.
>>
>
>Is it too obvious thing to state? Above you already say it's needed.
Ok. Let me drop it.
>
>> Since handoff data layout
>> may change between modules, the handoff data is versioned. Each module
>> has a native handoff version and provides backward support for several
>> older versions.
>>
>> The complete handoff versioning protocol is complex as it supports both
>> module upgrades and downgrades. See details in Intel® Trust Domain
>> Extensions (Intel® TDX) Module Base Architecture Specification, Chapter
>> "Handoff Versioning".
>>
>> Ideally, the kernel needs to retrieve the handoff versions supported by
>> the current module and the new module and select a version supported by
>> both. But, since this implementation chooses to only support module
>> upgrades, simply request the current module to generate handoff data
>> using its highest supported version, expecting that the new module will
>> likely support it.
>
>Hmm, "likely"? Is this trying to justify the kernel's policy? Dunno, stands out
>as weird to me. Like "this will mostly work". Sounds incomplete, rather than a
>reason of "this policy is the optimal initial implementation" or something like
>that.
how about:
... But since this implementation only supports module upgrades, simply request
handoff data from the current module using its highest supported version.
That is sufficient for this upgrade-only implementation.
>
>>
>> Retrieve the module's handoff version from TDX global metadata and add an
>> update step to shut down the module.
>>
>
>This is small patch with both things, but it's almost two changes.
>
>> Module shutdown has global effect, so
>> it only needs to run on one CPU.
>
>I wouldn't think having some global effect would necessarily exclude having to
>run on multiple CPUs. Or at least I don't follow. Is it a TDX arch thing? I
>guess it's ok.
Yes. This comes from the TDX architecture. I will just say in the changelog
that module shutdown only needs to run on one CPU.
>
>>
>> Note that the handoff information isn't cached in tdx_sysinfo. It is used
>> only for module shutdown, and is present only when the TDX module supports
>> updates. Caching it in get_tdx_sys_info() would require extra update-support
>> guards and refreshing the cached value across module updates.
>
>Instead of being a "note", could this be just an imperative: Don't cache the
>handoff information in tdx_sysinfo...
Sure. Will do.
>
>> @@ -214,8 +216,16 @@ static void init_state(struct update_ctrl *ctrl)
>> static int do_seamldr_install_module(void *seamldr_params)
>> {
>> enum module_update_state newstate, curstate = MODULE_UPDATE_START;
>> + int cpu = smp_processor_id();
>> + bool primary;
>> int ret = 0;
>>
>> + /*
>> + * Use CPU 0 to execute update steps that must run exactly once.
>> + * Note CPU 0 is always online.
>> + */
>> + primary = cpu == 0;
>> +
>
>Where does the term 'primary' come from?
>I'm guessing that the global steps must
>each be run on the same CPU? Is that right? And we just pick the cpu that we
>know much be online? Or can the global steps be run on different CPUs? Or they
>*have* to be run on cpu 0? It might be worth some comments explaining, depending
>on the answers to those questions.
"primary" is just my name for the CPU that runs the global steps. There is
nothing special about CPU 0 beyond the fact that it is guaranteed to be
online, so it is a convenient choice.
I can rename it to something like 'is_primary_cpu' or 'is_global_step_cpu'
for clarity.
how about:
/*
* Some steps must be run on exactly one CPU. Pick CPU 0 to execute those
* steps because CPU 0 is always online.
*/
>
>> do {
>> newstate = READ_ONCE(update_ctrl.state);
>>
>> @@ -226,7 +236,10 @@ static int do_seamldr_install_module(void *seamldr_params)
>>
>> curstate = newstate;
>> switch (curstate) {
>> - /* TODO: add the update steps. */
>> + case MODULE_UPDATE_SHUTDOWN:
>> + if (primary)
>> + ret = tdx_module_shutdown();
>> + break;
>> default:
>> break;
>> }
>> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>> index 1621695d7561..da3c1e857b26 100644
>> --- a/arch/x86/virt/vmx/tdx/tdx.c
>> +++ b/arch/x86/virt/vmx/tdx/tdx.c
>> @@ -321,7 +321,7 @@ static __init int build_tdx_memlist(struct list_head *tmb_list)
>> return ret;
>> }
>>
>> -static __init int read_sys_metadata_field(u64 field_id, u64 *data)
>> +static int read_sys_metadata_field(u64 field_id, u64 *data)
>> {
>> struct tdx_module_args args = {};
>> int ret;
>> @@ -1267,6 +1267,23 @@ static __init int tdx_enable(void)
>> }
>> subsys_initcall(tdx_enable);
>>
>> +int tdx_module_shutdown(void)
>> +{
>> + struct tdx_sys_info_handoff handoff = {};
>> + struct tdx_module_args args = {};
>> + int ret;
>> +
>> + ret = get_tdx_sys_info_handoff(&handoff);
>> + WARN_ON_ONCE(ret);
>
>Take or leave it:
>
> Why not just WARN_ON_ONCE(get_tdx_sys_info_handoff(&handoff));
> And we can drop the ret var. Save 2 LOC.
Dave had a different preference here:
https://lore.kernel.org/kvm/8b9d7fa7-6534-48e7-a4fa-c21260b1c762@intel.com/
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-19 11:06 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260515225113.GN7702@ziepe.ca>
On Fri, May 15, 2026 at 07:51:13PM -0300, Jason Gunthorpe wrote:
> On Thu, May 14, 2026 at 02:43:39PM +0000, Mostafa Saleh wrote:
> > > That's a somewhat different problem, we have the dev->trusted stuff
> > > that is supposed to deal with this kind of security. We need it for
> > > IOMMU based systems too, eg hot plug thunderbolt should have it.
> >
> > I see that it is used only for dma-iommu and for PCI devices.
> > However, I think that should be a problem with other CCA solutions
> > with emulated devices as they are untrusted. As I'd expect they
> > would have virtio devices.
>
> Yes, any security solution with an out of TCB device should be using either
> memory encryption so the kernel already bounces or this trusted stuff
> and a force strict dma-iommu so the dma layer is careful.
>
> This is more policy from userspace what devices they want in or out of
> their TCB. Like you make accept the device into T=1 but then still
> want to keep it out of your TCB with the vIOMMU, I can see good
> arguments for something like that.
>
> > > > While we can debate the aesthetics of the setup , this is
> > > > the exisitng behaviour for Linux, which existed for years
> > > > and pKVM relies on and is used extensively.
> > > > And, this patch alters that long-standing logic and introduces
> > > > a functional regression.
> > >
> > > Yeah, Aneesh needs to do something here, I'm pointing out it is
> > > entirely seperate thing from the CC path we are working on which is
> > > decoupling CC from reylying on force swiotlb.
> >
> > I am looking into converting pKVM to use the CC stuff, I replied with
> > a patch to Aneesh in this thread. However, I need to do more testing
> > and make sure there are not any unwanted consequences.
>
> Yeah, it is a nice patch and I think it will help reduce the
> complexity if it aligns to CCA type stuff.
>
> > > In a pkvm world it should be the same, the S2 table for the SMMU will
> > > control what the device can access, and if the SMMU points to a
> > > "private" or "shared" page is not something the device needs to know
> > > or care about.
> >
> > I see that's because dma-iommu chooses the attrs for iommu_map().
>
> Long term the DMA API path through the dma-iommu will pass the
> ATTR_CC_SHARED through to iommu_map so when the arch requires a
> different IOPTE it can construct it.
>
> > In pKVM, dma_addr_t and IOPTE are the same for private and shared,
> > so nothing differs in that case.
>
> Yes, so you don't have to worry.
>
> > We don’t expect pass-through devices to interact with shared
> > memory (T=0) at the moment.
> > However, I can see use cases for that, where the host and the guest
> > collaborate with device passthrough and require zero copy.
>
> Once you add the CC patch it becomes immediately possible though
> because the user can allocate a CC shared DMA HEAP and feed that all
> over the place.
>
> > One other interesting case for device-passthrough is non-coherent
> > devices which then require private pools for bouncing.
>
> Why does shared/private matter for bouncing? Why do you need to bounce
> at all? Do cmo's not work in pkvm guests?
At the moment, in iommu_dma_map_phys(), if a non coherent device
tries to map an unaligned address or size it will be bounced.
In pKVM, dma-iommu is used for assigned devices which operate on
private memory, so bouncing that through the SWIOTLB would leak
information from the guest as the SWIOTLB is decrypted.
In that case, the device needs a pool which remains private.
Thanks,
Mostafa
>
> Jason
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-19 11:04 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5ah5oa59wy.fsf@kernel.org>
On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
> >>
> >> What I meant was that we need a generic way to identify a pKVM guest, so
> >> that we can use it in the conditional above.
> >
> > I have this patch, with that I can boot with your series unmodified,
> > but I will need to do more testing.
> >
>
> Thanks, I can add this to the series once you complete the required testing.
>
I am still running more tests, but looking more into it. Setting
force_dma_unencrypted() to true for pKVM guests is wrong, as the
guest shouldn’t try to decrypt arbitrary memory as it can include
sensitive information (for example in case of virtio sub-page
allocation) and should strictly rely on the restricted-dma-pool
for that.
However, with my patch and setting force_dma_unencrypted() to false
on top of this series, it fails on pKVM due to a missing shared
attribute as Alexey mentioned, as now SWIOTLB rejects non shared
attrs, so, the DMA-API has to pass it. With that, I can boot again:
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 5103a04df99f..b19aeec03f27 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -286,6 +286,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
}
if (is_swiotlb_for_alloc(dev)) {
+ attrs |= DMA_ATTR_CC_SHARED;
+
page = dma_direct_alloc_swiotlb(dev, size, attrs);
if (page) {
/*
@@ -449,6 +451,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
&cpu_addr, gfp, attrs);
if (is_swiotlb_for_alloc(dev)) {
+ attrs |= DMA_ATTR_CC_SHARED;
+
page = dma_direct_alloc_swiotlb(dev, size, attrs);
if (!page)
return NULL;
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index 4e35264ab6f8..8ee5bbf78cfb 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -92,6 +92,7 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
return DMA_MAPPING_ERROR;
+ attrs |= DMA_ATTR_CC_SHARED;
return swiotlb_map(dev, phys, size, dir, attrs);
}
--
I will keep testing and let you know how it goes. If there is nothing
else required to convert pKVM guests to CC, I can just post the patch
separately as it has no dependency on this series.
Re force_dma_unencrypted(), I am looking into a safe way to use it
for pKVM as I beleive it will be useful to eliminate some bouncing.
However, that’s not critical for this series and can be added later
as I am still investigating it, if I reach something I can post it
along the pKVM patch above.
Thanks,
Mostafa
>
>
> -aneesh
^ permalink raw reply related
* Re: [PATCH v9 19/23] x86/virt/tdx: Refresh TDX module version after update
From: Chao Gao @ 2026-05-19 10:42 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Li, Xiaoyao, Huang, Kai,
Zhao, Yan Y, dave.hansen@linux.intel.com, kas@kernel.org,
Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
binbin.wu@linux.intel.com, Verma, Vishal L, nik.borisov@suse.com,
mingo@redhat.com, Weiny, Ira, tony.lindgren@linux.intel.com,
Annapurve, Vishal, Shahar, Sagi, djbw@kernel.org, tglx@kernel.org,
paulmck@kernel.org, hpa@zytor.com, bp@alien8.de,
yilun.xu@linux.intel.com, x86@kernel.org
In-Reply-To: <dbfb8fa01ee6035576c6cb2f665372323117a63f.camel@intel.com>
On Tue, May 19, 2026 at 11:16:35AM +0800, Edgecombe, Rick P wrote:
>On Wed, 2026-05-13 at 08:10 -0700, Chao Gao wrote:
>> The kernel exposes the TDX module version through sysfs so userspace can
>> check update compatibility. That information needs to remain accurate
>> across runtime updates.
>>
>> A runtime update may change the module's update_version, so refresh the
>> cached version right after a successful update.
I will use version instead of update_version there. There is no need to
distinguish it from the major/minor version fields.
>>
>> Drop __ro_after_init from tdx_sysinfo because it is now updated at runtime.
>>
>> Do not refresh the rest of tdx_sysinfo, even if some values change across
>> updates. TDX module updates are backward compatible, so existing
>> tdx_sysinfo consumers, e.g. KVM, can continue to operate without seeing the
>> new values.
>>
>> Refreshing the full structure would be risky. A tdx_sysinfo consumer may
>> initialize its TDX support based on the features originally reported in
>> tdx_sysinfo. If a runtime update adds new features and the full structure
>> is refreshed, that consumer could observe and use the newly reported
>> features without having performed the setup required to use them safely.
>>
>> Signed-off-by: Chao Gao <chao.gao@intel.com>
>> ---
>
>Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>
>The only thing I saw missing from Dave's last comments was:
>---
>> Note that major and minor versions are not refreshed because runtime updates
>> are supported only between releases with identical major and minor versions.
>
>I'd rather have this in code than a changelog comment.
>
>If they can't change then warn if they do.
>---
>
>But I think we discussed offline to not do this, is it right?
We didn't reach a firm conclusion on that.
But I think there is good reason not to do that, as I explained in my v8
reply:
: Maybe I can just drop the note as I don't want to add code to preemptively
: catch theoretical module bugs.
:
: I added it because Sashiko pointed out that assigning the whole version struct
: outside stop_machine() could allow sysfs readers to observe a partially updated
: version. As we don't need to print new module version, I will move that
: assignment into stop_machine(), which addresses that issue. After that, there
: is no need to mention that major/minor versions are identical across updates.
^ permalink raw reply
* Re: [PATCH v14 37/44] arm64: RMI: Prevent Device mappings for Realms
From: Aneesh Kumar K.V @ 2026-05-19 10:25 UTC (permalink / raw)
To: Steven Price, kvm, kvmarm
Cc: Steven Price, Catalin Marinas, Marc Zyngier, Will Deacon,
James Morse, Oliver Upton, Suzuki K Poulose, Zenghui Yu,
linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
Gavin Shan, Shanker Donthineni, Alper Gun, Emi Kisanuki,
Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <20260513131757.116630-38-steven.price@arm.com>
Steven Price <steven.price@arm.com> writes:
> Physical device assignment is not yet supported. RMM v2.0 does add the
> relevant APIs, but device assignment is a big topic so will be handled
> in a future patch series. For now prevent device mappings when the guest
> is a realm.
>
> Signed-off-by: Steven Price <steven.price@arm.com>
> ---
> Changes from v6:
> * Fix the check in user_mem_abort() to prevent all pages that are not
> guest_memfd() from being mapped into the protected half of the IPA.
> Changes from v5:
> * Also prevent accesses in user_mem_abort()
> ---
> arch/arm64/kvm/mmu.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 776ffe56d17e..7678226ffd38 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1230,6 +1230,10 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> if (is_protected_kvm_enabled())
> return -EPERM;
>
> + /* We don't support mapping special pages into a Realm */
> + if (kvm_is_realm(kvm))
> + return -EPERM;
> +
> size += offset_in_page(guest_ipa);
> guest_ipa &= PAGE_MASK;
>
The commit message suggests that this will need to be updated to support
Device Assignment, but that is not true. IIUC, this is only used by
GICv2?. Can we update the commit message?
-aneesh
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox