Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [PATCH v2 03/15] KVM: x86/xen: Don't truncate RAX when handling hypercall from protected guest
From: David Woodhouse @ 2026-05-20  8:32 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov,
	Kiryl Shutsemau, Paul Durrant
  Cc: Dave Hansen, Rick Edgecombe, kvm, x86, linux-coco, linux-kernel,
	Yosry Ahmed, Kai Huang, Binbin Wu
In-Reply-To: <20260514215355.1648463-4-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 2002 bytes --]

On Thu, 2026-05-14 at 14:53 -0700, Sean Christopherson wrote:
> Don't truncate RAX when handling a Xen hypercall for a guest with protected
> state, as KVM's ABI is to assume the guest is in 64-bit for such cases
> (the guest leaving garbage in 63:32 after a transition to 32-bit mode is
> far less likely than 63:32 being necessary to complete the hypercall).
> 

The latter isn't likely either. RAX is the system call number, and
those numbers are only in double digits; it'll be a while before we
need more than 8 bits for those, let alone 32 :)

But sure, as a cleanup it makes sense. Thanks.

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

> Fixes: b5aead0064f3 ("KVM: x86: Assume a 64-bit hypercall for guests with protected state")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/xen.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 6d9be74bb673..895095dc684e 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -1678,15 +1678,14 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
>  	bool handled = false;
>  	u8 cpl;
>  
> -	input = (u64)kvm_register_read(vcpu, VCPU_REGS_RAX);
> -
>  	/* Hyper-V hypercalls get bit 31 set in EAX */
> -	if ((input & 0x80000000) &&
> +	if ((kvm_rax_read(vcpu) & 0x80000000) &&
>  	    kvm_hv_hypercall_enabled(vcpu))
>  		return kvm_hv_hypercall(vcpu);
>  
>  	longmode = is_64_bit_hypercall(vcpu);
>  	if (!longmode) {
> +		input = (u32)kvm_rax_read(vcpu);
>  		params[0] = (u32)kvm_rbx_read(vcpu);
>  		params[1] = (u32)kvm_rcx_read(vcpu);
>  		params[2] = (u32)kvm_rdx_read(vcpu);
> @@ -1696,6 +1695,7 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
>  	}
>  	else {
>  #ifdef CONFIG_X86_64
> +		input = (u64)kvm_rax_read(vcpu);
>  		params[0] = (u64)kvm_rdi_read(vcpu);
>  		params[1] = (u64)kvm_rsi_read(vcpu);
>  		params[2] = (u64)kvm_rdx_read(vcpu);


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v2 03/15] KVM: x86/xen: Don't truncate RAX when handling hypercall from protected guest
From: David Woodhouse @ 2026-05-20  8:27 UTC (permalink / raw)
  To: Binbin Wu, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Kiryl Shutsemau, Paul Durrant,
	Dave Hansen, Rick Edgecombe, kvm, x86, linux-coco, linux-kernel,
	Yosry Ahmed, Kai Huang
In-Reply-To: <3beeaf04-e4f9-44cf-a3a3-04fa12912848@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 2980 bytes --]

On Wed, 2026-05-20 at 13:02 +0800, Binbin Wu wrote:
> > > > For this one, I think the current KVM code is consistent.
> > > > The format is determined by EFER.LMA, whether the guest is running in 64 bit or
> > > > compatible mode doesn't change the ABI.
> 
> I still have a point of confusion.
> 
> I noticed a behavioral mismatch between KVM and Xen regarding when they switch
> to the standard/compat shared info.
> - In Xen: The 32-bit shared info structure is latched if the current vCPU is
>   not in 64-bit mode:
>   hvm_latch_shinfo_size
>       d->arch.has_32bit_shinfo = hvm_guest_x86_mode(current) != X86_MODE_64BIT
> 
> - In KVM: It evaluates is_long_mode(vcpu) instead. E.g.,
>   kvm_xen_write_hypercall_page
>       bool lm = is_long_mode(vcpu);
>       ...
>       kvm->arch.xen.long_mode = lm;

Nice catch. That should probably use is_64_bit_hypercall() too, yes?

> In theory, these two checks could differ when the guest kernel is running in
> a 32-bit compatibility mode. However, I believe this mismatch is fine in
> practice for two reasons:
> - Mainstream 64-bit OSes don't run in compatibility mode for kernel code after
>   the early init.

Although... some of the early init does involve 32-bit mode and it
wouldn't be impossible for the guest to build the hypercall page from
32-bit mode during startup.

> - By default, HVM guests cannot issue hypercalls from userspace. The only one
>   exception HVMOP_guest_request_vm_event is not related to the share info.

A third reason: This is only one of *two* places where the guest mode
gets latched. The guest's mode is also latched when it sets the
HVM_PARAM_CALLBACK_IRQ parameter. When running under KVM, the VMM
handles this and tells the kernel via the KVM_XEN_ATTR_TYPE_LONG_MODE
attribute.

So even if it doesn't latch correctly when setting the hypercall page,
it will later.

And Linux at least will set KVM_PARAM_CALLBACK_IRQ even if it isn't
using it, in xen_set_upcall_vector() with a comment saying 'Trick
toolstack to think we are enlightened'.

> So the vCPU will never be in compatibility mode when a related hypercall occurs.
> In this specific operational context, evaluating is_long_mode() yields the
> exact same functional outcome as checking for 64-bit execution mode. Am I
> missing anything here?

I think you're right. We have thus far launched about 5 billion Xen
guests with this, and I've never heard any report about guests latching
the wrong mode. And there are a *lot* of random home-grown and network
appliance operating systems out there, and basic things like event
channels would fail to work if the shared_info was in the wrong mode.

That said, I would not be averse to fixing it. We could just use
is_64_bit_hypercall() in kvm_xen_write_hypercall_page(), right?

Or at the very least a comment saying that it *should* be doing so, but
that we're too nervous to change it now? 



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v5 1/3] firmware: smccc: coco: Manage arm-smccc platform device and CCA auxiliary drivers
From: Aneesh Kumar K.V @ 2026-05-20  8:11 UTC (permalink / raw)
  To: Greg KH
  Cc: Suzuki K Poulose, linux-coco, linux-arm-kernel, linux-kernel,
	Catalin Marinas, Jeremy Linton, Jonathan Cameron,
	Lorenzo Pieralisi, Mark Rutland, Sudeep Holla, Will Deacon,
	Steven Price
In-Reply-To: <yq5apl2txmav.fsf@kernel.org>


Hi Greg,

Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes:

> Greg KH <gregkh@linuxfoundation.org> writes:
>
>> On Thu, May 14, 2026 at 08:07:27PM +0530, Aneesh Kumar K.V wrote:
>>> Greg KH <gregkh@linuxfoundation.org> writes:
>>> 
>>> > On Thu, May 14, 2026 at 12:04:13PM +0100, Suzuki K Poulose wrote:
>>> >> Hi Aneesh
>>> >> 
>>> >> On 14/05/2026 10:40, Aneesh Kumar K.V (Arm) wrote:
>>> >> > Make the SMCCC driver responsible for registering the arm-smccc platform
>>> >> > device and after confirming the relevant SMCCC function IDs, create
>>> >> > the arm_cca_guest auxiliary device.
>>> >> > 
>>> >> 
>>> >> There are a few changes squashed in to this patch. Please could we
>>> >> split the patch in the following order ?
>>> >> 
>>> >> 1. Add platform device for arm-smccc
>>> >
>>> > Do not make any more "fake" platform devices please.
>>> >
>>> >> 2. Move TRNG to Auxilliary Device - (Even though it is a later patch, move
>>> >> it before the RSI changes)
>>> >
>>> > No, move it to the faux api please.
>>> >
>>> 
>>> 
>>> Maybe I was not complete in my previous reply. I did not want to repeat
>>> the entire thread, so I quoted the lore link for more details.
>>> 
>>> 1. We have platform firmware-provided SMCCC interfaces. Based on the
>>> support/availability of these function IDs, we want to load multiple
>>> drivers.
>>> 2. This patch series adds a platform device to represent the
>>> firmware-provided SMCCC resource.
>>> 3. Different SMCCC ranges are now represented as auxiliary devices.
>>> 4. Different subsystems, such as TSM, can autoload their backend drivers
>>> based on the availability of these SMCCC ranges, which are now
>>> represented as auxiliary devices.
>>> 
>>> You had agreed to all of this in the previous discussion here:
>>> https://lore.kernel.org/all/2025101516-handbook-hyphen-62ec@gregkh
>>
>> Then why did someone say "this is a fake platform device with no actual
>> resources"?  That's what I was triggering off of.
>>
>> Again, if you have actual platform resources, GREAT, use a platform
>> device and aux.  If you do not, then do NOT use a platform device.
>>
>> totally confused,
>>
>> greg k-h
>
> I have now rewritten the cover letter as below. Let me know if this
> helps.
>
> Switch Arm SMCCC firmware services to auxiliary devices
>
> As discussed here:
> https://lore.kernel.org/all/20250728135216.48084-12-aneesh.kumar@kernel.org
>
> The earlier CCA guest support used an arm-cca-dev platform device as a pure
> software anchor for the TSM class device. That platform device did not
> correspond to a DT/ACPI described device, MMIO range, interrupt, or other
> platform resource; it existed only to make the CCA guest driver bind and to
> place the resulting TSM device in the driver model. The same pattern also
> exists for smccc_trng. Creating separate platform devices for such
> SMCCC-discovered features is misleading, because those features are not
> independent platform devices.
>
> This series changes the model so that there is a single arm-smccc platform
> device representing the SMCCC firmware interface itself. The firmware
> interface, including its discoverable SMCCC function space, is the
> resource: after PSCI/SMCCC conduit discovery, the kernel can query SMCCC
> function IDs and determine whether optional firmware services are present.
> Services such as SMCCC TRNG and Realm Services Interface (RSI) are
> therefore represented as children of the arm-smccc device, and are created
> only when the required SMCCC function IDs and ABI checks succeed.
>
> The child devices use the auxiliary bus deliberately: they are intended to
> bind independent feature drivers, not just to provide a driverless object for
> sysfs or other class-device anchoring. They are firmware-provided functions
> of the parent SMCCC interface that are consumed by separate kernel drivers
> in different subsystems, such as hwrng and virt/coco/TSM. Those drivers
> need normal driver-core matching, probe/remove lifetime, and module
> autoloading based on the discovered firmware feature. The auxiliary bus
> provides a MODALIAS and id-table based binding model for that case, while
> keeping the feature drivers off the platform bus. A faux device was
> considered, but not used because it is suited for simple software objects
> that do not need independent bus/driver binding. The faux bus has no
> feature-driver id-table or MODALIAS matching, so it would not preserve the
> module-autoload flow that the current platform-device based users rely on.
>
> In other words, the parent arm-smccc device represents the firmware
> resource exposed through the SMCCC conduit, and each auxiliary child
> represents one discovered firmware service of that parent. This removes the
> unnecessary per-feature platform devices while retaining automatic loading
> and independent subsystem drivers for the SMCCC services.
>
> The TSM framework uses the device abstraction to provide cross-architecture
> TSM and TEE I/O functionality, including enumerating available platform TEE
> I/O capabilities and provisioning connections between the platform TSM and
> device DSMs.  For Arm CCA, the RSI auxiliary device continues to provide the
> device anchor used by the CCA guest TSM provider.
>
> For the CCA platform, the resulting device hierarchy appears as follows.
> Note that the auxiliary device is parented by the arm-smccc platform device,
> so the sysfs path remains under /devices/platform/arm-smccc/:
>
> $ cd /sys/class/tsm/
> $ ls -al
> total 0
> drwxr-xr-x    2 root     root             0 Jan  1 00:02 .
> drwxr-xr-x   23 root     root             0 Jan  1 00:00 ..
> lrwxrwxrwx    1 root     root             0 Jan  1 00:03 tsm0 -> ../../devices/platform/arm-smccc/arm_cca_guest.arm-rsi-dev.0/tsm/tsm0
> $
>
> The series also replaces the old arm-cca-dev userspace-visible dummy device
> with /sys/firmware/cca/realm_guest for detecting whether the kernel is
> running in a Realm.  This keeps the guest-state ABI under /sys/firmware and
> separates it from the internal driver-binding device used by the CCA guest
> TSM provider.
>
>
> -aneesh

Gentle ping, could you let me know if the updated cover letter helps
clarify the confusion regarding the platform-device usage here?

-aneesh

^ permalink raw reply

* Re: [PATCH v2 03/15] KVM: x86/xen: Don't truncate RAX when handling hypercall from protected guest
From: Binbin Wu @ 2026-05-20  5:02 UTC (permalink / raw)
  To: David Woodhouse, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Kiryl Shutsemau, Paul Durrant,
	Dave Hansen, Rick Edgecombe, kvm, x86, linux-coco, linux-kernel,
	Yosry Ahmed, Kai Huang
In-Reply-To: <dc62e58e-b6ee-41e1-84a5-0716822fefc8@linux.intel.com>



On 5/18/2026 5:55 PM, Binbin Wu wrote:
> 
> 
> On 5/18/2026 5:50 PM, David Woodhouse wrote:
>> On Mon, 2026-05-18 at 17:43 +0800, Binbin Wu wrote:
>>>
>>>
>>> On 5/18/2026 3:15 PM, David Woodhouse wrote:
>>>> On Mon, 2026-05-18 at 10:19 +0800, Binbin Wu wrote:
>>>>>  
>>>>>>>>   	longmode = is_64_bit_hypercall(vcpu);
>>>>>>>
>>>>>>> Is the variable name misleading?
>>>>>>
>>>>>> It most definitely is.  However, @longmode is passed around quite a few locations
>>>>>> in xen.c, and so I don't want to opportunistically fix this one variable.  Though
>>>>>> I'm definitely not opposed to a separate patch to rename them all to is_64bit or
>>>>>> something.
>>>>>
>>>>> OK, I can do it.
>>>>
>>>> This one (as shown above) is clearly indicating whether this particular
>>>> vCPU is in 64-bit mode for this particular hypercall. Changing that to
>>>> is_64bit makes sense.
>>>>
>>>> However, there is a separate overall mode for the VM, which is stored
>>>> in 'kvm->arch.xen.long_mode' and accessed by userspace using the
>>>> KVM_XEN_ATTR_TYPE_LONG_MODE attribute. It affects the datatypes used by
>>>> shared memory data structures, and is also latched by the kernel when
>>>> the guest writes the MSR for the hypercall page. That one should
>>>> probably keep its name.
>>>
>>> For this one, I think the current KVM code is consistent.
>>> The format is determined by EFER.LMA, whether the guest is running in 64 bit or
>>> compatible mode doesn't change the ABI.

I still have a point of confusion.

I noticed a behavioral mismatch between KVM and Xen regarding when they switch
to the standard/compat shared info.
- In Xen: The 32-bit shared info structure is latched if the current vCPU is
  not in 64-bit mode:
  hvm_latch_shinfo_size
      d->arch.has_32bit_shinfo = hvm_guest_x86_mode(current) != X86_MODE_64BIT

- In KVM: It evaluates is_long_mode(vcpu) instead. E.g.,
  kvm_xen_write_hypercall_page
      bool lm = is_long_mode(vcpu);
      ...
      kvm->arch.xen.long_mode = lm;

In theory, these two checks could differ when the guest kernel is running in
a 32-bit compatibility mode. However, I believe this mismatch is fine in
practice for two reasons:
- Mainstream 64-bit OSes don't run in compatibility mode for kernel code after
  the early init.
- By default, HVM guests cannot issue hypercalls from userspace. The only one
  exception HVMOP_guest_request_vm_event is not related to the share info.

So the vCPU will never be in compatibility mode when a related hypercall occurs.
In this specific operational context, evaluating is_long_mode() yields the
exact same functional outcome as checking for 64-bit execution mode. Am I
missing anything here?


>>
>> Agreed. For the hypercall case you're looking at, switching the name to
>> is_64bit makes sense.
>>
>>> struct compat_shared_info is used only when the guest is running natively in a
>>> 32-bit build.
>>
>> The struct compat_shared_info is also used in !kvm->arch.xen.long_mode
>> on a 64-bit host, as that's what means the guest is considered to be a
>> 32-bit guest.
>>
>> It's somewhat orthogonal from whether any given vCPU is making any
>> given hypercall while in 64-bit mode. The 'long_mode' is *latched* at
>> certain specific times which are defined by Xen's historical behaviour.
>>
>> I'm suggesting that you clean up longmode→is_64bit for the *hypercalls*
>> but leave 'long_mode' as is.
>>
> 
> Yes, will only do it for is_64_bit_hypercall().
> 
>>
> 


^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-20  3:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mostafa Saleh, iommu, linux-arm-kernel, linux-kernel, linux-coco,
	Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260519161120.GO7702@ziepe.ca>

Jason Gunthorpe <jgg@ziepe.ca> writes:

> On Tue, May 19, 2026 at 09:35:30PM +0530, Aneesh Kumar K.V wrote:
>> Yes, that also resulted in simpler and cleaner code.
>> 
>> swiotlb_tbl_map_single
>> 	/*
>> 	 * If the physical address is encrypted but the device requires
>> 	 * decrypted DMA, use a decrypted io_tlb_mem and update the
>> 	 * attributes so the caller knows that a decrypted io_tlb_mem
>> 	 * was used.
>> 	 */
>> 	if (!(*attrs & DMA_ATTR_CC_SHARED) && force_dma_unencrypted(dev))
>> 		*attrs |= DMA_ATTR_CC_SHARED;
>> 
>> 	if (mem->unencrypted != !!(*attrs & DMA_ATTR_CC_SHARED))
>> 		return (phys_addr_t)DMA_MAPPING_ERROR;
>
> Yeah, exactly that is so much clearer now that the mem->unecrypted is
> tied directly.
>
> That logic is reversed though, the incoming ATTR_CC doesn't matter for
> swiotlb, that is just the source of the memcpy.
>
> /* swiotlb pool is incorrect for this device */
> if (mem->unencrypted != force_dma_unencrypted(dev))
>     return (phys_addr_t)DMA_MAPPING_ERROR;
>
> /* Force attrs to match the kind of memory in the pool */
> if (mem->unencrypted)
>      *attrs |= DMA_ATTR_CC_SHARED;
> else
>      *attrs &= ~DMA_ATTR_CC_SHARED;
>
>
> Attrs should be forced to whatever memory swiotlb selected.
>

But that will not handle a T=1 device that wants to use swiotlb to
bounce unencrypted memory. That is:

force_dma_unencrypted(dev) == 0  /* T=1 device */
attrs = DMA_ATTR_CC_SHARED;

In that case, it should use an unencrypted io_tlb_mem:
mem->unencrypted == 1

-aneesh

^ permalink raw reply

* Re: [PATCH v2 15/15] KVM: x86: Move the bulk of register specific code from x86.c to regs.c
From: Huang, Kai @ 2026-05-20  2:29 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: dwmw2@infradead.org, Edgecombe, Rick P,
	dave.hansen@linux.intel.com, binbin.wu@linux.intel.com,
	vkuznets@redhat.com, x86@kernel.org, kas@kernel.org, paul@xen.org,
	yosry@kernel.org, pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org
In-Reply-To: <ag0NlJ2FEIL7GJIj@google.com>

On Tue, 2026-05-19 at 18:25 -0700, Sean Christopherson wrote:
> On Wed, May 20, 2026, Kai Huang wrote:
> > On Tue, 2026-05-19 at 08:04 -0700, Sean Christopherson wrote:
> > > On Tue, May 19, 2026, Kai Huang wrote:
> > Just wondering is it possible we might want to move events handling to some
> > other C file since you are cleanup x86.c?  But we can deal with this when it
> > happens.
> 
> Events are a hard one.  There's a decent amount of code, but not _so_ much that
> it's a no-brainer to move them out of x86.c.  And there's no super clear cut
> boundary, e.g. events can mean exceptions, INIT+SIPI, IRQs, APIC stuff, etc.,
> several of which already have substantial amounts of code outside of x86.c.

Yes agreed.

> 
> > > Hmm, though looking at all of this again, I think we're actually quite close to
> > > having somewhat sane rules.  Over the past few years, I've tried multiple times
> > > to move what I felt should be KVM-internal structures from asm/kvm_host.h to x86.h,
> > > and I've failed miserably every time because inevitably even the most innocuous
> > > struct manages to have usage that leads to cyclical header dependencies and/or is
> > > used by arch-neutral KVM code.
> > 
> > The problem is some other kernel code includes <linux/kvm_host.h> (which in turn
> > includes <asm/kvm_host.h>) but the KVM internal structures have nothing to do
> > with them.
> > 
> > E.g., some drivers are using <linux/kvm_host.h>:
> > 
> > #$ grep kvm_host.h drivers/ -Rn
> > drivers/vfio/pci/vfio_pci_zdev.c:14:#include <linux/kvm_host.h>
> > drivers/vfio/vfio_main.c:20:#include <linux/kvm_host.h>
> > drivers/firmware/arm_sdei.c:19:#include <linux/kvm_host.h>
> > drivers/hwtracing/coresight/coresight-trbe.c:20:#include <linux/kvm_host.h>
> > drivers/hwtracing/coresight/coresight-etm4x-core.c:10:#include
> > <linux/kvm_host.h>
> > drivers/s390/crypto/vfio_ap_ops.c:17:#include <linux/kvm_host.h>
> > drivers/s390/crypto/vfio_ap_private.h:20:#include <linux/kvm_host.h>
> > 
> > But looking at them, AFAICT what they need is only some structure declarations
> > (e.g., 'struct kvm;') for type safety (plus some function declarations), but
> > don't actually need to see the actual structure.
> 
> Ya.
> 
> > For x86, AFAICT there's (only) "arch/x86/events/intel/core.c" actually uses the
> > 'struct kvm_pmu', though.
> 
> I have a patch to fix that :-)
> 
> https://lore.kernel.org/all/20260508231353.406465-7-seanjc@google.com

Oh great!

> 
> > I haven't checked other ARCHs whether there's cases actually need to use any
> > structure.
> 
> PPC, arm64, and IIRC s390 all have assets defined by KVM that are consumed by
> the kernel at-large.  E.g. because KVM for arm64 can't be built as a module, the
> kernel calls directly into KVM during boot.  IIRC, PPC has similar code.
> 
> A few years ago (wow, time flies), I was able to hide KVM internals, using #ifdef
> shenanigans to deal with cases where non-KVM really truly needed to get at things
> defined in kvm_host.h
> 
> https://lore.kernel.org/all/20230916003118.2540661-27-seanjc@google.com

Oh I never thought from this perspective (thanks for the info):

  --
  Hiding KVM details for all architectures will, in the very distant future, 
  allow loading a new (or old) KVM module without needing to rebuild and reboot 
  the entire kernel, or to even allow loading and running multiple versions of 
  KVM simultaneously on a single host.
  --

> 
> More recently, I tried to standardize KVM arch=>common includes[1], to help pave
> the way to splitting up kvm_host.h, but then s390's crazy arm64 support killed
> that (at least for now).
> 
> [1] https://lore.kernel.org/all/20250611001042.170501-1-seanjc@google.com
> [2] https://lore.kernel.org/all/20260428160527.1378085-1-seiden@linux.ibm.com

:-)

> 
> > > I think it's probably time to admit I've been looking at the asm/kvm_host.h vs.
> > > x86.h split all wrong, i.e. finally give up on moving structures out of kvm_host.h,
> > > and do the exact opposite: commit to using kvm_host.h to define and declare widely
> > > used structures.
> > 
> > If the structure(s) are only used within arch/x86/kvm/, it doesn't seem right to
> > define them in asm/kvm_host.h?
> 
> The problem is that anything that feeds into kvm_vcpu_arch needs to be visible
> to virt/kvm.  
> 

Yeah that's the problem.

> And burying kvm_x86_ops in arch/kvm/x86 would mean one-liners like
> kvm_arch_vcpu_blocking() couldn't be inlined.

Oh right, sad but acceptable tradeoff I guess.

> 
> I've looked at this far too many times :-)
> 
> > > Because literally the only reason that x86.h doesn't include mmu.h is that mmu.h
> > > references struct kvm_host, which is currently defined in x86.h.  
> > > 
> > 
> > Yes. But I wouldn't worry about this too much since it's a small thing we can
> > always find a way to fix.  E.g., we can move kvm_mmu_max_gfn() out of "mmu.h"
> > (with a renaming perhaps).
> 
> I hacked on moving more stuff out of x86.{c,h} and kvm_host.h.  The diff stats
> are quite promising :-)
> 
>  arch/x86/include/asm/kvm_host.h           |  444 ++-------------
>  arch/x86/kvm/x86.c                        | 3784 +++-----------------------------------------------------------------------------------------------------------------------
>  arch/x86/kvm/x86.h                        |  474 ++++++++--------
> 

Indeed!

> > > If we "fix"
> > > that, then (a) we can make x86.h the "central" include everyone expects it to be,
> > > and (b) it can be the start of a cleanup of asm/kvm_host.h and a big step towards
> > > defining maintainable "rules" for what goes where.  E.g. there are a pile of
> > > functional declarations in asm/kvm_host.h that can live elsewhere; if we trim
> > > those down, then the rules become:
> > > 
> > >   - asm/kvm_host.h holds "common" structure definitions and associated key global
> > >     variables, and things that are referenced by arch-neutral KVM.
> > 
> > It's a bit weird the arch-neutral KVM code needs to reference variables in
> > asm/kvm_host.h, and I am afraid the "common" structure definitions will
> > effectively be a lot of structures only used by arch/x86/kvm/.  
> > 
> > Which isn't necessarily a bad thing, from the perspective we might finally clean
> > this up by a giant move.
> > 
> > E.g., <linux/kvm_types.h> is already used by other kernel components where they
> > don't need <linux/kvm_host.h>.  Ideally, maybe eventually we can use
> > <linux/kvm_types.h> and <asm/kvm_types.h> for things needed by other kernel
> > components, or keep <linux/kvm_host.h> and <asm/kvm_host.h> minimal after moving
> > majority things to some KVM internal headers.
> > 
> > E.g., maybe:
> > 
> >   virt/kvm/include/kvm_host.h
> >   arch/x86/kvm/kvm_host.h (can even be merged to x86.h)
> > 
> > I think the problem is "struct kvm_arch" and "struct kvm_vcpu_arch", that they
> > are not a pointer but a fully embedded structure in "struct kvm" and "struct
> > kvm_vcpu" respectively.  That caused that you need to keep the actual structure
> > definition of "struct kvm_arch" and "kvm_vcpu_arch" in asm/kvm_host.h, which in
> > turns makes a lot of structures only used by arch/x86/kvm/ need to stay in
> > asm/kvm_host.h.
> > 
> > I am not sure whether there's a mandatory requirement that "struct kvm_arch" and
> > "struct kvm_vcpu_arch" must be fully embedded, and it would be kinda painful to
> > covert to a pointer (e.g., there's kvm_x86_ops::vm_size), but perhaps that is
> > also an option to consider?
> 
> The idea I had in the past, and where I was going with things before s390's love
> for arm64 came along, was to add a kvm_arch.h in arch/<arch>/kvm, and have virt/kvm
> include _that_ instead of kvm_host.h.  
> 

Not sure whether there's other code doing so? :-)

> That way we don't need to make any fundamental
> changes to structures, but we can still significantly cut down on what's exposed
> via kvm_host.h.  
> 

Yeah.

I saw below from you in [1]:

  --
  We've explore several alternatives to the #ifdef __KVM__ approach, and
  they all sucked, hard.  What I really wanted (and still want) to do, is to
  bury the bulk of kvm_host.h (and other KVM headers) in virt/kvm, but every
  attempt to do that ended in flames.  Even with the __KVM__ guards in place,
  each architecture's kvm_host.h is too intertwined with the common kvm_host.h,
  and trying to extract small-ish pieces just doesn't work (each patch
  inevitably snowballed into a gigantic beast).

  The other idea we considered (which I thought of, and feel dirty for even
  proposing it internally), is to move all headers under virt/kvm, add
  virt/kvm/include to the global header path, and then have KVM x86 omit
  virt/kvm/include when configured to hide KVM internals.  I hate this idea
  because it sets a bad precedent, and requires a lot of file movement
  without providing any benefit to other architectures.  E.g. I hope that
  guarding KVM internals with #ifdef __KVM__ will allow us to slowly clean
  things up so that some day KVM only exposes a handful of APIs to the rest
  of the kernel (probably a pipe dream).
  --

I haven't looked into details of your #ifdef __KVM__ approach yet, but seems you
don't quite like moving KVM internal staff to virt/kvm/include/ ?

But if we want to hide KVM internal structures, I don't see any other options
except virt/kvm/include/ is the place to go?

Btw, have you considered reverting the inclusion of "strut kvm" and "struct
kvm_arch" (and the vcpu structure), i.e., to make "struct kvm_arch" include
"struct kvm"?  I don't have any clue of whether it is feasible or how much
effort it needs, though -- it's just something came to mind when replying.

[1]: https://lore.kernel.org/all/20230916003118.2540661-1-seanjc@google.com/

> At some point I'll try to take another look; it's really the
> s390+arm64 combo that's problematic :-/

If you want, I can take a look.  I think I'll have bandwidth in near feature.

Given you have tried multiple times so I am not sure what I can achieve, though.

Anyway, seems "allow loading a new (or old) KVM module without needing to
rebuild and reboot the entire kernel" is a good reason to do this.

^ permalink raw reply

* Re: [PATCH v2 15/15] KVM: x86: Move the bulk of register specific code from x86.c to regs.c
From: Sean Christopherson @ 2026-05-20  1:25 UTC (permalink / raw)
  To: Kai Huang
  Cc: dwmw2@infradead.org, Rick P Edgecombe, x86@kernel.org,
	kas@kernel.org, binbin.wu@linux.intel.com,
	dave.hansen@linux.intel.com, vkuznets@redhat.com, paul@xen.org,
	yosry@kernel.org, pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org
In-Reply-To: <729c4191d16e4c768c231ffb9bb8420306039210.camel@intel.com>

On Wed, May 20, 2026, Kai Huang wrote:
> On Tue, 2026-05-19 at 08:04 -0700, Sean Christopherson wrote:
> > On Tue, May 19, 2026, Kai Huang wrote:
> Just wondering is it possible we might want to move events handling to some
> other C file since you are cleanup x86.c?  But we can deal with this when it
> happens.

Events are a hard one.  There's a decent amount of code, but not _so_ much that
it's a no-brainer to move them out of x86.c.  And there's no super clear cut
boundary, e.g. events can mean exceptions, INIT+SIPI, IRQs, APIC stuff, etc.,
several of which already have substantial amounts of code outside of x86.c.

> > Hmm, though looking at all of this again, I think we're actually quite close to
> > having somewhat sane rules.  Over the past few years, I've tried multiple times
> > to move what I felt should be KVM-internal structures from asm/kvm_host.h to x86.h,
> > and I've failed miserably every time because inevitably even the most innocuous
> > struct manages to have usage that leads to cyclical header dependencies and/or is
> > used by arch-neutral KVM code.
> 
> The problem is some other kernel code includes <linux/kvm_host.h> (which in turn
> includes <asm/kvm_host.h>) but the KVM internal structures have nothing to do
> with them.
> 
> E.g., some drivers are using <linux/kvm_host.h>:
> 
> #$ grep kvm_host.h drivers/ -Rn
> drivers/vfio/pci/vfio_pci_zdev.c:14:#include <linux/kvm_host.h>
> drivers/vfio/vfio_main.c:20:#include <linux/kvm_host.h>
> drivers/firmware/arm_sdei.c:19:#include <linux/kvm_host.h>
> drivers/hwtracing/coresight/coresight-trbe.c:20:#include <linux/kvm_host.h>
> drivers/hwtracing/coresight/coresight-etm4x-core.c:10:#include
> <linux/kvm_host.h>
> drivers/s390/crypto/vfio_ap_ops.c:17:#include <linux/kvm_host.h>
> drivers/s390/crypto/vfio_ap_private.h:20:#include <linux/kvm_host.h>
> 
> But looking at them, AFAICT what they need is only some structure declarations
> (e.g., 'struct kvm;') for type safety (plus some function declarations), but
> don't actually need to see the actual structure.

Ya.

> For x86, AFAICT there's (only) "arch/x86/events/intel/core.c" actually uses the
> 'struct kvm_pmu', though.

I have a patch to fix that :-)

https://lore.kernel.org/all/20260508231353.406465-7-seanjc@google.com

> I haven't checked other ARCHs whether there's cases actually need to use any
> structure.

PPC, arm64, and IIRC s390 all have assets defined by KVM that are consumed by
the kernel at-large.  E.g. because KVM for arm64 can't be built as a module, the
kernel calls directly into KVM during boot.  IIRC, PPC has similar code.

A few years ago (wow, time flies), I was able to hide KVM internals, using #ifdef
shenanigans to deal with cases where non-KVM really truly needed to get at things
defined in kvm_host.h

https://lore.kernel.org/all/20230916003118.2540661-27-seanjc@google.com

More recently, I tried to standardize KVM arch=>common includes[1], to help pave
the way to splitting up kvm_host.h, but then s390's crazy arm64 support killed
that (at least for now).

[1] https://lore.kernel.org/all/20250611001042.170501-1-seanjc@google.com
[2] https://lore.kernel.org/all/20260428160527.1378085-1-seiden@linux.ibm.com

> > I think it's probably time to admit I've been looking at the asm/kvm_host.h vs.
> > x86.h split all wrong, i.e. finally give up on moving structures out of kvm_host.h,
> > and do the exact opposite: commit to using kvm_host.h to define and declare widely
> > used structures.
> 
> If the structure(s) are only used within arch/x86/kvm/, it doesn't seem right to
> define them in asm/kvm_host.h?

The problem is that anything that feeds into kvm_vcpu_arch needs to be visible
to virt/kvm.  And burying kvm_x86_ops in arch/kvm/x86 would mean one-liners like
kvm_arch_vcpu_blocking() couldn't be inlined.

I've looked at this far too many times :-)

> > Because literally the only reason that x86.h doesn't include mmu.h is that mmu.h
> > references struct kvm_host, which is currently defined in x86.h.  
> > 
> 
> Yes. But I wouldn't worry about this too much since it's a small thing we can
> always find a way to fix.  E.g., we can move kvm_mmu_max_gfn() out of "mmu.h"
> (with a renaming perhaps).

I hacked on moving more stuff out of x86.{c,h} and kvm_host.h.  The diff stats
are quite promising :-)

 arch/x86/include/asm/kvm_host.h           |  444 ++-------------
 arch/x86/kvm/x86.c                        | 3784 +++-----------------------------------------------------------------------------------------------------------------------
 arch/x86/kvm/x86.h                        |  474 ++++++++--------

> > If we "fix"
> > that, then (a) we can make x86.h the "central" include everyone expects it to be,
> > and (b) it can be the start of a cleanup of asm/kvm_host.h and a big step towards
> > defining maintainable "rules" for what goes where.  E.g. there are a pile of
> > functional declarations in asm/kvm_host.h that can live elsewhere; if we trim
> > those down, then the rules become:
> > 
> >   - asm/kvm_host.h holds "common" structure definitions and associated key global
> >     variables, and things that are referenced by arch-neutral KVM.
> 
> It's a bit weird the arch-neutral KVM code needs to reference variables in
> asm/kvm_host.h, and I am afraid the "common" structure definitions will
> effectively be a lot of structures only used by arch/x86/kvm/.  
> 
> Which isn't necessarily a bad thing, from the perspective we might finally clean
> this up by a giant move.
> 
> E.g., <linux/kvm_types.h> is already used by other kernel components where they
> don't need <linux/kvm_host.h>.  Ideally, maybe eventually we can use
> <linux/kvm_types.h> and <asm/kvm_types.h> for things needed by other kernel
> components, or keep <linux/kvm_host.h> and <asm/kvm_host.h> minimal after moving
> majority things to some KVM internal headers.
> 
> E.g., maybe:
> 
>   virt/kvm/include/kvm_host.h
>   arch/x86/kvm/kvm_host.h (can even be merged to x86.h)
> 
> I think the problem is "struct kvm_arch" and "struct kvm_vcpu_arch", that they
> are not a pointer but a fully embedded structure in "struct kvm" and "struct
> kvm_vcpu" respectively.  That caused that you need to keep the actual structure
> definition of "struct kvm_arch" and "kvm_vcpu_arch" in asm/kvm_host.h, which in
> turns makes a lot of structures only used by arch/x86/kvm/ need to stay in
> asm/kvm_host.h.
> 
> I am not sure whether there's a mandatory requirement that "struct kvm_arch" and
> "struct kvm_vcpu_arch" must be fully embedded, and it would be kinda painful to
> covert to a pointer (e.g., there's kvm_x86_ops::vm_size), but perhaps that is
> also an option to consider?

The idea I had in the past, and where I was going with things before s390's love
for arm64 came along, was to add a kvm_arch.h in arch/<arch>/kvm, and have virt/kvm
include _that_ instead of kvm_host.h.  That way we don't need to make any fundamental
changes to structures, but we can still significantly cut down on what's exposed
via kvm_host.h.  At some point I'll try to take another look; it's really the
s390+arm64 combo that's problematic :-/

^ permalink raw reply

* Re: [PATCH v2 15/15] KVM: x86: Move the bulk of register specific code from x86.c to regs.c
From: Huang, Kai @ 2026-05-20  0:59 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: dwmw2@infradead.org, Edgecombe, Rick P, x86@kernel.org,
	kas@kernel.org, binbin.wu@linux.intel.com,
	dave.hansen@linux.intel.com, vkuznets@redhat.com, paul@xen.org,
	yosry@kernel.org, pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org
In-Reply-To: <agx77QB3UVmJr5xP@google.com>

On Tue, 2026-05-19 at 08:04 -0700, Sean Christopherson wrote:
> On Tue, May 19, 2026, Kai Huang wrote:
> > > @@ -12712,19 +11913,8 @@ static void store_regs(struct kvm_vcpu *vcpu)
> > >  
> > >  static int sync_regs(struct kvm_vcpu *vcpu)
> > >  {
> > > -	if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_REGS) {
> > > -		__set_regs(vcpu, &vcpu->run->s.regs.regs);
> > > -		vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_REGS;
> > > -	}
> > > -
> > > -	if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_SREGS) {
> > > -		struct kvm_sregs sregs = vcpu->run->s.regs.sregs;
> > > -
> > > -		if (__set_sregs(vcpu, &sregs))
> > > -			return -EINVAL;
> > > -
> > > -		vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_SREGS;
> > > -	}
> > > +	if (kvm_run_set_regs(vcpu))
> > > +		return -EINVAL;
> > 
> > Nit:
> > 
> > Do you think 'kvm_run_sync_regs()' is better than 'kvm_run_set_regs()'?
> > 
> > Because I think "sync" reflects better that vcpu->run->kvm_dirty_regs is cleared
> > after the "set" operation.
> 
> The problem I have with "sync" is that it doesn't communicate the direction of
> the sync.  What about kvm_run_sync_regs_{to,from}_user()?

Yeah that's better to me too.

> 
> > >  
> > >  	if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_EVENTS) {
> > >  		struct kvm_vcpu_events events = vcpu->run->s.regs.events;
> > 
> > Also, I wonder whether it's better to add a helper for events so sync_regs() and
> > store_regs() can be simplified to:
> > 
> > static int sync_regs(struct kvm_vcpu *vcpu)
> > {
> > 	if (kvm_run_sync_regs(vcpu))
> > 		return -EINVAL;
> > 	return kvm_run_sync_events(vcpu);
> > }
> > 
> > static void store_regs(struct kvm_vcpu *vcpu)
> > {
> > 	kvm_run_get_regs(vcpu);
> > 	kvm_run_get_events(vcpu);
> > }
> > 
> > And maybe 'kvm_run_get_regs()' could be 'kvm_run_store_regs()' too , so that the
> > store_regs() could be:
> > 
> > static void store_regs(struct kvm_vcpu *vcpu)
> > {
> > 	kvm_run_store_regs(vcpu);
> > 	kvm_run_store_events(vcpu);
> > }
> 
> {store,sync}_regs() look pretty, but IMO the overall code is uglier because we
> end up with super small helpers that have one caller, e.g.
> 
>   static void kvm_run_sync_events_to_user(struct kvm_vcpu *vcpu)
>   {
> 	if (vcpu->run->kvm_valid_regs & KVM_SYNC_X86_EVENTS)
> 		kvm_vcpu_ioctl_x86_get_vcpu_events(vcpu, &vcpu->run->s.regs.events);
>   }
> 
>   static void store_regs(struct kvm_vcpu *vcpu)
>   {
> 	BUILD_BUG_ON(sizeof(struct kvm_sync_regs) > SYNC_REGS_SIZE_BYTES);
> 
> 	kvm_run_sync_regs_to_user(vcpu);
> 	kvm_run_sync_events_to_user(vcpu);
>   }
> 
> For me, the extra "jump" is undesirable, but it allows burying __{g,s}et_{s,}regs()
> in regs.c, and so is a net positive for registers.  But for events, it's pure
> overhead.

Sure.

Just wondering is it possible we might want to move events handling to some
other C file since you are cleanup x86.c?  But we can deal with this when it
happens.

> 
> > > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > > index 185062a26924..fd55cd031b1c 100644
> > > --- a/arch/x86/kvm/x86.h
> > > +++ b/arch/x86/kvm/x86.h
> > > @@ -414,6 +414,7 @@ int handle_ud(struct kvm_vcpu *vcpu);
> > >  
> > >  void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
> > >  				   struct kvm_queued_exception *ex);
> > > +void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu);
> > >  
> > >  int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
> > >  int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
> > > @@ -604,6 +605,7 @@ static inline void kvm_machine_check(void)
> > >  int kvm_spec_ctrl_test_value(u64 value);
> > >  int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,
> > >  			      struct x86_exception *e);
> > > +void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid);
> > > 
> > 
> > If I read correct, this is because "regs.c" calls kvm_invalidate_pcid() but you
> > want to keep it in x86.c.  But it seems the "x86.h" isn't included by "regs.c"
> > directly but via other headers ("mmu.h" does include "x86.h").
> > 
> > Should the "regs.c" include "x86.h" directly?
> 
> Oh, yeah, I just goofed that.
> 
> > Btw, I am a bit confused the relationship between "x86.h" and other headers like
> > "mmu.h" and the new "regs.h".  That is, headers like "mmu.h" include "x86.h",
> > but headers like "regs.h" do not (instead, "x86.h" includes them).
> 
> Heh, don't look for a theme/plan, because there isn't one.  Over the years, x86.h
> and x86.c became dumping grounds for everything that didn't have an obvious home,
> and so there aren't real "rules".

My guess too.

> 
> Hmm, though looking at all of this again, I think we're actually quite close to
> having somewhat sane rules.  Over the past few years, I've tried multiple times
> to move what I felt should be KVM-internal structures from asm/kvm_host.h to x86.h,
> and I've failed miserably every time because inevitably even the most innocuous
> struct manages to have usage that leads to cyclical header dependencies and/or is
> used by arch-neutral KVM code.

The problem is some other kernel code includes <linux/kvm_host.h> (which in turn
includes <asm/kvm_host.h>) but the KVM internal structures have nothing to do
with them.

E.g., some drivers are using <linux/kvm_host.h>:

#$ grep kvm_host.h drivers/ -Rn
drivers/vfio/pci/vfio_pci_zdev.c:14:#include <linux/kvm_host.h>
drivers/vfio/vfio_main.c:20:#include <linux/kvm_host.h>
drivers/firmware/arm_sdei.c:19:#include <linux/kvm_host.h>
drivers/hwtracing/coresight/coresight-trbe.c:20:#include <linux/kvm_host.h>
drivers/hwtracing/coresight/coresight-etm4x-core.c:10:#include
<linux/kvm_host.h>
drivers/s390/crypto/vfio_ap_ops.c:17:#include <linux/kvm_host.h>
drivers/s390/crypto/vfio_ap_private.h:20:#include <linux/kvm_host.h>

But looking at them, AFAICT what they need is only some structure declarations
(e.g., 'struct kvm;') for type safety (plus some function declarations), but
don't actually need to see the actual structure.

For x86, AFAICT there's (only) "arch/x86/events/intel/core.c" actually uses the
'struct kvm_pmu', though.  I haven't checked other ARCHs whether there's cases
actually need to use any structure.

> 
> I think it's probably time to admit I've been looking at the asm/kvm_host.h vs.
> x86.h split all wrong, i.e. finally give up on moving structures out of kvm_host.h,
> and do the exact opposite: commit to using kvm_host.h to define and declare widely
> used structures.

If the structure(s) are only used within arch/x86/kvm/, it doesn't seem right to
define them in asm/kvm_host.h?

> 
> Because literally the only reason that x86.h doesn't include mmu.h is that mmu.h
> references struct kvm_host, which is currently defined in x86.h.  
> 

Yes. But I wouldn't worry about this too much since it's a small thing we can
always find a way to fix.  E.g., we can move kvm_mmu_max_gfn() out of "mmu.h"
(with a renaming perhaps).

> If we "fix"
> that, then (a) we can make x86.h the "central" include everyone expects it to be,
> and (b) it can be the start of a cleanup of asm/kvm_host.h and a big step towards
> defining maintainable "rules" for what goes where.  E.g. there are a pile of
> functional declarations in asm/kvm_host.h that can live elsewhere; if we trim
> those down, then the rules become:
> 
>   - asm/kvm_host.h holds "common" structure definitions and associated key global
>     variables, and things that are referenced by arch-neutral KVM.

It's a bit weird the arch-neutral KVM code needs to reference variables in
asm/kvm_host.h, and I am afraid the "common" structure definitions will
effectively be a lot of structures only used by arch/x86/kvm/.  

Which isn't necessarily a bad thing, from the perspective we might finally clean
this up by a giant move.

E.g., <linux/kvm_types.h> is already used by other kernel components where they
don't need <linux/kvm_host.h>.  Ideally, maybe eventually we can use
<linux/kvm_types.h> and <asm/kvm_types.h> for things needed by other kernel
components, or keep <linux/kvm_host.h> and <asm/kvm_host.h> minimal after moving
majority things to some KVM internal headers.

E.g., maybe:

  virt/kvm/include/kvm_host.h
  arch/x86/kvm/kvm_host.h (can even be merged to x86.h)

I think the problem is "struct kvm_arch" and "struct kvm_vcpu_arch", that they
are not a pointer but a fully embedded structure in "struct kvm" and "struct
kvm_vcpu" respectively.  That caused that you need to keep the actual structure
definition of "struct kvm_arch" and "kvm_vcpu_arch" in asm/kvm_host.h, which in
turns makes a lot of structures only used by arch/x86/kvm/ need to stay in
asm/kvm_host.h.

I am not sure whether there's a mandatory requirement that "struct kvm_arch" and
"struct kvm_vcpu_arch" must be fully embedded, and it would be kinda painful to
covert to a pointer (e.g., there's kvm_x86_ops::vm_size), but perhaps that is
also an option to consider?

>   - <area>.{c,h} holds relevant declarations and definitions.
>   - x86.{c,h} is the kitchen sink for everything else.

Yeah the two are reasonable to me.

^ permalink raw reply

* SVSM Development Call May 20th, 2026
From: Jörg Rödel @ 2026-05-19 18:05 UTC (permalink / raw)
  To: coconut-svsm, linux-coco

Hi,

Here is the call for agenda items for this weeks SVSM development call.  Please
send any agenda items you have in mind as a reply to this email or raise them
in the meeting.

We will use the LF Zoom instance. Details of the meeting  can be found in our
governance repository at:

	https://github.com/coconut-svsm/governance

The link to the COCONUT-SVSM calendar is:

	https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week

The meeting will be recorded and the recording eventually published.

Regards,

	Jörg

^ permalink raw reply

* Re: [PATCH v9 14/23] x86/virt/seamldr: Shut down the current TDX module
From: Dave Hansen @ 2026-05-19 16:24 UTC (permalink / raw)
  To: Chao Gao, Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Li, Xiaoyao, Huang, Kai,
	Zhao, Yan Y, dave.hansen@linux.intel.com, kas@kernel.org,
	Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Verma, Vishal L, nik.borisov@suse.com,
	mingo@redhat.com, Weiny, Ira, tony.lindgren@linux.intel.com,
	Annapurve, Vishal, Shahar, Sagi, djbw@kernel.org, tglx@kernel.org,
	paulmck@kernel.org, hpa@zytor.com, bp@alien8.de,
	yilun.xu@linux.intel.com, x86@kernel.org
In-Reply-To: <agxSAsUvgcHj/Ywl@intel.com>

On 5/19/26 05:05, Chao Gao wrote:
>>  Why not just WARN_ON_ONCE(get_tdx_sys_info_handoff(&handoff));
>>  And we can drop the ret var. Save 2 LOC.
> Dave had a different preference here:
> 
> https://lore.kernel.org/kvm/8b9d7fa7-6534-48e7-a4fa-c21260b1c762@intel.com/

I almost never optimize for lines of code.

The _only_ reason to worry about it is when you have a chunk of logic
that's having issues fitting on a "screen". There, squishing a few lines
together can mean the difference between seeing a whole loop on one
screen or having to page around.

But, at the point you're doing *that*, you probably need to think about
refactoring anyway.

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-19 16:11 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Mostafa Saleh, iommu, linux-arm-kernel, linux-kernel, linux-coco,
	Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5a8q9fs7ud.fsf@kernel.org>

On Tue, May 19, 2026 at 09:35:30PM +0530, Aneesh Kumar K.V wrote:
> Yes, that also resulted in simpler and cleaner code.
> 
> swiotlb_tbl_map_single
> 	/*
> 	 * If the physical address is encrypted but the device requires
> 	 * decrypted DMA, use a decrypted io_tlb_mem and update the
> 	 * attributes so the caller knows that a decrypted io_tlb_mem
> 	 * was used.
> 	 */
> 	if (!(*attrs & DMA_ATTR_CC_SHARED) && force_dma_unencrypted(dev))
> 		*attrs |= DMA_ATTR_CC_SHARED;
> 
> 	if (mem->unencrypted != !!(*attrs & DMA_ATTR_CC_SHARED))
> 		return (phys_addr_t)DMA_MAPPING_ERROR;

Yeah, exactly that is so much clearer now that the mem->unecrypted is
tied directly.

That logic is reversed though, the incoming ATTR_CC doesn't matter for
swiotlb, that is just the source of the memcpy.

/* swiotlb pool is incorrect for this device */
if (mem->unencrypted != force_dma_unencrypted(dev))
    return (phys_addr_t)DMA_MAPPING_ERROR;

/* Force attrs to match the kind of memory in the pool */
if (mem->unencrypted)
     *attrs |= DMA_ATTR_CC_SHARED;
else
     *attrs &= ~DMA_ATTR_CC_SHARED;


Attrs should be forced to whatever memory swiotlb selected.

Jason

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-19 16:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mostafa Saleh, iommu, linux-arm-kernel, linux-kernel, linux-coco,
	Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260519152741.GM7702@ziepe.ca>

Jason Gunthorpe <jgg@ziepe.ca> writes:

> On Tue, May 19, 2026 at 08:37:54PM +0530, Aneesh Kumar K.V wrote:
>
>> if we get force_dma_unencrypted(dev) correct, we won't need the above.
>> 
>> for dma_direct_alloc and dma_direct_alloc_pages() we have
>> 
>> 	if (force_dma_unencrypted(dev))
>> 		attrs |= DMA_ATTR_CC_SHARED;
>> 
>> 
>> for dma_direct_map_phys(), if we have swiotlb bouncing forced,
>> 
>> swiotlb_tbl_map_single():
>> 
>> 	if ((attrs & DMA_ATTR_CC_SHARED) || force_dma_unencrypted(dev))
>> 		require_decrypted = true;
>
> IMHO I really do prefer the DMA_ATTR_CC_SHARED flows closer to the
> thing that did the decryption. While the above is possibly sound it is
> very obtuse to be guessing what kind of memory swiotlb decided to
> return..
>
> Can we pass a pointer to the attrs into the swiotlb stuff and it can
> update it based on the kind of memory it has allocated?
>

Yes, that also resulted in simpler and cleaner code.

swiotlb_tbl_map_single
	/*
	 * If the physical address is encrypted but the device requires
	 * decrypted DMA, use a decrypted io_tlb_mem and update the
	 * attributes so the caller knows that a decrypted io_tlb_mem
	 * was used.
	 */
	if (!(*attrs & DMA_ATTR_CC_SHARED) && force_dma_unencrypted(dev))
		*attrs |= DMA_ATTR_CC_SHARED;

	if (mem->unencrypted != !!(*attrs & DMA_ATTR_CC_SHARED))
		return (phys_addr_t)DMA_MAPPING_ERROR;

and

@@ -1640,19 +1654,14 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
 
 	trace_swiotlb_bounced(dev, phys_to_dma(dev, paddr), size);
 
-	swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, 0, dir, attrs);
+	swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, 0, dir, &attrs);
 	if (swiotlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
 		return DMA_MAPPING_ERROR;
 
-	/*
-	 * Use the allocated io_tlb_mem encryption type to determine dma addr.
-	 */
-	if (dev->dma_io_tlb_mem->unencrypted) {
+	if (attrs & DMA_ATTR_CC_SHARED)
 		dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
-		attrs |= DMA_ATTR_CC_SHARED;
-	} else {
+	else
 		dma_addr = phys_to_dma_encrypted(dev, swiotlb_addr);
-	}
 
 	if (unlikely(!dma_capable(dev, dma_addr, size, true, attrs))) {
 		__swiotlb_tbl_unmap_single(dev, swiotlb_addr, size, dir,

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-19 15:27 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Mostafa Saleh, iommu, linux-arm-kernel, linux-kernel, linux-coco,
	Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5abjebsaid.fsf@kernel.org>

On Tue, May 19, 2026 at 08:37:54PM +0530, Aneesh Kumar K.V wrote:

> if we get force_dma_unencrypted(dev) correct, we won't need the above.
> 
> for dma_direct_alloc and dma_direct_alloc_pages() we have
> 
> 	if (force_dma_unencrypted(dev))
> 		attrs |= DMA_ATTR_CC_SHARED;
> 
> 
> for dma_direct_map_phys(), if we have swiotlb bouncing forced,
> 
> swiotlb_tbl_map_single():
> 
> 	if ((attrs & DMA_ATTR_CC_SHARED) || force_dma_unencrypted(dev))
> 		require_decrypted = true;

IMHO I really do prefer the DMA_ATTR_CC_SHARED flows closer to the
thing that did the decryption. While the above is possibly sound it is
very obtuse to be guessing what kind of memory swiotlb decided to
return..

Can we pass a pointer to the attrs into the swiotlb stuff and it can
update it based on the kind of memory it has allocated?

Jason

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-19 15:07 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5amrxvshxg.fsf@kernel.org>

Aneesh Kumar K.V <aneesh.kumar@kernel.org> writes:

> Mostafa Saleh <smostafa@google.com> writes:
>
>> On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
>>> >> 
>>> >> What I meant was that we need a generic way to identify a pKVM guest, so
>>> >> that we can use it in the conditional above.
>>> >
>>> > I have this patch, with that I can boot with your series unmodified,
>>> > but I will need to do more testing.
>>> >
>>> 
>>> Thanks, I can add this to the series once you complete the required testing.
>>> 
>>
>> I am still running more tests, but looking more into it. Setting
>> force_dma_unencrypted() to true for pKVM guests is wrong, as the
>> guest shouldn’t try to decrypt arbitrary memory as it can include
>> sensitive information (for example in case of virtio sub-page
>> allocation) and should strictly rely on the restricted-dma-pool
>> for that.
>>
>> However, with my patch and setting force_dma_unencrypted() to false
>> on top of this series, it fails on pKVM due to a missing shared
>> attribute as Alexey mentioned, as now SWIOTLB rejects non shared
>> attrs, so, the DMA-API has to pass it. With that, I can boot again:
>>
>> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>> index 5103a04df99f..b19aeec03f27 100644
>> --- a/kernel/dma/direct.c
>> +++ b/kernel/dma/direct.c
>> @@ -286,6 +286,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>>  	}
>>  
>>  	if (is_swiotlb_for_alloc(dev)) {
>> +		attrs |= DMA_ATTR_CC_SHARED;
>> +
>>  		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>>  		if (page) {
>>  			/*
>> @@ -449,6 +451,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>>  						  &cpu_addr, gfp, attrs);
>>  
>>  	if (is_swiotlb_for_alloc(dev)) {
>> +		attrs |= DMA_ATTR_CC_SHARED;
>> +
>>  		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>>  		if (!page)
>>  			return NULL;
>> diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
>> index 4e35264ab6f8..8ee5bbf78cfb 100644
>> --- a/kernel/dma/direct.h
>> +++ b/kernel/dma/direct.h
>> @@ -92,6 +92,7 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
>>  		if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
>>  			return DMA_MAPPING_ERROR;
>>  
>> +		attrs |= DMA_ATTR_CC_SHARED;
>>  		return swiotlb_map(dev, phys, size, dir, attrs);
>>  	}
>>  
>> --
>>
>
> How about the below?
>
> modified   kernel/dma/direct.c
> @@ -278,6 +278,10 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>  	}
>  
>  	if (is_swiotlb_for_alloc(dev)) {
> +
> +		if (dev->dma_io_tlb_mem->unencrypted)
> +			attrs |= DMA_ATTR_CC_SHARED;
> +
>  		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>  		if (page) {
>  			/*
> @@ -451,6 +455,10 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
>  						  &cpu_addr, gfp, attrs);
>  
>  	if (is_swiotlb_for_alloc(dev)) {
> +
> +		if (dev->dma_io_tlb_mem->unencrypted)
> +			attrs |= DMA_ATTR_CC_SHARED;
> +
>  		page = dma_direct_alloc_swiotlb(dev, size, attrs);
>  		if (!page)
>  			return NULL;
> modified   kernel/dma/direct.h
> @@ -92,6 +92,9 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
>  		if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
>  			return DMA_MAPPING_ERROR;
>  
> +		if (dev->dma_io_tlb_mem->unencrypted)
> +			attrs |= DMA_ATTR_CC_SHARED;
> +
>  		return swiotlb_map(dev, phys, size, dir, attrs);
>  	}
>  
>
>

if we get force_dma_unencrypted(dev) correct, we won't need the above.

for dma_direct_alloc and dma_direct_alloc_pages() we have

	if (force_dma_unencrypted(dev))
		attrs |= DMA_ATTR_CC_SHARED;


for dma_direct_map_phys(), if we have swiotlb bouncing forced,

swiotlb_tbl_map_single():

	if ((attrs & DMA_ATTR_CC_SHARED) || force_dma_unencrypted(dev))
		require_decrypted = true;

-aneesh

^ permalink raw reply

* Re: [PATCH v2 15/15] KVM: x86: Move the bulk of register specific code from x86.c to regs.c
From: Sean Christopherson @ 2026-05-19 15:04 UTC (permalink / raw)
  To: Kai Huang
  Cc: pbonzini@redhat.com, kas@kernel.org, vkuznets@redhat.com,
	dwmw2@infradead.org, paul@xen.org, Rick P Edgecombe,
	x86@kernel.org, binbin.wu@linux.intel.com,
	dave.hansen@linux.intel.com, linux-kernel@vger.kernel.org,
	yosry@kernel.org, kvm@vger.kernel.org, linux-coco@lists.linux.dev
In-Reply-To: <074e209fe4c78a735868099f13d76b9dde2c88e3.camel@intel.com>

On Tue, May 19, 2026, Kai Huang wrote:
> > @@ -12712,19 +11913,8 @@ static void store_regs(struct kvm_vcpu *vcpu)
> >  
> >  static int sync_regs(struct kvm_vcpu *vcpu)
> >  {
> > -	if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_REGS) {
> > -		__set_regs(vcpu, &vcpu->run->s.regs.regs);
> > -		vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_REGS;
> > -	}
> > -
> > -	if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_SREGS) {
> > -		struct kvm_sregs sregs = vcpu->run->s.regs.sregs;
> > -
> > -		if (__set_sregs(vcpu, &sregs))
> > -			return -EINVAL;
> > -
> > -		vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_SREGS;
> > -	}
> > +	if (kvm_run_set_regs(vcpu))
> > +		return -EINVAL;
> 
> Nit:
> 
> Do you think 'kvm_run_sync_regs()' is better than 'kvm_run_set_regs()'?
> 
> Because I think "sync" reflects better that vcpu->run->kvm_dirty_regs is cleared
> after the "set" operation.

The problem I have with "sync" is that it doesn't communicate the direction of
the sync.  What about kvm_run_sync_regs_{to,from}_user()?

> >  
> >  	if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_EVENTS) {
> >  		struct kvm_vcpu_events events = vcpu->run->s.regs.events;
> 
> Also, I wonder whether it's better to add a helper for events so sync_regs() and
> store_regs() can be simplified to:
> 
> static int sync_regs(struct kvm_vcpu *vcpu)
> {
> 	if (kvm_run_sync_regs(vcpu))
> 		return -EINVAL;
> 	return kvm_run_sync_events(vcpu);
> }
> 
> static void store_regs(struct kvm_vcpu *vcpu)
> {
> 	kvm_run_get_regs(vcpu);
> 	kvm_run_get_events(vcpu);
> }
> 
> And maybe 'kvm_run_get_regs()' could be 'kvm_run_store_regs()' too , so that the
> store_regs() could be:
> 
> static void store_regs(struct kvm_vcpu *vcpu)
> {
> 	kvm_run_store_regs(vcpu);
> 	kvm_run_store_events(vcpu);
> }

{store,sync}_regs() look pretty, but IMO the overall code is uglier because we
end up with super small helpers that have one caller, e.g.

  static void kvm_run_sync_events_to_user(struct kvm_vcpu *vcpu)
  {
	if (vcpu->run->kvm_valid_regs & KVM_SYNC_X86_EVENTS)
		kvm_vcpu_ioctl_x86_get_vcpu_events(vcpu, &vcpu->run->s.regs.events);
  }

  static void store_regs(struct kvm_vcpu *vcpu)
  {
	BUILD_BUG_ON(sizeof(struct kvm_sync_regs) > SYNC_REGS_SIZE_BYTES);

	kvm_run_sync_regs_to_user(vcpu);
	kvm_run_sync_events_to_user(vcpu);
  }

For me, the extra "jump" is undesirable, but it allows burying __{g,s}et_{s,}regs()
in regs.c, and so is a net positive for registers.  But for events, it's pure
overhead.

> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > index 185062a26924..fd55cd031b1c 100644
> > --- a/arch/x86/kvm/x86.h
> > +++ b/arch/x86/kvm/x86.h
> > @@ -414,6 +414,7 @@ int handle_ud(struct kvm_vcpu *vcpu);
> >  
> >  void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
> >  				   struct kvm_queued_exception *ex);
> > +void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu);
> >  
> >  int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
> >  int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
> > @@ -604,6 +605,7 @@ static inline void kvm_machine_check(void)
> >  int kvm_spec_ctrl_test_value(u64 value);
> >  int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,
> >  			      struct x86_exception *e);
> > +void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid);
> > 
> 
> If I read correct, this is because "regs.c" calls kvm_invalidate_pcid() but you
> want to keep it in x86.c.  But it seems the "x86.h" isn't included by "regs.c"
> directly but via other headers ("mmu.h" does include "x86.h").
> 
> Should the "regs.c" include "x86.h" directly?

Oh, yeah, I just goofed that.

> Btw, I am a bit confused the relationship between "x86.h" and other headers like
> "mmu.h" and the new "regs.h".  That is, headers like "mmu.h" include "x86.h",
> but headers like "regs.h" do not (instead, "x86.h" includes them).

Heh, don't look for a theme/plan, because there isn't one.  Over the years, x86.h
and x86.c became dumping grounds for everything that didn't have an obvious home,
and so there aren't real "rules".

Hmm, though looking at all of this again, I think we're actually quite close to
having somewhat sane rules.  Over the past few years, I've tried multiple times
to move what I felt should be KVM-internal structures from asm/kvm_host.h to x86.h,
and I've failed miserably every time because inevitably even the most innocuous
struct manages to have usage that leads to cyclical header dependencies and/or is
used by arch-neutral KVM code.

I think it's probably time to admit I've been looking at the asm/kvm_host.h vs.
x86.h split all wrong, i.e. finally give up on moving structures out of kvm_host.h,
and do the exact opposite: commit to using kvm_host.h to define and declare widely
used structures.

Because literally the only reason that x86.h doesn't include mmu.h is that mmu.h
references struct kvm_host, which is currently defined in x86.h.  If we "fix"
that, then (a) we can make x86.h the "central" include everyone expects it to be,
and (b) it can be the start of a cleanup of asm/kvm_host.h and a big step towards
defining maintainable "rules" for what goes where.  E.g. there are a pile of
functional declarations in asm/kvm_host.h that can live elsewhere; if we trim
those down, then the rules become:

  - asm/kvm_host.h holds "common" structure definitions and associated key global
    variables, and things that are referenced by arch-neutral KVM.
  - <area>.{c,h} holds relevant declarations and definitions.
  - x86.{c,h} is the kitchen sink for everything else.

E.g. this compiles for at least one config:

---
 arch/x86/include/asm/kvm_host.h | 50 ++++++++++++++++++++++++----
 arch/x86/kvm/mmu.h              | 20 +++++++----
 arch/x86/kvm/x86.h              | 59 ++++-----------------------------
 3 files changed, 64 insertions(+), 65 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5e24987b2a94..67ba8bf22469 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -313,6 +313,50 @@ enum x86_intercept_stage;
 struct kvm_kernel_irqfd;
 struct kvm_kernel_irq_routing_entry;
 
+struct kvm_caps {
+	/* control of guest tsc rate supported? */
+	bool has_tsc_control;
+	/* maximum supported tsc_khz for guests */
+	u32  max_guest_tsc_khz;
+	/* number of bits of the fractional part of the TSC scaling ratio */
+	u8   tsc_scaling_ratio_frac_bits;
+	/* maximum allowed value of TSC scaling ratio */
+	u64  max_tsc_scaling_ratio;
+	/* 1ull << kvm_caps.tsc_scaling_ratio_frac_bits */
+	u64  default_tsc_scaling_ratio;
+	/* bus lock detection supported? */
+	bool has_bus_lock_exit;
+	/* notify VM exit supported? */
+	bool has_notify_vmexit;
+	/* bit mask of VM types */
+	u32 supported_vm_types;
+
+	u64 supported_mce_cap;
+	u64 supported_xcr0;
+	u64 supported_xss;
+	u64 supported_perf_cap;
+
+	u64 supported_quirks;
+	u64 inapplicable_quirks;
+};
+extern struct kvm_caps kvm_caps;
+
+struct kvm_host_values {
+	/*
+	 * The host's raw MAXPHYADDR, i.e. the number of non-reserved physical
+	 * address bits irrespective of features that repurpose legal bits,
+	 * e.g. MKTME.
+	 */
+	u8 maxphyaddr;
+
+	u64 efer;
+	u64 xcr0;
+	u64 xss;
+	u64 s_cet;
+	u64 arch_capabilities;
+};
+extern struct kvm_host_values kvm_host;
+
 /*
  * kvm_mmu_page_role tracks the properties of a shadow page (where shadow page
  * also includes TDP pages) to determine whether or not a page can be used in
@@ -2056,10 +2100,6 @@ struct kvm_arch_async_pf {
 	u64 error_code;
 };
 
-extern u32 __read_mostly kvm_nr_uret_msrs;
-extern bool __read_mostly allow_smaller_maxphyaddr;
-extern bool __read_mostly enable_apicv;
-extern bool __read_mostly enable_ipiv;
 extern bool __read_mostly enable_device_posted_irqs;
 extern struct kvm_x86_ops kvm_x86_ops;
 
@@ -2151,8 +2191,6 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
 
 int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3);
 
-extern bool tdp_enabled;
-
 u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
 
 /*
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e1bb663ebbd5..d841a4f486d1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -4,9 +4,14 @@
 
 #include <linux/kvm_host.h>
 #include "regs.h"
-#include "x86.h"
 #include "cpuid.h"
 
+extern bool tdp_enabled;
+#ifdef CONFIG_X86_64
+extern bool tdp_mmu_enabled;
+#else
+#define tdp_mmu_enabled false
+#endif
 extern bool __read_mostly enable_mmio_caching;
 
 #define PT_WRITABLE_SHIFT 1
@@ -261,14 +266,10 @@ static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
 	return smp_load_acquire(&kvm->arch.shadow_root_allocated);
 }
 
-#ifdef CONFIG_X86_64
-extern bool tdp_mmu_enabled;
-#else
-#define tdp_mmu_enabled false
-#endif
-
 int kvm_tdp_mmu_map_private_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn);
 
+bool kvm_mmu_is_mappable_memslot(const struct kvm_memory_slot *slot);
+
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
 	return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
@@ -300,6 +301,11 @@ static inline void kvm_update_page_stats(struct kvm *kvm, int level, int count)
 	atomic64_add(count, &kvm->stat.pages[level - 1]);
 }
 
+static inline bool mmu_is_nested(struct kvm_vcpu *vcpu)
+{
+	return vcpu->arch.walk_mmu == &vcpu->arch.nested_mmu;
+}
+
 static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
 				      struct kvm_mmu *mmu,
 				      gpa_t gpa, u64 access,
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index fd55cd031b1c..40c6f4c54f8e 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -6,53 +6,19 @@
 #include <asm/fpu/xstate.h>
 #include <asm/mce.h>
 #include <asm/pvclock.h>
+#include "mmu.h"
 #include "regs.h"
 #include "kvm_emulate.h"
 #include "cpuid.h"
 
 #define KVM_MAX_MCE_BANKS 32
 
-struct kvm_caps {
-	/* control of guest tsc rate supported? */
-	bool has_tsc_control;
-	/* maximum supported tsc_khz for guests */
-	u32  max_guest_tsc_khz;
-	/* number of bits of the fractional part of the TSC scaling ratio */
-	u8   tsc_scaling_ratio_frac_bits;
-	/* maximum allowed value of TSC scaling ratio */
-	u64  max_tsc_scaling_ratio;
-	/* 1ull << kvm_caps.tsc_scaling_ratio_frac_bits */
-	u64  default_tsc_scaling_ratio;
-	/* bus lock detection supported? */
-	bool has_bus_lock_exit;
-	/* notify VM exit supported? */
-	bool has_notify_vmexit;
-	/* bit mask of VM types */
-	u32 supported_vm_types;
-
-	u64 supported_mce_cap;
-	u64 supported_xcr0;
-	u64 supported_xss;
-	u64 supported_perf_cap;
-
-	u64 supported_quirks;
-	u64 inapplicable_quirks;
-};
-
-struct kvm_host_values {
-	/*
-	 * The host's raw MAXPHYADDR, i.e. the number of non-reserved physical
-	 * address bits irrespective of features that repurpose legal bits,
-	 * e.g. MKTME.
-	 */
-	u8 maxphyaddr;
-
-	u64 efer;
-	u64 xcr0;
-	u64 xss;
-	u64 s_cet;
-	u64 arch_capabilities;
-};
+extern u32 __read_mostly kvm_nr_uret_msrs;
+extern bool __read_mostly allow_smaller_maxphyaddr;
+extern bool __read_mostly enable_apicv;
+extern bool __read_mostly enable_ipiv;
+extern bool enable_pmu;
+extern bool enable_mediated_pmu;
 
 void kvm_spurious_fault(void);
 
@@ -252,11 +218,6 @@ static inline bool x86_exception_has_error_code(unsigned int vector)
 	return (1U << vector) & exception_has_error_code;
 }
 
-static inline bool mmu_is_nested(struct kvm_vcpu *vcpu)
-{
-	return vcpu->arch.walk_mmu == &vcpu->arch.nested_mmu;
-}
-
 static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu)
 {
 	return kvm_is_cr4_bit_set(vcpu, X86_CR4_LA57) ? 57 : 48;
@@ -428,12 +389,6 @@ fastpath_t handle_fastpath_wrmsr_imm(struct kvm_vcpu *vcpu, u32 msr, int reg);
 fastpath_t handle_fastpath_hlt(struct kvm_vcpu *vcpu);
 fastpath_t handle_fastpath_invd(struct kvm_vcpu *vcpu);
 
-extern struct kvm_caps kvm_caps;
-extern struct kvm_host_values kvm_host;
-
-extern bool enable_pmu;
-extern bool enable_mediated_pmu;
-
 void kvm_setup_xss_caps(void);
 
 /*

base-commit: b99808a11a42edc2cecced7adf57c2ac231bdb68
-- 

^ permalink raw reply related

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-19 14:49 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Aneesh Kumar K.V, iommu, linux-arm-kernel, linux-kernel,
	linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
	Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
	Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agx3j6Oc8QivZ3RG@google.com>

On Tue, May 19, 2026 at 02:45:35PM +0000, Mostafa Saleh wrote:
> However, it should not alway use SWIOTLB? It can trigger decryption for
> any memory returned from __dma_direct_alloc_pages() which can come
> from alloc_pages_node().

The alloc coherent flow is seperate and different, these are not pages
'passed into the DMA API' but pages fully allocated internally and
owned by it.

Yes, it should cause decrypted *allocation*.

Jason

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-19 14:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Aneesh Kumar K.V, iommu, linux-arm-kernel, linux-kernel,
	linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
	Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
	Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260519143529.GD7702@ziepe.ca>

On Tue, May 19, 2026 at 11:35:29AM -0300, Jason Gunthorpe wrote:
> On Tue, May 19, 2026 at 01:41:42PM +0000, Mostafa Saleh wrote:
> > On Tue, May 19, 2026 at 10:29:11AM -0300, Jason Gunthorpe wrote:
> > > On Tue, May 19, 2026 at 11:04:37AM +0000, Mostafa Saleh wrote:
> > > > On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
> > > > > >> 
> > > > > >> What I meant was that we need a generic way to identify a pKVM guest, so
> > > > > >> that we can use it in the conditional above.
> > > > > >
> > > > > > I have this patch, with that I can boot with your series unmodified,
> > > > > > but I will need to do more testing.
> > > > > >
> > > > > 
> > > > > Thanks, I can add this to the series once you complete the required testing.
> > > > > 
> > > > 
> > > > I am still running more tests, but looking more into it. Setting
> > > > force_dma_unencrypted() to true for pKVM guests is wrong, as the
> > > > guest shouldn’t try to decrypt arbitrary memory as it can include
> > > > sensitive information (for example in case of virtio sub-page
> > > > allocation) and should strictly rely on the restricted-dma-pool
> > > > for that.
> > > 
> > > ??
> > > 
> > > Where does force_dma_unencrypted() cause arbitary memory passed into
> > > the DMA API to be decrypted? That should never happen???
> > 
> > Sorry, maybe arbitrary is not the right expression again :)
> > I mean that, with emulated devices that use the DMA-API under pKVM,
> > they will map memory coming from other layers (VFS, net) through
> > vitrio-block, virtio-net... These can be smaller than a page, and
> > using force_dma_unencrypted() will share the whole page.
> 
> force_dma_unencrypted() should only trigger swiotlb and that never
> memcpy's more than necessary?
> 
> Where does it do otherwise? That sounds like a bug?

Agh, I got confused and thought that it can be triggered from dma_map()
too. I need to figure out why that made pKVM guests boot with broken
restricted-dma-pool then.

However, it should not alway use SWIOTLB? It can trigger decryption for
any memory returned from __dma_direct_alloc_pages() which can come
from alloc_pages_node().

Thanks,
Mostafa


> 
> Jason

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-19 14:37 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Aneesh Kumar K.V, iommu, linux-arm-kernel, linux-kernel,
	linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
	Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
	Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxzanDBmIP54hUz@google.com>

On Tue, May 19, 2026 at 02:27:54PM +0000, Mostafa Saleh wrote:
> However, as I mentioned to Jason, I think with some tweaks to
> force_dma_unencrypted() we can make it work under pKVM for aligned
> memory which eliminates some of the bouncing.
> I am currently investigating that.

force_dma_unencrypted() literally means that memory passed into the
DMA API *without* DMA_ATTR_CC_SHARED cannot be DMA'd from.

It should not mean anything else. The DMA API should never decrypt
passed in memory. You always have to bounce.

Jason

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-19 14:35 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Aneesh Kumar K.V, iommu, linux-arm-kernel, linux-kernel,
	linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
	Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
	Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxolksC_1nfO34X@google.com>

On Tue, May 19, 2026 at 01:41:42PM +0000, Mostafa Saleh wrote:
> On Tue, May 19, 2026 at 10:29:11AM -0300, Jason Gunthorpe wrote:
> > On Tue, May 19, 2026 at 11:04:37AM +0000, Mostafa Saleh wrote:
> > > On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
> > > > >> 
> > > > >> What I meant was that we need a generic way to identify a pKVM guest, so
> > > > >> that we can use it in the conditional above.
> > > > >
> > > > > I have this patch, with that I can boot with your series unmodified,
> > > > > but I will need to do more testing.
> > > > >
> > > > 
> > > > Thanks, I can add this to the series once you complete the required testing.
> > > > 
> > > 
> > > I am still running more tests, but looking more into it. Setting
> > > force_dma_unencrypted() to true for pKVM guests is wrong, as the
> > > guest shouldn’t try to decrypt arbitrary memory as it can include
> > > sensitive information (for example in case of virtio sub-page
> > > allocation) and should strictly rely on the restricted-dma-pool
> > > for that.
> > 
> > ??
> > 
> > Where does force_dma_unencrypted() cause arbitary memory passed into
> > the DMA API to be decrypted? That should never happen???
> 
> Sorry, maybe arbitrary is not the right expression again :)
> I mean that, with emulated devices that use the DMA-API under pKVM,
> they will map memory coming from other layers (VFS, net) through
> vitrio-block, virtio-net... These can be smaller than a page, and
> using force_dma_unencrypted() will share the whole page.

force_dma_unencrypted() should only trigger swiotlb and that never
memcpy's more than necessary?

Where does it do otherwise? That sounds like a bug?

Jason

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-19 14:27 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jason Gunthorpe, iommu, linux-arm-kernel, linux-kernel,
	linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
	Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
	Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5aecj7sctv.fsf@kernel.org>

On Tue, May 19, 2026 at 07:47:48PM +0530, Aneesh Kumar K.V wrote:
> Mostafa Saleh <smostafa@google.com> writes:
> 
> > On Tue, May 19, 2026 at 07:30:16PM +0530, Aneesh Kumar K.V wrote:
> >> Mostafa Saleh <smostafa@google.com> writes:
> >> 
> >> >> > 
> >> >> > I am still running more tests, but looking more into it. Setting
> >> >> > force_dma_unencrypted() to true for pKVM guests is wrong, as the
> >> >> > guest shouldn’t try to decrypt arbitrary memory as it can include
> >> >> > sensitive information (for example in case of virtio sub-page
> >> >> > allocation) and should strictly rely on the restricted-dma-pool
> >> >> > for that.
> >> >> 
> >> >> ??
> >> >> 
> >> >> Where does force_dma_unencrypted() cause arbitary memory passed into
> >> >> the DMA API to be decrypted? That should never happen???
> >> >
> >> > Sorry, maybe arbitrary is not the right expression again :)
> >> > I mean that, with emulated devices that use the DMA-API under pKVM,
> >> > they will map memory coming from other layers (VFS, net) through
> >> > vitrio-block, virtio-net... These can be smaller than a page, and
> >> >
> >> 
> >> Don't we PAGE_ALIGN these requests?
> >> 
> >> dma_direct_alloc
> >> 	size = PAGE_ALIGN(size);
> >> 
> >> iommu_dma_alloc_pages
> >> 	size_t alloc_size = PAGE_ALIGN(size);
> >> 
> >> 
> >
> > For allocation, yes, and that's fine because we bring memory from
> > the pool.
> > But not for mapping, as dma_direct_map_phys(), where the memory is
> > allocated from the driver or other parts in the kernel and the page
> > may be shared with other kernel components.
> >
> 
> But if we are using restricted-dma-pool, we also have:
> 
> mem->force_bounce = true;
> mem->for_alloc = true;
> 
> So, will we use the swiotlb buffers for mapping and copy only the shared
> content into those swiotlb buffers?

True, that's why under pKVM, force_dma_unencrypted() should never
cause any memory to be decrypted and so we set it to false.
As in case of any bugs, the guest does not leak any information,
similar to what just happened initially here due to missing attrs.

However, as I mentioned to Jason, I think with some tweaks to
force_dma_unencrypted() we can make it work under pKVM for aligned
memory which eliminates some of the bouncing.
I am currently investigating that.

Thanks,
Mostafa

> 
> -aneesh

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-19 14:17 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Jason Gunthorpe, iommu, linux-arm-kernel, linux-kernel,
	linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
	Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
	Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxt7SFGT7OLMIah@google.com>

Mostafa Saleh <smostafa@google.com> writes:

> On Tue, May 19, 2026 at 07:30:16PM +0530, Aneesh Kumar K.V wrote:
>> Mostafa Saleh <smostafa@google.com> writes:
>> 
>> >> > 
>> >> > I am still running more tests, but looking more into it. Setting
>> >> > force_dma_unencrypted() to true for pKVM guests is wrong, as the
>> >> > guest shouldn’t try to decrypt arbitrary memory as it can include
>> >> > sensitive information (for example in case of virtio sub-page
>> >> > allocation) and should strictly rely on the restricted-dma-pool
>> >> > for that.
>> >> 
>> >> ??
>> >> 
>> >> Where does force_dma_unencrypted() cause arbitary memory passed into
>> >> the DMA API to be decrypted? That should never happen???
>> >
>> > Sorry, maybe arbitrary is not the right expression again :)
>> > I mean that, with emulated devices that use the DMA-API under pKVM,
>> > they will map memory coming from other layers (VFS, net) through
>> > vitrio-block, virtio-net... These can be smaller than a page, and
>> >
>> 
>> Don't we PAGE_ALIGN these requests?
>> 
>> dma_direct_alloc
>> 	size = PAGE_ALIGN(size);
>> 
>> iommu_dma_alloc_pages
>> 	size_t alloc_size = PAGE_ALIGN(size);
>> 
>> 
>
> For allocation, yes, and that's fine because we bring memory from
> the pool.
> But not for mapping, as dma_direct_map_phys(), where the memory is
> allocated from the driver or other parts in the kernel and the page
> may be shared with other kernel components.
>

But if we are using restricted-dma-pool, we also have:

mem->force_bounce = true;
mem->for_alloc = true;

So, will we use the swiotlb buffers for mapping and copy only the shared
content into those swiotlb buffers?

-aneesh

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-19 14:04 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jason Gunthorpe, iommu, linux-arm-kernel, linux-kernel,
	linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
	Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
	Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <yq5ah5o3sdn3.fsf@kernel.org>

On Tue, May 19, 2026 at 07:30:16PM +0530, Aneesh Kumar K.V wrote:
> Mostafa Saleh <smostafa@google.com> writes:
> 
> >> > 
> >> > I am still running more tests, but looking more into it. Setting
> >> > force_dma_unencrypted() to true for pKVM guests is wrong, as the
> >> > guest shouldn’t try to decrypt arbitrary memory as it can include
> >> > sensitive information (for example in case of virtio sub-page
> >> > allocation) and should strictly rely on the restricted-dma-pool
> >> > for that.
> >> 
> >> ??
> >> 
> >> Where does force_dma_unencrypted() cause arbitary memory passed into
> >> the DMA API to be decrypted? That should never happen???
> >
> > Sorry, maybe arbitrary is not the right expression again :)
> > I mean that, with emulated devices that use the DMA-API under pKVM,
> > they will map memory coming from other layers (VFS, net) through
> > vitrio-block, virtio-net... These can be smaller than a page, and
> >
> 
> Don't we PAGE_ALIGN these requests?
> 
> dma_direct_alloc
> 	size = PAGE_ALIGN(size);
> 
> iommu_dma_alloc_pages
> 	size_t alloc_size = PAGE_ALIGN(size);
> 
> 

For allocation, yes, and that's fine because we bring memory from
the pool.
But not for mapping, as dma_direct_map_phys(), where the memory is
allocated from the driver or other parts in the kernel and the page
may be shared with other kernel components.

Thanks,
Mostafa

> 
> > using force_dma_unencrypted() will share the whole page.
> > And as discussed, that leaks sensitive information to the untrusted
> > host.
> > I am currently investigating passing iova/phys/size
> > to force_dma_unencrypted() and then we can share pages inplace only
> > if possible without leaking extra information.
> > I am trying to get some performance results first. But the tricky part
> > is to get the semantics right, I believe in that case those devices
> > shouldn’t use restricted-dma-pools as those should always force
> > bouncing. Instead bouncing happens through the default SWIOTLB pool,
> > if not possible to decrypt in place.
> >
> 
> -aneesh

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-19 14:00 UTC (permalink / raw)
  To: Mostafa Saleh, Jason Gunthorpe
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Petr Tesarik,
	Alexey Kardashevskiy, Dan Williams, Xu Yilun, linuxppc-dev,
	linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxolksC_1nfO34X@google.com>

Mostafa Saleh <smostafa@google.com> writes:

> On Tue, May 19, 2026 at 10:29:11AM -0300, Jason Gunthorpe wrote:
>> On Tue, May 19, 2026 at 11:04:37AM +0000, Mostafa Saleh wrote:
>> > On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
>> > > >> 
>> > > >> What I meant was that we need a generic way to identify a pKVM guest, so
>> > > >> that we can use it in the conditional above.
>> > > >
>> > > > I have this patch, with that I can boot with your series unmodified,
>> > > > but I will need to do more testing.
>> > > >
>> > > 
>> > > Thanks, I can add this to the series once you complete the required testing.
>> > > 
>> > 
>> > I am still running more tests, but looking more into it. Setting
>> > force_dma_unencrypted() to true for pKVM guests is wrong, as the
>> > guest shouldn’t try to decrypt arbitrary memory as it can include
>> > sensitive information (for example in case of virtio sub-page
>> > allocation) and should strictly rely on the restricted-dma-pool
>> > for that.
>> 
>> ??
>> 
>> Where does force_dma_unencrypted() cause arbitary memory passed into
>> the DMA API to be decrypted? That should never happen???
>
> Sorry, maybe arbitrary is not the right expression again :)
> I mean that, with emulated devices that use the DMA-API under pKVM,
> they will map memory coming from other layers (VFS, net) through
> vitrio-block, virtio-net... These can be smaller than a page, and
>

Don't we PAGE_ALIGN these requests?

dma_direct_alloc
	size = PAGE_ALIGN(size);

iommu_dma_alloc_pages
	size_t alloc_size = PAGE_ALIGN(size);



> using force_dma_unencrypted() will share the whole page.
> And as discussed, that leaks sensitive information to the untrusted
> host.
> I am currently investigating passing iova/phys/size
> to force_dma_unencrypted() and then we can share pages inplace only
> if possible without leaking extra information.
> I am trying to get some performance results first. But the tricky part
> is to get the semantics right, I believe in that case those devices
> shouldn’t use restricted-dma-pools as those should always force
> bouncing. Instead bouncing happens through the default SWIOTLB pool,
> if not possible to decrypt in place.
>

-aneesh

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-19 13:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Aneesh Kumar K.V, iommu, linux-arm-kernel, linux-kernel,
	linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
	Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
	Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260519132911.GA7702@ziepe.ca>

On Tue, May 19, 2026 at 10:29:11AM -0300, Jason Gunthorpe wrote:
> On Tue, May 19, 2026 at 11:04:37AM +0000, Mostafa Saleh wrote:
> > On Thu, May 14, 2026 at 08:13:25PM +0530, Aneesh Kumar K.V wrote:
> > > >> 
> > > >> What I meant was that we need a generic way to identify a pKVM guest, so
> > > >> that we can use it in the conditional above.
> > > >
> > > > I have this patch, with that I can boot with your series unmodified,
> > > > but I will need to do more testing.
> > > >
> > > 
> > > Thanks, I can add this to the series once you complete the required testing.
> > > 
> > 
> > I am still running more tests, but looking more into it. Setting
> > force_dma_unencrypted() to true for pKVM guests is wrong, as the
> > guest shouldn’t try to decrypt arbitrary memory as it can include
> > sensitive information (for example in case of virtio sub-page
> > allocation) and should strictly rely on the restricted-dma-pool
> > for that.
> 
> ??
> 
> Where does force_dma_unencrypted() cause arbitary memory passed into
> the DMA API to be decrypted? That should never happen???

Sorry, maybe arbitrary is not the right expression again :)
I mean that, with emulated devices that use the DMA-API under pKVM,
they will map memory coming from other layers (VFS, net) through
vitrio-block, virtio-net... These can be smaller than a page, and
using force_dma_unencrypted() will share the whole page.
And as discussed, that leaks sensitive information to the untrusted
host.
I am currently investigating passing iova/phys/size
to force_dma_unencrypted() and then we can share pages inplace only
if possible without leaking extra information.
I am trying to get some performance results first. But the tricky part
is to get the semantics right, I believe in that case those devices
shouldn’t use restricted-dma-pools as those should always force
bouncing. Instead bouncing happens through the default SWIOTLB pool,
if not possible to decrypt in place.

Thanks,
Mostafa

> 
> Jason

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-19 13:39 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
	linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
	Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
	Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agxETC1rXBUSkWYg@google.com>

On Tue, May 19, 2026 at 11:06:52AM +0000, Mostafa Saleh wrote:

> > > One other interesting case for device-passthrough is non-coherent
> > > devices which then require private pools for bouncing.
> > 
> > Why does shared/private matter for bouncing? Why do you need to bounce
> > at all? Do cmo's not work in pkvm guests?
> 
> At the moment, in iommu_dma_map_phys(), if a non coherent device
> tries to map an unaligned address or size it will be bounced.

Sure, that's fine.

> In pKVM, dma-iommu is used for assigned devices which operate on
> private memory, so bouncing that through the SWIOTLB would leak
> information from the guest as the SWIOTLB is decrypted.

Yes, a device that can do private access should not be using a shared
SWIOTLB, that should be part of the selection logic inside the SWIOTLB
stuff..

Jason

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox