public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mostafa Saleh <smostafa@google.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: iommu@lists.linux.dev, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
	will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
	jean-philippe@linaro.org, nicolinc@nvidia.com,
	vdonnefort@google.com, qperret@google.com, tabba@google.com,
	danielmentz@google.com, tzukui@google.com
Subject: Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
Date: Wed, 8 Jan 2025 12:09:53 +0000	[thread overview]
Message-ID: <Z35rEeSVwfZVA9IF@google.com> (raw)
In-Reply-To: <20250102201614.GA26854@ziepe.ca>

On Thu, Jan 02, 2025 at 04:16:14PM -0400, Jason Gunthorpe wrote:
> On Fri, Dec 13, 2024 at 07:39:04PM +0000, Mostafa Saleh wrote:
> > Thanks a lot for taking the time to review this, I tried to reply to all
> > points. However I think a main source of confusion was that this is only
> > for the host kernel not guests, with this series guests still have no
> > access to DMA under pKVM. I hope that clarifies some of the points.
> 
> I think I just used different words, I ment the direct guest of pvkm,
> including what you are calling the host kernel.
> 

KVM treats host/guests very differently, so I think the distinction
between both in this context is important as this driver is for the
host only, guests are another story.

> > > The cover letter doesn't explain why someone needs page tables in the
> > > guest at all?
> > 
> > This is not for guests but for the host, the hypervisor needs to
> > establish DMA isolation between the host and the hypervisor/guests.
> 
> Why isn't this done directly in pkvm by setting up IOMMU tables that
> identity map the host/guest's CPU mapping? Why does the host kernel or
> guest kernel need to have page tables?
> 

If we setup identity tables that either means there is no translation
capability for the guest(or host here) or nesting should be used,
which is discussed later in this cover letter.

> > However, guest DMA support is optional and only needed for device
> > passthrough, 
> 
> Why? The CC cases are having the pkvm layer control the translation,
> so when the host spawns a guest the pkvm will setup a contained IOMMU
> translation for that guest as well.
> 
> Don't you also want to protect the guests from the host in this model?
> 

We do protect the guests from the host, in the proposed approach by
preventing mapping memory from guests(or hypervisor) in the IOMMU or
donating memory currently mapped in the IOMMU.
However, at the moment pKVM doesn’t support device passthrough, so
guests don't need IOMMU page tables as they can’t use any device or issue
DMA directly.
I have some patches to support device passthrough in guests + guest
IOMMU page tables, which is not part of this series, as mentioned host
DMA isolation is critical for pKVM model, while guest device passthrough
is an optional feature (but we plan to upstream that later)

> > We can do that for the host also, which is discussed in the v1 cover
> > letter. However, we try to keep feature parity with the normal (VHE)
> > KVM arm64 support, so constraining KVM support to not have IOVA spaces
> > for devices seems too much and impractical on modern systems (phones for
> > example).
> 
> But why? Do you have current use cases on phone where you need to have
> device-specific iommu_domains? What are they? Answering this goes a
> long way to understanding the real performance of a para virt approach.
> 

I don’t think having one domain for all devices fits most cases, SoCs can have
heterogeneous SMMUs, different addresses sizes, coherency...
Also, the basic idea of isolation between devices where some can be
controlled from userspace, or influenced by external entities, which
should be isolated. (we'd want the USB/network devices to have access to
other devices memory for example)
Another example would be accelerators, where they only operate on contiguous
memory and having such large buffers on phones in phyiscal space is almost
impossible.

I don’t think having a single domain is practical (nor it helps in this case).

> > There is no hacking for the arm-smmu-v3 driver, but mostly splitting
> > the driver so it can be re-used + introduction for a separate
> > hypervisor
> 
> I understood splitting some of it so you could share code with the
> pkvm side, but I don't see that it should be connected to the
> host/guest driver. Surely that should be a generic pkvm-iommu driver
> that is arch neutral, like virtio-iommu.
> 

The host driver follows the KVM (nvhe/hvhe) model, where at boot the
kernel (EL1) does a lot of the initialization and then it becomes untrusted
and the hypervisor manages everything after.

Similarly, The driver first probes in EL1 and does many of the complicated
stuff that is not supported at the hypervisor (EL2) as parsing firmware tables.
And ends up populating a simplified description of the SMMU topology.

Then the KVM <-> SMMU interfaces is not arch specific, you can check that
in hyp_main.c or nvhe/iommu.c where there is no reference to SMMU and all
hypercalls are abstracted so other IOMMUs can be supported under pKVM
(That’s the case in Android).

Maybe the driver at EL1 also can be further split to have a standard
part for hypercall interface and an init part which is SMMUv3 specific,
but I’d rather not complicate things until we have other users upstream..

For guest VMs (not part of this series), the interface and the kernel driver
are completely arch agnostic, similarly to virtio-iommu.

> > With pKVM, the host kernel is not trusted, and if compromised it can
> > instrument such attacks to corrupt hypervisor memory, so the hypervisor
> > would lock io-pgtable-arm operations in EL2 to avoid that.
> 
> io-pgtable-arm has a particular set of locking assumptions, the caller
> has to follow it. When pkvm converts the hypercalls for the
> para-virtualization into io-pgtable-arm calls it has to also ensure it
> follows io-pgtable-arm's locking model if it is going to use that as
> its code base. This has nothing to do with the guest or trust, it is
> just implementing concurrency correctly in pkvm..
> 

AFAICT, io-pgtable-arm has a set of assumptions about how it's called,
that’s why it’s lockless as the DMA API should follow those assumptions.
For example you can’t unmap a table and an entry inside the table concurrently,
that can lead to UAF/memory corruption, and this never happens at the moment as
the kernel has no bugs :)

However, pKVM always assumes that the kernel can be malicious, so a bad kernel
can issue such a call breaking those assumptions leading to UAF/memory corruption
inside the hypervisor. Which is not acceptable, so the solution is to use a lock
to prevent such issues from concurrent requests.

> > Yeah, SVA is tricky, I guess for that we would have to use nesting,
> > but tbh, I don’t think it’s a deal breaker for now.
> 
> Again, it depends what your actual use case for translation is inside
> the host/guest environments. It would be good to clearly spell this out..
> There are few drivers that directly manpulate the iommu_domains of a
> device. a few gpus, ath1x wireless, some tegra stuff, "venus". Which
> of those are you targetting?
> 

Not sure I understand this point about manipulating domains.
AFAIK, SVA is not that common, including mobile spaces but I can be wrong,
that’s why it’s not a priority here.

> > > Lots of people have now done this, it is not really so bad. In
> > > exchange you get a full architected feature set, better performance,
> > > and are ready for HW optimizations.
> > 
> > It’s not impossible, it’s just more complicated doing it in the
> > hypervisor which has limited features compared to the kernel + I haven’t
> > seen any open source implementation for that except for Qemu which is in
> > userspace.
> 
> People are doing it in their CC stuff, which is about the same as
> pkvm. I'm not sure if it will be open source, I hope so since it needs
> security auditing..
> 

Yes, as mentioned later I also have a WIP implementation for KVM (which is
open source[1] :)) that I plan to send to the list (maybe in 3-4 months when
ready) as an alternative approach.

> > > > - Add IDENTITY_DOMAIN support, I already have some patches for that, but
> > > >   didn’t want to complicate this series, I can send them separately.
> > > 
> > > This seems kind of pointless to me. If you can tolerate identity (ie
> > > pin all memory) then do nested, and maybe don't even bother with a
> > > guest iommu.
> > 
> > As mentioned, the choice for para-virt was not only to avoid pinning,
> > as this is the host, for IDENTITY_DOMAIN we either share the page table,
> > then we have to deal with lazy mapping (SMMU features, BBM...) or mirror
> > the table in a shadow SMMU only identity page table.
> 
> AFAIK you always have to mirror unless you significantly change how
> the KVM S1 page table stuff is working. The CC people have made those
> changes and won't mirror, so it is doable..
> 

Yes, I agree, AFAIK, the current KVM pgtable code is not ready for shared
page tables with the IOMMU.

> > > My advice for merging would be to start with the pkvm side setting up
> > > a fully pinned S2 and do not have a guest driver. Nesting without
> > > emulating smmuv3. Basically you get protected identity DMA support. I
> > > think that would be a much less sprawling patch series. From there it
> > > would be well positioned to add both smmuv3 emulation and a paravirt
> > > iommu flow.
> > 
> > I am open to any suggestions, but I believe any solution considered for
> > merge, should have enough features to be usable on actual systems (translating
> > IOMMU can be used for example) so either para-virt as this series or full
> > nesting as the PoC above (or maybe both?), which IMO comes down to the
> > trade-off mentioned above.
> 
> IMHO no, you can have a completely usable solution without host/guest
> controlled translation. This is equivilant to a bare metal system with
> no IOMMU HW. This exists and is still broadly useful. The majority of
> cloud VMs out there are in this configuration.
> 
> That is the simplest/smallest thing to start with. Adding host/guest
> controlled translation is a build-on-top excercise that seems to have
> a lot of options and people may end up wanting to do all of them.
> 
> I don't think you need to show that host/guest controlled translation
> is possible to make progress, of course it is possible. Just getting
> to the point where pkvm can own the SMMU HW and provide DMA isolation
> between all of it's direct host/guest is a good step.

My plan was basically:
1) Finish and send nested SMMUv3 as RFC, with more insights about
   performance and complexity trade-offs of both approaches.

2) Discuss next steps for the upstream solution in an upcoming conference
   (like LPC or earlier if possible) and work on upstreaming it.

3) Work on guest device passthrough and IOMMU support.

I am open to gradually upstream this as you mentioned where as a first
step pKVM would establish DMA isolation without translation for host,
that should be enough to have functional pKVM and run protected workloads.

But although that might be usable on some systems, I don’t think that’s
practical in the long term as it limits the amount of HW that can run pKVM.

[1] https://android-kvm.googlesource.com/linux/+/refs/heads/smostafa/android15-6.6-smmu-nesting-wip

Thanks,
Mostafa

> 
> Jason

  reply	other threads:[~2025-01-08 12:09 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 01/58] iommu/io-pgtable-arm: Split the page table driver Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 02/58] iommu/io-pgtable-arm: Split initialization Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 03/58] iommu/io-pgtable: Add configure() operation Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 04/58] iommu/arm-smmu-v3: Move some definitions to arm64 include/ Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 05/58] iommu/arm-smmu-v3: Extract driver-specific bits from probe function Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 06/58] iommu/arm-smmu-v3: Move some functions to arm-smmu-v3-common.c Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 07/58] iommu/arm-smmu-v3: Move queue and table allocation " Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 08/58] iommu/arm-smmu-v3: Move firmware probe to arm-smmu-v3-common Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 09/58] iommu/arm-smmu-v3: Move IOMMU registration to arm-smmu-v3-common.c Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 10/58] iommu/arm-smmu-v3: Move common irq code to common file Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 11/58] KVM: arm64: pkvm: Add pkvm_udelay() Mostafa Saleh
2024-12-19 11:14   ` Quentin Perret
2024-12-19 11:21     ` Mostafa Saleh
2024-12-19 11:28       ` Quentin Perret
2024-12-12 18:03 ` [RFC PATCH v2 12/58] KVM: arm64: Add __pkvm_{use, unuse}_dma() Mostafa Saleh
2024-12-19 11:23   ` Quentin Perret
2024-12-12 18:03 ` [RFC PATCH v2 13/58] KVM: arm64: Introduce IOMMU driver infrastructure Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 14/58] KVM: arm64: pkvm: Add IOMMU hypercalls Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 15/58] KVM: arm64: iommu: Add a memory pool for the IOMMU Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 16/58] KVM: arm64: iommu: Add domains Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 17/58] KVM: arm64: iommu: Add {attach, detach}_dev Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 18/58] KVM: arm64: iommu: Add map/unmap() operations Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 19/58] KVM: arm64: iommu: support iommu_iotlb_gather Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 20/58] KVM: arm64: Support power domains Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 21/58] KVM: arm64: pkvm: Add __pkvm_host_add_remove_page() Mostafa Saleh
2024-12-19 11:10   ` Quentin Perret
2024-12-19 11:19     ` Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 22/58] KVM: arm64: pkvm: Support SCMI power domain Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 23/58] KVM: arm64: iommu: Support power management Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 24/58] KVM: arm64: iommu: Support DABT for IOMMU Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 25/58] KVM: arm64: iommu: Add SMMUv3 driver Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 26/58] KVM: arm64: smmu-v3: Initialize registers Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 27/58] KVM: arm64: smmu-v3: Setup command queue Mostafa Saleh
2025-01-23 13:01   ` Robin Murphy
2025-01-29 11:15     ` Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 28/58] KVM: arm64: smmu-v3: Setup stream table Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 29/58] KVM: arm64: smmu-v3: Setup event queue Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 30/58] KVM: arm64: smmu-v3: Reset the device Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 31/58] KVM: arm64: smmu-v3: Support io-pgtable Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 32/58] KVM: arm64: smmu-v3: Add {alloc/free}_domain Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 33/58] KVM: arm64: smmu-v3: Add TLB ops Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 34/58] KVM: arm64: smmu-v3: Add context descriptor functions Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 35/58] KVM: arm64: smmu-v3: Add attach_dev Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 36/58] KVM: arm64: smmu-v3: Add detach_dev Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 37/58] iommu/io-pgtable: Generalize walker interface Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 38/58] iommu/io-pgtable-arm: Add post table walker callback Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 39/58] drivers/iommu: io-pgtable: Add IO_PGTABLE_QUIRK_UNMAP_INVAL Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 40/58] KVM: arm64: smmu-v3: Add map/unmap pages and iova_to_phys Mostafa Saleh
2024-12-12 19:44   ` Jason Gunthorpe
2024-12-13 19:48     ` Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 41/58] KVM: arm64: smmu-v3: Add DABT handler Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 42/58] iommu/arm-smmu-v3-kvm: Add host driver for pKVM Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 43/58] iommu/arm-smmu-v3-kvm: Pass a list of SMMU devices to the hypervisor Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 44/58] iommu/arm-smmu-v3-kvm: Validate device features Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 45/58] iommu/arm-smmu-v3-kvm: Allocate structures and reset device Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 46/58] KVM: arm64: Add function to topup generic allocator Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 47/58] KVM: arm64: Add macro for SMCCC call with all returns Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 48/58] iommu/arm-smmu-v3-kvm: Add function to topup IOMMU allocator Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 49/58] iommu/arm-smmu-v3-kvm: Add IOMMU ops Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 50/58] iommu/arm-smmu-v3-kvm: Add map, unmap and iova_to_phys operations Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 51/58] iommu/arm-smmu-v3-kvm: Support PASID operations Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 52/58] iommu/arm-smmu-v3-kvm: Add IRQs for the driver Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 53/58] iommu/arm-smmu-v3-kvm: Probe power domains Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 54/58] iommu/arm-smmu-v3-kvm: Enable runtime PM Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 55/58] drivers/iommu: Add deferred map_sg operations Mostafa Saleh
2024-12-19 12:48   ` Robin Murphy
2024-12-19 14:24     ` Mostafa Saleh
2025-01-02 20:18       ` Jason Gunthorpe
2025-01-03 15:35         ` Mostafa Saleh
2025-01-03 15:47           ` Jason Gunthorpe
2025-01-08 12:13             ` Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 56/58] KVM: arm64: iommu: Add hypercall for map_sg Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 57/58] iommu/arm-smmu-v3-kvm: Implement sg operations Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 58/58] iommu/arm-smmu-v3-kvm: Support command queue batching Mostafa Saleh
2024-12-12 19:41 ` [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Jason Gunthorpe
2024-12-13 19:39   ` Mostafa Saleh
2025-01-02 20:16     ` Jason Gunthorpe
2025-01-08 12:09       ` Mostafa Saleh [this message]
2025-01-16  6:39         ` Tian, Kevin
2025-01-16 19:14           ` Jason Gunthorpe
2025-01-17  6:57             ` Tian, Kevin
2025-01-22 11:04               ` Mostafa Saleh
2025-01-22 16:20                 ` Jason Gunthorpe
2025-01-22 17:17                   ` Mostafa Saleh
2025-01-22 19:16                     ` Jason Gunthorpe
2025-01-23  8:13                 ` Tian, Kevin
2025-01-29 12:16                   ` Mostafa Saleh
2025-01-16  8:51         ` Tian, Kevin
2025-01-22 11:28           ` Mostafa Saleh
2025-01-23  8:25             ` Tian, Kevin
2025-01-29 12:21               ` Mostafa Saleh
2025-01-29 13:50                 ` Jason Gunthorpe
2025-01-29 14:08                   ` Mostafa Saleh
2025-02-18  9:52                 ` Tian, Kevin
2025-01-16 19:19         ` Jason Gunthorpe
2025-01-22 11:46           ` Mostafa Saleh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z35rEeSVwfZVA9IF@google.com \
    --to=smostafa@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=danielmentz@google.com \
    --cc=iommu@lists.linux.dev \
    --cc=jean-philippe@linaro.org \
    --cc=jgg@ziepe.ca \
    --cc=joey.gouly@arm.com \
    --cc=joro@8bytes.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=nicolinc@nvidia.com \
    --cc=oliver.upton@linux.dev \
    --cc=qperret@google.com \
    --cc=robdclark@gmail.com \
    --cc=robin.murphy@arm.com \
    --cc=suzuki.poulose@arm.com \
    --cc=tabba@google.com \
    --cc=tzukui@google.com \
    --cc=vdonnefort@google.com \
    --cc=will@kernel.org \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox