From: Mostafa Saleh <smostafa@google.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: iommu@lists.linux.dev, kvmarm@lists.linux.dev,
linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, catalin.marinas@arm.com,
will@kernel.org, maz@kernel.org, oliver.upton@linux.dev,
joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
robdclark@gmail.com, joro@8bytes.org, robin.murphy@arm.com,
jean-philippe@linaro.org, nicolinc@nvidia.com,
vdonnefort@google.com, qperret@google.com, tabba@google.com,
danielmentz@google.com, tzukui@google.com
Subject: Re: [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
Date: Fri, 13 Dec 2024 19:39:04 +0000 [thread overview]
Message-ID: <Z1yNWI9lBNIZg6Le@google.com> (raw)
In-Reply-To: <20241212194119.GA4679@ziepe.ca>
Hi Jason,
Thanks a lot for taking the time to review this, I tried to reply to all
points. However I think a main source of confusion was that this is only
for the host kernel not guests, with this series guests still have no
access to DMA under pKVM. I hope that clarifies some of the points.
On Thu, Dec 12, 2024 at 03:41:19PM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 12, 2024 at 06:03:24PM +0000, Mostafa Saleh wrote:
>
> > This series adds a hypervisor driver for the Arm SMMUv3 IOMMU. Since the
> > hypervisor part of pKVM (called nVHE here) is minimal, moving the whole
> > host SMMU driver into nVHE isn't really an option. It is too large and
> > complex and requires infrastructure from all over the kernel. We add a
> > reduced nVHE driver that deals with populating the SMMU tables and the
> > command queue, and the host driver still deals with probing and some
> > initialization.
>
> The cover letter doesn't explain why someone needs page tables in the
> guest at all?
This is not for guests but for the host, the hypervisor needs to
establish DMA isolation between the host and the hypervisor/guests.
Before these patches; as mentioned, a host can program a DMA device
to read/write any memory (that has nothing to do with whether the
guest has DMA access or not).
So it’s mandatory for pKVM to establish DMA isolation, otherwise
it can be easily defeated.
However, guest DMA support is optional and only needed for device
passthrough, I have some patches to support that in pKVM also(only with
vfio-platform), but it’s unlikely to be posted upstream before merging a
host DMA isolation solution first as it’s mandatory.
>
> If you are able to implement nested support then you can boot the
> guest with no-iommu and an effective identity translation through a
> hypervisor controlled S2. ie no guest map/unmap. Great DMA
> performance.
We can do that for the host also, which is discussed in the v1 cover
letter. However, we try to keep feature parity with the normal (VHE)
KVM arm64 support, so constraining KVM support to not have IOVA spaces
for devices seems too much and impractical on modern systems (phones for
example).
>
> I thought the point of doing the paravirt here was to allow dynamic
> pinning of the guest memory? This is the primary downside with nested.
> The entire guest memory has to be pinned down at guest boot.
As this is for the host, memory pinning is not really an issue (However,
with nesting and shared CPU stage-2 there are other challenges as
mentioned).
>
> > 1. Paravirtual I/O page tables
> > This is the solution implemented in this series. The host creates
> > IOVA->HPA mappings with two hypercalls map_pages() and unmap_pages(), and
> > the hypervisor populates the page tables. Page tables are abstracted into
> > IOMMU domains, which allow multiple devices to share the same address
> > space. Another four hypercalls, alloc_domain(), attach_dev(), detach_dev()
> > and free_domain(), manage the domains, the semantics of those hypercalls
> > are almost identical to the IOMMU ops which make the kernel driver part
> > simpler.
>
> That is re-inventing virtio-iommu. I don't really understand why this
> series is hacking up arm-smmuv3 so much, that is not, and should not,
> be a paravirt driver. Why not create a clean new pkvm specific driver
> for the paravirt?? Or find a way to re-use parts of virtio-iommu?
>
> Shouldn't other arch versions of pkvm be able to re-use the same guest
> iommu driver?
As mentioned, this is for the host kernel not the guest. However the
hypervisor/kernel interface is not IOMMU specific. And it can be extended
to other IOMMUs/archs.
There is no hacking for the arm-smmu-v3 driver, but mostly splitting
the driver so it can be re-used + introduction for a separate hypervisor
driver, it’s similar to how SVA re-use part of the driver also but just
on a bigger scale.
>
> > b- Locking: The io-pgtable-arm is lockless under some guarantees of how
> > the IOMMU code behaves. However with pKVM, the kernel is not trusted
> > and a malicious kernel can issue concurrent requests causing memory
> > corruption or UAF, so that it has to be locked in the hypervisor.
>
> ? I don't get it, the hypervisor page table has to be private to the
> hypervisor. It is not that io-pgtable-arm is lockless, it is that it
> relies on a particular kind of caller supplied locking. pkvm's calls
> into its private io-pgtable-arm would need pkvm specific locking that
> makes sense for it. Where does a malicious guest kernel get into this?
At the moment when the kernel driver uses the io-pgtable-arm, it doesn’t
protect it with any locks under some assumptions, for example, unmapping
a table with block size and a leaf inside it with page size concurrently
can cause a UAF, but the DMA API never does that.
With pKVM, the host kernel is not trusted, and if compromised it can
instrument such attacks to corrupt hypervisor memory, so the hypervisor
would lock io-pgtable-arm operations in EL2 to avoid that.
>
> > 2. Nested SMMUv3 translation (with emulation)
> > Another approach is to rely on nested translation support which is
> > optional in SMMUv3, that requires an architecturally accurate emulation
> > of SMMUv3 which can be complicated including cmdq emulation.
>
> The confidential compute folks are going in this direction.
I see, but one key advantage for pKVM that it requires minimum hardware,
with the paravirtual approach we can support single stage SMMUv3 or even
non-architected IOMMUs, that + the DMA performance, might give it slight
edge, but as I mentioned I plan to do more throughout comparison with
nesting and maybe discuss it in a conference this year.
>
> > The trade off between the 2 approaches can be roughly summarised as:
> > Paravirtualization:
> > - Compatible with more HW (and IOMMUs).
> > - Better DMA performance due to shorter table walks/less TLB pressure
> > - Needs extra complexity to squeeze the last bit of optimization (around
> > unmap, and map_sg).
>
> It has better straight line DMA performance if the DMAs are all
> static. Generally much, much worse performance if the DMAs are
> dynamically mapped as you have to trap so much stuff.
I agree it’s not that clear, I will finish the nested implementation
and run some standard IO benchmarks.
>
> The other negative is there is no way to get SVA support with
> para-virtualization.
>
Yeah, SVA is tricky, I guess for that we would have to use nesting,
but tbh, I don’t think it’s a deal breaker for now.
> The positive is you don't have to pin the VM's memory.
>
> > Nested Emulation
> > - Faster map_pages (not sure about unmap because it requires cmdq
> > emulation for TLB invalidation if DVM not used).
>
> If you can do nested then you can run in identity mode and then you
> don't have any performance down side. It is a complete win.
Unfortunately, as mentioned above it’s not that practical, many devices
in mobile space expect IO translation capability.
>
> Even if you do non-idenity nested is still likely faster for changing
> translation than paravirt approaches. A single cmdq range invalidate
> should be about the same broad overhead as a single paravirt call to
> unmap except they can be batched under load.
>
> Things like vCMDQ eliminate this overhead entirely, to my mind that is
> the future direction of this HW as you obviously need to HW optimize
> invalidation...
>
> > - Needs extra complexity for architecturally emulating SMMUv3.
>
> Lots of people have now done this, it is not really so bad. In
> exchange you get a full architected feature set, better performance,
> and are ready for HW optimizations.
It’s not impossible, it’s just more complicated doing it in the
hypervisor which has limited features compared to the kernel + I haven’t
seen any open source implementation for that except for Qemu which is in
userspace.
>
> > - Add IDENTITY_DOMAIN support, I already have some patches for that, but
> > didn’t want to complicate this series, I can send them separately.
>
> This seems kind of pointless to me. If you can tolerate identity (ie
> pin all memory) then do nested, and maybe don't even bother with a
> guest iommu.
As mentioned, the choice for para-virt was not only to avoid pinning,
as this is the host, for IDENTITY_DOMAIN we either share the page table,
then we have to deal with lazy mapping (SMMU features, BBM...) or mirror
the table in a shadow SMMU only identity page table.
>
> If you want most of the guest memory to be swappable/movable/whatever
> then paravirt is the only choice, and you really don't want the guest
> to have any identiy support at all.
>
> Really, I think you'd want to have both options, there is no "best"
> here. It depends what people want to use the VM for.
>
> My advice for merging would be to start with the pkvm side setting up
> a fully pinned S2 and do not have a guest driver. Nesting without
> emulating smmuv3. Basically you get protected identity DMA support. I
> think that would be a much less sprawling patch series. From there it
> would be well positioned to add both smmuv3 emulation and a paravirt
> iommu flow.
>
I am open to any suggestions, but I believe any solution considered for
merge, should have enough features to be usable on actual systems (translating
IOMMU can be used for example) so either para-virt as this series or full
nesting as the PoC above (or maybe both?), which IMO comes down to the
trade-off mentioned above.
Thanks,
Mostafa
> Jason
next prev parent reply other threads:[~2024-12-13 19:44 UTC|newest]
Thread overview: 97+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-12 18:03 [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 01/58] iommu/io-pgtable-arm: Split the page table driver Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 02/58] iommu/io-pgtable-arm: Split initialization Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 03/58] iommu/io-pgtable: Add configure() operation Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 04/58] iommu/arm-smmu-v3: Move some definitions to arm64 include/ Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 05/58] iommu/arm-smmu-v3: Extract driver-specific bits from probe function Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 06/58] iommu/arm-smmu-v3: Move some functions to arm-smmu-v3-common.c Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 07/58] iommu/arm-smmu-v3: Move queue and table allocation " Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 08/58] iommu/arm-smmu-v3: Move firmware probe to arm-smmu-v3-common Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 09/58] iommu/arm-smmu-v3: Move IOMMU registration to arm-smmu-v3-common.c Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 10/58] iommu/arm-smmu-v3: Move common irq code to common file Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 11/58] KVM: arm64: pkvm: Add pkvm_udelay() Mostafa Saleh
2024-12-19 11:14 ` Quentin Perret
2024-12-19 11:21 ` Mostafa Saleh
2024-12-19 11:28 ` Quentin Perret
2024-12-12 18:03 ` [RFC PATCH v2 12/58] KVM: arm64: Add __pkvm_{use, unuse}_dma() Mostafa Saleh
2024-12-19 11:23 ` Quentin Perret
2024-12-12 18:03 ` [RFC PATCH v2 13/58] KVM: arm64: Introduce IOMMU driver infrastructure Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 14/58] KVM: arm64: pkvm: Add IOMMU hypercalls Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 15/58] KVM: arm64: iommu: Add a memory pool for the IOMMU Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 16/58] KVM: arm64: iommu: Add domains Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 17/58] KVM: arm64: iommu: Add {attach, detach}_dev Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 18/58] KVM: arm64: iommu: Add map/unmap() operations Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 19/58] KVM: arm64: iommu: support iommu_iotlb_gather Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 20/58] KVM: arm64: Support power domains Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 21/58] KVM: arm64: pkvm: Add __pkvm_host_add_remove_page() Mostafa Saleh
2024-12-19 11:10 ` Quentin Perret
2024-12-19 11:19 ` Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 22/58] KVM: arm64: pkvm: Support SCMI power domain Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 23/58] KVM: arm64: iommu: Support power management Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 24/58] KVM: arm64: iommu: Support DABT for IOMMU Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 25/58] KVM: arm64: iommu: Add SMMUv3 driver Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 26/58] KVM: arm64: smmu-v3: Initialize registers Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 27/58] KVM: arm64: smmu-v3: Setup command queue Mostafa Saleh
2025-01-23 13:01 ` Robin Murphy
2025-01-29 11:15 ` Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 28/58] KVM: arm64: smmu-v3: Setup stream table Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 29/58] KVM: arm64: smmu-v3: Setup event queue Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 30/58] KVM: arm64: smmu-v3: Reset the device Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 31/58] KVM: arm64: smmu-v3: Support io-pgtable Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 32/58] KVM: arm64: smmu-v3: Add {alloc/free}_domain Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 33/58] KVM: arm64: smmu-v3: Add TLB ops Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 34/58] KVM: arm64: smmu-v3: Add context descriptor functions Mostafa Saleh
2024-12-12 18:03 ` [RFC PATCH v2 35/58] KVM: arm64: smmu-v3: Add attach_dev Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 36/58] KVM: arm64: smmu-v3: Add detach_dev Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 37/58] iommu/io-pgtable: Generalize walker interface Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 38/58] iommu/io-pgtable-arm: Add post table walker callback Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 39/58] drivers/iommu: io-pgtable: Add IO_PGTABLE_QUIRK_UNMAP_INVAL Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 40/58] KVM: arm64: smmu-v3: Add map/unmap pages and iova_to_phys Mostafa Saleh
2024-12-12 19:44 ` Jason Gunthorpe
2024-12-13 19:48 ` Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 41/58] KVM: arm64: smmu-v3: Add DABT handler Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 42/58] iommu/arm-smmu-v3-kvm: Add host driver for pKVM Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 43/58] iommu/arm-smmu-v3-kvm: Pass a list of SMMU devices to the hypervisor Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 44/58] iommu/arm-smmu-v3-kvm: Validate device features Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 45/58] iommu/arm-smmu-v3-kvm: Allocate structures and reset device Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 46/58] KVM: arm64: Add function to topup generic allocator Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 47/58] KVM: arm64: Add macro for SMCCC call with all returns Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 48/58] iommu/arm-smmu-v3-kvm: Add function to topup IOMMU allocator Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 49/58] iommu/arm-smmu-v3-kvm: Add IOMMU ops Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 50/58] iommu/arm-smmu-v3-kvm: Add map, unmap and iova_to_phys operations Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 51/58] iommu/arm-smmu-v3-kvm: Support PASID operations Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 52/58] iommu/arm-smmu-v3-kvm: Add IRQs for the driver Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 53/58] iommu/arm-smmu-v3-kvm: Probe power domains Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 54/58] iommu/arm-smmu-v3-kvm: Enable runtime PM Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 55/58] drivers/iommu: Add deferred map_sg operations Mostafa Saleh
2024-12-19 12:48 ` Robin Murphy
2024-12-19 14:24 ` Mostafa Saleh
2025-01-02 20:18 ` Jason Gunthorpe
2025-01-03 15:35 ` Mostafa Saleh
2025-01-03 15:47 ` Jason Gunthorpe
2025-01-08 12:13 ` Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 56/58] KVM: arm64: iommu: Add hypercall for map_sg Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 57/58] iommu/arm-smmu-v3-kvm: Implement sg operations Mostafa Saleh
2024-12-12 18:04 ` [RFC PATCH v2 58/58] iommu/arm-smmu-v3-kvm: Support command queue batching Mostafa Saleh
2024-12-12 19:41 ` [RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM Jason Gunthorpe
2024-12-13 19:39 ` Mostafa Saleh [this message]
2025-01-02 20:16 ` Jason Gunthorpe
2025-01-08 12:09 ` Mostafa Saleh
2025-01-16 6:39 ` Tian, Kevin
2025-01-16 19:14 ` Jason Gunthorpe
2025-01-17 6:57 ` Tian, Kevin
2025-01-22 11:04 ` Mostafa Saleh
2025-01-22 16:20 ` Jason Gunthorpe
2025-01-22 17:17 ` Mostafa Saleh
2025-01-22 19:16 ` Jason Gunthorpe
2025-01-23 8:13 ` Tian, Kevin
2025-01-29 12:16 ` Mostafa Saleh
2025-01-16 8:51 ` Tian, Kevin
2025-01-22 11:28 ` Mostafa Saleh
2025-01-23 8:25 ` Tian, Kevin
2025-01-29 12:21 ` Mostafa Saleh
2025-01-29 13:50 ` Jason Gunthorpe
2025-01-29 14:08 ` Mostafa Saleh
2025-02-18 9:52 ` Tian, Kevin
2025-01-16 19:19 ` Jason Gunthorpe
2025-01-22 11:46 ` Mostafa Saleh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z1yNWI9lBNIZg6Le@google.com \
--to=smostafa@google.com \
--cc=catalin.marinas@arm.com \
--cc=danielmentz@google.com \
--cc=iommu@lists.linux.dev \
--cc=jean-philippe@linaro.org \
--cc=jgg@ziepe.ca \
--cc=joey.gouly@arm.com \
--cc=joro@8bytes.org \
--cc=kvmarm@lists.linux.dev \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maz@kernel.org \
--cc=nicolinc@nvidia.com \
--cc=oliver.upton@linux.dev \
--cc=qperret@google.com \
--cc=robdclark@gmail.com \
--cc=robin.murphy@arm.com \
--cc=suzuki.poulose@arm.com \
--cc=tabba@google.com \
--cc=tzukui@google.com \
--cc=vdonnefort@google.com \
--cc=will@kernel.org \
--cc=yuzenghui@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox