From: Eric Auger <eric.auger@redhat.com>
To: Zhenzhong Duan <zhenzhong.duan@intel.com>, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com
Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device
Date: Thu, 20 Feb 2025 20:03:53 +0100 [thread overview]
Message-ID: <242f875d-a47c-4732-9f17-1a121e89b053@redhat.com> (raw)
In-Reply-To: <20250219082228.3303163-1-zhenzhong.duan@intel.com>
Hi Zhenzhong
On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> Hi,
>
> Per Jason Wang's suggestion, iommufd nesting series[1] is split into
> "Enable stage-1 translation for emulated device" series and
> "Enable stage-1 translation for passthrough device" series.
>
> This series is 2nd part focusing on passthrough device. We don't do
> shadowing of guest page table for passthrough device but pass stage-1
> page table to host side to construct a nested domain. There was some
> effort to enable this feature in old days, see [2] for details.
>
> The key design is to utilize the dual-stage IOMMU translation
> (also known as IOMMU nested translation) capability in host IOMMU.
> As the below diagram shows, guest I/O page table pointer in GPA
> (guest physical address) is passed to host and be used to perform
s/be/is
> the stage-1 address translation. Along with it, modifications to
> present mappings in the guest I/O page table should be followed
> with an IOTLB invalidation.
>
> .-------------. .---------------------------.
> | vIOMMU | | Guest I/O page table |
> | | '---------------------------'
> .----------------/
> | PASID Entry |--- PASID cache flush --+
> '-------------' |
> | | V
> | | I/O page table pointer in GPA
> '-------------'
> Guest
> ------| Shadow |---------------------------|--------
> v v v
> Host
> .-------------. .------------------------.
> | pIOMMU | | FS for GIOVA->GPA |
> | | '------------------------'
> .----------------/ |
> | PASID Entry | V (Nested xlate)
> '----------------\.----------------------------------.
> | | | SS for GPA->HPA, unmanaged domain|
> | | '----------------------------------'
> '-------------'
> Where:
> - FS = First stage page tables
> - SS = Second stage page tables
> <Intel VT-d Nested translation>
>
> There are some interactions between VFIO and vIOMMU
> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
> subsystem. VFIO calls them to register/unregister HostIOMMUDevice
> instance to vIOMMU at vfio device realize stage.
> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
> to bind/unbind device to IOMMUFD backed domains, either nested
> domain or not.
>
> See below diagram:
>
> VFIO Device Intel IOMMU
> .-----------------. .-------------------.
> | | | |
> | .---------|PCIIOMMUOps |.-------------. |
> | | IOMMUFD |(set_iommu_device) || Host IOMMU | |
> | | Device |------------------------>|| Device list | |
> | .---------|(unset_iommu_device) |.-------------. |
> | | | | |
> | | | V |
> | .---------| HostIOMMUDeviceIOMMUFD | .-------------. |
> | | IOMMUFD | (attach_hwpt)| | Host IOMMU | |
> | | link |<------------------------| | Device | |
> | .---------| (detach_hwpt)| .-------------. |
> | | | | |
> | | | ... |
> .-----------------. .-------------------.
>
> Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
> whenever possible and create new one on demand, also supports multiple
> iommufd objects and ERRATA_772415.
>
> E.g., Stage-2 page table could be shared by different devices if there
> is no conflict and devices link to same iommufd object, i.e. devices
> under same host IOMMU can share same stage-2 page table. If there is
> conflict, i.e. there is one device under non cache coherency mode
> which is different from others, it requires a separate stage-2 page
> table in non-CC mode.
>
> SPR platform has ERRATA_772415 which requires no readonly mappings
> in stage-2 page table. This series supports creating VTDIOASContainer
> with no readonly mappings. If there is a rare case that some IOMMUs
> on a multiple IOMMU host have ERRATA_772415 and others not, this
> design can still survive.
>
> See below example diagram for a full view:
>
> IntelIOMMUState
> |
> V
> .------------------. .------------------. .-------------------.
> | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer |-->...
> | (iommufd0,RW&RO) | | (iommufd1,RW&RO) | | (iommufd0,RW only)|
> .------------------. .------------------. .-------------------.
> | | |
> | .-->... |
> V V
> .-------------------. .-------------------. .---------------.
> | VTDS2Hwpt(CC) |--->| VTDS2Hwpt(non-CC) |-->... | VTDS2Hwpt(CC) |-->...
> .-------------------. .-------------------. .---------------.
> | | | |
> | | | |
> .-----------. .-----------. .------------. .------------.
> | IOMMUFD | | IOMMUFD | | IOMMUFD | | IOMMUFD |
> | Device(CC)| | Device(CC)| | Device | | Device(CC) |
> | (iommufd0)| | (iommufd0)| | (non-CC) | | (errata) |
> | | | | | (iommufd0) | | (iommufd0) |
> .-----------. .-----------. .------------. .------------.
>
> This series is also a prerequisite work for vSVA, i.e. Sharing
> guest application address space with passthrough devices.
>
> To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on".
> i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...
>
> Passthrough device should use iommufd backend to work with stage-1 translation.
> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>
> If host doesn't support nested translation, qemu will fail with an unsupported
> report.
you're not mentioning lack of error reporting from HW S1 faults to
guests. Are there other deps missing?
Eric
>
> Test done:
> - VFIO devices hotplug/unplug
> - different VFIO devices linked to different iommufds
> - vhost net device ping test
>
> PATCH1-8: Add HWPT-based nesting infrastructure support
> PATCH9-10: Some cleanup work
> PATCH11: cap/ecap related compatibility check between vIOMMU and Host IOMMU
> PATCH12-19:Implement stage-1 page table for passthrough device
> PATCH20: Enable stage-1 translation for passthrough device
>
> Qemu code can be found at [3]
>
> TODO:
> - RAM discard
> - dirty tracking on stage-2 page table
>
> [1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html
> [2] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
> [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv2
>
> Thanks
> Zhenzhong
>
> Changelog:
> rfcv2:
> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
> - add two cleanup patches(patch9-10)
> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
> iommu pasid, this is important for dropping VTDPASIDAddressSpace
>
> Yi Liu (3):
> intel_iommu: Replay pasid binds after context cache invalidation
> intel_iommu: Propagate PASID-based iotlb invalidation to host
> intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
>
> Zhenzhong Duan (17):
> backends/iommufd: Add helpers for invalidating user-managed HWPT
> vfio/iommufd: Add properties and handlers to
> TYPE_HOST_IOMMU_DEVICE_IOMMUFD
> HostIOMMUDevice: Introduce realize_late callback
> vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
> vfio/iommufd: Implement [at|de]tach_hwpt handlers
> host_iommu_device: Define two new capabilities
> HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
> iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
> iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA
> intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
> vtd_ce_get_pasid_entry
> intel_iommu: Optimize context entry cache utilization
> intel_iommu: Check for compatibility with IOMMUFD backed device when
> x-flts=on
> intel_iommu: Introduce a new structure VTDHostIOMMUDevice
> intel_iommu: Add PASID cache management infrastructure
> intel_iommu: Bind/unbind guest page table to host
> intel_iommu: ERRATA_772415 workaround
> intel_iommu: Bypass replay in stage-1 page table mode
> intel_iommu: Enable host device when x-flts=on in scalable mode
>
> hw/i386/intel_iommu_internal.h | 56 +
> include/hw/i386/intel_iommu.h | 33 +-
> include/system/host_iommu_device.h | 40 +
> include/system/iommufd.h | 53 +
> backends/iommufd.c | 58 +
> hw/i386/intel_iommu.c | 1660 ++++++++++++++++++++++++----
> hw/vfio/common.c | 17 +-
> hw/vfio/iommufd.c | 48 +
> backends/trace-events | 1 +
> hw/i386/trace-events | 13 +
> 10 files changed, 1776 insertions(+), 203 deletions(-)
>
next prev parent reply other threads:[~2025-02-20 19:04 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-19 8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-02-19 8:22 ` [PATCH rfcv2 01/20] backends/iommufd: Add helpers for invalidating user-managed HWPT Zhenzhong Duan
2025-02-20 16:47 ` Eric Auger
2025-02-28 2:26 ` Duan, Zhenzhong
2025-02-24 10:03 ` Shameerali Kolothum Thodi via
2025-02-28 9:36 ` Duan, Zhenzhong
2025-02-19 8:22 ` [PATCH rfcv2 02/20] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD Zhenzhong Duan
2025-02-20 17:42 ` Eric Auger
2025-02-28 5:39 ` Duan, Zhenzhong
2025-02-19 8:22 ` [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late callback Zhenzhong Duan
2025-02-20 17:48 ` Eric Auger
2025-02-28 8:16 ` Duan, Zhenzhong
2025-03-06 15:53 ` Eric Auger
2025-04-07 11:19 ` Cédric Le Goater
2025-04-08 8:00 ` Cédric Le Goater
2025-04-09 8:27 ` Duan, Zhenzhong
2025-04-09 9:58 ` Cédric Le Goater
2025-02-19 8:22 ` [PATCH rfcv2 04/20] vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler Zhenzhong Duan
2025-02-20 18:07 ` Eric Auger
2025-02-28 8:23 ` Duan, Zhenzhong
2025-02-19 8:22 ` [PATCH rfcv2 05/20] vfio/iommufd: Implement [at|de]tach_hwpt handlers Zhenzhong Duan
2025-02-20 18:13 ` Eric Auger
2025-02-28 8:24 ` Duan, Zhenzhong
2025-03-06 15:56 ` Eric Auger
2025-02-19 8:22 ` [PATCH rfcv2 06/20] host_iommu_device: Define two new capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] Zhenzhong Duan
2025-02-20 18:41 ` Eric Auger
2025-02-20 18:44 ` Eric Auger
2025-02-28 8:29 ` Duan, Zhenzhong
2025-03-06 15:59 ` Eric Auger
2025-03-06 19:45 ` Nicolin Chen
2025-03-10 3:48 ` Duan, Zhenzhong
2025-02-19 8:22 ` [PATCH rfcv2 07/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] Zhenzhong Duan
2025-02-20 19:00 ` Eric Auger
2025-02-28 8:32 ` Duan, Zhenzhong
2025-02-19 8:22 ` [PATCH rfcv2 08/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA Zhenzhong Duan
2025-02-20 18:55 ` Eric Auger
2025-02-28 8:31 ` Duan, Zhenzhong
2025-02-19 8:22 ` [PATCH rfcv2 09/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-02-21 6:39 ` CLEMENT MATHIEU--DRIF
2025-02-21 10:11 ` Eric Auger
2025-02-28 8:47 ` Duan, Zhenzhong
2025-02-19 8:22 ` [PATCH rfcv2 10/20] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
2025-02-21 10:00 ` Eric Auger
2025-02-28 8:34 ` Duan, Zhenzhong
2025-02-19 8:22 ` [PATCH rfcv2 11/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
2025-02-21 12:49 ` Eric Auger
2025-02-21 14:18 ` Eric Auger
2025-02-28 8:57 ` Duan, Zhenzhong
2025-02-19 8:22 ` [PATCH rfcv2 12/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
2025-02-21 13:03 ` Eric Auger
2025-02-28 8:58 ` Duan, Zhenzhong
2025-02-19 8:22 ` [PATCH rfcv2 13/20] intel_iommu: Add PASID cache management infrastructure Zhenzhong Duan
2025-02-21 17:02 ` Eric Auger
2025-02-28 9:35 ` Duan, Zhenzhong
2025-02-19 8:22 ` [PATCH rfcv2 14/20] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
2025-02-19 8:22 ` [PATCH rfcv2 15/20] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
2025-02-19 8:22 ` [PATCH rfcv2 16/20] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
2025-02-19 8:22 ` [PATCH rfcv2 17/20] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
2025-02-19 8:22 ` [PATCH rfcv2 18/20] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
2025-02-19 8:22 ` [PATCH rfcv2 19/20] intel_iommu: Bypass replay in stage-1 page table mode Zhenzhong Duan
2025-02-19 8:22 ` [PATCH rfcv2 20/20] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
2025-02-20 19:03 ` Eric Auger [this message]
2025-02-21 6:08 ` [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Duan, Zhenzhong
2025-04-05 3:01 ` Donald Dutile
2025-05-19 8:37 ` Duan, Zhenzhong
2025-05-19 15:39 ` Donald Dutile
2025-05-20 9:13 ` Duan, Zhenzhong
2025-05-20 10:47 ` Donald Dutile
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=242f875d-a47c-4732-9f17-1a121e89b053@redhat.com \
--to=eric.auger@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=chao.p.peng@intel.com \
--cc=clement.mathieu--drif@eviden.com \
--cc=clg@redhat.com \
--cc=jasowang@redhat.com \
--cc=jgg@nvidia.com \
--cc=joao.m.martins@oracle.com \
--cc=kevin.tian@intel.com \
--cc=mst@redhat.com \
--cc=nicolinc@nvidia.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=shameerali.kolothum.thodi@huawei.com \
--cc=yi.l.liu@intel.com \
--cc=zhenzhong.duan@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).