qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device
@ 2025-06-06 10:04 Zhenzhong Duan
  2025-06-06 10:04 ` [PATCH v1 01/15] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
                   ` (14 more replies)
  0 siblings, 15 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Hi,

After VFIO/IOMMUFD prerequisite patchset got accepted, now this focuses on
stage-1 translation for passthrough device in intel_iommu. I thought it's
time to bump to v1 from rfcv3.

rfcv3 cover-letter:

Per Jason Wang's suggestion, iommufd nesting series[1] is split into
"Enable stage-1 translation for emulated device" series and
"Enable stage-1 translation for passthrough device" series.

This series is 2nd part focusing on passthrough device. We don't do shadowing
of guest page table for passthrough device but pass stage-1 page table to host
side to construct a nested domain. There was some effort to enable this feature
in old days, see [2] for details.

The key design is to utilize the dual-stage IOMMU translation (also known as
IOMMU nested translation) capability in host IOMMU. As the below diagram shows,
guest I/O page table pointer in GPA (guest physical address) is passed to host
and be used to perform the stage-1 address translation. Along with it,
modifications to present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.

        .-------------.  .---------------------------.
        |   vIOMMU    |  | Guest I/O page table      |
        |             |  '---------------------------'
        .----------------/
        | PASID Entry |--- PASID cache flush --+
        '-------------'                        |
        |             |                        V
        |             |           I/O page table pointer in GPA
        '-------------'
    Guest
    ------| Shadow |---------------------------|--------
          v        v                           v
    Host
        .-------------.  .------------------------.
        |   pIOMMU    |  | Stage1 for GIOVA->GPA  |
        |             |  '------------------------'
        .----------------/  |
        | PASID Entry |     V (Nested xlate)
        '----------------\.--------------------------------------.
        |             |   | Stage2 for GPA->HPA, unmanaged domain|
        |             |   '--------------------------------------'
        '-------------'
For history reason, there are different namings in different VTD spec rev,
Where:
 - Stage1 = First stage = First level = flts
 - Stage2 = Second stage = Second level = slts
<Intel VT-d Nested translation>

There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
  subsystem. VFIO calls them to register/unregister HostIOMMUDevice
  instance to vIOMMU at vfio device realize stage.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
  to bind/unbind device to IOMMUFD backed domains, either nested
  domain or not.

See below diagram:

        VFIO Device                                 Intel IOMMU
    .-----------------.                         .-------------------.
    |                 |                         |                   |
    |       .---------|PCIIOMMUOps              |.-------------.    |
    |       | IOMMUFD |(set_iommu_device)       || Host IOMMU  |    |
    |       | Device  |------------------------>|| Device list |    |
    |       .---------|(unset_iommu_device)     |.-------------.    |
    |                 |                         |       |           |
    |                 |                         |       V           |
    |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
    |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
    |       | link    |<------------------------|  |   Device    |  |
    |       .---------|            (detach_hwpt)|  .-------------.  |
    |                 |                         |       |           |
    |                 |                         |       ...         |
    .-----------------.                         .-------------------.

Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
whenever possible and create new one on demand, also supports multiple
iommufd objects and ERRATA_772415.

E.g., Under one guest's scope, Stage-2 page table could be shared by different
devices if there is no conflict and devices link to same iommufd object,
i.e. devices under same host IOMMU can share same stage-2 page table. If there
is conflict, i.e. there is one device under non cache coherency mode which is
different from others, it requires a separate stage-2 page table in non-CC mode.

SPR platform has ERRATA_772415 which requires no readonly mappings
in stage-2 page table. This series supports creating VTDIOASContainer
with no readonly mappings. If there is a rare case that some IOMMUs
on a multiple IOMMU host have ERRATA_772415 and others not, this
design can still survive.

See below example diagram for a full view:

      IntelIOMMUState
             |
             V
    .------------------.    .------------------.    .-------------------.
    | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |-->...
    | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,only RW)|
    .------------------.    .------------------.    .-------------------.
             |                       |                              |
             |                       .-->...                        |
             V                                                      V
      .-------------------.    .-------------------.          .---------------.
      |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-->...
      .-------------------.    .-------------------.          .---------------.
          |            |               |                            |
          |            |               |                            |
    .-----------.  .-----------.  .------------.              .------------.
    | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
    | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
    | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
    |           |  |           |  | (iommufd0) |              | (iommufd0) |
    .-----------.  .-----------.  .------------.              .------------.

This series is also a prerequisite work for vSVA, i.e. Sharing
guest application address space with passthrough devices.

To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on".
i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...

Passthrough device should use iommufd backend to work with stage-1 translation.
i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...

If host doesn't support nested translation, qemu will fail with an unsupported
report.

Test done:
- VFIO devices hotplug/unplug
- different VFIO devices linked to different iommufds
- vhost net device ping test
- build on windows

Fault report isn't supported in this series, we presume guest kernel always
construct correct S1 page table for passthrough device. For emulated devices,
the emulation code already provided S1 fault injection.

PATCH1-2:  Some cleanup work
PATCH3:    cap/ecap related compatibility check between vIOMMU and Host IOMMU
PATCH4-14: Implement stage-1 page table for passthrough device
PATCH15:   Enable stage-1 translation for passthrough device

Qemu code can be found at [3]

TODO:
- RAM discard
- dirty tracking on stage-2 page table
- Fault report to guest when HW S1 faults

[1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html
[2] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v1

Thanks
Zhenzhong

Changelog:
v1:
- simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
- rebase to master

rfcv3:
- s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter (Shameer)
- hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
- simplify return value check of get_cap() (Eric)
- drop realize_late (Cedric, Eric)
- split patch13:intel_iommu: Add PASID cache management infrastructure (Eric)
- s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
- s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
- refine comments (Eric, Donald)

rfcv2:
- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
- add two cleanup patches(patch9-10)
- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
  iommu pasid, this is important for dropping VTDPASIDAddressSpace


Yi Liu (3):
  intel_iommu: Replay pasid binds after context cache invalidation
  intel_iommu: Propagate PASID-based iotlb invalidation to host
  intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed

Zhenzhong Duan (12):
  intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
    vtd_ce_get_pasid_entry
  intel_iommu: Optimize context entry cache utilization
  intel_iommu: Check for compatibility with IOMMUFD backed device when
    x-flts=on
  intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
  intel_iommu: Handle PASID entry removing and updating
  intel_iommu: Handle PASID entry adding
  intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
  intel_iommu: Bind/unbind guest page table to host
  intel_iommu: ERRATA_772415 workaround
  intel_iommu: Bypass replay in stage-1 page table mode
  intel_iommu: Enable host device when x-flts=on in scalable mode

 hw/i386/intel_iommu_internal.h |   56 ++
 include/hw/i386/intel_iommu.h  |   33 +-
 hw/i386/intel_iommu.c          | 1659 ++++++++++++++++++++++++++++----
 hw/i386/trace-events           |   13 +
 4 files changed, 1563 insertions(+), 198 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-06-20  7:09 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 01/15] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-06-11  7:20   ` Yi Liu
2025-06-17 17:16   ` Eric Auger
2025-06-06 10:04 ` [PATCH v1 02/15] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
2025-06-11  7:48   ` Yi Liu
2025-06-11 10:06     ` Duan, Zhenzhong
2025-06-17 10:57       ` Yi Liu
2025-06-18  1:58         ` Duan, Zhenzhong
2025-06-17 17:24   ` Eric Auger
2025-06-18  2:10     ` Duan, Zhenzhong
2025-06-18  7:08       ` Eric Auger
2025-06-06 10:04 ` [PATCH v1 03/15] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
2025-06-17 17:49   ` Eric Auger
2025-06-18  2:14     ` Duan, Zhenzhong
2025-06-18  7:08       ` Eric Auger
2025-06-06 10:04 ` [PATCH v1 04/15] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
2025-06-12 16:04   ` CLEMENT MATHIEU--DRIF
2025-06-13  9:08     ` Duan, Zhenzhong
2025-06-20  7:08       ` Eric Auger
2025-06-06 10:04 ` [PATCH v1 05/15] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
2025-06-11  9:54   ` Yi Liu
2025-06-11 10:46     ` Duan, Zhenzhong
2025-06-17 10:58       ` Yi Liu
2025-06-06 10:04 ` [PATCH v1 06/15] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
2025-06-17 12:29   ` Yi Liu
2025-06-18  6:03     ` Duan, Zhenzhong
2025-06-06 10:04 ` [PATCH v1 07/15] intel_iommu: Handle PASID entry adding Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 08/15] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 09/15] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 10/15] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 11/15] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 12/15] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 13/15] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 14/15] intel_iommu: Bypass replay in stage-1 page table mode Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 15/15] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).