From: Nicolin Chen <nicolinc@nvidia.com>
To: Yi Liu <yi.l.liu@intel.com>
Cc: Peter Xu <peterx@redhat.com>,
Zhenzhong Duan <zhenzhong.duan@intel.com>,
<qemu-devel@nongnu.org>, <alex.williamson@redhat.com>,
<clg@redhat.com>, <eric.auger@redhat.com>, <mst@redhat.com>,
<jasowang@redhat.com>, <ddutile@redhat.com>, <jgg@nvidia.com>,
<shameerali.kolothum.thodi@huawei.com>,
<joao.m.martins@oracle.com>, <clement.mathieu--drif@eviden.com>,
<kevin.tian@intel.com>, <chao.p.peng@intel.com>,
Yi Sun <yi.y.sun@linux.intel.com>,
Marcel Apfelbaum <marcel.apfelbaum@gmail.com>,
Paolo Bonzini <pbonzini@redhat.com>,
"Richard Henderson" <richard.henderson@linaro.org>,
Eduardo Habkost <eduardo@habkost.net>
Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
Date: Mon, 26 May 2025 10:35:46 -0700 [thread overview]
Message-ID: <aDSmcvZ08jNOSr05@Asurada-Nvidia> (raw)
In-Reply-To: <29f5f434-1fe3-4b5e-91d1-f153e1e98602@intel.com>
OK. Let me clarify this at the top as I see the gap here now:
First, the vSMMU model is based on Zhenzhong's older series that
keeps an ioas_id in the HostIOMMUDeviceIOMMUFD structure, which
now it only keeps an hwpt_id in this RFCv3 series. This ioas_id
is allocated when a passthrough cdev attaches to a VFIO container.
Second, the vSMMU model reuses the default IOAS via that ioas_id.
Since the VFIO container doesn't allocate a nesting parent S2 HWPT
(maybe it could?), so the vSMMU allocates another S2 HWPT in the
vIOMMU code.
Third, the vSMMU model, for invalidation efficiency and HW Queue
support, isolates all emulated devices out of the nesting-enabled
vSMMU instance, suggested by Jason. So, only passthrough devices
would use the nesting-enabled vSMMU instance, meaning there is no
need of IOMMU_NOTIFIER_IOTLB_EVENTS:
- MAP is not needed as there is no shadow page table. QEMU only
traps the page table pointer and forwards it to host kernel.
- UNMAP is not needed as QEMU only traps invalidation requests
and forwards them to host kernel.
(let's forget about the "address space switch" for MSI for now.)
So, in the vSMMU model, there is actually no need for the iommu
AS. And there is only one IOAS in the VM instance allocated by the
VFIO container. And this IOAS manages the GPA->PA mappings. So,
get_address_space() returns the system AS for passthrough devices.
On the other hand, the VT-d model is a bit different. It's a giant
vIOMMU for all devices (either passthrough or emualted). For all
emulated devices, it needs IOMMU_NOTIFIER_IOTLB_EVENTS, i.e. the
iommu address space returned via get_address_space().
That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
for passthrough devices, right?
IIUIC, in the VT-d model, a passthrough device also gets attached
to the VFIO container via iommufd_cdev_attach, allocating an IOAS.
But it returns the iommu address space, treating them like those
emulated devices, although the underlying MR of the returned IOMMU
AS is backed by a nodmar MR (that is essentially a system AS).
This seems to completely ignore the default IOAS owned by the VFIO
container, because it needs to bypass those RO mappings(?)
Then for passthrough devices, the VT-d model allocates an internal
IOAS that further requires an internal S2 listener, which seems an
large duplication of what the VFIO container already does..
So, here are things that I want us to conclude:
1) Since the VFIO container already has an IOAS for a passthrough
device, and IOMMU_NOTIFIER_IOTLB_EVENTS isn't seemingly needed,
why not setup this default IOAS to manage gPA=>PA mappings by
returning the system AS via get_address_space() for passthrough
devices?
I got that the VT-d model might have some concern against this,
as the default listener would map those RO regions. Yet, maybe
the right approach is to figure out a way to bypass RO regions
in the core v.s. duplicating another ioas_alloc()/map() and S2
listener?
2) If (1) makes sense, I think we can further simplify the routine
by allocating a nesting parent HWPT in iommufd_cdev_attach(),
as long as the attaching device is identified as "passthrough"
and there is "iommufd" in its "-device" string?
After all, IOMMU_HWPT_ALLOC_NEST_PARENT is a common flag.
On Mon, May 26, 2025 at 03:24:50PM +0800, Yi Liu wrote:
> vfio_listener_region_add, section->mr->name: pc.bios, iova: fffc0000, size:
> 40000, vaddr: 7fb314200000, RO
> vfio_listener_region_add, section->mr->name: pc.rom, iova: c0000, size:
> 20000, vaddr: 7fb206c00000, RO
..
> vfio_listener_region_add, section->mr->name: pc.ram, iova: ce000, size:
> 1a000, vaddr: 7fb207ece000, RO
OK. They look like memory carveouts for FWs. "iova" is gPA right?
And they can be in the range of a guest RAM..
Mind elaborating why they shouldn't be mapped onto nesting parent
S2?
> IMHO. At least for vfio devices, I can see only one get_address_space()
> call. So even there are two ASs, how should the vfio be notified when the
> AS changed? Since vIOMMU is the source of map/umap requests, it looks fine
> to always return iommu AS and handle the AS switch by switching the enabled
> subregions according to the guest vIOMMU translation types.
No, VFIO doesn't get notified when the AS changes.
The vSMMU model wants VFIO to stay in the system AS since the VFIO
container manages the S2 mappings for guest PA.
The "switch" in vSMMU model is only needed by KVM for MSI doorbell
translation. By thinking it carefully, maybe it shouldn't switch AS
because VFIO might be confused if it somehow does get_address_space
again in the future..
Thanks
Nic
next prev parent reply other threads:[~2025-05-26 17:41 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 01/21] backends/iommufd: Add a helper to invalidate user-managed HWPT Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 02/21] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 03/21] vfio/iommufd: Initialize iommufd specific members in HostIOMMUDeviceIOMMUFD Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 04/21] vfio/iommufd: Implement [at|de]tach_hwpt handlers Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info Zhenzhong Duan
2025-05-21 21:57 ` Nicolin Chen
2025-05-22 9:21 ` Duan, Zhenzhong
2025-05-22 19:35 ` Nicolin Chen
2025-05-26 12:15 ` Cédric Le Goater
2025-05-27 2:12 ` Duan, Zhenzhong
2025-05-21 11:14 ` [PATCH rfcv3 06/21] iommufd: Implement query of host VTD IOMMU's capability Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 07/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 08/21] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 09/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 10/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 11/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 12/21] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 13/21] intel_iommu: Handle PASID entry adding Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 14/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
2025-05-21 22:49 ` Nicolin Chen
2025-05-22 6:50 ` Duan, Zhenzhong
2025-05-22 19:29 ` Nicolin Chen
2025-05-23 6:26 ` Yi Liu
2025-05-26 3:34 ` Duan, Zhenzhong
2025-05-23 6:22 ` Yi Liu
2025-05-23 6:52 ` Duan, Zhenzhong
2025-05-23 21:12 ` Nicolin Chen
2025-05-26 3:46 ` Duan, Zhenzhong
2025-05-26 7:24 ` Yi Liu
2025-05-26 17:35 ` Nicolin Chen [this message]
2025-05-28 7:12 ` Duan, Zhenzhong
2025-06-12 12:53 ` Yi Liu
2025-06-12 14:06 ` Shameerali Kolothum Thodi via
2025-06-16 6:04 ` Nicolin Chen
2025-06-16 3:24 ` Duan, Zhenzhong
2025-06-16 6:34 ` Nicolin Chen
2025-06-16 8:54 ` Duan, Zhenzhong
2025-06-16 9:36 ` Yi Liu
2025-06-16 10:16 ` Duan, Zhenzhong
2025-06-17 7:04 ` Yi Liu
2025-06-16 5:59 ` Nicolin Chen
2025-06-16 7:38 ` Yi Liu
2025-06-17 3:22 ` Nicolin Chen
2025-06-17 6:48 ` Yi Liu
2025-06-16 5:47 ` Nicolin Chen
2025-06-16 8:15 ` Duan, Zhenzhong
2025-06-17 3:14 ` Nicolin Chen
2025-06-17 12:37 ` Jason Gunthorpe
2025-06-17 13:03 ` Yi Liu
2025-06-17 13:11 ` Jason Gunthorpe
2025-06-18 2:51 ` Duan, Zhenzhong
2025-06-18 3:40 ` Yi Liu
2025-06-18 11:43 ` Jason Gunthorpe
2025-05-21 11:14 ` [PATCH rfcv3 16/21] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 17/21] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 18/21] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 19/21] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 20/21] intel_iommu: Bypass replay in stage-1 page table mode Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
2025-05-26 12:19 ` [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Cédric Le Goater
2025-05-27 2:16 ` Duan, Zhenzhong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aDSmcvZ08jNOSr05@Asurada-Nvidia \
--to=nicolinc@nvidia.com \
--cc=alex.williamson@redhat.com \
--cc=chao.p.peng@intel.com \
--cc=clement.mathieu--drif@eviden.com \
--cc=clg@redhat.com \
--cc=ddutile@redhat.com \
--cc=eduardo@habkost.net \
--cc=eric.auger@redhat.com \
--cc=jasowang@redhat.com \
--cc=jgg@nvidia.com \
--cc=joao.m.martins@oracle.com \
--cc=kevin.tian@intel.com \
--cc=marcel.apfelbaum@gmail.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=shameerali.kolothum.thodi@huawei.com \
--cc=yi.l.liu@intel.com \
--cc=yi.y.sun@linux.intel.com \
--cc=zhenzhong.duan@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).