From: Yi Liu <yi.l.liu@intel.com>
To: "Duan, Zhenzhong" <zhenzhong.duan@intel.com>,
Nicolin Chen <nicolinc@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
"clg@redhat.com" <clg@redhat.com>,
"eric.auger@redhat.com" <eric.auger@redhat.com>,
"mst@redhat.com" <mst@redhat.com>,
"jasowang@redhat.com" <jasowang@redhat.com>,
"ddutile@redhat.com" <ddutile@redhat.com>,
"jgg@nvidia.com" <jgg@nvidia.com>,
"shameerali.kolothum.thodi@huawei.com"
<shameerali.kolothum.thodi@huawei.com>,
"joao.m.martins@oracle.com" <joao.m.martins@oracle.com>,
"clement.mathieu--drif@eviden.com"
<clement.mathieu--drif@eviden.com>,
"Tian, Kevin" <kevin.tian@intel.com>,
"Peng, Chao P" <chao.p.peng@intel.com>,
Yi Sun <yi.y.sun@linux.intel.com>,
Marcel Apfelbaum <marcel.apfelbaum@gmail.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Richard Henderson <richard.henderson@linaro.org>,
Eduardo Habkost <eduardo@habkost.net>
Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
Date: Thu, 12 Jun 2025 20:53:40 +0800 [thread overview]
Message-ID: <f6baaea1-a60c-41dc-a9a8-d2389ed14679@intel.com> (raw)
In-Reply-To: <SJ0PR11MB6744340B889FF65D3BD5B8459267A@SJ0PR11MB6744.namprd11.prod.outlook.com>
On 2025/5/28 15:12, Duan, Zhenzhong wrote:
>
>
>> -----Original Message-----
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
>> host
>>
>> OK. Let me clarify this at the top as I see the gap here now:
>>
>> First, the vSMMU model is based on Zhenzhong's older series that
>> keeps an ioas_id in the HostIOMMUDeviceIOMMUFD structure, which
>> now it only keeps an hwpt_id in this RFCv3 series. This ioas_id
>> is allocated when a passthrough cdev attaches to a VFIO container.
>>
>> Second, the vSMMU model reuses the default IOAS via that ioas_id.
>> Since the VFIO container doesn't allocate a nesting parent S2 HWPT
>> (maybe it could?), so the vSMMU allocates another S2 HWPT in the
>> vIOMMU code.
>>
>> Third, the vSMMU model, for invalidation efficiency and HW Queue
>> support, isolates all emulated devices out of the nesting-enabled
>> vSMMU instance, suggested by Jason. So, only passthrough devices
>> would use the nesting-enabled vSMMU instance, meaning there is no
>> need of IOMMU_NOTIFIER_IOTLB_EVENTS:
>
> I see, then you need to check if there is emulated device under nesting-enabled vSMMU and fail if there is.
>
>> - MAP is not needed as there is no shadow page table. QEMU only
>> traps the page table pointer and forwards it to host kernel.
>> - UNMAP is not needed as QEMU only traps invalidation requests
>> and forwards them to host kernel.
>>
>> (let's forget about the "address space switch" for MSI for now.)
>>
>> So, in the vSMMU model, there is actually no need for the iommu
>> AS. And there is only one IOAS in the VM instance allocated by the
>> VFIO container. And this IOAS manages the GPA->PA mappings. So,
>> get_address_space() returns the system AS for passthrough devices.
>>
>> On the other hand, the VT-d model is a bit different. It's a giant
>> vIOMMU for all devices (either passthrough or emualted). For all
>> emulated devices, it needs IOMMU_NOTIFIER_IOTLB_EVENTS, i.e. the
>> iommu address space returned via get_address_space().
>>
>> That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
>> for passthrough devices, right?
>
> No, even if x-flts=on is configured in QEMU cmdline, that only mean virtual vtd
> supports stage-1 translation, guest still can choose to run in legacy mode(stage2),
> e.g., with kernel cmdline intel_iommu=on,sm_off
>
> So before guest run, we don't know which kind of page table either stage1 or stage2
> for this VFIO device by guest. So we have to use iommu AS to catch stage2's MAP event
> if guest choose stage2.
@Zheznzhong, if guest decides to use legacy mode then vIOMMU should switch
the MRs of the device's AS, hence the IOAS created by VFIO container would
be switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
switched to IOMMU MR. So it should be able to support shadowing the guest
IO page table. Hence, this should not be a problem.
@Nicolin, I think your major point is making the VFIO container IOAS as a
GPA IOAS (always return system AS in get_address_space op) and reusing it
when setting nested translation. Is it? I think it should work if:
1) we can let the vfio memory listener filter out the RO pages per vIOMMU's
request. But I don't want the get_address_space op always return system
AS as the reason mentioned by Zhenzhong above.
2) we can disallow emulated/passthru devices behind the same pcie-pci
bridge[1]. For emulated devices, AS should switch to iommu MR, while for
passthru devices, it needs the AS stick with the system MR hence be able
to keep the VFIO container IOAS as a GPA IOAS. To support this, let AS
switch to iommu MR and have a separate GPA IOAS is needed. This separate
GPA IOAS can be shared by all the passthru devices.
[1]
https://lore.kernel.org/all/SJ0PR11MB6744E2BA00BBE677B2B49BE99265A@SJ0PR11MB6744.namprd11.prod.outlook.com/#t
So basically, we are ok with your idea. But we should decide if it is
necessary to support the topology in 2). I think this is a general
question. TBH. I don't have much information to judge if it is valuable.
Perhaps, let's hear from more people.
>>
>> IIUIC, in the VT-d model, a passthrough device also gets attached
>> to the VFIO container via iommufd_cdev_attach, allocating an IOAS.
>> But it returns the iommu address space, treating them like those
>> emulated devices, although the underlying MR of the returned IOMMU
>> AS is backed by a nodmar MR (that is essentially a system AS).
>>
>> This seems to completely ignore the default IOAS owned by the VFIO
>> container, because it needs to bypass those RO mappings(?)
>>
>> Then for passthrough devices, the VT-d model allocates an internal
>> IOAS that further requires an internal S2 listener, which seems an
>> large duplication of what the VFIO container already does..
>>
>> So, here are things that I want us to conclude:
>> 1) Since the VFIO container already has an IOAS for a passthrough
>> device, and IOMMU_NOTIFIER_IOTLB_EVENTS isn't seemingly needed,
>> why not setup this default IOAS to manage gPA=>PA mappings by
>> returning the system AS via get_address_space() for passthrough
>> devices?
>>
>> I got that the VT-d model might have some concern against this,
>> as the default listener would map those RO regions. Yet, maybe
>> the right approach is to figure out a way to bypass RO regions
>> in the core v.s. duplicating another ioas_alloc()/map() and S2
>> listener?
>>
>> 2) If (1) makes sense, I think we can further simplify the routine
>> by allocating a nesting parent HWPT in iommufd_cdev_attach(),
>> as long as the attaching device is identified as "passthrough"
>> and there is "iommufd" in its "-device" string?
>>
>> After all, IOMMU_HWPT_ALLOC_NEST_PARENT is a common flag.
>>
>> On Mon, May 26, 2025 at 03:24:50PM +0800, Yi Liu wrote:
>>> vfio_listener_region_add, section->mr->name: pc.bios, iova: fffc0000, size:
>>> 40000, vaddr: 7fb314200000, RO
>>> vfio_listener_region_add, section->mr->name: pc.rom, iova: c0000, size:
>>> 20000, vaddr: 7fb206c00000, RO
>> ..
>>> vfio_listener_region_add, section->mr->name: pc.ram, iova: ce000, size:
>>> 1a000, vaddr: 7fb207ece000, RO
>>
>> OK. They look like memory carveouts for FWs. "iova" is gPA right?
>>
>> And they can be in the range of a guest RAM..
>>
>> Mind elaborating why they shouldn't be mapped onto nesting parent
>> S2?
@Nicolin, It's due to ERRATA_772415.
>>> IMHO. At least for vfio devices, I can see only one get_address_space()
>>> call. So even there are two ASs, how should the vfio be notified when the
>>> AS changed? Since vIOMMU is the source of map/umap requests, it looks fine
>>> to always return iommu AS and handle the AS switch by switching the enabled
>>> subregions according to the guest vIOMMU translation types.
>>
>> No, VFIO doesn't get notified when the AS changes.
>>
>> The vSMMU model wants VFIO to stay in the system AS since the VFIO
>> container manages the S2 mappings for guest PA.
>>
>> The "switch" in vSMMU model is only needed by KVM for MSI doorbell
>> translation. By thinking it carefully, maybe it shouldn't switch AS
>> because VFIO might be confused if it somehow does get_address_space
>> again in the future..
@Nicolin, not quite get the detailed logic for the MSI stuff on SMMU. But I
agree with the last sentence. get_address_space should return a consistent
AS.
--
Regards,
Yi Liu
next prev parent reply other threads:[~2025-06-12 12:49 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 01/21] backends/iommufd: Add a helper to invalidate user-managed HWPT Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 02/21] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 03/21] vfio/iommufd: Initialize iommufd specific members in HostIOMMUDeviceIOMMUFD Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 04/21] vfio/iommufd: Implement [at|de]tach_hwpt handlers Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info Zhenzhong Duan
2025-05-21 21:57 ` Nicolin Chen
2025-05-22 9:21 ` Duan, Zhenzhong
2025-05-22 19:35 ` Nicolin Chen
2025-05-26 12:15 ` Cédric Le Goater
2025-05-27 2:12 ` Duan, Zhenzhong
2025-05-21 11:14 ` [PATCH rfcv3 06/21] iommufd: Implement query of host VTD IOMMU's capability Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 07/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 08/21] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 09/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 10/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 11/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 12/21] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 13/21] intel_iommu: Handle PASID entry adding Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 14/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
2025-05-21 22:49 ` Nicolin Chen
2025-05-22 6:50 ` Duan, Zhenzhong
2025-05-22 19:29 ` Nicolin Chen
2025-05-23 6:26 ` Yi Liu
2025-05-26 3:34 ` Duan, Zhenzhong
2025-05-23 6:22 ` Yi Liu
2025-05-23 6:52 ` Duan, Zhenzhong
2025-05-23 21:12 ` Nicolin Chen
2025-05-26 3:46 ` Duan, Zhenzhong
2025-05-26 7:24 ` Yi Liu
2025-05-26 17:35 ` Nicolin Chen
2025-05-28 7:12 ` Duan, Zhenzhong
2025-06-12 12:53 ` Yi Liu [this message]
2025-06-12 14:06 ` Shameerali Kolothum Thodi via
2025-06-16 6:04 ` Nicolin Chen
2025-06-16 3:24 ` Duan, Zhenzhong
2025-06-16 6:34 ` Nicolin Chen
2025-06-16 8:54 ` Duan, Zhenzhong
2025-06-16 9:36 ` Yi Liu
2025-06-16 10:16 ` Duan, Zhenzhong
2025-06-17 7:04 ` Yi Liu
2025-06-16 5:59 ` Nicolin Chen
2025-06-16 7:38 ` Yi Liu
2025-06-17 3:22 ` Nicolin Chen
2025-06-17 6:48 ` Yi Liu
2025-06-16 5:47 ` Nicolin Chen
2025-06-16 8:15 ` Duan, Zhenzhong
2025-06-17 3:14 ` Nicolin Chen
2025-06-17 12:37 ` Jason Gunthorpe
2025-06-17 13:03 ` Yi Liu
2025-06-17 13:11 ` Jason Gunthorpe
2025-06-18 2:51 ` Duan, Zhenzhong
2025-06-18 3:40 ` Yi Liu
2025-06-18 11:43 ` Jason Gunthorpe
2025-05-21 11:14 ` [PATCH rfcv3 16/21] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 17/21] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 18/21] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 19/21] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 20/21] intel_iommu: Bypass replay in stage-1 page table mode Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
2025-05-26 12:19 ` [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Cédric Le Goater
2025-05-27 2:16 ` Duan, Zhenzhong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f6baaea1-a60c-41dc-a9a8-d2389ed14679@intel.com \
--to=yi.l.liu@intel.com \
--cc=alex.williamson@redhat.com \
--cc=chao.p.peng@intel.com \
--cc=clement.mathieu--drif@eviden.com \
--cc=clg@redhat.com \
--cc=ddutile@redhat.com \
--cc=eduardo@habkost.net \
--cc=eric.auger@redhat.com \
--cc=jasowang@redhat.com \
--cc=jgg@nvidia.com \
--cc=joao.m.martins@oracle.com \
--cc=kevin.tian@intel.com \
--cc=marcel.apfelbaum@gmail.com \
--cc=mst@redhat.com \
--cc=nicolinc@nvidia.com \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=shameerali.kolothum.thodi@huawei.com \
--cc=yi.y.sun@linux.intel.com \
--cc=zhenzhong.duan@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).