qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Nicolin Chen <nicolinc@nvidia.com>
To: Yi Liu <yi.l.liu@intel.com>
Cc: Zhenzhong Duan <zhenzhong.duan@intel.com>,
	<qemu-devel@nongnu.org>, <alex.williamson@redhat.com>,
	<clg@redhat.com>, <eric.auger@redhat.com>, <mst@redhat.com>,
	<jasowang@redhat.com>, <peterx@redhat.com>, <ddutile@redhat.com>,
	<jgg@nvidia.com>, <shameerali.kolothum.thodi@huawei.com>,
	<joao.m.martins@oracle.com>, <clement.mathieu--drif@eviden.com>,
	<kevin.tian@intel.com>, <chao.p.peng@intel.com>,
	Yi Sun <yi.y.sun@linux.intel.com>,
	Marcel Apfelbaum <marcel.apfelbaum@gmail.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Richard Henderson" <richard.henderson@linaro.org>,
	Eduardo Habkost <eduardo@habkost.net>
Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
Date: Fri, 23 May 2025 14:12:52 -0700	[thread overview]
Message-ID: <aDDk1NYwJXaAdUQI@Asurada-Nvidia> (raw)
In-Reply-To: <0f8087f4-0c97-440d-84d2-f3f017f81041@intel.com>

Hey,

Thanks for the reply.

Just want to say that I am asking a lot to understand why VT-d is
different than ARM, so as to decide whether ARM should follow VT-d
implementing a separate listener or just use the VFIO listener.

On Fri, May 23, 2025 at 02:22:15PM +0800, Yi Liu wrote:
> Hey Nic,
> 
> On 2025/5/22 06:49, Nicolin Chen wrote:
> > On Wed, May 21, 2025 at 07:14:45PM +0800, Zhenzhong Duan wrote:
> > > +static const MemoryListener iommufd_s2domain_memory_listener = {
> > > +    .name = "iommufd_s2domain",
> > > +    .priority = 1000,
> > > +    .region_add = iommufd_listener_region_add_s2domain,
> > > +    .region_del = iommufd_listener_region_del_s2domain,
> > > +};
> > 
> > Would you mind elaborating When and how vtd does all S2 mappings?
> > 
> > On ARM, the default vfio_memory_listener could capture the entire
> > guest RAM and add to the address space. So what we do is basically
> > reusing the vfio_memory_listener:
> > https://lore.kernel.org/qemu-devel/20250311141045.66620-13-shameerali.kolothum.thodi@huawei.com/
> 
> in concept yes, all the guest ram. but due to an errata, we need
> to skip the RO mappings.

Mind elaborating what are RO mappings? Can those be possible within
the range of the RAM?

> > The thing is that when a VFIO device is attached to the container
> > upon a nesting configuration, the ->get_address_space op should
> > return the system address space as S1 nested HWPT isn't allocated
> > yet. Then all the iommu as routines in vfio_listener_region_add()
> > would be skipped, ending up with mapping the guest RAM in S2 HWPT
> > correctly. Not until the S1 nested HWPT is allocated by the guest
> > OS (after guest boots), can the ->get_address_space op return the
> > iommu address space.
> 
> This seems a bit different between ARM and VT-d emulation. The VT-d
> emulation code returns the iommu address space regardless of what
> translation mode guest configured. But the MR of the address space
> has two overlapped subregions, one is nodmar, another one is iommu.
> As the naming shows, the nodmar is aliased to the system MR.

OK. But why two overlapped subregions v.s. two separate two ASs?

> And before
> the guest enables iommu and set PGTT to a non-PT mode (e.g. S1 or S2),
> the effective MR alias is the nodmar, hence the mapping this address
> space holds are the GPA mappings in the beginning.

I think this is same on ARM, where get_address_space() may return
system address space. And for VT-d, it actually returns the range
of the system address space (just though a sub MR of an iommu AS),
right? 

> If guest set PGTT to S2,
> then the iommu MR is enabled, hence the mapping is gIOVA mappings
> accordingly. So in VT-d emulation, the address space switch is more the MR
> alias switching.

Zhenzhong said that there is no shadow page table for the nesting
setup, i.e. gIOVA=>gPA mappings are entirely done by the guest OS.

Then, why does VT-d need to switch to the iommu MR here?

> In this series, we mainly want to support S1 translation type for guest.
> And it is based on nested translation, which needs a S2 domain that holds
> the GPA mappings. Besides S1 translation type, PT is also supported. Both
> the two types need a S2 domain which already holds GPA mappings. So we have
> this internal listener.

Hmm, the reasoning to the last "so" doesn't sound enough. The VFIO
listener could do the same...

> Also, we want to skip RO mappings on S2, so that's
> another reason for it.  @Zhenzhong, perhaps, it can be described in the
> commit message why an internal listener is introduced.

OK. I think that can be a good reason to have an internal listener,
only if VFIO can't skip the RO mappings.

> > So the second question is:
> > Does vtd have to own this iommufd_s2domain_memory_listener? IOW,
> 
> yes based on the current design. when guest GPTT==PT, attach device
> to S2 hwpt, when it goes to S1, then attach it to a S1 hwpt whose
> parent is the aforementioned S2 hwpt. This S2 hwpt is always there
> for use.

ARM is doing the same thing. And the exact point "this S2 hwpt is
always there for use" has been telling me that the device can just
stay at the S2 address space (system), since the guest kernel will
take care of the S1 address space (iommu).

Overall, the questions here have been two-fold:

1.Why does VT-d need an internal listener?

  I can see the (only) reason is for the RO mappings.

  Yet, Is there anything that we can do to the VFIO listener to
  bypass these RO mappings?

2.Why not return the system AS all the time when nesting is on?
  Why switch to the iommu AS when device attaches to S1 HWPT?

  For ARM, MSI requires a translation so it has to; but MSI on
  VT-d doesn't. So, I couldn't see why VT-d needs to return the
  iommu AS via get_address_space().

  However, combining the question-1, my gut feeling is that, VT-d
  needs to skip RO mappings for errata, while the VFIO listener
  can't do that. So, VT-d has to create its own listener. And to
  avoid duplicated mappings on the same address space, it has to
  bypass the VFIO listener by working it around with an IOMMU AS.
  Is this correct?

Thanks
Nic


  parent reply	other threads:[~2025-05-23 21:18 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 01/21] backends/iommufd: Add a helper to invalidate user-managed HWPT Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 02/21] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 03/21] vfio/iommufd: Initialize iommufd specific members in HostIOMMUDeviceIOMMUFD Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 04/21] vfio/iommufd: Implement [at|de]tach_hwpt handlers Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info Zhenzhong Duan
2025-05-21 21:57   ` Nicolin Chen
2025-05-22  9:21     ` Duan, Zhenzhong
2025-05-22 19:35       ` Nicolin Chen
2025-05-26 12:15   ` Cédric Le Goater
2025-05-27  2:12     ` Duan, Zhenzhong
2025-05-21 11:14 ` [PATCH rfcv3 06/21] iommufd: Implement query of host VTD IOMMU's capability Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 07/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 08/21] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 09/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 10/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 11/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 12/21] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 13/21] intel_iommu: Handle PASID entry adding Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 14/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
2025-05-21 22:49   ` Nicolin Chen
2025-05-22  6:50     ` Duan, Zhenzhong
2025-05-22 19:29       ` Nicolin Chen
2025-05-23  6:26         ` Yi Liu
2025-05-26  3:34         ` Duan, Zhenzhong
2025-05-23  6:22     ` Yi Liu
2025-05-23  6:52       ` Duan, Zhenzhong
2025-05-23 21:12       ` Nicolin Chen [this message]
2025-05-26  3:46         ` Duan, Zhenzhong
2025-05-26  7:24         ` Yi Liu
2025-05-26 17:35           ` Nicolin Chen
2025-05-28  7:12             ` Duan, Zhenzhong
2025-06-12 12:53               ` Yi Liu
2025-06-12 14:06                 ` Shameerali Kolothum Thodi via
2025-06-16  6:04                   ` Nicolin Chen
2025-06-16  3:24                 ` Duan, Zhenzhong
2025-06-16  6:34                   ` Nicolin Chen
2025-06-16  8:54                     ` Duan, Zhenzhong
2025-06-16  9:36                       ` Yi Liu
2025-06-16 10:16                         ` Duan, Zhenzhong
2025-06-17  7:04                           ` Yi Liu
2025-06-16  5:59                 ` Nicolin Chen
2025-06-16  7:38                   ` Yi Liu
2025-06-17  3:22                     ` Nicolin Chen
2025-06-17  6:48                       ` Yi Liu
2025-06-16  5:47               ` Nicolin Chen
2025-06-16  8:15                 ` Duan, Zhenzhong
2025-06-17  3:14                   ` Nicolin Chen
2025-06-17 12:37                     ` Jason Gunthorpe
2025-06-17 13:03                       ` Yi Liu
2025-06-17 13:11                         ` Jason Gunthorpe
2025-06-18  2:51                           ` Duan, Zhenzhong
2025-06-18  3:40                           ` Yi Liu
2025-06-18 11:43                             ` Jason Gunthorpe
2025-05-21 11:14 ` [PATCH rfcv3 16/21] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 17/21] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 18/21] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 19/21] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 20/21] intel_iommu: Bypass replay in stage-1 page table mode Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
2025-05-26 12:19 ` [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Cédric Le Goater
2025-05-27  2:16   ` Duan, Zhenzhong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aDDk1NYwJXaAdUQI@Asurada-Nvidia \
    --to=nicolinc@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=chao.p.peng@intel.com \
    --cc=clement.mathieu--drif@eviden.com \
    --cc=clg@redhat.com \
    --cc=ddutile@redhat.com \
    --cc=eduardo@habkost.net \
    --cc=eric.auger@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=joao.m.martins@oracle.com \
    --cc=kevin.tian@intel.com \
    --cc=marcel.apfelbaum@gmail.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=yi.l.liu@intel.com \
    --cc=yi.y.sun@linux.intel.com \
    --cc=zhenzhong.duan@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).