Multiple vIOMMU instance support in QEMU?

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Multiple vIOMMU instance support in QEMU?
@ 2023-04-24 23:42 Nicolin Chen
  2023-05-18  3:22 ` Nicolin Chen
  0 siblings, 1 reply; 10+ messages in thread
From: Nicolin Chen @ 2023-04-24 23:42 UTC (permalink / raw)
  To: eric.auger, peter.maydell, qemu-devel, qemu-arm; +Cc: jgg, yi.l.liu, kevin.tian

Hi all,

(Please feel free to include related folks into this thread.)

In light of an ongoing nested-IOMMU support effort via IOMMUFD, we
would likely have a need of a multi-vIOMMU support in QEMU, or more
specificly a multi-vSMMU support for an underlying HW that has multi
physical SMMUs. This would be used in the following use cases.
 1) Multiple physical SMMUs with different feature bits so that one
    vSMMU enabling a nesting configuration cannot reflect properly.
 2) NVIDIA Grace CPU has a VCMDQ HW extension for SMMU CMDQ. Every
    VCMDQ HW has an MMIO region (CONS and PROD indexes) that should
    be exposed to a VM, so that a hypervisor can avoid trappings by
    using this HW accelerator for performance. However, one single
    vSMMU cannot mmap multiple MMIO regions from multiple pSMMUs.
 3) With the latest iommufd design, a single vIOMMU model shares the
    same stage-2 HW pagetable across all physical SMMUs with a shared
    VMID. Then a stage-1 pagetable invalidation (for one device) at
    the vSMMU would have to be broadcasted to all the SMMU instances,
    which would hurt the overall performance.

I previously discussed with Eric this topic in a private email. Eric
felt the difficulty of implementing this in the current QEMU system,
as it would touch different subsystems like IORT and platform device,
since the passthrough devices would be attached to different vIOMMUs.

Yet, given the situations above, it's likely the best by duplicating
the vIOMMU instance corresponding to the number of the physical SMMU
instances.

So, I am sending this email to collect opinions on this and see what
would be a potential TODO list if we decide to go on this path.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Multiple vIOMMU instance support in QEMU?
  2023-04-24 23:42 Multiple vIOMMU instance support in QEMU? Nicolin Chen
@ 2023-05-18  3:22 ` Nicolin Chen
  2023-05-18  9:06   ` Eric Auger
  0 siblings, 1 reply; 10+ messages in thread
From: Nicolin Chen @ 2023-05-18  3:22 UTC (permalink / raw)
  To: eric.auger, peter.maydell, qemu-devel, qemu-arm; +Cc: jgg, yi.l.liu, kevin.tian

Hi Peter,

Eric previously mentioned that you might not like the idea.
Before we start this big effort, would it possible for you
to comment a word or two on this topic?

Thanks!

On Mon, Apr 24, 2023 at 04:42:57PM -0700, Nicolin Chen wrote:
> Hi all,
> 
> (Please feel free to include related folks into this thread.)
> 
> In light of an ongoing nested-IOMMU support effort via IOMMUFD, we
> would likely have a need of a multi-vIOMMU support in QEMU, or more
> specificly a multi-vSMMU support for an underlying HW that has multi
> physical SMMUs. This would be used in the following use cases.
>  1) Multiple physical SMMUs with different feature bits so that one
>     vSMMU enabling a nesting configuration cannot reflect properly.
>  2) NVIDIA Grace CPU has a VCMDQ HW extension for SMMU CMDQ. Every
>     VCMDQ HW has an MMIO region (CONS and PROD indexes) that should
>     be exposed to a VM, so that a hypervisor can avoid trappings by
>     using this HW accelerator for performance. However, one single
>     vSMMU cannot mmap multiple MMIO regions from multiple pSMMUs.
>  3) With the latest iommufd design, a single vIOMMU model shares the
>     same stage-2 HW pagetable across all physical SMMUs with a shared
>     VMID. Then a stage-1 pagetable invalidation (for one device) at
>     the vSMMU would have to be broadcasted to all the SMMU instances,
>     which would hurt the overall performance.
> 
> I previously discussed with Eric this topic in a private email. Eric
> felt the difficulty of implementing this in the current QEMU system,
> as it would touch different subsystems like IORT and platform device,
> since the passthrough devices would be attached to different vIOMMUs.
> 
> Yet, given the situations above, it's likely the best by duplicating
> the vIOMMU instance corresponding to the number of the physical SMMU
> instances.
> 
> So, I am sending this email to collect opinions on this and see what
> would be a potential TODO list if we decide to go on this path.
> 
> Thanks
> Nicolin


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Multiple vIOMMU instance support in QEMU?
  2023-05-18  3:22 ` Nicolin Chen
@ 2023-05-18  9:06   ` Eric Auger
  2023-05-18 14:16     ` Peter Xu
  2023-05-18 17:39     ` Nicolin Chen
  0 siblings, 2 replies; 10+ messages in thread
From: Eric Auger @ 2023-05-18  9:06 UTC (permalink / raw)
  To: Nicolin Chen, peter.maydell, qemu-devel, qemu-arm
  Cc: jgg, yi.l.liu, kevin.tian, Peter Xu

Hi Nicolin,

On 5/18/23 05:22, Nicolin Chen wrote:
> Hi Peter,
>
> Eric previously mentioned that you might not like the idea.
> Before we start this big effort, would it possible for you
> to comment a word or two on this topic?
>
> Thanks!
>
> On Mon, Apr 24, 2023 at 04:42:57PM -0700, Nicolin Chen wrote:
>> Hi all,
>>
>> (Please feel free to include related folks into this thread.)
>>
>> In light of an ongoing nested-IOMMU support effort via IOMMUFD, we
>> would likely have a need of a multi-vIOMMU support in QEMU, or more
>> specificly a multi-vSMMU support for an underlying HW that has multi
>> physical SMMUs. This would be used in the following use cases.
>>  1) Multiple physical SMMUs with different feature bits so that one
>>     vSMMU enabling a nesting configuration cannot reflect properly.
>>  2) NVIDIA Grace CPU has a VCMDQ HW extension for SMMU CMDQ. Every
>>     VCMDQ HW has an MMIO region (CONS and PROD indexes) that should
>>     be exposed to a VM, so that a hypervisor can avoid trappings by
>>     using this HW accelerator for performance. However, one single
>>     vSMMU cannot mmap multiple MMIO regions from multiple pSMMUs.
>>  3) With the latest iommufd design, a single vIOMMU model shares the
>>     same stage-2 HW pagetable across all physical SMMUs with a shared
>>     VMID. Then a stage-1 pagetable invalidation (for one device) at
>>     the vSMMU would have to be broadcasted to all the SMMU instances,
>>     which would hurt the overall performance.
Well if there is a real production use case behind the requirement of
having mutliple vSMMUs (and more generally vIOMMUs) sure you can go
ahead. I just wanted to warn you that as far as I know multiple vIOMMUS
are not supported even on Intel iommu and virtio-iommu. Let's add Peter
Xu in CC. I foresee added complexicity with regard to how you define the
RID scope of each vIOMMU, ACPI table generation, impact on arm-virt
machine options, how you pass the feature associated to each instance,
notifier propagation impact? And I don't evoke the VCMDQ feat addition.
We are still far from having a singleton QEMU nested stage SMMU
implementation at the moment but I understand you may want to feed the
pipeline to pave the way for enhanced use cases.

Thanks

Eric
>>
>> I previously discussed with Eric this topic in a private email. Eric
>> felt the difficulty of implementing this in the current QEMU system,
>> as it would touch different subsystems like IORT and platform device,
>> since the passthrough devices would be attached to different vIOMMUs.
>>
>> Yet, given the situations above, it's likely the best by duplicating
>> the vIOMMU instance corresponding to the number of the physical SMMU
>> instances.
>>
>> So, I am sending this email to collect opinions on this and see what
>> would be a potential TODO list if we decide to go on this path.
>>
>> Thanks
>> Nicolin



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Multiple vIOMMU instance support in QEMU?
  2023-05-18  9:06   ` Eric Auger
@ 2023-05-18 14:16     ` Peter Xu
  2023-05-18 14:56       ` Jason Gunthorpe
  2023-05-18 17:39     ` Nicolin Chen
  1 sibling, 1 reply; 10+ messages in thread
From: Peter Xu @ 2023-05-18 14:16 UTC (permalink / raw)
  To: Eric Auger
  Cc: Nicolin Chen, peter.maydell, qemu-devel, qemu-arm, jgg, yi.l.liu,
	kevin.tian

On Thu, May 18, 2023 at 11:06:50AM +0200, Eric Auger wrote:
> Hi Nicolin,
> 
> On 5/18/23 05:22, Nicolin Chen wrote:
> > Hi Peter,
> >
> > Eric previously mentioned that you might not like the idea.
> > Before we start this big effort, would it possible for you
> > to comment a word or two on this topic?
> >
> > Thanks!
> >
> > On Mon, Apr 24, 2023 at 04:42:57PM -0700, Nicolin Chen wrote:
> >> Hi all,
> >>
> >> (Please feel free to include related folks into this thread.)
> >>
> >> In light of an ongoing nested-IOMMU support effort via IOMMUFD, we
> >> would likely have a need of a multi-vIOMMU support in QEMU, or more
> >> specificly a multi-vSMMU support for an underlying HW that has multi
> >> physical SMMUs. This would be used in the following use cases.
> >>  1) Multiple physical SMMUs with different feature bits so that one
> >>     vSMMU enabling a nesting configuration cannot reflect properly.
> >>  2) NVIDIA Grace CPU has a VCMDQ HW extension for SMMU CMDQ. Every
> >>     VCMDQ HW has an MMIO region (CONS and PROD indexes) that should
> >>     be exposed to a VM, so that a hypervisor can avoid trappings by
> >>     using this HW accelerator for performance. However, one single
> >>     vSMMU cannot mmap multiple MMIO regions from multiple pSMMUs.
> >>  3) With the latest iommufd design, a single vIOMMU model shares the
> >>     same stage-2 HW pagetable across all physical SMMUs with a shared
> >>     VMID. Then a stage-1 pagetable invalidation (for one device) at
> >>     the vSMMU would have to be broadcasted to all the SMMU instances,
> >>     which would hurt the overall performance.
> Well if there is a real production use case behind the requirement of
> having mutliple vSMMUs (and more generally vIOMMUs) sure you can go
> ahead. I just wanted to warn you that as far as I know multiple vIOMMUS
> are not supported even on Intel iommu and virtio-iommu. Let's add Peter
> Xu in CC. I foresee added complexicity with regard to how you define the
> RID scope of each vIOMMU, ACPI table generation, impact on arm-virt
> machine options, how you pass the feature associated to each instance,
> notifier propagation impact? And I don't evoke the VCMDQ feat addition.
> We are still far from having a singleton QEMU nested stage SMMU
> implementation at the moment but I understand you may want to feed the
> pipeline to pave the way for enhanced use cases.

I agree with Eric that we're still lacking quite a few things for >1
vIOMMUs support, afaik.

What you mentioned above makes sense to me from the POV that 1 vIOMMU may
not suffice, but that's at least totally new area to me because I never
used >1 IOMMUs even bare metal (excluding the case where I'm aware that
e.g. a GPU could have its own IOMMU-like dma translator).

What's the system layout of your multi-vIOMMU world?  Is there still a
centric vIOMMU, or multi-vIOMMUs can run fully in parallel, so that e.g. we
can have DEV1,DEV2 under vIOMMU1 and DEV3,DEV4 under vIOMMU2?  Can vIOMMU
get involved in any plug/unplug dynamically in any form?  What else can be
different from that regard?

Is it a common hardware layout or nVidia specific?

Thanks,

> 
> Thanks
> 
> Eric
> >>
> >> I previously discussed with Eric this topic in a private email. Eric
> >> felt the difficulty of implementing this in the current QEMU system,
> >> as it would touch different subsystems like IORT and platform device,
> >> since the passthrough devices would be attached to different vIOMMUs.
> >>
> >> Yet, given the situations above, it's likely the best by duplicating
> >> the vIOMMU instance corresponding to the number of the physical SMMU
> >> instances.
> >>
> >> So, I am sending this email to collect opinions on this and see what
> >> would be a potential TODO list if we decide to go on this path.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Multiple vIOMMU instance support in QEMU?
  2023-05-18 14:16     ` Peter Xu
@ 2023-05-18 14:56       ` Jason Gunthorpe
  2023-05-18 19:45         ` Peter Xu
  2023-05-18 22:56         ` Tian, Kevin
  0 siblings, 2 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2023-05-18 14:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: Eric Auger, Nicolin Chen, peter.maydell, qemu-devel, qemu-arm,
	yi.l.liu, kevin.tian

On Thu, May 18, 2023 at 10:16:24AM -0400, Peter Xu wrote:

> What you mentioned above makes sense to me from the POV that 1 vIOMMU may
> not suffice, but that's at least totally new area to me because I never
> used >1 IOMMUs even bare metal (excluding the case where I'm aware that
> e.g. a GPU could have its own IOMMU-like dma translator).

Even x86 systems are multi-iommu, one iommu per physical CPU socket.

I'm not sure how they model this though - Kevin do you know? Do we get
multiple iommu instances in Linux or is all the broadcasting of
invalidates and sharing of tables hidden?

> What's the system layout of your multi-vIOMMU world?  Is there still a
> centric vIOMMU, or multi-vIOMMUs can run fully in parallel, so that e.g. we
> can have DEV1,DEV2 under vIOMMU1 and DEV3,DEV4 under vIOMMU2?

Just like physical, each viommu is parallel and independent. Each has
its own caches, ASIDs, DIDs/etc and thus invalidation domains.

The seperated caches is the motivating reason to do this as something
like vCMDQ is a direct command channel for invalidations to only the
caches of a single IOMMU block.

> Is it a common hardware layout or nVidia specific?

I think it is pretty normal, you have multiple copies of the IOMMU and
its caches for physical reasons.

The only choice is if the platform HW somehow routes invalidations to
all IOMMUs or requires SW to route/replicate invalidates.

ARM's IP seems to be designed toward the latter so I expect it is
going to be common on ARM.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Multiple vIOMMU instance support in QEMU?
  2023-05-18  9:06   ` Eric Auger
  2023-05-18 14:16     ` Peter Xu
@ 2023-05-18 17:39     ` Nicolin Chen
  1 sibling, 0 replies; 10+ messages in thread
From: Nicolin Chen @ 2023-05-18 17:39 UTC (permalink / raw)
  To: Eric Auger
  Cc: peter.maydell, qemu-devel, qemu-arm, jgg, yi.l.liu, kevin.tian,
	Peter Xu

Hi Eric,

On Thu, May 18, 2023 at 11:06:50AM +0200, Eric Auger wrote:
> > On Mon, Apr 24, 2023 at 04:42:57PM -0700, Nicolin Chen wrote:
> >> Hi all,
> >>
> >> (Please feel free to include related folks into this thread.)
> >>
> >> In light of an ongoing nested-IOMMU support effort via IOMMUFD, we
> >> would likely have a need of a multi-vIOMMU support in QEMU, or more
> >> specificly a multi-vSMMU support for an underlying HW that has multi
> >> physical SMMUs. This would be used in the following use cases.
> >>  1) Multiple physical SMMUs with different feature bits so that one
> >>     vSMMU enabling a nesting configuration cannot reflect properly.
> >>  2) NVIDIA Grace CPU has a VCMDQ HW extension for SMMU CMDQ. Every
> >>     VCMDQ HW has an MMIO region (CONS and PROD indexes) that should
> >>     be exposed to a VM, so that a hypervisor can avoid trappings by
> >>     using this HW accelerator for performance. However, one single
> >>     vSMMU cannot mmap multiple MMIO regions from multiple pSMMUs.
> >>  3) With the latest iommufd design, a single vIOMMU model shares the
> >>     same stage-2 HW pagetable across all physical SMMUs with a shared
> >>     VMID. Then a stage-1 pagetable invalidation (for one device) at
> >>     the vSMMU would have to be broadcasted to all the SMMU instances,
> >>     which would hurt the overall performance.

> Well if there is a real production use case behind the requirement of
> having mutliple vSMMUs (and more generally vIOMMUs) sure you can go
> ahead. I just wanted to warn you that as far as I know multiple vIOMMUS
> are not supported even on Intel iommu and virtio-iommu. Let's add Peter
> Xu in CC. I foresee added complexicity with regard to how you define the
> RID scope of each vIOMMU, ACPI table generation, impact on arm-virt
> machine options, how you pass the feature associated to each instance,
> notifier propagation impact? And I don't evoke the VCMDQ feat addition.

Yea. A solution could be having multi PCI buses/bridges that
are behind multi vSMMUs and taking different IORT ID mappings.
This will touch a few parts as you foresee here.

W.r.t. the arm-virt machine options, I am thinking of a simple
flag, let's say "iommu=nested-smmuv3", for QEMU to add multiple
vSMMU instances automatically (and enabling nesting mode too),
depending on the hw_info ioctl return values at passthrough-ed
devices. If there is only one passthrough device, or if all of
the passthrough devices are behind the same pSMMU, there would
be no need to add multiple vSMMU instances.

> We are still far from having a singleton QEMU nested stage SMMU
> implementation at the moment but I understand you may want to feed the
> pipeline to pave the way for enhanced use cases.

Yea. It's for the planning -- I wanted to gather some opinion
before doing anything complicated, as you warned here :) And
this would also impact a bit how we add nested SMMU in QEMU.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Multiple vIOMMU instance support in QEMU?
  2023-05-18 14:56       ` Jason Gunthorpe
@ 2023-05-18 19:45         ` Peter Xu
  2023-05-18 20:19           ` Jason Gunthorpe
  2023-05-18 22:56         ` Tian, Kevin
  1 sibling, 1 reply; 10+ messages in thread
From: Peter Xu @ 2023-05-18 19:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Eric Auger, Nicolin Chen, peter.maydell, qemu-devel, qemu-arm,
	yi.l.liu, kevin.tian, Jason Wang, Michael S. Tsirkin

On Thu, May 18, 2023 at 11:56:46AM -0300, Jason Gunthorpe wrote:
> On Thu, May 18, 2023 at 10:16:24AM -0400, Peter Xu wrote:
> 
> > What you mentioned above makes sense to me from the POV that 1 vIOMMU may
> > not suffice, but that's at least totally new area to me because I never
> > used >1 IOMMUs even bare metal (excluding the case where I'm aware that
> > e.g. a GPU could have its own IOMMU-like dma translator).
> 
> Even x86 systems are multi-iommu, one iommu per physical CPU socket.

I tried to look at a 2-node system on hand and I indeed got two dmars:

[    4.444788] DMAR: dmar0: reg_base_addr fbffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    4.459673] DMAR: dmar1: reg_base_addr c7ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df

Though they do not seem to be all parallel on attaching devices.  E.g.,
most of the devices on this host are attached to dmar1, while there're only
two devices attached to dmar0:

80:05.2 System peripheral: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO RAS/Control Status/Global Errors (rev 01)
80:05.0 System peripheral: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Map/VTd_Misc/System Management (rev 01)

> 
> I'm not sure how they model this though - Kevin do you know? Do we get
> multiple iommu instances in Linux or is all the broadcasting of
> invalidates and sharing of tables hidden?
> 
> > What's the system layout of your multi-vIOMMU world?  Is there still a
> > centric vIOMMU, or multi-vIOMMUs can run fully in parallel, so that e.g. we
> > can have DEV1,DEV2 under vIOMMU1 and DEV3,DEV4 under vIOMMU2?
> 
> Just like physical, each viommu is parallel and independent. Each has
> its own caches, ASIDs, DIDs/etc and thus invalidation domains.
> 
> The seperated caches is the motivating reason to do this as something
> like vCMDQ is a direct command channel for invalidations to only the
> caches of a single IOMMU block.

From cache invalidation pov, shouldn't the best be per-device granule (like
dev-iotlb in VT-d? No idea for ARM)?

But that's two angles I assume - currently dev-iotlb is still emulated at
least in QEMU.  Having a hardware accelerated queue is definitely another
thing.

> 
> > Is it a common hardware layout or nVidia specific?
> 
> I think it is pretty normal, you have multiple copies of the IOMMU and
> its caches for physical reasons.
> 
> The only choice is if the platform HW somehow routes invalidations to
> all IOMMUs or requires SW to route/replicate invalidates.
> 
> ARM's IP seems to be designed toward the latter so I expect it is
> going to be common on ARM.

Thanks for the information, Jason.

I see that Intel is already copied here (at least Yi and Kevin) so I assume
there're already some kind of synchronizations on multi-vIOMMU vs recent
works on Intel side, which is definitely nice and can avoid work conflicts.

We should probably also copy Jason Wang and mst when there's any formal
proposal.  I've got them all copied here too.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Multiple vIOMMU instance support in QEMU?
  2023-05-18 19:45         ` Peter Xu
@ 2023-05-18 20:19           ` Jason Gunthorpe
  2023-05-19  0:38             ` Tian, Kevin
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2023-05-18 20:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: Eric Auger, Nicolin Chen, peter.maydell, qemu-devel, qemu-arm,
	yi.l.liu, kevin.tian, Jason Wang, Michael S. Tsirkin

On Thu, May 18, 2023 at 03:45:24PM -0400, Peter Xu wrote:
> On Thu, May 18, 2023 at 11:56:46AM -0300, Jason Gunthorpe wrote:
> > On Thu, May 18, 2023 at 10:16:24AM -0400, Peter Xu wrote:
> > 
> > > What you mentioned above makes sense to me from the POV that 1 vIOMMU may
> > > not suffice, but that's at least totally new area to me because I never
> > > used >1 IOMMUs even bare metal (excluding the case where I'm aware that
> > > e.g. a GPU could have its own IOMMU-like dma translator).
> > 
> > Even x86 systems are multi-iommu, one iommu per physical CPU socket.
> 
> I tried to look at a 2-node system on hand and I indeed got two dmars:
> 
> [    4.444788] DMAR: dmar0: reg_base_addr fbffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
> [    4.459673] DMAR: dmar1: reg_base_addr c7ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
> 
> Though they do not seem to be all parallel on attaching devices.  E.g.,
> most of the devices on this host are attached to dmar1, while there're only
> two devices attached to dmar0:

Yeah, I expect it has to do with physical topology. PCIe devices
physically connected to each socket should use the socket local iommu
and the socket local caches.

ie it would be foolish to take an IO in socket A and the forward it to
socket B to perform IOMMU translation then forward it back to socket A
to land in memory.

> > I'm not sure how they model this though - Kevin do you know? Do we get
> > multiple iommu instances in Linux or is all the broadcasting of
> > invalidates and sharing of tables hidden?
> > 
> > > What's the system layout of your multi-vIOMMU world?  Is there still a
> > > centric vIOMMU, or multi-vIOMMUs can run fully in parallel, so that e.g. we
> > > can have DEV1,DEV2 under vIOMMU1 and DEV3,DEV4 under vIOMMU2?
> > 
> > Just like physical, each viommu is parallel and independent. Each has
> > its own caches, ASIDs, DIDs/etc and thus invalidation domains.
> > 
> > The seperated caches is the motivating reason to do this as something
> > like vCMDQ is a direct command channel for invalidations to only the
> > caches of a single IOMMU block.
> 
> From cache invalidation pov, shouldn't the best be per-device granule (like
> dev-iotlb in VT-d? No idea for ARM)?

There are many caches and different cache tag schemes in an iommu. All
of them are local to the IOMMU block.

Consider where we might have a single vDID but the devices using that
DID are spread across two physical IOMMUs. When the VM asks to
invalidate the vDID the system has to generate two physical pDID
invalidations.

This can't be done without a software mediation layer in the VMM.

The better solution is to make the pDID and vDID 1:1 so the VM itself
replicates the invalidations. The VM has better knowledge of when
replication is needed so it is overall more efficient.

> I see that Intel is already copied here (at least Yi and Kevin) so I assume
> there're already some kind of synchronizations on multi-vIOMMU vs recent
> works on Intel side, which is definitely nice and can avoid work conflicts.

I actually don't know that.. Intel sees multiple DMAR blocks in SW and
they have kernel level replication of invalidation.. Intel doesn't
have a HW fast path yet so they can rely on mediation to fix it. Thus
I expect there is no HW replication of invalidations here. Kevin?

Remember the VFIO API hides all of this, when you change the VFIO
container it automatically generates all requires invalidations in the
kernel.

I also heard AMD has a HW fast and also multi-iommu but I don't really
know the details.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Multiple vIOMMU instance support in QEMU?
  2023-05-18 14:56       ` Jason Gunthorpe
  2023-05-18 19:45         ` Peter Xu
@ 2023-05-18 22:56         ` Tian, Kevin
  1 sibling, 0 replies; 10+ messages in thread
From: Tian, Kevin @ 2023-05-18 22:56 UTC (permalink / raw)
  To: Jason Gunthorpe, Peter Xu
  Cc: Eric Auger, Nicolin Chen, peter.maydell@linaro.org,
	qemu-devel@nongnu.org, qemu-arm@nongnu.org, Liu, Yi L

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, May 18, 2023 10:57 PM
> 
> On Thu, May 18, 2023 at 10:16:24AM -0400, Peter Xu wrote:
> 
> > What you mentioned above makes sense to me from the POV that 1
> vIOMMU may
> > not suffice, but that's at least totally new area to me because I never
> > used >1 IOMMUs even bare metal (excluding the case where I'm aware
> that
> > e.g. a GPU could have its own IOMMU-like dma translator).
> 
> Even x86 systems are multi-iommu, one iommu per physical CPU socket.
> 
> I'm not sure how they model this though - Kevin do you know? Do we get
> multiple iommu instances in Linux or is all the broadcasting of
> invalidates and sharing of tables hidden?
> 

Yes Linux supports multiple iommu instances on x86 systems.

Each iommu has its own configuration structures/caches and attached
devices. No broadcasting.

An ACPI table is used to describe the topology between IOMMUs and
devices.

If an iommu domain is attached to two devices behind two IOMMUs,
separate iotlb invalidation commands are required when the domain
mapping is changed.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Multiple vIOMMU instance support in QEMU?
  2023-05-18 20:19           ` Jason Gunthorpe
@ 2023-05-19  0:38             ` Tian, Kevin
  0 siblings, 0 replies; 10+ messages in thread
From: Tian, Kevin @ 2023-05-19  0:38 UTC (permalink / raw)
  To: Jason Gunthorpe, Peter Xu
  Cc: Eric Auger, Nicolin Chen, peter.maydell@linaro.org,
	qemu-devel@nongnu.org, qemu-arm@nongnu.org, Liu, Yi L, Jason Wang,
	Michael S. Tsirkin

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, May 19, 2023 4:19 AM
> 
> On Thu, May 18, 2023 at 03:45:24PM -0400, Peter Xu wrote:
> 
> > I see that Intel is already copied here (at least Yi and Kevin) so I assume
> > there're already some kind of synchronizations on multi-vIOMMU vs recent
> > works on Intel side, which is definitely nice and can avoid work conflicts.
> 
> I actually don't know that.. Intel sees multiple DMAR blocks in SW and
> they have kernel level replication of invalidation.. Intel doesn't
> have a HW fast path yet so they can rely on mediation to fix it. Thus
> I expect there is no HW replication of invalidations here. Kevin?
> 

No HW fast path so single vIOMMU instance is sufficient on Intel now.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-05-19  0:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-24 23:42 Multiple vIOMMU instance support in QEMU? Nicolin Chen
2023-05-18  3:22 ` Nicolin Chen
2023-05-18  9:06   ` Eric Auger
2023-05-18 14:16     ` Peter Xu
2023-05-18 14:56       ` Jason Gunthorpe
2023-05-18 19:45         ` Peter Xu
2023-05-18 20:19           ` Jason Gunthorpe
2023-05-19  0:38             ` Tian, Kevin
2023-05-18 22:56         ` Tian, Kevin
2023-05-18 17:39     ` Nicolin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).