Query on ARM SMMUv3 nested support

All of lore.kernel.org
 help / color / mirror / Atom feed

* Query on ARM SMMUv3 nested support
@ 2024-03-13 10:13 Shameerali Kolothum Thodi
  2024-03-13 23:50 ` Nicolin Chen
  0 siblings, 1 reply; 9+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-03-13 10:13 UTC (permalink / raw)
  To: Nicolin Chen, Jason Gunthorpe
  Cc: iommu@lists.linux.dev, Linuxarm, Zhangfei Gao, Michael Shavit,
	Eric Auger, Moritz Fischer

Hi Nicolin,

Thanks for the latest repos with basic SMMUv3 nested support enabled[1].
I did some  basic sanity runs on a HiSilicon platform and they seems to work as
expected. The only problem being we can't assign two devices to the VM if
they are on different physical SMMUs.

qemu-system-aarch64: -device
vfio-pci,host=0000:75:00.1,iommufd=iommufd0: [iommufd=29] error attach
0000:75:00.1 (36) to id=4: Invalid argument
qemu-system-aarch64: -device
vfio-pci,host=0000:75:00.1,iommufd=iommufd0: Unable to attach dev to
stage-2 HW pagetable: -1
Segmentation fault (core dumped)
...

I see that on the Qemu side we now allocate a single s2 hwpt and attach that into
all the devices . But this will only work currently if all the assigned devices
are under the same physical SMMUv3 as we have a check in kernel
whether domains are having the same SMMUv3. I remember Jason mentioning
that he is planning to relax that. So are the Qemu side changes based on that
assumption? And any idea how we are planning to relax that restriction? We
do a check for compatibility of the phys SMMUv3s and then allow/restrict in kernel?
Do Qemu can then try allocating a separate s2 hwpt for those and attach again?
Sorry if this was already discussed elsewhere and I missed that.

Thanks,
Shameer
[1]:
https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-03112024/
https://github.com/nicolinc/qemu/commits/wip/iommufd_vsmmu-02292024/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Query on ARM SMMUv3 nested support
  2024-03-13 10:13 Query on ARM SMMUv3 nested support Shameerali Kolothum Thodi
@ 2024-03-13 23:50 ` Nicolin Chen
  2024-03-18 16:22   ` Jason Gunthorpe
  2024-03-22 15:04   ` Shameerali Kolothum Thodi
  0 siblings, 2 replies; 9+ messages in thread
From: Nicolin Chen @ 2024-03-13 23:50 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, iommu@lists.linux.dev, Linuxarm, Zhangfei Gao,
	Michael Shavit, Eric Auger, Moritz Fischer

On Wed, Mar 13, 2024 at 10:13:58AM +0000, Shameerali Kolothum Thodi wrote:
> Hi Nicolin,
> 
> Thanks for the latest repos with basic SMMUv3 nested support enabled[1].
> I did some  basic sanity runs on a HiSilicon platform and they seems to work as
> expected. The only problem being we can't assign two devices to the VM if
> they are on different physical SMMUs.
> 
> qemu-system-aarch64: -device
> vfio-pci,host=0000:75:00.1,iommufd=iommufd0: [iommufd=29] error attach
> 0000:75:00.1 (36) to id=4: Invalid argument
> qemu-system-aarch64: -device
> vfio-pci,host=0000:75:00.1,iommufd=iommufd0: Unable to attach dev to
> stage-2 HW pagetable: -1
> Segmentation fault (core dumped)
> ...
> 
> I see that on the Qemu side we now allocate a single s2 hwpt and attach that into
> all the devices . But this will only work currently if all the assigned devices
> are under the same physical SMMUv3 as we have a check in kernel
> whether domains are having the same SMMUv3. I remember Jason mentioning
> that he is planning to relax that. So are the Qemu side changes based on that
> assumption? And any idea how we are planning to relax that restriction? We
> do a check for compatibility of the phys SMMUv3s and then allow/restrict in kernel?
> Do Qemu can then try allocating a separate s2 hwpt for those and attach again?
> Sorry if this was already discussed elsewhere and I missed that.

You are very right about this. I haven't thought about supporting
that case yet. Likely I need to spare some time to refine the QEMU
part in the coming weeks or so (have been waiting for Zhenzhong's
full nesting patches).

Also, NVIDIA has an interest to support CMDQ-V accelerator, where
we need more vSMMU instances, than just one. I haven't decided how
to handle this for both single-vSMMU and multi-vSMMU versions.

Yet, some of the ongoing work might help:
https://github.com/nicolinc/iommufd/commit/b7520901184fd9fa127abb88c1f0be16b9967cff
https://github.com/nicolinc/iommufd/commit/d969ffeac899491d3a6e54557bf6a46d52b95865
So, each device should poll hw_info to get its SMMU ID, if it is
not in the S2 list of the vSMMU's, add it and allocate a new S2.

Jason is on vacation. I believe we'll have a more solid discussion
later.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Query on ARM SMMUv3 nested support
  2024-03-13 23:50 ` Nicolin Chen
@ 2024-03-18 16:22   ` Jason Gunthorpe
  2024-03-22 15:04   ` Shameerali Kolothum Thodi
  1 sibling, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2024-03-18 16:22 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Shameerali Kolothum Thodi, iommu@lists.linux.dev, Linuxarm,
	Zhangfei Gao, Michael Shavit, Eric Auger, Moritz Fischer

On Wed, Mar 13, 2024 at 04:50:58PM -0700, Nicolin Chen wrote:
> On Wed, Mar 13, 2024 at 10:13:58AM +0000, Shameerali Kolothum Thodi wrote:
> > Hi Nicolin,
> > 
> > Thanks for the latest repos with basic SMMUv3 nested support enabled[1].
> > I did some  basic sanity runs on a HiSilicon platform and they seems to work as
> > expected. The only problem being we can't assign two devices to the VM if
> > they are on different physical SMMUs.
> > 
> > qemu-system-aarch64: -device
> > vfio-pci,host=0000:75:00.1,iommufd=iommufd0: [iommufd=29] error attach
> > 0000:75:00.1 (36) to id=4: Invalid argument
> > qemu-system-aarch64: -device
> > vfio-pci,host=0000:75:00.1,iommufd=iommufd0: Unable to attach dev to
> > stage-2 HW pagetable: -1
> > Segmentation fault (core dumped)
> > ...
> > 
> > I see that on the Qemu side we now allocate a single s2 hwpt and attach that into
> > all the devices . But this will only work currently if all the assigned devices
> > are under the same physical SMMUv3 as we have a check in kernel
> > whether domains are having the same SMMUv3. I remember Jason mentioning
> > that he is planning to relax that. 

Yes, I intend this to be the direction as it makes the most sense


> > So are the Qemu side changes based on that assumption? And any
> > idea how we are planning to relax that restriction? We do a check
> > for compatibility of the phys SMMUv3s and then allow/restrict in
> > kernel?  

Right, so qemu should still handle the case where the HWPT is not
compatible and create a new HWPT. Regardless of the kernel side
optimization qemu is still missing this logic apparently.

> > Do Qemu can then try allocating a separate s2 hwpt for
> > those and attach again?  Sorry if this was already discussed
> > elsewhere and I missed that.

Yes, this is the right thing.

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Query on ARM SMMUv3 nested support
  2024-03-13 23:50 ` Nicolin Chen
  2024-03-18 16:22   ` Jason Gunthorpe
@ 2024-03-22 15:04   ` Shameerali Kolothum Thodi
  2024-03-29  7:13     ` Nicolin Chen
  1 sibling, 1 reply; 9+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-03-22 15:04 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Jason Gunthorpe, iommu@lists.linux.dev, Linuxarm, Zhangfei Gao,
	Michael Shavit, Eric Auger, Moritz Fischer,
	baolu.lu@linux.intel.com



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, March 13, 2024 11:51 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; iommu@lists.linux.dev; Linuxarm
> <linuxarm@huawei.com>; Zhangfei Gao <zhangfei.gao@linaro.org>; Michael
> Shavit <mshavit@google.com>; Eric Auger <eric.auger@redhat.com>; Moritz
> Fischer <mdf@kernel.org>
> Subject: Re: Query on ARM SMMUv3 nested support
> 
> On Wed, Mar 13, 2024 at 10:13:58AM +0000, Shameerali Kolothum Thodi
> wrote:
> > Hi Nicolin,
> >
> > Thanks for the latest repos with basic SMMUv3 nested support enabled[1].
> > I did some  basic sanity runs on a HiSilicon platform and they seems to
> work as
> > expected. The only problem being we can't assign two devices to the VM if
> > they are on different physical SMMUs.
> >
> > qemu-system-aarch64: -device
> > vfio-pci,host=0000:75:00.1,iommufd=iommufd0: [iommufd=29] error
> attach
> > 0000:75:00.1 (36) to id=4: Invalid argument
> > qemu-system-aarch64: -device
> > vfio-pci,host=0000:75:00.1,iommufd=iommufd0: Unable to attach dev to
> > stage-2 HW pagetable: -1
> > Segmentation fault (core dumped)
> > ...
> >
> > I see that on the Qemu side we now allocate a single s2 hwpt and attach
> that into
> > all the devices . But this will only work currently if all the assigned devices
> > are under the same physical SMMUv3 as we have a check in kernel
> > whether domains are having the same SMMUv3. I remember Jason
> mentioning
> > that he is planning to relax that. So are the Qemu side changes based on
> that
> > assumption? And any idea how we are planning to relax that restriction?
> We
> > do a check for compatibility of the phys SMMUv3s and then allow/restrict
> in kernel?
> > Do Qemu can then try allocating a separate s2 hwpt for those and attach
> again?
> > Sorry if this was already discussed elsewhere and I missed that.
> 
> You are very right about this. I haven't thought about supporting
> that case yet. Likely I need to spare some time to refine the QEMU
> part in the coming weeks or so (have been waiting for Zhenzhong's
> full nesting patches).
> 
> Also, NVIDIA has an interest to support CMDQ-V accelerator, where
> we need more vSMMU instances, than just one. I haven't decided how
> to handle this for both single-vSMMU and multi-vSMMU versions.
> 
> Yet, some of the ongoing work might help:
> https://github.com/nicolinc/iommufd/commit/b7520901184fd9fa127abb88c
> 1f0be16b9967cff
> https://github.com/nicolinc/iommufd/commit/d969ffeac899491d3a6e54557
> bf6a46d52b95865
> So, each device should poll hw_info to get its SMMU ID, if it is
> not in the S2 list of the vSMMU's, add it and allocate a new S2.

Thanks for the reference to above CMDQ-V. Looks interesting.
I think we can have a go with retry logic first to allocate a new hwpt
if the attach fails. For now, I have a temp fix where I allocate a new
hwpt every time.

I also noticed another problem with the commit below,
6691a2f("hw/arm/smmu-common: Use sysmem for get_address _space until !!s2_hwpti")

This actually breaks a virtio-pci dev assignment when that is the only 
assigned device to the Guest without any pass-through dev in nested
scenario.

I have a temp fix for that as well here,
https://github.com/hisilicon/qemu/commit/ff0230e9593ba1a172aa7ec9162967d6ae6875b0

(I know it is ugly!). I will revisit that again. Please let me know if you have
any ideas.

FWIW, I have a working branch for vSVA here,
https://github.com/hisilicon/qemu/tree/iommufd_vsmmu-02292024-vsva-wip-v2

I still have that io_uring write issue mentioned in the other thread[1],
but for now I use io_uring API only for read and make use of normal write()
for updating kernel with page response.

I have done some basic sanity runs with our ACC devices and those seems
to work(of course needs more testing). 

Please have a look and let me know if you spot anything.

Thanks,
Shameer
1. https://lore.kernel.org/all/ad4575588dd247fa8beae60963f36404@huawei.com/



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Query on ARM SMMUv3 nested support
  2024-03-22 15:04   ` Shameerali Kolothum Thodi
@ 2024-03-29  7:13     ` Nicolin Chen
  2024-04-02  7:25       ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 9+ messages in thread
From: Nicolin Chen @ 2024-03-29  7:13 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, iommu@lists.linux.dev, Linuxarm, Zhangfei Gao,
	Michael Shavit, Eric Auger, Moritz Fischer,
	baolu.lu@linux.intel.com

Hi Shameer,

Sorry for the late reply. Having been occupied by multiple tasks
lately.

On Fri, Mar 22, 2024 at 03:04:42PM +0000, Shameerali Kolothum Thodi wrote:

> Thanks for the reference to above CMDQ-V. Looks interesting.
> I think we can have a go with retry logic first to allocate a new hwpt
> if the attach fails. For now, I have a temp fix where I allocate a new
> hwpt every time.
> 
> I also noticed another problem with the commit below,
> 6691a2f("hw/arm/smmu-common: Use sysmem for get_address _space until !!s2_hwpti")
> 
> This actually breaks a virtio-pci dev assignment when that is the only
> assigned device to the Guest without any pass-through dev in nested
> scenario.
>
> I have a temp fix for that as well here,
> (I know it is ugly!). I will revisit that again. Please let me know if you have
> any ideas.

I am not quite getting the case here. Is that virtio-pci device
behind a nested-SMMU while there is no S2 hwpt? In that case, why
does this virtio-pci device require a "nested-smmuv3"?

> I still have that io_uring write issue mentioned in the other thread[1],
> but for now I use io_uring API only for read and make use of normal write()
> for updating kernel with page response.
> 
> I have done some basic sanity runs with our ACC devices and those seems
> to work(of course needs more testing).
> 
> Please have a look and let me know if you spot anything.

I'm glad that you include the fault handling support. I'll give
that a try.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Query on ARM SMMUv3 nested support
  2024-03-29  7:13     ` Nicolin Chen
@ 2024-04-02  7:25       ` Shameerali Kolothum Thodi
  2024-04-02 11:28         ` Jason Gunthorpe
  2024-04-09  6:12         ` Nicolin Chen
  0 siblings, 2 replies; 9+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-04-02  7:25 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Jason Gunthorpe, iommu@lists.linux.dev, Linuxarm, Zhangfei Gao,
	Michael Shavit, Eric Auger, Moritz Fischer,
	baolu.lu@linux.intel.com



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Friday, March 29, 2024 7:14 AM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; iommu@lists.linux.dev; Linuxarm
> <linuxarm@huawei.com>; Zhangfei Gao <zhangfei.gao@linaro.org>; Michael
> Shavit <mshavit@google.com>; Eric Auger <eric.auger@redhat.com>; Moritz
> Fischer <mdf@kernel.org>; baolu.lu@linux.intel.com
> Subject: Re: Query on ARM SMMUv3 nested support
> 
> Hi Shameer,
> 
> Sorry for the late reply. Having been occupied by multiple tasks
> lately.
> 
> On Fri, Mar 22, 2024 at 03:04:42PM +0000, Shameerali Kolothum Thodi
> wrote:
> 
> > Thanks for the reference to above CMDQ-V. Looks interesting.
> > I think we can have a go with retry logic first to allocate a new hwpt
> > if the attach fails. For now, I have a temp fix where I allocate a new
> > hwpt every time.
> >
> > I also noticed another problem with the commit below,
> > 6691a2f("hw/arm/smmu-common: Use sysmem for get_address _space
> until !!s2_hwpti")
> >
> > This actually breaks a virtio-pci dev assignment when that is the only
> > assigned device to the Guest without any pass-through dev in nested
> > scenario.
> >
> > I have a temp fix for that as well here,
> > (I know it is ugly!). I will revisit that again. Please let me know if you have
> > any ideas.
> 
> I am not quite getting the case here. Is that virtio-pci device
> behind a nested-SMMU while there is no S2 hwpt? In that case, why
> does this virtio-pci device require a "nested-smmuv3"?

Hi Nicolin,

Let me try to explain with a use case,

Consider a guest boot with below options,

./qemu-system-aarch64 virt,gic-version=3,iommu=nested-smmuv3,iommufd=iommufd0 \
-enable-kvm -cpu host -m 4G -smp cpus=8,maxcpus=8 \
-object iommufd,id=iommufd0 \
-bios QEMU_EFI.fd \
-kernel Image \
-device virtio-blk-device,drive=fs \
-drive if=none,file=ubuntu.img,id=fs \
-device ioh3420,id=rp1 \
-device virtio-9p-pci,fsdev=p9fs,mount_tag=p9 \
-fsdev local,id=p9fs,path=p9root,security_model=mapped \
--append "rdinit=init console=ttyAMA0 root=/dev/vda rw earlycon=pl011,0x9000000" \
-net none \
-nographic

Here we have specified nested-smmuv3 and have  a PCIe root port(ioh3420) and
a virtio-pci dev(for virtfs). Once this VM is up and running, both the ioh3420
and virtio-9p-pci will not work as expected as we return sysmem address space
for them.

And then later when you try to hotplug a vfio dev to the PCIe root port,
device_add vfio-pci,host=0000:7d:02.1,bus=rp1,id=net1,iommufd=iommufd0

it won't work either because the address space for ioh3420 is wrong.

Taking another look at this, I think we can replace my earlier attempt to fix
this by detecting if the dev is vfio or not and add that into the check below,

--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -619,8 +619,15 @@ static AddressSpace *smmu_find_add_as(PCIBus
*bus, void *opaque, int devfn)
     SMMUState *s = opaque;
     SMMUPciBus *sbus = smmu_get_sbus(s, bus);
     SMMUDevice *sdev = smmu_get_sdev(s, sbus, bus, devfn);
+    bool is_vfio = false;
+    PCIDevice *pdev;

-    if (s->nested && !s->s2_hwpt) {
+    pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
+    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
+        is_vfio = true;
+    }
+
+    if (is_vfio && s->nested && !s->s2_hwpt) {
         return &sdev->as_sysmem;
     } else {
         return &sdev->as;

Please let me know your thoughts on this.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Query on ARM SMMUv3 nested support
  2024-04-02  7:25       ` Shameerali Kolothum Thodi
@ 2024-04-02 11:28         ` Jason Gunthorpe
  2024-04-09  6:12         ` Nicolin Chen
  1 sibling, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2024-04-02 11:28 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Nicolin Chen, iommu@lists.linux.dev, Linuxarm, Zhangfei Gao,
	Michael Shavit, Eric Auger, Moritz Fischer,
	baolu.lu@linux.intel.com

On Tue, Apr 02, 2024 at 07:25:56AM +0000, Shameerali Kolothum Thodi wrote:

> Consider a guest boot with below options,
> 
> ./qemu-system-aarch64 virt,gic-version=3,iommu=nested-smmuv3,iommufd=iommufd0 \
> -enable-kvm -cpu host -m 4G -smp cpus=8,maxcpus=8 \
> -object iommufd,id=iommufd0 \
> -bios QEMU_EFI.fd \
> -kernel Image \
> -device virtio-blk-device,drive=fs \
> -drive if=none,file=ubuntu.img,id=fs \
> -device ioh3420,id=rp1 \
> -device virtio-9p-pci,fsdev=p9fs,mount_tag=p9 \
> -fsdev local,id=p9fs,path=p9root,security_model=mapped \
> --append "rdinit=init console=ttyAMA0 root=/dev/vda rw earlycon=pl011,0x9000000" \
> -net none \
> -nographic
> 
> Here we have specified nested-smmuv3 and have  a PCIe root port(ioh3420) and
> a virtio-pci dev(for virtfs). Once this VM is up and running, both the ioh3420
> and virtio-9p-pci will not work as expected as we return sysmem address space
> for them.

A virtio device that does not use the HW SMMU for translation should
not be affiliated with the nested smmuv3 in the ACPI.

IOW only virtual devices that can be associated with the nested HWPT
can be linked to the smmuv3 instance in the ACPI. QEMU needs to track
this and build the correct IORT.

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Query on ARM SMMUv3 nested support
  2024-04-02  7:25       ` Shameerali Kolothum Thodi
  2024-04-02 11:28         ` Jason Gunthorpe
@ 2024-04-09  6:12         ` Nicolin Chen
  2024-04-09 19:47           ` Nicolin Chen
  1 sibling, 1 reply; 9+ messages in thread
From: Nicolin Chen @ 2024-04-09  6:12 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, iommu@lists.linux.dev, Linuxarm, Zhangfei Gao,
	Michael Shavit, Eric Auger, Moritz Fischer,
	baolu.lu@linux.intel.com

Sorry for my belated reply.

On Tue, Apr 02, 2024 at 07:25:56AM +0000, Shameerali Kolothum Thodi wrote:

> Here we have specified nested-smmuv3 and have  a PCIe root port(ioh3420) and
> a virtio-pci dev(for virtfs). Once this VM is up and running, both the ioh3420
> and virtio-9p-pci will not work as expected as we return sysmem address space
> for them.
> 
> And then later when you try to hotplug a vfio dev to the PCIe root port,
> device_add vfio-pci,host=0000:7d:02.1,bus=rp1,id=net1,iommufd=iommufd0
> 
> it won't work either because the address space for ioh3420 is wrong.
> 
> Taking another look at this, I think we can replace my earlier attempt to fix
> this by detecting if the dev is vfio or not and add that into the check below,
> 
> --- a/hw/arm/smmu-common.c
> +++ b/hw/arm/smmu-common.c
> @@ -619,8 +619,15 @@ static AddressSpace *smmu_find_add_as(PCIBus
> *bus, void *opaque, int devfn)
>      SMMUState *s = opaque;
>      SMMUPciBus *sbus = smmu_get_sbus(s, bus);
>      SMMUDevice *sdev = smmu_get_sdev(s, sbus, bus, devfn);
> +    bool is_vfio = false;
> +    PCIDevice *pdev;
> 
> -    if (s->nested && !s->s2_hwpt) {
> +    pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
> +    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> +        is_vfio = true;
> +    }
> +
> +    if (is_vfio && s->nested && !s->s2_hwpt) {
>          return &sdev->as_sysmem;
>      } else {
>          return &sdev->as;
> 
> Please let me know your thoughts on this.

Hmm, checking "vfio-pci" here feels a bit of layer violation..
But I understand how you got there...

I think that the current way of enabling the "nested" flag for
a device is a bit messy:
    if -machine has "iommu=nested-smmuv3" &&
       -device is "vfio_pci" &&
       -device has "iommufd"

Any other device that doesn't meet the latter two conditions
is ambiguously added to a soft smmuv3, which feels wrong to
me now.

Also, the callbacks from hw/vfio/pci.c to iommu are tricky as
well:
    vfio_realize() {
        pci_device_iommu_address_space() {
            smmu_find_add_as();   // Can only assume vfio-pci is behind
                                  // a nested SMMU. So returns system AS
        }
        vfio_attach_device() {
            ioctl(VFIO_DEVICE_BIND_IOMMUFD);
            ioctl(VFIO_DEVICE_ATTACH_IOMMUFD_PT);
            listener_add_address_space() {
                if (memory_region_is_iommu(section->mr)) { // Skip
                    iommu_notifier_init(IOMMU_NOTIFIER_IOTLB_EVENTS);
                    memory_region_iommu_replay();
                }
            }
        }
    }

In my view, we should sort this out with a cleaner enabling
command first. And then, as Jason pointed out, other devices
(non vfio-pci) should probably not be added to the SMMU, i.e.
they might need to be added to a separate PCI bus/bridge with
out iommu. In this case, the vSMMU module would not need an
iommu AS at all when "iommu=nested-smmuv3".

Thoughts?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Query on ARM SMMUv3 nested support
  2024-04-09  6:12         ` Nicolin Chen
@ 2024-04-09 19:47           ` Nicolin Chen
  0 siblings, 0 replies; 9+ messages in thread
From: Nicolin Chen @ 2024-04-09 19:47 UTC (permalink / raw)
  To: Eric Auger, Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, iommu@lists.linux.dev, Linuxarm, Zhangfei Gao,
	Michael Shavit, Moritz Fischer, baolu.lu@linux.intel.com

On Mon, Apr 08, 2024 at 11:13:02PM -0700, Nicolin Chen wrote:
> Sorry for my belated reply.
> 
> On Tue, Apr 02, 2024 at 07:25:56AM +0000, Shameerali Kolothum Thodi wrote:
> 
> > Here we have specified nested-smmuv3 and have  a PCIe root port(ioh3420) and
> > a virtio-pci dev(for virtfs). Once this VM is up and running, both the ioh3420
> > and virtio-9p-pci will not work as expected as we return sysmem address space
> > for them.
> > 
> > And then later when you try to hotplug a vfio dev to the PCIe root port,
> > device_add vfio-pci,host=0000:7d:02.1,bus=rp1,id=net1,iommufd=iommufd0
> > 
> > it won't work either because the address space for ioh3420 is wrong.
> > 
> > Taking another look at this, I think we can replace my earlier attempt to fix
> > this by detecting if the dev is vfio or not and add that into the check below,
> > 
> > --- a/hw/arm/smmu-common.c
> > +++ b/hw/arm/smmu-common.c
> > @@ -619,8 +619,15 @@ static AddressSpace *smmu_find_add_as(PCIBus
> > *bus, void *opaque, int devfn)
> >      SMMUState *s = opaque;
> >      SMMUPciBus *sbus = smmu_get_sbus(s, bus);
> >      SMMUDevice *sdev = smmu_get_sdev(s, sbus, bus, devfn);
> > +    bool is_vfio = false;
> > +    PCIDevice *pdev;
> > 
> > -    if (s->nested && !s->s2_hwpt) {
> > +    pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
> > +    if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> > +        is_vfio = true;
> > +    }
> > +
> > +    if (is_vfio && s->nested && !s->s2_hwpt) {
> >          return &sdev->as_sysmem;
> >      } else {
> >          return &sdev->as;
> > 
> > Please let me know your thoughts on this.
> 
> Hmm, checking "vfio-pci" here feels a bit of layer violation..
> But I understand how you got there...
> 
> I think that the current way of enabling the "nested" flag for
> a device is a bit messy:
>     if -machine has "iommu=nested-smmuv3" &&
>        -device is "vfio_pci" &&
>        -device has "iommufd"
> 
> Any other device that doesn't meet the latter two conditions
> is ambiguously added to a soft smmuv3, which feels wrong to
> me now.
> 
> Also, the callbacks from hw/vfio/pci.c to iommu are tricky as
> well:
>     vfio_realize() {
>         pci_device_iommu_address_space() {
>             smmu_find_add_as();   // Can only assume vfio-pci is behind
>                                   // a nested SMMU. So returns system AS
>         }
>         vfio_attach_device() {
>             ioctl(VFIO_DEVICE_BIND_IOMMUFD);
>             ioctl(VFIO_DEVICE_ATTACH_IOMMUFD_PT);
>             listener_add_address_space() {
>                 if (memory_region_is_iommu(section->mr)) { // Skip
>                     iommu_notifier_init(IOMMU_NOTIFIER_IOTLB_EVENTS);
>                     memory_region_iommu_replay();
>                 }
>             }
>         }
>     }
> 
> In my view, we should sort this out with a cleaner enabling
> command first. And then, as Jason pointed out, other devices
> (non vfio-pci) should probably not be added to the SMMU, i.e.
> they might need to be added to a separate PCI bus/bridge with
> out iommu. In this case, the vSMMU module would not need an
> iommu AS at all when "iommu=nested-smmuv3".

Just recalled that kvm_arch_fixup_msi_route still requires a
"vfio-pci" device to return an IOMMUMemoryRegion v.s. system
memory. And that's why there's a "!s2_hwpt" condition (hack)
when returning the system AS, so that (at the beginning) the
vfio attach could bypass the iotlb registeration using system
AS and (after s2_hwpt is allocated) MSI could translate the
MSI address using iommu AS.

Otherwise, such a device could (likely "should" in my view)
stay with the system AS, since it doesn't need the emulated 
iotlb/cache at all. Yet in that case, we would need another
pci_device_msi_address_space for MSI..

@Eric,
Would it be possible for you shed some light also?

In a summary, we have the following opens:
1. How should we enable "nesting" for a passthrough device?
   - Currently is a bit ambiguous:
     if -machine has "iommu=nested-smmuv3" &&
        -device is "vfio_pci" &&
        -device has "iommufd"
   - Actually "-device has 'iommufd'" only guarantees using
     the IOMMUFD cdev APIs v.s. the legacy one, i.e. it's not
     necessary to enable nesting supposedly?

2. How should we handle a non passthrough device (no nesting)?
   - Current is a bit ambiguous again by adding it to the soft
     SMMUv3. But, is it necessary? It doesn't seem to have any
     positive performance gain.
   - Yet not adding it behind the SMMUv3 would require another
     noiommu PCI bus to dock such a device, I assume?

3. Should we keep a nesting-enabled passthrough device in the
   system AS v.s. iommu AS?
   - The reasoning is that it doesn't need an emulated iotlb
     or cache since it uses the HW one.
   - Also, with a HW-accelerated VCMDQ on NVIDIA Grace, the
     emulated iotlb should be turned off because there will be
     no TLBI/ATC command trapped by QEMU, as all those commands
     will be issued to the VCMDQ HW directly.
   - The only usage of an iommu AS is the MSI code, as it's a
     bit unique requiring for a gIOVA to gPA translation. So,
     keeping such a device in the system AS would need another
     pci_device_msi_address_space, which feels reasonable since
     we need something similar in the kernel too:
     https://lore.kernel.org/lkml/f65a3d7e3e0e0c7b2f5cfc03ada9618aa8b523bb.1683688960.git.nicolinc@nvidia.com/

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-04-09 19:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-13 10:13 Query on ARM SMMUv3 nested support Shameerali Kolothum Thodi
2024-03-13 23:50 ` Nicolin Chen
2024-03-18 16:22   ` Jason Gunthorpe
2024-03-22 15:04   ` Shameerali Kolothum Thodi
2024-03-29  7:13     ` Nicolin Chen
2024-04-02  7:25       ` Shameerali Kolothum Thodi
2024-04-02 11:28         ` Jason Gunthorpe
2024-04-09  6:12         ` Nicolin Chen
2024-04-09 19:47           ` Nicolin Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.