From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jason Gunthorpe Date: Thu, 07 Jul 2022 15:10:02 +0000 Subject: Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains Message-Id: <20220707151002.GB1705032@nvidia.com> List-Id: References: <20220707135552.3688927-1-aik@ozlabs.ru> In-Reply-To: <20220707135552.3688927-1-aik@ozlabs.ru> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Alexey Kardashevskiy Cc: linuxppc-dev@lists.ozlabs.org, Robin Murphy , Michael Ellerman , Joerg Roedel , Joel Stanley , Alex Williamson , Oliver O'Halloran , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Daniel Henrique Barboza , Fabiano Rosas , Murilo Opsfelder Araujo , Nicholas Piggin On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote: > Historically PPC64 managed to avoid using iommu_ops. The VFIO driver > uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in > the Type1 VFIO driver. Recent development though has added a coherency > capability check to the generic part of VFIO and essentially disabled > VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed(). > > This adds an iommu_ops stub which reports support for cache > coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices, > this provides minimum code for the probing to not crash. > > Because now we have to set iommu_ops to the system (bus_set_iommu() or > iommu_device_register()), this requires the POWERNV PCI setup to happen > after bus_register(&pci_bus_type) which is postcore_initcall > TODO: check if it still works, read sha1, for more details: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?idU37fcb319d016ce387 > > Because setting the ops triggers probing, this does not work well with > iommu_group_add_device(), hence the move to iommu_probe_device(). > > Because iommu_probe_device() does not take the group (which is why > we had the helper in the first place), this adds > pci_controller_ops::device_group. > > So, basically there is one iommu_device per PHB and devices are added to > groups indirectly via series of calls inside the IOMMU code. > > pSeries is out of scope here (a minor fix needed for barely supported > platform in regard to VFIO). > > The previous discussion is here: > https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-aik@ozlabs.ru/ I think this is basically OK, for what it is. It looks like there is more some-day opportunity to make use of the core infrastructure though. > does it make sense to have this many callbacks, or > the generic IOMMU code can safely operate without some > (given I add some more checks for !NULL)? thanks, I wouldn't worry about it.. > @@ -1156,7 +1158,10 @@ int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) > pr_debug("%s: Adding %s to iommu group %d\n", > __func__, dev_name(dev), iommu_group_id(table_group->group)); > > - return iommu_group_add_device(table_group->group, dev); > + ret = iommu_probe_device(dev); > + dev_info(dev, "probed with %d\n", ret); For another day, but it seems a bit strange to call iommu_probe_device() like this? Shouldn't one of the existing call sites cover this? The one in of_iommu.c perhaps? > +static bool spapr_tce_iommu_is_attach_deferred(struct device *dev) > +{ > + return false; > +} I think you can NULL this op: static bool iommu_is_attach_deferred(struct device *dev) { const struct iommu_ops *ops = dev_iommu_ops(dev); if (ops->is_attach_deferred) return ops->is_attach_deferred(dev); return false; } > +static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev) > +{ > + struct pci_controller *hose; > + struct pci_dev *pdev; > + > + /* Weirdly iommu_device_register() assigns the same ops to all buses */ > + if (!dev_is_pci(dev)) > + return ERR_PTR(-EPERM); > + > + pdev = to_pci_dev(dev); > + hose = pdev->bus->sysdata; > + > + if (!hose->controller_ops.device_group) > + return ERR_PTR(-ENOENT); > + > + return hose->controller_ops.device_group(hose, pdev); > +} Is this missing a refcount get on the group? > + > +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom, > + struct device *dev) > +{ > + return 0; > +} It is important when this returns that the iommu translation is all emptied. There should be no left over translations from the DMA API at this point. I have no idea how power works in this regard, but it should be explained why this is safe in a comment at a minimum. It will turn into a security problem to allow kernel mappings to leak past this point. Jason