* [PATCH v6 1/5] iommu: Lock group->mutex in iommu_deferred_attach()
2025-11-19 0:52 [PATCH v6 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
@ 2025-11-19 0:52 ` Nicolin Chen
2025-11-19 0:52 ` [PATCH v6 2/5] iommu: Tidy domain for iommu_setup_dma_ops() Nicolin Chen
` (3 subsequent siblings)
4 siblings, 0 replies; 11+ messages in thread
From: Nicolin Chen @ 2025-11-19 0:52 UTC (permalink / raw)
To: robin.murphy, joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, lenb, baolu.lu, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, kvm, patches, pjaroszynski, vsethi,
helgaas, etzhao1900
The iommu_deferred_attach() function invokes __iommu_attach_device(), but
doesn't hold the group->mutex like other __iommu_attach_device() callers.
Though there is no pratical bug being triggered so far, it would be better
to apply the same locking to this __iommu_attach_device(), since the IOMMU
drivers nowaday are more aware of the group->mutex -- some of them use the
iommu_group_mutex_assert() function that could be potentially in the path
of an attach_dev callback function invoked by the __iommu_attach_device().
Worth mentioning that the iommu_deferred_attach() will soon need to check
group->resetting_domain that must be locked also.
Thus, grab the mutex to guard __iommu_attach_device() like other callers.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/iommu.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2ca990dfbb884..170e522b5bda4 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2185,10 +2185,17 @@ EXPORT_SYMBOL_GPL(iommu_attach_device);
int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
{
- if (dev->iommu && dev->iommu->attach_deferred)
- return __iommu_attach_device(domain, dev, NULL);
+ /*
+ * This is called on the dma mapping fast path so avoid locking. This is
+ * racy, but we have an expectation that the driver will setup its DMAs
+ * inside probe while being single threaded to avoid racing.
+ */
+ if (!dev->iommu || !dev->iommu->attach_deferred)
+ return 0;
- return 0;
+ guard(mutex)(&dev->iommu_group->mutex);
+
+ return __iommu_attach_device(domain, dev, NULL);
}
void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH v6 2/5] iommu: Tidy domain for iommu_setup_dma_ops()
2025-11-19 0:52 [PATCH v6 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
2025-11-19 0:52 ` [PATCH v6 1/5] iommu: Lock group->mutex in iommu_deferred_attach() Nicolin Chen
@ 2025-11-19 0:52 ` Nicolin Chen
2025-11-19 0:52 ` [PATCH v6 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper Nicolin Chen
` (2 subsequent siblings)
4 siblings, 0 replies; 11+ messages in thread
From: Nicolin Chen @ 2025-11-19 0:52 UTC (permalink / raw)
To: robin.murphy, joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, lenb, baolu.lu, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, kvm, patches, pjaroszynski, vsethi,
helgaas, etzhao1900
This function can only be called on the default_domain. Trivally pass it
in. In all three existing cases, the default domain was just attached to
the device.
This avoids iommu_setup_dma_ops() calling iommu_get_domain_for_dev() that
will be used by external callers.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/dma-iommu.h | 5 +++--
drivers/iommu/dma-iommu.c | 4 +---
drivers/iommu/iommu.c | 6 +++---
3 files changed, 7 insertions(+), 8 deletions(-)
diff --git a/drivers/iommu/dma-iommu.h b/drivers/iommu/dma-iommu.h
index eca201c1f9639..040d002525632 100644
--- a/drivers/iommu/dma-iommu.h
+++ b/drivers/iommu/dma-iommu.h
@@ -9,7 +9,7 @@
#ifdef CONFIG_IOMMU_DMA
-void iommu_setup_dma_ops(struct device *dev);
+void iommu_setup_dma_ops(struct device *dev, struct iommu_domain *domain);
int iommu_get_dma_cookie(struct iommu_domain *domain);
void iommu_put_dma_cookie(struct iommu_domain *domain);
@@ -26,7 +26,8 @@ extern bool iommu_dma_forcedac;
#else /* CONFIG_IOMMU_DMA */
-static inline void iommu_setup_dma_ops(struct device *dev)
+static inline void iommu_setup_dma_ops(struct device *dev,
+ struct iommu_domain *domain)
{
}
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7944a3af4545e..e8ffb50c66dbf 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -2096,10 +2096,8 @@ void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
}
EXPORT_SYMBOL_GPL(dma_iova_destroy);
-void iommu_setup_dma_ops(struct device *dev)
+void iommu_setup_dma_ops(struct device *dev, struct iommu_domain *domain)
{
- struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-
if (dev_is_pci(dev))
dev->iommu->pci_32bit_workaround = !iommu_dma_forcedac;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 170e522b5bda4..1e322f87b1710 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -661,7 +661,7 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
}
if (group->default_domain)
- iommu_setup_dma_ops(dev);
+ iommu_setup_dma_ops(dev, group->default_domain);
mutex_unlock(&group->mutex);
@@ -1949,7 +1949,7 @@ static int bus_iommu_probe(const struct bus_type *bus)
return ret;
}
for_each_group_device(group, gdev)
- iommu_setup_dma_ops(gdev->dev);
+ iommu_setup_dma_ops(gdev->dev, group->default_domain);
mutex_unlock(&group->mutex);
/*
@@ -3155,7 +3155,7 @@ static ssize_t iommu_group_store_type(struct iommu_group *group,
/* Make sure dma_ops is appropriatley set */
for_each_group_device(group, gdev)
- iommu_setup_dma_ops(gdev->dev);
+ iommu_setup_dma_ops(gdev->dev, group->default_domain);
out_unlock:
mutex_unlock(&group->mutex);
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH v6 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper
2025-11-19 0:52 [PATCH v6 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
2025-11-19 0:52 ` [PATCH v6 1/5] iommu: Lock group->mutex in iommu_deferred_attach() Nicolin Chen
2025-11-19 0:52 ` [PATCH v6 2/5] iommu: Tidy domain for iommu_setup_dma_ops() Nicolin Chen
@ 2025-11-19 0:52 ` Nicolin Chen
2025-11-19 0:52 ` [PATCH v6 4/5] iommu: Introduce pci_dev_reset_iommu_prepare/done() Nicolin Chen
2025-11-19 0:52 ` [PATCH v6 5/5] PCI: Suspend iommu function prior to resetting a device Nicolin Chen
4 siblings, 0 replies; 11+ messages in thread
From: Nicolin Chen @ 2025-11-19 0:52 UTC (permalink / raw)
To: robin.murphy, joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, lenb, baolu.lu, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, kvm, patches, pjaroszynski, vsethi,
helgaas, etzhao1900
There is a need to stage a resetting PCI device to temporarily the blocked
domain and then attach back to its previously attached domain after reset.
This can be simply done by keeping the "previously attached domain" in the
iommu_group->domain pointer while adding an iommu_group->resetting_domain,
which gives troubles to IOMMU drivers using the iommu_get_domain_for_dev()
for a device's physical domain in order to program IOMMU hardware.
And in such for-driver use cases, the iommu_group->mutex must be held, so
it doesn't fit in external callers that don't hold the iommu_group->mutex.
Introduce a new iommu_driver_get_domain_for_dev() helper, exclusively for
driver use cases that hold the iommu_group->mutex, to separate from those
external use cases.
Add a lockdep_assert_not_held to the existing iommu_get_domain_for_dev()
and highlight that in a kdoc.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
include/linux/iommu.h | 1 +
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 5 ++--
drivers/iommu/iommu.c | 28 +++++++++++++++++++++
3 files changed, 32 insertions(+), 2 deletions(-)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 801b2bd9e8d49..a42a2d1d7a0b7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -910,6 +910,7 @@ extern int iommu_attach_device(struct iommu_domain *domain,
extern void iommu_detach_device(struct iommu_domain *domain,
struct device *dev);
extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
+struct iommu_domain *iommu_driver_get_domain_for_dev(struct device *dev);
extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index a33fbd12a0dd9..412d1a9b31275 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3125,7 +3125,8 @@ int arm_smmu_set_pasid(struct arm_smmu_master *master,
struct arm_smmu_domain *smmu_domain, ioasid_t pasid,
struct arm_smmu_cd *cd, struct iommu_domain *old)
{
- struct iommu_domain *sid_domain = iommu_get_domain_for_dev(master->dev);
+ struct iommu_domain *sid_domain =
+ iommu_driver_get_domain_for_dev(master->dev);
struct arm_smmu_attach_state state = {
.master = master,
.ssid = pasid,
@@ -3191,7 +3192,7 @@ static int arm_smmu_blocking_set_dev_pasid(struct iommu_domain *new_domain,
*/
if (!arm_smmu_ssids_in_use(&master->cd_table)) {
struct iommu_domain *sid_domain =
- iommu_get_domain_for_dev(master->dev);
+ iommu_driver_get_domain_for_dev(master->dev);
if (sid_domain->type == IOMMU_DOMAIN_IDENTITY ||
sid_domain->type == IOMMU_DOMAIN_BLOCKED)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 1e322f87b1710..672597100e9a0 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2217,6 +2217,15 @@ void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
}
EXPORT_SYMBOL_GPL(iommu_detach_device);
+/**
+ * iommu_get_domain_for_dev() - Return the DMA API domain pointer
+ * @dev: Device to query
+ *
+ * This function can be called within a driver bound to dev. The returned
+ * pointer is valid for the lifetime of the bound driver.
+ *
+ * It should not be called by drivers with driver_managed_dma = true.
+ */
struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
{
/* Caller must be a probed driver on dev */
@@ -2225,10 +2234,29 @@ struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
if (!group)
return NULL;
+ lockdep_assert_not_held(&group->mutex);
+
return group->domain;
}
EXPORT_SYMBOL_GPL(iommu_get_domain_for_dev);
+/**
+ * iommu_driver_get_domain_for_dev() - Return the driver-level domain pointer
+ * @dev: Device to query
+ *
+ * This function can be called by an iommu driver that wants to get the physical
+ * domain within an iommu callback function where group->mutex is held.
+ */
+struct iommu_domain *iommu_driver_get_domain_for_dev(struct device *dev)
+{
+ struct iommu_group *group = dev->iommu_group;
+
+ lockdep_assert_held(&group->mutex);
+
+ return group->domain;
+}
+EXPORT_SYMBOL_GPL(iommu_driver_get_domain_for_dev);
+
/*
* For IOMMU_DOMAIN_DMA implementations which already provide their own
* guarantees that the group and its default domain are valid and correct.
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH v6 4/5] iommu: Introduce pci_dev_reset_iommu_prepare/done()
2025-11-19 0:52 [PATCH v6 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
` (2 preceding siblings ...)
2025-11-19 0:52 ` [PATCH v6 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper Nicolin Chen
@ 2025-11-19 0:52 ` Nicolin Chen
2025-11-19 2:56 ` Nicolin Chen
2025-11-21 7:59 ` Tian, Kevin
2025-11-19 0:52 ` [PATCH v6 5/5] PCI: Suspend iommu function prior to resetting a device Nicolin Chen
4 siblings, 2 replies; 11+ messages in thread
From: Nicolin Chen @ 2025-11-19 0:52 UTC (permalink / raw)
To: robin.murphy, joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, lenb, baolu.lu, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, kvm, patches, pjaroszynski, vsethi,
helgaas, etzhao1900
PCIe permits a device to ignore ATS invalidation TLPs while processing a
reset. This creates a problem visible to the OS where an ATS invalidation
command will time out. E.g. an SVA domain will have no coordination with a
reset event and can racily issue ATS invalidations to a resetting device.
The OS should do something to mitigate this as we do not want production
systems to be reporting critical ATS failures, especially in a hypervisor
environment. Broadly, OS could arrange to ignore the timeouts, block page
table mutations to prevent invalidations, or disable and block ATS.
The PCIe r6.0, sec 10.3.1 IMPLEMENTATION NOTE recommends SW to disable and
block ATS before initiating a Function Level Reset. It also mentions that
other reset methods could have the same vulnerability as well.
Provide a callback from the PCI subsystem that will enclose the reset and
have the iommu core temporarily change all the attached RID/PASID domains
group->blocking_domain so that the IOMMU hardware would fence any incoming
ATS queries. And IOMMU drivers should also synchronously stop issuing new
ATS invalidations and wait for all ATS invalidations to complete. This can
avoid any ATS invaliation timeouts.
However, if there is a domain attachment/replacement happening during an
ongoing reset, ATS routines may be re-activated between the two function
calls. So, introduce a new resetting_domain in the iommu_group structure
to reject any concurrent attach_dev/set_dev_pasid call during a reset for
a concern of compatibility failure. Since this changes the behavior of an
attach operation, update the uAPI accordingly.
Note that there are two corner cases:
1. Devices in the same iommu_group
Since an attachment is always per iommu_group, this means that any
sibling devices in the iommu_group cannot change domain, to prevent
race conditions.
2. An SR-IOV PF that is being reset while its VF is not
In such case, the VF itself is already broken. So, there is no point
in preventing PF from going through the iommu reset.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
include/linux/iommu.h | 13 +++
include/uapi/linux/vfio.h | 4 +
drivers/iommu/iommu.c | 173 ++++++++++++++++++++++++++++++++++++++
3 files changed, 190 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a42a2d1d7a0b7..364989107aca7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1186,6 +1186,10 @@ void iommu_detach_device_pasid(struct iommu_domain *domain,
struct device *dev, ioasid_t pasid);
ioasid_t iommu_alloc_global_pasid(struct device *dev);
void iommu_free_global_pasid(ioasid_t pasid);
+
+/* PCI device reset functions */
+int pci_dev_reset_iommu_prepare(struct pci_dev *pdev);
+void pci_dev_reset_iommu_done(struct pci_dev *pdev);
#else /* CONFIG_IOMMU_API */
struct iommu_ops {};
@@ -1509,6 +1513,15 @@ static inline ioasid_t iommu_alloc_global_pasid(struct device *dev)
}
static inline void iommu_free_global_pasid(ioasid_t pasid) {}
+
+static inline int pci_dev_reset_iommu_prepare(struct device *dev)
+{
+ return 0;
+}
+
+static inline void pci_dev_reset_iommu_done(struct device *dev)
+{
+}
#endif /* CONFIG_IOMMU_API */
#ifdef CONFIG_IRQ_MSI_IOMMU
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 75100bf009baf..4aee2af1b6cbe 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -963,6 +963,10 @@ struct vfio_device_bind_iommufd {
* hwpt corresponding to the given pt_id.
*
* Return: 0 on success, -errno on failure.
+ *
+ * When a device is resetting, -EBUSY will be returned to reject any concurrent
+ * attachment to the resetting device itself or any sibling device in the IOMMU
+ * group having the resetting device.
*/
struct vfio_device_attach_iommufd_pt {
__u32 argsz;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 672597100e9a0..0665dedd91b2d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -61,6 +61,11 @@ struct iommu_group {
int id;
struct iommu_domain *default_domain;
struct iommu_domain *blocking_domain;
+ /*
+ * During a group device reset, @resetting_domain points to the physical
+ * domain, while @domain points to the attached domain before the reset.
+ */
+ struct iommu_domain *resetting_domain;
struct iommu_domain *domain;
struct list_head entry;
unsigned int owner_cnt;
@@ -2195,6 +2200,15 @@ int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
guard(mutex)(&dev->iommu_group->mutex);
+ /*
+ * This is a concurrent attach during a device reset. Reject it until
+ * pci_dev_reset_iommu_done() attaches the device to group->domain.
+ *
+ * Note that this might fail the iommu_dma_map(). But there's nothing
+ * more we can do here.
+ */
+ if (dev->iommu_group->resetting_domain)
+ return -EBUSY;
return __iommu_attach_device(domain, dev, NULL);
}
@@ -2253,6 +2267,17 @@ struct iommu_domain *iommu_driver_get_domain_for_dev(struct device *dev)
lockdep_assert_held(&group->mutex);
+ /*
+ * Driver handles the low-level __iommu_attach_device(), including the
+ * one invoked by pci_dev_reset_iommu_done() re-attaching the device to
+ * the cached group->domain. In this case, the driver must get the old
+ * domain from group->resetting_domain rather than group->domain. This
+ * prevents it from re-attaching the device from group->domain (old) to
+ * group->domain (new).
+ */
+ if (group->resetting_domain)
+ return group->resetting_domain;
+
return group->domain;
}
EXPORT_SYMBOL_GPL(iommu_driver_get_domain_for_dev);
@@ -2409,6 +2434,13 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
if (WARN_ON(!new_domain))
return -EINVAL;
+ /*
+ * This is a concurrent attach during a device reset. Reject it until
+ * pci_dev_reset_iommu_done() attaches the device to group->domain.
+ */
+ if (group->resetting_domain)
+ return -EBUSY;
+
/*
* Changing the domain is done by calling attach_dev() on the new
* domain. This switch does not have to be atomic and DMA can be
@@ -3527,6 +3559,16 @@ int iommu_attach_device_pasid(struct iommu_domain *domain,
return -EINVAL;
mutex_lock(&group->mutex);
+
+ /*
+ * This is a concurrent attach during a device reset. Reject it until
+ * pci_dev_reset_iommu_done() attaches the device to group->domain.
+ */
+ if (group->resetting_domain) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+
for_each_group_device(group, device) {
/*
* Skip PASID validation for devices without PASID support
@@ -3610,6 +3652,16 @@ int iommu_replace_device_pasid(struct iommu_domain *domain,
return -EINVAL;
mutex_lock(&group->mutex);
+
+ /*
+ * This is a concurrent attach during a device reset. Reject it until
+ * pci_dev_reset_iommu_done() attaches the device to group->domain.
+ */
+ if (group->resetting_domain) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+
entry = iommu_make_pasid_array_entry(domain, handle);
curr = xa_cmpxchg(&group->pasid_array, pasid, NULL,
XA_ZERO_ENTRY, GFP_KERNEL);
@@ -3867,6 +3919,127 @@ int iommu_replace_group_handle(struct iommu_group *group,
}
EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
+/**
+ * pci_dev_reset_iommu_prepare() - Block IOMMU to prepare for a PCI device reset
+ * @pdev: PCI device that is going to enter a reset routine
+ *
+ * The PCIe r6.0, sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and block
+ * ATS before initiating a reset. This means that a PCIe device during the reset
+ * routine wants to block any IOMMU activity: translation and ATS invalidation.
+ *
+ * This function attaches the device's RID/PASID(s) the group->blocking_domain,
+ * setting the group->resetting_domain. This allows the IOMMU driver pausing any
+ * IOMMU activity while leaving the group->domain pointer intact. Later when the
+ * reset is finished, pci_dev_reset_iommu_done() can restore everything.
+ *
+ * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
+ * before/after the core-level reset routine, to unset the resetting_domain.
+ *
+ * Return: 0 on success or negative error code if the preparation failed.
+ *
+ * These two functions are designed to be used by PCI reset functions that would
+ * not invoke any racy iommu_release_device(), since PCI sysfs node gets removed
+ * before it notifies with a BUS_NOTIFY_REMOVED_DEVICE. When using them in other
+ * case, callers must ensure there will be no racy iommu_release_device() call,
+ * which otherwise would UAF the dev->iommu_group pointer.
+ */
+int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
+{
+ struct iommu_group *group = pdev->dev.iommu_group;
+ unsigned long pasid;
+ void *entry;
+ int ret;
+
+ if (!pci_ats_supported(pdev) || !dev_has_iommu(&pdev->dev))
+ return 0;
+
+ guard(mutex)(&group->mutex);
+
+ /* Re-entry is not allowed */
+ if (WARN_ON(group->resetting_domain))
+ return -EBUSY;
+
+ ret = __iommu_group_alloc_blocking_domain(group);
+ if (ret)
+ return ret;
+
+ /* Stage RID domain at blocking_domain while retaining group->domain */
+ if (group->domain != group->blocking_domain) {
+ ret = __iommu_attach_device(group->blocking_domain, &pdev->dev,
+ group->domain);
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Stage PASID domains at blocking_domain while retaining pasid_array.
+ *
+ * The pasid_array is mostly fenced by group->mutex, except one reader
+ * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
+ */
+ xa_for_each_start(&group->pasid_array, pasid, entry, 1)
+ iommu_remove_dev_pasid(&pdev->dev, pasid,
+ pasid_array_entry_to_domain(entry));
+
+ group->resetting_domain = group->blocking_domain;
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
+
+/**
+ * pci_dev_reset_iommu_done() - Restore IOMMU after a PCI device reset is done
+ * @pdev: PCI device that has finished a reset routine
+ *
+ * After a PCIe device finishes a reset routine, it wants to restore its IOMMU
+ * IOMMU activity, including new translation as well as cache invalidation, by
+ * re-attaching all RID/PASID of the device's back to the domains retained in
+ * the core-level structure.
+ *
+ * Caller must pair it with a successful pci_dev_reset_iommu_prepare().
+ *
+ * Note that, although unlikely, there is a risk that re-attaching domains might
+ * fail due to some unexpected happening like OOM.
+ */
+void pci_dev_reset_iommu_done(struct pci_dev *pdev)
+{
+ struct iommu_group *group = pdev->dev.iommu_group;
+ unsigned long pasid;
+ void *entry;
+
+ if (!pci_ats_supported(pdev) || !dev_has_iommu(&pdev->dev))
+ return;
+
+ guard(mutex)(&group->mutex);
+
+ /* pci_dev_reset_iommu_prepare() was bypassed for the device */
+ if (!group->resetting_domain)
+ return;
+
+ /* pci_dev_reset_iommu_prepare() was not successfully called */
+ if (WARN_ON(!group->blocking_domain))
+ return;
+
+ /* Re-attach RID domain back to group->domain */
+ if (group->domain != group->blocking_domain) {
+ WARN_ON(__iommu_attach_device(group->domain, &pdev->dev,
+ group->blocking_domain));
+ }
+
+ /*
+ * Re-attach PASID domains back to the domains retained in pasid_array.
+ *
+ * The pasid_array is mostly fenced by group->mutex, except one reader
+ * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
+ */
+ xa_for_each_start(&group->pasid_array, pasid, entry, 1)
+ WARN_ON(__iommu_set_group_pasid(
+ pasid_array_entry_to_domain(entry), group, pasid,
+ group->blocking_domain));
+
+ group->resetting_domain = NULL;
+}
+EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_done);
+
#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
/**
* iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH v6 4/5] iommu: Introduce pci_dev_reset_iommu_prepare/done()
2025-11-19 0:52 ` [PATCH v6 4/5] iommu: Introduce pci_dev_reset_iommu_prepare/done() Nicolin Chen
@ 2025-11-19 2:56 ` Nicolin Chen
2025-11-21 7:59 ` Tian, Kevin
1 sibling, 0 replies; 11+ messages in thread
From: Nicolin Chen @ 2025-11-19 2:56 UTC (permalink / raw)
To: robin.murphy, joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, lenb, baolu.lu, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, kvm, patches, pjaroszynski, vsethi,
helgaas, etzhao1900
On Tue, Nov 18, 2025 at 04:52:10PM -0800, Nicolin Chen wrote:
> +/* PCI device reset functions */
> +int pci_dev_reset_iommu_prepare(struct pci_dev *pdev);
> +void pci_dev_reset_iommu_done(struct pci_dev *pdev);
> #else /* CONFIG_IOMMU_API */
>
> struct iommu_ops {};
> @@ -1509,6 +1513,15 @@ static inline ioasid_t iommu_alloc_global_pasid(struct device *dev)
> }
>
> static inline void iommu_free_global_pasid(ioasid_t pasid) {}
> +
> +static inline int pci_dev_reset_iommu_prepare(struct device *dev)
> +{
> + return 0;
> +}
> +
> +static inline void pci_dev_reset_iommu_done(struct device *dev)
Ah, I forgot to update these two using struct pci_dev..
Will fix this in v7.
Nicolin
^ permalink raw reply [flat|nested] 11+ messages in thread* RE: [PATCH v6 4/5] iommu: Introduce pci_dev_reset_iommu_prepare/done()
2025-11-19 0:52 ` [PATCH v6 4/5] iommu: Introduce pci_dev_reset_iommu_prepare/done() Nicolin Chen
2025-11-19 2:56 ` Nicolin Chen
@ 2025-11-21 7:59 ` Tian, Kevin
1 sibling, 0 replies; 11+ messages in thread
From: Tian, Kevin @ 2025-11-21 7:59 UTC (permalink / raw)
To: Nicolin Chen, robin.murphy@arm.com, joro@8bytes.org,
afael@kernel.org, bhelgaas@google.com, alex@shazbot.org,
jgg@nvidia.com
Cc: will@kernel.org, lenb@kernel.org, baolu.lu@linux.intel.com,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
helgaas@kernel.org, etzhao1900@gmail.com
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, November 19, 2025 8:52 AM
>
> PCIe permits a device to ignore ATS invalidation TLPs while processing a
> reset. This creates a problem visible to the OS where an ATS invalidation
> command will time out. E.g. an SVA domain will have no coordination with a
> reset event and can racily issue ATS invalidations to a resetting device.
>
> The OS should do something to mitigate this as we do not want production
> systems to be reporting critical ATS failures, especially in a hypervisor
> environment. Broadly, OS could arrange to ignore the timeouts, block page
> table mutations to prevent invalidations, or disable and block ATS.
>
> The PCIe r6.0, sec 10.3.1 IMPLEMENTATION NOTE recommends SW to
> disable and
> block ATS before initiating a Function Level Reset. It also mentions that
> other reset methods could have the same vulnerability as well.
>
> Provide a callback from the PCI subsystem that will enclose the reset and
> have the iommu core temporarily change all the attached RID/PASID
> domains
> group->blocking_domain so that the IOMMU hardware would fence any
> incoming
> ATS queries. And IOMMU drivers should also synchronously stop issuing new
> ATS invalidations and wait for all ATS invalidations to complete. This can
> avoid any ATS invaliation timeouts.
>
> However, if there is a domain attachment/replacement happening during an
> ongoing reset, ATS routines may be re-activated between the two function
> calls. So, introduce a new resetting_domain in the iommu_group structure
> to reject any concurrent attach_dev/set_dev_pasid call during a reset for
> a concern of compatibility failure. Since this changes the behavior of an
> attach operation, update the uAPI accordingly.
>
> Note that there are two corner cases:
> 1. Devices in the same iommu_group
> Since an attachment is always per iommu_group, this means that any
> sibling devices in the iommu_group cannot change domain, to prevent
> race conditions.
> 2. An SR-IOV PF that is being reset while its VF is not
> In such case, the VF itself is already broken. So, there is no point
> in preventing PF from going through the iommu reset.
>
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v6 5/5] PCI: Suspend iommu function prior to resetting a device
2025-11-19 0:52 [PATCH v6 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
` (3 preceding siblings ...)
2025-11-19 0:52 ` [PATCH v6 4/5] iommu: Introduce pci_dev_reset_iommu_prepare/done() Nicolin Chen
@ 2025-11-19 0:52 ` Nicolin Chen
2025-11-19 2:29 ` kernel test robot
` (2 more replies)
4 siblings, 3 replies; 11+ messages in thread
From: Nicolin Chen @ 2025-11-19 0:52 UTC (permalink / raw)
To: robin.murphy, joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, lenb, baolu.lu, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, kvm, patches, pjaroszynski, vsethi,
helgaas, etzhao1900
PCIe permits a device to ignore ATS invalidation TLPs while processing a
reset. This creates a problem visible to the OS where an ATS invalidation
command will time out: e.g. an SVA domain will have no coordination with a
reset event and can racily issue ATS invalidations to a resetting device.
The PCIe r6.0, sec 10.3.1 IMPLEMENTATION NOTE recommends SW to disable and
block ATS before initiating a Function Level Reset. It also mentions that
other reset methods could have the same vulnerability as well.
The IOMMU subsystem provides pci_dev_reset_iommu_prepare/done() callback
helpers for this matter. Use them in all the existing reset functions.
This will attach the device to its iommu_group->blocking_domain during the
device reset, so as to allow IOMMU driver to:
- invoke pci_disable_ats() and pci_enable_ats(), if necessary
- wait for all ATS invalidations to complete
- stop issuing new ATS invalidations
- fence any incoming ATS queries
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/pci/pci-acpi.c | 13 +++++++--
drivers/pci/pci.c | 65 +++++++++++++++++++++++++++++++++++++-----
drivers/pci/quirks.c | 19 +++++++++++-
3 files changed, 87 insertions(+), 10 deletions(-)
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 9369377725fa0..651d9b5561fff 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -9,6 +9,7 @@
#include <linux/delay.h>
#include <linux/init.h>
+#include <linux/iommu.h>
#include <linux/irqdomain.h>
#include <linux/pci.h>
#include <linux/msi.h>
@@ -971,6 +972,7 @@ void pci_set_acpi_fwnode(struct pci_dev *dev)
int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
{
acpi_handle handle = ACPI_HANDLE(&dev->dev);
+ int ret;
if (!handle || !acpi_has_method(handle, "_RST"))
return -ENOTTY;
@@ -978,12 +980,19 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
if (probe)
return 0;
+ ret = pci_dev_reset_iommu_prepare(dev);
+ if (ret) {
+ pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
+ return ret;
+ }
+
if (ACPI_FAILURE(acpi_evaluate_object(handle, "_RST", NULL, NULL))) {
pci_warn(dev, "ACPI _RST failed\n");
- return -ENOTTY;
+ ret = -ENOTTY;
}
- return 0;
+ pci_dev_reset_iommu_done(dev);
+ return ret;
}
bool acpi_pci_power_manageable(struct pci_dev *dev)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b14dd064006cc..da0cf0f041516 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -13,6 +13,7 @@
#include <linux/delay.h>
#include <linux/dmi.h>
#include <linux/init.h>
+#include <linux/iommu.h>
#include <linux/msi.h>
#include <linux/of.h>
#include <linux/pci.h>
@@ -25,6 +26,7 @@
#include <linux/logic_pio.h>
#include <linux/device.h>
#include <linux/pm_runtime.h>
+#include <linux/pci-ats.h>
#include <linux/pci_hotplug.h>
#include <linux/vmalloc.h>
#include <asm/dma.h>
@@ -4478,13 +4480,22 @@ EXPORT_SYMBOL(pci_wait_for_pending_transaction);
*/
int pcie_flr(struct pci_dev *dev)
{
+ int ret;
+
if (!pci_wait_for_pending_transaction(dev))
pci_err(dev, "timed out waiting for pending transaction; performing function level reset anyway\n");
+ /* Have to call it after waiting for pending DMA transaction */
+ ret = pci_dev_reset_iommu_prepare(dev);
+ if (ret) {
+ pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
+ return ret;
+ }
+
pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR);
if (dev->imm_ready)
- return 0;
+ goto done;
/*
* Per PCIe r4.0, sec 6.6.2, a device must complete an FLR within
@@ -4493,7 +4504,10 @@ int pcie_flr(struct pci_dev *dev)
*/
msleep(100);
- return pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
+ ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
+done:
+ pci_dev_reset_iommu_done(dev);
+ return ret;
}
EXPORT_SYMBOL_GPL(pcie_flr);
@@ -4521,6 +4535,7 @@ EXPORT_SYMBOL_GPL(pcie_reset_flr);
static int pci_af_flr(struct pci_dev *dev, bool probe)
{
+ int ret;
int pos;
u8 cap;
@@ -4547,10 +4562,17 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
PCI_AF_STATUS_TP << 8))
pci_err(dev, "timed out waiting for pending transaction; performing AF function level reset anyway\n");
+ /* Have to call it after waiting for pending DMA transaction */
+ ret = pci_dev_reset_iommu_prepare(dev);
+ if (ret) {
+ pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
+ return ret;
+ }
+
pci_write_config_byte(dev, pos + PCI_AF_CTRL, PCI_AF_CTRL_FLR);
if (dev->imm_ready)
- return 0;
+ goto done;
/*
* Per Advanced Capabilities for Conventional PCI ECN, 13 April 2006,
@@ -4560,7 +4582,10 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
*/
msleep(100);
- return pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
+ ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
+done:
+ pci_dev_reset_iommu_done(dev);
+ return ret;
}
/**
@@ -4581,6 +4606,7 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
static int pci_pm_reset(struct pci_dev *dev, bool probe)
{
u16 csr;
+ int ret;
if (!dev->pm_cap || dev->dev_flags & PCI_DEV_FLAGS_NO_PM_RESET)
return -ENOTTY;
@@ -4595,6 +4621,12 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
if (dev->current_state != PCI_D0)
return -EINVAL;
+ ret = pci_dev_reset_iommu_prepare(dev);
+ if (ret) {
+ pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
+ return ret;
+ }
+
csr &= ~PCI_PM_CTRL_STATE_MASK;
csr |= PCI_D3hot;
pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
@@ -4605,7 +4637,9 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
pci_dev_d3_sleep(dev);
- return pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
+ ret = pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
+ pci_dev_reset_iommu_done(dev);
+ return ret;
}
/**
@@ -5033,10 +5067,20 @@ static int pci_reset_bus_function(struct pci_dev *dev, bool probe)
return -ENOTTY;
}
+ rc = pci_dev_reset_iommu_prepare(dev);
+ if (rc) {
+ pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", rc);
+ return rc;
+ }
+
rc = pci_dev_reset_slot_function(dev, probe);
if (rc != -ENOTTY)
- return rc;
- return pci_parent_bus_reset(dev, probe);
+ goto done;
+
+ rc = pci_parent_bus_reset(dev, probe);
+done:
+ pci_dev_reset_iommu_done(dev);
+ return rc;
}
static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
@@ -5060,6 +5104,12 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
if (rc)
return -ENOTTY;
+ rc = pci_dev_reset_iommu_prepare(dev);
+ if (rc) {
+ pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", rc);
+ return rc;
+ }
+
if (reg & PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR) {
val = reg;
} else {
@@ -5074,6 +5124,7 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
reg);
+ pci_dev_reset_iommu_done(dev);
return rc;
}
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 214ed060ca1b3..75b6786af01b8 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -21,6 +21,7 @@
#include <linux/pci.h>
#include <linux/isa-dma.h> /* isa_dma_bridge_buggy */
#include <linux/init.h>
+#include <linux/iommu.h>
#include <linux/delay.h>
#include <linux/acpi.h>
#include <linux/dmi.h>
@@ -4226,6 +4227,22 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
{ 0 }
};
+static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
+ const struct pci_dev_reset_methods *i)
+{
+ int ret;
+
+ ret = pci_dev_reset_iommu_prepare(dev);
+ if (ret) {
+ pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
+ return ret;
+ }
+
+ ret = i->reset(dev, probe);
+ pci_dev_reset_iommu_done(dev);
+ return ret;
+}
+
/*
* These device-specific reset methods are here rather than in a driver
* because when a host assigns a device to a guest VM, the host may need
@@ -4240,7 +4257,7 @@ int pci_dev_specific_reset(struct pci_dev *dev, bool probe)
i->vendor == (u16)PCI_ANY_ID) &&
(i->device == dev->device ||
i->device == (u16)PCI_ANY_ID))
- return i->reset(dev, probe);
+ return __pci_dev_specific_reset(dev, probe, i);
}
return -ENOTTY;
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH v6 5/5] PCI: Suspend iommu function prior to resetting a device
2025-11-19 0:52 ` [PATCH v6 5/5] PCI: Suspend iommu function prior to resetting a device Nicolin Chen
@ 2025-11-19 2:29 ` kernel test robot
2025-11-19 3:03 ` kernel test robot
2025-11-21 7:59 ` Tian, Kevin
2 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2025-11-19 2:29 UTC (permalink / raw)
To: Nicolin Chen, robin.murphy, joro, afael, bhelgaas, alex, jgg,
kevin.tian
Cc: llvm, oe-kbuild-all, will, lenb, baolu.lu, linux-arm-kernel,
iommu, linux-kernel, linux-acpi, linux-pci, kvm, patches,
pjaroszynski, vsethi, helgaas, etzhao1900
Hi Nicolin,
kernel test robot noticed the following build errors:
[auto build test ERROR on next-20251118]
[cannot apply to pci/next pci/for-linus awilliam-vfio/next awilliam-vfio/for-linus rafael-pm/linux-next rafael-pm/bleeding-edge linus/master v6.18-rc6 v6.18-rc5 v6.18-rc4 v6.18-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Nicolin-Chen/iommu-Lock-group-mutex-in-iommu_deferred_attach/20251119-085721
base: next-20251118
patch link: https://lore.kernel.org/r/9f6caaedc278fe057aacb813d94f44a93d8cab3c.1763512374.git.nicolinc%40nvidia.com
patch subject: [PATCH v6 5/5] PCI: Suspend iommu function prior to resetting a device
config: loongarch-allnoconfig (https://download.01.org/0day-ci/archive/20251119/202511191020.OczvlCww-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 0bba1e76581bad04e7d7f09f5115ae5e2989e0d9)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251119/202511191020.OczvlCww-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511191020.OczvlCww-lkp@intel.com/
All errors (new ones prefixed by >>):
>> drivers/pci/pci.c:4341:36: error: incompatible pointer types passing 'struct pci_dev *' to parameter of type 'struct device *' [-Wincompatible-pointer-types]
4341 | ret = pci_dev_reset_iommu_prepare(dev);
| ^~~
include/linux/iommu.h:1519:62: note: passing argument to parameter 'dev' here
1519 | static inline int pci_dev_reset_iommu_prepare(struct device *dev)
| ^
drivers/pci/pci.c:4361:27: error: incompatible pointer types passing 'struct pci_dev *' to parameter of type 'struct device *' [-Wincompatible-pointer-types]
4361 | pci_dev_reset_iommu_done(dev);
| ^~~
include/linux/iommu.h:1524:60: note: passing argument to parameter 'dev' here
1524 | static inline void pci_dev_reset_iommu_done(struct device *dev)
| ^
drivers/pci/pci.c:4418:36: error: incompatible pointer types passing 'struct pci_dev *' to parameter of type 'struct device *' [-Wincompatible-pointer-types]
4418 | ret = pci_dev_reset_iommu_prepare(dev);
| ^~~
include/linux/iommu.h:1519:62: note: passing argument to parameter 'dev' here
1519 | static inline int pci_dev_reset_iommu_prepare(struct device *dev)
| ^
drivers/pci/pci.c:4439:27: error: incompatible pointer types passing 'struct pci_dev *' to parameter of type 'struct device *' [-Wincompatible-pointer-types]
4439 | pci_dev_reset_iommu_done(dev);
| ^~~
include/linux/iommu.h:1524:60: note: passing argument to parameter 'dev' here
1524 | static inline void pci_dev_reset_iommu_done(struct device *dev)
| ^
drivers/pci/pci.c:4476:36: error: incompatible pointer types passing 'struct pci_dev *' to parameter of type 'struct device *' [-Wincompatible-pointer-types]
4476 | ret = pci_dev_reset_iommu_prepare(dev);
| ^~~
include/linux/iommu.h:1519:62: note: passing argument to parameter 'dev' here
1519 | static inline int pci_dev_reset_iommu_prepare(struct device *dev)
| ^
drivers/pci/pci.c:4493:27: error: incompatible pointer types passing 'struct pci_dev *' to parameter of type 'struct device *' [-Wincompatible-pointer-types]
4493 | pci_dev_reset_iommu_done(dev);
| ^~~
include/linux/iommu.h:1524:60: note: passing argument to parameter 'dev' here
1524 | static inline void pci_dev_reset_iommu_done(struct device *dev)
| ^
drivers/pci/pci.c:4922:35: error: incompatible pointer types passing 'struct pci_dev *' to parameter of type 'struct device *' [-Wincompatible-pointer-types]
4922 | rc = pci_dev_reset_iommu_prepare(dev);
| ^~~
include/linux/iommu.h:1519:62: note: passing argument to parameter 'dev' here
1519 | static inline int pci_dev_reset_iommu_prepare(struct device *dev)
| ^
drivers/pci/pci.c:4934:27: error: incompatible pointer types passing 'struct pci_dev *' to parameter of type 'struct device *' [-Wincompatible-pointer-types]
4934 | pci_dev_reset_iommu_done(dev);
| ^~~
include/linux/iommu.h:1524:60: note: passing argument to parameter 'dev' here
1524 | static inline void pci_dev_reset_iommu_done(struct device *dev)
| ^
drivers/pci/pci.c:4959:35: error: incompatible pointer types passing 'struct pci_dev *' to parameter of type 'struct device *' [-Wincompatible-pointer-types]
4959 | rc = pci_dev_reset_iommu_prepare(dev);
| ^~~
include/linux/iommu.h:1519:62: note: passing argument to parameter 'dev' here
1519 | static inline int pci_dev_reset_iommu_prepare(struct device *dev)
| ^
drivers/pci/pci.c:4979:27: error: incompatible pointer types passing 'struct pci_dev *' to parameter of type 'struct device *' [-Wincompatible-pointer-types]
4979 | pci_dev_reset_iommu_done(dev);
| ^~~
include/linux/iommu.h:1524:60: note: passing argument to parameter 'dev' here
1524 | static inline void pci_dev_reset_iommu_done(struct device *dev)
| ^
10 errors generated.
--
>> drivers/pci/pci-acpi.c:983:36: error: incompatible pointer types passing 'struct pci_dev *' to parameter of type 'struct device *' [-Wincompatible-pointer-types]
983 | ret = pci_dev_reset_iommu_prepare(dev);
| ^~~
include/linux/iommu.h:1519:62: note: passing argument to parameter 'dev' here
1519 | static inline int pci_dev_reset_iommu_prepare(struct device *dev)
| ^
drivers/pci/pci-acpi.c:994:27: error: incompatible pointer types passing 'struct pci_dev *' to parameter of type 'struct device *' [-Wincompatible-pointer-types]
994 | pci_dev_reset_iommu_done(dev);
| ^~~
include/linux/iommu.h:1524:60: note: passing argument to parameter 'dev' here
1524 | static inline void pci_dev_reset_iommu_done(struct device *dev)
| ^
2 errors generated.
--
>> drivers/pci/quirks.c:4237:36: error: incompatible pointer types passing 'struct pci_dev *' to parameter of type 'struct device *' [-Wincompatible-pointer-types]
4237 | ret = pci_dev_reset_iommu_prepare(dev);
| ^~~
include/linux/iommu.h:1519:62: note: passing argument to parameter 'dev' here
1519 | static inline int pci_dev_reset_iommu_prepare(struct device *dev)
| ^
drivers/pci/quirks.c:4244:27: error: incompatible pointer types passing 'struct pci_dev *' to parameter of type 'struct device *' [-Wincompatible-pointer-types]
4244 | pci_dev_reset_iommu_done(dev);
| ^~~
include/linux/iommu.h:1524:60: note: passing argument to parameter 'dev' here
1524 | static inline void pci_dev_reset_iommu_done(struct device *dev)
| ^
2 errors generated.
vim +4341 drivers/pci/pci.c
4325
4326 /**
4327 * pcie_flr - initiate a PCIe function level reset
4328 * @dev: device to reset
4329 *
4330 * Initiate a function level reset unconditionally on @dev without
4331 * checking any flags and DEVCAP
4332 */
4333 int pcie_flr(struct pci_dev *dev)
4334 {
4335 int ret;
4336
4337 if (!pci_wait_for_pending_transaction(dev))
4338 pci_err(dev, "timed out waiting for pending transaction; performing function level reset anyway\n");
4339
4340 /* Have to call it after waiting for pending DMA transaction */
> 4341 ret = pci_dev_reset_iommu_prepare(dev);
4342 if (ret) {
4343 pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
4344 return ret;
4345 }
4346
4347 pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR);
4348
4349 if (dev->imm_ready)
4350 goto done;
4351
4352 /*
4353 * Per PCIe r4.0, sec 6.6.2, a device must complete an FLR within
4354 * 100ms, but may silently discard requests while the FLR is in
4355 * progress. Wait 100ms before trying to access the device.
4356 */
4357 msleep(100);
4358
4359 ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
4360 done:
4361 pci_dev_reset_iommu_done(dev);
4362 return ret;
4363 }
4364 EXPORT_SYMBOL_GPL(pcie_flr);
4365
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH v6 5/5] PCI: Suspend iommu function prior to resetting a device
2025-11-19 0:52 ` [PATCH v6 5/5] PCI: Suspend iommu function prior to resetting a device Nicolin Chen
2025-11-19 2:29 ` kernel test robot
@ 2025-11-19 3:03 ` kernel test robot
2025-11-21 7:59 ` Tian, Kevin
2 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2025-11-19 3:03 UTC (permalink / raw)
To: Nicolin Chen, robin.murphy, joro, afael, bhelgaas, alex, jgg,
kevin.tian
Cc: oe-kbuild-all, will, lenb, baolu.lu, linux-arm-kernel, iommu,
linux-kernel, linux-acpi, linux-pci, kvm, patches, pjaroszynski,
vsethi, helgaas, etzhao1900
Hi Nicolin,
kernel test robot noticed the following build errors:
[auto build test ERROR on next-20251118]
[cannot apply to pci/next pci/for-linus awilliam-vfio/next awilliam-vfio/for-linus rafael-pm/linux-next rafael-pm/bleeding-edge linus/master v6.18-rc6 v6.18-rc5 v6.18-rc4 v6.18-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Nicolin-Chen/iommu-Lock-group-mutex-in-iommu_deferred_attach/20251119-085721
base: next-20251118
patch link: https://lore.kernel.org/r/9f6caaedc278fe057aacb813d94f44a93d8cab3c.1763512374.git.nicolinc%40nvidia.com
patch subject: [PATCH v6 5/5] PCI: Suspend iommu function prior to resetting a device
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20251119/202511191219.qIkZ1n2P-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251119/202511191219.qIkZ1n2P-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511191219.qIkZ1n2P-lkp@intel.com/
All errors (new ones prefixed by >>):
drivers/pci/pci.c: In function 'pcie_flr':
>> drivers/pci/pci.c:4341:43: error: passing argument 1 of 'pci_dev_reset_iommu_prepare' from incompatible pointer type [-Wincompatible-pointer-types]
4341 | ret = pci_dev_reset_iommu_prepare(dev);
| ^~~
| |
| struct pci_dev *
In file included from drivers/pci/pci.c:16:
include/linux/iommu.h:1519:62: note: expected 'struct device *' but argument is of type 'struct pci_dev *'
1519 | static inline int pci_dev_reset_iommu_prepare(struct device *dev)
| ~~~~~~~~~~~~~~~^~~
>> drivers/pci/pci.c:4361:34: error: passing argument 1 of 'pci_dev_reset_iommu_done' from incompatible pointer type [-Wincompatible-pointer-types]
4361 | pci_dev_reset_iommu_done(dev);
| ^~~
| |
| struct pci_dev *
include/linux/iommu.h:1524:60: note: expected 'struct device *' but argument is of type 'struct pci_dev *'
1524 | static inline void pci_dev_reset_iommu_done(struct device *dev)
| ~~~~~~~~~~~~~~~^~~
drivers/pci/pci.c: In function 'pci_af_flr':
drivers/pci/pci.c:4418:43: error: passing argument 1 of 'pci_dev_reset_iommu_prepare' from incompatible pointer type [-Wincompatible-pointer-types]
4418 | ret = pci_dev_reset_iommu_prepare(dev);
| ^~~
| |
| struct pci_dev *
include/linux/iommu.h:1519:62: note: expected 'struct device *' but argument is of type 'struct pci_dev *'
1519 | static inline int pci_dev_reset_iommu_prepare(struct device *dev)
| ~~~~~~~~~~~~~~~^~~
drivers/pci/pci.c:4439:34: error: passing argument 1 of 'pci_dev_reset_iommu_done' from incompatible pointer type [-Wincompatible-pointer-types]
4439 | pci_dev_reset_iommu_done(dev);
| ^~~
| |
| struct pci_dev *
include/linux/iommu.h:1524:60: note: expected 'struct device *' but argument is of type 'struct pci_dev *'
1524 | static inline void pci_dev_reset_iommu_done(struct device *dev)
| ~~~~~~~~~~~~~~~^~~
drivers/pci/pci.c: In function 'pci_pm_reset':
drivers/pci/pci.c:4476:43: error: passing argument 1 of 'pci_dev_reset_iommu_prepare' from incompatible pointer type [-Wincompatible-pointer-types]
4476 | ret = pci_dev_reset_iommu_prepare(dev);
| ^~~
| |
| struct pci_dev *
include/linux/iommu.h:1519:62: note: expected 'struct device *' but argument is of type 'struct pci_dev *'
1519 | static inline int pci_dev_reset_iommu_prepare(struct device *dev)
| ~~~~~~~~~~~~~~~^~~
drivers/pci/pci.c:4493:34: error: passing argument 1 of 'pci_dev_reset_iommu_done' from incompatible pointer type [-Wincompatible-pointer-types]
4493 | pci_dev_reset_iommu_done(dev);
| ^~~
| |
| struct pci_dev *
include/linux/iommu.h:1524:60: note: expected 'struct device *' but argument is of type 'struct pci_dev *'
1524 | static inline void pci_dev_reset_iommu_done(struct device *dev)
| ~~~~~~~~~~~~~~~^~~
drivers/pci/pci.c: In function 'pci_reset_bus_function':
drivers/pci/pci.c:4922:42: error: passing argument 1 of 'pci_dev_reset_iommu_prepare' from incompatible pointer type [-Wincompatible-pointer-types]
4922 | rc = pci_dev_reset_iommu_prepare(dev);
| ^~~
| |
| struct pci_dev *
include/linux/iommu.h:1519:62: note: expected 'struct device *' but argument is of type 'struct pci_dev *'
1519 | static inline int pci_dev_reset_iommu_prepare(struct device *dev)
| ~~~~~~~~~~~~~~~^~~
drivers/pci/pci.c:4934:34: error: passing argument 1 of 'pci_dev_reset_iommu_done' from incompatible pointer type [-Wincompatible-pointer-types]
4934 | pci_dev_reset_iommu_done(dev);
| ^~~
| |
| struct pci_dev *
include/linux/iommu.h:1524:60: note: expected 'struct device *' but argument is of type 'struct pci_dev *'
1524 | static inline void pci_dev_reset_iommu_done(struct device *dev)
| ~~~~~~~~~~~~~~~^~~
drivers/pci/pci.c: In function 'cxl_reset_bus_function':
drivers/pci/pci.c:4959:42: error: passing argument 1 of 'pci_dev_reset_iommu_prepare' from incompatible pointer type [-Wincompatible-pointer-types]
4959 | rc = pci_dev_reset_iommu_prepare(dev);
| ^~~
| |
| struct pci_dev *
include/linux/iommu.h:1519:62: note: expected 'struct device *' but argument is of type 'struct pci_dev *'
1519 | static inline int pci_dev_reset_iommu_prepare(struct device *dev)
| ~~~~~~~~~~~~~~~^~~
drivers/pci/pci.c:4979:34: error: passing argument 1 of 'pci_dev_reset_iommu_done' from incompatible pointer type [-Wincompatible-pointer-types]
4979 | pci_dev_reset_iommu_done(dev);
| ^~~
| |
| struct pci_dev *
include/linux/iommu.h:1524:60: note: expected 'struct device *' but argument is of type 'struct pci_dev *'
1524 | static inline void pci_dev_reset_iommu_done(struct device *dev)
| ~~~~~~~~~~~~~~~^~~
--
drivers/pci/quirks.c: In function '__pci_dev_specific_reset':
>> drivers/pci/quirks.c:4237:43: error: passing argument 1 of 'pci_dev_reset_iommu_prepare' from incompatible pointer type [-Wincompatible-pointer-types]
4237 | ret = pci_dev_reset_iommu_prepare(dev);
| ^~~
| |
| struct pci_dev *
In file included from drivers/pci/quirks.c:24:
include/linux/iommu.h:1519:62: note: expected 'struct device *' but argument is of type 'struct pci_dev *'
1519 | static inline int pci_dev_reset_iommu_prepare(struct device *dev)
| ~~~~~~~~~~~~~~~^~~
>> drivers/pci/quirks.c:4244:34: error: passing argument 1 of 'pci_dev_reset_iommu_done' from incompatible pointer type [-Wincompatible-pointer-types]
4244 | pci_dev_reset_iommu_done(dev);
| ^~~
| |
| struct pci_dev *
include/linux/iommu.h:1524:60: note: expected 'struct device *' but argument is of type 'struct pci_dev *'
1524 | static inline void pci_dev_reset_iommu_done(struct device *dev)
| ~~~~~~~~~~~~~~~^~~
vim +/pci_dev_reset_iommu_prepare +4341 drivers/pci/pci.c
4325
4326 /**
4327 * pcie_flr - initiate a PCIe function level reset
4328 * @dev: device to reset
4329 *
4330 * Initiate a function level reset unconditionally on @dev without
4331 * checking any flags and DEVCAP
4332 */
4333 int pcie_flr(struct pci_dev *dev)
4334 {
4335 int ret;
4336
4337 if (!pci_wait_for_pending_transaction(dev))
4338 pci_err(dev, "timed out waiting for pending transaction; performing function level reset anyway\n");
4339
4340 /* Have to call it after waiting for pending DMA transaction */
> 4341 ret = pci_dev_reset_iommu_prepare(dev);
4342 if (ret) {
4343 pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
4344 return ret;
4345 }
4346
4347 pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR);
4348
4349 if (dev->imm_ready)
4350 goto done;
4351
4352 /*
4353 * Per PCIe r4.0, sec 6.6.2, a device must complete an FLR within
4354 * 100ms, but may silently discard requests while the FLR is in
4355 * progress. Wait 100ms before trying to access the device.
4356 */
4357 msleep(100);
4358
4359 ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
4360 done:
> 4361 pci_dev_reset_iommu_done(dev);
4362 return ret;
4363 }
4364 EXPORT_SYMBOL_GPL(pcie_flr);
4365
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 11+ messages in thread* RE: [PATCH v6 5/5] PCI: Suspend iommu function prior to resetting a device
2025-11-19 0:52 ` [PATCH v6 5/5] PCI: Suspend iommu function prior to resetting a device Nicolin Chen
2025-11-19 2:29 ` kernel test robot
2025-11-19 3:03 ` kernel test robot
@ 2025-11-21 7:59 ` Tian, Kevin
2 siblings, 0 replies; 11+ messages in thread
From: Tian, Kevin @ 2025-11-21 7:59 UTC (permalink / raw)
To: Nicolin Chen, robin.murphy@arm.com, joro@8bytes.org,
afael@kernel.org, bhelgaas@google.com, alex@shazbot.org,
jgg@nvidia.com
Cc: will@kernel.org, lenb@kernel.org, baolu.lu@linux.intel.com,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
helgaas@kernel.org, etzhao1900@gmail.com
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, November 19, 2025 8:52 AM
>
> PCIe permits a device to ignore ATS invalidation TLPs while processing a
> reset. This creates a problem visible to the OS where an ATS invalidation
> command will time out: e.g. an SVA domain will have no coordination with a
> reset event and can racily issue ATS invalidations to a resetting device.
>
> The PCIe r6.0, sec 10.3.1 IMPLEMENTATION NOTE recommends SW to
> disable and
> block ATS before initiating a Function Level Reset. It also mentions that
> other reset methods could have the same vulnerability as well.
>
> The IOMMU subsystem provides pci_dev_reset_iommu_prepare/done()
> callback
> helpers for this matter. Use them in all the existing reset functions.
>
> This will attach the device to its iommu_group->blocking_domain during the
> device reset, so as to allow IOMMU driver to:
> - invoke pci_disable_ats() and pci_enable_ats(), if necessary
> - wait for all ATS invalidations to complete
> - stop issuing new ATS invalidations
> - fence any incoming ATS queries
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
^ permalink raw reply [flat|nested] 11+ messages in thread