public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH rc v7 0/6] iommu: Fix pci_dev_reset_iommu_prepare/done()
@ 2026-04-18 23:41 Nicolin Chen
  2026-04-18 23:41 ` [PATCH rc v7 1/6] iommu: Fix kdocs of pci_dev_reset_iommu_done() Nicolin Chen
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Nicolin Chen @ 2026-04-18 23:41 UTC (permalink / raw)
  To: joro, kevin.tian, jgg
  Cc: will, robin.murphy, baolu.lu, iommu, linux-kernel, xueshuai

Shuai and Kevin found a few bugs in the pci_dev_reset_iommu_prepare/done()
helpers when used to handle some corner cases:
 - Nested callbacks
 - Multi-device groups
 - UAF due to concurrent detach

This needs some substantial rework by tracking device reset states on a per
gdev basis. This series includes a few patches addressing them. Most of the
patches are reviewed previously in a single patch v6. As we found more bugs
during the reviews, I split that v6 to smaller patches so each of them will
be cleaner. IOW, the git-diff from PATCH 1-4 is identical to v6.

This is on Github:
https://github.com/nicolinc/iommufd/commits/fix_iommu_reset-v7

Changelog
 v7:
  * Add Reviewed-by tags
  * Split v6 into smaller patches
  * Add one patch to fix UAF during detach()
  * Add one patch to fix unnecessary ATS invalidation
 v6:
  https://lore.kernel.org/all/20260407194644.171304-1-nicolinc@nvidia.com/
  * Update inline comments and commit message
  * Add "max_pasids > 0" condition in both helpers
 v5:
  https://lore.kernel.org/all/20260404050243.141366-1-nicolinc@nvidia.com/
  * Add 'blocked' to fix iommu_driver_get_domain_for_dev() return.
 v4:
  https://lore.kernel.org/all/20260324014056.36103-1-nicolinc@nvidia.com/
  * Rename 'reset_cnt' to 'recovery_cnt'
 v3:
  https://lore.kernel.org/all/20260321223930.10836-1-nicolinc@nvidia.com/
  * Turn prepare()/done() to be per-gdev
  * Use reset_depth to track nested re-entries
  * Replace group->resetting_domain with a reset_cnt
 v2:
  https://lore.kernel.org/all/20260319043135.1153534-1-nicolinc@nvidia.com/
  * Fix in the helpers by allowing re-entry
 v1:
  https://lore.kernel.org/all/20260318220028.1146905-1-nicolinc@nvidia.com/

Nicolin Chen (6):
  iommu: Fix kdocs of pci_dev_reset_iommu_done()
  iommu: Replace per-group resetting_domain with per-gdev blocked flag
  iommu: Fix pasid attach in pci_dev_reset_iommu_prepare/done()
  iommu: Fix nested pci_dev_reset_iommu_prepare/done()
  iommu: Fix ATS invalidation timeouts during
    __iommu_remove_group_pasid()
  iommu: Fix UAF in pci_dev_reset_iommu_done() due to concurrent detach

 drivers/iommu/iommu.c | 163 ++++++++++++++++++++++++++++++++----------
 1 file changed, 124 insertions(+), 39 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH rc v7 1/6] iommu: Fix kdocs of pci_dev_reset_iommu_done()
  2026-04-18 23:41 [PATCH rc v7 0/6] iommu: Fix pci_dev_reset_iommu_prepare/done() Nicolin Chen
@ 2026-04-18 23:41 ` Nicolin Chen
  2026-04-18 23:41 ` [PATCH rc v7 2/6] iommu: Replace per-group resetting_domain with per-gdev blocked flag Nicolin Chen
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Nicolin Chen @ 2026-04-18 23:41 UTC (permalink / raw)
  To: joro, kevin.tian, jgg
  Cc: will, robin.murphy, baolu.lu, iommu, linux-kernel, xueshuai

Remove the duplicated word. No functional change.

Fixes: c279e83953d9 ("iommu: Introduce pci_dev_reset_iommu_prepare/done()")
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 35db517809540..b950e7be1a3ed 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3997,9 +3997,9 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
  * @pdev: PCI device that has finished a reset routine
  *
  * After a PCIe device finishes a reset routine, it wants to restore its IOMMU
- * IOMMU activity, including new translation as well as cache invalidation, by
- * re-attaching all RID/PASID of the device's back to the domains retained in
- * the core-level structure.
+ * activity, including new translation and cache invalidation, by re-attaching
+ * all RID/PASID of the device back to the domains retained in the core-level
+ * structure.
  *
  * Caller must pair it with a successful pci_dev_reset_iommu_prepare().
  *
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH rc v7 2/6] iommu: Replace per-group resetting_domain with per-gdev blocked flag
  2026-04-18 23:41 [PATCH rc v7 0/6] iommu: Fix pci_dev_reset_iommu_prepare/done() Nicolin Chen
  2026-04-18 23:41 ` [PATCH rc v7 1/6] iommu: Fix kdocs of pci_dev_reset_iommu_done() Nicolin Chen
@ 2026-04-18 23:41 ` Nicolin Chen
  2026-04-18 23:41 ` [PATCH rc v7 3/6] iommu: Fix pasid attach in pci_dev_reset_iommu_prepare/done() Nicolin Chen
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Nicolin Chen @ 2026-04-18 23:41 UTC (permalink / raw)
  To: joro, kevin.tian, jgg
  Cc: will, robin.murphy, baolu.lu, iommu, linux-kernel, xueshuai

The core tracks device resetting states with a per-group resetting_domain,
while a reset is actually per group-device. Such a mismatch might lead to
confusion and even difficulty to untangle per-gdev handling requirement.

Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
internally while both are calling pci_dev_reset_iommu_prepare/done(). And
the solution requires the core to track at the group_device level as well.

Introduce a 'blocked' flag to struct group_device, to allow a multi-device
group to isolate concurrent device resets independently.

As the reset routine is per gdev, it cannot clear group->resetting_domain
without iterating over the device list to ensure no other device is being
reset. Simplify it by replacing the resetting_domain with a 'recovery_cnt'
in the struct iommu_group.

No functional change. But this is essential to apply following bug fixes.

Fixes: c279e83953d9 ("iommu: Introduce pci_dev_reset_iommu_prepare/done()")
Cc: stable@vger.kernel.org
Reported-by: Shuai Xue <xueshuai@linux.alibaba.com>
Closes: https://lore.kernel.org/all/absKsk7qQOwzhpzv@Asurada-Nvidia/
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 104 ++++++++++++++++++++++++++++++++----------
 1 file changed, 79 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index b950e7be1a3ed..d5d9102b9d750 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -61,14 +61,14 @@ struct iommu_group {
 	int id;
 	struct iommu_domain *default_domain;
 	struct iommu_domain *blocking_domain;
-	/*
-	 * During a group device reset, @resetting_domain points to the physical
-	 * domain, while @domain points to the attached domain before the reset.
-	 */
-	struct iommu_domain *resetting_domain;
 	struct iommu_domain *domain;
 	struct list_head entry;
 	unsigned int owner_cnt;
+	/*
+	 * Number of devices in the group undergoing or awaiting recovery.
+	 * If non-zero, concurrent domain attachments are rejected.
+	 */
+	unsigned int recovery_cnt;
 	void *owner;
 };
 
@@ -76,12 +76,32 @@ struct group_device {
 	struct list_head list;
 	struct device *dev;
 	char *name;
+	/*
+	 * Device is blocked for a pending recovery while its group->domain is
+	 * retained. This can happen when:
+	 *  - Device is undergoing a reset
+	 */
+	bool blocked;
 };
 
 /* Iterate over each struct group_device in a struct iommu_group */
 #define for_each_group_device(group, pos) \
 	list_for_each_entry(pos, &(group)->devices, list)
 
+static struct group_device *__dev_to_gdev(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	struct group_device *gdev;
+
+	lockdep_assert_held(&group->mutex);
+
+	for_each_group_device(group, gdev) {
+		if (gdev->dev == dev)
+			return gdev;
+	}
+	return NULL;
+}
+
 struct iommu_group_attribute {
 	struct attribute attr;
 	ssize_t (*show)(struct iommu_group *group, char *buf);
@@ -2191,6 +2211,8 @@ EXPORT_SYMBOL_GPL(iommu_attach_device);
 
 int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
 {
+	struct group_device *gdev;
+
 	/*
 	 * This is called on the dma mapping fast path so avoid locking. This is
 	 * racy, but we have an expectation that the driver will setup its DMAs
@@ -2201,14 +2223,18 @@ int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
 
 	guard(mutex)(&dev->iommu_group->mutex);
 
+	gdev = __dev_to_gdev(dev);
+	if (WARN_ON(!gdev))
+		return -ENODEV;
+
 	/*
-	 * This is a concurrent attach during a device reset. Reject it until
+	 * This is a concurrent attach during device recovery. Reject it until
 	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
 	 *
 	 * Note that this might fail the iommu_dma_map(). But there's nothing
 	 * more we can do here.
 	 */
-	if (dev->iommu_group->resetting_domain)
+	if (gdev->blocked)
 		return -EBUSY;
 	return __iommu_attach_device(domain, dev, NULL);
 }
@@ -2265,19 +2291,24 @@ EXPORT_SYMBOL_GPL(iommu_get_domain_for_dev);
 struct iommu_domain *iommu_driver_get_domain_for_dev(struct device *dev)
 {
 	struct iommu_group *group = dev->iommu_group;
+	struct group_device *gdev;
 
 	lockdep_assert_held(&group->mutex);
 
+	gdev = __dev_to_gdev(dev);
+	if (WARN_ON(!gdev))
+		return NULL;
+
 	/*
 	 * Driver handles the low-level __iommu_attach_device(), including the
 	 * one invoked by pci_dev_reset_iommu_done() re-attaching the device to
 	 * the cached group->domain. In this case, the driver must get the old
-	 * domain from group->resetting_domain rather than group->domain. This
+	 * domain from group->blocking_domain rather than group->domain. This
 	 * prevents it from re-attaching the device from group->domain (old) to
 	 * group->domain (new).
 	 */
-	if (group->resetting_domain)
-		return group->resetting_domain;
+	if (gdev->blocked)
+		return group->blocking_domain;
 
 	return group->domain;
 }
@@ -2436,10 +2467,10 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
 		return -EINVAL;
 
 	/*
-	 * This is a concurrent attach during a device reset. Reject it until
+	 * This is a concurrent attach during device recovery. Reject it until
 	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
 	 */
-	if (group->resetting_domain)
+	if (group->recovery_cnt)
 		return -EBUSY;
 
 	/*
@@ -3567,10 +3598,10 @@ int iommu_attach_device_pasid(struct iommu_domain *domain,
 	mutex_lock(&group->mutex);
 
 	/*
-	 * This is a concurrent attach during a device reset. Reject it until
+	 * This is a concurrent attach during device recovery. Reject it until
 	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
 	 */
-	if (group->resetting_domain) {
+	if (group->recovery_cnt) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
@@ -3660,10 +3691,10 @@ int iommu_replace_device_pasid(struct iommu_domain *domain,
 	mutex_lock(&group->mutex);
 
 	/*
-	 * This is a concurrent attach during a device reset. Reject it until
+	 * This is a concurrent attach during device recovery. Reject it until
 	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
 	 */
-	if (group->resetting_domain) {
+	if (group->recovery_cnt) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
@@ -3934,12 +3965,12 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
  * routine wants to block any IOMMU activity: translation and ATS invalidation.
  *
  * This function attaches the device's RID/PASID(s) the group->blocking_domain,
- * setting the group->resetting_domain. This allows the IOMMU driver pausing any
+ * incrementing the group->recovery_cnt, to allow the IOMMU driver pausing any
  * IOMMU activity while leaving the group->domain pointer intact. Later when the
  * reset is finished, pci_dev_reset_iommu_done() can restore everything.
  *
  * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
- * before/after the core-level reset routine, to unset the resetting_domain.
+ * before/after the core-level reset routine, to decrement the recovery_cnt.
  *
  * Return: 0 on success or negative error code if the preparation failed.
  *
@@ -3952,6 +3983,7 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
 int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 {
 	struct iommu_group *group = pdev->dev.iommu_group;
+	struct group_device *gdev;
 	unsigned long pasid;
 	void *entry;
 	int ret;
@@ -3961,8 +3993,12 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 
 	guard(mutex)(&group->mutex);
 
-	/* Re-entry is not allowed */
-	if (WARN_ON(group->resetting_domain))
+	gdev = __dev_to_gdev(&pdev->dev);
+	if (WARN_ON(!gdev))
+		return -ENODEV;
+
+	/* Re-entry is not allowed (will be fixed in a following patch) */
+	if (WARN_ON(gdev->blocked))
 		return -EBUSY;
 
 	ret = __iommu_group_alloc_blocking_domain(group);
@@ -3977,6 +4013,13 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 			return ret;
 	}
 
+	/*
+	 * Update gdev->blocked upon the domain change, as it is used to return
+	 * the correct domain in iommu_driver_get_domain_for_dev() that might be
+	 * called in a set_dev_pasid callback function.
+	 */
+	gdev->blocked = true;
+
 	/*
 	 * Stage PASID domains at blocking_domain while retaining pasid_array.
 	 *
@@ -3987,7 +4030,7 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 		iommu_remove_dev_pasid(&pdev->dev, pasid,
 				       pasid_array_entry_to_domain(entry));
 
-	group->resetting_domain = group->blocking_domain;
+	group->recovery_cnt++;
 	return ret;
 }
 EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
@@ -4009,6 +4052,7 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
 void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 {
 	struct iommu_group *group = pdev->dev.iommu_group;
+	struct group_device *gdev;
 	unsigned long pasid;
 	void *entry;
 
@@ -4017,11 +4061,13 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 
 	guard(mutex)(&group->mutex);
 
-	/* pci_dev_reset_iommu_prepare() was bypassed for the device */
-	if (!group->resetting_domain)
+	gdev = __dev_to_gdev(&pdev->dev);
+	if (WARN_ON(!gdev))
+		return;
+
+	if (!gdev->blocked)
 		return;
 
-	/* pci_dev_reset_iommu_prepare() was not successfully called */
 	if (WARN_ON(!group->blocking_domain))
 		return;
 
@@ -4031,6 +4077,13 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 					      group->blocking_domain));
 	}
 
+	/*
+	 * Update gdev->blocked upon the domain change, as it is used to return
+	 * the correct domain in iommu_driver_get_domain_for_dev() that might be
+	 * called in a set_dev_pasid callback function.
+	 */
+	gdev->blocked = false;
+
 	/*
 	 * Re-attach PASID domains back to the domains retained in pasid_array.
 	 *
@@ -4042,7 +4095,8 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 			pasid_array_entry_to_domain(entry), group, pasid,
 			group->blocking_domain));
 
-	group->resetting_domain = NULL;
+	if (!WARN_ON(group->recovery_cnt == 0))
+		group->recovery_cnt--;
 }
 EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_done);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH rc v7 3/6] iommu: Fix pasid attach in pci_dev_reset_iommu_prepare/done()
  2026-04-18 23:41 [PATCH rc v7 0/6] iommu: Fix pci_dev_reset_iommu_prepare/done() Nicolin Chen
  2026-04-18 23:41 ` [PATCH rc v7 1/6] iommu: Fix kdocs of pci_dev_reset_iommu_done() Nicolin Chen
  2026-04-18 23:41 ` [PATCH rc v7 2/6] iommu: Replace per-group resetting_domain with per-gdev blocked flag Nicolin Chen
@ 2026-04-18 23:41 ` Nicolin Chen
  2026-04-18 23:41 ` [PATCH rc v7 4/6] iommu: Fix nested pci_dev_reset_iommu_prepare/done() Nicolin Chen
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Nicolin Chen @ 2026-04-18 23:41 UTC (permalink / raw)
  To: joro, kevin.tian, jgg
  Cc: will, robin.murphy, baolu.lu, iommu, linux-kernel, xueshuai

Now the helpers handle per-gdev resets. So use the per-device API properly
to attach/detach PASIDs. Also add max_pasids check as other callers.

Fixes: c279e83953d9 ("iommu: Introduce pci_dev_reset_iommu_prepare/done()")
Cc: stable@vger.kernel.org
Reported-by: Shuai Xue <xueshuai@linux.alibaba.com>
Closes: https://lore.kernel.org/all/ad858513-09fc-455e-bbc5-fe38a225cc78@linux.alibaba.com/
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d5d9102b9d750..e9ffa562b614f 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -4026,9 +4026,14 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 	 * The pasid_array is mostly fenced by group->mutex, except one reader
 	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
 	 */
-	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
-		iommu_remove_dev_pasid(&pdev->dev, pasid,
-				       pasid_array_entry_to_domain(entry));
+	if (pdev->dev.iommu->max_pasids > 0) {
+		xa_for_each_start(&group->pasid_array, pasid, entry, 1) {
+			struct iommu_domain *pasid_dom =
+				pasid_array_entry_to_domain(entry);
+
+			iommu_remove_dev_pasid(&pdev->dev, pasid, pasid_dom);
+		}
+	}
 
 	group->recovery_cnt++;
 	return ret;
@@ -4090,10 +4095,16 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 	 * The pasid_array is mostly fenced by group->mutex, except one reader
 	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
 	 */
-	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
-		WARN_ON(__iommu_set_group_pasid(
-			pasid_array_entry_to_domain(entry), group, pasid,
-			group->blocking_domain));
+	if (pdev->dev.iommu->max_pasids > 0) {
+		xa_for_each_start(&group->pasid_array, pasid, entry, 1) {
+			struct iommu_domain *pasid_dom =
+				pasid_array_entry_to_domain(entry);
+
+			WARN_ON(pasid_dom->ops->set_dev_pasid(
+				pasid_dom, &pdev->dev, pasid,
+				group->blocking_domain));
+		}
+	}
 
 	if (!WARN_ON(group->recovery_cnt == 0))
 		group->recovery_cnt--;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH rc v7 4/6] iommu: Fix nested pci_dev_reset_iommu_prepare/done()
  2026-04-18 23:41 [PATCH rc v7 0/6] iommu: Fix pci_dev_reset_iommu_prepare/done() Nicolin Chen
                   ` (2 preceding siblings ...)
  2026-04-18 23:41 ` [PATCH rc v7 3/6] iommu: Fix pasid attach in pci_dev_reset_iommu_prepare/done() Nicolin Chen
@ 2026-04-18 23:41 ` Nicolin Chen
  2026-04-18 23:41 ` [PATCH rc v7 5/6] iommu: Fix ATS invalidation timeouts during __iommu_remove_group_pasid() Nicolin Chen
  2026-04-18 23:41 ` [PATCH rc v7 6/6] iommu: Fix UAF in pci_dev_reset_iommu_done() due to concurrent detach Nicolin Chen
  5 siblings, 0 replies; 11+ messages in thread
From: Nicolin Chen @ 2026-04-18 23:41 UTC (permalink / raw)
  To: joro, kevin.tian, jgg
  Cc: will, robin.murphy, baolu.lu, iommu, linux-kernel, xueshuai

Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
internally while both are calling pci_dev_reset_iommu_prepare/done().

As pci_dev_reset_iommu_prepare() doesn't support re-entry, the inner call
will trigger a WARN_ON and return -EBUSY, resulting in failing the entire
device reset.

On the other hand, removing the outer calls in the PCI callers is unsafe.
As pointed out by Kevin, device-specific quirks like reset_hinic_vf_dev()
execute custom firmware waits after their inner pcie_flr() completes. If
the IOMMU protection relies solely on the inner reset, the IOMMU will be
unblocked prematurely while the device is still resetting.

Instead, fix this by making pci_dev_reset_iommu_prepare/done() reentrant.

Introduce gdev->reset_depth to handle the re-entries on the same device.

Fixes: c279e83953d9 ("iommu: Introduce pci_dev_reset_iommu_prepare/done()")
Cc: stable@vger.kernel.org
Reported-by: Shuai Xue <xueshuai@linux.alibaba.com>
Closes: https://lore.kernel.org/all/absKsk7qQOwzhpzv@Asurada-Nvidia/
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index e9ffa562b614f..ff181db687bbf 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -82,6 +82,7 @@ struct group_device {
 	 *  - Device is undergoing a reset
 	 */
 	bool blocked;
+	unsigned int reset_depth;
 };
 
 /* Iterate over each struct group_device in a struct iommu_group */
@@ -3997,20 +3998,19 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 	if (WARN_ON(!gdev))
 		return -ENODEV;
 
-	/* Re-entry is not allowed (will be fixed in a following patch) */
-	if (WARN_ON(gdev->blocked))
-		return -EBUSY;
+	if (gdev->reset_depth++)
+		return 0;
 
 	ret = __iommu_group_alloc_blocking_domain(group);
 	if (ret)
-		return ret;
+		goto err_depth;
 
 	/* Stage RID domain at blocking_domain while retaining group->domain */
 	if (group->domain != group->blocking_domain) {
 		ret = __iommu_attach_device(group->blocking_domain, &pdev->dev,
 					    group->domain);
 		if (ret)
-			return ret;
+			goto err_depth;
 	}
 
 	/*
@@ -4037,6 +4037,10 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 
 	group->recovery_cnt++;
 	return ret;
+
+err_depth:
+	gdev->reset_depth--;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
 
@@ -4070,7 +4074,10 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 	if (WARN_ON(!gdev))
 		return;
 
-	if (!gdev->blocked)
+	/* Unbalanced done() calls would underflow the counter */
+	if (WARN_ON(gdev->reset_depth == 0))
+		return;
+	if (--gdev->reset_depth)
 		return;
 
 	if (WARN_ON(!group->blocking_domain))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH rc v7 5/6] iommu: Fix ATS invalidation timeouts during __iommu_remove_group_pasid()
  2026-04-18 23:41 [PATCH rc v7 0/6] iommu: Fix pci_dev_reset_iommu_prepare/done() Nicolin Chen
                   ` (3 preceding siblings ...)
  2026-04-18 23:41 ` [PATCH rc v7 4/6] iommu: Fix nested pci_dev_reset_iommu_prepare/done() Nicolin Chen
@ 2026-04-18 23:41 ` Nicolin Chen
  2026-04-21  7:15   ` Tian, Kevin
  2026-04-18 23:41 ` [PATCH rc v7 6/6] iommu: Fix UAF in pci_dev_reset_iommu_done() due to concurrent detach Nicolin Chen
  5 siblings, 1 reply; 11+ messages in thread
From: Nicolin Chen @ 2026-04-18 23:41 UTC (permalink / raw)
  To: joro, kevin.tian, jgg
  Cc: will, robin.murphy, baolu.lu, iommu, linux-kernel, xueshuai

If a device is blocked, its PASID domains are already detached. Repeating
iommu_remove_dev_pasid() is unnecessary and might trigger ATS invalidation
timeouts.

Skip the iommu_remove_dev_pasid() call upon the blocked flag.

Fixes: c279e83953d9 ("iommu: Introduce pci_dev_reset_iommu_prepare/done()")
Cc: stable@vger.kernel.org
Reported-by: Kevin Tian <kevin.tian@intel.com>
Closes: https://lore.kernel.org/all/BN9PR11MB5276D60096EBF15C5753C4BB8C202@BN9PR11MB5276.namprd11.prod.outlook.com/
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ff181db687bbf..30ba18e613faa 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3554,7 +3554,8 @@ static void __iommu_remove_group_pasid(struct iommu_group *group,
 	struct group_device *device;
 
 	for_each_group_device(group, device) {
-		if (device->dev->iommu->max_pasids > 0)
+		/* Device might be already detached for a device recovery */
+		if (!device->blocked && device->dev->iommu->max_pasids > 0)
 			iommu_remove_dev_pasid(device->dev, pasid, domain);
 	}
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH rc v7 6/6] iommu: Fix UAF in pci_dev_reset_iommu_done() due to concurrent detach
  2026-04-18 23:41 [PATCH rc v7 0/6] iommu: Fix pci_dev_reset_iommu_prepare/done() Nicolin Chen
                   ` (4 preceding siblings ...)
  2026-04-18 23:41 ` [PATCH rc v7 5/6] iommu: Fix ATS invalidation timeouts during __iommu_remove_group_pasid() Nicolin Chen
@ 2026-04-18 23:41 ` Nicolin Chen
  2026-04-21  7:41   ` Tian, Kevin
  5 siblings, 1 reply; 11+ messages in thread
From: Nicolin Chen @ 2026-04-18 23:41 UTC (permalink / raw)
  To: joro, kevin.tian, jgg
  Cc: will, robin.murphy, baolu.lu, iommu, linux-kernel, xueshuai

In __iommu_group_set_domain_internal(), concurrent domain attachments are
rejected when any device in the group is recovering. This is necessary to
fence concurrent attachments to a multi-device group where devices might
share the same RID due to PCI DMA alias quirks.

However, IOMMU_SET_DOMAIN_MUST_SUCCEED callers (detach/teardown paths such
as __iommu_group_set_core_domain and __iommu_release_dma_ownership) should
not be rejected, as the domain would be free-ed anyway in this nofail path
while group->domain is still pointing to it. So pci_dev_reset_iommu_done()
could trigger a UAF when re-attaching group->domain.

Honor the IOMMU_SET_DOMAIN_MUST_SUCCEED flag, allowing the callers through
the group->recovery_cnt fence, so as to update the group->domain pointer.
Instead add gdev->blocked checks in the device iteration loop to avoid any
concurrent per-device detachment.

Fixes: c279e83953d9 ("iommu: Introduce pci_dev_reset_iommu_prepare/done()")
Cc: stable@vger.kernel.org
Reported-by: Kevin Tian <kevin.tian@intel.com>
Closes: https://lore.kernel.org/all/BN9PR11MB5276D60096EBF15C5753C4BB8C202@BN9PR11MB5276.namprd11.prod.outlook.com/
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 30ba18e613faa..fa7b04a77e469 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2470,8 +2470,11 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
 	/*
 	 * This is a concurrent attach during device recovery. Reject it until
 	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
+	 *
+	 * Note: still allow MUST_SUCCEED callers (detach/teardown) through to
+	 * avoid UAF on domain release paths.
 	 */
-	if (group->recovery_cnt)
+	if (group->recovery_cnt && !(flags & IOMMU_SET_DOMAIN_MUST_SUCCEED))
 		return -EBUSY;
 
 	/*
@@ -2482,6 +2485,13 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
 	 */
 	result = 0;
 	for_each_group_device(group, gdev) {
+		/*
+		 * Device under recovery is attached to group->blocking_domain.
+		 * Don't change that. pci_dev_reset_iommu_done() will re-attach
+		 * its domain to the updated group->domain, after the recovery.
+		 */
+		if (gdev->blocked)
+			continue;
 		ret = __iommu_device_set_domain(group, gdev->dev, new_domain,
 						group->domain, flags);
 		if (ret) {
@@ -2511,6 +2521,8 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
 		/* No need to revert the last gdev that failed to set domain */
 		if (gdev == last_gdev)
 			break;
+		if (gdev->blocked)
+			continue;
 		/*
 		 * A NULL domain can happen only for first probe, in which case
 		 * we leave group->domain as NULL and let release clean
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* RE: [PATCH rc v7 5/6] iommu: Fix ATS invalidation timeouts during __iommu_remove_group_pasid()
  2026-04-18 23:41 ` [PATCH rc v7 5/6] iommu: Fix ATS invalidation timeouts during __iommu_remove_group_pasid() Nicolin Chen
@ 2026-04-21  7:15   ` Tian, Kevin
  2026-04-21 17:57     ` Nicolin Chen
  0 siblings, 1 reply; 11+ messages in thread
From: Tian, Kevin @ 2026-04-21  7:15 UTC (permalink / raw)
  To: Nicolin Chen, joro@8bytes.org, jgg@nvidia.com
  Cc: will@kernel.org, robin.murphy@arm.com, baolu.lu@linux.intel.com,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	xueshuai@linux.alibaba.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Sunday, April 19, 2026 7:42 AM
> 
> If a device is blocked, its PASID domains are already detached. Repeating
> iommu_remove_dev_pasid() is unnecessary and might trigger ATS
> invalidation
> timeouts.
> 
> Skip the iommu_remove_dev_pasid() call upon the blocked flag.
> 
> Fixes: c279e83953d9 ("iommu: Introduce
> pci_dev_reset_iommu_prepare/done()")
> Cc: stable@vger.kernel.org
> Reported-by: Kevin Tian <kevin.tian@intel.com>

It's not reported by me. Instead, I just tried to copy the words from Sashiko (same
in next patch).

with that fixed:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

btw not sure whether Joerg plans to enable Sashiko auto-reply to the iommu
list. I feel most suggestions there are positive. 

If it won't happen near term, suggest you find a way to check Sashilo manually
for future patches.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH rc v7 6/6] iommu: Fix UAF in pci_dev_reset_iommu_done() due to concurrent detach
  2026-04-18 23:41 ` [PATCH rc v7 6/6] iommu: Fix UAF in pci_dev_reset_iommu_done() due to concurrent detach Nicolin Chen
@ 2026-04-21  7:41   ` Tian, Kevin
  2026-04-21 18:10     ` Nicolin Chen
  0 siblings, 1 reply; 11+ messages in thread
From: Tian, Kevin @ 2026-04-21  7:41 UTC (permalink / raw)
  To: Nicolin Chen, joro@8bytes.org, jgg@nvidia.com
  Cc: will@kernel.org, robin.murphy@arm.com, baolu.lu@linux.intel.com,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	xueshuai@linux.alibaba.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Sunday, April 19, 2026 7:42 AM
> 
> In __iommu_group_set_domain_internal(), concurrent domain attachments
> are
> rejected when any device in the group is recovering. This is necessary to
> fence concurrent attachments to a multi-device group where devices might
> share the same RID due to PCI DMA alias quirks.
> 
> However, IOMMU_SET_DOMAIN_MUST_SUCCEED callers (detach/teardown
> paths such
> as __iommu_group_set_core_domain and
> __iommu_release_dma_ownership) should
> not be rejected, as the domain would be free-ed anyway in this nofail path
> while group->domain is still pointing to it. So pci_dev_reset_iommu_done()
> could trigger a UAF when re-attaching group->domain.

As I noted in my reply to v6, a WARN_ON will be triggered before any UAF:

static void __iommu_group_set_domain_nofail(struct iommu_group *group,
                                            struct iommu_domain *new_domain)
{
        WARN_ON(__iommu_group_set_domain_internal(
                group, new_domain, IOMMU_SET_DOMAIN_MUST_SUCCEED));
}

> __iommu_group_set_domain_internal(struct iommu_group *group,
>  	/*
>  	 * This is a concurrent attach during device recovery. Reject it until
>  	 * pci_dev_reset_iommu_done() attaches the device to group-
> >domain.
> +	 *
> +	 * Note: still allow MUST_SUCCEED callers (detach/teardown)
> through to
> +	 * avoid UAF on domain release paths.
>  	 */
> -	if (group->recovery_cnt)
> +	if (group->recovery_cnt && !(flags &
> IOMMU_SET_DOMAIN_MUST_SUCCEED))
>  		return -EBUSY;

MUST_SUCCEED are used in other places too:

- iommu_setup_default_domain()
- error handling 

I'd just remove "(detach/teardown)" here.

> 
>  	/*
> @@ -2482,6 +2485,13 @@ static int
> __iommu_group_set_domain_internal(struct iommu_group *group,
>  	 */
>  	result = 0;
>  	for_each_group_device(group, gdev) {
> +		/*
> +		 * Device under recovery is attached to group-
> >blocking_domain.
> +		 * Don't change that. pci_dev_reset_iommu_done() will re-
> attach
> +		 * its domain to the updated group->domain, after the
> recovery.
> +		 */
> +		if (gdev->blocked)
> +			continue;

This reminds me one thing. Ideally the blocked device really doesn't care
whether group->domain is the one before resetting or a different one
changed in middle. It's blocked then doesn't refer to any non-blocking
domains. After reset is done it's re-attached to whatever group->domain
is at that time. 

Then sounds there is no reason to block attach/replace too. Just skip
the blocked device and update group->domain then it will be picked up
later at reset done, just like done here for detach.

Did I overlook anything? btw even if it makes sense, I'm not suggesting
to change that logic in this series. It's more an enhancement rather than
a bug fix.

Sashiko [1] gave another interesting comment about dma aliasing caused
by PCIe to PCI/PCI-X bridge - devices behind the bridge share a same
RID (then same device/context entry in IOMMU). In this case unblocking
devA could prematurely unblock devB which is actively undergoing a reset.

[1] https://sashiko.dev/#/patchset/cover.1776551790.git.nicolinc%40nvidia.com

Honestly speaking I don't know whether such configuration exists today
(having an ATS capable device in a PCI/PCI-X bus). Probably for robustness
pci_enable_ats() could add a check to prevent it. But given it's niche and
this series already fixes several real issues, I'm OK as it is now.

So:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH rc v7 5/6] iommu: Fix ATS invalidation timeouts during __iommu_remove_group_pasid()
  2026-04-21  7:15   ` Tian, Kevin
@ 2026-04-21 17:57     ` Nicolin Chen
  0 siblings, 0 replies; 11+ messages in thread
From: Nicolin Chen @ 2026-04-21 17:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: joro@8bytes.org, jgg@nvidia.com, will@kernel.org,
	robin.murphy@arm.com, baolu.lu@linux.intel.com,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	xueshuai@linux.alibaba.com

On Tue, Apr 21, 2026 at 07:15:34AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Sunday, April 19, 2026 7:42 AM
> > 
> > If a device is blocked, its PASID domains are already detached. Repeating
> > iommu_remove_dev_pasid() is unnecessary and might trigger ATS
> > invalidation
> > timeouts.
> > 
> > Skip the iommu_remove_dev_pasid() call upon the blocked flag.
> > 
> > Fixes: c279e83953d9 ("iommu: Introduce
> > pci_dev_reset_iommu_prepare/done()")
> > Cc: stable@vger.kernel.org
> > Reported-by: Kevin Tian <kevin.tian@intel.com>
> 
> It's not reported by me. Instead, I just tried to copy the words from Sashiko (same
> in next patch).

I think you should take the credit, though maybe we could add:

Assisted-by: Gemini:gemini-3.1-pro-preview
 
> btw not sure whether Joerg plans to enable Sashiko auto-reply to the iommu
> list. I feel most suggestions there are positive. 

That sounds helpful to me.

> If it won't happen near term, suggest you find a way to check Sashilo manually
> for future patches.

I use AI reviewer too, but it didn't report these corner cases.
I will add Sashiko (gemini-3.1-pro-preview) to my routine.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH rc v7 6/6] iommu: Fix UAF in pci_dev_reset_iommu_done() due to concurrent detach
  2026-04-21  7:41   ` Tian, Kevin
@ 2026-04-21 18:10     ` Nicolin Chen
  0 siblings, 0 replies; 11+ messages in thread
From: Nicolin Chen @ 2026-04-21 18:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: joro@8bytes.org, jgg@nvidia.com, will@kernel.org,
	robin.murphy@arm.com, baolu.lu@linux.intel.com,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	xueshuai@linux.alibaba.com

On Tue, Apr 21, 2026 at 07:41:03AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Sunday, April 19, 2026 7:42 AM
> > 
> > In __iommu_group_set_domain_internal(), concurrent domain attachments
> > are
> > rejected when any device in the group is recovering. This is necessary to
> > fence concurrent attachments to a multi-device group where devices might
> > share the same RID due to PCI DMA alias quirks.
> > 
> > However, IOMMU_SET_DOMAIN_MUST_SUCCEED callers (detach/teardown
> > paths such
> > as __iommu_group_set_core_domain and
> > __iommu_release_dma_ownership) should
> > not be rejected, as the domain would be free-ed anyway in this nofail path
> > while group->domain is still pointing to it. So pci_dev_reset_iommu_done()
> > could trigger a UAF when re-attaching group->domain.
> 
> As I noted in my reply to v6, a WARN_ON will be triggered before any UAF:
> 
> static void __iommu_group_set_domain_nofail(struct iommu_group *group,
>                                             struct iommu_domain *new_domain)
> {
>         WARN_ON(__iommu_group_set_domain_internal(
>                 group, new_domain, IOMMU_SET_DOMAIN_MUST_SUCCEED));
> }

OK. I think this fix should be just "do not fail MUST_SUCCEED" or so.

> > @@ -2482,6 +2485,13 @@ static int
> > __iommu_group_set_domain_internal(struct iommu_group *group,
> >  	 */
> >  	result = 0;
> >  	for_each_group_device(group, gdev) {
> > +		/*
> > +		 * Device under recovery is attached to group-
> > >blocking_domain.
> > +		 * Don't change that. pci_dev_reset_iommu_done() will re-
> > attach
> > +		 * its domain to the updated group->domain, after the
> > recovery.
> > +		 */
> > +		if (gdev->blocked)
> > +			continue;
> 
> This reminds me one thing. Ideally the blocked device really doesn't care
> whether group->domain is the one before resetting or a different one
> changed in middle. It's blocked then doesn't refer to any non-blocking
> domains. After reset is done it's re-attached to whatever group->domain
> is at that time. 
> 
> Then sounds there is no reason to block attach/replace too. Just skip
> the blocked device and update group->domain then it will be picked up
> later at reset done, just like done here for detach.

There might be devices in the same group sharing RID?

> Sashiko [1] gave another interesting comment about dma aliasing caused
> by PCIe to PCI/PCI-X bridge - devices behind the bridge share a same
> RID (then same device/context entry in IOMMU). In this case unblocking
> devA could prematurely unblock devB which is actively undergoing a reset.

Exactly. I recall we talked about it when introducing this entire
reset piece: there was a piece of condition in the reset helpers
skipping aliasing groups, then we dropped it to focus on singleton
groups for the first version. Maybe we can resume the discussion,
but I doubt we could exclude those RID sharing cases...

Nicolin

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-21 18:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-18 23:41 [PATCH rc v7 0/6] iommu: Fix pci_dev_reset_iommu_prepare/done() Nicolin Chen
2026-04-18 23:41 ` [PATCH rc v7 1/6] iommu: Fix kdocs of pci_dev_reset_iommu_done() Nicolin Chen
2026-04-18 23:41 ` [PATCH rc v7 2/6] iommu: Replace per-group resetting_domain with per-gdev blocked flag Nicolin Chen
2026-04-18 23:41 ` [PATCH rc v7 3/6] iommu: Fix pasid attach in pci_dev_reset_iommu_prepare/done() Nicolin Chen
2026-04-18 23:41 ` [PATCH rc v7 4/6] iommu: Fix nested pci_dev_reset_iommu_prepare/done() Nicolin Chen
2026-04-18 23:41 ` [PATCH rc v7 5/6] iommu: Fix ATS invalidation timeouts during __iommu_remove_group_pasid() Nicolin Chen
2026-04-21  7:15   ` Tian, Kevin
2026-04-21 17:57     ` Nicolin Chen
2026-04-18 23:41 ` [PATCH rc v7 6/6] iommu: Fix UAF in pci_dev_reset_iommu_done() due to concurrent detach Nicolin Chen
2026-04-21  7:41   ` Tian, Kevin
2026-04-21 18:10     ` Nicolin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox