[PATCH rc v6] iommu: Fix nested pci_dev_reset_iommu

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH rc v6] iommu: Fix nested pci_dev_reset_iommu_prepare/done()
@ 2026-04-07 19:46 Nicolin Chen
  2026-04-14 14:20 ` Jason Gunthorpe
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Nicolin Chen @ 2026-04-07 19:46 UTC (permalink / raw)
  To: joro, kevin.tian, jgg
  Cc: will, robin.murphy, baolu.lu, iommu, linux-kernel, xueshuai

Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
internally while both are calling pci_dev_reset_iommu_prepare/done().

As pci_dev_reset_iommu_prepare() doesn't support re-entry, the inner call
will trigger a WARN_ON and return -EBUSY, resulting in failing the entire
device reset.

On the other hand, removing the outer calls in the PCI callers is unsafe.
As pointed out by Kevin, device-specific quirks like reset_hinic_vf_dev()
execute custom firmware waits after their inner pcie_flr() completes. If
the IOMMU protection relies solely on the inner reset, the IOMMU will be
unblocked prematurely while the device is still resetting.

Instead, fix this by making pci_dev_reset_iommu_prepare/done() reentrant.

Given the IOMMU core tracks the resetting state per iommu_group while the
reset is per device, this has to track at the group_device level as well.

Introduce a 'reset_depth' and a 'blocked' flag to struct group_device, to
handle the re-entries on the same device. This allows multi-device groups
to isolate concurrent device resets independently.

Note that iommu_deferred_attach() and iommu_driver_get_domain_for_dev()
both now check the per-device 'gdev->blocked' flag instead of a per-group
flag like 'group->resetting_domain'. This is actually more precise. Also,
this 'gdev->blocked' will be useful in the future work to flag the device
blocked by an ongoing/failed reset or quarantine.

As the reset routine is per gdev, it cannot clear group->resetting_domain
without iterating over the device list to ensure no other device is being
reset. Simplify it by replacing the resetting_domain with a 'recovery_cnt'
in the struct iommu_group.

Since both helpers are now per gdev, call the per-device set_dev_pasid op
to recover PASID domains. And add 'max_pasids > 0' checks in both helpers.

Fixes: c279e83953d9 ("iommu: Introduce pci_dev_reset_iommu_prepare/done()")
Cc: stable@vger.kernel.org
Reported-by: Shuai Xue <xueshuai@linux.alibaba.com>
Closes: https://lore.kernel.org/all/absKsk7qQOwzhpzv@Asurada-Nvidia/
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
Changelog
 v6:
  * Update inline comments and commit message
  * Add "max_pasids > 0" condition in both helpers
 v5:
  https://lore.kernel.org/all/20260404050243.141366-1-nicolinc@nvidia.com/
  * Add 'blocked' to fix iommu_driver_get_domain_for_dev() return.
 v4:
  https://lore.kernel.org/all/20260324014056.36103-1-nicolinc@nvidia.com/
  * Rename 'reset_cnt' to 'recovery_cnt'
 v3:
  https://lore.kernel.org/all/20260321223930.10836-1-nicolinc@nvidia.com/
  * Turn prepare()/done() to be per-gdev
  * Use reset_depth to track nested re-entries
  * Replace group->resetting_domain with a reset_cnt
 v2:
  https://lore.kernel.org/all/20260319043135.1153534-1-nicolinc@nvidia.com/
  * Fix in the helpers by allowing re-entry
 v1:
  https://lore.kernel.org/all/20260318220028.1146905-1-nicolinc@nvidia.com/

 drivers/iommu/iommu.c | 148 +++++++++++++++++++++++++++++++-----------
 1 file changed, 110 insertions(+), 38 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 35db517809540..ff181db687bbf 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -61,14 +61,14 @@ struct iommu_group {
 	int id;
 	struct iommu_domain *default_domain;
 	struct iommu_domain *blocking_domain;
-	/*
-	 * During a group device reset, @resetting_domain points to the physical
-	 * domain, while @domain points to the attached domain before the reset.
-	 */
-	struct iommu_domain *resetting_domain;
 	struct iommu_domain *domain;
 	struct list_head entry;
 	unsigned int owner_cnt;
+	/*
+	 * Number of devices in the group undergoing or awaiting recovery.
+	 * If non-zero, concurrent domain attachments are rejected.
+	 */
+	unsigned int recovery_cnt;
 	void *owner;
 };
 
@@ -76,12 +76,33 @@ struct group_device {
 	struct list_head list;
 	struct device *dev;
 	char *name;
+	/*
+	 * Device is blocked for a pending recovery while its group->domain is
+	 * retained. This can happen when:
+	 *  - Device is undergoing a reset
+	 */
+	bool blocked;
+	unsigned int reset_depth;
 };
 
 /* Iterate over each struct group_device in a struct iommu_group */
 #define for_each_group_device(group, pos) \
 	list_for_each_entry(pos, &(group)->devices, list)
 
+static struct group_device *__dev_to_gdev(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	struct group_device *gdev;
+
+	lockdep_assert_held(&group->mutex);
+
+	for_each_group_device(group, gdev) {
+		if (gdev->dev == dev)
+			return gdev;
+	}
+	return NULL;
+}
+
 struct iommu_group_attribute {
 	struct attribute attr;
 	ssize_t (*show)(struct iommu_group *group, char *buf);
@@ -2191,6 +2212,8 @@ EXPORT_SYMBOL_GPL(iommu_attach_device);
 
 int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
 {
+	struct group_device *gdev;
+
 	/*
 	 * This is called on the dma mapping fast path so avoid locking. This is
 	 * racy, but we have an expectation that the driver will setup its DMAs
@@ -2201,14 +2224,18 @@ int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
 
 	guard(mutex)(&dev->iommu_group->mutex);
 
+	gdev = __dev_to_gdev(dev);
+	if (WARN_ON(!gdev))
+		return -ENODEV;
+
 	/*
-	 * This is a concurrent attach during a device reset. Reject it until
+	 * This is a concurrent attach during device recovery. Reject it until
 	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
 	 *
 	 * Note that this might fail the iommu_dma_map(). But there's nothing
 	 * more we can do here.
 	 */
-	if (dev->iommu_group->resetting_domain)
+	if (gdev->blocked)
 		return -EBUSY;
 	return __iommu_attach_device(domain, dev, NULL);
 }
@@ -2265,19 +2292,24 @@ EXPORT_SYMBOL_GPL(iommu_get_domain_for_dev);
 struct iommu_domain *iommu_driver_get_domain_for_dev(struct device *dev)
 {
 	struct iommu_group *group = dev->iommu_group;
+	struct group_device *gdev;
 
 	lockdep_assert_held(&group->mutex);
 
+	gdev = __dev_to_gdev(dev);
+	if (WARN_ON(!gdev))
+		return NULL;
+
 	/*
 	 * Driver handles the low-level __iommu_attach_device(), including the
 	 * one invoked by pci_dev_reset_iommu_done() re-attaching the device to
 	 * the cached group->domain. In this case, the driver must get the old
-	 * domain from group->resetting_domain rather than group->domain. This
+	 * domain from group->blocking_domain rather than group->domain. This
 	 * prevents it from re-attaching the device from group->domain (old) to
 	 * group->domain (new).
 	 */
-	if (group->resetting_domain)
-		return group->resetting_domain;
+	if (gdev->blocked)
+		return group->blocking_domain;
 
 	return group->domain;
 }
@@ -2436,10 +2468,10 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
 		return -EINVAL;
 
 	/*
-	 * This is a concurrent attach during a device reset. Reject it until
+	 * This is a concurrent attach during device recovery. Reject it until
 	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
 	 */
-	if (group->resetting_domain)
+	if (group->recovery_cnt)
 		return -EBUSY;
 
 	/*
@@ -3567,10 +3599,10 @@ int iommu_attach_device_pasid(struct iommu_domain *domain,
 	mutex_lock(&group->mutex);
 
 	/*
-	 * This is a concurrent attach during a device reset. Reject it until
+	 * This is a concurrent attach during device recovery. Reject it until
 	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
 	 */
-	if (group->resetting_domain) {
+	if (group->recovery_cnt) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
@@ -3660,10 +3692,10 @@ int iommu_replace_device_pasid(struct iommu_domain *domain,
 	mutex_lock(&group->mutex);
 
 	/*
-	 * This is a concurrent attach during a device reset. Reject it until
+	 * This is a concurrent attach during device recovery. Reject it until
 	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
 	 */
-	if (group->resetting_domain) {
+	if (group->recovery_cnt) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
@@ -3934,12 +3966,12 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
  * routine wants to block any IOMMU activity: translation and ATS invalidation.
  *
  * This function attaches the device's RID/PASID(s) the group->blocking_domain,
- * setting the group->resetting_domain. This allows the IOMMU driver pausing any
+ * incrementing the group->recovery_cnt, to allow the IOMMU driver pausing any
  * IOMMU activity while leaving the group->domain pointer intact. Later when the
  * reset is finished, pci_dev_reset_iommu_done() can restore everything.
  *
  * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
- * before/after the core-level reset routine, to unset the resetting_domain.
+ * before/after the core-level reset routine, to decrement the recovery_cnt.
  *
  * Return: 0 on success or negative error code if the preparation failed.
  *
@@ -3952,6 +3984,7 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
 int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 {
 	struct iommu_group *group = pdev->dev.iommu_group;
+	struct group_device *gdev;
 	unsigned long pasid;
 	void *entry;
 	int ret;
@@ -3961,33 +3994,52 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 
 	guard(mutex)(&group->mutex);
 
-	/* Re-entry is not allowed */
-	if (WARN_ON(group->resetting_domain))
-		return -EBUSY;
+	gdev = __dev_to_gdev(&pdev->dev);
+	if (WARN_ON(!gdev))
+		return -ENODEV;
+
+	if (gdev->reset_depth++)
+		return 0;
 
 	ret = __iommu_group_alloc_blocking_domain(group);
 	if (ret)
-		return ret;
+		goto err_depth;
 
 	/* Stage RID domain at blocking_domain while retaining group->domain */
 	if (group->domain != group->blocking_domain) {
 		ret = __iommu_attach_device(group->blocking_domain, &pdev->dev,
 					    group->domain);
 		if (ret)
-			return ret;
+			goto err_depth;
 	}
 
+	/*
+	 * Update gdev->blocked upon the domain change, as it is used to return
+	 * the correct domain in iommu_driver_get_domain_for_dev() that might be
+	 * called in a set_dev_pasid callback function.
+	 */
+	gdev->blocked = true;
+
 	/*
 	 * Stage PASID domains at blocking_domain while retaining pasid_array.
 	 *
 	 * The pasid_array is mostly fenced by group->mutex, except one reader
 	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
 	 */
-	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
-		iommu_remove_dev_pasid(&pdev->dev, pasid,
-				       pasid_array_entry_to_domain(entry));
+	if (pdev->dev.iommu->max_pasids > 0) {
+		xa_for_each_start(&group->pasid_array, pasid, entry, 1) {
+			struct iommu_domain *pasid_dom =
+				pasid_array_entry_to_domain(entry);
+
+			iommu_remove_dev_pasid(&pdev->dev, pasid, pasid_dom);
+		}
+	}
+
+	group->recovery_cnt++;
+	return ret;
 
-	group->resetting_domain = group->blocking_domain;
+err_depth:
+	gdev->reset_depth--;
 	return ret;
 }
 EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
@@ -3997,9 +4049,9 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
  * @pdev: PCI device that has finished a reset routine
  *
  * After a PCIe device finishes a reset routine, it wants to restore its IOMMU
- * IOMMU activity, including new translation as well as cache invalidation, by
- * re-attaching all RID/PASID of the device's back to the domains retained in
- * the core-level structure.
+ * activity, including new translation and cache invalidation, by re-attaching
+ * all RID/PASID of the device back to the domains retained in the core-level
+ * structure.
  *
  * Caller must pair it with a successful pci_dev_reset_iommu_prepare().
  *
@@ -4009,6 +4061,7 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
 void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 {
 	struct iommu_group *group = pdev->dev.iommu_group;
+	struct group_device *gdev;
 	unsigned long pasid;
 	void *entry;
 
@@ -4017,11 +4070,16 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 
 	guard(mutex)(&group->mutex);
 
-	/* pci_dev_reset_iommu_prepare() was bypassed for the device */
-	if (!group->resetting_domain)
+	gdev = __dev_to_gdev(&pdev->dev);
+	if (WARN_ON(!gdev))
+		return;
+
+	/* Unbalanced done() calls would underflow the counter */
+	if (WARN_ON(gdev->reset_depth == 0))
+		return;
+	if (--gdev->reset_depth)
 		return;
 
-	/* pci_dev_reset_iommu_prepare() was not successfully called */
 	if (WARN_ON(!group->blocking_domain))
 		return;
 
@@ -4031,18 +4089,32 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 					      group->blocking_domain));
 	}
 
+	/*
+	 * Update gdev->blocked upon the domain change, as it is used to return
+	 * the correct domain in iommu_driver_get_domain_for_dev() that might be
+	 * called in a set_dev_pasid callback function.
+	 */
+	gdev->blocked = false;
+
 	/*
 	 * Re-attach PASID domains back to the domains retained in pasid_array.
 	 *
 	 * The pasid_array is mostly fenced by group->mutex, except one reader
 	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
 	 */
-	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
-		WARN_ON(__iommu_set_group_pasid(
-			pasid_array_entry_to_domain(entry), group, pasid,
-			group->blocking_domain));
+	if (pdev->dev.iommu->max_pasids > 0) {
+		xa_for_each_start(&group->pasid_array, pasid, entry, 1) {
+			struct iommu_domain *pasid_dom =
+				pasid_array_entry_to_domain(entry);
+
+			WARN_ON(pasid_dom->ops->set_dev_pasid(
+				pasid_dom, &pdev->dev, pasid,
+				group->blocking_domain));
+		}
+	}
 
-	group->resetting_domain = NULL;
+	if (!WARN_ON(group->recovery_cnt == 0))
+		group->recovery_cnt--;
 }
 EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_done);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH rc v6] iommu: Fix nested pci_dev_reset_iommu_prepare/done()
  2026-04-07 19:46 [PATCH rc v6] iommu: Fix nested pci_dev_reset_iommu_prepare/done() Nicolin Chen
@ 2026-04-14 14:20 ` Jason Gunthorpe
  2026-04-16  7:48 ` Shuai Xue
  2026-04-17  8:24 ` Tian, Kevin
  2 siblings, 0 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2026-04-14 14:20 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: joro, kevin.tian, will, robin.murphy, baolu.lu, iommu,
	linux-kernel, xueshuai

On Tue, Apr 07, 2026 at 12:46:44PM -0700, Nicolin Chen wrote:
> Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
> internally while both are calling pci_dev_reset_iommu_prepare/done().
> 
> As pci_dev_reset_iommu_prepare() doesn't support re-entry, the inner call
> will trigger a WARN_ON and return -EBUSY, resulting in failing the entire
> device reset.
> 
> On the other hand, removing the outer calls in the PCI callers is unsafe.
> As pointed out by Kevin, device-specific quirks like reset_hinic_vf_dev()
> execute custom firmware waits after their inner pcie_flr() completes. If
> the IOMMU protection relies solely on the inner reset, the IOMMU will be
> unblocked prematurely while the device is still resetting.
> 
> Instead, fix this by making pci_dev_reset_iommu_prepare/done() reentrant.
> 
> Given the IOMMU core tracks the resetting state per iommu_group while the
> reset is per device, this has to track at the group_device level as well.
> 
> Introduce a 'reset_depth' and a 'blocked' flag to struct group_device, to
> handle the re-entries on the same device. This allows multi-device groups
> to isolate concurrent device resets independently.
> 
> Note that iommu_deferred_attach() and iommu_driver_get_domain_for_dev()
> both now check the per-device 'gdev->blocked' flag instead of a per-group
> flag like 'group->resetting_domain'. This is actually more precise. Also,
> this 'gdev->blocked' will be useful in the future work to flag the device
> blocked by an ongoing/failed reset or quarantine.
> 
> As the reset routine is per gdev, it cannot clear group->resetting_domain
> without iterating over the device list to ensure no other device is being
> reset. Simplify it by replacing the resetting_domain with a 'recovery_cnt'
> in the struct iommu_group.
> 
> Since both helpers are now per gdev, call the per-device set_dev_pasid op
> to recover PASID domains. And add 'max_pasids > 0' checks in both helpers.
> 
> Fixes: c279e83953d9 ("iommu: Introduce pci_dev_reset_iommu_prepare/done()")
> Cc: stable@vger.kernel.org
> Reported-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Closes: https://lore.kernel.org/all/absKsk7qQOwzhpzv@Asurada-Nvidia/
> Suggested-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
> Changelog
>  v6:
>   * Update inline comments and commit message
>   * Add "max_pasids > 0" condition in both helpers
>  v5:
>   https://lore.kernel.org/all/20260404050243.141366-1-nicolinc@nvidia.com/
>   * Add 'blocked' to fix iommu_driver_get_domain_for_dev() return.
>  v4:
>   https://lore.kernel.org/all/20260324014056.36103-1-nicolinc@nvidia.com/
>   * Rename 'reset_cnt' to 'recovery_cnt'
>  v3:
>   https://lore.kernel.org/all/20260321223930.10836-1-nicolinc@nvidia.com/
>   * Turn prepare()/done() to be per-gdev
>   * Use reset_depth to track nested re-entries
>   * Replace group->resetting_domain with a reset_cnt
>  v2:
>   https://lore.kernel.org/all/20260319043135.1153534-1-nicolinc@nvidia.com/
>   * Fix in the helpers by allowing re-entry
>  v1:
>   https://lore.kernel.org/all/20260318220028.1146905-1-nicolinc@nvidia.com/
> 
>  drivers/iommu/iommu.c | 148 +++++++++++++++++++++++++++++++-----------
>  1 file changed, 110 insertions(+), 38 deletions(-)

This looks reasonable to me.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH rc v6] iommu: Fix nested pci_dev_reset_iommu_prepare/done()
  2026-04-07 19:46 [PATCH rc v6] iommu: Fix nested pci_dev_reset_iommu_prepare/done() Nicolin Chen
  2026-04-14 14:20 ` Jason Gunthorpe
@ 2026-04-16  7:48 ` Shuai Xue
  2026-04-17  8:24 ` Tian, Kevin
  2 siblings, 0 replies; 6+ messages in thread
From: Shuai Xue @ 2026-04-16  7:48 UTC (permalink / raw)
  To: Nicolin Chen, joro, kevin.tian, jgg
  Cc: will, robin.murphy, baolu.lu, iommu, linux-kernel



On 4/8/26 3:46 AM, Nicolin Chen wrote:
> Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
> internally while both are calling pci_dev_reset_iommu_prepare/done().
> 
> As pci_dev_reset_iommu_prepare() doesn't support re-entry, the inner call
> will trigger a WARN_ON and return -EBUSY, resulting in failing the entire
> device reset.
> 
> On the other hand, removing the outer calls in the PCI callers is unsafe.
> As pointed out by Kevin, device-specific quirks like reset_hinic_vf_dev()
> execute custom firmware waits after their inner pcie_flr() completes. If
> the IOMMU protection relies solely on the inner reset, the IOMMU will be
> unblocked prematurely while the device is still resetting.
> 
> Instead, fix this by making pci_dev_reset_iommu_prepare/done() reentrant.
> 
> Given the IOMMU core tracks the resetting state per iommu_group while the
> reset is per device, this has to track at the group_device level as well.
> 
> Introduce a 'reset_depth' and a 'blocked' flag to struct group_device, to
> handle the re-entries on the same device. This allows multi-device groups
> to isolate concurrent device resets independently.
> 
> Note that iommu_deferred_attach() and iommu_driver_get_domain_for_dev()
> both now check the per-device 'gdev->blocked' flag instead of a per-group
> flag like 'group->resetting_domain'. This is actually more precise. Also,
> this 'gdev->blocked' will be useful in the future work to flag the device
> blocked by an ongoing/failed reset or quarantine.
> 
> As the reset routine is per gdev, it cannot clear group->resetting_domain
> without iterating over the device list to ensure no other device is being
> reset. Simplify it by replacing the resetting_domain with a 'recovery_cnt'
> in the struct iommu_group.
> 
> Since both helpers are now per gdev, call the per-device set_dev_pasid op
> to recover PASID domains. And add 'max_pasids > 0' checks in both helpers.
> 
> Fixes: c279e83953d9 ("iommu: Introduce pci_dev_reset_iommu_prepare/done()")
> Cc: stable@vger.kernel.org
> Reported-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Closes: https://lore.kernel.org/all/absKsk7qQOwzhpzv@Asurada-Nvidia/
> Suggested-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

LGTM. Feel free to add:

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>

Thanks.
Shuai

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH rc v6] iommu: Fix nested pci_dev_reset_iommu_prepare/done()
  2026-04-07 19:46 [PATCH rc v6] iommu: Fix nested pci_dev_reset_iommu_prepare/done() Nicolin Chen
  2026-04-14 14:20 ` Jason Gunthorpe
  2026-04-16  7:48 ` Shuai Xue
@ 2026-04-17  8:24 ` Tian, Kevin
  2026-04-17 21:44   ` Nicolin Chen
  2 siblings, 1 reply; 6+ messages in thread
From: Tian, Kevin @ 2026-04-17  8:24 UTC (permalink / raw)
  To: Nicolin Chen, joro@8bytes.org, jgg@nvidia.com
  Cc: will@kernel.org, robin.murphy@arm.com, baolu.lu@linux.intel.com,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	xueshuai@linux.alibaba.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, April 8, 2026 3:47 AM
> 
> Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
> internally while both are calling pci_dev_reset_iommu_prepare/done().
> 
> As pci_dev_reset_iommu_prepare() doesn't support re-entry, the inner call
> will trigger a WARN_ON and return -EBUSY, resulting in failing the entire
> device reset.
> 
> On the other hand, removing the outer calls in the PCI callers is unsafe.
> As pointed out by Kevin, device-specific quirks like reset_hinic_vf_dev()
> execute custom firmware waits after their inner pcie_flr() completes. If
> the IOMMU protection relies solely on the inner reset, the IOMMU will be
> unblocked prematurely while the device is still resetting.
> 
> Instead, fix this by making pci_dev_reset_iommu_prepare/done() reentrant.
> 
> Given the IOMMU core tracks the resetting state per iommu_group while the
> reset is per device, this has to track at the group_device level as well.
> 
> Introduce a 'reset_depth' and a 'blocked' flag to struct group_device, to
> handle the re-entries on the same device. This allows multi-device groups
> to isolate concurrent device resets independently.
> 
> Note that iommu_deferred_attach() and
> iommu_driver_get_domain_for_dev()
> both now check the per-device 'gdev->blocked' flag instead of a per-group
> flag like 'group->resetting_domain'. This is actually more precise. Also,
> this 'gdev->blocked' will be useful in the future work to flag the device
> blocked by an ongoing/failed reset or quarantine.
> 
> As the reset routine is per gdev, it cannot clear group->resetting_domain
> without iterating over the device list to ensure no other device is being
> reset. Simplify it by replacing the resetting_domain with a 'recovery_cnt'
> in the struct iommu_group.
> 
> Since both helpers are now per gdev, call the per-device set_dev_pasid op
> to recover PASID domains. And add 'max_pasids > 0' checks in both helpers.
> 
> Fixes: c279e83953d9 ("iommu: Introduce
> pci_dev_reset_iommu_prepare/done()")
> Cc: stable@vger.kernel.org
> Reported-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Closes: https://lore.kernel.org/all/absKsk7qQOwzhpzv@Asurada-Nvidia/
> Suggested-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

I still have a question whether iommu_driver_get_domain_for_dev() is
actually required, but it's orthogonal to what this fixes.

btw Sashiko [1] gave several comments.

one is that iommu_detach_device_pasid() is not blocked which can trigger
devtlb invalidation in middle of reset. but it cannot fail. so the right fix is
to skip the blocked device in __iommu_remove_group_pasid().

another is a use-after-free concern upon iommu_detach_device() in
middle of reset. In my thinking it will trigger WARN_ON before any UAF:

static void __iommu_group_set_domain_nofail(struct iommu_group *group,
                                            struct iommu_domain *new_domain)
{
        WARN_ON(__iommu_group_set_domain_internal(
                group, new_domain, IOMMU_SET_DOMAIN_MUST_SUCCEED));
}

but I haven't got time to think about the fix carefully. 

the last one is trivial that goto and guard() shouldn't be mixed in one
function according to the cleanup guidelines.

the former two are existing issues which could be fixed in a follow-up
patch if you want to fix this nesting issue first. If that's case (with the
3rd issue fixed):

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

[1] https://sashiko.dev/#/patchset/20260407194644.171304-1-nicolinc%40nvidia.com


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH rc v6] iommu: Fix nested pci_dev_reset_iommu_prepare/done()
  2026-04-17  8:24 ` Tian, Kevin
@ 2026-04-17 21:44   ` Nicolin Chen
  2026-04-18  4:56     ` Nicolin Chen
  0 siblings, 1 reply; 6+ messages in thread
From: Nicolin Chen @ 2026-04-17 21:44 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: joro@8bytes.org, jgg@nvidia.com, will@kernel.org,
	robin.murphy@arm.com, baolu.lu@linux.intel.com,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	xueshuai@linux.alibaba.com

On Fri, Apr 17, 2026 at 08:24:27AM +0000, Tian, Kevin wrote:
> one is that iommu_detach_device_pasid() is not blocked which can trigger
> devtlb invalidation in middle of reset. but it cannot fail. so the right fix is
> to skip the blocked device in __iommu_remove_group_pasid().

Yea, squashing this:
@@ -3556,3 +3559,4 @@ static void __iommu_remove_group_pasid(struct iommu_group *group,
        for_each_group_device(group, device) {
-               if (device->dev->iommu->max_pasids > 0)
+               /* Device might be already detached for a device recovery */
+               if (!device->blocked && device->dev->iommu->max_pasids > 0)
                        iommu_remove_dev_pasid(device->dev, pasid, domain);

> another is a use-after-free concern upon iommu_detach_device() in
> middle of reset. In my thinking it will trigger WARN_ON before any UAF:
> 
> static void __iommu_group_set_domain_nofail(struct iommu_group *group,
>                                             struct iommu_domain *new_domain)
> {
>         WARN_ON(__iommu_group_set_domain_internal(
>                 group, new_domain, IOMMU_SET_DOMAIN_MUST_SUCCEED));
> }

Yes.

> but I haven't got time to think about the fix carefully. 

I think we could squash this:

@@ -2469,9 +2469,2 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,

-       /*
-        * This is a concurrent attach during device recovery. Reject it until
-        * pci_dev_reset_iommu_done() attaches the device to group->domain.
-        */
-       if (group->recovery_cnt)
-               return -EBUSY;
-
        /*
@@ -2484,2 +2477,10 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
        for_each_group_device(group, gdev) {
+               /*
+                * Skip devices under recovery: they are already attached to
+                * group->blocking_domain at the hardware level. When their
+                * reset completes, pci_dev_reset_iommu_done() will re-attach
+                * them to the updated group->domain.
+                */
+               if (gdev->blocked)
+                       continue;
                ret = __iommu_device_set_domain(group, gdev->dev, new_domain,
@@ -2513,2 +2514,4 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
                        break;
+               if (gdev->blocked)
+                       continue;
                /*


> the last one is trivial that goto and guard() shouldn't be mixed in one
> function according to the cleanup guidelines.

I don't think this is mixing. The guard is protecting the entire
routine including those goto paths. So there isn't any goto path
that is outside the mutex.

> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH rc v6] iommu: Fix nested pci_dev_reset_iommu_prepare/done()
  2026-04-17 21:44   ` Nicolin Chen
@ 2026-04-18  4:56     ` Nicolin Chen
  0 siblings, 0 replies; 6+ messages in thread
From: Nicolin Chen @ 2026-04-18  4:56 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: joro@8bytes.org, jgg@nvidia.com, will@kernel.org,
	robin.murphy@arm.com, baolu.lu@linux.intel.com,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	xueshuai@linux.alibaba.com

On Fri, Apr 17, 2026 at 02:44:41PM -0700, Nicolin Chen wrote:
> On Fri, Apr 17, 2026 at 08:24:27AM +0000, Tian, Kevin wrote:
> > one is that iommu_detach_device_pasid() is not blocked which can trigger
> > devtlb invalidation in middle of reset. but it cannot fail. so the right fix is
> > to skip the blocked device in __iommu_remove_group_pasid().
> 
> Yea, squashing this:
> @@ -3556,3 +3559,4 @@ static void __iommu_remove_group_pasid(struct iommu_group *group,
>         for_each_group_device(group, device) {
> -               if (device->dev->iommu->max_pasids > 0)
> +               /* Device might be already detached for a device recovery */
> +               if (!device->blocked && device->dev->iommu->max_pasids > 0)
>                         iommu_remove_dev_pasid(device->dev, pasid, domain);
> 
> > another is a use-after-free concern upon iommu_detach_device() in
> > middle of reset. In my thinking it will trigger WARN_ON before any UAF:
> > 
> > static void __iommu_group_set_domain_nofail(struct iommu_group *group,
> >                                             struct iommu_domain *new_domain)
> > {
> >         WARN_ON(__iommu_group_set_domain_internal(
> >                 group, new_domain, IOMMU_SET_DOMAIN_MUST_SUCCEED));
> > }
> 
> Yes.
> 
> > but I haven't got time to think about the fix carefully. 
> 
> I think we could squash this:
> 
> @@ -2469,9 +2469,2 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
> 
> -       /*
> -        * This is a concurrent attach during device recovery. Reject it until
> -        * pci_dev_reset_iommu_done() attaches the device to group->domain.
> -        */
> -       if (group->recovery_cnt)
> -               return -EBUSY;
> -

On a second thought, we may not simply drop this -- IIRC, we added
it particularly to fence a case where gdevs share the same RID or
some corner case like that?

In a conservative way, we can still reject concurrent attach while
allowing the detach case:

+	/*
+	 * This is a concurrent attach during device recovery. Reject it until
+	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
+	 *
+	 * Note: still allow MUST_SUCCEED callers (detach/teardown) through to
+	 * avoid UAF on domain release paths.
+	 */
+	if (group->recovery_cnt && !(flags & IOMMU_SET_DOMAIN_MUST_SUCCEED))
+		return -EBUSY;
+

In the detach path, it'll move forward and skip per gdev->blocked
inside the for_each_group_device() and defer the attach to done().

Thanks
Nicolin

> @@ -2484,2 +2477,10 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
>         for_each_group_device(group, gdev) {
> +               /*
> +                * Skip devices under recovery: they are already attached to
> +                * group->blocking_domain at the hardware level. When their
> +                * reset completes, pci_dev_reset_iommu_done() will re-attach
> +                * them to the updated group->domain.
> +                */
> +               if (gdev->blocked)
> +                       continue;
>                 ret = __iommu_device_set_domain(group, gdev->dev, new_domain,
> @@ -2513,2 +2514,4 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
>                         break;
> +               if (gdev->blocked)
> +                       continue;
>                 /*
> 
> 
> > the last one is trivial that goto and guard() shouldn't be mixed in one
> > function according to the cleanup guidelines.
> 
> I don't think this is mixing. The guard is protecting the entire
> routine including those goto paths. So there isn't any goto path
> that is outside the mutex.
> 
> > Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> 
> Thanks!
> Nicolin

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-18  4:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-07 19:46 [PATCH rc v6] iommu: Fix nested pci_dev_reset_iommu_prepare/done() Nicolin Chen
2026-04-14 14:20 ` Jason Gunthorpe
2026-04-16  7:48 ` Shuai Xue
2026-04-17  8:24 ` Tian, Kevin
2026-04-17 21:44   ` Nicolin Chen
2026-04-18  4:56     ` Nicolin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox