[PATCH rc v5] iommu: Fix nested pci_dev_reset_iommu

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH rc v5] iommu: Fix nested pci_dev_reset_iommu_prepare/done()
@ 2026-04-04  5:02 Nicolin Chen
  2026-04-05  0:55 ` Nicolin Chen
  2026-04-07  6:36 ` Shuai Xue
  0 siblings, 2 replies; 3+ messages in thread
From: Nicolin Chen @ 2026-04-04  5:02 UTC (permalink / raw)
  To: joro, kevin.tian, jgg
  Cc: will, robin.murphy, baolu.lu, iommu, linux-kernel, xueshuai

Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
internally while both are calling pci_dev_reset_iommu_prepare/done().

As pci_dev_reset_iommu_prepare() doesn't support re-entry, the inner call
will trigger a WARN_ON and return -EBUSY, resulting in failing the entire
device reset.

On the other hand, removing the outer calls in the PCI callers is unsafe.
As pointed out by Kevin, device-specific quirks like reset_hinic_vf_dev()
execute custom firmware waits after their inner pcie_flr() completes. If
the IOMMU protection relies solely on the inner reset, the IOMMU will be
unblocked prematurely while the device is still resetting.

Instead, fix this by making pci_dev_reset_iommu_prepare/done() reentrant.

Given the IOMMU core tracks the resetting state per iommu_group while the
reset is per device, this has to track at the group_device level as well.

Introduce a 'reset_depth' to struct group_device to handle the re-entries
on the same device. This allows multi-device groups to isolate concurrent
device resets independently.

Note that iommu_deferred_attach() and iommu_driver_get_domain_for_dev()
both now check gdev->reset_depth (per-device) instead of a per-group flag
like "group->resetting_domain". This is actually more precise.

As the reset routine is per gdev, it cannot clear group->resetting_domain
without iterating over the device list to ensure no other device is being
reset. Simplify it by replacing the resetting_domain with a 'recovery_cnt'
in the struct iommu_group.

Since both helpers are now per gdev, call the per-device set_dev_pasid op
to recover PASID domains.

While fixing the bug, also fix the kdoc for pci_dev_reset_iommu_done().

Fixes: c279e83953d9 ("iommu: Introduce pci_dev_reset_iommu_prepare/done()")
Cc: stable@vger.kernel.org
Reported-by: Shuai Xue <xueshuai@linux.alibaba.com>
Closes: https://lore.kernel.org/all/absKsk7qQOwzhpzv@Asurada-Nvidia/
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
Changelog
 v5:
  * Add 'blocked' to fix iommu_driver_get_domain_for_dev() return.
 v4:
  https://lore.kernel.org/all/20260324014056.36103-1-nicolinc@nvidia.com/
  * Rename 'reset_cnt' to 'recovery_cnt'
 v3:
  https://lore.kernel.org/all/20260321223930.10836-1-nicolinc@nvidia.com/
  * Turn prepare()/done() to be per-gdev
  * Use reset_depth to track nested re-entries
  * Replace group->resetting_domain with a reset_cnt
 v2:
  https://lore.kernel.org/all/20260319043135.1153534-1-nicolinc@nvidia.com/
  * Fix in the helpers by allowing re-entry
 v1:
  https://lore.kernel.org/all/20260318220028.1146905-1-nicolinc@nvidia.com/

 drivers/iommu/iommu.c | 116 +++++++++++++++++++++++++++++-------------
 1 file changed, 82 insertions(+), 34 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 35db517809540..4bd7a4afc9881 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -61,14 +61,14 @@ struct iommu_group {
 	int id;
 	struct iommu_domain *default_domain;
 	struct iommu_domain *blocking_domain;
-	/*
-	 * During a group device reset, @resetting_domain points to the physical
-	 * domain, while @domain points to the attached domain before the reset.
-	 */
-	struct iommu_domain *resetting_domain;
 	struct iommu_domain *domain;
 	struct list_head entry;
 	unsigned int owner_cnt;
+	/*
+	 * Number of devices in the group undergoing or awaiting recovery.
+	 * If non-zero, concurrent domain attachments are rejected.
+	 */
+	unsigned int recovery_cnt;
 	void *owner;
 };
 
@@ -76,12 +76,28 @@ struct group_device {
 	struct list_head list;
 	struct device *dev;
 	char *name;
+	bool blocked;
+	unsigned int reset_depth;
 };
 
 /* Iterate over each struct group_device in a struct iommu_group */
 #define for_each_group_device(group, pos) \
 	list_for_each_entry(pos, &(group)->devices, list)
 
+static struct group_device *__dev_to_gdev(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	struct group_device *gdev;
+
+	lockdep_assert_held(&group->mutex);
+
+	for_each_group_device(group, gdev) {
+		if (gdev->dev == dev)
+			return gdev;
+	}
+	return NULL;
+}
+
 struct iommu_group_attribute {
 	struct attribute attr;
 	ssize_t (*show)(struct iommu_group *group, char *buf);
@@ -2191,6 +2207,8 @@ EXPORT_SYMBOL_GPL(iommu_attach_device);
 
 int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
 {
+	struct group_device *gdev;
+
 	/*
 	 * This is called on the dma mapping fast path so avoid locking. This is
 	 * racy, but we have an expectation that the driver will setup its DMAs
@@ -2201,6 +2219,9 @@ int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
 
 	guard(mutex)(&dev->iommu_group->mutex);
 
+	gdev = __dev_to_gdev(dev);
+	if (WARN_ON(!gdev))
+		return -ENODEV;
 	/*
 	 * This is a concurrent attach during a device reset. Reject it until
 	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
@@ -2208,7 +2229,7 @@ int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
 	 * Note that this might fail the iommu_dma_map(). But there's nothing
 	 * more we can do here.
 	 */
-	if (dev->iommu_group->resetting_domain)
+	if (gdev->blocked)
 		return -EBUSY;
 	return __iommu_attach_device(domain, dev, NULL);
 }
@@ -2265,19 +2286,23 @@ EXPORT_SYMBOL_GPL(iommu_get_domain_for_dev);
 struct iommu_domain *iommu_driver_get_domain_for_dev(struct device *dev)
 {
 	struct iommu_group *group = dev->iommu_group;
+	struct group_device *gdev;
 
 	lockdep_assert_held(&group->mutex);
+	gdev = __dev_to_gdev(dev);
+	if (WARN_ON(!gdev))
+		return NULL;
 
 	/*
 	 * Driver handles the low-level __iommu_attach_device(), including the
 	 * one invoked by pci_dev_reset_iommu_done() re-attaching the device to
 	 * the cached group->domain. In this case, the driver must get the old
-	 * domain from group->resetting_domain rather than group->domain. This
+	 * domain from group->blocking_domain rather than group->domain. This
 	 * prevents it from re-attaching the device from group->domain (old) to
 	 * group->domain (new).
 	 */
-	if (group->resetting_domain)
-		return group->resetting_domain;
+	if (gdev->blocked)
+		return group->blocking_domain;
 
 	return group->domain;
 }
@@ -2436,10 +2461,10 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
 		return -EINVAL;
 
 	/*
-	 * This is a concurrent attach during a device reset. Reject it until
+	 * This is a concurrent attach during device recovery. Reject it until
 	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
 	 */
-	if (group->resetting_domain)
+	if (group->recovery_cnt)
 		return -EBUSY;
 
 	/*
@@ -3567,10 +3592,10 @@ int iommu_attach_device_pasid(struct iommu_domain *domain,
 	mutex_lock(&group->mutex);
 
 	/*
-	 * This is a concurrent attach during a device reset. Reject it until
+	 * This is a concurrent attach during device recovery. Reject it until
 	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
 	 */
-	if (group->resetting_domain) {
+	if (group->recovery_cnt) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
@@ -3660,10 +3685,10 @@ int iommu_replace_device_pasid(struct iommu_domain *domain,
 	mutex_lock(&group->mutex);
 
 	/*
-	 * This is a concurrent attach during a device reset. Reject it until
+	 * This is a concurrent attach during device recovery. Reject it until
 	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
 	 */
-	if (group->resetting_domain) {
+	if (group->recovery_cnt) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
@@ -3934,12 +3959,12 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
  * routine wants to block any IOMMU activity: translation and ATS invalidation.
  *
  * This function attaches the device's RID/PASID(s) the group->blocking_domain,
- * setting the group->resetting_domain. This allows the IOMMU driver pausing any
+ * incrementing the group->recovery_cnt, to allow the IOMMU driver pausing any
  * IOMMU activity while leaving the group->domain pointer intact. Later when the
  * reset is finished, pci_dev_reset_iommu_done() can restore everything.
  *
  * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
- * before/after the core-level reset routine, to unset the resetting_domain.
+ * before/after the core-level reset routine, to decrement the recovery_cnt.
  *
  * Return: 0 on success or negative error code if the preparation failed.
  *
@@ -3952,6 +3977,7 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
 int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 {
 	struct iommu_group *group = pdev->dev.iommu_group;
+	struct group_device *gdev;
 	unsigned long pasid;
 	void *entry;
 	int ret;
@@ -3961,21 +3987,25 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 
 	guard(mutex)(&group->mutex);
 
-	/* Re-entry is not allowed */
-	if (WARN_ON(group->resetting_domain))
-		return -EBUSY;
+	gdev = __dev_to_gdev(&pdev->dev);
+	if (WARN_ON(!gdev))
+		return -ENODEV;
+
+	if (gdev->reset_depth++)
+		return 0;
 
 	ret = __iommu_group_alloc_blocking_domain(group);
 	if (ret)
-		return ret;
+		goto err_depth;
 
 	/* Stage RID domain at blocking_domain while retaining group->domain */
 	if (group->domain != group->blocking_domain) {
 		ret = __iommu_attach_device(group->blocking_domain, &pdev->dev,
 					    group->domain);
 		if (ret)
-			return ret;
+			goto err_depth;
 	}
+	gdev->blocked = true;
 
 	/*
 	 * Stage PASID domains at blocking_domain while retaining pasid_array.
@@ -3987,7 +4017,11 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 		iommu_remove_dev_pasid(&pdev->dev, pasid,
 				       pasid_array_entry_to_domain(entry));
 
-	group->resetting_domain = group->blocking_domain;
+	group->recovery_cnt++;
+	return ret;
+
+err_depth:
+	gdev->reset_depth--;
 	return ret;
 }
 EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
@@ -3997,9 +4031,9 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
  * @pdev: PCI device that has finished a reset routine
  *
  * After a PCIe device finishes a reset routine, it wants to restore its IOMMU
- * IOMMU activity, including new translation as well as cache invalidation, by
- * re-attaching all RID/PASID of the device's back to the domains retained in
- * the core-level structure.
+ * activity, including new translation and cache invalidation, by re-attaching
+ * all RID/PASID of the device back to the domains retained in the core-level
+ * structure.
  *
  * Caller must pair it with a successful pci_dev_reset_iommu_prepare().
  *
@@ -4009,6 +4043,7 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
 void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 {
 	struct iommu_group *group = pdev->dev.iommu_group;
+	struct group_device *gdev;
 	unsigned long pasid;
 	void *entry;
 
@@ -4017,11 +4052,16 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 
 	guard(mutex)(&group->mutex);
 
-	/* pci_dev_reset_iommu_prepare() was bypassed for the device */
-	if (!group->resetting_domain)
+	gdev = __dev_to_gdev(&pdev->dev);
+	if (WARN_ON(!gdev))
+		return;
+
+	/* Unbalanced done() calls would underflow the counter */
+	if (WARN_ON(gdev->reset_depth == 0))
+		return;
+	if (--gdev->reset_depth)
 		return;
 
-	/* pci_dev_reset_iommu_prepare() was not successfully called */
 	if (WARN_ON(!group->blocking_domain))
 		return;
 
@@ -4030,6 +4070,7 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 		WARN_ON(__iommu_attach_device(group->domain, &pdev->dev,
 					      group->blocking_domain));
 	}
+	gdev->blocked = false;
 
 	/*
 	 * Re-attach PASID domains back to the domains retained in pasid_array.
@@ -4037,12 +4078,19 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 	 * The pasid_array is mostly fenced by group->mutex, except one reader
 	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
 	 */
-	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
-		WARN_ON(__iommu_set_group_pasid(
-			pasid_array_entry_to_domain(entry), group, pasid,
-			group->blocking_domain));
+	if (pdev->dev.iommu->max_pasids > 0) {
+		xa_for_each_start(&group->pasid_array, pasid, entry, 1) {
+			struct iommu_domain *pasid_dom =
+				pasid_array_entry_to_domain(entry);
+
+			WARN_ON(pasid_dom->ops->set_dev_pasid(
+				pasid_dom, &pdev->dev, pasid,
+				group->blocking_domain));
+		}
+	}
 
-	group->resetting_domain = NULL;
+	if (!WARN_ON(group->recovery_cnt == 0))
+		group->recovery_cnt--;
 }
 EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_done);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH rc v5] iommu: Fix nested pci_dev_reset_iommu_prepare/done()
  2026-04-04  5:02 [PATCH rc v5] iommu: Fix nested pci_dev_reset_iommu_prepare/done() Nicolin Chen
@ 2026-04-05  0:55 ` Nicolin Chen
  2026-04-07  6:36 ` Shuai Xue
  1 sibling, 0 replies; 3+ messages in thread
From: Nicolin Chen @ 2026-04-05  0:55 UTC (permalink / raw)
  To: joro, kevin.tian, jgg
  Cc: will, robin.murphy, baolu.lu, iommu, linux-kernel, xueshuai

On Fri, Apr 03, 2026 at 10:02:43PM -0700, Nicolin Chen wrote:
> @@ -76,12 +76,28 @@ struct group_device {
>  	struct list_head list;
>  	struct device *dev;
>  	char *name;
> +	bool blocked;
> +	unsigned int reset_depth;

Perhaps I should have mentioned that this "blocked" will be useful
when shared with the followup series to flag the device blocked by
an ongoing/failed reset or quarantine.

Nicolin

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH rc v5] iommu: Fix nested pci_dev_reset_iommu_prepare/done()
  2026-04-04  5:02 [PATCH rc v5] iommu: Fix nested pci_dev_reset_iommu_prepare/done() Nicolin Chen
  2026-04-05  0:55 ` Nicolin Chen
@ 2026-04-07  6:36 ` Shuai Xue
  1 sibling, 0 replies; 3+ messages in thread
From: Shuai Xue @ 2026-04-07  6:36 UTC (permalink / raw)
  To: Nicolin Chen, joro, kevin.tian, jgg
  Cc: will, robin.murphy, baolu.lu, iommu, linux-kernel



On 4/4/26 1:02 PM, Nicolin Chen wrote:
> Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
> internally while both are calling pci_dev_reset_iommu_prepare/done().
> 
> As pci_dev_reset_iommu_prepare() doesn't support re-entry, the inner call
> will trigger a WARN_ON and return -EBUSY, resulting in failing the entire
> device reset.
> 
> On the other hand, removing the outer calls in the PCI callers is unsafe.
> As pointed out by Kevin, device-specific quirks like reset_hinic_vf_dev()
> execute custom firmware waits after their inner pcie_flr() completes. If
> the IOMMU protection relies solely on the inner reset, the IOMMU will be
> unblocked prematurely while the device is still resetting.
> 
> Instead, fix this by making pci_dev_reset_iommu_prepare/done() reentrant.
> 
> Given the IOMMU core tracks the resetting state per iommu_group while the
> reset is per device, this has to track at the group_device level as well.
> 
> Introduce a 'reset_depth' to struct group_device to handle the re-entries
> on the same device. This allows multi-device groups to isolate concurrent
> device resets independently.
> 
> Note that iommu_deferred_attach() and iommu_driver_get_domain_for_dev()
> both now check gdev->reset_depth (per-device) instead of a per-group flag
> like "group->resetting_domain". This is actually more precise.
> 
> As the reset routine is per gdev, it cannot clear group->resetting_domain
> without iterating over the device list to ensure no other device is being
> reset. Simplify it by replacing the resetting_domain with a 'recovery_cnt'
> in the struct iommu_group.
> 
> Since both helpers are now per gdev, call the per-device set_dev_pasid op
> to recover PASID domains.
> 
> While fixing the bug, also fix the kdoc for pci_dev_reset_iommu_done().
> 
> Fixes: c279e83953d9 ("iommu: Introduce pci_dev_reset_iommu_prepare/done()")
> Cc: stable@vger.kernel.org
> Reported-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Closes: https://lore.kernel.org/all/absKsk7qQOwzhpzv@Asurada-Nvidia/
> Suggested-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
> Changelog
>   v5:
>    * Add 'blocked' to fix iommu_driver_get_domain_for_dev() return.
>   v4:
>    https://lore.kernel.org/all/20260324014056.36103-1-nicolinc@nvidia.com/
>    * Rename 'reset_cnt' to 'recovery_cnt'
>   v3:
>    https://lore.kernel.org/all/20260321223930.10836-1-nicolinc@nvidia.com/
>    * Turn prepare()/done() to be per-gdev
>    * Use reset_depth to track nested re-entries
>    * Replace group->resetting_domain with a reset_cnt
>   v2:
>    https://lore.kernel.org/all/20260319043135.1153534-1-nicolinc@nvidia.com/
>    * Fix in the helpers by allowing re-entry
>   v1:
>    https://lore.kernel.org/all/20260318220028.1146905-1-nicolinc@nvidia.com/
> 
>   drivers/iommu/iommu.c | 116 +++++++++++++++++++++++++++++-------------
>   1 file changed, 82 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 35db517809540..4bd7a4afc9881 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -61,14 +61,14 @@ struct iommu_group {
>   	int id;
>   	struct iommu_domain *default_domain;
>   	struct iommu_domain *blocking_domain;
> -	/*
> -	 * During a group device reset, @resetting_domain points to the physical
> -	 * domain, while @domain points to the attached domain before the reset.
> -	 */
> -	struct iommu_domain *resetting_domain;
>   	struct iommu_domain *domain;
>   	struct list_head entry;
>   	unsigned int owner_cnt;
> +	/*
> +	 * Number of devices in the group undergoing or awaiting recovery.
> +	 * If non-zero, concurrent domain attachments are rejected.
> +	 */
> +	unsigned int recovery_cnt;
>   	void *owner;
>   };
>   
> @@ -76,12 +76,28 @@ struct group_device {
>   	struct list_head list;
>   	struct device *dev;
>   	char *name;
> +	bool blocked;
> +	unsigned int reset_depth;
>   };
>   
>   /* Iterate over each struct group_device in a struct iommu_group */
>   #define for_each_group_device(group, pos) \
>   	list_for_each_entry(pos, &(group)->devices, list)
>   
> +static struct group_device *__dev_to_gdev(struct device *dev)
> +{
> +	struct iommu_group *group = dev->iommu_group;
> +	struct group_device *gdev;
> +
> +	lockdep_assert_held(&group->mutex);
> +
> +	for_each_group_device(group, gdev) {
> +		if (gdev->dev == dev)
> +			return gdev;
> +	}
> +	return NULL;
> +}
> +
>   struct iommu_group_attribute {
>   	struct attribute attr;
>   	ssize_t (*show)(struct iommu_group *group, char *buf);
> @@ -2191,6 +2207,8 @@ EXPORT_SYMBOL_GPL(iommu_attach_device);
>   
>   int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
>   {
> +	struct group_device *gdev;
> +
>   	/*
>   	 * This is called on the dma mapping fast path so avoid locking. This is
>   	 * racy, but we have an expectation that the driver will setup its DMAs
> @@ -2201,6 +2219,9 @@ int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
>   
>   	guard(mutex)(&dev->iommu_group->mutex);
>   
> +	gdev = __dev_to_gdev(dev);
> +	if (WARN_ON(!gdev))
> +		return -ENODEV;
>   	/*
>   	 * This is a concurrent attach during a device reset. Reject it until
>   	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
> @@ -2208,7 +2229,7 @@ int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
>   	 * Note that this might fail the iommu_dma_map(). But there's nothing
>   	 * more we can do here.
>   	 */
> -	if (dev->iommu_group->resetting_domain)
> +	if (gdev->blocked)
>   		return -EBUSY;
>   	return __iommu_attach_device(domain, dev, NULL);
>   }
> @@ -2265,19 +2286,23 @@ EXPORT_SYMBOL_GPL(iommu_get_domain_for_dev);
>   struct iommu_domain *iommu_driver_get_domain_for_dev(struct device *dev)
>   {
>   	struct iommu_group *group = dev->iommu_group;
> +	struct group_device *gdev;
>   
>   	lockdep_assert_held(&group->mutex);
> +	gdev = __dev_to_gdev(dev);
> +	if (WARN_ON(!gdev))
> +		return NULL;
>   
>   	/*
>   	 * Driver handles the low-level __iommu_attach_device(), including the
>   	 * one invoked by pci_dev_reset_iommu_done() re-attaching the device to
>   	 * the cached group->domain. In this case, the driver must get the old
> -	 * domain from group->resetting_domain rather than group->domain. This
> +	 * domain from group->blocking_domain rather than group->domain. This
>   	 * prevents it from re-attaching the device from group->domain (old) to
>   	 * group->domain (new).
>   	 */
> -	if (group->resetting_domain)
> -		return group->resetting_domain;
> +	if (gdev->blocked)
> +		return group->blocking_domain;
>   
>   	return group->domain;
>   }
> @@ -2436,10 +2461,10 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
>   		return -EINVAL;
>   
>   	/*
> -	 * This is a concurrent attach during a device reset. Reject it until
> +	 * This is a concurrent attach during device recovery. Reject it until
>   	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
>   	 */
> -	if (group->resetting_domain)
> +	if (group->recovery_cnt)
>   		return -EBUSY;
>   
>   	/*
> @@ -3567,10 +3592,10 @@ int iommu_attach_device_pasid(struct iommu_domain *domain,
>   	mutex_lock(&group->mutex);
>   
>   	/*
> -	 * This is a concurrent attach during a device reset. Reject it until
> +	 * This is a concurrent attach during device recovery. Reject it until
>   	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
>   	 */
> -	if (group->resetting_domain) {
> +	if (group->recovery_cnt) {
>   		ret = -EBUSY;
>   		goto out_unlock;
>   	}
> @@ -3660,10 +3685,10 @@ int iommu_replace_device_pasid(struct iommu_domain *domain,
>   	mutex_lock(&group->mutex);
>   
>   	/*
> -	 * This is a concurrent attach during a device reset. Reject it until
> +	 * This is a concurrent attach during device recovery. Reject it until
>   	 * pci_dev_reset_iommu_done() attaches the device to group->domain.
>   	 */
> -	if (group->resetting_domain) {
> +	if (group->recovery_cnt) {
>   		ret = -EBUSY;
>   		goto out_unlock;
>   	}
> @@ -3934,12 +3959,12 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
>    * routine wants to block any IOMMU activity: translation and ATS invalidation.
>    *
>    * This function attaches the device's RID/PASID(s) the group->blocking_domain,
> - * setting the group->resetting_domain. This allows the IOMMU driver pausing any
> + * incrementing the group->recovery_cnt, to allow the IOMMU driver pausing any
>    * IOMMU activity while leaving the group->domain pointer intact. Later when the
>    * reset is finished, pci_dev_reset_iommu_done() can restore everything.
>    *
>    * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
> - * before/after the core-level reset routine, to unset the resetting_domain.
> + * before/after the core-level reset routine, to decrement the recovery_cnt.
>    *
>    * Return: 0 on success or negative error code if the preparation failed.
>    *
> @@ -3952,6 +3977,7 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
>   int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
>   {
>   	struct iommu_group *group = pdev->dev.iommu_group;
> +	struct group_device *gdev;
>   	unsigned long pasid;
>   	void *entry;
>   	int ret;
> @@ -3961,21 +3987,25 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
>   
>   	guard(mutex)(&group->mutex);
>   
> -	/* Re-entry is not allowed */
> -	if (WARN_ON(group->resetting_domain))
> -		return -EBUSY;
> +	gdev = __dev_to_gdev(&pdev->dev);
> +	if (WARN_ON(!gdev))
> +		return -ENODEV;
> +
> +	if (gdev->reset_depth++)
> +		return 0;
>   
>   	ret = __iommu_group_alloc_blocking_domain(group);
>   	if (ret)
> -		return ret;
> +		goto err_depth;
>   
>   	/* Stage RID domain at blocking_domain while retaining group->domain */
>   	if (group->domain != group->blocking_domain) {
>   		ret = __iommu_attach_device(group->blocking_domain, &pdev->dev,
>   					    group->domain);
>   		if (ret)
> -			return ret;
> +			goto err_depth;
>   	}
> +	gdev->blocked = true;
>   
>   	/*
>   	 * Stage PASID domains at blocking_domain while retaining pasid_array.
> @@ -3987,7 +4017,11 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
>   		iommu_remove_dev_pasid(&pdev->dev, pasid,
>   				       pasid_array_entry_to_domain(entry));

Hi Nicolin,

The max_pasids check is indeed missing in
pci_dev_reset_iommu_prepare(). All other callers of
iommu_remove_dev_pasid() guard it with a max_pasids > 0 check:

- __iommu_remove_group_pasid() checks device->dev->iommu->max_pasids > 0
- __iommu_set_group_pasid() error rollback path also checks it

But pci_dev_reset_iommu_prepare() calls iommu_remove_dev_pasid()
unconditionally while iterating pasid_array. Since pasid_array is
per-group, in a multi-device group where only some devices support
PASIDs, this could call the driver's set_dev_pasid on a device that
doesn't support PASIDs.

Note that this is a pre-existing issue in the
original code before this patch, not a regression introduced by this
fix.


>   
> -	group->resetting_domain = group->blocking_domain;
> +	group->recovery_cnt++;
> +	return ret;
> +
> +err_depth:
> +	gdev->reset_depth--;
>   	return ret;
>   }
>   EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
> @@ -3997,9 +4031,9 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
>    * @pdev: PCI device that has finished a reset routine
>    *
>    * After a PCIe device finishes a reset routine, it wants to restore its IOMMU
> - * IOMMU activity, including new translation as well as cache invalidation, by
> - * re-attaching all RID/PASID of the device's back to the domains retained in
> - * the core-level structure.
> + * activity, including new translation and cache invalidation, by re-attaching
> + * all RID/PASID of the device back to the domains retained in the core-level
> + * structure.
>    *
>    * Caller must pair it with a successful pci_dev_reset_iommu_prepare().
>    *
> @@ -4009,6 +4043,7 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
>   void pci_dev_reset_iommu_done(struct pci_dev *pdev)
>   {
>   	struct iommu_group *group = pdev->dev.iommu_group;
> +	struct group_device *gdev;
>   	unsigned long pasid;
>   	void *entry;
>   
> @@ -4017,11 +4052,16 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
>   
>   	guard(mutex)(&group->mutex);
>   
> -	/* pci_dev_reset_iommu_prepare() was bypassed for the device */
> -	if (!group->resetting_domain)
> +	gdev = __dev_to_gdev(&pdev->dev);
> +	if (WARN_ON(!gdev))
> +		return;
> +
> +	/* Unbalanced done() calls would underflow the counter */
> +	if (WARN_ON(gdev->reset_depth == 0))
> +		return;
> +	if (--gdev->reset_depth)
>   		return;
>   
> -	/* pci_dev_reset_iommu_prepare() was not successfully called */
>   	if (WARN_ON(!group->blocking_domain))
>   		return;
>   
> @@ -4030,6 +4070,7 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
>   		WARN_ON(__iommu_attach_device(group->domain, &pdev->dev,
>   					      group->blocking_domain));
>   	}
> +	gdev->blocked = false;


One question about the placement of gdev->blocked = false in done(). In
the original code, group->resetting_domain is cleared at the very end of
done(), after both RID re-attach and PASID restore:

__iommu_attach_device(group->domain, ...);   // RID re-attach
__iommu_set_group_pasid(...);                 // PASID restore
group->resetting_domain = NULL;               // cleared last

So iommu_driver_get_domain_for_dev() returns blocking_domain throughout
the entire restore sequence.In v5, gdev->blocked is cleared between RID
re-attach and PASID restore:

__iommu_attach_device(group->domain, ...);   // blocked=true
gdev->blocked = false;                        // cleared here
set_dev_pasid(...);                           // blocked=false

This means during PASID restore, iommu_driver_get_domain_for_dev() now
returns group->domain instead of blocking_domain — a behavioral change
from the original code.

I traced through the ARM SMMUv3 path: arm_smmu_set_pasid() →
arm_smmu_update_ste(), and the early return at:

if (master->cd_table.in_ste && master->ste_ats_enabled == ats_enabled)
     return;

is always taken during PASID restore (since RID re-attach already
installed the CD table and ATS state is consistent), so sid_domain is
effectively unused. No functional impact on SMMUv3.

Just want to confirm: is this behavioral change deliberate? Or should
blocked stay true until after PASID restore to preserve the original
behavior?

Thanks.
Shuai

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-07  6:36 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-04  5:02 [PATCH rc v5] iommu: Fix nested pci_dev_reset_iommu_prepare/done() Nicolin Chen
2026-04-05  0:55 ` Nicolin Chen
2026-04-07  6:36 ` Shuai Xue

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox