linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets
@ 2025-06-09 18:45 Nicolin Chen
  2025-06-09 18:45 ` [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Nicolin Chen @ 2025-06-09 18:45 UTC (permalink / raw)
  To: jgg, joro, will, robin.murphy, bhelgaas
  Cc: iommu, linux-kernel, linux-pci, patches, pjaroszynski, vsethi

Hi all,

Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
before initiating a Function Level Reset, and then ensure no invalidation
requests being issued to a device when its ATS capability is disabled.

Both pci_enable_ats() and pci_disable_ats() are called by an IOMMU driver,
but an unsolicited FLR can happen at any time in the PCI layer. This might
result in a race between them, breaking the rules given by the PCIe Spec.

Therefore, there needs to be a sync between IOMMU and PCI subsystems, to
ensure that ATS will be disabled and never gets re-enabled until the FLR
finishes. Add a pair of new IOMMU helpers for PCI reset functions to call
before and after the reset routines. These two helpers will temporally
attach the device's RID/PASID to IOMMU_DOMAIN_BLOCKED, which should allow
its IOMMU driver to pause any DMA traffic and disable ATS feature until
the FLR is done.

This is on Github:
https://github.com/nicolinc/iommufd/commits/iommu_dev_reset-rfcv1

Thanks
Nicolin

Nicolin Chen (2):
  iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  pci: Suspend ATS before doing FLR

 include/linux/iommu.h |  12 +++++
 drivers/iommu/iommu.c | 106 ++++++++++++++++++++++++++++++++++++++++++
 drivers/pci/pci.c     |  42 +++++++++++++++--
 3 files changed, 156 insertions(+), 4 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-06-09 18:45 [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets Nicolin Chen
@ 2025-06-09 18:45 ` Nicolin Chen
  2025-06-10  4:26   ` Baolu Lu
  2025-06-09 18:45 ` [PATCH RFC v1 2/2] pci: Suspend ATS before doing FLR Nicolin Chen
  2025-06-10 15:37 ` [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets Robin Murphy
  2 siblings, 1 reply; 21+ messages in thread
From: Nicolin Chen @ 2025-06-09 18:45 UTC (permalink / raw)
  To: jgg, joro, will, robin.murphy, bhelgaas
  Cc: iommu, linux-kernel, linux-pci, patches, pjaroszynski, vsethi

Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
before initiating a Function Level Reset, and then ensure no invalidation
requests being issued to a device when its ATS capability is disabled.

Since pci_enable_ats() and pci_disable_ats() are called by an IOMMU driver
while an unsolicited FLR can happen at any time in the PCI layer, PCI code
has to notify the IOMMU subsystem about the ongoing FLR. Add a pair of new
IOMMU APIs that will be called by the PCI reset functions before&after the
reset routines.

However, if there is a domain attachment/replacement happening during an
ongoing reset, the ATS might be re-enabled between the two function calls.
So the iommu_dev_reset_prepare() has to hold the group mutex to avoid this
race condition, unitl iommu_dev_reset_done() is finished. Thus, these two
functions are a strong pair that must be used together.

Inside the mutex, these two functions will dock all RID and PASID domains
to an IOMMU_DOMAIN_BLOCKED. This would further disable ATS by two-fold: an
IOMMU driver should disable ATS in its control bits (e.g. SMMU's STE.EATS)
and an IOMMU driver should call pci_disable_ats() as well.

Notes:
 - This only works for IOMMU drivers that implemented ops->blocked_domain
   correctly with pci_disable_ats().
 - This only works for IOMMU drivers that will not issue ATS invalidation
   requests to the device, after it's docked at ops->blocked_domain.
Driver should fix itself to align with the aforementioned notes.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h |  12 +++++
 drivers/iommu/iommu.c | 106 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 118 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 156732807994..a17161b8625a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1123,6 +1123,9 @@ void dev_iommu_priv_set(struct device *dev, void *priv);
 extern struct mutex iommu_probe_device_lock;
 int iommu_probe_device(struct device *dev);
 
+int iommu_dev_reset_prepare(struct device *dev);
+void iommu_dev_reset_done(struct device *dev);
+
 int iommu_device_use_default_domain(struct device *dev);
 void iommu_device_unuse_default_domain(struct device *dev);
 
@@ -1407,6 +1410,15 @@ static inline int iommu_fwspec_add_ids(struct device *dev, u32 *ids,
 	return -ENODEV;
 }
 
+static inline int iommu_dev_reset_prepare(struct device *dev)
+{
+	return 0;
+}
+
+static inline void iommu_dev_reset_done(struct device *dev)
+{
+}
+
 static inline struct iommu_fwspec *dev_iommu_fwspec_get(struct device *dev)
 {
 	return NULL;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index a4b606c591da..3c1854c5e55e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3795,6 +3795,112 @@ int iommu_replace_group_handle(struct iommu_group *group,
 }
 EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
 
+/*
+ * Deadlock Alert
+ *
+ * Caller must use iommu_dev_reset_prepare() and iommu_dev_reset_done() together
+ * before/after the core-level reset routine, as iommu_dev_reset_prepare() holds
+ * the group->mutex that will be only released in iommu_dev_reset_done().
+ */
+int iommu_dev_reset_prepare(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	const struct iommu_ops *ops;
+	unsigned long pasid;
+	void *entry;
+	int ret;
+
+	/* Before locking */
+	if (!dev_has_iommu(dev))
+		return 0;
+
+	if (dev->iommu->require_direct) {
+		dev_warn(dev,
+			 "Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.\n");
+		return -EINVAL;
+	}
+
+	ops = dev_iommu_ops(dev);
+	if (!ops->blocked_domain) {
+		dev_warn(dev,
+			 "IOMMU driver doesn't support IOMMU_DOMAIN_BLOCKED\n");
+		return -EOPNOTSUPP;
+	}
+
+	/*
+	 * group->mutex starts
+	 *
+	 * This has to hold the group mutex until the reset is done, to prevent
+	 * any RID or PASID domain attachment/replacement, which otherwise might
+	 * re-enable the ATS during the reset cycle.
+	 */
+	mutex_lock(&group->mutex);
+
+	/* Device is already attached to the blocked_domain. Nothing to do */
+	if (group->domain->type == IOMMU_DOMAIN_BLOCKED)
+		return 0;
+
+	/* Dock RID domain to blocked_domain while retaining group->domain */
+	ret = __iommu_attach_device(ops->blocked_domain, dev);
+	if (ret)
+		return ret;
+
+	/* Dock PASID domains to blocked_domain while retaining pasid_array */
+	xa_lock(&group->pasid_array);
+	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
+		iommu_remove_dev_pasid(dev, pasid,
+				       pasid_array_entry_to_domain(entry));
+	xa_unlock(&group->pasid_array);
+
+	/* group->mutex is held. Caller must invoke iommu_dev_reset_done() */
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_reset_prepare);
+
+/*
+ * This is the resume routine of iommu_dev_reset_prepare(). It unlocks the group
+ * mutex at end, after all RID/PASID domains are re-attached.
+ *
+ * Note that, although unlikely, there is a risk that re-attaching domains might
+ * fail due to some unexpected happening like OOM.
+ */
+void iommu_dev_reset_done(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	const struct iommu_ops *ops;
+	unsigned long pasid;
+	void *entry;
+
+	/* Previously unlocked */
+	if (!dev_has_iommu(dev))
+		return;
+	ops = dev_iommu_ops(dev);
+	if (!ops->blocked_domain)
+		return;
+
+	/* group->mutex held in iommu_dev_reset_prepare() continues from here */
+	WARN_ON(!lockdep_is_held(&group->mutex));
+
+	if (group->domain->type == IOMMU_DOMAIN_BLOCKED)
+		goto unlock;
+
+	/* Shift RID domain back to group->domain */
+	WARN_ON(__iommu_attach_device(group->domain, dev));
+
+	/* Shift PASID domains back to domains retained in pasid_array */
+	xa_lock(&group->pasid_array);
+	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
+		WARN_ON(__iommu_set_group_pasid(
+			pasid_array_entry_to_domain(entry), group, pasid,
+			ops->blocked_domain));
+	xa_unlock(&group->pasid_array);
+
+unlock:
+	mutex_unlock(&group->mutex);
+}
+EXPORT_SYMBOL_GPL(iommu_dev_reset_done);
+
 #if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
 /**
  * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC v1 2/2] pci: Suspend ATS before doing FLR
  2025-06-09 18:45 [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets Nicolin Chen
  2025-06-09 18:45 ` [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
@ 2025-06-09 18:45 ` Nicolin Chen
  2025-06-10  4:27   ` Baolu Lu
  2025-06-10 15:37 ` [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets Robin Murphy
  2 siblings, 1 reply; 21+ messages in thread
From: Nicolin Chen @ 2025-06-09 18:45 UTC (permalink / raw)
  To: jgg, joro, will, robin.murphy, bhelgaas
  Cc: iommu, linux-kernel, linux-pci, patches, pjaroszynski, vsethi

Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
before initiating a Function Level Reset.

Call iommu_dev_reset_prepare() before FLR and iommu_dev_reset_done() after,
in the two FLR Functions. This will dock the device at IOMMU_DOMAIN_BLOCKED
during the FLR function, which should allow the IOMMU driver to pause DMA
traffic and invode pci_disable_ats() and pci_enable_ats() respectively.

Add a warning if ATS isn't disabled, in which case IOMMU driver should fix
itself to disable ATS following the design in iommu_dev_reset_prepare().

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/pci/pci.c | 42 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 38 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e9448d55113b..61535435bde1 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -13,6 +13,7 @@
 #include <linux/delay.h>
 #include <linux/dmi.h>
 #include <linux/init.h>
+#include <linux/iommu.h>
 #include <linux/msi.h>
 #include <linux/of.h>
 #include <linux/pci.h>
@@ -4518,13 +4519,26 @@ EXPORT_SYMBOL(pci_wait_for_pending_transaction);
  */
 int pcie_flr(struct pci_dev *dev)
 {
+	int ret = 0;
+
 	if (!pci_wait_for_pending_transaction(dev))
 		pci_err(dev, "timed out waiting for pending transaction; performing function level reset anyway\n");
 
+	/*
+	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
+	 * before initiating a Function Level Reset. So notify the iommu driver
+	 * that actually enabled ATS. Have to call it after waiting for pending
+	 * DMA transaction.
+	 */
+	if (iommu_dev_reset_prepare(&dev->dev))
+		pci_err(dev, "failed to stop IOMMU\n");
+	if (dev->ats_enabled)
+		pci_err(dev, "failed to stop ATS\n");
+
 	pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR);
 
 	if (dev->imm_ready)
-		return 0;
+		goto done;
 
 	/*
 	 * Per PCIe r4.0, sec 6.6.2, a device must complete an FLR within
@@ -4533,7 +4547,11 @@ int pcie_flr(struct pci_dev *dev)
 	 */
 	msleep(100);
 
-	return pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
+	ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
+
+done:
+	iommu_dev_reset_done(&dev->dev);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(pcie_flr);
 
@@ -4561,6 +4579,7 @@ EXPORT_SYMBOL_GPL(pcie_reset_flr);
 
 static int pci_af_flr(struct pci_dev *dev, bool probe)
 {
+	int ret = 0;
 	int pos;
 	u8 cap;
 
@@ -4587,10 +4606,21 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
 				 PCI_AF_STATUS_TP << 8))
 		pci_err(dev, "timed out waiting for pending transaction; performing AF function level reset anyway\n");
 
+	/*
+	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
+	 * before initiating a Function Level Reset. So notify the iommu driver
+	 * that actually enabled ATS. Have to call it after waiting for pending
+	 * DMA transaction.
+	 */
+	if (iommu_dev_reset_prepare(&dev->dev))
+		pci_err(dev, "failed to stop IOMMU\n");
+	if (dev->ats_enabled)
+		pci_err(dev, "failed to stop ATS\n");
+
 	pci_write_config_byte(dev, pos + PCI_AF_CTRL, PCI_AF_CTRL_FLR);
 
 	if (dev->imm_ready)
-		return 0;
+		goto done;
 
 	/*
 	 * Per Advanced Capabilities for Conventional PCI ECN, 13 April 2006,
@@ -4600,7 +4630,11 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
 	 */
 	msleep(100);
 
-	return pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
+	ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
+
+done:
+	iommu_dev_reset_done(&dev->dev);
+	return ret;
 }
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-06-09 18:45 ` [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
@ 2025-06-10  4:26   ` Baolu Lu
  2025-06-10  7:07     ` Nicolin Chen
  0 siblings, 1 reply; 21+ messages in thread
From: Baolu Lu @ 2025-06-10  4:26 UTC (permalink / raw)
  To: Nicolin Chen, jgg, joro, will, robin.murphy, bhelgaas
  Cc: iommu, linux-kernel, linux-pci, patches, pjaroszynski, vsethi

On 6/10/25 02:45, Nicolin Chen wrote:
> Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
> before initiating a Function Level Reset, and then ensure no invalidation
> requests being issued to a device when its ATS capability is disabled.
> 
> Since pci_enable_ats() and pci_disable_ats() are called by an IOMMU driver
> while an unsolicited FLR can happen at any time in the PCI layer, PCI code
> has to notify the IOMMU subsystem about the ongoing FLR. Add a pair of new
> IOMMU APIs that will be called by the PCI reset functions before&after the
> reset routines.
> 
> However, if there is a domain attachment/replacement happening during an
> ongoing reset, the ATS might be re-enabled between the two function calls.
> So the iommu_dev_reset_prepare() has to hold the group mutex to avoid this
> race condition, unitl iommu_dev_reset_done() is finished. Thus, these two
> functions are a strong pair that must be used together.
> 
> Inside the mutex, these two functions will dock all RID and PASID domains
> to an IOMMU_DOMAIN_BLOCKED. This would further disable ATS by two-fold: an
> IOMMU driver should disable ATS in its control bits (e.g. SMMU's STE.EATS)
> and an IOMMU driver should call pci_disable_ats() as well.
> 
> Notes:
>   - This only works for IOMMU drivers that implemented ops->blocked_domain
>     correctly with pci_disable_ats().
>   - This only works for IOMMU drivers that will not issue ATS invalidation
>     requests to the device, after it's docked at ops->blocked_domain.
> Driver should fix itself to align with the aforementioned notes.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   include/linux/iommu.h |  12 +++++
>   drivers/iommu/iommu.c | 106 ++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 118 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 156732807994..a17161b8625a 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -1123,6 +1123,9 @@ void dev_iommu_priv_set(struct device *dev, void *priv);
>   extern struct mutex iommu_probe_device_lock;
>   int iommu_probe_device(struct device *dev);
>   
> +int iommu_dev_reset_prepare(struct device *dev);
> +void iommu_dev_reset_done(struct device *dev);
> +
>   int iommu_device_use_default_domain(struct device *dev);
>   void iommu_device_unuse_default_domain(struct device *dev);
>   
> @@ -1407,6 +1410,15 @@ static inline int iommu_fwspec_add_ids(struct device *dev, u32 *ids,
>   	return -ENODEV;
>   }
>   
> +static inline int iommu_dev_reset_prepare(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static inline void iommu_dev_reset_done(struct device *dev)
> +{
> +}
> +
>   static inline struct iommu_fwspec *dev_iommu_fwspec_get(struct device *dev)
>   {
>   	return NULL;
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index a4b606c591da..3c1854c5e55e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -3795,6 +3795,112 @@ int iommu_replace_group_handle(struct iommu_group *group,
>   }
>   EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
>   
> +/*
> + * Deadlock Alert
> + *
> + * Caller must use iommu_dev_reset_prepare() and iommu_dev_reset_done() together
> + * before/after the core-level reset routine, as iommu_dev_reset_prepare() holds
> + * the group->mutex that will be only released in iommu_dev_reset_done().
> + */
> +int iommu_dev_reset_prepare(struct device *dev)
> +{
> +	struct iommu_group *group = dev->iommu_group;
> +	const struct iommu_ops *ops;
> +	unsigned long pasid;
> +	void *entry;
> +	int ret;
> +
> +	/* Before locking */
> +	if (!dev_has_iommu(dev))
> +		return 0;
> +
> +	if (dev->iommu->require_direct) {
> +		dev_warn(dev,
> +			 "Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.\n");
> +		return -EINVAL;
> +	}
> +
> +	ops = dev_iommu_ops(dev);

Should this be protected by group->mutext?

> +	if (!ops->blocked_domain) {
> +		dev_warn(dev,
> +			 "IOMMU driver doesn't support IOMMU_DOMAIN_BLOCKED\n");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	/*
> +	 * group->mutex starts
> +	 *
> +	 * This has to hold the group mutex until the reset is done, to prevent
> +	 * any RID or PASID domain attachment/replacement, which otherwise might
> +	 * re-enable the ATS during the reset cycle.
> +	 */
> +	mutex_lock(&group->mutex);

Is it possible that group has been freed when it reaches here?

> +
> +	/* Device is already attached to the blocked_domain. Nothing to do */
> +	if (group->domain->type == IOMMU_DOMAIN_BLOCKED)
> +		return 0;
> +
> +	/* Dock RID domain to blocked_domain while retaining group->domain */
> +	ret = __iommu_attach_device(ops->blocked_domain, dev);
> +	if (ret)
> +		return ret;
> +
> +	/* Dock PASID domains to blocked_domain while retaining pasid_array */
> +	xa_lock(&group->pasid_array);
> +	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
> +		iommu_remove_dev_pasid(dev, pasid,
> +				       pasid_array_entry_to_domain(entry));
> +	xa_unlock(&group->pasid_array);
> +
> +	/* group->mutex is held. Caller must invoke iommu_dev_reset_done() */
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_dev_reset_prepare);
> +
> +/*
> + * This is the resume routine of iommu_dev_reset_prepare(). It unlocks the group
> + * mutex at end, after all RID/PASID domains are re-attached.
> + *
> + * Note that, although unlikely, there is a risk that re-attaching domains might
> + * fail due to some unexpected happening like OOM.
> + */
> +void iommu_dev_reset_done(struct device *dev)
> +{
> +	struct iommu_group *group = dev->iommu_group;
> +	const struct iommu_ops *ops;
> +	unsigned long pasid;
> +	void *entry;
> +
> +	/* Previously unlocked */
> +	if (!dev_has_iommu(dev))
> +		return;
> +	ops = dev_iommu_ops(dev);
> +	if (!ops->blocked_domain)
> +		return;

Should it be a WARN_ON()? As proposed, reset_prepare and reset_done must
be paired. So if reset_prepare returns failure, reset_done should not be
called. Or not?

> +
> +	/* group->mutex held in iommu_dev_reset_prepare() continues from here */
> +	WARN_ON(!lockdep_is_held(&group->mutex));

Probably iommu_group_mutex_assert() and move it up?

> +
> +	if (group->domain->type == IOMMU_DOMAIN_BLOCKED)
> +		goto unlock;
> +
> +	/* Shift RID domain back to group->domain */
> +	WARN_ON(__iommu_attach_device(group->domain, dev));
> +
> +	/* Shift PASID domains back to domains retained in pasid_array */
> +	xa_lock(&group->pasid_array);
> +	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
> +		WARN_ON(__iommu_set_group_pasid(
> +			pasid_array_entry_to_domain(entry), group, pasid,
> +			ops->blocked_domain));
> +	xa_unlock(&group->pasid_array);
> +
> +unlock:
> +	mutex_unlock(&group->mutex);
> +}
> +EXPORT_SYMBOL_GPL(iommu_dev_reset_done);
> +
>   #if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
>   /**
>    * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain

How about combining these two helpers? Something like,

int iommu_dev_block_dma_and_action(struct device *dev,
		int (*action)(struct pci_dev *dev))
{
	prepare();
	action();
	done();
}

?

Thanks,
baolu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 2/2] pci: Suspend ATS before doing FLR
  2025-06-09 18:45 ` [PATCH RFC v1 2/2] pci: Suspend ATS before doing FLR Nicolin Chen
@ 2025-06-10  4:27   ` Baolu Lu
  2025-06-10  6:55     ` Nicolin Chen
  0 siblings, 1 reply; 21+ messages in thread
From: Baolu Lu @ 2025-06-10  4:27 UTC (permalink / raw)
  To: Nicolin Chen, jgg, joro, will, robin.murphy, bhelgaas
  Cc: iommu, linux-kernel, linux-pci, patches, pjaroszynski, vsethi

On 6/10/25 02:45, Nicolin Chen wrote:
> Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
> before initiating a Function Level Reset.
> 
> Call iommu_dev_reset_prepare() before FLR and iommu_dev_reset_done() after,
> in the two FLR Functions. This will dock the device at IOMMU_DOMAIN_BLOCKED
> during the FLR function, which should allow the IOMMU driver to pause DMA
> traffic and invode pci_disable_ats() and pci_enable_ats() respectively.
> 
> Add a warning if ATS isn't disabled, in which case IOMMU driver should fix
> itself to disable ATS following the design in iommu_dev_reset_prepare().
> 
> Signed-off-by: Nicolin Chen<nicolinc@nvidia.com>
> ---
>   drivers/pci/pci.c | 42 ++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 38 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e9448d55113b..61535435bde1 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -13,6 +13,7 @@
>   #include <linux/delay.h>
>   #include <linux/dmi.h>
>   #include <linux/init.h>
> +#include <linux/iommu.h>
>   #include <linux/msi.h>
>   #include <linux/of.h>
>   #include <linux/pci.h>
> @@ -4518,13 +4519,26 @@ EXPORT_SYMBOL(pci_wait_for_pending_transaction);
>    */
>   int pcie_flr(struct pci_dev *dev)
>   {
> +	int ret = 0;
> +
>   	if (!pci_wait_for_pending_transaction(dev))
>   		pci_err(dev, "timed out waiting for pending transaction; performing function level reset anyway\n");
>   
> +	/*
> +	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> +	 * before initiating a Function Level Reset. So notify the iommu driver
> +	 * that actually enabled ATS. Have to call it after waiting for pending
> +	 * DMA transaction.
> +	 */
> +	if (iommu_dev_reset_prepare(&dev->dev))
> +		pci_err(dev, "failed to stop IOMMU\n");

Need to abort here?

Thanks,
baolu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 2/2] pci: Suspend ATS before doing FLR
  2025-06-10  4:27   ` Baolu Lu
@ 2025-06-10  6:55     ` Nicolin Chen
  0 siblings, 0 replies; 21+ messages in thread
From: Nicolin Chen @ 2025-06-10  6:55 UTC (permalink / raw)
  To: Baolu Lu
  Cc: jgg, joro, will, robin.murphy, bhelgaas, iommu, linux-kernel,
	linux-pci, patches, pjaroszynski, vsethi

On Tue, Jun 10, 2025 at 12:27:04PM +0800, Baolu Lu wrote:
> On 6/10/25 02:45, Nicolin Chen wrote:
> >   int pcie_flr(struct pci_dev *dev)
> >   {
> > +	int ret = 0;
> > +
> >   	if (!pci_wait_for_pending_transaction(dev))
> >   		pci_err(dev, "timed out waiting for pending transaction; performing function level reset anyway\n");
> > +	/*
> > +	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> > +	 * before initiating a Function Level Reset. So notify the iommu driver
> > +	 * that actually enabled ATS. Have to call it after waiting for pending
> > +	 * DMA transaction.
> > +	 */
> > +	if (iommu_dev_reset_prepare(&dev->dev))
> > +		pci_err(dev, "failed to stop IOMMU\n");
> 
> Need to abort here?

Yea. I think it should abort.

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-06-10  4:26   ` Baolu Lu
@ 2025-06-10  7:07     ` Nicolin Chen
  2025-06-10 13:04       ` Jason Gunthorpe
  0 siblings, 1 reply; 21+ messages in thread
From: Nicolin Chen @ 2025-06-10  7:07 UTC (permalink / raw)
  To: Baolu Lu
  Cc: jgg, joro, will, robin.murphy, bhelgaas, iommu, linux-kernel,
	linux-pci, patches, pjaroszynski, vsethi

On Tue, Jun 10, 2025 at 12:26:07PM +0800, Baolu Lu wrote:
> On 6/10/25 02:45, Nicolin Chen wrote:
> > +	ops = dev_iommu_ops(dev);
> 
> Should this be protected by group->mutext?

Not seemingly, but should require the iommu_probe_device_lock I
think.

> > +	/*
> > +	 * group->mutex starts
> > +	 *
> > +	 * This has to hold the group mutex until the reset is done, to prevent
> > +	 * any RID or PASID domain attachment/replacement, which otherwise might
> > +	 * re-enable the ATS during the reset cycle.
> > +	 */
> > +	mutex_lock(&group->mutex);
> 
> Is it possible that group has been freed when it reaches here?

It doesn't look very obvious to me which lock we need here. But,
it's a good point that dev->iommu_group is unsafe here. Will dig
a bit later.

> > +void iommu_dev_reset_done(struct device *dev)
> > +{
> > +	struct iommu_group *group = dev->iommu_group;
> > +	const struct iommu_ops *ops;
> > +	unsigned long pasid;
> > +	void *entry;
> > +
> > +	/* Previously unlocked */
> > +	if (!dev_has_iommu(dev))
> > +		return;
> > +	ops = dev_iommu_ops(dev);
> > +	if (!ops->blocked_domain)
> > +		return;
> 
> Should it be a WARN_ON()? As proposed, reset_prepare and reset_done must
> be paired. So if reset_prepare returns failure, reset_done should not be
> called. Or not?

Yea, I agree. Should work like that.

> > +	/* group->mutex held in iommu_dev_reset_prepare() continues from here */
> > +	WARN_ON(!lockdep_is_held(&group->mutex));
> 
> Probably iommu_group_mutex_assert() and move it up?

Yes and not sure (will take another look).

> How about combining these two helpers? Something like,
> 
> int iommu_dev_block_dma_and_action(struct device *dev,
> 		int (*action)(struct pci_dev *dev))
> {
> 	prepare();
> 	action();
> 	done();
> }

That's an interesting idea! So, we wouldn't need to worry about
the pairing.

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-06-10  7:07     ` Nicolin Chen
@ 2025-06-10 13:04       ` Jason Gunthorpe
  2025-06-10 14:40         ` Robin Murphy
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2025-06-10 13:04 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Baolu Lu, joro, will, robin.murphy, bhelgaas, iommu, linux-kernel,
	linux-pci, patches, pjaroszynski, vsethi

On Tue, Jun 10, 2025 at 12:07:00AM -0700, Nicolin Chen wrote:
> On Tue, Jun 10, 2025 at 12:26:07PM +0800, Baolu Lu wrote:
> > On 6/10/25 02:45, Nicolin Chen wrote:
> > > +	ops = dev_iommu_ops(dev);
> > 
> > Should this be protected by group->mutext?
> 
> Not seemingly, but should require the iommu_probe_device_lock I
> think.

group and ops are not permitted to change while a driver is attached..

IIRC the FLR code in PCI doesn't always ensure that (due to the sysfs
paths), so yeah, this looks troubled. iommu_probe_device_lock perhaps
would fix it.

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-06-10 13:04       ` Jason Gunthorpe
@ 2025-06-10 14:40         ` Robin Murphy
  2025-06-10 15:36           ` Jason Gunthorpe
  0 siblings, 1 reply; 21+ messages in thread
From: Robin Murphy @ 2025-06-10 14:40 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: Baolu Lu, joro, will, bhelgaas, iommu, linux-kernel, linux-pci,
	patches, pjaroszynski, vsethi

On 2025-06-10 2:04 pm, Jason Gunthorpe wrote:
> On Tue, Jun 10, 2025 at 12:07:00AM -0700, Nicolin Chen wrote:
>> On Tue, Jun 10, 2025 at 12:26:07PM +0800, Baolu Lu wrote:
>>> On 6/10/25 02:45, Nicolin Chen wrote:
>>>> +	ops = dev_iommu_ops(dev);
>>>
>>> Should this be protected by group->mutext?
>>
>> Not seemingly, but should require the iommu_probe_device_lock I
>> think.
> 
> group and ops are not permitted to change while a driver is attached..
> 
> IIRC the FLR code in PCI doesn't always ensure that (due to the sysfs
> paths), so yeah, this looks troubled. iommu_probe_device_lock perhaps
> would fix it.

No, iommu_probe_device_lock is for protecting access to dev->iommu in 
the probe path until the device is definitively assigned to a group (or 
not). Fundamentally it defends against multiple sources triggering a 
probe of the same device in parallel - once the device *is* probed it is 
no longer relevant, and the group mutex is the right thing to protect 
all subsequent operations.

Also I'm still working towards getting rid of iommu_probe_device_lock as 
soon as I can because it's horrid... I now have most of a plan for 
making it safe to rely on device_lock() for probe, which should nicely 
solve the dev->driver races as well.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-06-10 14:40         ` Robin Murphy
@ 2025-06-10 15:36           ` Jason Gunthorpe
  2025-06-10 16:31             ` Robin Murphy
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2025-06-10 15:36 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Nicolin Chen, Baolu Lu, joro, will, bhelgaas, iommu, linux-kernel,
	linux-pci, patches, pjaroszynski, vsethi

On Tue, Jun 10, 2025 at 03:40:40PM +0100, Robin Murphy wrote:
> On 2025-06-10 2:04 pm, Jason Gunthorpe wrote:
> > On Tue, Jun 10, 2025 at 12:07:00AM -0700, Nicolin Chen wrote:
> > > On Tue, Jun 10, 2025 at 12:26:07PM +0800, Baolu Lu wrote:
> > > > On 6/10/25 02:45, Nicolin Chen wrote:
> > > > > +	ops = dev_iommu_ops(dev);
> > > > 
> > > > Should this be protected by group->mutext?
> > > 
> > > Not seemingly, but should require the iommu_probe_device_lock I
> > > think.
> > 
> > group and ops are not permitted to change while a driver is attached..
> > 
> > IIRC the FLR code in PCI doesn't always ensure that (due to the sysfs
> > paths), so yeah, this looks troubled. iommu_probe_device_lock perhaps
> > would fix it.
> 
> No, iommu_probe_device_lock is for protecting access to dev->iommu in the
> probe path until the device is definitively assigned to a group (or not).
> Fundamentally it defends against multiple sources triggering a probe of the
> same device in parallel - once the device *is* probed it is no longer
> relevant, and the group mutex is the right thing to protect all subsequent
> operations.

Yes, adding iommu_probe_device_lock to iommu_deinit_device() would be
gross.

but something is required to protect the load of
dev->iommu_group.. dev->iommu_group->mutex can't protect itself
against a race UAF on deinit.

READ_ONCE is good enough to protect from races with the probe path, no
need for iommu_probe_device_lock there.

In this case need to look at the PCI sysfs for races against the
iommu_release_device()/etc that is freeing the dev->iommu_group.

Maybe the sysfs is always removed before we get to release. Or maybe
the PCI FLR sysfs code should hold the device_lock..

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets
  2025-06-09 18:45 [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets Nicolin Chen
  2025-06-09 18:45 ` [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
  2025-06-09 18:45 ` [PATCH RFC v1 2/2] pci: Suspend ATS before doing FLR Nicolin Chen
@ 2025-06-10 15:37 ` Robin Murphy
  2025-06-10 16:30   ` Jason Gunthorpe
  2 siblings, 1 reply; 21+ messages in thread
From: Robin Murphy @ 2025-06-10 15:37 UTC (permalink / raw)
  To: Nicolin Chen, jgg, joro, will, bhelgaas
  Cc: iommu, linux-kernel, linux-pci, patches, pjaroszynski, vsethi

On 2025-06-09 7:45 pm, Nicolin Chen wrote:
> Hi all,
> 
> Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
> before initiating a Function Level Reset, and then ensure no invalidation
> requests being issued to a device when its ATS capability is disabled.

Not really - what it says is that software should not expect to receive 
invalidate completions from a function which is in the process of being 
reset or powered off, and if software doesn't want to be confused by 
that then it should take care to wait for completion or timeout of all 
outstanding requests, and avoid issuing new requests, before initiating 
such a reset or power transition.

> Both pci_enable_ats() and pci_disable_ats() are called by an IOMMU driver,
> but an unsolicited FLR can happen at any time in the PCI layer. This might
> result in a race between them, breaking the rules given by the PCIe Spec.

Can you clarify which rules? The Implementation Note itself is just an 
example of a possible software policy, and explicitly not normative.

> Therefore, there needs to be a sync between IOMMU and PCI subsystems, to
> ensure that ATS will be disabled and never gets re-enabled until the FLR
> finishes. Add a pair of new IOMMU helpers for PCI reset functions to call
> before and after the reset routines. These two helpers will temporally
> attach the device's RID/PASID to IOMMU_DOMAIN_BLOCKED, which should allow
> its IOMMU driver to pause any DMA traffic and disable ATS feature until
> the FLR is done.

FLR must inherently stop the function from issuing any kind of requests 
anyway (see 6.6.2), so "pausing" traffic at the IOMMU end seems like a 
non-issue. I guess I can see how messing with the domain attachment 
underneath the rest of the group manages to prevent new invalidate 
requests from group->domain being issued to the given function, but it's 
pretty horrid - leaving the mutex blocked might be just about tolerable 
for an FLR that's supposed to take no longer than 100ms, but what if we 
do want to do this for suspend/resume as well?

Thanks,
Robin.

> 
> This is on Github:
> https://github.com/nicolinc/iommufd/commits/iommu_dev_reset-rfcv1
> 
> Thanks
> Nicolin
> 
> Nicolin Chen (2):
>    iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
>    pci: Suspend ATS before doing FLR
> 
>   include/linux/iommu.h |  12 +++++
>   drivers/iommu/iommu.c | 106 ++++++++++++++++++++++++++++++++++++++++++
>   drivers/pci/pci.c     |  42 +++++++++++++++--
>   3 files changed, 156 insertions(+), 4 deletions(-)
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets
  2025-06-10 15:37 ` [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets Robin Murphy
@ 2025-06-10 16:30   ` Jason Gunthorpe
  2025-06-10 20:36     ` Nicolin Chen
  2025-06-13 19:27     ` Bjorn Helgaas
  0 siblings, 2 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2025-06-10 16:30 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Nicolin Chen, joro, will, bhelgaas, iommu, linux-kernel,
	linux-pci, patches, pjaroszynski, vsethi

On Tue, Jun 10, 2025 at 04:37:58PM +0100, Robin Murphy wrote:
> On 2025-06-09 7:45 pm, Nicolin Chen wrote:
> > Hi all,
> > 
> > Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
> > before initiating a Function Level Reset, and then ensure no invalidation
> > requests being issued to a device when its ATS capability is disabled.
> 
> Not really - what it says is that software should not expect to receive
> invalidate completions from a function which is in the process of being
> reset or powered off, and if software doesn't want to be confused by that
> then it should take care to wait for completion or timeout of all
> outstanding requests, and avoid issuing new requests, before initiating such
> a reset or power transition.

The commit message can be more precise, but I agree with the
conclusion that the right direction for Linux is to disable and block
ATS, instead of trying to ignore completion time out events, or trying
to block page table mutations. Ie do what the implementation note
says..

Maybe:

PCIe permits a device to ignore ATS invalidation TLPs while it is
processing FLR. This creates a problem visible to the OS where ATS
invalidation commands will time out. For instance a SVA domain will
have no coordination with a FLR event and can racily issue ATC
invalidations into a resetting device.

The OS should do something to mitigate this as we do not want
production systems to be reporting critical ATS failures, especially
in a hypervisor environment. Broadly the OS could arrange to ignore
the timeout, block page table mutations to prevent invalidations, or
disable and block ATS.

The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable
and block ATS, and we already have iommu driver support to implement
something like this. Implement this approach in the iommu core.

Provide a callback from the PCI subsystem that will enclose the FLR
and have the iommu core temporarily change all the domain attachments
into BLOCKED. When attaching a BLOCKED domain IOMMU drivers should
fence any incoming ATS queries, synchronously stop issuing new ATS
invalidations, and synchronously wait for all ATS invalidations to
complete. This will avoid any ATS invaliation time outs.

IOMMU drivers may also disable ATS in PCI config space, but it is not
required to solve the completion timeout problem. The PCI FLR logic
will put all the iommu owned config space bits back before completing.

During this period holding the group mutex will not allow new domains
to be attached to prevent any new ATS invalidations.

> > Therefore, there needs to be a sync between IOMMU and PCI subsystems, to
> > ensure that ATS will be disabled and never gets re-enabled until the FLR
> > finishes. Add a pair of new IOMMU helpers for PCI reset functions to call
> > before and after the reset routines. These two helpers will temporally
> > attach the device's RID/PASID to IOMMU_DOMAIN_BLOCKED, which should allow
> > its IOMMU driver to pause any DMA traffic and disable ATS feature until
> > the FLR is done.
> 
> FLR must inherently stop the function from issuing any kind of requests
> anyway (see 6.6.2), so "pausing" traffic at the IOMMU end seems like a
> non-issue. 

Yeah..  I haven't liked this blocking domain approach too much, but I
was convinced..

It has the strong advantage of being simple to implement and not
requiring any special cases or code in the driver. We don't actually
need the driver to disable ATS in PCIe config space, it only needs to
stop sending invalidations and fail all new ATS requests. All existing
drivers inherently do this already for blocked domains, so this should
solve the problem for everyone.

> I guess I can see how messing with the domain attachment
> underneath the rest of the group manages to prevent new invalidate requests
> from group->domain being issued to the given function, but it's pretty
> horrid - leaving the mutex blocked might be just about tolerable for an FLR
> that's supposed to take no longer than 100ms, but what if we do want to do
> this for suspend/resume as well?

I don't view this a problem for FLR, we can hold a mutex for a long
time. It principally delays domain changes which are kind of nonsense
to be doing concurrently with FLR in the first place..

However, for suspend, we probably want to leave a marker in the group
that the group is force-blocked and all domain attach/detach logic
will only update the group tracking structures and not call into the
iommu driver. When the resume happens the core will set the current
group domain list to the iommu driver. No need for a long lived lock
this way.

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-06-10 15:36           ` Jason Gunthorpe
@ 2025-06-10 16:31             ` Robin Murphy
  2025-06-10 16:43               ` Jason Gunthorpe
  0 siblings, 1 reply; 21+ messages in thread
From: Robin Murphy @ 2025-06-10 16:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, Baolu Lu, joro, will, bhelgaas, iommu, linux-kernel,
	linux-pci, patches, pjaroszynski, vsethi

On 2025-06-10 4:36 pm, Jason Gunthorpe wrote:
> On Tue, Jun 10, 2025 at 03:40:40PM +0100, Robin Murphy wrote:
>> On 2025-06-10 2:04 pm, Jason Gunthorpe wrote:
>>> On Tue, Jun 10, 2025 at 12:07:00AM -0700, Nicolin Chen wrote:
>>>> On Tue, Jun 10, 2025 at 12:26:07PM +0800, Baolu Lu wrote:
>>>>> On 6/10/25 02:45, Nicolin Chen wrote:
>>>>>> +	ops = dev_iommu_ops(dev);
>>>>>
>>>>> Should this be protected by group->mutext?
>>>>
>>>> Not seemingly, but should require the iommu_probe_device_lock I
>>>> think.
>>>
>>> group and ops are not permitted to change while a driver is attached..
>>>
>>> IIRC the FLR code in PCI doesn't always ensure that (due to the sysfs
>>> paths), so yeah, this looks troubled. iommu_probe_device_lock perhaps
>>> would fix it.
>>
>> No, iommu_probe_device_lock is for protecting access to dev->iommu in the
>> probe path until the device is definitively assigned to a group (or not).
>> Fundamentally it defends against multiple sources triggering a probe of the
>> same device in parallel - once the device *is* probed it is no longer
>> relevant, and the group mutex is the right thing to protect all subsequent
>> operations.
> 
> Yes, adding iommu_probe_device_lock to iommu_deinit_device() would be
> gross.
> 
> but something is required to protect the load of
> dev->iommu_group.. dev->iommu_group->mutex can't protect itself
> against a race UAF on deinit.

Then you do iommu_group_get/put() around it as well. And I suppose 
technically you also ought to check that the device is still actually in 
the group once you have acquired the mutex (although perhaps that case 
ends up crashing anyway once we proceed to attempting to reset the 
already-disappeared device itself...)

> READ_ONCE is good enough to protect from races with the probe path, no
> need for iommu_probe_device_lock there.
> 
> In this case need to look at the PCI sysfs for races against the
> iommu_release_device()/etc that is freeing the dev->iommu_group.
> 
> Maybe the sysfs is always removed before we get to release. Or maybe
> the PCI FLR sysfs code should hold the device_lock..

 From a quick skim I suspect it's probably OK - at least device_del() 
gets to bus_remove_device()->device_remove_groups() well enough before 
the BUS_NOTIFY_REMOVED_DEVICE call that triggers iommu_release_device().

And on an unrelated thought that's just come to mind, if we ever did FLR 
with PASIDs enabled, presumably that's going to wipe out the PASID 
configuration, so will the caller who requested the reset actually 
expect the attachments at the IOMMU end to be preserved, or would they 
assume to start over from scratch? Seems like there's not necessarily 
one right answer there :/

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-06-10 16:31             ` Robin Murphy
@ 2025-06-10 16:43               ` Jason Gunthorpe
  2025-06-10 20:19                 ` Nicolin Chen
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2025-06-10 16:43 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Nicolin Chen, Baolu Lu, joro, will, bhelgaas, iommu, linux-kernel,
	linux-pci, patches, pjaroszynski, vsethi

On Tue, Jun 10, 2025 at 05:31:09PM +0100, Robin Murphy wrote:
> On 2025-06-10 4:36 pm, Jason Gunthorpe wrote:
> > On Tue, Jun 10, 2025 at 03:40:40PM +0100, Robin Murphy wrote:
> > > On 2025-06-10 2:04 pm, Jason Gunthorpe wrote:
> > > > On Tue, Jun 10, 2025 at 12:07:00AM -0700, Nicolin Chen wrote:
> > > > > On Tue, Jun 10, 2025 at 12:26:07PM +0800, Baolu Lu wrote:
> > > > > > On 6/10/25 02:45, Nicolin Chen wrote:
> > > > > > > +	ops = dev_iommu_ops(dev);
> > > > > > 
> > > > > > Should this be protected by group->mutext?
> > > > > 
> > > > > Not seemingly, but should require the iommu_probe_device_lock I
> > > > > think.
> > > > 
> > > > group and ops are not permitted to change while a driver is attached..
> > > > 
> > > > IIRC the FLR code in PCI doesn't always ensure that (due to the sysfs
> > > > paths), so yeah, this looks troubled. iommu_probe_device_lock perhaps
> > > > would fix it.
> > > 
> > > No, iommu_probe_device_lock is for protecting access to dev->iommu in the
> > > probe path until the device is definitively assigned to a group (or not).
> > > Fundamentally it defends against multiple sources triggering a probe of the
> > > same device in parallel - once the device *is* probed it is no longer
> > > relevant, and the group mutex is the right thing to protect all subsequent
> > > operations.
> > 
> > Yes, adding iommu_probe_device_lock to iommu_deinit_device() would be
> > gross.
> > 
> > but something is required to protect the load of
> > dev->iommu_group.. dev->iommu_group->mutex can't protect itself
> > against a race UAF on deinit.
> 
> Then you do iommu_group_get/put() around it as well. 

Same issue - can't use dev->iommu_group.kobj.kref to protect against
UAF. By the time you do a try_get you've already UAF'd the memory
holding the kref. It always needs some other enclosing protection.

> From a quick skim I suspect it's probably OK - at least device_del() gets to
> bus_remove_device()->device_remove_groups() well enough before the
> BUS_NOTIFY_REMOVED_DEVICE call that triggers iommu_release_device().

Make sense, Nicolin, a well placed comment explaining this would be
good
 
> And on an unrelated thought that's just come to mind, if we ever did FLR
> with PASIDs enabled, presumably that's going to wipe out the PASID
> configuration,

I've always been expecting the PCI FLR code to preserve the config
space that belongs the iommu subsystem. PASID, ATS, PRI, etc. Never
looked into it... Nicolin??

Otherwise we need a post-FLR callback to have the iommu driver reload
the right config values for its current config.. That's an existing
nasty bug :)

> so will the caller who requested the reset actually expect
> the attachments at the IOMMU end to be preserved, or would they assume to
> start over from scratch? Seems like there's not necessarily one right answer
> there :/

IMHO we have to preserve everything, and I think we should things back
to working normally overall, if that isn't happening already.

For something like VFIO preserving is desired. For DMA-API that is the
right thing too.

Something like mlx5, that has a robust RAS system, will unregister
itself from the rdma subsystem and that triggers a natural destruction
of the SVA/etc domains that might be there. We want the attachments to
be left undisturbed so there is no issue cleaning them up later.

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-06-10 16:43               ` Jason Gunthorpe
@ 2025-06-10 20:19                 ` Nicolin Chen
  2025-06-10 23:41                   ` Jason Gunthorpe
  0 siblings, 1 reply; 21+ messages in thread
From: Nicolin Chen @ 2025-06-10 20:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, Baolu Lu, joro, will, bhelgaas, iommu, linux-kernel,
	linux-pci, patches, pjaroszynski, vsethi

On Tue, Jun 10, 2025 at 01:43:06PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 10, 2025 at 05:31:09PM +0100, Robin Murphy wrote:
> > From a quick skim I suspect it's probably OK - at least device_del() gets to
> > bus_remove_device()->device_remove_groups() well enough before the
> > BUS_NOTIFY_REMOVED_DEVICE call that triggers iommu_release_device().
> 
> Make sense, Nicolin, a well placed comment explaining this would be
> good

Ack.

> > And on an unrelated thought that's just come to mind, if we ever did FLR
> > with PASIDs enabled, presumably that's going to wipe out the PASID
> > configuration,
> 
> I've always been expecting the PCI FLR code to preserve the config
> space that belongs the iommu subsystem. PASID, ATS, PRI, etc. Never
> looked into it... Nicolin??

Those are the flags in struct pci_dev right?
unsigned int    ats_enabled:1;          /* Address Translation Svc */
unsigned int    pasid_enabled:1;        /* Process Address Space ID */
unsigned int    pri_enabled:1;          /* Page Request Interface */

And pci_restore_state() does the recovery of these bits.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets
  2025-06-10 16:30   ` Jason Gunthorpe
@ 2025-06-10 20:36     ` Nicolin Chen
  2025-06-10 23:43       ` Jason Gunthorpe
  2025-06-13 19:27     ` Bjorn Helgaas
  1 sibling, 1 reply; 21+ messages in thread
From: Nicolin Chen @ 2025-06-10 20:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, joro, will, bhelgaas, iommu, linux-kernel,
	linux-pci, patches, pjaroszynski, vsethi

On Tue, Jun 10, 2025 at 01:30:45PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 10, 2025 at 04:37:58PM +0100, Robin Murphy wrote:
> > On 2025-06-09 7:45 pm, Nicolin Chen wrote:
> > > Hi all,
> > > 
> > > Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
> > > before initiating a Function Level Reset, and then ensure no invalidation
> > > requests being issued to a device when its ATS capability is disabled.
> > 
> > Not really - what it says is that software should not expect to receive
> > invalidate completions from a function which is in the process of being
> > reset or powered off, and if software doesn't want to be confused by that
> > then it should take care to wait for completion or timeout of all
> > outstanding requests, and avoid issuing new requests, before initiating such
> > a reset or power transition.
> 
> The commit message can be more precise, but I agree with the
> conclusion that the right direction for Linux is to disable and block
> ATS, instead of trying to ignore completion time out events, or trying
> to block page table mutations. Ie do what the implementation note
> says..
> 
> Maybe:
> 
> PCIe permits a device to ignore ATS invalidation TLPs while it is
> processing FLR. This creates a problem visible to the OS where ATS
> invalidation commands will time out. For instance a SVA domain will
> have no coordination with a FLR event and can racily issue ATC
> invalidations into a resetting device.
> 
> The OS should do something to mitigate this as we do not want
> production systems to be reporting critical ATS failures, especially
> in a hypervisor environment. Broadly the OS could arrange to ignore
> the timeout, block page table mutations to prevent invalidations, or
> disable and block ATS.
> 
> The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable
> and block ATS, and we already have iommu driver support to implement
> something like this. Implement this approach in the iommu core.
> 
> Provide a callback from the PCI subsystem that will enclose the FLR
> and have the iommu core temporarily change all the domain attachments
> into BLOCKED. When attaching a BLOCKED domain IOMMU drivers should
> fence any incoming ATS queries, synchronously stop issuing new ATS
> invalidations, and synchronously wait for all ATS invalidations to
> complete. This will avoid any ATS invaliation time outs.
> 
> IOMMU drivers may also disable ATS in PCI config space, but it is not
> required to solve the completion timeout problem. The PCI FLR logic
> will put all the iommu owned config space bits back before completing.
> 
> During this period holding the group mutex will not allow new domains
> to be attached to prevent any new ATS invalidations.

Will pick this writing.

> > I guess I can see how messing with the domain attachment
> > underneath the rest of the group manages to prevent new invalidate requests
> > from group->domain being issued to the given function, but it's pretty
> > horrid - leaving the mutex blocked might be just about tolerable for an FLR
> > that's supposed to take no longer than 100ms, but what if we do want to do
> > this for suspend/resume as well?
> 
> I don't view this a problem for FLR, we can hold a mutex for a long
> time. It principally delays domain changes which are kind of nonsense
> to be doing concurrently with FLR in the first place..
> 
> However, for suspend, we probably want to leave a marker in the group

IIUIC, the thing for suspend/resume is that it would result in a
long hold of the mutex, which can be a problem?

> that the group is force-blocked and all domain attach/detach logic
> will only update the group tracking structures and not call into the
> iommu driver. When the resume happens the core will set the current
> group domain list to the iommu driver. No need for a long lived lock
> this way.

Yea, what we don't want is driver re-enabling ATS. So, bypassing
it at the core level should work. Then, iommu_dev_reset_prepare
and iommu_dev_reset_done will only mutex the flag.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-06-10 20:19                 ` Nicolin Chen
@ 2025-06-10 23:41                   ` Jason Gunthorpe
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2025-06-10 23:41 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Robin Murphy, Baolu Lu, joro, will, bhelgaas, iommu, linux-kernel,
	linux-pci, patches, pjaroszynski, vsethi

On Tue, Jun 10, 2025 at 01:19:22PM -0700, Nicolin Chen wrote:

> > > And on an unrelated thought that's just come to mind, if we ever did FLR
> > > with PASIDs enabled, presumably that's going to wipe out the PASID
> > > configuration,
> > 
> > I've always been expecting the PCI FLR code to preserve the config
> > space that belongs the iommu subsystem. PASID, ATS, PRI, etc. Never
> > looked into it... Nicolin??
> 
> Those are the flags in struct pci_dev right?
> unsigned int    ats_enabled:1;          /* Address Translation Svc */
> unsigned int    pasid_enabled:1;        /* Process Address Space ID */
> unsigned int    pri_enabled:1;          /* Page Request Interface */
> 
> And pci_restore_state() does the recovery of these bits.

Yes, that all looks like the right stuff to me.

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets
  2025-06-10 20:36     ` Nicolin Chen
@ 2025-06-10 23:43       ` Jason Gunthorpe
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2025-06-10 23:43 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Robin Murphy, joro, will, bhelgaas, iommu, linux-kernel,
	linux-pci, patches, pjaroszynski, vsethi

On Tue, Jun 10, 2025 at 01:36:45PM -0700, Nicolin Chen wrote:
> > I don't view this a problem for FLR, we can hold a mutex for a long
> > time. It principally delays domain changes which are kind of nonsense
> > to be doing concurrently with FLR in the first place..
> > 
> > However, for suspend, we probably want to leave a marker in the group
> 
> IIUIC, the thing for suspend/resume is that it would result in a
> long hold of the mutex, which can be a problem?

Yes, it would be a confusing mess to do that..

> > that the group is force-blocked and all domain attach/detach logic
> > will only update the group tracking structures and not call into the
> > iommu driver. When the resume happens the core will set the current
> > group domain list to the iommu driver. No need for a long lived lock
> > this way.
> 
> Yea, what we don't want is driver re-enabling ATS. So, bypassing
> it at the core level should work. Then, iommu_dev_reset_prepare
> and iommu_dev_reset_done will only mutex the flag.

If it is easy it seems worth including.

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets
  2025-06-10 16:30   ` Jason Gunthorpe
  2025-06-10 20:36     ` Nicolin Chen
@ 2025-06-13 19:27     ` Bjorn Helgaas
  2025-06-13 21:10       ` Nicolin Chen
  2025-06-16 13:09       ` Jason Gunthorpe
  1 sibling, 2 replies; 21+ messages in thread
From: Bjorn Helgaas @ 2025-06-13 19:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, Nicolin Chen, joro, will, bhelgaas, iommu,
	linux-kernel, linux-pci, patches, pjaroszynski, vsethi

On Tue, Jun 10, 2025 at 01:30:45PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 10, 2025 at 04:37:58PM +0100, Robin Murphy wrote:
> > On 2025-06-09 7:45 pm, Nicolin Chen wrote:
> > > Hi all,
> > > 
> > > Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
> > > before initiating a Function Level Reset, and then ensure no invalidation
> > > requests being issued to a device when its ATS capability is disabled.
> > 
> > Not really - what it says is that software should not expect to receive
> > invalidate completions from a function which is in the process of being
> > reset or powered off, and if software doesn't want to be confused by that
> > then it should take care to wait for completion or timeout of all
> > outstanding requests, and avoid issuing new requests, before initiating such
> > a reset or power transition.
> 
> The commit message can be more precise, but I agree with the
> conclusion that the right direction for Linux is to disable and block
> ATS, instead of trying to ignore completion time out events, or trying
> to block page table mutations. Ie do what the implementation note
> says..
> 
> Maybe:
> 
> PCIe permits a device to ignore ATS invalidation TLPs while it is
> processing FLR. This creates a problem visible to the OS where ATS
> invalidation commands will time out. For instance a SVA domain will
> have no coordination with a FLR event and can racily issue ATC
> invalidations into a resetting device.

The sec 10.3.1 implementation note mentions FLR specifically, but it
seems like *any* kind of reset would be vulnerable, e.g., SBR,
external PERST# assert, etc?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets
  2025-06-13 19:27     ` Bjorn Helgaas
@ 2025-06-13 21:10       ` Nicolin Chen
  2025-06-16 13:09       ` Jason Gunthorpe
  1 sibling, 0 replies; 21+ messages in thread
From: Nicolin Chen @ 2025-06-13 21:10 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jason Gunthorpe, Robin Murphy, joro, will, bhelgaas, iommu,
	linux-kernel, linux-pci, patches, pjaroszynski, vsethi

On Fri, Jun 13, 2025 at 02:27:09PM -0500, Bjorn Helgaas wrote:
> On Tue, Jun 10, 2025 at 01:30:45PM -0300, Jason Gunthorpe wrote:
> > On Tue, Jun 10, 2025 at 04:37:58PM +0100, Robin Murphy wrote:
> > > On 2025-06-09 7:45 pm, Nicolin Chen wrote:
> > > > Hi all,
> > > > 
> > > > Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
> > > > before initiating a Function Level Reset, and then ensure no invalidation
> > > > requests being issued to a device when its ATS capability is disabled.
> > > 
> > > Not really - what it says is that software should not expect to receive
> > > invalidate completions from a function which is in the process of being
> > > reset or powered off, and if software doesn't want to be confused by that
> > > then it should take care to wait for completion or timeout of all
> > > outstanding requests, and avoid issuing new requests, before initiating such
> > > a reset or power transition.
> > 
> > The commit message can be more precise, but I agree with the
> > conclusion that the right direction for Linux is to disable and block
> > ATS, instead of trying to ignore completion time out events, or trying
> > to block page table mutations. Ie do what the implementation note
> > says..
> > 
> > Maybe:
> > 
> > PCIe permits a device to ignore ATS invalidation TLPs while it is
> > processing FLR. This creates a problem visible to the OS where ATS
> > invalidation commands will time out. For instance a SVA domain will
> > have no coordination with a FLR event and can racily issue ATC
> > invalidations into a resetting device.
> 
> The sec 10.3.1 implementation note mentions FLR specifically, but it
> seems like *any* kind of reset would be vulnerable, e.g., SBR,
> external PERST# assert, etc?

Yes. I forgot to put a question mark in the cover-letter, asking
whether other reset routines would or not need the same trick.

So, let's apply this to all the pci_reset_fn_methods.reset_fns?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets
  2025-06-13 19:27     ` Bjorn Helgaas
  2025-06-13 21:10       ` Nicolin Chen
@ 2025-06-16 13:09       ` Jason Gunthorpe
  1 sibling, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2025-06-16 13:09 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Robin Murphy, Nicolin Chen, joro, will, bhelgaas, iommu,
	linux-kernel, linux-pci, patches, pjaroszynski, vsethi

On Fri, Jun 13, 2025 at 02:27:09PM -0500, Bjorn Helgaas wrote:
> On Tue, Jun 10, 2025 at 01:30:45PM -0300, Jason Gunthorpe wrote:
> > On Tue, Jun 10, 2025 at 04:37:58PM +0100, Robin Murphy wrote:
> > > On 2025-06-09 7:45 pm, Nicolin Chen wrote:
> > > > Hi all,
> > > > 
> > > > Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
> > > > before initiating a Function Level Reset, and then ensure no invalidation
> > > > requests being issued to a device when its ATS capability is disabled.
> > > 
> > > Not really - what it says is that software should not expect to receive
> > > invalidate completions from a function which is in the process of being
> > > reset or powered off, and if software doesn't want to be confused by that
> > > then it should take care to wait for completion or timeout of all
> > > outstanding requests, and avoid issuing new requests, before initiating such
> > > a reset or power transition.
> > 
> > The commit message can be more precise, but I agree with the
> > conclusion that the right direction for Linux is to disable and block
> > ATS, instead of trying to ignore completion time out events, or trying
> > to block page table mutations. Ie do what the implementation note
> > says..
> > 
> > Maybe:
> > 
> > PCIe permits a device to ignore ATS invalidation TLPs while it is
> > processing FLR. This creates a problem visible to the OS where ATS
> > invalidation commands will time out. For instance a SVA domain will
> > have no coordination with a FLR event and can racily issue ATC
> > invalidations into a resetting device.
> 
> The sec 10.3.1 implementation note mentions FLR specifically, but it
> seems like *any* kind of reset would be vulnerable, e.g., SBR,
> external PERST# assert, etc?

Yes, there are a bunch of states where a PCI devices is permitted to
ignore ATC invalidation TLPs.. Aside from all the resets power
management is also a problem.

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-06-16 13:09 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-09 18:45 [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets Nicolin Chen
2025-06-09 18:45 ` [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
2025-06-10  4:26   ` Baolu Lu
2025-06-10  7:07     ` Nicolin Chen
2025-06-10 13:04       ` Jason Gunthorpe
2025-06-10 14:40         ` Robin Murphy
2025-06-10 15:36           ` Jason Gunthorpe
2025-06-10 16:31             ` Robin Murphy
2025-06-10 16:43               ` Jason Gunthorpe
2025-06-10 20:19                 ` Nicolin Chen
2025-06-10 23:41                   ` Jason Gunthorpe
2025-06-09 18:45 ` [PATCH RFC v1 2/2] pci: Suspend ATS before doing FLR Nicolin Chen
2025-06-10  4:27   ` Baolu Lu
2025-06-10  6:55     ` Nicolin Chen
2025-06-10 15:37 ` [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets Robin Murphy
2025-06-10 16:30   ` Jason Gunthorpe
2025-06-10 20:36     ` Nicolin Chen
2025-06-10 23:43       ` Jason Gunthorpe
2025-06-13 19:27     ` Bjorn Helgaas
2025-06-13 21:10       ` Nicolin Chen
2025-06-16 13:09       ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).