[PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
@ 2026-03-17 19:15 Nicolin Chen
  2026-03-17 19:15 ` [PATCH v2 1/7] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds Nicolin Chen
                   ` (7 more replies)
  0 siblings, 8 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-17 19:15 UTC (permalink / raw)
  To: will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, baolu.lu, xueshuai, kevin.tian,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

Hi all,

This series addresses a critical vulnerability and stability issue where an
unresponsive PCIe device failing to process ATC (Address Translation Cache)
invalidation requests leads to silent data corruption and continuous SMMU
CMDQ error spam.

Currently, when an ATC invalidation times out, the SMMUv3 driver skips the
CMDQ_ERR_CERROR_ATC_INV_IDX error. This leaves the device's ATS cache state
desynchronized from the SMMU: the device cache may retain stale ATC entries
for memory pages that the OS has already reclaimed and reassigned, creating
a direct vector for data corruption. Furthermore, the driver might continue
issuing ATC_INV commands, resulting in constant CMDQ errors:
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb84): ATC invalidate timeout
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb88): ATC invalidate timeout
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb8c): ATC invalidate timeout
    ...

To resolve this, introduce a mechanism to quarantine a broken device in the
SMMUv3 driver and the IOMMU core. To achive this, some preparatory changes:
 - Tighten the semantics of pci_dev_reset_iommu_done() that is now strictly
   called only upon a successful hardware reset
 - Introduce a reset_device_done op, allowing the core to signal the driver
   when the physical hardware has been cleanly recovered (e.g., via AER or
   a manual reset) so the quarantine can be lifted
 - Utilize a per-iommu_group WQ via an iommu_report_device_broken() helper
   Note that this implementation only supports single-device iommu_groups.

On the SMMUv3 driver side, introduce the bisection logic to identify which
device caused a batched ATC_INV timeout via an atc_sync_timeouts tracker.
Perform a surgical STE update and flag the ATS as broken to reject further
ATS/ATC requests at the hardware level and suppress further timeout spam.

This is on Github:
https://github.com/nicolinc/iommufd/commits/smmuv3_atc_timeout-v2

Changelog
v2:
 * Rebase on arm_smmu_invs-v13 series [0]
 * Bisect batched atc invalidation commands
 * Drop the direct pci_reset_function() call
 * Move the work queue from SMMUv3 to the core
 * Proceed a surgical STE update to disable EATS
 * Wait for pci_dev_reset_iommu_done() to signal a recovery
v1:
 https://lore.kernel.org/all/cover.1772686998.git.nicolinc@nvidia.com/

[0] https://lore.kernel.org/all/cover.1773733797.git.nicolinc@nvidia.com/

Thanks
Nicolin

Nicolin Chen (7):
  iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds
  iommu: Add reset_device_done callback for hardware fault recovery
  iommu: Add iommu_report_device_broken() to quarantine a broken device
  iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  iommu/arm-smmu-v3: Replace smmu with master in arm_smmu_inv
  iommu/arm-smmu-v3: Introduce master->ats_broken flag
  iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   4 +-
 include/linux/iommu.h                         |   4 +
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  |  34 ++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 165 ++++++++++++++++--
 drivers/iommu/iommu.c                         |  87 ++++++++-
 drivers/pci/pci-acpi.c                        |  11 +-
 drivers/pci/pci.c                             |  50 +++++-
 drivers/pci/quirks.c                          |  11 +-
 8 files changed, 322 insertions(+), 44 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 1/7] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds
  2026-03-17 19:15 [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
@ 2026-03-17 19:15 ` Nicolin Chen
  2026-03-18  7:21   ` Tian, Kevin
  2026-03-18  8:02   ` Shuai Xue
  2026-03-17 19:15 ` [PATCH v2 2/7] iommu: Add reset_device_done callback for hardware fault recovery Nicolin Chen
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-17 19:15 UTC (permalink / raw)
  To: will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, baolu.lu, xueshuai, kevin.tian,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

IOMMU drivers handle ATC cache maintenance. They may encounter ATC-related
errors (e.g., ATC invalidation request timeout), indicating that ATC cache
might have stale entries that can corrupt the memory. In this case, IOMMU
driver has no choice but to block the device's ATS function and wait for a
device recovery.

The pci_dev_reset_iommu_done() called at the end of a reset function could
serve as a reliable signal to the IOMMU subsystem that the physical device
cache is completely clean. However, the function is called unconditionally
even if the reset operation had actually failed, which would re-attach the
faulty device back to a normal translation domain. And this will leave the
system highly exposed, creating vulnerabilities for data corruption:
    IOMMU blocks RID/ATS
    pci_reset_function():
        pci_dev_reset_iommu_prepare(); // Block RID/ATS
        __reset(); // Failed (ATC is still stale)
        pci_dev_reset_iommu_done(); // Unblock RID/ATS (ah-ha)

The simplest fix is to use pci_dev_reset_iommu_done() only on a successful
reset:
    IOMMU blocks RID/ATS
    pci_reset_function():
        pci_dev_reset_iommu_prepare(); // Block RID/ATS
        if (!__reset())
            pci_dev_reset_iommu_done(); // Unblock RID/ATS
        else
            // Keep the device blocked by IOMMU

However, this breaks the symmetric requirement of these reset APIs so that
we have to allow a re-entry to pass a second reset attempt:
    IOMMU blocks RID/ATS
    pci_reset_function():
        pci_dev_reset_iommu_prepare(); // Block RID/ATS
        __reset(); // Failed (ATC is still stale)
        // Keep the device blocked by IOMMU
    Second reset:
        pci_reset_function():
            pci_dev_reset_iommu_prepare(); // Re-entry (!)

Update the function kdocs and all the existing callers to only unblock ATS
when the reset succeeds. Drop the WARN_ON in pci_dev_reset_iommu_prepare()
to allow re-entries.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c  | 16 +++++++++-----
 drivers/pci/pci-acpi.c | 11 +++++++++-
 drivers/pci/pci.c      | 50 +++++++++++++++++++++++++++++++++++++-----
 drivers/pci/quirks.c   | 11 +++++++++-
 4 files changed, 75 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 35db517809540..40a15c9360bd1 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3938,8 +3938,10 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
  * IOMMU activity while leaving the group->domain pointer intact. Later when the
  * reset is finished, pci_dev_reset_iommu_done() can restore everything.
  *
- * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
- * before/after the core-level reset routine, to unset the resetting_domain.
+ * Caller must use pci_dev_reset_iommu_done() after a successful PCI-level reset
+ * to unset the resetting_domain. If the reset fails, caller can choose to keep
+ * the device in the resetting_domain to protect system memory using IOMMU from
+ * any bad ATS.
  *
  * Return: 0 on success or negative error code if the preparation failed.
  *
@@ -3961,9 +3963,9 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 
 	guard(mutex)(&group->mutex);
 
-	/* Re-entry is not allowed */
-	if (WARN_ON(group->resetting_domain))
-		return -EBUSY;
+	/* Already prepared */
+	if (group->resetting_domain)
+		return 0;
 
 	ret = __iommu_group_alloc_blocking_domain(group);
 	if (ret)
@@ -4001,7 +4003,9 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
  * re-attaching all RID/PASID of the device's back to the domains retained in
  * the core-level structure.
  *
- * Caller must pair it with a successful pci_dev_reset_iommu_prepare().
+ * This is a pairing function for pci_dev_reset_iommu_prepare(). Caller should
+ * use it on a successful PCI-level reset. Otherwise, it's suggested for caller
+ * to keep the device in the resetting_domain to protect system memory.
  *
  * Note that, although unlikely, there is a risk that re-attaching domains might
  * fail due to some unexpected happening like OOM.
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 4d0f2cb6c695b..f1a918938242c 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -16,6 +16,7 @@
 #include <linux/pci_hotplug.h>
 #include <linux/module.h>
 #include <linux/pci-acpi.h>
+#include <linux/pci-ats.h>
 #include <linux/pci-ecam.h>
 #include <linux/pm_runtime.h>
 #include <linux/pm_qos.h>
@@ -977,7 +978,15 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
 		ret = -ENOTTY;
 	}
 
-	pci_dev_reset_iommu_done(dev);
+	/*
+	 * The reset might be invoked to recover a serious error. E.g. when the
+	 * ATC failed to invalidate its stale entries, which can result in data
+	 * corruption. Thus, do not unblock ATS until a successful reset.
+	 */
+	if (!ret || !pci_ats_supported(dev))
+		pci_dev_reset_iommu_done(dev);
+	else
+		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
 	return ret;
 }
 
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8479c2e1f74f1..80c5cf6eeebdc 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4358,7 +4358,15 @@ int pcie_flr(struct pci_dev *dev)
 
 	ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
 done:
-	pci_dev_reset_iommu_done(dev);
+	/*
+	 * The reset might be invoked to recover a serious error. E.g. when the
+	 * ATC failed to invalidate its stale entries, which can result in data
+	 * corruption. Thus, do not unblock ATS until a successful reset.
+	 */
+	if (!ret || !pci_ats_supported(dev))
+		pci_dev_reset_iommu_done(dev);
+	else
+		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
 	return ret;
 }
 EXPORT_SYMBOL_GPL(pcie_flr);
@@ -4436,7 +4444,15 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
 
 	ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
 done:
-	pci_dev_reset_iommu_done(dev);
+	/*
+	 * The reset might be invoked to recover a serious error. E.g. when the
+	 * ATC failed to invalidate its stale entries, which can result in data
+	 * corruption. Thus, do not unblock ATS until a successful reset.
+	 */
+	if (!ret || !pci_ats_supported(dev))
+		pci_dev_reset_iommu_done(dev);
+	else
+		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
 	return ret;
 }
 
@@ -4490,7 +4506,15 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
 	pci_dev_d3_sleep(dev);
 
 	ret = pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
-	pci_dev_reset_iommu_done(dev);
+	/*
+	 * The reset might be invoked to recover a serious error. E.g. when the
+	 * ATC failed to invalidate its stale entries, which can result in data
+	 * corruption. Thus, do not unblock ATS until a successful reset.
+	 */
+	if (!ret || !pci_ats_supported(dev))
+		pci_dev_reset_iommu_done(dev);
+	else
+		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
 	return ret;
 }
 
@@ -4933,7 +4957,15 @@ static int pci_reset_bus_function(struct pci_dev *dev, bool probe)
 
 	rc = pci_parent_bus_reset(dev, probe);
 done:
-	pci_dev_reset_iommu_done(dev);
+	/*
+	 * The reset might be invoked to recover a serious error. E.g. when the
+	 * ATC failed to invalidate its stale entries, which can result in data
+	 * corruption. Thus, do not unblock ATS until a successful reset.
+	 */
+	if (!rc || !pci_ats_supported(dev))
+		pci_dev_reset_iommu_done(dev);
+	else
+		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
 	return rc;
 }
 
@@ -4978,7 +5010,15 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
 		pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
 				      reg);
 
-	pci_dev_reset_iommu_done(dev);
+	/*
+	 * The reset might be invoked to recover a serious error. E.g. when the
+	 * ATC failed to invalidate its stale entries, which can result in data
+	 * corruption. Thus, do not unblock ATS until a successful reset.
+	 */
+	if (!rc || !pci_ats_supported(dev))
+		pci_dev_reset_iommu_done(dev);
+	else
+		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
 	return rc;
 }
 
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 48946cca4be72..d9a03a7772916 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -30,6 +30,7 @@
 #include <linux/ktime.h>
 #include <linux/mm.h>
 #include <linux/nvme.h>
+#include <linux/pci-ats.h>
 #include <linux/platform_data/x86/apple.h>
 #include <linux/pm_runtime.h>
 #include <linux/sizes.h>
@@ -4269,7 +4270,15 @@ static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
 	}
 
 	ret = i->reset(dev, probe);
-	pci_dev_reset_iommu_done(dev);
+	/*
+	 * The reset might be invoked to recover a serious error. E.g. when the
+	 * ATC failed to invalidate its stale entries, which can result in data
+	 * corruption. Thus, do not unblock ATS until a successful reset.
+	 */
+	if (!ret || !pci_ats_supported(dev))
+		pci_dev_reset_iommu_done(dev);
+	else
+		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
 	return ret;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 1/7] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds
  2026-03-17 19:15 ` [PATCH v2 1/7] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds Nicolin Chen
@ 2026-03-18  7:21   ` Tian, Kevin
  2026-03-18 20:16     ` Nicolin Chen
  2026-03-18  8:02   ` Shuai Xue
  1 sibling, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2026-03-18  7:21 UTC (permalink / raw)
  To: Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	joro@8bytes.org, bhelgaas@google.com, jgg@nvidia.com
  Cc: rafael@kernel.org, lenb@kernel.org, praan@google.com,
	baolu.lu@linux.intel.com, xueshuai@linux.alibaba.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, Vikram Sethi

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, March 18, 2026 3:16 AM
> 
> @@ -977,7 +978,15 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool
> probe)
>  		ret = -ENOTTY;
>  	}
> 
> -	pci_dev_reset_iommu_done(dev);
> +	/*
> +	 * The reset might be invoked to recover a serious error. E.g. when
> the
> +	 * ATC failed to invalidate its stale entries, which can result in data
> +	 * corruption. Thus, do not unblock ATS until a successful reset.
> +	 */
> +	if (!ret || !pci_ats_supported(dev))
> +		pci_dev_reset_iommu_done(dev);
> +	else
> +		pci_warn(dev, "Reset failed. Blocking ATS to protect
> memory\n");
>  	return ret;

let's pass the reset status to pci_dev_reset_iommu_done() then
put above detail inside. 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 1/7] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds
  2026-03-18  7:21   ` Tian, Kevin
@ 2026-03-18 20:16     ` Nicolin Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-18 20:16 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: will@kernel.org, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, praan@google.com, baolu.lu@linux.intel.com,
	xueshuai@linux.alibaba.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	Vikram Sethi

On Wed, Mar 18, 2026 at 07:21:53AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, March 18, 2026 3:16 AM
> > 
> > @@ -977,7 +978,15 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool
> > probe)
> >  		ret = -ENOTTY;
> >  	}
> > 
> > -	pci_dev_reset_iommu_done(dev);
> > +	/*
> > +	 * The reset might be invoked to recover a serious error. E.g. when
> > the
> > +	 * ATC failed to invalidate its stale entries, which can result in data
> > +	 * corruption. Thus, do not unblock ATS until a successful reset.
> > +	 */
> > +	if (!ret || !pci_ats_supported(dev))
> > +		pci_dev_reset_iommu_done(dev);
> > +	else
> > +		pci_warn(dev, "Reset failed. Blocking ATS to protect
> > memory\n");
> >  	return ret;
> 
> let's pass the reset status to pci_dev_reset_iommu_done() then
> put above detail inside. 

Yea, that's cleaner. I will make a change:
@@ -4014,7 +4014,7 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
  * Note that, although unlikely, there is a risk that re-attaching domains might
  * fail due to some unexpected happening like OOM.
  */
-void pci_dev_reset_iommu_done(struct pci_dev *pdev)
+void pci_dev_reset_iommu_done(struct pci_dev *pdev, bool reset_succeeds)
 {
        struct iommu_group *group = pdev->dev.iommu_group;
        const struct iommu_ops *ops;
@@ -4025,6 +4025,11 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
                return;
        ops = dev_iommu_ops(&pdev->dev);

+       if (!reset_succeeds) {
+               pci_err(pdev, "Reset failed. Blocking ATS to protect memory\n");
+               return;
+       }
+
        guard(mutex)(&group->mutex);

        /* pci_dev_reset_iommu_prepare() was bypassed for the device */


Thanks
Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 1/7] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds
  2026-03-17 19:15 ` [PATCH v2 1/7] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds Nicolin Chen
  2026-03-18  7:21   ` Tian, Kevin
@ 2026-03-18  8:02   ` Shuai Xue
  2026-03-18 20:27     ` Nicolin Chen
  1 sibling, 1 reply; 47+ messages in thread
From: Shuai Xue @ 2026-03-18  8:02 UTC (permalink / raw)
  To: Nicolin Chen, will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, baolu.lu, kevin.tian, linux-arm-kernel,
	iommu, linux-kernel, linux-acpi, linux-pci, vsethi



On 3/18/26 3:15 AM, Nicolin Chen wrote:
> IOMMU drivers handle ATC cache maintenance. They may encounter ATC-related
> errors (e.g., ATC invalidation request timeout), indicating that ATC cache
> might have stale entries that can corrupt the memory. In this case, IOMMU
> driver has no choice but to block the device's ATS function and wait for a
> device recovery.
> 
> The pci_dev_reset_iommu_done() called at the end of a reset function could
> serve as a reliable signal to the IOMMU subsystem that the physical device
> cache is completely clean. However, the function is called unconditionally
> even if the reset operation had actually failed, which would re-attach the
> faulty device back to a normal translation domain. And this will leave the
> system highly exposed, creating vulnerabilities for data corruption:
>      IOMMU blocks RID/ATS
>      pci_reset_function():
>          pci_dev_reset_iommu_prepare(); // Block RID/ATS
>          __reset(); // Failed (ATC is still stale)
>          pci_dev_reset_iommu_done(); // Unblock RID/ATS (ah-ha)
> 
> The simplest fix is to use pci_dev_reset_iommu_done() only on a successful
> reset:
>      IOMMU blocks RID/ATS
>      pci_reset_function():
>          pci_dev_reset_iommu_prepare(); // Block RID/ATS
>          if (!__reset())
>              pci_dev_reset_iommu_done(); // Unblock RID/ATS
>          else
>              // Keep the device blocked by IOMMU
> 
> However, this breaks the symmetric requirement of these reset APIs so that
> we have to allow a re-entry to pass a second reset attempt:
>      IOMMU blocks RID/ATS
>      pci_reset_function():
>          pci_dev_reset_iommu_prepare(); // Block RID/ATS
>          __reset(); // Failed (ATC is still stale)
>          // Keep the device blocked by IOMMU
>      Second reset:
>          pci_reset_function():
>              pci_dev_reset_iommu_prepare(); // Re-entry (!)
> 
> Update the function kdocs and all the existing callers to only unblock ATS
> when the reset succeeds. Drop the WARN_ON in pci_dev_reset_iommu_prepare()
> to allow re-entries.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   drivers/iommu/iommu.c  | 16 +++++++++-----
>   drivers/pci/pci-acpi.c | 11 +++++++++-
>   drivers/pci/pci.c      | 50 +++++++++++++++++++++++++++++++++++++-----
>   drivers/pci/quirks.c   | 11 +++++++++-
>   4 files changed, 75 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 35db517809540..40a15c9360bd1 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -3938,8 +3938,10 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
>    * IOMMU activity while leaving the group->domain pointer intact. Later when the
>    * reset is finished, pci_dev_reset_iommu_done() can restore everything.
>    *
> - * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
> - * before/after the core-level reset routine, to unset the resetting_domain.
> + * Caller must use pci_dev_reset_iommu_done() after a successful PCI-level reset
> + * to unset the resetting_domain. If the reset fails, caller can choose to keep
> + * the device in the resetting_domain to protect system memory using IOMMU from
> + * any bad ATS.

I think you mean ...to protect system memory from DMA via stale ATC entries.

>    *
>    * Return: 0 on success or negative error code if the preparation failed.
>    *
> @@ -3961,9 +3963,9 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
>   
>   	guard(mutex)(&group->mutex);
>   
> -	/* Re-entry is not allowed */
> -	if (WARN_ON(group->resetting_domain))
> -		return -EBUSY;
> +	/* Already prepared */
> +	if (group->resetting_domain)
> +		return 0;

Could you elaborate more on why Re-entry is allowed now? For example:

/*
  * Allow re-entry: if a previous reset failed, the device remains in
  * resetting_domain. A subsequent reset attempt must be able to call
  * prepare() again without failing.
  */

>   
>   	ret = __iommu_group_alloc_blocking_domain(group);
>   	if (ret)
> @@ -4001,7 +4003,9 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
>    * re-attaching all RID/PASID of the device's back to the domains retained in
>    * the core-level structure.
>    *
> - * Caller must pair it with a successful pci_dev_reset_iommu_prepare().
> + * This is a pairing function for pci_dev_reset_iommu_prepare(). Caller should
> + * use it on a successful PCI-level reset. Otherwise, it's suggested for caller
> + * to keep the device in the resetting_domain to protect system memory.
>    *
>    * Note that, although unlikely, there is a risk that re-attaching domains might
>    * fail due to some unexpected happening like OOM.
> diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
> index 4d0f2cb6c695b..f1a918938242c 100644
> --- a/drivers/pci/pci-acpi.c
> +++ b/drivers/pci/pci-acpi.c
> @@ -16,6 +16,7 @@
>   #include <linux/pci_hotplug.h>
>   #include <linux/module.h>
>   #include <linux/pci-acpi.h>
> +#include <linux/pci-ats.h>
>   #include <linux/pci-ecam.h>
>   #include <linux/pm_runtime.h>
>   #include <linux/pm_qos.h>
> @@ -977,7 +978,15 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
>   		ret = -ENOTTY;
>   	}
>   
> -	pci_dev_reset_iommu_done(dev);
> +	/*
> +	 * The reset might be invoked to recover a serious error. E.g. when the
> +	 * ATC failed to invalidate its stale entries, which can result in data
> +	 * corruption. Thus, do not unblock ATS until a successful reset.
> +	 */
> +	if (!ret || !pci_ats_supported(dev))
> +		pci_dev_reset_iommu_done(dev);

pci_dev_reset_iommu_prepare() already does an early return for non-ATS
devices, and pci_dev_reset_iommu_done() also early-returns
when !pci_ats_supported(). So for non-ATS devices, done()
is always a no-op regardless of whether it's called or not.

Do we really need check pci_ats_supported(dev) here?

> +	else
> +		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
>   	return ret;
>   }
>   
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 8479c2e1f74f1..80c5cf6eeebdc 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4358,7 +4358,15 @@ int pcie_flr(struct pci_dev *dev)
>   
>   	ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
>   done:
> -	pci_dev_reset_iommu_done(dev);
> +	/*
> +	 * The reset might be invoked to recover a serious error. E.g. when the
> +	 * ATC failed to invalidate its stale entries, which can result in data
> +	 * corruption. Thus, do not unblock ATS until a successful reset.
> +	 */
> +	if (!ret || !pci_ats_supported(dev))
> +		pci_dev_reset_iommu_done(dev);
> +	else
> +		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
>   	return ret;
>   }
>   EXPORT_SYMBOL_GPL(pcie_flr);
> @@ -4436,7 +4444,15 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
>   
>   	ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
>   done:
> -	pci_dev_reset_iommu_done(dev);
> +	/*
> +	 * The reset might be invoked to recover a serious error. E.g. when the
> +	 * ATC failed to invalidate its stale entries, which can result in data
> +	 * corruption. Thus, do not unblock ATS until a successful reset.
> +	 */
> +	if (!ret || !pci_ats_supported(dev))
> +		pci_dev_reset_iommu_done(dev);
> +	else
> +		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
>   	return ret;
>   }
>   
> @@ -4490,7 +4506,15 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
>   	pci_dev_d3_sleep(dev);
>   
>   	ret = pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
> -	pci_dev_reset_iommu_done(dev);
> +	/*
> +	 * The reset might be invoked to recover a serious error. E.g. when the
> +	 * ATC failed to invalidate its stale entries, which can result in data
> +	 * corruption. Thus, do not unblock ATS until a successful reset.
> +	 */
> +	if (!ret || !pci_ats_supported(dev))
> +		pci_dev_reset_iommu_done(dev);
> +	else
> +		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
>   	return ret;
>   }
>   
> @@ -4933,7 +4957,15 @@ static int pci_reset_bus_function(struct pci_dev *dev, bool probe)
>   
>   	rc = pci_parent_bus_reset(dev, probe);
>   done:
> -	pci_dev_reset_iommu_done(dev);
> +	/*
> +	 * The reset might be invoked to recover a serious error. E.g. when the
> +	 * ATC failed to invalidate its stale entries, which can result in data
> +	 * corruption. Thus, do not unblock ATS until a successful reset.
> +	 */
> +	if (!rc || !pci_ats_supported(dev))
> +		pci_dev_reset_iommu_done(dev);
> +	else
> +		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
>   	return rc;
>   }
>   
> @@ -4978,7 +5010,15 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
>   		pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
>   				      reg);
>   
> -	pci_dev_reset_iommu_done(dev);
> +	/*
> +	 * The reset might be invoked to recover a serious error. E.g. when the
> +	 * ATC failed to invalidate its stale entries, which can result in data
> +	 * corruption. Thus, do not unblock ATS until a successful reset.
> +	 */
> +	if (!rc || !pci_ats_supported(dev))
> +		pci_dev_reset_iommu_done(dev);
> +	else
> +		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
>   	return rc;
>   }

cxl_reset_bus_function() calls pci_reset_bus_function() which already
contains the conditional pci_dev_reset_iommu_done() logic. Then the
outer function calls done() again. The second call is a no-op because
resetting_domain has already been cleared, but this is confusing.

Additionally, cxl_reset_bus_function() does its own
pci_dev_reset_iommu_prepare(), while pci_reset_bus_function() also
calls prepare() internally. With the re-entry change in this patch,
the nested prepare() silently succeeds.

Is it a expected behavior?


Thanks.
Shuai


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 1/7] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds
  2026-03-18  8:02   ` Shuai Xue
@ 2026-03-18 20:27     ` Nicolin Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-18 20:27 UTC (permalink / raw)
  To: Shuai Xue
  Cc: will, robin.murphy, joro, bhelgaas, jgg, rafael, lenb, praan,
	baolu.lu, kevin.tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Wed, Mar 18, 2026 at 04:02:01PM +0800, Shuai Xue wrote:
> On 3/18/26 3:15 AM, Nicolin Chen wrote:
> > - * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
> > - * before/after the core-level reset routine, to unset the resetting_domain.
> > + * Caller must use pci_dev_reset_iommu_done() after a successful PCI-level reset
> > + * to unset the resetting_domain. If the reset fails, caller can choose to keep
> > + * the device in the resetting_domain to protect system memory using IOMMU from
> > + * any bad ATS.
> 
> I think you mean ...to protect system memory from DMA via stale ATC entries.

"the DMA via stale ATC entries" sounds too specific to this issue.

I intended to generalize the message in the kdocs. Maybe should've
drop "from any bad ATS" as well?

> >    *
> >    * Return: 0 on success or negative error code if the preparation failed.
> >    *
> > @@ -3961,9 +3963,9 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
> >   	guard(mutex)(&group->mutex);
> > -	/* Re-entry is not allowed */
> > -	if (WARN_ON(group->resetting_domain))
> > -		return -EBUSY;
> > +	/* Already prepared */
> > +	if (group->resetting_domain)
> > +		return 0;
> 
> Could you elaborate more on why Re-entry is allowed now? For example:
> 
> /*
>  * Allow re-entry: if a previous reset failed, the device remains in
>  * resetting_domain. A subsequent reset attempt must be able to call
>  * prepare() again without failing.
>  */

Sure. I will add this.

> > @@ -977,7 +978,15 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
> >   		ret = -ENOTTY;
> >   	}
> > -	pci_dev_reset_iommu_done(dev);
> > +	/*
> > +	 * The reset might be invoked to recover a serious error. E.g. when the
> > +	 * ATC failed to invalidate its stale entries, which can result in data
> > +	 * corruption. Thus, do not unblock ATS until a successful reset.
> > +	 */
> > +	if (!ret || !pci_ats_supported(dev))
> > +		pci_dev_reset_iommu_done(dev);
> 
> pci_dev_reset_iommu_prepare() already does an early return for non-ATS
> devices, and pci_dev_reset_iommu_done() also early-returns
> when !pci_ats_supported(). So for non-ATS devices, done()
> is always a no-op regardless of whether it's called or not.
> 
> Do we really need check pci_ats_supported(dev) here?

You are right. It was for a precaution as pci_dev_reset_iommu_done()
isn't maintained in the pci.c file.

Kevin suggested to move this part into pci_dev_reset_iommu_done() so
these conditions can be merged. And pci callers will be simply:
	pci_dev_reset_iommu_done(dev, !ret);

> > @@ -4978,7 +5010,15 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
> >   		pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
> >   				      reg);
> > -	pci_dev_reset_iommu_done(dev);
> > +	/*
> > +	 * The reset might be invoked to recover a serious error. E.g. when the
> > +	 * ATC failed to invalidate its stale entries, which can result in data
> > +	 * corruption. Thus, do not unblock ATS until a successful reset.
> > +	 */
> > +	if (!rc || !pci_ats_supported(dev))
> > +		pci_dev_reset_iommu_done(dev);
> > +	else
> > +		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
> >   	return rc;
> >   }
> 
> cxl_reset_bus_function() calls pci_reset_bus_function() which already
> contains the conditional pci_dev_reset_iommu_done() logic. Then the
> outer function calls done() again. The second call is a no-op because
> resetting_domain has already been cleared, but this is confusing.
> 
> Additionally, cxl_reset_bus_function() does its own
> pci_dev_reset_iommu_prepare(), while pci_reset_bus_function() also
> calls prepare() internally. With the re-entry change in this patch,
> the nested prepare() silently succeeds.
> 
> Is it a expected behavior?

No.. This is a bug prior to this patch. The nested prepare/done()
is wrong. I will submit a patch fixing this.

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 2/7] iommu: Add reset_device_done callback for hardware fault recovery
  2026-03-17 19:15 [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
  2026-03-17 19:15 ` [PATCH v2 1/7] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds Nicolin Chen
@ 2026-03-17 19:15 ` Nicolin Chen
  2026-03-18  5:59   ` Baolu Lu
  2026-03-17 19:15 ` [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Nicolin Chen @ 2026-03-17 19:15 UTC (permalink / raw)
  To: will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, baolu.lu, xueshuai, kevin.tian,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

When an IOMMU hardware detects an error due to a faulty device (e.g. an ATS
invalidation timeout), IOMMU drivers may quarantine the device by disabling
specific hardware features or dropping translation capabilities.

To recover from these states, the IOMMU driver needs a reliable signal that
the underlying physical hardware has been cleanly reset (e.g., via PCIe AER
or a sysfs Function Level Reset) so as to lift the quarantine.

Introduce a reset_device_done callback in struct iommu_ops. Trigger it from
the existing pci_dev_reset_iommu_done() path to notify the underlying IOMMU
driver that the device's internal state has been sanitized.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h |  2 ++
 drivers/iommu/iommu.c | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 54b8b48c762e8..9ba12b2164724 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -626,6 +626,7 @@ __iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
  * @release_device: Remove device from iommu driver handling
  * @probe_finalize: Do final setup work after the device is added to an IOMMU
  *                  group and attached to the groups domain
+ * @reset_device_done: Notify the driver about the completion of a device reset
  * @device_group: find iommu group for a particular device
  * @get_resv_regions: Request list of reserved regions for a device
  * @of_xlate: add OF master IDs to iommu grouping
@@ -683,6 +684,7 @@ struct iommu_ops {
 	struct iommu_device *(*probe_device)(struct device *dev);
 	void (*release_device)(struct device *dev);
 	void (*probe_finalize)(struct device *dev);
+	void (*reset_device_done)(struct device *dev);
 	struct iommu_group *(*device_group)(struct device *dev);
 
 	/* Request/Free a list of reserved regions for a device */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 40a15c9360bd1..fcd2902d9e8db 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -4013,11 +4013,13 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
 void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 {
 	struct iommu_group *group = pdev->dev.iommu_group;
+	const struct iommu_ops *ops;
 	unsigned long pasid;
 	void *entry;
 
 	if (!pci_ats_supported(pdev) || !dev_has_iommu(&pdev->dev))
 		return;
+	ops = dev_iommu_ops(&pdev->dev);
 
 	guard(mutex)(&group->mutex);
 
@@ -4029,6 +4031,16 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 	if (WARN_ON(!group->blocking_domain))
 		return;
 
+	/*
+	 * A PCI device might have been in an error state, so the IOMMU driver
+	 * had to quarantine the device by disabling specific hardware feature
+	 * or dropping translation capability. Here notify the IOMMU driver as
+	 * a reliable signal that the faulty PCI device has been cleanly reset
+	 * so now it can lift its quarantine and restore full functionality.
+	 */
+	if (ops && ops->reset_device_done)
+		ops->reset_device_done(&pdev->dev);
+
 	/* Re-attach RID domain back to group->domain */
 	if (group->domain != group->blocking_domain) {
 		WARN_ON(__iommu_attach_device(group->domain, &pdev->dev,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 2/7] iommu: Add reset_device_done callback for hardware fault recovery
  2026-03-17 19:15 ` [PATCH v2 2/7] iommu: Add reset_device_done callback for hardware fault recovery Nicolin Chen
@ 2026-03-18  5:59   ` Baolu Lu
  2026-03-18 18:42     ` Nicolin Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Baolu Lu @ 2026-03-18  5:59 UTC (permalink / raw)
  To: Nicolin Chen, will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, xueshuai, kevin.tian, linux-arm-kernel,
	iommu, linux-kernel, linux-acpi, linux-pci, vsethi

On 3/18/26 03:15, Nicolin Chen wrote:
> When an IOMMU hardware detects an error due to a faulty device (e.g. an ATS
> invalidation timeout), IOMMU drivers may quarantine the device by disabling
> specific hardware features or dropping translation capabilities.
> 
> To recover from these states, the IOMMU driver needs a reliable signal that
> the underlying physical hardware has been cleanly reset (e.g., via PCIe AER
> or a sysfs Function Level Reset) so as to lift the quarantine.
> 
> Introduce a reset_device_done callback in struct iommu_ops. Trigger it from
> the existing pci_dev_reset_iommu_done() path to notify the underlying IOMMU
> driver that the device's internal state has been sanitized.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   include/linux/iommu.h |  2 ++
>   drivers/iommu/iommu.c | 12 ++++++++++++
>   2 files changed, 14 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 54b8b48c762e8..9ba12b2164724 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -626,6 +626,7 @@ __iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
>    * @release_device: Remove device from iommu driver handling
>    * @probe_finalize: Do final setup work after the device is added to an IOMMU
>    *                  group and attached to the groups domain
> + * @reset_device_done: Notify the driver about the completion of a device reset
>    * @device_group: find iommu group for a particular device
>    * @get_resv_regions: Request list of reserved regions for a device
>    * @of_xlate: add OF master IDs to iommu grouping
> @@ -683,6 +684,7 @@ struct iommu_ops {
>   	struct iommu_device *(*probe_device)(struct device *dev);
>   	void (*release_device)(struct device *dev);
>   	void (*probe_finalize)(struct device *dev);
> +	void (*reset_device_done)(struct device *dev);
>   	struct iommu_group *(*device_group)(struct device *dev);
>   
>   	/* Request/Free a list of reserved regions for a device */
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 40a15c9360bd1..fcd2902d9e8db 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -4013,11 +4013,13 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
>   void pci_dev_reset_iommu_done(struct pci_dev *pdev)
>   {
>   	struct iommu_group *group = pdev->dev.iommu_group;
> +	const struct iommu_ops *ops;
>   	unsigned long pasid;
>   	void *entry;
>   
>   	if (!pci_ats_supported(pdev) || !dev_has_iommu(&pdev->dev))
>   		return;
> +	ops = dev_iommu_ops(&pdev->dev);
>   
>   	guard(mutex)(&group->mutex);
>   
> @@ -4029,6 +4031,16 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
>   	if (WARN_ON(!group->blocking_domain))
>   		return;
>   
> +	/*
> +	 * A PCI device might have been in an error state, so the IOMMU driver
> +	 * had to quarantine the device by disabling specific hardware feature
> +	 * or dropping translation capability. Here notify the IOMMU driver as
> +	 * a reliable signal that the faulty PCI device has been cleanly reset
> +	 * so now it can lift its quarantine and restore full functionality.
> +	 */
> +	if (ops && ops->reset_device_done)
> +		ops->reset_device_done(&pdev->dev);

Nit: dev_iommu_ops() ensures a valid iommu "ops". There is no need to
check "ops != NULL" here. Just

	if (ops->reset_device_done)
		ops->reset_device_done(&pdev->dev);

> +
>   	/* Re-attach RID domain back to group->domain */
>   	if (group->domain != group->blocking_domain) {
>   		WARN_ON(__iommu_attach_device(group->domain, &pdev->dev,

Thanks,
baolu

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 2/7] iommu: Add reset_device_done callback for hardware fault recovery
  2026-03-18  5:59   ` Baolu Lu
@ 2026-03-18 18:42     ` Nicolin Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-18 18:42 UTC (permalink / raw)
  To: Baolu Lu
  Cc: will, robin.murphy, joro, bhelgaas, jgg, rafael, lenb, praan,
	xueshuai, kevin.tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Wed, Mar 18, 2026 at 01:59:58PM +0800, Baolu Lu wrote:
> On 3/18/26 03:15, Nicolin Chen wrote:
> > +	if (ops && ops->reset_device_done)
> > +		ops->reset_device_done(&pdev->dev);
> 
> Nit: dev_iommu_ops() ensures a valid iommu "ops". There is no need to
> check "ops != NULL" here. Just
> 
> 	if (ops->reset_device_done)
> 		ops->reset_device_done(&pdev->dev);

Yes, right.

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-03-17 19:15 [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
  2026-03-17 19:15 ` [PATCH v2 1/7] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds Nicolin Chen
  2026-03-17 19:15 ` [PATCH v2 2/7] iommu: Add reset_device_done callback for hardware fault recovery Nicolin Chen
@ 2026-03-17 19:15 ` Nicolin Chen
  2026-03-18  6:13   ` Baolu Lu
                     ` (2 more replies)
  2026-03-17 19:15 ` [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
                   ` (4 subsequent siblings)
  7 siblings, 3 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-17 19:15 UTC (permalink / raw)
  To: will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, baolu.lu, xueshuai, kevin.tian,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

When an IOMMU hardware detects an error due to a faulty device (e.g. an ATS
invalidation timeout), IOMMU drivers may quarantine the device by disabling
specific hardware features or dropping translation capabilities.

However, the core-level states of the faulty device are out of sync, as the
device can be still attached to a translation domain or even potentially be
moved to a new domain that might overwrite the driver-level quarantine.

Given that such an error can be likely an ISR, introduce a broken_work per
iommu_group, and add a helper function to allow driver to report the broken
device, so as to completely quarantine it in the core.

Use the existing pci_dev_reset_iommu_prepare() function to shift the device
to its resetting_domain/blocking_domain. A later pci_dev_reset_iommu_done()
call will clear it and move it out of the quarantine.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h |  2 ++
 drivers/iommu/iommu.c | 59 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 9ba12b2164724..9b5f94e566ff9 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -891,6 +891,8 @@ static inline struct iommu_device *__iommu_get_iommu_dev(struct device *dev)
 #define iommu_get_iommu_dev(dev, type, member) \
 	container_of(__iommu_get_iommu_dev(dev), type, member)
 
+void iommu_report_device_broken(struct device *dev);
+
 static inline void iommu_iotlb_gather_init(struct iommu_iotlb_gather *gather)
 {
 	*gather = (struct iommu_iotlb_gather) {
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index fcd2902d9e8db..2f297f689a3a3 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -55,6 +55,8 @@ struct iommu_group {
 	struct list_head devices;
 	struct xarray pasid_array;
 	struct mutex mutex;
+	struct work_struct broken_work;
+	bool requires_reset;
 	void *iommu_data;
 	void (*iommu_data_release)(void *iommu_data);
 	char *name;
@@ -146,6 +148,7 @@ static struct group_device *iommu_group_alloc_device(struct iommu_group *group,
 						     struct device *dev);
 static void __iommu_group_free_device(struct iommu_group *group,
 				      struct group_device *grp_dev);
+static void iommu_group_broken_worker(struct work_struct *work);
 static void iommu_domain_init(struct iommu_domain *domain, unsigned int type,
 			      const struct iommu_ops *ops);
 
@@ -1057,6 +1060,7 @@ struct iommu_group *iommu_group_alloc(void)
 	if (!group)
 		return ERR_PTR(-ENOMEM);
 
+	INIT_WORK(&group->broken_work, iommu_group_broken_worker);
 	group->kobj.kset = iommu_group_kset;
 	mutex_init(&group->mutex);
 	INIT_LIST_HEAD(&group->devices);
@@ -4031,6 +4035,7 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 	if (WARN_ON(!group->blocking_domain))
 		return;
 
+	WRITE_ONCE(group->requires_reset, false);
 	/*
 	 * A PCI device might have been in an error state, so the IOMMU driver
 	 * had to quarantine the device by disabling specific hardware feature
@@ -4062,6 +4067,60 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 }
 EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_done);
 
+static void iommu_group_broken_worker(struct work_struct *work)
+{
+	struct iommu_group *group =
+		container_of(work, struct iommu_group, broken_work);
+	struct pci_dev *pdev = NULL;
+	struct device *dev;
+
+	scoped_guard(mutex, &group->mutex) {
+		/* Do not block the device again if it has been recovered */
+		if (!READ_ONCE(group->requires_reset))
+			goto out_put;
+		if (list_is_singular(&group->devices)) {
+			/* Note: only support group with a single device */
+			dev = iommu_group_first_dev(group);
+			if (dev_is_pci(dev)) {
+				pdev = to_pci_dev(dev);
+				pci_dev_get(pdev);
+			}
+		}
+	}
+
+	if (pdev) {
+		/*
+		 * Quarantine the device completely. This will be cleared upon
+		 * a pci_dev_reset_iommu_done() call indicating the recovery.
+		 */
+		pci_dev_lock(pdev);
+		pci_dev_reset_iommu_prepare(pdev);
+		pci_dev_unlock(pdev);
+		pci_dev_put(pdev);
+	}
+out_put:
+	iommu_group_put(group);
+}
+
+void iommu_report_device_broken(struct device *dev)
+{
+	struct iommu_group *group = iommu_group_get(dev);
+
+	if (!group)
+		return;
+
+	if (READ_ONCE(group->requires_reset)) {
+		iommu_group_put(group);
+		return;
+	}
+	WRITE_ONCE(group->requires_reset, true);
+
+	/* Put the group now or later in iommu_group_broken_worker() */
+	if (!schedule_work(&group->broken_work))
+		iommu_group_put(group);
+}
+EXPORT_SYMBOL_GPL(iommu_report_device_broken);
+
 #if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
 /**
  * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-03-17 19:15 ` [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
@ 2026-03-18  6:13   ` Baolu Lu
  2026-03-19  1:31     ` Nicolin Chen
  2026-03-18  7:31   ` Tian, Kevin
  2026-03-18 11:45   ` Shuai Xue
  2 siblings, 1 reply; 47+ messages in thread
From: Baolu Lu @ 2026-03-18  6:13 UTC (permalink / raw)
  To: Nicolin Chen, will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, xueshuai, kevin.tian, linux-arm-kernel,
	iommu, linux-kernel, linux-acpi, linux-pci, vsethi

On 3/18/26 03:15, Nicolin Chen wrote:
> When an IOMMU hardware detects an error due to a faulty device (e.g. an ATS
> invalidation timeout), IOMMU drivers may quarantine the device by disabling
> specific hardware features or dropping translation capabilities.
> 
> However, the core-level states of the faulty device are out of sync, as the
> device can be still attached to a translation domain or even potentially be
> moved to a new domain that might overwrite the driver-level quarantine.
> 
> Given that such an error can be likely an ISR, introduce a broken_work per
> iommu_group, and add a helper function to allow driver to report the broken
> device, so as to completely quarantine it in the core.
> 
> Use the existing pci_dev_reset_iommu_prepare() function to shift the device
> to its resetting_domain/blocking_domain. A later pci_dev_reset_iommu_done()
> call will clear it and move it out of the quarantine.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   include/linux/iommu.h |  2 ++
>   drivers/iommu/iommu.c | 59 +++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 61 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 9ba12b2164724..9b5f94e566ff9 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -891,6 +891,8 @@ static inline struct iommu_device *__iommu_get_iommu_dev(struct device *dev)
>   #define iommu_get_iommu_dev(dev, type, member) \
>   	container_of(__iommu_get_iommu_dev(dev), type, member)
>   
> +void iommu_report_device_broken(struct device *dev);
> +
>   static inline void iommu_iotlb_gather_init(struct iommu_iotlb_gather *gather)
>   {
>   	*gather = (struct iommu_iotlb_gather) {
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index fcd2902d9e8db..2f297f689a3a3 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -55,6 +55,8 @@ struct iommu_group {
>   	struct list_head devices;
>   	struct xarray pasid_array;
>   	struct mutex mutex;
> +	struct work_struct broken_work;
> +	bool requires_reset;
>   	void *iommu_data;
>   	void (*iommu_data_release)(void *iommu_data);
>   	char *name;
> @@ -146,6 +148,7 @@ static struct group_device *iommu_group_alloc_device(struct iommu_group *group,
>   						     struct device *dev);
>   static void __iommu_group_free_device(struct iommu_group *group,
>   				      struct group_device *grp_dev);
> +static void iommu_group_broken_worker(struct work_struct *work);
>   static void iommu_domain_init(struct iommu_domain *domain, unsigned int type,
>   			      const struct iommu_ops *ops);
>   
> @@ -1057,6 +1060,7 @@ struct iommu_group *iommu_group_alloc(void)
>   	if (!group)
>   		return ERR_PTR(-ENOMEM);
>   
> +	INIT_WORK(&group->broken_work, iommu_group_broken_worker);
>   	group->kobj.kset = iommu_group_kset;
>   	mutex_init(&group->mutex);
>   	INIT_LIST_HEAD(&group->devices);
> @@ -4031,6 +4035,7 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
>   	if (WARN_ON(!group->blocking_domain))
>   		return;
>   
> +	WRITE_ONCE(group->requires_reset, false);
>   	/*
>   	 * A PCI device might have been in an error state, so the IOMMU driver
>   	 * had to quarantine the device by disabling specific hardware feature
> @@ -4062,6 +4067,60 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
>   }
>   EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_done);
>   
> +static void iommu_group_broken_worker(struct work_struct *work)
> +{
> +	struct iommu_group *group =
> +		container_of(work, struct iommu_group, broken_work);
> +	struct pci_dev *pdev = NULL;
> +	struct device *dev;
> +
> +	scoped_guard(mutex, &group->mutex) {
> +		/* Do not block the device again if it has been recovered */
> +		if (!READ_ONCE(group->requires_reset))
> +			goto out_put;
> +		if (list_is_singular(&group->devices)) {
> +			/* Note: only support group with a single device */
> +			dev = iommu_group_first_dev(group);
> +			if (dev_is_pci(dev)) {
> +				pdev = to_pci_dev(dev);
> +				pci_dev_get(pdev);
> +			}
> +		}

The current mechanism is designed only for the PCIe devices within
singleton iommu groups. How about moving the above check to
iommu_report_device_broken() and emitting a message if it is called for
unsupported devices?

> +	}
> +
> +	if (pdev) {
> +		/*
> +		 * Quarantine the device completely. This will be cleared upon
> +		 * a pci_dev_reset_iommu_done() call indicating the recovery.
> +		 */
> +		pci_dev_lock(pdev);
> +		pci_dev_reset_iommu_prepare(pdev);
> +		pci_dev_unlock(pdev);
> +		pci_dev_put(pdev);
> +	}
> +out_put:
> +	iommu_group_put(group);
> +}
> +
> +void iommu_report_device_broken(struct device *dev)
> +{
> +	struct iommu_group *group = iommu_group_get(dev);
> +
> +	if (!group)
> +		return;
> +
> +	if (READ_ONCE(group->requires_reset)) {
> +		iommu_group_put(group);
> +		return;
> +	}
> +	WRITE_ONCE(group->requires_reset, true);
> +
> +	/* Put the group now or later in iommu_group_broken_worker() */
> +	if (!schedule_work(&group->broken_work))
> +		iommu_group_put(group);
> +}
> +EXPORT_SYMBOL_GPL(iommu_report_device_broken);
> +
>   #if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
>   /**
>    * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain

Thanks,
baolu

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-03-18  6:13   ` Baolu Lu
@ 2026-03-19  1:31     ` Nicolin Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-19  1:31 UTC (permalink / raw)
  To: Baolu Lu
  Cc: will, robin.murphy, joro, bhelgaas, jgg, rafael, lenb, praan,
	xueshuai, kevin.tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Wed, Mar 18, 2026 at 02:13:09PM +0800, Baolu Lu wrote:
> On 3/18/26 03:15, Nicolin Chen wrote:
> > +	scoped_guard(mutex, &group->mutex) {
> > +		/* Do not block the device again if it has been recovered */
> > +		if (!READ_ONCE(group->requires_reset))
> > +			goto out_put;
> > +		if (list_is_singular(&group->devices)) {
> > +			/* Note: only support group with a single device */
> > +			dev = iommu_group_first_dev(group);
> > +			if (dev_is_pci(dev)) {
> > +				pdev = to_pci_dev(dev);
> > +				pci_dev_get(pdev);
> > +			}
> > +		}
> 
> The current mechanism is designed only for the PCIe devices within
> singleton iommu groups. How about moving the above check to
> iommu_report_device_broken() and emitting a message if it is called for
> unsupported devices?

Yes. That would be cleaner.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-03-17 19:15 ` [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
  2026-03-18  6:13   ` Baolu Lu
@ 2026-03-18  7:31   ` Tian, Kevin
  2026-03-19  1:30     ` Nicolin Chen
  2026-03-18 11:45   ` Shuai Xue
  2 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2026-03-18  7:31 UTC (permalink / raw)
  To: Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	joro@8bytes.org, bhelgaas@google.com, jgg@nvidia.com
  Cc: rafael@kernel.org, lenb@kernel.org, praan@google.com,
	baolu.lu@linux.intel.com, xueshuai@linux.alibaba.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, Vikram Sethi

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, March 18, 2026 3:16 AM
> 
> +static void iommu_group_broken_worker(struct work_struct *work)
> +{
> +	struct iommu_group *group =
> +		container_of(work, struct iommu_group, broken_work);
> +	struct pci_dev *pdev = NULL;
> +	struct device *dev;
> +
> +	scoped_guard(mutex, &group->mutex) {
> +		/* Do not block the device again if it has been recovered */
> +		if (!READ_ONCE(group->requires_reset))
> +			goto out_put;
> +		if (list_is_singular(&group->devices)) {
> +			/* Note: only support group with a single device */

this series is about fixing a vulnerability. Then it sounds incomplete to
leave certain configuration still under risk. Probably we should first 
ensure ATS can be enabled only in singleton group, just like how we
did for pci_enable_pasid()?

> +			dev = iommu_group_first_dev(group);
> +			if (dev_is_pci(dev)) {
> +				pdev = to_pci_dev(dev);
> +				pci_dev_get(pdev);
> +			}
> +		}
> +	}
> +
> +	if (pdev) {
> +		/*
> +		 * Quarantine the device completely. This will be cleared
> upon
> +		 * a pci_dev_reset_iommu_done() call indicating the recovery.
> +		 */
> +		pci_dev_lock(pdev);
> +		pci_dev_reset_iommu_prepare(pdev);

let's rename it to iommu_quarantine_device() to be called here. then
have another wrapper pci_dev_reset_iommu_prepare() to call it too.

this path has nothing to do with reset.

> +		pci_dev_unlock(pdev);
> +		pci_dev_put(pdev);
> +	}
> +out_put:
> +	iommu_group_put(group);
> +}
> +

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-03-18  7:31   ` Tian, Kevin
@ 2026-03-19  1:30     ` Nicolin Chen
  2026-03-19  2:35       ` Tian, Kevin
  0 siblings, 1 reply; 47+ messages in thread
From: Nicolin Chen @ 2026-03-19  1:30 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: will@kernel.org, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, praan@google.com, baolu.lu@linux.intel.com,
	xueshuai@linux.alibaba.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	Vikram Sethi

On Wed, Mar 18, 2026 at 07:31:18AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, March 18, 2026 3:16 AM
> > 
> > +static void iommu_group_broken_worker(struct work_struct *work)
> > +{
> > +	struct iommu_group *group =
> > +		container_of(work, struct iommu_group, broken_work);
> > +	struct pci_dev *pdev = NULL;
> > +	struct device *dev;
> > +
> > +	scoped_guard(mutex, &group->mutex) {
> > +		/* Do not block the device again if it has been recovered */
> > +		if (!READ_ONCE(group->requires_reset))
> > +			goto out_put;
> > +		if (list_is_singular(&group->devices)) {
> > +			/* Note: only support group with a single device */
> 
> this series is about fixing a vulnerability. Then it sounds incomplete to
> leave certain configuration still under risk. Probably we should first 
> ensure ATS can be enabled only in singleton group, just like how we
> did for pci_enable_pasid()?

I understand your concern. But I am not very sure about applying
limitation to ATS support. Would this block some existing cases?

> > +			dev = iommu_group_first_dev(group);
> > +			if (dev_is_pci(dev)) {
> > +				pdev = to_pci_dev(dev);
> > +				pci_dev_get(pdev);
> > +			}
> > +		}
> > +	}
> > +
> > +	if (pdev) {
> > +		/*
> > +		 * Quarantine the device completely. This will be cleared
> > upon
> > +		 * a pci_dev_reset_iommu_done() call indicating the recovery.
> > +		 */
> > +		pci_dev_lock(pdev);
> > +		pci_dev_reset_iommu_prepare(pdev);
> 
> let's rename it to iommu_quarantine_device() to be called here. then
> have another wrapper pci_dev_reset_iommu_prepare() to call it too.
> 
> this path has nothing to do with reset.

But the implementation of that iommu_quarantine_device would be
still to shift to resetting_domain. Perhaps, we can rename that
to quarantine_domain or so.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-03-19  1:30     ` Nicolin Chen
@ 2026-03-19  2:35       ` Tian, Kevin
  2026-03-19  3:13         ` Nicolin Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2026-03-19  2:35 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will@kernel.org, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, praan@google.com, baolu.lu@linux.intel.com,
	xueshuai@linux.alibaba.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	Vikram Sethi

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, March 19, 2026 9:31 AM
> 
> On Wed, Mar 18, 2026 at 07:31:18AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Wednesday, March 18, 2026 3:16 AM
> > >
> > > +static void iommu_group_broken_worker(struct work_struct *work)
> > > +{
> > > +	struct iommu_group *group =
> > > +		container_of(work, struct iommu_group, broken_work);
> > > +	struct pci_dev *pdev = NULL;
> > > +	struct device *dev;
> > > +
> > > +	scoped_guard(mutex, &group->mutex) {
> > > +		/* Do not block the device again if it has been recovered */
> > > +		if (!READ_ONCE(group->requires_reset))
> > > +			goto out_put;
> > > +		if (list_is_singular(&group->devices)) {
> > > +			/* Note: only support group with a single device */
> >
> > this series is about fixing a vulnerability. Then it sounds incomplete to
> > leave certain configuration still under risk. Probably we should first
> > ensure ATS can be enabled only in singleton group, just like how we
> > did for pci_enable_pasid()?
> 
> I understand your concern. But I am not very sure about applying
> limitation to ATS support. Would this block some existing cases?

what about just throwing out a message to warn that enabling ATS
in a non-singleton iommu group may suffer from unquarantined
situation if any device in the group triggers a ATC invalidation timeout?

> 
> > > +			dev = iommu_group_first_dev(group);
> > > +			if (dev_is_pci(dev)) {
> > > +				pdev = to_pci_dev(dev);
> > > +				pci_dev_get(pdev);
> > > +			}
> > > +		}
> > > +	}
> > > +
> > > +	if (pdev) {
> > > +		/*
> > > +		 * Quarantine the device completely. This will be cleared
> > > upon
> > > +		 * a pci_dev_reset_iommu_done() call indicating the recovery.
> > > +		 */
> > > +		pci_dev_lock(pdev);
> > > +		pci_dev_reset_iommu_prepare(pdev);
> >
> > let's rename it to iommu_quarantine_device() to be called here. then
> > have another wrapper pci_dev_reset_iommu_prepare() to call it too.
> >
> > this path has nothing to do with reset.
> 
> But the implementation of that iommu_quarantine_device would be
> still to shift to resetting_domain. Perhaps, we can rename that
> to quarantine_domain or so.
> 

yes let's rename that too. the purpose of this function is clearly about
quarantine. reset is just one user of it.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-03-19  2:35       ` Tian, Kevin
@ 2026-03-19  3:13         ` Nicolin Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-19  3:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: will@kernel.org, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, praan@google.com, baolu.lu@linux.intel.com,
	xueshuai@linux.alibaba.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	Vikram Sethi

On Thu, Mar 19, 2026 at 02:35:33AM +0000, Tian, Kevin wrote:
> > > > +	scoped_guard(mutex, &group->mutex) {
> > > > +		/* Do not block the device again if it has been recovered */
> > > > +		if (!READ_ONCE(group->requires_reset))
> > > > +			goto out_put;
> > > > +		if (list_is_singular(&group->devices)) {
> > > > +			/* Note: only support group with a single device */
> > >
> > > this series is about fixing a vulnerability. Then it sounds incomplete to
> > > leave certain configuration still under risk. Probably we should first
> > > ensure ATS can be enabled only in singleton group, just like how we
> > > did for pci_enable_pasid()?
> > 
> > I understand your concern. But I am not very sure about applying
> > limitation to ATS support. Would this block some existing cases?
> 
> what about just throwing out a message to warn that enabling ATS
> in a non-singleton iommu group may suffer from unquarantined
> situation if any device in the group triggers a ATC invalidation timeout?

Yes. Baolu suggested the same, we could move this condition into
iommu_report_device_broken().

> > > > +		/*
> > > > +		 * Quarantine the device completely. This will be cleared
> > > > upon
> > > > +		 * a pci_dev_reset_iommu_done() call indicating the recovery.
> > > > +		 */
> > > > +		pci_dev_lock(pdev);
> > > > +		pci_dev_reset_iommu_prepare(pdev);
> > >
> > > let's rename it to iommu_quarantine_device() to be called here. then
> > > have another wrapper pci_dev_reset_iommu_prepare() to call it too.
> > >
> > > this path has nothing to do with reset.
> > 
> > But the implementation of that iommu_quarantine_device would be
> > still to shift to resetting_domain. Perhaps, we can rename that
> > to quarantine_domain or so.
> 
> yes let's rename that too. the purpose of this function is clearly about
> quarantine. reset is just one user of it.

OK. I will make it happen.

Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-03-17 19:15 ` [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
  2026-03-18  6:13   ` Baolu Lu
  2026-03-18  7:31   ` Tian, Kevin
@ 2026-03-18 11:45   ` Shuai Xue
  2026-03-18 20:29     ` Nicolin Chen
  2 siblings, 1 reply; 47+ messages in thread
From: Shuai Xue @ 2026-03-18 11:45 UTC (permalink / raw)
  To: Nicolin Chen, will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, baolu.lu, kevin.tian, linux-arm-kernel,
	iommu, linux-kernel, linux-acpi, linux-pci, vsethi



On 3/18/26 3:15 AM, Nicolin Chen wrote:
> When an IOMMU hardware detects an error due to a faulty device (e.g. an ATS
> invalidation timeout), IOMMU drivers may quarantine the device by disabling
> specific hardware features or dropping translation capabilities.
> 
> However, the core-level states of the faulty device are out of sync, as the
> device can be still attached to a translation domain or even potentially be
> moved to a new domain that might overwrite the driver-level quarantine.
> 
> Given that such an error can be likely an ISR, introduce a broken_work per
> iommu_group, and add a helper function to allow driver to report the broken
> device, so as to completely quarantine it in the core.
> 
> Use the existing pci_dev_reset_iommu_prepare() function to shift the device
> to its resetting_domain/blocking_domain. A later pci_dev_reset_iommu_done()
> call will clear it and move it out of the quarantine.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   include/linux/iommu.h |  2 ++
>   drivers/iommu/iommu.c | 59 +++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 61 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 9ba12b2164724..9b5f94e566ff9 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -891,6 +891,8 @@ static inline struct iommu_device *__iommu_get_iommu_dev(struct device *dev)
>   #define iommu_get_iommu_dev(dev, type, member) \
>   	container_of(__iommu_get_iommu_dev(dev), type, member)
>   
> +void iommu_report_device_broken(struct device *dev);
> +

This declaration is inside the #ifdef CONFIG_IOMMU_API section, but
there's no corresponding stub in the #else block. While current
callers (arm-smmu-v3) always have CONFIG_IOMMU_API, for API
completeness, please add a stub.
Thanks.
Shuai

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-03-18 11:45   ` Shuai Xue
@ 2026-03-18 20:29     ` Nicolin Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-18 20:29 UTC (permalink / raw)
  To: Shuai Xue
  Cc: will, robin.murphy, joro, bhelgaas, jgg, rafael, lenb, praan,
	baolu.lu, kevin.tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Wed, Mar 18, 2026 at 07:45:19PM +0800, Shuai Xue wrote:
> On 3/18/26 3:15 AM, Nicolin Chen wrote:
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > index 9ba12b2164724..9b5f94e566ff9 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -891,6 +891,8 @@ static inline struct iommu_device *__iommu_get_iommu_dev(struct device *dev)
> >   #define iommu_get_iommu_dev(dev, type, member) \
> >   	container_of(__iommu_get_iommu_dev(dev), type, member)
> > +void iommu_report_device_broken(struct device *dev);
> > +
> 
> This declaration is inside the #ifdef CONFIG_IOMMU_API section, but
> there's no corresponding stub in the #else block. While current
> callers (arm-smmu-v3) always have CONFIG_IOMMU_API, for API
> completeness, please add a stub.

Yes. That's a miss.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-03-17 19:15 [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (2 preceding siblings ...)
  2026-03-17 19:15 ` [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
@ 2026-03-17 19:15 ` Nicolin Chen
  2026-03-18  7:36   ` Tian, Kevin
  2026-03-18 22:02   ` Samiullah Khawaja
  2026-03-17 19:15 ` [PATCH v2 5/7] iommu/arm-smmu-v3: Replace smmu with master in arm_smmu_inv Nicolin Chen
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-17 19:15 UTC (permalink / raw)
  To: will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, baolu.lu, xueshuai, kevin.tian,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

An ATC invalidation timeout is a fatal error. While the SMMUv3 hardware is
aware of the timeout via a GERROR interrupt, the driver thread issuing the
commands lacks a direct mechanism to verify whether its specific batch was
the cause or not, as polling the CMD_SYNC status doesn't natively return a
failure code, making it very difficult to coordinate per-device recovery.

Introduce an atc_sync_timeouts bitmap in the cmdq structure to bridge this
gap. When the ISR detects an ATC timeout, set the bit corresponding to the
physical CMDQ index of the faulting CMD_SYNC command.

On the issuer side, after polling completes (or times out), test and clear
its dedicated bit. If set, override any generic timeout, return -ETIMEDOUT
to trigger device quarantine.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 20 +++++++++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 36de2b0b2ebe6..3eb12a34b086a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -633,6 +633,7 @@ struct arm_smmu_cmdq {
 	atomic_long_t			*valid_map;
 	atomic_t			owner_prod;
 	atomic_t			lock;
+	unsigned long			*atc_sync_timeouts;
 	bool				(*supports_cmd)(struct arm_smmu_cmdq_ent *ent);
 };
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 01030ffd2fe23..9c8972ebc94f9 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -445,7 +445,10 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
 		 * at the CMD_SYNC. Attempt to complete other pending commands
 		 * by repeating the CMD_SYNC, though we might well end up back
 		 * here since the ATC invalidation may still be pending.
+		 *
+		 * Mark the faulty batch in the bitmap for the issuer to match.
 		 */
+		set_bit(Q_IDX(&q->llq, cons), cmdq->atc_sync_timeouts);
 		return;
 	case CMDQ_ERR_CERROR_ILL_IDX:
 	default:
@@ -895,9 +898,19 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 
 	/* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */
 	if (sync) {
+		u32 sync_prod;
+
 		llq.prod = queue_inc_prod_n(&llq, n);
+		sync_prod = llq.prod;
+
 		ret = arm_smmu_cmdq_poll_until_sync(smmu, cmdq, &llq);
-		if (ret) {
+		if (test_and_clear_bit(Q_IDX(&llq, sync_prod),
+				       cmdq->atc_sync_timeouts)) {
+			dev_err_ratelimited(smmu->dev,
+					    "CMD_SYNC for ATC_INV timeout at prod=0x%08x\n",
+					    sync_prod);
+			ret = -ETIMEDOUT;
+		} else if (ret) {
 			dev_err_ratelimited(smmu->dev,
 					    "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
 					    llq.prod,
@@ -4458,6 +4471,11 @@ int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
 	if (!cmdq->valid_map)
 		return -ENOMEM;
 
+	cmdq->atc_sync_timeouts =
+		devm_bitmap_zalloc(smmu->dev, nents, GFP_KERNEL);
+	if (!cmdq->atc_sync_timeouts)
+		return -ENOMEM;
+
 	return 0;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-03-17 19:15 ` [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
@ 2026-03-18  7:36   ` Tian, Kevin
  2026-03-18 19:26     ` Nicolin Chen
  2026-03-18 22:02   ` Samiullah Khawaja
  1 sibling, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2026-03-18  7:36 UTC (permalink / raw)
  To: Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	joro@8bytes.org, bhelgaas@google.com, jgg@nvidia.com
  Cc: rafael@kernel.org, lenb@kernel.org, praan@google.com,
	baolu.lu@linux.intel.com, xueshuai@linux.alibaba.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, Vikram Sethi

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, March 18, 2026 3:16 AM
> 
> An ATC invalidation timeout is a fatal error. While the SMMUv3 hardware is
> aware of the timeout via a GERROR interrupt, the driver thread issuing the
> commands lacks a direct mechanism to verify whether its specific batch was
> the cause or not, as polling the CMD_SYNC status doesn't natively return a
> failure code, making it very difficult to coordinate per-device recovery.
> 
> Introduce an atc_sync_timeouts bitmap in the cmdq structure to bridge this
> gap. When the ISR detects an ATC timeout, set the bit corresponding to the
> physical CMDQ index of the faulting CMD_SYNC command.
> 

It's nice to see the ability of allowing sw to identify the faulting sync command
upon an ATC timeout! On VT-d it's not feasible when multiple wait descriptors
(similar to CMD_SYNC) are in-fly... :/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-03-18  7:36   ` Tian, Kevin
@ 2026-03-18 19:26     ` Nicolin Chen
  2026-03-18 22:06       ` Samiullah Khawaja
  0 siblings, 1 reply; 47+ messages in thread
From: Nicolin Chen @ 2026-03-18 19:26 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: will@kernel.org, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, praan@google.com, baolu.lu@linux.intel.com,
	xueshuai@linux.alibaba.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	Vikram Sethi

On Wed, Mar 18, 2026 at 07:36:20AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, March 18, 2026 3:16 AM
> > 
> > An ATC invalidation timeout is a fatal error. While the SMMUv3 hardware is
> > aware of the timeout via a GERROR interrupt, the driver thread issuing the
> > commands lacks a direct mechanism to verify whether its specific batch was
> > the cause or not, as polling the CMD_SYNC status doesn't natively return a
> > failure code, making it very difficult to coordinate per-device recovery.
> > 
> > Introduce an atc_sync_timeouts bitmap in the cmdq structure to bridge this
> > gap. When the ISR detects an ATC timeout, set the bit corresponding to the
> > physical CMDQ index of the faulting CMD_SYNC command.
> > 
> 
> It's nice to see the ability of allowing sw to identify the faulting sync command
> upon an ATC timeout! On VT-d it's not feasible when multiple wait descriptors
> (similar to CMD_SYNC) are in-fly... :/

Actually SMMU doesn't know which device is faulting when CMD_SYNC
follows ATC_INV commands for multiple devices. The commit message
in PATCH-7 describes this in the end. So Jason suggested to retry
those ATC_INV commands by bisecting them per-device, which allows
us to pinpoint which device.

Could VT-d do the same?

Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-03-18 19:26     ` Nicolin Chen
@ 2026-03-18 22:06       ` Samiullah Khawaja
  2026-03-19  3:08         ` Tian, Kevin
  0 siblings, 1 reply; 47+ messages in thread
From: Samiullah Khawaja @ 2026-03-18 22:06 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, will@kernel.org, robin.murphy@arm.com,
	joro@8bytes.org, bhelgaas@google.com, jgg@nvidia.com,
	rafael@kernel.org, lenb@kernel.org, praan@google.com,
	baolu.lu@linux.intel.com, xueshuai@linux.alibaba.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, Vikram Sethi

Hi Nicolin,

On Wed, Mar 18, 2026 at 12:26:33PM -0700, Nicolin Chen wrote:
>On Wed, Mar 18, 2026 at 07:36:20AM +0000, Tian, Kevin wrote:
>> > From: Nicolin Chen <nicolinc@nvidia.com>
>> > Sent: Wednesday, March 18, 2026 3:16 AM
>> >
>> > An ATC invalidation timeout is a fatal error. While the SMMUv3 hardware is
>> > aware of the timeout via a GERROR interrupt, the driver thread issuing the
>> > commands lacks a direct mechanism to verify whether its specific batch was
>> > the cause or not, as polling the CMD_SYNC status doesn't natively return a
>> > failure code, making it very difficult to coordinate per-device recovery.
>> >
>> > Introduce an atc_sync_timeouts bitmap in the cmdq structure to bridge this
>> > gap. When the ISR detects an ATC timeout, set the bit corresponding to the
>> > physical CMDQ index of the faulting CMD_SYNC command.
>> >
>>
>> It's nice to see the ability of allowing sw to identify the faulting sync command
>> upon an ATC timeout! On VT-d it's not feasible when multiple wait descriptors
>> (similar to CMD_SYNC) are in-fly... :/
>
>Actually SMMU doesn't know which device is faulting when CMD_SYNC

VT-d is able to find out the SID of the device for which the device TLB
invalidation timed-out occured by using the SID reported in the
"Invalidation Queue Error Record Register" (VT-d Specs 11.4.9.9).
>follows ATC_INV commands for multiple devices. The commit message
>in PATCH-7 describes this in the end. So Jason suggested to retry
>those ATC_INV commands by bisecting them per-device, which allows
>us to pinpoint which device.

But for a software timeout, something like this would be needed.
>
>Could VT-d do the same?
>
>Nicolin
>

Thanks,
Sami

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-03-18 22:06       ` Samiullah Khawaja
@ 2026-03-19  3:08         ` Tian, Kevin
  2026-03-19  3:12           ` Nicolin Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2026-03-19  3:08 UTC (permalink / raw)
  To: Samiullah Khawaja, Nicolin Chen
  Cc: will@kernel.org, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, praan@google.com, baolu.lu@linux.intel.com,
	xueshuai@linux.alibaba.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	Vikram Sethi

> From: Samiullah Khawaja <skhawaja@google.com>
> Sent: Thursday, March 19, 2026 6:07 AM
> 
> Hi Nicolin,
> 
> On Wed, Mar 18, 2026 at 12:26:33PM -0700, Nicolin Chen wrote:
> >On Wed, Mar 18, 2026 at 07:36:20AM +0000, Tian, Kevin wrote:
> >> > From: Nicolin Chen <nicolinc@nvidia.com>
> >> > Sent: Wednesday, March 18, 2026 3:16 AM
> >> >
> >> > An ATC invalidation timeout is a fatal error. While the SMMUv3
> hardware is
> >> > aware of the timeout via a GERROR interrupt, the driver thread issuing
> the
> >> > commands lacks a direct mechanism to verify whether its specific batch
> was
> >> > the cause or not, as polling the CMD_SYNC status doesn't natively return
> a
> >> > failure code, making it very difficult to coordinate per-device recovery.
> >> >
> >> > Introduce an atc_sync_timeouts bitmap in the cmdq structure to bridge
> this
> >> > gap. When the ISR detects an ATC timeout, set the bit corresponding to
> the
> >> > physical CMDQ index of the faulting CMD_SYNC command.
> >> >
> >>
> >> It's nice to see the ability of allowing sw to identify the faulting sync
> command
> >> upon an ATC timeout! On VT-d it's not feasible when multiple wait
> descriptors
> >> (similar to CMD_SYNC) are in-fly... :/
> >
> >Actually SMMU doesn't know which device is faulting when CMD_SYNC
> 
> VT-d is able to find out the SID of the device for which the device TLB
> invalidation timed-out occured by using the SID reported in the
> "Invalidation Queue Error Record Register" (VT-d Specs 11.4.9.9).

yes. but when there are multiple submissions (each with a wait descriptor)
fetched/handled by the hw and then an invalidation timeout comes, all
pending wait descriptors will be aborted (not just the one corresponding
to the timeout). In this case all affected submitters need to re-try.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-03-19  3:08         ` Tian, Kevin
@ 2026-03-19  3:12           ` Nicolin Chen
  2026-03-23 23:51             ` Jason Gunthorpe
  0 siblings, 1 reply; 47+ messages in thread
From: Nicolin Chen @ 2026-03-19  3:12 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Samiullah Khawaja, will@kernel.org, robin.murphy@arm.com,
	joro@8bytes.org, bhelgaas@google.com, jgg@nvidia.com,
	rafael@kernel.org, lenb@kernel.org, praan@google.com,
	baolu.lu@linux.intel.com, xueshuai@linux.alibaba.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, Vikram Sethi

On Thu, Mar 19, 2026 at 03:08:05AM +0000, Tian, Kevin wrote:
> > From: Samiullah Khawaja <skhawaja@google.com>
> > Sent: Thursday, March 19, 2026 6:07 AM
> > 
> > Hi Nicolin,
> > 
> > On Wed, Mar 18, 2026 at 12:26:33PM -0700, Nicolin Chen wrote:
> > >On Wed, Mar 18, 2026 at 07:36:20AM +0000, Tian, Kevin wrote:
> > >> > From: Nicolin Chen <nicolinc@nvidia.com>
> > >> > Sent: Wednesday, March 18, 2026 3:16 AM
> > >> >
> > >> > An ATC invalidation timeout is a fatal error. While the SMMUv3
> > hardware is
> > >> > aware of the timeout via a GERROR interrupt, the driver thread issuing
> > the
> > >> > commands lacks a direct mechanism to verify whether its specific batch
> > was
> > >> > the cause or not, as polling the CMD_SYNC status doesn't natively return
> > a
> > >> > failure code, making it very difficult to coordinate per-device recovery.
> > >> >
> > >> > Introduce an atc_sync_timeouts bitmap in the cmdq structure to bridge
> > this
> > >> > gap. When the ISR detects an ATC timeout, set the bit corresponding to
> > the
> > >> > physical CMDQ index of the faulting CMD_SYNC command.
> > >> >
> > >>
> > >> It's nice to see the ability of allowing sw to identify the faulting sync
> > command
> > >> upon an ATC timeout! On VT-d it's not feasible when multiple wait
> > descriptors
> > >> (similar to CMD_SYNC) are in-fly... :/
> > >
> > >Actually SMMU doesn't know which device is faulting when CMD_SYNC
> > 
> > VT-d is able to find out the SID of the device for which the device TLB
> > invalidation timed-out occured by using the SID reported in the
> > "Invalidation Queue Error Record Register" (VT-d Specs 11.4.9.9).
> 
> yes. but when there are multiple submissions (each with a wait descriptor)
> fetched/handled by the hw and then an invalidation timeout comes, all
> pending wait descriptors will be aborted (not just the one corresponding
> to the timeout). In this case all affected submitters need to re-try.

This sounds similar to SMMU then.

Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-03-19  3:12           ` Nicolin Chen
@ 2026-03-23 23:51             ` Jason Gunthorpe
  0 siblings, 0 replies; 47+ messages in thread
From: Jason Gunthorpe @ 2026-03-23 23:51 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, Samiullah Khawaja, will@kernel.org,
	robin.murphy@arm.com, joro@8bytes.org, bhelgaas@google.com,
	rafael@kernel.org, lenb@kernel.org, praan@google.com,
	baolu.lu@linux.intel.com, xueshuai@linux.alibaba.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, Vikram Sethi

On Wed, Mar 18, 2026 at 08:12:04PM -0700, Nicolin Chen wrote:
> > > VT-d is able to find out the SID of the device for which the device TLB
> > > invalidation timed-out occured by using the SID reported in the
> > > "Invalidation Queue Error Record Register" (VT-d Specs 11.4.9.9).
> > 
> > yes. but when there are multiple submissions (each with a wait descriptor)
> > fetched/handled by the hw and then an invalidation timeout comes, all
> > pending wait descriptors will be aborted (not just the one corresponding
> > to the timeout). In this case all affected submitters need to re-try.
> 
> This sounds similar to SMMU then.

Not entirely.. smmu HW stops processing at a SYNC and waits for
everything pending to complete, then goes on forward. If there is a HW
reported ATC timeout then it is contained to the SYNC that followed
the ATC invalidation. The errored sync is skipped and whatever follows
continues forward, so it doesn't contaminate future work.

VT-d's wait descriptor with fence FN=1 sounds identical???

I guess if FN=0 then things start to become indeterminate what the
wait actually waits for..

Jason

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-03-17 19:15 ` [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
  2026-03-18  7:36   ` Tian, Kevin
@ 2026-03-18 22:02   ` Samiullah Khawaja
  2026-03-18 23:23     ` Nicolin Chen
  1 sibling, 1 reply; 47+ messages in thread
From: Samiullah Khawaja @ 2026-03-18 22:02 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, bhelgaas, jgg, rafael, lenb, praan,
	baolu.lu, xueshuai, kevin.tian, linux-arm-kernel, iommu,
	linux-kernel, linux-acpi, linux-pci, vsethi

Hi Nicolin,

On Tue, Mar 17, 2026 at 12:15:37PM -0700, Nicolin Chen wrote:
>An ATC invalidation timeout is a fatal error. While the SMMUv3 hardware is
>aware of the timeout via a GERROR interrupt, the driver thread issuing the
>commands lacks a direct mechanism to verify whether its specific batch was
>the cause or not, as polling the CMD_SYNC status doesn't natively return a
>failure code, making it very difficult to coordinate per-device recovery.
>
>Introduce an atc_sync_timeouts bitmap in the cmdq structure to bridge this
>gap. When the ISR detects an ATC timeout, set the bit corresponding to the
>physical CMDQ index of the faulting CMD_SYNC command.
>
>On the issuer side, after polling completes (or times out), test and clear
>its dedicated bit. If set, override any generic timeout, return -ETIMEDOUT
>to trigger device quarantine.
>
>Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 20 +++++++++++++++++++-
> 2 files changed, 20 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>index 36de2b0b2ebe6..3eb12a34b086a 100644
>--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>@@ -633,6 +633,7 @@ struct arm_smmu_cmdq {
> 	atomic_long_t			*valid_map;
> 	atomic_t			owner_prod;
> 	atomic_t			lock;
>+	unsigned long			*atc_sync_timeouts;
> 	bool				(*supports_cmd)(struct arm_smmu_cmdq_ent *ent);
> };
>
>diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>index 01030ffd2fe23..9c8972ebc94f9 100644
>--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>@@ -445,7 +445,10 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
> 		 * at the CMD_SYNC. Attempt to complete other pending commands
> 		 * by repeating the CMD_SYNC, though we might well end up back
> 		 * here since the ATC invalidation may still be pending.
>+		 *
>+		 * Mark the faulty batch in the bitmap for the issuer to match.
> 		 */
>+		set_bit(Q_IDX(&q->llq, cons), cmdq->atc_sync_timeouts);
> 		return;
> 	case CMDQ_ERR_CERROR_ILL_IDX:
> 	default:
>@@ -895,9 +898,19 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>
> 	/* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */
> 	if (sync) {
>+		u32 sync_prod;
>+
> 		llq.prod = queue_inc_prod_n(&llq, n);
>+		sync_prod = llq.prod;
>+
> 		ret = arm_smmu_cmdq_poll_until_sync(smmu, cmdq, &llq);
>-		if (ret) {
>+		if (test_and_clear_bit(Q_IDX(&llq, sync_prod),
>+				       cmdq->atc_sync_timeouts)) {

This will not be set if a software timeout (1 second) occurs. Do you
know if the ATC timeout of Arm sMMUv3 is less than the software timeout
in the driver?

If not maybe we can handle the software timeout here also as the cmdlist
is already known?

Thanks,
Sami
>+			dev_err_ratelimited(smmu->dev,
>+					    "CMD_SYNC for ATC_INV timeout at prod=0x%08x\n",
>+					    sync_prod);
>+			ret = -ETIMEDOUT;
>+		} else if (ret) {
> 			dev_err_ratelimited(smmu->dev,
> 					    "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
> 					    llq.prod,
>@@ -4458,6 +4471,11 @@ int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
> 	if (!cmdq->valid_map)
> 		return -ENOMEM;
>
>+	cmdq->atc_sync_timeouts =
>+		devm_bitmap_zalloc(smmu->dev, nents, GFP_KERNEL);
>+	if (!cmdq->atc_sync_timeouts)
>+		return -ENOMEM;
>+
> 	return 0;
> }
>
>-- 
>2.43.0
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-03-18 22:02   ` Samiullah Khawaja
@ 2026-03-18 23:23     ` Nicolin Chen
  2026-03-19  0:08       ` Samiullah Khawaja
  2026-03-23 23:57       ` Jason Gunthorpe
  0 siblings, 2 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-18 23:23 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: will, robin.murphy, joro, bhelgaas, jgg, rafael, lenb, praan,
	baolu.lu, xueshuai, kevin.tian, linux-arm-kernel, iommu,
	linux-kernel, linux-acpi, linux-pci, vsethi

Hi Sami,

On Wed, Mar 18, 2026 at 10:02:32PM +0000, Samiullah Khawaja wrote:
> On Tue, Mar 17, 2026 at 12:15:37PM -0700, Nicolin Chen wrote:
> > @@ -895,9 +898,19 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> > 
> > 	/* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */
> > 	if (sync) {
> > +		u32 sync_prod;
> > +
> > 		llq.prod = queue_inc_prod_n(&llq, n);
> > +		sync_prod = llq.prod;
> > +
> > 		ret = arm_smmu_cmdq_poll_until_sync(smmu, cmdq, &llq);
> > -		if (ret) {
> > +		if (test_and_clear_bit(Q_IDX(&llq, sync_prod),
> > +				       cmdq->atc_sync_timeouts)) {
> 
> This will not be set if a software timeout (1 second) occurs. Do you
> know if the ATC timeout of Arm sMMUv3 is less than the software timeout
> in the driver?

You brought up a good point!

I think ATC timeout follows the PCI Completion Timeout Value in
"Device Control 2 Register", which is typically set [50us, 50ms]
but can be set up to [17s, 64s] according to PCI Base spec.

> If not maybe we can handle the software timeout here also as the cmdlist
> is already known?

I think it's trickier.

If the software times out first at 1s, it means the CMDQ is still
pending on wait for the completion of ATC invalidation. Then, the
caller sees -ETIMEOUT and tries to bisect the ATC batch or update
the STE directly, either of which involves CMDQ. But CMDQ has not
recovered yet.

Then, in case of a batch, all the reties could timeout again. So,
it will fail to identify which device is truly broken. This would
end badly by blindly disabling all the devices in the batch. Also
the disabling calls require CMDQ too, so they might fail as well.

Thus, partially to answer the question, in case software timeout,
I am afraid that we can hardly do anything.. :-/

This means I need to set a different return code for ATC timeouts
v.s. software timeouts.

Also, there is another problem: when PCI CTO finally reaches, the
GERROR ISR will set atc_sync_timeouts but nobody will clear it..
So, before calling arm_smmu_cmdq_issue_cmdlist(), we need to make
sure there is no dirty bit on the bitmap too.

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-03-18 23:23     ` Nicolin Chen
@ 2026-03-19  0:08       ` Samiullah Khawaja
  2026-03-19  1:15         ` Nicolin Chen
  2026-03-23 23:57       ` Jason Gunthorpe
  1 sibling, 1 reply; 47+ messages in thread
From: Samiullah Khawaja @ 2026-03-19  0:08 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, bhelgaas, jgg, rafael, lenb, praan,
	baolu.lu, xueshuai, kevin.tian, linux-arm-kernel, iommu,
	linux-kernel, linux-acpi, linux-pci, vsethi

On Wed, Mar 18, 2026 at 04:23:53PM -0700, Nicolin Chen wrote:
>Hi Sami,
>
>On Wed, Mar 18, 2026 at 10:02:32PM +0000, Samiullah Khawaja wrote:
>> On Tue, Mar 17, 2026 at 12:15:37PM -0700, Nicolin Chen wrote:
>> > @@ -895,9 +898,19 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>> >
>> > 	/* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */
>> > 	if (sync) {
>> > +		u32 sync_prod;
>> > +
>> > 		llq.prod = queue_inc_prod_n(&llq, n);
>> > +		sync_prod = llq.prod;
>> > +
>> > 		ret = arm_smmu_cmdq_poll_until_sync(smmu, cmdq, &llq);
>> > -		if (ret) {
>> > +		if (test_and_clear_bit(Q_IDX(&llq, sync_prod),
>> > +				       cmdq->atc_sync_timeouts)) {
>>
>> This will not be set if a software timeout (1 second) occurs. Do you
>> know if the ATC timeout of Arm sMMUv3 is less than the software timeout
>> in the driver?
>
>You brought up a good point!
>
>I think ATC timeout follows the PCI Completion Timeout Value in
>"Device Control 2 Register", which is typically set [50us, 50ms]
>but can be set up to [17s, 64s] according to PCI Base spec.

Agreed.
>
>> If not maybe we can handle the software timeout here also as the cmdlist
>> is already known?
>
>I think it's trickier.
>
>If the software times out first at 1s, it means the CMDQ is still
>pending on wait for the completion of ATC invalidation. Then, the
>caller sees -ETIMEOUT and tries to bisect the ATC batch or update
>the STE directly, either of which involves CMDQ. But CMDQ has not
>recovered yet.
>
>Then, in case of a batch, all the reties could timeout again. So,
>it will fail to identify which device is truly broken. This would
>end badly by blindly disabling all the devices in the batch. Also
>the disabling calls require CMDQ too, so they might fail as well.

Yes, looking at VT-d currently and the queue length is 256 and this
spirals out of control quickly.
>
>Thus, partially to answer the question, in case software timeout,
>I am afraid that we can hardly do anything.. :-/

Agreed.

Do you think we can maybe document this somewhere? Maybe add to the
cover letter?
>
>This means I need to set a different return code for ATC timeouts
>v.s. software timeouts.
>
>Also, there is another problem: when PCI CTO finally reaches, the
>GERROR ISR will set atc_sync_timeouts but nobody will clear it..
>So, before calling arm_smmu_cmdq_issue_cmdlist(), we need to make
>sure there is no dirty bit on the bitmap too.

Yes, Just to confirm, do you think this needs to be handled regardless
whether we handle the software timeout for the ATC invalidation?
Basically to cleanup the bit on bitmap.
>
>Thanks!
>Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-03-19  0:08       ` Samiullah Khawaja
@ 2026-03-19  1:15         ` Nicolin Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-19  1:15 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: will, robin.murphy, joro, bhelgaas, jgg, rafael, lenb, praan,
	baolu.lu, xueshuai, kevin.tian, linux-arm-kernel, iommu,
	linux-kernel, linux-acpi, linux-pci, vsethi

On Thu, Mar 19, 2026 at 12:08:04AM +0000, Samiullah Khawaja wrote:
> On Wed, Mar 18, 2026 at 04:23:53PM -0700, Nicolin Chen wrote:
> > If the software times out first at 1s, it means the CMDQ is still
> > pending on wait for the completion of ATC invalidation. Then, the
> > caller sees -ETIMEOUT and tries to bisect the ATC batch or update
> > the STE directly, either of which involves CMDQ. But CMDQ has not
> > recovered yet.
> > 
> > Then, in case of a batch, all the reties could timeout again. So,
> > it will fail to identify which device is truly broken. This would
> > end badly by blindly disabling all the devices in the batch. Also
> > the disabling calls require CMDQ too, so they might fail as well.
> 
> Yes, looking at VT-d currently and the queue length is 256 and this
> spirals out of control quickly.
> > 
> > Thus, partially to answer the question, in case software timeout,
> > I am afraid that we can hardly do anything.. :-/
> 
> Agreed.
> 
> Do you think we can maybe document this somewhere? Maybe add to the
> cover letter?

Yes. I will add a note inline as well where software times out.

> > This means I need to set a different return code for ATC timeouts
> > v.s. software timeouts.
> > 
> > Also, there is another problem: when PCI CTO finally reaches, the
> > GERROR ISR will set atc_sync_timeouts but nobody will clear it..
> > So, before calling arm_smmu_cmdq_issue_cmdlist(), we need to make
> > sure there is no dirty bit on the bitmap too.
> 
> Yes, Just to confirm, do you think this needs to be handled regardless
> whether we handle the software timeout for the ATC invalidation?
> Basically to cleanup the bit on bitmap.

I don't see a reason not to. I think the next issuer who sees a
dirty slot in the bitmap will not have any idea about that ATC
timeout (batch). Basically the previous issuer was returned and
the batch is gone. So, it can do nothing but clear the slot in
the bitmap and move forward.

Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-03-18 23:23     ` Nicolin Chen
  2026-03-19  0:08       ` Samiullah Khawaja
@ 2026-03-23 23:57       ` Jason Gunthorpe
  2026-03-24  1:21         ` Nicolin Chen
  1 sibling, 1 reply; 47+ messages in thread
From: Jason Gunthorpe @ 2026-03-23 23:57 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Samiullah Khawaja, will, robin.murphy, joro, bhelgaas, rafael,
	lenb, praan, baolu.lu, xueshuai, kevin.tian, linux-arm-kernel,
	iommu, linux-kernel, linux-acpi, linux-pci, vsethi

On Wed, Mar 18, 2026 at 04:23:53PM -0700, Nicolin Chen wrote:

> If the software times out first at 1s, it means the CMDQ is still
> pending on wait for the completion of ATC invalidation. Then, the
> caller sees -ETIMEOUT and tries to bisect the ATC batch or update
> the STE directly, either of which involves CMDQ. But CMDQ has not
> recovered yet.

Yeah, I don't know if the SW timeout flow is really all that RASy here
right now. Without somehow recovering the CMDQ it is pointless to try
to continue after a timeout.

And we are really in trouble if things like normal IOTLB invalidation
start to fail.

I think the right thing is to somehow try to recover the cmdq and then
restart it on the commands that haven't been SYNC'd yet and just keep
trying, maybe with progressively longer timeouts.

Just ignoring the error and continuing doesn't seem safe.

But that's something else again, as long as ATC invalidation reliably
hits the HW timeout first we should be OK to ignore it in this
series..

Jason

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-03-23 23:57       ` Jason Gunthorpe
@ 2026-03-24  1:21         ` Nicolin Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-24  1:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Samiullah Khawaja, will, robin.murphy, joro, bhelgaas, rafael,
	lenb, praan, baolu.lu, xueshuai, kevin.tian, linux-arm-kernel,
	iommu, linux-kernel, linux-acpi, linux-pci, vsethi

On Mon, Mar 23, 2026 at 08:57:56PM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 18, 2026 at 04:23:53PM -0700, Nicolin Chen wrote:
> 
> > If the software times out first at 1s, it means the CMDQ is still
> > pending on wait for the completion of ATC invalidation. Then, the
> > caller sees -ETIMEOUT and tries to bisect the ATC batch or update
> > the STE directly, either of which involves CMDQ. But CMDQ has not
> > recovered yet.
> 
> Yeah, I don't know if the SW timeout flow is really all that RASy here
> right now. Without somehow recovering the CMDQ it is pointless to try
> to continue after a timeout.
> 
> And we are really in trouble if things like normal IOTLB invalidation
> start to fail.
> 
> I think the right thing is to somehow try to recover the cmdq and then
> restart it on the commands that haven't been SYNC'd yet and just keep
> trying, maybe with progressively longer timeouts.
> 
> Just ignoring the error and continuing doesn't seem safe.
> 
> But that's something else again, as long as ATC invalidation reliably
> hits the HW timeout first we should be OK to ignore it in this
> series..

Yea. I will leave a FIXME inline for now.

Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 5/7] iommu/arm-smmu-v3: Replace smmu with master in arm_smmu_inv
  2026-03-17 19:15 [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (3 preceding siblings ...)
  2026-03-17 19:15 ` [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
@ 2026-03-17 19:15 ` Nicolin Chen
  2026-03-17 19:15 ` [PATCH v2 6/7] iommu/arm-smmu-v3: Introduce master->ats_broken flag Nicolin Chen
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-17 19:15 UTC (permalink / raw)
  To: will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, baolu.lu, xueshuai, kevin.tian,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

Storing master allows to backtrack the master pointer from an invalidation
entry, which will be useful when handling ATC invalidation time outs.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  2 +-
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  | 34 +++++++++++--------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 25 ++++++++------
 3 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 3eb12a34b086a..cb83ea1f3407f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -662,7 +662,7 @@ enum arm_smmu_inv_type {
 };
 
 struct arm_smmu_inv {
-	struct arm_smmu_device *smmu;
+	struct arm_smmu_master *master;
 	u8 type;
 	u8 size_opcode;
 	u8 nsize_opcode;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
index add671363c828..ef0c0bfe44206 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
@@ -653,39 +653,43 @@ static void arm_smmu_v3_invs_test_verify(struct kunit *test,
 	}
 }
 
+static struct arm_smmu_master invs_master = {
+	.smmu = &smmu,
+};
+
 static struct arm_smmu_invs invs1 = {
 	.num_invs = 3,
-	.inv = { { .type = INV_TYPE_S2_VMID, .id = 1, },
-		 { .type = INV_TYPE_S2_VMID_S1_CLEAR, .id = 1, },
-		 { .type = INV_TYPE_ATS, .id = 3, }, },
+	.inv = { { .master = &invs_master, .type = INV_TYPE_S2_VMID, .id = 1, },
+		 { .master = &invs_master, .type = INV_TYPE_S2_VMID_S1_CLEAR, .id = 1, },
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 3, }, },
 };
 
 static struct arm_smmu_invs invs2 = {
 	.num_invs = 3,
-	.inv = { { .type = INV_TYPE_S2_VMID, .id = 1, }, /* duplicated */
-		 { .type = INV_TYPE_ATS, .id = 4, },
-		 { .type = INV_TYPE_ATS, .id = 5, }, },
+	.inv = { { .master = &invs_master, .type = INV_TYPE_S2_VMID, .id = 1, }, /* dup */
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 4, },
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 5, }, },
 };
 
 static struct arm_smmu_invs invs3 = {
 	.num_invs = 3,
-	.inv = { { .type = INV_TYPE_S2_VMID, .id = 1, }, /* duplicated */
-		 { .type = INV_TYPE_ATS, .id = 5, }, /* recover a trash */
-		 { .type = INV_TYPE_ATS, .id = 6, }, },
+	.inv = { { .master = &invs_master, .type = INV_TYPE_S2_VMID, .id = 1, }, /* dup */
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 5, }, /* recover a trash */
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 6, }, },
 };
 
 static struct arm_smmu_invs invs4 = {
 	.num_invs = 3,
-	.inv = { { .type = INV_TYPE_ATS, .id = 10, .ssid = 1 },
-		 { .type = INV_TYPE_ATS, .id = 10, .ssid = 3 },
-		 { .type = INV_TYPE_ATS, .id = 12, .ssid = 1 }, },
+	.inv = { { .master = &invs_master, .type = INV_TYPE_ATS, .id = 10, .ssid = 1 },
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 10, .ssid = 3 },
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 12, .ssid = 1 }, },
 };
 
 static struct arm_smmu_invs invs5 = {
 	.num_invs = 3,
-	.inv = { { .type = INV_TYPE_ATS, .id = 10, .ssid = 2 },
-		 { .type = INV_TYPE_ATS, .id = 10, .ssid = 3 }, /* duplicate */
-		 { .type = INV_TYPE_ATS, .id = 12, .ssid = 2 }, },
+	.inv = { { .master = &invs_master, .type = INV_TYPE_ATS, .id = 10, .ssid = 2 },
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 10, .ssid = 3 }, /* dup */
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 12, .ssid = 2 }, },
 };
 
 static void arm_smmu_v3_invs_test(struct kunit *test)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9c8972ebc94f9..aa42fe39d66b6 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1071,8 +1071,9 @@ arm_smmu_invs_iter_next(struct arm_smmu_invs *invs, size_t next, size_t *idx)
 static int arm_smmu_inv_cmp(const struct arm_smmu_inv *inv_l,
 			    const struct arm_smmu_inv *inv_r)
 {
-	if (inv_l->smmu != inv_r->smmu)
-		return cmp_int((uintptr_t)inv_l->smmu, (uintptr_t)inv_r->smmu);
+	if (inv_l->master->smmu != inv_r->master->smmu)
+		return cmp_int((uintptr_t)inv_l->master->smmu,
+			       (uintptr_t)inv_r->master->smmu);
 	if (inv_l->type != inv_r->type)
 		return cmp_int(inv_l->type, inv_r->type);
 	if (inv_l->id != inv_r->id)
@@ -2629,22 +2630,22 @@ static void arm_smmu_inv_to_cmdq_batch(struct arm_smmu_inv *inv,
 				       unsigned long iova, size_t size,
 				       unsigned int granule)
 {
-	if (arm_smmu_inv_size_too_big(inv->smmu, size, granule)) {
+	if (arm_smmu_inv_size_too_big(inv->master->smmu, size, granule)) {
 		cmd->opcode = inv->nsize_opcode;
-		arm_smmu_cmdq_batch_add(inv->smmu, cmds, cmd);
+		arm_smmu_cmdq_batch_add(inv->master->smmu, cmds, cmd);
 		return;
 	}
 
 	cmd->opcode = inv->size_opcode;
-	arm_smmu_cmdq_batch_add_range(inv->smmu, cmds, cmd, iova, size, granule,
-				      inv->pgsize);
+	arm_smmu_cmdq_batch_add_range(inv->master->smmu, cmds, cmd, iova, size,
+				      granule, inv->pgsize);
 }
 
 static inline bool arm_smmu_invs_end_batch(struct arm_smmu_inv *cur,
 					   struct arm_smmu_inv *next)
 {
 	/* Changing smmu means changing command queue */
-	if (cur->smmu != next->smmu)
+	if (cur->master->smmu != next->master->smmu)
 		return true;
 	/* The batch for S2 TLBI must be done before nested S1 ASIDs */
 	if (cur->type != INV_TYPE_S2_VMID_S1_CLEAR &&
@@ -2671,7 +2672,7 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
 		if (READ_ONCE(cur->users))
 			break;
 	while (cur != end) {
-		struct arm_smmu_device *smmu = cur->smmu;
+		struct arm_smmu_device *smmu = cur->master->smmu;
 		struct arm_smmu_cmdq_ent cmd = {
 			/*
 			 * Pick size_opcode to run arm_smmu_get_cmdq(). This can
@@ -2700,7 +2701,8 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
 			break;
 		case INV_TYPE_S2_VMID_S1_CLEAR:
 			/* CMDQ_OP_TLBI_S12_VMALL already flushed S1 entries */
-			if (arm_smmu_inv_size_too_big(cur->smmu, size, granule))
+			if (arm_smmu_inv_size_too_big(cur->master->smmu, size,
+						      granule))
 				continue;
 			cmd.tlbi.vmid = cur->id;
 			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
@@ -3225,7 +3227,7 @@ arm_smmu_master_build_inv(struct arm_smmu_master *master,
 {
 	struct arm_smmu_invs *build_invs = master->build_invs;
 	struct arm_smmu_inv *cur, inv = {
-		.smmu = master->smmu,
+		.master = master,
 		.type = type,
 		.id = id,
 		.pgsize = pgsize,
@@ -3261,6 +3263,7 @@ arm_smmu_master_build_inv(struct arm_smmu_master *master,
 	case INV_TYPE_ATS:
 	case INV_TYPE_ATS_FULL:
 		cur->size_opcode = cur->nsize_opcode = CMDQ_OP_ATC_INV;
+		cur->master = master;
 		cur->ssid = ssid;
 		break;
 	}
@@ -3457,7 +3460,7 @@ static void arm_smmu_inv_flush_iotlb_tag(struct arm_smmu_inv *inv)
 	}
 
 	cmd.opcode = inv->nsize_opcode;
-	arm_smmu_cmdq_issue_cmd_with_sync(inv->smmu, &cmd);
+	arm_smmu_cmdq_issue_cmd_with_sync(inv->master->smmu, &cmd);
 }
 
 /* Should be installed after arm_smmu_install_ste_for_dev() */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v2 6/7] iommu/arm-smmu-v3: Introduce master->ats_broken flag
  2026-03-17 19:15 [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (4 preceding siblings ...)
  2026-03-17 19:15 ` [PATCH v2 5/7] iommu/arm-smmu-v3: Replace smmu with master in arm_smmu_inv Nicolin Chen
@ 2026-03-17 19:15 ` Nicolin Chen
  2026-03-18  7:39   ` Tian, Kevin
  2026-03-17 19:15 ` [PATCH v2 7/7] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout Nicolin Chen
  2026-03-18  7:47 ` [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon " Tian, Kevin
  7 siblings, 1 reply; 47+ messages in thread
From: Nicolin Chen @ 2026-03-17 19:15 UTC (permalink / raw)
  To: will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, baolu.lu, xueshuai, kevin.tian,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

The flag will be set when IOMMU cannot trust device's ATS function. E.g.,
when ATC invalidation request to the device times out.

Once it is set, unsupport the ATS feature to prevent data corruption, and
skip further ATC invalidation commands to avoid new timeouts.

Unset the flag when the device finishes a reset for recovery.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 28 +++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index cb83ea1f3407f..0a0a88bb60e65 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -941,6 +941,7 @@ struct arm_smmu_master {
 	/* Locked by the iommu core using the group mutex */
 	struct arm_smmu_ctx_desc_cfg	cd_table;
 	unsigned int			num_streams;
+	bool				ats_broken;
 	bool				ats_enabled : 1;
 	bool				ste_ats_enabled : 1;
 	bool				stall_enabled;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index aa42fe39d66b6..366d812668011 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2502,6 +2502,10 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
 	struct arm_smmu_cmdq_ent cmd;
 	struct arm_smmu_cmdq_batch cmds;
 
+	/* Do not issue ATC_INV that will definitely time out */
+	if (READ_ONCE(master->ats_broken))
+		return 0;
+
 	arm_smmu_atc_inv_to_cmd(ssid, 0, 0, &cmd);
 
 	arm_smmu_cmdq_batch_init(master->smmu, &cmds, &cmd);
@@ -2708,11 +2712,17 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
 			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
 			break;
 		case INV_TYPE_ATS:
+			/* Do not issue ATC_INV that will definitely time out */
+			if (READ_ONCE(cur->master->ats_broken))
+				continue;
 			arm_smmu_atc_inv_to_cmd(cur->ssid, iova, size, &cmd);
 			cmd.atc.sid = cur->id;
 			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
 			break;
 		case INV_TYPE_ATS_FULL:
+			/* Do not issue ATC_INV that will definitely time out */
+			if (READ_ONCE(cur->master->ats_broken))
+				continue;
 			arm_smmu_atc_inv_to_cmd(IOMMU_NO_PASID, 0, 0, &cmd);
 			cmd.atc.sid = cur->id;
 			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
@@ -3048,6 +3058,15 @@ void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master,
 	}
 }
 
+static void arm_smmu_reset_device_done(struct device *dev)
+{
+	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+
+	if (WARN_ON(!master))
+		return;
+	WRITE_ONCE(master->ats_broken, false);
+}
+
 static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
 {
 	struct device *dev = master->dev;
@@ -3060,6 +3079,14 @@ static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
 	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_ATS))
 		return false;
 
+	/*
+	 * Reject any new ATS request because ATC invalidation was timed out.
+	 * The PCI device should go through a recovery (reset) and notify the
+	 * SMMUv3 driver via a reset_device_done callback.
+	 */
+	if (READ_ONCE(master->ats_broken))
+		return false;
+
 	return dev_is_pci(dev) && pci_ats_supported(to_pci_dev(dev));
 }
 
@@ -4392,6 +4419,7 @@ static const struct iommu_ops arm_smmu_ops = {
 	.domain_alloc_paging_flags = arm_smmu_domain_alloc_paging_flags,
 	.probe_device		= arm_smmu_probe_device,
 	.release_device		= arm_smmu_release_device,
+	.reset_device_done	= arm_smmu_reset_device_done,
 	.device_group		= arm_smmu_device_group,
 	.of_xlate		= arm_smmu_of_xlate,
 	.get_resv_regions	= arm_smmu_get_resv_regions,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 6/7] iommu/arm-smmu-v3: Introduce master->ats_broken flag
  2026-03-17 19:15 ` [PATCH v2 6/7] iommu/arm-smmu-v3: Introduce master->ats_broken flag Nicolin Chen
@ 2026-03-18  7:39   ` Tian, Kevin
  2026-03-18 20:00     ` Nicolin Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2026-03-18  7:39 UTC (permalink / raw)
  To: Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	joro@8bytes.org, bhelgaas@google.com, jgg@nvidia.com
  Cc: rafael@kernel.org, lenb@kernel.org, praan@google.com,
	baolu.lu@linux.intel.com, xueshuai@linux.alibaba.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, Vikram Sethi

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, March 18, 2026 3:16 AM
> 
> @@ -3060,6 +3079,14 @@ static bool arm_smmu_ats_supported(struct
> arm_smmu_master *master)
>  	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_ATS))
>  		return false;
> 
> +	/*
> +	 * Reject any new ATS request because ATC invalidation was timed
> out.
> +	 * The PCI device should go through a recovery (reset) and notify the
> +	 * SMMUv3 driver via a reset_device_done callback.
> +	 */
> +	if (READ_ONCE(master->ats_broken))
> +		return false;
> +

"Reject any new ATS request" means any new request to enable ATS
on this device, instead of rejecting any new ATS translation request,
correct? next patch does the actual work to block ATS...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 6/7] iommu/arm-smmu-v3: Introduce master->ats_broken flag
  2026-03-18  7:39   ` Tian, Kevin
@ 2026-03-18 20:00     ` Nicolin Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-18 20:00 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: will@kernel.org, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, praan@google.com, baolu.lu@linux.intel.com,
	xueshuai@linux.alibaba.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	Vikram Sethi

On Wed, Mar 18, 2026 at 07:39:53AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, March 18, 2026 3:16 AM
> > 
> > @@ -3060,6 +3079,14 @@ static bool arm_smmu_ats_supported(struct
> > arm_smmu_master *master)
> >  	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_ATS))
> >  		return false;
> > 
> > +	/*
> > +	 * Reject any new ATS request because ATC invalidation was timed
> > out.
> > +	 * The PCI device should go through a recovery (reset) and notify the
> > +	 * SMMUv3 driver via a reset_device_done callback.
> > +	 */
> > +	if (READ_ONCE(master->ats_broken))
> > +		return false;
> > +
> 
> "Reject any new ATS request" means any new request to enable ATS
> on this device, instead of rejecting any new ATS translation request,
> correct? next patch does the actual work to block ATS...

Yes. We won't call pci_enable_ats() due to !arm_smmu_ats_supported.
So, there shouldn't be any new ATS request any more.

I will change it to:
	/*
	 * Do not enable ATS if master->ats_broken is set. The PCI device should
	 * go through a recovery (reset) that shall notify the SMMUv3 driver via
	 * a reset_device_done callback.
	 */

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 7/7] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout
  2026-03-17 19:15 [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (5 preceding siblings ...)
  2026-03-17 19:15 ` [PATCH v2 6/7] iommu/arm-smmu-v3: Introduce master->ats_broken flag Nicolin Chen
@ 2026-03-17 19:15 ` Nicolin Chen
  2026-03-19  2:56   ` Shuai Xue
  2026-03-18  7:47 ` [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon " Tian, Kevin
  7 siblings, 1 reply; 47+ messages in thread
From: Nicolin Chen @ 2026-03-17 19:15 UTC (permalink / raw)
  To: will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, baolu.lu, xueshuai, kevin.tian,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

Currently, when GERROR_CMDQ_ERR occurs, the arm_smmu_cmdq_skip_err() won't
do anything for the CMDQ_ERR_CERROR_ATC_INV_IDX.

When a device wasn't responsive to an ATC invalidation request, this often
results in constant CMDQ errors:
  unexpected global error reported (0x00000001), this could be serious
  CMDQ error (cons 0x0302bb84): ATC invalidate timeout
  unexpected global error reported (0x00000001), this could be serious
  CMDQ error (cons 0x0302bb88): ATC invalidate timeout
  unexpected global error reported (0x00000001), this could be serious
  CMDQ error (cons 0x0302bb8c): ATC invalidate timeout
  ...

An ATC invalidation timeout indicates that the device failed to respond to
a protocol-critical coherency request, which means that device's internal
ATS state is desynchronized from the SMMU.

Furthermore, ignoring the timeout leaves the system in an unsafe state, as
the device cache may retain stale ATC entries for memory pages that the OS
has already reclaimed and reassigned. This might lead to data corruption.

Isolate the device that is confirmed to be unresponsive by a surgical STE
update to unset its EATS bit so as to reject any further ATS transaction,
which could corrupt the memory.

Also, set the master->ats_broken flag that is revertible after the device
completes a reset. This flag avoids further ATS requests and invalidations
from happening.

Finally, report this broken device to the IOMMU core to isolate the device
in the core level too.

For batched ATC_INV commands, SMMU hardware only reports a timeout at the
CMD_SYNC, which could follow the batch issued for multiple devices. So, it
isn't straightforward to identify which command in a batch resulted in the
timeout. Fortunately, the invs array has a sorted list of ATC entries. So,
the issued batch must be sorted as well. This makes it possible to bisect
the batch to retry the command per Stream ID and identify the master.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 92 ++++++++++++++++++++-
 1 file changed, 89 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 366d812668011..418ee2f40d96d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -107,6 +107,8 @@ static const char * const event_class_str[] = {
 	[3] = "Reserved",
 };
 
+static struct arm_smmu_ste *
+arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid);
 static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
 
 static void parse_driver_options(struct arm_smmu_device *smmu)
@@ -2495,10 +2497,43 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size,
 	cmd->atc.size	= log2_span;
 }
 
+static void arm_smmu_disable_eats_for_sid(struct arm_smmu_device *smmu,
+					  struct arm_smmu_cmdq *cmdq, u32 sid)
+{
+	struct arm_smmu_cmdq_ent ent = {
+		.opcode = CMDQ_OP_CFGI_STE,
+		.cfgi	= {
+			.sid = sid,
+			.leaf = true,
+		},
+	};
+	struct arm_smmu_ste *step;
+	u64 cmd[CMDQ_ENT_DWORDS];
+
+	step = arm_smmu_get_step_for_sid(smmu, sid);
+	WRITE_ONCE(step->data[1],
+		   READ_ONCE(step->data[1]) & cpu_to_le64(~STRTAB_STE_1_EATS));
+
+	arm_smmu_cmdq_build_cmd(cmd, &ent);
+	arm_smmu_cmdq_issue_cmdlist(smmu, cmdq, cmd, 1, true);
+}
+
+static void arm_smmu_master_disable_ats(struct arm_smmu_master *master,
+					struct arm_smmu_cmdq *cmdq)
+{
+	int i;
+
+	for (i = 0; i < master->num_streams; i++)
+		arm_smmu_disable_eats_for_sid(master->smmu, cmdq,
+					      master->streams[i].id);
+	WRITE_ONCE(master->ats_broken, true);
+	iommu_report_device_broken(master->dev);
+}
+
 static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
 				   ioasid_t ssid)
 {
-	int i;
+	int i, ret;
 	struct arm_smmu_cmdq_ent cmd;
 	struct arm_smmu_cmdq_batch cmds;
 
@@ -2514,7 +2549,10 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
 		arm_smmu_cmdq_batch_add(master->smmu, &cmds, &cmd);
 	}
 
-	return arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
+	ret = arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
+	if (ret)
+		arm_smmu_master_disable_ats(master, cmds.cmdq);
+	return ret;
 }
 
 /* IO_PGTABLE API */
@@ -2661,6 +2699,53 @@ static inline bool arm_smmu_invs_end_batch(struct arm_smmu_inv *cur,
 	return false;
 }
 
+static void arm_smmu_invs_disable_ats(struct arm_smmu_invs *invs,
+				      struct arm_smmu_cmdq *cmdq,
+				      struct arm_smmu_device *smmu, u32 sid)
+{
+	struct arm_smmu_inv *cur;
+	size_t i;
+
+	arm_smmu_invs_for_each_entry(invs, i, cur) {
+		if (cur->master->smmu == smmu && arm_smmu_inv_is_ats(cur) &&
+		    cur->id == sid) {
+			arm_smmu_master_disable_ats(cur->master, cmdq);
+			break;
+		}
+	}
+}
+
+static void arm_smmu_cmdq_batch_retry(struct arm_smmu_device *smmu,
+				      struct arm_smmu_invs *invs,
+				      struct arm_smmu_cmdq_batch *cmds)
+{
+	u64 atc[CMDQ_ENT_DWORDS] = {0};
+	int i;
+
+	/* Only a timed out ATC_INV command needs a retry */
+	if (!invs->has_ats)
+		return;
+
+	for (i = 0; i < cmds->num * CMDQ_ENT_DWORDS; i += CMDQ_ENT_DWORDS) {
+		struct arm_smmu_cmdq *cmdq = cmds->cmdq;
+		u32 sid;
+
+		/* Only need to retry ATC invalidations */
+		if (FIELD_GET(CMDQ_0_OP, cmds->cmds[i]) != CMDQ_OP_ATC_INV)
+			continue;
+
+		/* Only need to retry with one ATC_INV per Stream ID (device) */
+		sid = FIELD_GET(CMDQ_ATC_0_SID, cmds->cmds[i]);
+		if (atc[0] && sid == FIELD_GET(CMDQ_ATC_0_SID, atc[0]))
+			continue;
+
+		atc[0] = cmds->cmds[i];
+		atc[1] = cmds->cmds[i + 1];
+		if (arm_smmu_cmdq_issue_cmdlist(smmu, cmdq, atc, 1, true))
+			arm_smmu_invs_disable_ats(invs, cmdq, smmu, sid);
+	}
+}
+
 static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
 					unsigned long iova, size_t size,
 					unsigned int granule, bool leaf)
@@ -2739,7 +2824,8 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
 
 		if (cmds.num &&
 		    (next == end || arm_smmu_invs_end_batch(cur, next))) {
-			arm_smmu_cmdq_batch_submit(smmu, &cmds);
+			if (arm_smmu_cmdq_batch_submit(smmu, &cmds))
+				arm_smmu_cmdq_batch_retry(smmu, invs, &cmds);
 			cmds.num = 0;
 		}
 		cur = next;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 7/7] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout
  2026-03-17 19:15 ` [PATCH v2 7/7] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout Nicolin Chen
@ 2026-03-19  2:56   ` Shuai Xue
  2026-03-19  3:26     ` Nicolin Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Shuai Xue @ 2026-03-19  2:56 UTC (permalink / raw)
  To: Nicolin Chen, will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, baolu.lu, kevin.tian, linux-arm-kernel,
	iommu, linux-kernel, linux-acpi, linux-pci, vsethi



On 3/18/26 3:15 AM, Nicolin Chen wrote:
> Currently, when GERROR_CMDQ_ERR occurs, the arm_smmu_cmdq_skip_err() won't
> do anything for the CMDQ_ERR_CERROR_ATC_INV_IDX.
> 
> When a device wasn't responsive to an ATC invalidation request, this often
> results in constant CMDQ errors:
>    unexpected global error reported (0x00000001), this could be serious
>    CMDQ error (cons 0x0302bb84): ATC invalidate timeout
>    unexpected global error reported (0x00000001), this could be serious
>    CMDQ error (cons 0x0302bb88): ATC invalidate timeout
>    unexpected global error reported (0x00000001), this could be serious
>    CMDQ error (cons 0x0302bb8c): ATC invalidate timeout
>    ...
> 
> An ATC invalidation timeout indicates that the device failed to respond to
> a protocol-critical coherency request, which means that device's internal
> ATS state is desynchronized from the SMMU.
> 
> Furthermore, ignoring the timeout leaves the system in an unsafe state, as
> the device cache may retain stale ATC entries for memory pages that the OS
> has already reclaimed and reassigned. This might lead to data corruption.
> 
> Isolate the device that is confirmed to be unresponsive by a surgical STE
> update to unset its EATS bit so as to reject any further ATS transaction,
> which could corrupt the memory.
> 
> Also, set the master->ats_broken flag that is revertible after the device
> completes a reset. This flag avoids further ATS requests and invalidations
> from happening.
> 
> Finally, report this broken device to the IOMMU core to isolate the device
> in the core level too.
> 
> For batched ATC_INV commands, SMMU hardware only reports a timeout at the
> CMD_SYNC, which could follow the batch issued for multiple devices. So, it
> isn't straightforward to identify which command in a batch resulted in the
> timeout. Fortunately, the invs array has a sorted list of ATC entries. So,
> the issued batch must be sorted as well. This makes it possible to bisect
> the batch to retry the command per Stream ID and identify the master.

Nit: The implementation is a linear per-SID retry, not a binary
search / bisection. Suggest rewording to:

   "retry the ATC_INV command for each unique Stream ID in the batch
    to identify the unresponsive master"


> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 92 ++++++++++++++++++++-
>   1 file changed, 89 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 366d812668011..418ee2f40d96d 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -107,6 +107,8 @@ static const char * const event_class_str[] = {
>   	[3] = "Reserved",
>   };
>   
> +static struct arm_smmu_ste *
> +arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid);
>   static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
>   
>   static void parse_driver_options(struct arm_smmu_device *smmu)
> @@ -2495,10 +2497,43 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size,
>   	cmd->atc.size	= log2_span;
>   }
>   
> +static void arm_smmu_disable_eats_for_sid(struct arm_smmu_device *smmu,
> +					  struct arm_smmu_cmdq *cmdq, u32 sid)
> +{
> +	struct arm_smmu_cmdq_ent ent = {
> +		.opcode = CMDQ_OP_CFGI_STE,
> +		.cfgi	= {
> +			.sid = sid,
> +			.leaf = true,
> +		},
> +	};
> +	struct arm_smmu_ste *step;
> +	u64 cmd[CMDQ_ENT_DWORDS];
> +
> +	step = arm_smmu_get_step_for_sid(smmu, sid);
> +	WRITE_ONCE(step->data[1],
> +		   READ_ONCE(step->data[1]) & cpu_to_le64(~STRTAB_STE_1_EATS));


This non-atomic read-modify-write on step->data[1] can race with the
normal STE installation path (arm_smmu_write_entry → entry_set →
WRITE_ONCE).

The error path runs from:

   __arm_smmu_domain_inv_range()  (data path, no group->mutex)
     → arm_smmu_cmdq_batch_retry()
       → arm_smmu_master_disable_ats()
         → arm_smmu_disable_eats_for_sid()   ← NO locks on STE

The normal STE path runs from:

   iommu_attach_device()
     → mutex_lock(&group->mutex)
       → arm_smmu_attach_dev()
         → mutex_lock(&arm_smmu_asid_lock)
           → arm_smmu_install_ste_for_dev()
             → arm_smmu_write_entry()        ← holds both mutexes

Since the error path holds neither group->mutex nor arm_smmu_asid_lock,
the following race is possible:

   CPU A (error path):             CPU B (attach path):
   READ data[1] = X
                                   WRITE data[1] = Y (new STE config)
   WRITE data[1] = X & ~EATS
                                   // Y is lost

This could clobber a concurrent STE update from the attach path.


> +
> +	arm_smmu_cmdq_build_cmd(cmd, &ent);
> +	arm_smmu_cmdq_issue_cmdlist(smmu, cmdq, cmd, 1, true);
> +}
> +
> +static void arm_smmu_master_disable_ats(struct arm_smmu_master *master,
> +					struct arm_smmu_cmdq *cmdq)
> +{
> +	int i;
> +
> +	for (i = 0; i < master->num_streams; i++)
> +		arm_smmu_disable_eats_for_sid(master->smmu, cmdq,
> +					      master->streams[i].id);
> +	WRITE_ONCE(master->ats_broken, true);
> +	iommu_report_device_broken(master->dev);
> +}
> +
>   static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
>   				   ioasid_t ssid)
>   {
> -	int i;
> +	int i, ret;
>   	struct arm_smmu_cmdq_ent cmd;
>   	struct arm_smmu_cmdq_batch cmds;
>   
> @@ -2514,7 +2549,10 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
>   		arm_smmu_cmdq_batch_add(master->smmu, &cmds, &cmd);
>   	}
>   
> -	return arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
> +	ret = arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
> +	if (ret)
> +		arm_smmu_master_disable_ats(master, cmds.cmdq);
> +	return ret;
>   }
>   
>   /* IO_PGTABLE API */
> @@ -2661,6 +2699,53 @@ static inline bool arm_smmu_invs_end_batch(struct arm_smmu_inv *cur,
>   	return false;
>   }
>   
> +static void arm_smmu_invs_disable_ats(struct arm_smmu_invs *invs,
> +				      struct arm_smmu_cmdq *cmdq,
> +				      struct arm_smmu_device *smmu, u32 sid)
> +{
> +	struct arm_smmu_inv *cur;
> +	size_t i;
> +
> +	arm_smmu_invs_for_each_entry(invs, i, cur) {
> +		if (cur->master->smmu == smmu && arm_smmu_inv_is_ats(cur) &&
> +		    cur->id == sid) {
> +			arm_smmu_master_disable_ats(cur->master, cmdq);
> +			break;
> +		}
> +	}
> +}
> +
> +static void arm_smmu_cmdq_batch_retry(struct arm_smmu_device *smmu,
> +				      struct arm_smmu_invs *invs,
> +				      struct arm_smmu_cmdq_batch *cmds)
> +{
> +	u64 atc[CMDQ_ENT_DWORDS] = {0};
> +	int i;
> +
> +	/* Only a timed out ATC_INV command needs a retry */
> +	if (!invs->has_ats)
> +		return;
> +
> +	for (i = 0; i < cmds->num * CMDQ_ENT_DWORDS; i += CMDQ_ENT_DWORDS) {
> +		struct arm_smmu_cmdq *cmdq = cmds->cmdq;
> +		u32 sid;
> +
> +		/* Only need to retry ATC invalidations */
> +		if (FIELD_GET(CMDQ_0_OP, cmds->cmds[i]) != CMDQ_OP_ATC_INV)
> +			continue;
> +
> +		/* Only need to retry with one ATC_INV per Stream ID (device) */
> +		sid = FIELD_GET(CMDQ_ATC_0_SID, cmds->cmds[i]);
> +		if (atc[0] && sid == FIELD_GET(CMDQ_ATC_0_SID, atc[0]))
> +			continue;
> +
> +		atc[0] = cmds->cmds[i];
> +		atc[1] = cmds->cmds[i + 1];
> +		if (arm_smmu_cmdq_issue_cmdlist(smmu, cmdq, atc, 1, true))
> +			arm_smmu_invs_disable_ats(invs, cmdq, smmu, sid);
> +	}
> +}
> +
>   static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
>   					unsigned long iova, size_t size,
>   					unsigned int granule, bool leaf)
> @@ -2739,7 +2824,8 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
>   
>   		if (cmds.num &&
>   		    (next == end || arm_smmu_invs_end_batch(cur, next))) {
> -			arm_smmu_cmdq_batch_submit(smmu, &cmds);
> +			if (arm_smmu_cmdq_batch_submit(smmu, &cmds))
> +				arm_smmu_cmdq_batch_retry(smmu, invs, &cmds);
>   			cmds.num = 0;
>   		}
>   		cur = next;


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 7/7] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout
  2026-03-19  2:56   ` Shuai Xue
@ 2026-03-19  3:26     ` Nicolin Chen
  2026-03-19  7:41       ` Shuai Xue
  0 siblings, 1 reply; 47+ messages in thread
From: Nicolin Chen @ 2026-03-19  3:26 UTC (permalink / raw)
  To: Shuai Xue
  Cc: will, robin.murphy, joro, bhelgaas, jgg, rafael, lenb, praan,
	baolu.lu, kevin.tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Thu, Mar 19, 2026 at 10:56:43AM +0800, Shuai Xue wrote:
> On 3/18/26 3:15 AM, Nicolin Chen wrote:
> > For batched ATC_INV commands, SMMU hardware only reports a timeout at the
> > CMD_SYNC, which could follow the batch issued for multiple devices. So, it
> > isn't straightforward to identify which command in a batch resulted in the
> > timeout. Fortunately, the invs array has a sorted list of ATC entries. So,
> > the issued batch must be sorted as well. This makes it possible to bisect
> > the batch to retry the command per Stream ID and identify the master.
> 
> Nit: The implementation is a linear per-SID retry, not a binary
> search / bisection. Suggest rewording to:
> 
>   "retry the ATC_INV command for each unique Stream ID in the batch
>    to identify the unresponsive master"

You are right. And that sounds OK.

> > +	step = arm_smmu_get_step_for_sid(smmu, sid);
> > +	WRITE_ONCE(step->data[1],
> > +		   READ_ONCE(step->data[1]) & cpu_to_le64(~STRTAB_STE_1_EATS));
> 
> 
> This non-atomic read-modify-write on step->data[1] can race with the
> normal STE installation path (arm_smmu_write_entry → entry_set →
> WRITE_ONCE).
> 
> The error path runs from:
> 
>   __arm_smmu_domain_inv_range()  (data path, no group->mutex)
>     → arm_smmu_cmdq_batch_retry()
>       → arm_smmu_master_disable_ats()
>         → arm_smmu_disable_eats_for_sid()   ← NO locks on STE
> 
> The normal STE path runs from:
> 
>   iommu_attach_device()
>     → mutex_lock(&group->mutex)
>       → arm_smmu_attach_dev()
>         → mutex_lock(&arm_smmu_asid_lock)
>           → arm_smmu_install_ste_for_dev()
>             → arm_smmu_write_entry()        ← holds both mutexes
> 
> Since the error path holds neither group->mutex nor arm_smmu_asid_lock,
> the following race is possible:

Because invalidations can be in atomic context so we can't hold
those mutex locks.

>   CPU A (error path):             CPU B (attach path):
>   READ data[1] = X
>                                   WRITE data[1] = Y (new STE config)
>   WRITE data[1] = X & ~EATS
>                                   // Y is lost
> 
> This could clobber a concurrent STE update from the attach path.

Oh, that's true. Maybe this:
	__le64 new, old = READ_ONCE(step->data[1]);
	[...]
	do {
		new = old & cpu_to_le64(~STRTAB_STE_1_EATS);
	} while (!try_cmpxchg64(&step->data[1], &old, new));
?

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 7/7] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout
  2026-03-19  3:26     ` Nicolin Chen
@ 2026-03-19  7:41       ` Shuai Xue
  0 siblings, 0 replies; 47+ messages in thread
From: Shuai Xue @ 2026-03-19  7:41 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, bhelgaas, jgg, rafael, lenb, praan,
	baolu.lu, kevin.tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi



On 3/19/26 11:26 AM, Nicolin Chen wrote:
> On Thu, Mar 19, 2026 at 10:56:43AM +0800, Shuai Xue wrote:
>> On 3/18/26 3:15 AM, Nicolin Chen wrote:
>>> For batched ATC_INV commands, SMMU hardware only reports a timeout at the
>>> CMD_SYNC, which could follow the batch issued for multiple devices. So, it
>>> isn't straightforward to identify which command in a batch resulted in the
>>> timeout. Fortunately, the invs array has a sorted list of ATC entries. So,
>>> the issued batch must be sorted as well. This makes it possible to bisect
>>> the batch to retry the command per Stream ID and identify the master.
>>
>> Nit: The implementation is a linear per-SID retry, not a binary
>> search / bisection. Suggest rewording to:
>>
>>    "retry the ATC_INV command for each unique Stream ID in the batch
>>     to identify the unresponsive master"
> 
> You are right. And that sounds OK.
> 
>>> +	step = arm_smmu_get_step_for_sid(smmu, sid);
>>> +	WRITE_ONCE(step->data[1],
>>> +		   READ_ONCE(step->data[1]) & cpu_to_le64(~STRTAB_STE_1_EATS));
>>
>>
>> This non-atomic read-modify-write on step->data[1] can race with the
>> normal STE installation path (arm_smmu_write_entry → entry_set →
>> WRITE_ONCE).
>>
>> The error path runs from:
>>
>>    __arm_smmu_domain_inv_range()  (data path, no group->mutex)
>>      → arm_smmu_cmdq_batch_retry()
>>        → arm_smmu_master_disable_ats()
>>          → arm_smmu_disable_eats_for_sid()   ← NO locks on STE
>>
>> The normal STE path runs from:
>>
>>    iommu_attach_device()
>>      → mutex_lock(&group->mutex)
>>        → arm_smmu_attach_dev()
>>          → mutex_lock(&arm_smmu_asid_lock)
>>            → arm_smmu_install_ste_for_dev()
>>              → arm_smmu_write_entry()        ← holds both mutexes
>>
>> Since the error path holds neither group->mutex nor arm_smmu_asid_lock,
>> the following race is possible:
> 
> Because invalidations can be in atomic context so we can't hold
> those mutex locks.
> 
>>    CPU A (error path):             CPU B (attach path):
>>    READ data[1] = X
>>                                    WRITE data[1] = Y (new STE config)
>>    WRITE data[1] = X & ~EATS
>>                                    // Y is lost
>>
>> This could clobber a concurrent STE update from the attach path.
> 
> Oh, that's true. Maybe this:
> 	__le64 new, old = READ_ONCE(step->data[1]);
> 	[...]
> 	do {
> 		new = old & cpu_to_le64(~STRTAB_STE_1_EATS);
> 	} while (!try_cmpxchg64(&step->data[1], &old, new));
> ?

Yes, the cmpxchg loop looks correct to me.

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
  2026-03-17 19:15 [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (6 preceding siblings ...)
  2026-03-17 19:15 ` [PATCH v2 7/7] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout Nicolin Chen
@ 2026-03-18  7:47 ` Tian, Kevin
  2026-03-18 20:04   ` Nicolin Chen
  7 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2026-03-18  7:47 UTC (permalink / raw)
  To: Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	joro@8bytes.org, bhelgaas@google.com, jgg@nvidia.com
  Cc: rafael@kernel.org, lenb@kernel.org, praan@google.com,
	baolu.lu@linux.intel.com, xueshuai@linux.alibaba.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, Vikram Sethi

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, March 18, 2026 3:16 AM
> 
> Hi all,
> 
> This series addresses a critical vulnerability and stability issue where an
> unresponsive PCIe device failing to process ATC (Address Translation Cache)
> invalidation requests leads to silent data corruption and continuous SMMU
> CMDQ error spam.
> 

None of the patches in this series contains a Fixed tag and cc stable.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
  2026-03-18  7:47 ` [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon " Tian, Kevin
@ 2026-03-18 20:04   ` Nicolin Chen
  2026-03-19  2:29     ` Tian, Kevin
  0 siblings, 1 reply; 47+ messages in thread
From: Nicolin Chen @ 2026-03-18 20:04 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: will@kernel.org, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, praan@google.com, baolu.lu@linux.intel.com,
	xueshuai@linux.alibaba.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	Vikram Sethi

On Wed, Mar 18, 2026 at 07:47:18AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, March 18, 2026 3:16 AM
> > 
> > Hi all,
> > 
> > This series addresses a critical vulnerability and stability issue where an
> > unresponsive PCIe device failing to process ATC (Address Translation Cache)
> > invalidation requests leads to silent data corruption and continuous SMMU
> > CMDQ error spam.
> > 
> 
> None of the patches in this series contains a Fixed tag and cc stable.

Hmm, I guess AI overly polished the cover letter so it sounds too
strong?

This is essentially a vulnerability (potential memory corruption).
And none of these patches actually fixes any regression. The PATCH
7 even requires the arm_smmu_invs series which has not been merged
yet :-/

Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
  2026-03-18 20:04   ` Nicolin Chen
@ 2026-03-19  2:29     ` Tian, Kevin
  2026-03-19  3:10       ` Nicolin Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2026-03-19  2:29 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will@kernel.org, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, praan@google.com, baolu.lu@linux.intel.com,
	xueshuai@linux.alibaba.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	Vikram Sethi

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, March 19, 2026 4:05 AM
> 
> On Wed, Mar 18, 2026 at 07:47:18AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Wednesday, March 18, 2026 3:16 AM
> > >
> > > Hi all,
> > >
> > > This series addresses a critical vulnerability and stability issue where an
> > > unresponsive PCIe device failing to process ATC (Address Translation
> Cache)
> > > invalidation requests leads to silent data corruption and continuous
> SMMU
> > > CMDQ error spam.
> > >
> >
> > None of the patches in this series contains a Fixed tag and cc stable.
> 
> Hmm, I guess AI overly polished the cover letter so it sounds too
> strong?
> 
> This is essentially a vulnerability (potential memory corruption).
> And none of these patches actually fixes any regression. The PATCH
> 7 even requires the arm_smmu_invs series which has not been merged
> yet :-/
> 

Fixes tag and backporting are not just for regression. People certainly
want to see reported vulnerabilities fixed in stable kernels...


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
  2026-03-19  2:29     ` Tian, Kevin
@ 2026-03-19  3:10       ` Nicolin Chen
  2026-03-24  0:03         ` Jason Gunthorpe
  0 siblings, 1 reply; 47+ messages in thread
From: Nicolin Chen @ 2026-03-19  3:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: will@kernel.org, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, praan@google.com, baolu.lu@linux.intel.com,
	xueshuai@linux.alibaba.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	Vikram Sethi

On Thu, Mar 19, 2026 at 02:29:38AM +0000, Tian, Kevin wrote:
> > > > This series addresses a critical vulnerability and stability issue where an
> > > > unresponsive PCIe device failing to process ATC (Address Translation
> > Cache)
> > > > invalidation requests leads to silent data corruption and continuous
> > SMMU
> > > > CMDQ error spam.
> > > >
> > >
> > > None of the patches in this series contains a Fixed tag and cc stable.
> > 
> > Hmm, I guess AI overly polished the cover letter so it sounds too
> > strong?
> > 
> > This is essentially a vulnerability (potential memory corruption).
> > And none of these patches actually fixes any regression. The PATCH
> > 7 even requires the arm_smmu_invs series which has not been merged
> > yet :-/
> > 
> 
> Fixes tag and backporting are not just for regression. People certainly
> want to see reported vulnerabilities fixed in stable kernels...

Well, maybe I'll just leave additional line telling people that this
can't be a bug "fix" because it's written on another unmerged series?

Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
  2026-03-19  3:10       ` Nicolin Chen
@ 2026-03-24  0:03         ` Jason Gunthorpe
  2026-03-24  1:30           ` Nicolin Chen
  2026-03-25  6:55           ` Tian, Kevin
  0 siblings, 2 replies; 47+ messages in thread
From: Jason Gunthorpe @ 2026-03-24  0:03 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, will@kernel.org, robin.murphy@arm.com,
	joro@8bytes.org, bhelgaas@google.com, rafael@kernel.org,
	lenb@kernel.org, praan@google.com, baolu.lu@linux.intel.com,
	xueshuai@linux.alibaba.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	Vikram Sethi

On Wed, Mar 18, 2026 at 08:10:01PM -0700, Nicolin Chen wrote:
> On Thu, Mar 19, 2026 at 02:29:38AM +0000, Tian, Kevin wrote:
> > > > > This series addresses a critical vulnerability and stability issue where an
> > > > > unresponsive PCIe device failing to process ATC (Address Translation
> > > Cache)
> > > > > invalidation requests leads to silent data corruption and continuous
> > > SMMU
> > > > > CMDQ error spam.
> > > > >
> > > >
> > > > None of the patches in this series contains a Fixed tag and cc stable.
> > > 
> > > Hmm, I guess AI overly polished the cover letter so it sounds too
> > > strong?
> > > 
> > > This is essentially a vulnerability (potential memory corruption).
> > > And none of these patches actually fixes any regression. The PATCH
> > > 7 even requires the arm_smmu_invs series which has not been merged
> > > yet :-/
> > > 
> > 
> > Fixes tag and backporting are not just for regression. People certainly
> > want to see reported vulnerabilities fixed in stable kernels...
> 
> Well, maybe I'll just leave additional line telling people that this
> can't be a bug "fix" because it's written on another unmerged series?

I think this is more of a feature (RAS support for SMMUv3) than a
specific fix.

Jason

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
  2026-03-24  0:03         ` Jason Gunthorpe
@ 2026-03-24  1:30           ` Nicolin Chen
  2026-03-25  6:55           ` Tian, Kevin
  1 sibling, 0 replies; 47+ messages in thread
From: Nicolin Chen @ 2026-03-24  1:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, will@kernel.org, robin.murphy@arm.com,
	joro@8bytes.org, bhelgaas@google.com, rafael@kernel.org,
	lenb@kernel.org, praan@google.com, baolu.lu@linux.intel.com,
	xueshuai@linux.alibaba.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	Vikram Sethi

On Mon, Mar 23, 2026 at 09:03:21PM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 18, 2026 at 08:10:01PM -0700, Nicolin Chen wrote:
> > On Thu, Mar 19, 2026 at 02:29:38AM +0000, Tian, Kevin wrote:
> > > > > > This series addresses a critical vulnerability and stability issue where an
> > > > > > unresponsive PCIe device failing to process ATC (Address Translation
> > > > Cache)
> > > > > > invalidation requests leads to silent data corruption and continuous
> > > > SMMU
> > > > > > CMDQ error spam.
> > > > > >
> > > > >
> > > > > None of the patches in this series contains a Fixed tag and cc stable.
> > > > 
> > > > Hmm, I guess AI overly polished the cover letter so it sounds too
> > > > strong?
> > > > 
> > > > This is essentially a vulnerability (potential memory corruption).
> > > > And none of these patches actually fixes any regression. The PATCH
> > > > 7 even requires the arm_smmu_invs series which has not been merged
> > > > yet :-/
> > > > 
> > > 
> > > Fixes tag and backporting are not just for regression. People certainly
> > > want to see reported vulnerabilities fixed in stable kernels...
> > 
> > Well, maybe I'll just leave additional line telling people that this
> > can't be a bug "fix" because it's written on another unmerged series?
> 
> I think this is more of a feature (RAS support for SMMUv3) than a
> specific fix.

Adding that to the cover-letter. Thanks for the input.

Nicolin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
  2026-03-24  0:03         ` Jason Gunthorpe
  2026-03-24  1:30           ` Nicolin Chen
@ 2026-03-25  6:55           ` Tian, Kevin
  2026-03-25 14:12             ` Jason Gunthorpe
  1 sibling, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2026-03-25  6:55 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: will@kernel.org, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, rafael@kernel.org, lenb@kernel.org,
	praan@google.com, baolu.lu@linux.intel.com,
	xueshuai@linux.alibaba.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	Vikram Sethi

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 24, 2026 8:03 AM
> 
> On Wed, Mar 18, 2026 at 08:10:01PM -0700, Nicolin Chen wrote:
> > On Thu, Mar 19, 2026 at 02:29:38AM +0000, Tian, Kevin wrote:
> > > > > > This series addresses a critical vulnerability and stability issue where
> an
> > > > > > unresponsive PCIe device failing to process ATC (Address Translation
> > > > Cache)
> > > > > > invalidation requests leads to silent data corruption and continuous
> > > > SMMU
> > > > > > CMDQ error spam.
> > > > > >
> > > > >
> > > > > None of the patches in this series contains a Fixed tag and cc stable.
> > > >
> > > > Hmm, I guess AI overly polished the cover letter so it sounds too
> > > > strong?
> > > >
> > > > This is essentially a vulnerability (potential memory corruption).
> > > > And none of these patches actually fixes any regression. The PATCH
> > > > 7 even requires the arm_smmu_invs series which has not been merged
> > > > yet :-/
> > > >
> > >
> > > Fixes tag and backporting are not just for regression. People certainly
> > > want to see reported vulnerabilities fixed in stable kernels...
> >
> > Well, maybe I'll just leave additional line telling people that this
> > can't be a bug "fix" because it's written on another unmerged series?
> 
> I think this is more of a feature (RAS support for SMMUv3) than a
> specific fix.
> 

Not a RAS guy, but below is what I got from AI:

"
RAS improvements typically involve better error reporting, graceful
degradation, or improved recovery - but they usually don't involve
scenarios where the system continues operating with compromised
security assumptions."

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
  2026-03-25  6:55           ` Tian, Kevin
@ 2026-03-25 14:12             ` Jason Gunthorpe
  0 siblings, 0 replies; 47+ messages in thread
From: Jason Gunthorpe @ 2026-03-25 14:12 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	joro@8bytes.org, bhelgaas@google.com, rafael@kernel.org,
	lenb@kernel.org, praan@google.com, baolu.lu@linux.intel.com,
	xueshuai@linux.alibaba.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	Vikram Sethi

On Wed, Mar 25, 2026 at 06:55:40AM +0000, Tian, Kevin wrote:
> > I think this is more of a feature (RAS support for SMMUv3) than a
> > specific fix.
> > 
> 
> Not a RAS guy, but below is what I got from AI:
> 
> "
> RAS improvements typically involve better error reporting, graceful
> degradation, or improved recovery - but they usually don't involve
> scenarios where the system continues operating with compromised
> security assumptions."

Right, so currently there is no RAS in smmuv3, if it hits this error
it continues with "compromised security assumptions". Adding RAS
support is to avoid this.

Jason


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2026-03-25 14:12 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-17 19:15 [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
2026-03-17 19:15 ` [PATCH v2 1/7] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds Nicolin Chen
2026-03-18  7:21   ` Tian, Kevin
2026-03-18 20:16     ` Nicolin Chen
2026-03-18  8:02   ` Shuai Xue
2026-03-18 20:27     ` Nicolin Chen
2026-03-17 19:15 ` [PATCH v2 2/7] iommu: Add reset_device_done callback for hardware fault recovery Nicolin Chen
2026-03-18  5:59   ` Baolu Lu
2026-03-18 18:42     ` Nicolin Chen
2026-03-17 19:15 ` [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
2026-03-18  6:13   ` Baolu Lu
2026-03-19  1:31     ` Nicolin Chen
2026-03-18  7:31   ` Tian, Kevin
2026-03-19  1:30     ` Nicolin Chen
2026-03-19  2:35       ` Tian, Kevin
2026-03-19  3:13         ` Nicolin Chen
2026-03-18 11:45   ` Shuai Xue
2026-03-18 20:29     ` Nicolin Chen
2026-03-17 19:15 ` [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
2026-03-18  7:36   ` Tian, Kevin
2026-03-18 19:26     ` Nicolin Chen
2026-03-18 22:06       ` Samiullah Khawaja
2026-03-19  3:08         ` Tian, Kevin
2026-03-19  3:12           ` Nicolin Chen
2026-03-23 23:51             ` Jason Gunthorpe
2026-03-18 22:02   ` Samiullah Khawaja
2026-03-18 23:23     ` Nicolin Chen
2026-03-19  0:08       ` Samiullah Khawaja
2026-03-19  1:15         ` Nicolin Chen
2026-03-23 23:57       ` Jason Gunthorpe
2026-03-24  1:21         ` Nicolin Chen
2026-03-17 19:15 ` [PATCH v2 5/7] iommu/arm-smmu-v3: Replace smmu with master in arm_smmu_inv Nicolin Chen
2026-03-17 19:15 ` [PATCH v2 6/7] iommu/arm-smmu-v3: Introduce master->ats_broken flag Nicolin Chen
2026-03-18  7:39   ` Tian, Kevin
2026-03-18 20:00     ` Nicolin Chen
2026-03-17 19:15 ` [PATCH v2 7/7] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout Nicolin Chen
2026-03-19  2:56   ` Shuai Xue
2026-03-19  3:26     ` Nicolin Chen
2026-03-19  7:41       ` Shuai Xue
2026-03-18  7:47 ` [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon " Tian, Kevin
2026-03-18 20:04   ` Nicolin Chen
2026-03-19  2:29     ` Tian, Kevin
2026-03-19  3:10       ` Nicolin Chen
2026-03-24  0:03         ` Jason Gunthorpe
2026-03-24  1:30           ` Nicolin Chen
2026-03-25  6:55           ` Tian, Kevin
2026-03-25 14:12             ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox