[PATCH v3 00/11] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 00/11] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
@ 2026-04-16 23:28 Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 01/11] PCI: Propagate FLR return values to callers Nicolin Chen
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

Hi all,

This series addresses a critical vulnerability and stability issue where an
unresponsive PCIe device failing to process ATC (Address Translation Cache)
invalidation requests leads to silent data corruption and continuous SMMU
CMDQ error spam.

[ As Jason pointed out, because this series fundamentally introduces a new
  RAS feature to quarantine and recover from hardware faults and relies on
  a recently accepted SMMU driver rework, it is not treated as a standard
  bug fix. Thus, none of the patches here carries a "Fixes" tag. ]

Currently, when an ATC invalidation times out, the SMMUv3 driver skips the
CMDQ_ERR_CERROR_ATC_INV_IDX error. This leaves the device's ATS cache state
desynchronized from the SMMU: the device cache may retain stale ATC entries
for memory pages that the OS has already reclaimed and reassigned, creating
a direct vector for data corruption. Furthermore, the driver might continue
issuing ATC_INV commands, resulting in constant CMDQ errors:
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb84): ATC invalidate timeout
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb88): ATC invalidate timeout
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb8c): ATC invalidate timeout
    ...

To resolve this, introduce a mechanism to quarantine a broken device in the
SMMUv3 driver and the IOMMU core. To achieve this, add preparatory changes:
 - Tighten the semantics of pci_dev_reset_iommu_done() that is now strictly
   called only upon a successful hardware reset
 - Introduce a reset_device_done op, allowing the core to signal the driver
   when the physical hardware has been cleanly recovered (e.g., via AER or
   a manual reset) so the quarantine can be lifted
 - Utilize a per-group_device WQ via an iommu_report_device_broken() helper

On the SMMUv3 driver side, retry the timedout ATC_INV batch to identify the
faulty device(s) via an atc_sync_timeouts tracker. Perform a surgical STE
update and flag the ATS as broken to reject further ATS/ATC requests at the
hardware level and suppress further timeout spam.

This is on Github:
https://github.com/nicolinc/iommufd/commits/smmuv3_atc_timeout-v3

Note that patches are rebased on bug-fix under review:
https://lore.kernel.org/all/20260407194644.171304-1-nicolinc@nvidia.com/

Changelog
v3:
 * Rebase on arm/smmu/updates branch + bug fix
 * Update commit messages and inline comments
 * [iommu] Drop unnecessary ops validation
 * [iommu] Add missed function stub when !CONFIG_IOMMU_API
 * [iommu] Change iommu_report_device_broken() to per gdev
 * [iommu] Separate quarantine from pci_dev_reset_prepare()
 * [iommu] Check reset failure in pci_dev_reset_iommu_done()
 * [smmuv3] Fix STE update with try_cmpxchg64()
 * [smmuv3] Fix "continue" bug when skipping ATC commands
 * [smmuv3] Replace atomic_t prod_err with a lockless bitmap
 * [smmuv3] Drop master->invs_domain; disable ATS per-master directly
 * [smmuv3] Return -EIO for ATC timeout v.s. -ETIMEDOUT for poll timeout
 * [smmuv3] Replace INV_TYPE_ATS_DISABLED with per-master ats_broken flag
v2:
 https://lore.kernel.org/all/cover.1773774441.git.nicolinc@nvidia.com/
 * Rebase on arm_smmu_invs-v13 series [0]
 * Bisect batched atc invalidation commands
 * Drop the direct pci_reset_function() call
 * Move the work queue from SMMUv3 to the core
 * Proceed a surgical STE update to disable EATS
 * Wait for pci_dev_reset_iommu_done() to signal a recovery
v1:
 https://lore.kernel.org/all/cover.1772686998.git.nicolinc@nvidia.com/

[0] https://lore.kernel.org/all/cover.1773733797.git.nicolinc@nvidia.com/

Thanks
Nicolin

Nicolin Chen (11):
  PCI: Propagate FLR return values to callers
  iommu: Pass in reset result to pci_dev_reset_iommu_done()
  iommu: Add reset_device_done callback for hardware fault recovery
  iommu: Add __iommu_group_block_device helper
  iommu: Change group->devices to RCU-protected list
  iommu: Defer __iommu_group_free_device() to be outside group->mutex
  iommu: Add iommu_report_device_broken() to quarantine a broken device
  iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  iommu/arm-smmu-v3: Replace smmu with master in arm_smmu_inv
  iommu/arm-smmu-v3: Introduce master->ats_broken flag
  iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   4 +-
 include/linux/iommu.h                         |  15 +-
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  |  34 ++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 193 +++++++++++-
 drivers/iommu/iommu.c                         | 284 ++++++++++++++----
 drivers/pci/pci-acpi.c                        |   2 +-
 drivers/pci/pci.c                             |  10 +-
 drivers/pci/quirks.c                          |  24 +-
 8 files changed, 454 insertions(+), 112 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 01/11] PCI: Propagate FLR return values to callers
  2026-04-16 23:28 [PATCH v3 00/11] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
@ 2026-04-16 23:28 ` Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 02/11] iommu: Pass in reset result to pci_dev_reset_iommu_done() Nicolin Chen
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

A reset failure implies that the device might be unreliable. E.g. its ATC
might still retain stale entries. Thus, the IOMMU layer cannot trust this
device to resume its ATS function that can lead to memory corruption. So,
the pci_dev_reset_iommu_done() won't recover the device's IOMMU pathway if
the device reset fails.

Those functions in the pci_dev_reset_methods array invoke pcie_flr(), but
do not check the return value. Propagate them correctly.

Given that these functions have been running okay, and the return values
will be only needed for an incoming work. This is not treated as bug fix.

Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/pci/quirks.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 48946cca4be72..05ce12b6b2f76 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3957,7 +3957,7 @@ static int reset_intel_82599_sfp_virtfn(struct pci_dev *dev, bool probe)
 	 * supported.
 	 */
 	if (!probe)
-		pcie_flr(dev);
+		return pcie_flr(dev);
 	return 0;
 }
 
@@ -4015,6 +4015,7 @@ static int reset_chelsio_generic_dev(struct pci_dev *dev, bool probe)
 {
 	u16 old_command;
 	u16 msix_flags;
+	int ret;
 
 	/*
 	 * If this isn't a Chelsio T4-based device, return -ENOTTY indicating
@@ -4060,7 +4061,7 @@ static int reset_chelsio_generic_dev(struct pci_dev *dev, bool probe)
 				      PCI_MSIX_FLAGS_ENABLE |
 				      PCI_MSIX_FLAGS_MASKALL);
 
-	pcie_flr(dev);
+	ret = pcie_flr(dev);
 
 	/*
 	 * Restore the configuration information (BAR values, etc.) including
@@ -4069,7 +4070,7 @@ static int reset_chelsio_generic_dev(struct pci_dev *dev, bool probe)
 	 */
 	pci_restore_state(dev);
 	pci_write_config_word(dev, PCI_COMMAND, old_command);
-	return 0;
+	return ret;
 }
 
 #define PCI_DEVICE_ID_INTEL_82599_SFP_VF   0x10ed
@@ -4152,9 +4153,7 @@ static int nvme_disable_and_flr(struct pci_dev *dev, bool probe)
 
 	pci_iounmap(dev, bar);
 
-	pcie_flr(dev);
-
-	return 0;
+	return pcie_flr(dev);
 }
 
 /*
@@ -4166,14 +4165,16 @@ static int nvme_disable_and_flr(struct pci_dev *dev, bool probe)
  */
 static int delay_250ms_after_flr(struct pci_dev *dev, bool probe)
 {
+	int ret;
+
 	if (probe)
 		return pcie_reset_flr(dev, PCI_RESET_PROBE);
 
-	pcie_reset_flr(dev, PCI_RESET_DO_RESET);
+	ret = pcie_reset_flr(dev, PCI_RESET_DO_RESET);
 
 	msleep(250);
 
-	return 0;
+	return ret;
 }
 
 #define PCI_DEVICE_ID_HINIC_VF      0x375E
@@ -4189,6 +4190,7 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe)
 	unsigned long timeout;
 	void __iomem *bar;
 	u32 val;
+	int ret;
 
 	if (probe)
 		return 0;
@@ -4209,7 +4211,7 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe)
 	val = val | HINIC_VF_FLR_PROC_BIT;
 	iowrite32be(val, bar + HINIC_VF_OP);
 
-	pcie_flr(pdev);
+	ret = pcie_flr(pdev);
 
 	/*
 	 * The device must recapture its Bus and Device Numbers after FLR
@@ -4236,7 +4238,7 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe)
 reset_complete:
 	pci_iounmap(pdev, bar);
 
-	return 0;
+	return ret;
 }
 
 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 02/11] iommu: Pass in reset result to pci_dev_reset_iommu_done()
  2026-04-16 23:28 [PATCH v3 00/11] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 01/11] PCI: Propagate FLR return values to callers Nicolin Chen
@ 2026-04-16 23:28 ` Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 03/11] iommu: Add reset_device_done callback for hardware fault recovery Nicolin Chen
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

IOMMU drivers handle ATC cache maintenance. They may encounter ATC-related
errors (e.g., ATC invalidation request timeout), indicating that ATC cache
might have stale entries that can corrupt the memory. In this case, IOMMU
driver has no choice but to block the device's ATS function and wait for a
device recovery.

The pci_dev_reset_iommu_done() called at the end of a reset function could
serve as a reliable signal to the IOMMU subsystem that the physical device
cache is completely clean. However, the function is called unconditionally
even if the reset operation had actually failed, which would re-attach the
faulty device back to a normal translation domain. And this will leave the
system highly exposed, creating vulnerabilities for data corruption:
    IOMMU blocks RID/ATS
    pci_reset_function():
        pci_dev_reset_iommu_prepare(); // Block RID/ATS
        __reset(); // Failed (ATC is still stale)
        pci_dev_reset_iommu_done(); // Unblock RID/ATS (ah-ha)

Instead, add a @reset_succeeds parameter to pci_dev_reset_iommu_done() and
pass the reset result from each caller:
    IOMMU blocks RID/ATS
    pci_reset_function():
        pci_dev_reset_iommu_prepare(); // Block RID/ATS
        rc = __reset();
        pci_dev_reset_iommu_done(!rc); // Unblock or quarantine

On a successful reset, done() restores the device to its RID/PASID domains
and decrements group->recovery_cnt. On failure, the device remains blocked,
and concurrent domain attachment will be rejected until a successful reset.

Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h  |  5 +++--
 drivers/iommu/iommu.c  | 28 +++++++++++++++++++++++++---
 drivers/pci/pci-acpi.c |  2 +-
 drivers/pci/pci.c      | 10 +++++-----
 drivers/pci/quirks.c   |  2 +-
 5 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 54b8b48c762e8..d3685967e960a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1191,7 +1191,7 @@ void iommu_free_global_pasid(ioasid_t pasid);
 
 /* PCI device reset functions */
 int pci_dev_reset_iommu_prepare(struct pci_dev *pdev);
-void pci_dev_reset_iommu_done(struct pci_dev *pdev);
+void pci_dev_reset_iommu_done(struct pci_dev *pdev, bool reset_succeeds);
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
@@ -1521,7 +1521,8 @@ static inline int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 	return 0;
 }
 
-static inline void pci_dev_reset_iommu_done(struct pci_dev *pdev)
+static inline void pci_dev_reset_iommu_done(struct pci_dev *pdev,
+					    bool reset_succeeds)
 {
 }
 #endif /* CONFIG_IOMMU_API */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ff181db687bbf..28d4c1f143a08 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -80,6 +80,7 @@ struct group_device {
 	 * Device is blocked for a pending recovery while its group->domain is
 	 * retained. This can happen when:
 	 *  - Device is undergoing a reset
+	 *  - Device failed the last reset
 	 */
 	bool blocked;
 	unsigned int reset_depth;
@@ -3971,7 +3972,9 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
  * reset is finished, pci_dev_reset_iommu_done() can restore everything.
  *
  * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
- * before/after the core-level reset routine, to decrement the recovery_cnt.
+ * before/after the core-level reset routine. On a successful reset, done() will
+ * decrement group->recovery_cnt and restore domains. On a failure, recovery_cnt
+ * is left intact and the device stays blocked.
  *
  * Return: 0 on success or negative error code if the preparation failed.
  *
@@ -4000,6 +4003,9 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 
 	if (gdev->reset_depth++)
 		return 0;
+	/* Device might be already blocked for a quarantine */
+	if (gdev->blocked)
+		return 0;
 
 	ret = __iommu_group_alloc_blocking_domain(group);
 	if (ret)
@@ -4047,18 +4053,22 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
 /**
  * pci_dev_reset_iommu_done() - Restore IOMMU after a PCI device reset is done
  * @pdev: PCI device that has finished a reset routine
+ * @reset_succeeds: Whether the PCI device reset is successful or not
  *
  * After a PCIe device finishes a reset routine, it wants to restore its IOMMU
  * activity, including new translation and cache invalidation, by re-attaching
  * all RID/PASID of the device back to the domains retained in the core-level
  * structure.
  *
- * Caller must pair it with a successful pci_dev_reset_iommu_prepare().
+ * This is a pairing function for pci_dev_reset_iommu_prepare(). Caller should
+ * pass in the reset state via @reset_succeeds. On a failed reset, the device
+ * remains blocked for a quarantine with the group->recovery_cnt intact, so as
+ * to protect system memory until a subsequent successful reset.
  *
  * Note that, although unlikely, there is a risk that re-attaching domains might
  * fail due to some unexpected happening like OOM.
  */
-void pci_dev_reset_iommu_done(struct pci_dev *pdev)
+void pci_dev_reset_iommu_done(struct pci_dev *pdev, bool reset_succeeds)
 {
 	struct iommu_group *group = pdev->dev.iommu_group;
 	struct group_device *gdev;
@@ -4083,6 +4093,18 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 	if (WARN_ON(!group->blocking_domain))
 		return;
 
+	/*
+	 * A reset failure implies that the device might be unreliable. E.g. its
+	 * device cache might retain stale entries, which potentially results in
+	 * memory corruption. Thus, do not unblock the device until a successful
+	 * reset.
+	 */
+	if (!reset_succeeds) {
+		pci_err(pdev,
+			"Reset failed. Keep it blocked to protect memory\n");
+		return;
+	}
+
 	/* Re-attach RID domain back to group->domain */
 	if (group->domain != group->blocking_domain) {
 		WARN_ON(__iommu_attach_device(group->domain, &pdev->dev,
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 4d0f2cb6c695b..9ffd7f013a7d4 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -977,7 +977,7 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
 		ret = -ENOTTY;
 	}
 
-	pci_dev_reset_iommu_done(dev);
+	pci_dev_reset_iommu_done(dev, !ret);
 	return ret;
 }
 
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8479c2e1f74f1..d78e724027c78 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4358,7 +4358,7 @@ int pcie_flr(struct pci_dev *dev)
 
 	ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
 done:
-	pci_dev_reset_iommu_done(dev);
+	pci_dev_reset_iommu_done(dev, !ret);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(pcie_flr);
@@ -4436,7 +4436,7 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
 
 	ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
 done:
-	pci_dev_reset_iommu_done(dev);
+	pci_dev_reset_iommu_done(dev, !ret);
 	return ret;
 }
 
@@ -4490,7 +4490,7 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
 	pci_dev_d3_sleep(dev);
 
 	ret = pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
-	pci_dev_reset_iommu_done(dev);
+	pci_dev_reset_iommu_done(dev, !ret);
 	return ret;
 }
 
@@ -4933,7 +4933,7 @@ static int pci_reset_bus_function(struct pci_dev *dev, bool probe)
 
 	rc = pci_parent_bus_reset(dev, probe);
 done:
-	pci_dev_reset_iommu_done(dev);
+	pci_dev_reset_iommu_done(dev, !rc);
 	return rc;
 }
 
@@ -4978,7 +4978,7 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
 		pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
 				      reg);
 
-	pci_dev_reset_iommu_done(dev);
+	pci_dev_reset_iommu_done(dev, !rc);
 	return rc;
 }
 
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 05ce12b6b2f76..6ce79a25e5c76 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4271,7 +4271,7 @@ static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
 	}
 
 	ret = i->reset(dev, probe);
-	pci_dev_reset_iommu_done(dev);
+	pci_dev_reset_iommu_done(dev, !ret);
 	return ret;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 03/11] iommu: Add reset_device_done callback for hardware fault recovery
  2026-04-16 23:28 [PATCH v3 00/11] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 01/11] PCI: Propagate FLR return values to callers Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 02/11] iommu: Pass in reset result to pci_dev_reset_iommu_done() Nicolin Chen
@ 2026-04-16 23:28 ` Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 04/11] iommu: Add __iommu_group_block_device helper Nicolin Chen
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

When an IOMMU hardware detects an error due to a faulty device (e.g. an ATS
invalidation timeout), IOMMU drivers may quarantine the device by disabling
specific hardware features or dropping translation capabilities.

To recover from these states, the IOMMU driver needs a reliable signal that
the underlying physical hardware has been cleanly reset (e.g., via PCIe AER
or a sysfs Function Level Reset) so as to lift the quarantine.

Introduce a reset_device_done callback in struct iommu_ops. Trigger it from
the existing pci_dev_reset_iommu_done() path to notify the underlying IOMMU
driver that the device's internal state has been sanitized.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h |  4 ++++
 drivers/iommu/iommu.c | 12 ++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d3685967e960a..3c5c5fa5cdc6a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -626,6 +626,9 @@ __iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
  * @release_device: Remove device from iommu driver handling
  * @probe_finalize: Do final setup work after the device is added to an IOMMU
  *                  group and attached to the groups domain
+ * @reset_device_done: Notify the driver that a device has reset successfully.
+ *                     Note that the core invokes the callback function while
+ *                     holding the group->mutex
  * @device_group: find iommu group for a particular device
  * @get_resv_regions: Request list of reserved regions for a device
  * @of_xlate: add OF master IDs to iommu grouping
@@ -683,6 +686,7 @@ struct iommu_ops {
 	struct iommu_device *(*probe_device)(struct device *dev);
 	void (*release_device)(struct device *dev);
 	void (*probe_finalize)(struct device *dev);
+	void (*reset_device_done)(struct device *dev);
 	struct iommu_group *(*device_group)(struct device *dev);
 
 	/* Request/Free a list of reserved regions for a device */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 28d4c1f143a08..df23ef0a26e6c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -4071,12 +4071,14 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
 void pci_dev_reset_iommu_done(struct pci_dev *pdev, bool reset_succeeds)
 {
 	struct iommu_group *group = pdev->dev.iommu_group;
+	const struct iommu_ops *ops;
 	struct group_device *gdev;
 	unsigned long pasid;
 	void *entry;
 
 	if (!pci_ats_supported(pdev) || !dev_has_iommu(&pdev->dev))
 		return;
+	ops = dev_iommu_ops(&pdev->dev);
 
 	guard(mutex)(&group->mutex);
 
@@ -4105,6 +4107,16 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev, bool reset_succeeds)
 		return;
 	}
 
+	/*
+	 * A PCI device might have been in an error state, so the IOMMU driver
+	 * had to quarantine the device by disabling specific hardware features
+	 * or dropping translation capability. Here notify the IOMMU driver as
+	 * a reliable signal that the faulty PCI device has been cleanly reset
+	 * so now it can lift its quarantine and restore full functionality.
+	 */
+	if (ops->reset_device_done)
+		ops->reset_device_done(&pdev->dev);
+
 	/* Re-attach RID domain back to group->domain */
 	if (group->domain != group->blocking_domain) {
 		WARN_ON(__iommu_attach_device(group->domain, &pdev->dev,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 04/11] iommu: Add __iommu_group_block_device helper
  2026-04-16 23:28 [PATCH v3 00/11] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (2 preceding siblings ...)
  2026-04-16 23:28 ` [PATCH v3 03/11] iommu: Add reset_device_done callback for hardware fault recovery Nicolin Chen
@ 2026-04-16 23:28 ` Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 05/11] iommu: Change group->devices to RCU-protected list Nicolin Chen
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

Move the RID/PASID blocking routine into a separate helper, which will be
reused by a new function to quarantine the device but does not bother the
gdev->reset_depth counter.

No functional changes.

Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 99 ++++++++++++++++++++++++-------------------
 1 file changed, 56 insertions(+), 43 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index df23ef0a26e6c..768ac728b4cc3 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3958,6 +3958,57 @@ int iommu_replace_group_handle(struct iommu_group *group,
 }
 EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
 
+static int __iommu_group_block_device(struct iommu_group *group,
+				      struct group_device *gdev)
+{
+	unsigned long pasid;
+	void *entry;
+	int ret;
+
+	lockdep_assert_held(&group->mutex);
+
+	/* Device might be already blocked for a quarantine */
+	if (gdev->blocked)
+		return 0;
+
+	ret = __iommu_group_alloc_blocking_domain(group);
+	if (ret)
+		return ret;
+
+	/* Stage RID domain at blocking_domain while retaining group->domain */
+	if (group->domain != group->blocking_domain) {
+		ret = __iommu_attach_device(group->blocking_domain, gdev->dev,
+					    group->domain);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * Update gdev->blocked upon the domain change, as it is used to return
+	 * the correct domain in iommu_driver_get_domain_for_dev() that might be
+	 * called in a set_dev_pasid callback function.
+	 */
+	gdev->blocked = true;
+
+	/*
+	 * Stage PASID domains at blocking_domain while retaining pasid_array.
+	 *
+	 * The pasid_array is mostly fenced by group->mutex, except one reader
+	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
+	 */
+	if (gdev->dev->iommu->max_pasids > 0) {
+		xa_for_each_start(&group->pasid_array, pasid, entry, 1) {
+			struct iommu_domain *pasid_dom =
+				pasid_array_entry_to_domain(entry);
+
+			iommu_remove_dev_pasid(gdev->dev, pasid, pasid_dom);
+		}
+	}
+
+	group->recovery_cnt++;
+	return 0;
+}
+
 /**
  * pci_dev_reset_iommu_prepare() - Block IOMMU to prepare for a PCI device reset
  * @pdev: PCI device that is going to enter a reset routine
@@ -3988,8 +4039,6 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 {
 	struct iommu_group *group = pdev->dev.iommu_group;
 	struct group_device *gdev;
-	unsigned long pasid;
-	void *entry;
 	int ret;
 
 	if (!pci_ats_supported(pdev) || !dev_has_iommu(&pdev->dev))
@@ -4003,50 +4052,14 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 
 	if (gdev->reset_depth++)
 		return 0;
-	/* Device might be already blocked for a quarantine */
-	if (gdev->blocked)
-		return 0;
-
-	ret = __iommu_group_alloc_blocking_domain(group);
-	if (ret)
-		goto err_depth;
 
-	/* Stage RID domain at blocking_domain while retaining group->domain */
-	if (group->domain != group->blocking_domain) {
-		ret = __iommu_attach_device(group->blocking_domain, &pdev->dev,
-					    group->domain);
-		if (ret)
-			goto err_depth;
-	}
-
-	/*
-	 * Update gdev->blocked upon the domain change, as it is used to return
-	 * the correct domain in iommu_driver_get_domain_for_dev() that might be
-	 * called in a set_dev_pasid callback function.
-	 */
-	gdev->blocked = true;
-
-	/*
-	 * Stage PASID domains at blocking_domain while retaining pasid_array.
-	 *
-	 * The pasid_array is mostly fenced by group->mutex, except one reader
-	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
-	 */
-	if (pdev->dev.iommu->max_pasids > 0) {
-		xa_for_each_start(&group->pasid_array, pasid, entry, 1) {
-			struct iommu_domain *pasid_dom =
-				pasid_array_entry_to_domain(entry);
-
-			iommu_remove_dev_pasid(&pdev->dev, pasid, pasid_dom);
-		}
+	ret = __iommu_group_block_device(group, gdev);
+	if (ret) {
+		gdev->reset_depth--;
+		return ret;
 	}
 
-	group->recovery_cnt++;
-	return ret;
-
-err_depth:
-	gdev->reset_depth--;
-	return ret;
+	return 0;
 }
 EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 05/11] iommu: Change group->devices to RCU-protected list
  2026-04-16 23:28 [PATCH v3 00/11] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (3 preceding siblings ...)
  2026-04-16 23:28 ` [PATCH v3 04/11] iommu: Add __iommu_group_block_device helper Nicolin Chen
@ 2026-04-16 23:28 ` Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 06/11] iommu: Defer __iommu_group_free_device() to be outside group->mutex Nicolin Chen
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

To allow lockless iterations of the group->devices list in an ISR context
that cannot hold the group->mutex, change the list to be RCU protected.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 768ac728b4cc3..d1be62a07904a 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -84,18 +84,20 @@ struct group_device {
 	 */
 	bool blocked;
 	unsigned int reset_depth;
+	struct rcu_head rcu;
 };
 
 /* Iterate over each struct group_device in a struct iommu_group */
 #define for_each_group_device(group, pos) \
-	list_for_each_entry(pos, &(group)->devices, list)
+	list_for_each_entry_rcu(pos, &(group)->devices, list, \
+				lockdep_is_held(&(group)->mutex))
 
 static struct group_device *__dev_to_gdev(struct device *dev)
 {
 	struct iommu_group *group = dev->iommu_group;
 	struct group_device *gdev;
 
-	lockdep_assert_held(&group->mutex);
+	lockdep_assert(lockdep_is_held(&group->mutex) || rcu_read_lock_held());
 
 	for_each_group_device(group, gdev) {
 		if (gdev->dev == dev)
@@ -666,7 +668,7 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 	 * The gdev must be in the list before calling
 	 * iommu_setup_default_domain()
 	 */
-	list_add_tail(&gdev->list, &group->devices);
+	list_add_tail_rcu(&gdev->list, &group->devices);
 	WARN_ON(group->default_domain && !group->domain);
 	if (group->default_domain)
 		iommu_create_device_direct_mappings(group->default_domain, dev);
@@ -697,7 +699,7 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 	return 0;
 
 err_remove_gdev:
-	list_del(&gdev->list);
+	list_del_rcu(&gdev->list);
 	__iommu_group_free_device(group, gdev);
 err_put_group:
 	iommu_deinit_device(dev);
@@ -745,7 +747,7 @@ static void __iommu_group_free_device(struct iommu_group *group,
 			group->domain != group->default_domain);
 
 	kfree(grp_dev->name);
-	kfree(grp_dev);
+	kfree_rcu(grp_dev, rcu);
 }
 
 /* Remove the iommu_group from the struct device. */
@@ -759,7 +761,7 @@ static void __iommu_group_remove_device(struct device *dev)
 		if (device->dev != dev)
 			continue;
 
-		list_del(&device->list);
+		list_del_rcu(&device->list);
 		__iommu_group_free_device(group, device);
 		if (dev_has_iommu(dev))
 			iommu_deinit_device(dev);
@@ -1335,7 +1337,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 	dev->iommu_group = group;
 
 	mutex_lock(&group->mutex);
-	list_add_tail(&gdev->list, &group->devices);
+	list_add_tail_rcu(&gdev->list, &group->devices);
 	mutex_unlock(&group->mutex);
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 06/11] iommu: Defer __iommu_group_free_device() to be outside group->mutex
  2026-04-16 23:28 [PATCH v3 00/11] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (4 preceding siblings ...)
  2026-04-16 23:28 ` [PATCH v3 05/11] iommu: Change group->devices to RCU-protected list Nicolin Chen
@ 2026-04-16 23:28 ` Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 07/11] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

__iommu_group_remove_device() holds group->mutex across the entire call to
__iommu_group_free_device() that performs sysfs removals, tracing, and the
final kfree_rcu(). But in fact, most of these operations don't really need
the group->mutex.

The group_device structure will support a work_struct to quarantine broken
devices asynchronously. The work function must hold group->mutex to safely
update group state. cancel_work_sync() must be called, to cancel that work
before freeing the device. But doing so under group->mutex would deadlock
if the worker is already running and waiting to acquire the same lock.

Separate the assertion from __iommu_group_free_device() to another helper
__iommu_group_empty_assert_owner_cnt().

Defer the __iommu_group_free_device() until the mutex is released.

This is a preparatory refactor with no functional change.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d1be62a07904a..810e7b94a1ae2 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -627,6 +627,19 @@ static struct iommu_domain *pasid_array_entry_to_domain(void *entry)
 
 DEFINE_MUTEX(iommu_probe_device_lock);
 
+static void __iommu_group_empty_assert_owner_cnt(struct iommu_group *group)
+{
+	lockdep_assert_held(&group->mutex);
+	/*
+	 * If the group has become empty then ownership must have been
+	 * released, and the current domain must be set back to NULL or
+	 * the default domain.
+	 */
+	if (list_empty(&group->devices))
+		WARN_ON(group->owner_cnt ||
+			group->domain != group->default_domain);
+}
+
 static int __iommu_probe_device(struct device *dev, struct list_head *group_list)
 {
 	struct iommu_group *group;
@@ -700,10 +713,12 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 
 err_remove_gdev:
 	list_del_rcu(&gdev->list);
-	__iommu_group_free_device(group, gdev);
+	__iommu_group_empty_assert_owner_cnt(group);
 err_put_group:
 	iommu_deinit_device(dev);
 	mutex_unlock(&group->mutex);
+	if (!IS_ERR(gdev))
+		__iommu_group_free_device(group, gdev);
 	iommu_group_put(group);
 
 	return ret;
@@ -732,20 +747,13 @@ static void __iommu_group_free_device(struct iommu_group *group,
 {
 	struct device *dev = grp_dev->dev;
 
+	lockdep_assert_not_held(&group->mutex);
+
 	sysfs_remove_link(group->devices_kobj, grp_dev->name);
 	sysfs_remove_link(&dev->kobj, "iommu_group");
 
 	trace_remove_device_from_group(group->id, dev);
 
-	/*
-	 * If the group has become empty then ownership must have been
-	 * released, and the current domain must be set back to NULL or
-	 * the default domain.
-	 */
-	if (list_empty(&group->devices))
-		WARN_ON(group->owner_cnt ||
-			group->domain != group->default_domain);
-
 	kfree(grp_dev->name);
 	kfree_rcu(grp_dev, rcu);
 }
@@ -754,7 +762,7 @@ static void __iommu_group_free_device(struct iommu_group *group,
 static void __iommu_group_remove_device(struct device *dev)
 {
 	struct iommu_group *group = dev->iommu_group;
-	struct group_device *device;
+	struct group_device *device, *to_free = NULL;
 
 	mutex_lock(&group->mutex);
 	for_each_group_device(group, device) {
@@ -762,15 +770,18 @@ static void __iommu_group_remove_device(struct device *dev)
 			continue;
 
 		list_del_rcu(&device->list);
-		__iommu_group_free_device(group, device);
+		__iommu_group_empty_assert_owner_cnt(group);
 		if (dev_has_iommu(dev))
 			iommu_deinit_device(dev);
 		else
 			dev->iommu_group = NULL;
+		to_free = device;
 		break;
 	}
 	mutex_unlock(&group->mutex);
 
+	if (to_free)
+		__iommu_group_free_device(group, to_free);
 	/*
 	 * Pairs with the get in iommu_init_device() or
 	 * iommu_group_add_device()
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 07/11] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-04-16 23:28 [PATCH v3 00/11] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (5 preceding siblings ...)
  2026-04-16 23:28 ` [PATCH v3 06/11] iommu: Defer __iommu_group_free_device() to be outside group->mutex Nicolin Chen
@ 2026-04-16 23:28 ` Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 08/11] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

When an IOMMU hardware detects an error due to a faulty device (e.g. an ATS
invalidation timeout), IOMMU drivers may quarantine the device by disabling
specific hardware features or dropping translation capabilities.

However, the core-level states of the faulty device are out of sync, as the
device can be still attached to a translation domain or even potentially be
moved to a new domain that might overwrite the driver-level quarantine.

Given that such an error can likely be triggered from an ISR, introduce an
asynchronous broken_work per group_device, and provide a helper function to
allow driver initiate a quarantine in the core.

Note that the worker function must not use dev->iommu_group that is NULLed
by iommu_deinit_device() holding group->mutex. The cancel_work_sync() only
gets called afterwards outside the mutex. So, this would be a NULL pointer
dereference. Add a stable group backpointer to struct group_device instead.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h |   6 +++
 drivers/iommu/iommu.c | 100 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 106 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 3c5c5fa5cdc6a..97d0e5b90c58f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -893,6 +893,8 @@ static inline struct iommu_device *__iommu_get_iommu_dev(struct device *dev)
 #define iommu_get_iommu_dev(dev, type, member) \
 	container_of(__iommu_get_iommu_dev(dev), type, member)
 
+void iommu_report_device_broken(struct device *dev);
+
 static inline void iommu_iotlb_gather_init(struct iommu_iotlb_gather *gather)
 {
 	*gather = (struct iommu_iotlb_gather) {
@@ -1207,6 +1209,10 @@ struct iommu_iotlb_gather {};
 struct iommu_dirty_bitmap {};
 struct iommu_dirty_ops {};
 
+static inline void iommu_report_device_broken(struct device *dev)
+{
+}
+
 static inline bool device_iommu_capable(struct device *dev, enum iommu_cap cap)
 {
 	return false;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 810e7b94a1ae2..bb00918e1b70d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -73,6 +73,7 @@ struct iommu_group {
 };
 
 struct group_device {
+	struct iommu_group *group;
 	struct list_head list;
 	struct device *dev;
 	char *name;
@@ -81,10 +82,12 @@ struct group_device {
 	 * retained. This can happen when:
 	 *  - Device is undergoing a reset
 	 *  - Device failed the last reset
+	 *  - Device is broken and quarantined
 	 */
 	bool blocked;
 	unsigned int reset_depth;
 	struct rcu_head rcu;
+	struct work_struct broken_work;
 };
 
 /* Iterate over each struct group_device in a struct iommu_group */
@@ -170,6 +173,7 @@ static struct group_device *iommu_group_alloc_device(struct iommu_group *group,
 						     struct device *dev);
 static void __iommu_group_free_device(struct iommu_group *group,
 				      struct group_device *grp_dev);
+static void iommu_group_broken_worker(struct work_struct *work);
 static void iommu_domain_init(struct iommu_domain *domain, unsigned int type,
 			      const struct iommu_ops *ops);
 
@@ -752,6 +756,8 @@ static void __iommu_group_free_device(struct iommu_group *group,
 	sysfs_remove_link(group->devices_kobj, grp_dev->name);
 	sysfs_remove_link(&dev->kobj, "iommu_group");
 
+	/* Must wait for broken_work to prevent UAF */
+	cancel_work_sync(&grp_dev->broken_work);
 	trace_remove_device_from_group(group->id, dev);
 
 	kfree(grp_dev->name);
@@ -1284,6 +1290,8 @@ static struct group_device *iommu_group_alloc_device(struct iommu_group *group,
 		return ERR_PTR(-ENOMEM);
 
 	device->dev = dev;
+	device->group = group;
+	INIT_WORK(&device->broken_work, iommu_group_broken_worker);
 
 	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
 	if (ret)
@@ -4178,6 +4186,98 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev, bool reset_succeeds)
 }
 EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_done);
 
+static void iommu_group_broken_worker(struct work_struct *work)
+{
+	struct group_device *gdev =
+		container_of(work, struct group_device, broken_work);
+	struct iommu_group *group = gdev->group;
+	struct device *dev = gdev->dev;
+
+	mutex_lock(&group->mutex);
+
+	/*
+	 * iommu_deinit_device() frees dev->iommu under group->mutex. Bail
+	 * out if the device has already been removed from IOMMU handling.
+	 */
+	if (!dev_has_iommu(dev))
+		goto out_unlock;
+
+	if (gdev->blocked) {
+		dev_dbg(dev, "IOMMU has already quarantined the device\n");
+		goto out_unlock;
+	}
+
+	/*
+	 * Quarantine the device completely. For a PCI device, it will be lifted
+	 * upon a pci_dev_reset_iommu_done(pdev, succeeds=true) call indicating
+	 * a device recovery.
+	 *
+	 * For a non-PCI device, currently it has no recovery framework tied to
+	 * the IOMMU subsystem. Quarantine it indefinitely until a recovery path
+	 * is introduced.
+	 */
+	if (!WARN_ON(__iommu_group_block_device(group, gdev)))
+		dev_warn(dev, "IOMMU has quarantined the device\n");
+
+out_unlock:
+	mutex_unlock(&group->mutex);
+	iommu_group_put(group);
+}
+
+/**
+ * iommu_report_device_broken() - Report a broken device to quarantine it
+ * @dev: Device that has encountered an unrecoverable IOMMU-related error
+ *
+ * When an IOMMU driver detects a critical error caused by a device (e.g. an ATC
+ * invalidation timeout), this function should be used to quarantine the device
+ * at the IOMMU core level.
+ *
+ * The quarantine moves the device's RID and PASIDs to group->blocking_domain to
+ * prevent any further DMA/ATS activity that can potentially corrupt the system
+ * memory due to stale device cache entries.
+ *
+ * This function is safe to call from any context, including interrupt handlers,
+ * as it schedules the actual quarantine work asynchronously. The caller should
+ * have already taken driver-level measures (e.g., disabling ATS in hardware) to
+ * contain the fault immediately, before calling this function.
+ *
+ * For PCI devices, the quarantine will be lifted by a successful device reset
+ * via pci_dev_reset_iommu_done(). For non-PCI devices, the quarantine remains
+ * in effect indefinitely until a recovery mechanism is introduced.
+ *
+ * If the device is concurrently being removed or has already been removed from
+ * the IOMMU subsystem, this function will silently return without any action.
+ */
+void iommu_report_device_broken(struct device *dev)
+{
+	struct iommu_group *group = iommu_group_get(dev);
+	struct group_device *gdev;
+	bool scheduled = false;
+
+	if (!group)
+		return;
+	if (!dev_has_iommu(dev))
+		goto out;
+
+	rcu_read_lock();
+	/*
+	 * Note the device might have been concurrently removed from the group
+	 * (list_del_rcu) before iommu_deinit_device() cleared the dev->iommu.
+	 */
+	list_for_each_entry_rcu(gdev, &group->devices, list) {
+		if (gdev->dev != dev)
+			continue;
+		/* iommu_group_broken_worker() must put the group ref */
+		scheduled = schedule_work(&gdev->broken_work);
+		break;
+	}
+	rcu_read_unlock();
+out:
+	if (!scheduled)
+		iommu_group_put(group);
+}
+EXPORT_SYMBOL_GPL(iommu_report_device_broken);
+
 #if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
 /**
  * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 08/11] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-04-16 23:28 [PATCH v3 00/11] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (6 preceding siblings ...)
  2026-04-16 23:28 ` [PATCH v3 07/11] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
@ 2026-04-16 23:28 ` Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 09/11] iommu/arm-smmu-v3: Replace smmu with master in arm_smmu_inv Nicolin Chen
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

An ATC invalidation timeout is a fatal error. While the SMMUv3 hardware is
aware of the timeout via a GERROR interrupt, the driver thread issuing the
commands lacks a direct mechanism to verify whether its specific batch was
the cause or not, as polling the CMD_SYNC status doesn't natively return a
failure code, making it very difficult to coordinate per-device recovery.

Introduce an atc_sync_timeouts bitmap in the cmdq structure to bridge this
gap. When the ISR detects an ATC timeout, set the bit corresponding to the
physical CMDQ index of the faulting CMD_SYNC command.

On the issuer side, after polling completes (or times out), test and clear
its dedicated bit. If set, return -EIO to trigger device quarantine.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 41 ++++++++++++++++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index ef42df4753ec4..1d72e5040ea97 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -633,6 +633,7 @@ struct arm_smmu_cmdq {
 	atomic_long_t			*valid_map;
 	atomic_t			owner_prod;
 	atomic_t			lock;
+	unsigned long			*atc_sync_timeouts;
 	bool				(*supports_cmd)(struct arm_smmu_cmdq_ent *ent);
 };
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f6901c5437edc..f47943f860f3d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -445,7 +445,10 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
 		 * at the CMD_SYNC. Attempt to complete other pending commands
 		 * by repeating the CMD_SYNC, though we might well end up back
 		 * here since the ATC invalidation may still be pending.
+		 *
+		 * Mark the faulty batch in the bitmap for the issuer to match.
 		 */
+		set_bit(Q_IDX(&q->llq, cons), cmdq->atc_sync_timeouts);
 		return;
 	case CMDQ_ERR_CERROR_ILL_IDX:
 	default:
@@ -895,9 +898,40 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 
 	/* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */
 	if (sync) {
+		u32 sync_prod;
+
 		llq.prod = queue_inc_prod_n(&llq, n);
+		sync_prod = llq.prod;
+		/*
+		 * If there is an unhandled ATC timeout, we will have no choice
+		 * but to ignore it, since this was left on the ring buffer in
+		 * the last round. And we certainly don't want it to affect the
+		 * current issue.
+		 */
+		clear_bit(Q_IDX(&llq, sync_prod), cmdq->atc_sync_timeouts);
+
 		ret = arm_smmu_cmdq_poll_until_sync(smmu, cmdq, &llq);
-		if (ret) {
+
+		/*
+		 * Test atc_sync_timeouts first and see if there is ATC timeout
+		 * resulted from this cmdlist. Return -EIO to separate from the
+		 * ARM_SMMU_POLL_TIMEOUT_US software timeout.
+		 *
+		 * FIXME possible unhandled ATC invalidation timeout scenario:
+		 * PCI Completion Timeout can be set to a range longer than the
+		 * ARM_SMMU_POLL_TIMEOUT_US software timeout. -ETIMEDOUT can be
+		 * returned by arm_smmu_cmdq_poll_until_sync() while ATC timeout
+		 * might not be flagged on atc_sync_timeouts yet. In this case,
+		 * we can hardly do anything here since the command queue HW is
+		 * still pending on the ATC command.
+		 */
+		if (test_and_clear_bit(Q_IDX(&llq, sync_prod),
+				       cmdq->atc_sync_timeouts)) {
+			dev_err_ratelimited(smmu->dev,
+					    "CMD_SYNC for ATC_INV timeout at prod=0x%08x\n",
+					    sync_prod);
+			ret = -EIO;
+		} else if (ret) {
 			dev_err_ratelimited(smmu->dev,
 					    "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
 					    llq.prod,
@@ -4458,6 +4492,11 @@ int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
 	if (!cmdq->valid_map)
 		return -ENOMEM;
 
+	cmdq->atc_sync_timeouts =
+		devm_bitmap_zalloc(smmu->dev, nents, GFP_KERNEL);
+	if (!cmdq->atc_sync_timeouts)
+		return -ENOMEM;
+
 	return 0;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 09/11] iommu/arm-smmu-v3: Replace smmu with master in arm_smmu_inv
  2026-04-16 23:28 [PATCH v3 00/11] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (7 preceding siblings ...)
  2026-04-16 23:28 ` [PATCH v3 08/11] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
@ 2026-04-16 23:28 ` Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 10/11] iommu/arm-smmu-v3: Introduce master->ats_broken flag Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 11/11] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout Nicolin Chen
  10 siblings, 0 replies; 12+ messages in thread
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

Storing master allows to backtrack the master pointer from an invalidation
entry, which will be useful when handling ATC invalidation time outs.

No functional changes.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  2 +-
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  | 34 +++++++++++--------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 24 +++++++------
 3 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 1d72e5040ea97..26e0ee0bb5b45 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -662,7 +662,7 @@ enum arm_smmu_inv_type {
 };
 
 struct arm_smmu_inv {
-	struct arm_smmu_device *smmu;
+	struct arm_smmu_master *master;
 	u8 type;
 	u8 size_opcode;
 	u8 nsize_opcode;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
index add671363c828..ef0c0bfe44206 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
@@ -653,39 +653,43 @@ static void arm_smmu_v3_invs_test_verify(struct kunit *test,
 	}
 }
 
+static struct arm_smmu_master invs_master = {
+	.smmu = &smmu,
+};
+
 static struct arm_smmu_invs invs1 = {
 	.num_invs = 3,
-	.inv = { { .type = INV_TYPE_S2_VMID, .id = 1, },
-		 { .type = INV_TYPE_S2_VMID_S1_CLEAR, .id = 1, },
-		 { .type = INV_TYPE_ATS, .id = 3, }, },
+	.inv = { { .master = &invs_master, .type = INV_TYPE_S2_VMID, .id = 1, },
+		 { .master = &invs_master, .type = INV_TYPE_S2_VMID_S1_CLEAR, .id = 1, },
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 3, }, },
 };
 
 static struct arm_smmu_invs invs2 = {
 	.num_invs = 3,
-	.inv = { { .type = INV_TYPE_S2_VMID, .id = 1, }, /* duplicated */
-		 { .type = INV_TYPE_ATS, .id = 4, },
-		 { .type = INV_TYPE_ATS, .id = 5, }, },
+	.inv = { { .master = &invs_master, .type = INV_TYPE_S2_VMID, .id = 1, }, /* dup */
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 4, },
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 5, }, },
 };
 
 static struct arm_smmu_invs invs3 = {
 	.num_invs = 3,
-	.inv = { { .type = INV_TYPE_S2_VMID, .id = 1, }, /* duplicated */
-		 { .type = INV_TYPE_ATS, .id = 5, }, /* recover a trash */
-		 { .type = INV_TYPE_ATS, .id = 6, }, },
+	.inv = { { .master = &invs_master, .type = INV_TYPE_S2_VMID, .id = 1, }, /* dup */
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 5, }, /* recover a trash */
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 6, }, },
 };
 
 static struct arm_smmu_invs invs4 = {
 	.num_invs = 3,
-	.inv = { { .type = INV_TYPE_ATS, .id = 10, .ssid = 1 },
-		 { .type = INV_TYPE_ATS, .id = 10, .ssid = 3 },
-		 { .type = INV_TYPE_ATS, .id = 12, .ssid = 1 }, },
+	.inv = { { .master = &invs_master, .type = INV_TYPE_ATS, .id = 10, .ssid = 1 },
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 10, .ssid = 3 },
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 12, .ssid = 1 }, },
 };
 
 static struct arm_smmu_invs invs5 = {
 	.num_invs = 3,
-	.inv = { { .type = INV_TYPE_ATS, .id = 10, .ssid = 2 },
-		 { .type = INV_TYPE_ATS, .id = 10, .ssid = 3 }, /* duplicate */
-		 { .type = INV_TYPE_ATS, .id = 12, .ssid = 2 }, },
+	.inv = { { .master = &invs_master, .type = INV_TYPE_ATS, .id = 10, .ssid = 2 },
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 10, .ssid = 3 }, /* dup */
+		 { .master = &invs_master, .type = INV_TYPE_ATS, .id = 12, .ssid = 2 }, },
 };
 
 static void arm_smmu_v3_invs_test(struct kunit *test)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f47943f860f3d..13f225f704e73 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1092,8 +1092,9 @@ arm_smmu_invs_iter_next(struct arm_smmu_invs *invs, size_t next, size_t *idx)
 static int arm_smmu_inv_cmp(const struct arm_smmu_inv *inv_l,
 			    const struct arm_smmu_inv *inv_r)
 {
-	if (inv_l->smmu != inv_r->smmu)
-		return cmp_int((uintptr_t)inv_l->smmu, (uintptr_t)inv_r->smmu);
+	if (inv_l->master->smmu != inv_r->master->smmu)
+		return cmp_int((uintptr_t)inv_l->master->smmu,
+			       (uintptr_t)inv_r->master->smmu);
 	if (inv_l->type != inv_r->type)
 		return cmp_int(inv_l->type, inv_r->type);
 	if (inv_l->id != inv_r->id)
@@ -2650,22 +2651,22 @@ static void arm_smmu_inv_to_cmdq_batch(struct arm_smmu_inv *inv,
 				       unsigned long iova, size_t size,
 				       unsigned int granule)
 {
-	if (arm_smmu_inv_size_too_big(inv->smmu, size, granule)) {
+	if (arm_smmu_inv_size_too_big(inv->master->smmu, size, granule)) {
 		cmd->opcode = inv->nsize_opcode;
-		arm_smmu_cmdq_batch_add(inv->smmu, cmds, cmd);
+		arm_smmu_cmdq_batch_add(inv->master->smmu, cmds, cmd);
 		return;
 	}
 
 	cmd->opcode = inv->size_opcode;
-	arm_smmu_cmdq_batch_add_range(inv->smmu, cmds, cmd, iova, size, granule,
-				      inv->pgsize);
+	arm_smmu_cmdq_batch_add_range(inv->master->smmu, cmds, cmd, iova, size,
+				      granule, inv->pgsize);
 }
 
 static inline bool arm_smmu_invs_end_batch(struct arm_smmu_inv *cur,
 					   struct arm_smmu_inv *next)
 {
 	/* Changing smmu means changing command queue */
-	if (cur->smmu != next->smmu)
+	if (cur->master->smmu != next->master->smmu)
 		return true;
 	/* The batch for S2 TLBI must be done before nested S1 ASIDs */
 	if (cur->type != INV_TYPE_S2_VMID_S1_CLEAR &&
@@ -2692,7 +2693,7 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
 		if (READ_ONCE(cur->users))
 			break;
 	while (cur != end) {
-		struct arm_smmu_device *smmu = cur->smmu;
+		struct arm_smmu_device *smmu = cur->master->smmu;
 		struct arm_smmu_cmdq_ent cmd = {
 			/*
 			 * Pick size_opcode to run arm_smmu_get_cmdq(). This can
@@ -2721,7 +2722,8 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
 			break;
 		case INV_TYPE_S2_VMID_S1_CLEAR:
 			/* CMDQ_OP_TLBI_S12_VMALL already flushed S1 entries */
-			if (arm_smmu_inv_size_too_big(cur->smmu, size, granule))
+			if (arm_smmu_inv_size_too_big(cur->master->smmu, size,
+						      granule))
 				break;
 			cmd.tlbi.vmid = cur->id;
 			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
@@ -3246,7 +3248,7 @@ arm_smmu_master_build_inv(struct arm_smmu_master *master,
 {
 	struct arm_smmu_invs *build_invs = master->build_invs;
 	struct arm_smmu_inv *cur, inv = {
-		.smmu = master->smmu,
+		.master = master,
 		.type = type,
 		.id = id,
 		.pgsize = pgsize,
@@ -3478,7 +3480,7 @@ static void arm_smmu_inv_flush_iotlb_tag(struct arm_smmu_inv *inv)
 	}
 
 	cmd.opcode = inv->nsize_opcode;
-	arm_smmu_cmdq_issue_cmd_with_sync(inv->smmu, &cmd);
+	arm_smmu_cmdq_issue_cmd_with_sync(inv->master->smmu, &cmd);
 }
 
 /* Should be installed after arm_smmu_install_ste_for_dev() */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 10/11] iommu/arm-smmu-v3: Introduce master->ats_broken flag
  2026-04-16 23:28 [PATCH v3 00/11] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (8 preceding siblings ...)
  2026-04-16 23:28 ` [PATCH v3 09/11] iommu/arm-smmu-v3: Replace smmu with master in arm_smmu_inv Nicolin Chen
@ 2026-04-16 23:28 ` Nicolin Chen
  2026-04-16 23:28 ` [PATCH v3 11/11] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout Nicolin Chen
  10 siblings, 0 replies; 12+ messages in thread
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

The flag will be set when IOMMU cannot trust device's ATS function. E.g.,
when ATC invalidation request to the device times out.

Once it is set, unsupport the ATS feature to prevent data corruption, and
skip further ATC invalidation commands to avoid new timeouts.

Unset the flag when the device finishes a reset for recovery.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 28 +++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 26e0ee0bb5b45..95bce9966374a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -941,6 +941,7 @@ struct arm_smmu_master {
 	/* Locked by the iommu core using the group mutex */
 	struct arm_smmu_ctx_desc_cfg	cd_table;
 	unsigned int			num_streams;
+	bool				ats_broken;
 	bool				ats_enabled : 1;
 	bool				ste_ats_enabled : 1;
 	bool				stall_enabled;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 13f225f704e73..5dead82cf1186 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2523,6 +2523,10 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
 	struct arm_smmu_cmdq_ent cmd;
 	struct arm_smmu_cmdq_batch cmds;
 
+	/* Do not issue ATC_INV that will definitely time out */
+	if (READ_ONCE(master->ats_broken))
+		return 0;
+
 	arm_smmu_atc_inv_to_cmd(ssid, 0, 0, &cmd);
 
 	arm_smmu_cmdq_batch_init(master->smmu, &cmds, &cmd);
@@ -2729,11 +2733,17 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
 			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
 			break;
 		case INV_TYPE_ATS:
+			/* Do not issue ATC_INV that will definitely time out */
+			if (READ_ONCE(cur->master->ats_broken))
+				break;
 			arm_smmu_atc_inv_to_cmd(cur->ssid, iova, size, &cmd);
 			cmd.atc.sid = cur->id;
 			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
 			break;
 		case INV_TYPE_ATS_FULL:
+			/* Do not issue ATC_INV that will definitely time out */
+			if (READ_ONCE(cur->master->ats_broken))
+				break;
 			arm_smmu_atc_inv_to_cmd(IOMMU_NO_PASID, 0, 0, &cmd);
 			cmd.atc.sid = cur->id;
 			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
@@ -3069,6 +3079,15 @@ void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master,
 	}
 }
 
+static void arm_smmu_reset_device_done(struct device *dev)
+{
+	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+
+	if (WARN_ON(!master))
+		return;
+	WRITE_ONCE(master->ats_broken, false);
+}
+
 static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
 {
 	struct device *dev = master->dev;
@@ -3081,6 +3100,14 @@ static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
 	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_ATS))
 		return false;
 
+	/*
+	 * Do not enable ATS if master->ats_broken is set. The PCI device should
+	 * go through a recovery (reset) that shall notify the SMMUv3 driver via
+	 * a reset_device_done callback.
+	 */
+	if (READ_ONCE(master->ats_broken))
+		return false;
+
 	return dev_is_pci(dev) && pci_ats_supported(to_pci_dev(dev));
 }
 
@@ -4412,6 +4439,7 @@ static const struct iommu_ops arm_smmu_ops = {
 	.domain_alloc_paging_flags = arm_smmu_domain_alloc_paging_flags,
 	.probe_device		= arm_smmu_probe_device,
 	.release_device		= arm_smmu_release_device,
+	.reset_device_done	= arm_smmu_reset_device_done,
 	.device_group		= arm_smmu_device_group,
 	.of_xlate		= arm_smmu_of_xlate,
 	.get_resv_regions	= arm_smmu_get_resv_regions,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 11/11] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout
  2026-04-16 23:28 [PATCH v3 00/11] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (9 preceding siblings ...)
  2026-04-16 23:28 ` [PATCH v3 10/11] iommu/arm-smmu-v3: Introduce master->ats_broken flag Nicolin Chen
@ 2026-04-16 23:28 ` Nicolin Chen
  10 siblings, 0 replies; 12+ messages in thread
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

Currently, when GERROR_CMDQ_ERR occurs, the arm_smmu_cmdq_skip_err() won't
do anything for the CMDQ_ERR_CERROR_ATC_INV_IDX.

When a device wasn't responsive to an ATC invalidation request, this often
results in constant CMDQ errors:
  unexpected global error reported (0x00000001), this could be serious
  CMDQ error (cons 0x0302bb84): ATC invalidate timeout
  unexpected global error reported (0x00000001), this could be serious
  CMDQ error (cons 0x0302bb88): ATC invalidate timeout
  unexpected global error reported (0x00000001), this could be serious
  CMDQ error (cons 0x0302bb8c): ATC invalidate timeout
  ...

An ATC invalidation timeout indicates that the device failed to respond to
a protocol-critical coherency request, which means that device's internal
ATS state is desynchronized from the SMMU.

Furthermore, ignoring the timeout leaves the system in an unsafe state, as
the device cache may retain stale ATC entries for memory pages that the OS
has already reclaimed and reassigned. This might lead to data corruption.

Isolate the device that is confirmed to be unresponsive by a surgical STE
update to unset its EATS bit so as to reject any further ATS transaction,
which could corrupt the memory.

Also, set the master->ats_broken flag that is revertible after the device
completes a reset. This flag avoids further ATS requests and invalidations
from happening.

Finally, report this broken device to the IOMMU core to isolate the device
in the core level too.

For batched ATC_INV commands, SMMU hardware only reports a timeout at the
CMD_SYNC, which could follow the batch issued for multiple devices. So, it
isn't straightforward to identify which command in a batch resulted in the
timeout. Fortunately, the invs array has a sorted list of ATC entries. So,
the issued batch must be sorted as well. This makes it possible to retry
the ATC_INV command for each unique Stream ID in the batch to identify the
unresponsive master.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 100 +++++++++++++++++++-
 1 file changed, 97 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5dead82cf1186..7dbd9c5834314 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -107,6 +107,8 @@ static const char * const event_class_str[] = {
 	[3] = "Reserved",
 };
 
+static struct arm_smmu_ste *
+arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid);
 static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
 
 static void parse_driver_options(struct arm_smmu_device *smmu)
@@ -2516,10 +2518,49 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size,
 	cmd->atc.size	= log2_span;
 }
 
+static void arm_smmu_disable_eats_for_sid(struct arm_smmu_device *smmu,
+					  struct arm_smmu_cmdq *cmdq, u32 sid)
+{
+	struct arm_smmu_cmdq_ent ent = {
+		.opcode = CMDQ_OP_CFGI_STE,
+		.cfgi	= {
+			.sid = sid,
+			.leaf = true,
+		},
+	};
+	struct arm_smmu_ste *step;
+	u64 cmd[CMDQ_ENT_DWORDS];
+	__le64 old, new;
+
+	step = arm_smmu_get_step_for_sid(smmu, sid);
+
+	old = READ_ONCE(step->data[1]);
+	do {
+		new = old & cpu_to_le64(~STRTAB_STE_1_EATS);
+	} while (!try_cmpxchg64(&step->data[1], &old, new));
+
+	arm_smmu_cmdq_build_cmd(cmd, &ent);
+	if (arm_smmu_cmdq_issue_cmdlist(smmu, cmdq, cmd, 1, true))
+		dev_err_ratelimited(smmu->dev,
+				    "failed to disable ATS for sid %#x\n", sid);
+}
+
+static void arm_smmu_master_disable_ats(struct arm_smmu_master *master,
+					struct arm_smmu_cmdq *cmdq)
+{
+	int i;
+
+	for (i = 0; i < master->num_streams; i++)
+		arm_smmu_disable_eats_for_sid(master->smmu, cmdq,
+					      master->streams[i].id);
+	WRITE_ONCE(master->ats_broken, true);
+	iommu_report_device_broken(master->dev);
+}
+
 static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
 				   ioasid_t ssid)
 {
-	int i;
+	int i, ret;
 	struct arm_smmu_cmdq_ent cmd;
 	struct arm_smmu_cmdq_batch cmds;
 
@@ -2535,7 +2576,10 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
 		arm_smmu_cmdq_batch_add(master->smmu, &cmds, &cmd);
 	}
 
-	return arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
+	ret = arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
+	if (ret == -EIO)
+		arm_smmu_master_disable_ats(master, cmds.cmdq);
+	return ret;
 }
 
 /* IO_PGTABLE API */
@@ -2682,6 +2726,55 @@ static inline bool arm_smmu_invs_end_batch(struct arm_smmu_inv *cur,
 	return false;
 }
 
+static void arm_smmu_invs_disable_ats(struct arm_smmu_invs *invs,
+				      struct arm_smmu_cmdq *cmdq,
+				      struct arm_smmu_device *smmu, u32 sid)
+{
+	struct arm_smmu_inv *cur;
+	size_t i;
+
+	arm_smmu_invs_for_each_entry(invs, i, cur) {
+		if (cur->master->smmu == smmu && arm_smmu_inv_is_ats(cur) &&
+		    cur->id == sid) {
+			arm_smmu_master_disable_ats(cur->master, cmdq);
+			break;
+		}
+	}
+}
+
+static void arm_smmu_cmdq_batch_retry(struct arm_smmu_device *smmu,
+				      struct arm_smmu_invs *invs,
+				      struct arm_smmu_cmdq_batch *cmds)
+{
+	u64 atc[CMDQ_ENT_DWORDS] = {0};
+	int i;
+
+	/* Only a timed out ATC_INV command needs a retry */
+	if (!invs->has_ats)
+		return;
+
+	for (i = 0; i < cmds->num * CMDQ_ENT_DWORDS; i += CMDQ_ENT_DWORDS) {
+		struct arm_smmu_cmdq *cmdq = cmds->cmdq;
+		u32 sid;
+		int ret;
+
+		/* Only need to retry ATC invalidations */
+		if (FIELD_GET(CMDQ_0_OP, cmds->cmds[i]) != CMDQ_OP_ATC_INV)
+			continue;
+
+		/* Only need to retry with one ATC_INV per Stream ID (device) */
+		sid = FIELD_GET(CMDQ_ATC_0_SID, cmds->cmds[i]);
+		if (atc[0] && sid == FIELD_GET(CMDQ_ATC_0_SID, atc[0]))
+			continue;
+
+		atc[0] = cmds->cmds[i];
+		atc[1] = cmds->cmds[i + 1];
+		ret = arm_smmu_cmdq_issue_cmdlist(smmu, cmdq, atc, 1, true);
+		if (ret == -EIO)
+			arm_smmu_invs_disable_ats(invs, cmdq, smmu, sid);
+	}
+}
+
 static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
 					unsigned long iova, size_t size,
 					unsigned int granule, bool leaf)
@@ -2760,7 +2853,8 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
 
 		if (cmds.num &&
 		    (next == end || arm_smmu_invs_end_batch(cur, next))) {
-			arm_smmu_cmdq_batch_submit(smmu, &cmds);
+			if (arm_smmu_cmdq_batch_submit(smmu, &cmds) == -EIO)
+				arm_smmu_cmdq_batch_retry(smmu, invs, &cmds);
 			cmds.num = 0;
 		}
 		cur = next;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-04-16 23:30 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-16 23:28 [PATCH v3 00/11] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
2026-04-16 23:28 ` [PATCH v3 01/11] PCI: Propagate FLR return values to callers Nicolin Chen
2026-04-16 23:28 ` [PATCH v3 02/11] iommu: Pass in reset result to pci_dev_reset_iommu_done() Nicolin Chen
2026-04-16 23:28 ` [PATCH v3 03/11] iommu: Add reset_device_done callback for hardware fault recovery Nicolin Chen
2026-04-16 23:28 ` [PATCH v3 04/11] iommu: Add __iommu_group_block_device helper Nicolin Chen
2026-04-16 23:28 ` [PATCH v3 05/11] iommu: Change group->devices to RCU-protected list Nicolin Chen
2026-04-16 23:28 ` [PATCH v3 06/11] iommu: Defer __iommu_group_free_device() to be outside group->mutex Nicolin Chen
2026-04-16 23:28 ` [PATCH v3 07/11] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
2026-04-16 23:28 ` [PATCH v3 08/11] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
2026-04-16 23:28 ` [PATCH v3 09/11] iommu/arm-smmu-v3: Replace smmu with master in arm_smmu_inv Nicolin Chen
2026-04-16 23:28 ` [PATCH v3 10/11] iommu/arm-smmu-v3: Introduce master->ats_broken flag Nicolin Chen
2026-04-16 23:28 ` [PATCH v3 11/11] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout Nicolin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox