[PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout
@ 2026-05-19  3:38 Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 01/24] PCI: Don't suspend IOMMU when probing reset capability Nicolin Chen
                   ` (23 more replies)
  0 siblings, 24 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

Hi all,

This series addresses a critical vulnerability and stability issue where an
unresponsive PCIe device failing to process ATC (Address Translation Cache)
invalidation requests leads to silent data corruption and continuous SMMU
CMDQ error spam.

[ As Jason pointed out, because this series fundamentally introduces a new
  RAS feature to quarantine and recover from hardware faults and relies on
  a recently accepted SMMU driver rework, it is not treated as a standard
  bug fix. Thus, most of the patches here don't carry a "Fixes" tag. ]

Currently, when an ATC invalidation times out, the SMMUv3 driver skips the
CMDQ_ERR_CERROR_ATC_INV_IDX error. This leaves the device's ATS cache state
desynchronized from the SMMU: the device cache may retain stale ATC entries
for memory pages that the OS has already reclaimed and reassigned, creating
a direct vector for data corruption. Furthermore, the driver might continue
issuing ATC_INV commands, resulting in constant CMDQ errors:
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb84): ATC invalidate timeout
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb88): ATC invalidate timeout
    unexpected global error reported (0x00000001), this could be serious
    CMDQ error (cons 0x0302bb8c): ATC invalidate timeout
    ...

To resolve this, introduce a mechanism to quarantine a broken device in the
SMMUv3 driver and the IOMMU core. To achieve this, add preparatory changes:
 - Pass in PCI reset result to pci_dev_reset_iommu_done()
 - Co-clear pending CMDQ_ERR from the cmdq issuer under a raw_spinlock_t,
   so an ATC_INV timeout flagged in cmdq->atc_sync_timeouts is definitive
   when the issuer reads its bit after CMD_SYNC poll
 - Introduce a reset_device_done op, allowing the core to signal the driver
   when the physical hardware has been cleanly recovered (e.g., via AER or
   a manual reset) so the quarantine can be lifted
 - Utilize a per-group_device WQ via an iommu_report_device_broken() helper

On the SMMUv3 driver side, retry the timedout ATC_INV batch to identify the
faulty device(s). Perform a surgical STE update, and flag the ATS as broken
to reject further ATS/ATC requests at HW level and suppress timeout spam.

This is on Github:
https://github.com/nicolinc/iommufd/commits/smmuv3_atc_timeout-v4

Changelog
v4:
 * Rebase on Joerg's IOMMU "fixes" branch
 * Rebase on Jason's SMMUv3 cmd_ent series
   https://lore.kernel.org/all/0-v2-47b2bf710ad5+716ac-smmu_no_cmdq_ent_jgg@nvidia.com/
 * [PCI] Don't suspend IOMMU in probe mode
 * [iommu] kfree_rcu() iommu_group
 * [iommu] Convert gdev->blocked to enum gdev_blocked
 * [iommu] Use disable_work_sync() to fix UAF and ref leak
 * [iommu] Gate done() transitions to preserve BLOCKED_BROKEN
 * [iommu] Decrement recovery_cnt when unplugging a blocked gdev
 * [iommu] Drop racy dev_has_iommu() in iommu_report_device_broken()
 * [iommu] Add gdev->broken_pending to skip worker after racing recovery
 * [smmuv3] Add master->ats_invs scratch
 * [smmuv3] Add arm_smmu_cmdq_batch_issue() wrapper
 * [smmuv3] Force per-flush sync for has_ats batches
 * [smmuv3] Serialize STE.EATS and ats_broken updates
 * [smmuv3] Co-clear pending CMDQ_ERR from cmdq issuer
 * [smmuv3] Add invs and has_ats to arm_smmu_cmdq_batch
 * [smmuv3] Move arm_smmu_invs_for_each_entry to header
 * [smmuv3] Set master->ats_broken after clearing STE.EATS
 * [smmuv3] Issue CFGI_STE via arm_smmu_cmdq_issue_cmd_with_sync()
 * [smmuv3] Keep "smmu" pointer in arm_smmu_inv but add "master" for ATS
v3:
 https://lore.kernel.org/all/cover.1776381841.git.nicolinc@nvidia.com/
 * Rebase on arm/smmu/updates branch + bug fix
 * Update commit messages and inline comments
 * [iommu] Drop unnecessary ops validation
 * [iommu] Add missed function stub when !CONFIG_IOMMU_API
 * [iommu] Change iommu_report_device_broken() to per gdev
 * [iommu] Separate quarantine from pci_dev_reset_prepare()
 * [iommu] Check reset failure in pci_dev_reset_iommu_done()
 * [smmuv3] Fix STE update with try_cmpxchg64()
 * [smmuv3] Fix "continue" bug when skipping ATC commands
 * [smmuv3] Replace atomic_t prod_err with a lockless bitmap
 * [smmuv3] Drop master->invs_domain; disable ATS per-master directly
 * [smmuv3] Return -EIO for ATC timeout v.s. -ETIMEDOUT for poll timeout
 * [smmuv3] Replace INV_TYPE_ATS_DISABLED with per-master ats_broken flag
v2:
 https://lore.kernel.org/all/cover.1773774441.git.nicolinc@nvidia.com/
 * Rebase on arm_smmu_invs-v13 series
 * Bisect batched atc invalidation commands
 * Drop the direct pci_reset_function() call
 * Move the work queue from SMMUv3 to the core
 * Proceed a surgical STE update to disable EATS
 * Wait for pci_dev_reset_iommu_done() to signal a recovery
v1:
 https://lore.kernel.org/all/cover.1772686998.git.nicolinc@nvidia.com/

Thanks
Nicolin

Nicolin Chen (24):
  PCI: Don't suspend IOMMU when probing reset capability
  PCI: Propagate FLR return values to callers
  iommu: Convert gdev->blocked from bool to enum gdev_blocked
  iommu: Pass in reset result to pci_dev_reset_iommu_done()
  iommu: Add reset_device_done callback for hardware fault recovery
  iommu: Defer iommu_group free via kfree_rcu()
  iommu: Defer __iommu_group_free_device() to be outside group->mutex
  iommu: Change group->devices to RCU-protected list
  iommu: Add group pointer to struct group_device
  iommu: Add __iommu_group_block_device helper
  iommu: Add iommu_report_device_broken() to quarantine a broken device
  iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  iommu/arm-smmu-v3: Skip remaining GERROR causes on SFM
  iommu/arm-smmu-v3: Introduce per-cmdq cmdq_err_handler callback
  iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when CMD_SYNC times out
  iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when queue_has_space()
    fails
  iommu/arm-smmu-v3: Add master in arm_smmu_inv for ATS entries
  iommu/arm-smmu-v3: Introduce master->ats_broken flag
  iommu/arm-smmu-v3: Add invs and has_ats to struct arm_smmu_cmdq_batch
  iommu/arm-smmu-v3: Introduce arm_smmu_cmdq_batch_issue() wrapper
  iommu/arm-smmu-v3: Move arm_smmu_invs_for_each_entry to header
  iommu/arm-smmu-v3: Introduce master->ats_invs
  iommu/arm-smmu-v3: Serialize STE.EATS and ats_broken updates
  iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  72 +++-
 include/linux/iommu.h                         |  18 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 387 ++++++++++++++---
 .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c    |  36 +-
 drivers/iommu/iommu.c                         | 406 ++++++++++++++----
 drivers/pci/pci-acpi.c                        |   2 +-
 drivers/pci/pci.c                             |  21 +-
 drivers/pci/quirks.c                          |  43 +-
 8 files changed, 820 insertions(+), 165 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v4 01/24] PCI: Don't suspend IOMMU when probing reset capability
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 02/24] PCI: Propagate FLR return values to callers Nicolin Chen
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

reset_method_store() in drivers/pci/pci-sysfs.c discovers supported reset
methods by calling reset_fn(pdev, PCI_RESET_PROBE, ...) without holding a
device_lock, since the probe path is expected to query the device's reset
capability without changing device state.

However, pci_reset_bus_function() and __pci_dev_specific_reset() violate
that contract after pci_dev_reset_iommu_prepare/done() were added, which
moves the device into a blocking domain and abruptly aborts any in-flight
DMA. Doing this for a probe -- a state-query call that does not even hold
device_lock -- can cause driver timeouts and data loss on a DMAing device.

The peer reset helpers all handle this correctly: they short-circuit on a
probe input before touching the IOMMU.

Skip pci_dev_reset_iommu_prepare()/_done() entirely when probe is set. The
inner reset routines already implement their own probe semantics, and they
perform the capability checks and return without changing device state.

Fixes: f5b16b802174 ("PCI: Suspend iommu function prior to resetting a device")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/pci/pci.c    | 13 ++++++++-----
 drivers/pci/quirks.c | 13 ++++++++-----
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d34266651ad09..d0af8b5eca2ce 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4914,10 +4914,12 @@ static int pci_reset_bus_function(struct pci_dev *dev, bool probe)
 	if (bridge && pcie_is_cxl(bridge) && cxl_sbr_masked(bridge))
 		return -ENOTTY;
 
-	rc = pci_dev_reset_iommu_prepare(dev);
-	if (rc) {
-		pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", rc);
-		return rc;
+	if (!probe) {
+		rc = pci_dev_reset_iommu_prepare(dev);
+		if (rc) {
+			pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", rc);
+			return rc;
+		}
 	}
 
 	rc = pci_dev_reset_slot_function(dev, probe);
@@ -4926,7 +4928,8 @@ static int pci_reset_bus_function(struct pci_dev *dev, bool probe)
 
 	rc = pci_parent_bus_reset(dev, probe);
 done:
-	pci_dev_reset_iommu_done(dev);
+	if (!probe)
+		pci_dev_reset_iommu_done(dev);
 	return rc;
 }
 
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index caaed1a01dc02..a344abd745947 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4260,14 +4260,17 @@ static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
 {
 	int ret;
 
-	ret = pci_dev_reset_iommu_prepare(dev);
-	if (ret) {
-		pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
-		return ret;
+	if (!probe) {
+		ret = pci_dev_reset_iommu_prepare(dev);
+		if (ret) {
+			pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
+			return ret;
+		}
 	}
 
 	ret = i->reset(dev, probe);
-	pci_dev_reset_iommu_done(dev);
+	if (!probe)
+		pci_dev_reset_iommu_done(dev);
 	return ret;
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 02/24] PCI: Propagate FLR return values to callers
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 01/24] PCI: Don't suspend IOMMU when probing reset capability Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 03/24] iommu: Convert gdev->blocked from bool to enum gdev_blocked Nicolin Chen
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

A reset failure implies that the device might be unreliable. E.g. its ATC
might still retain stale entries. Thus, the IOMMU layer cannot trust this
device to resume its ATS function that can lead to memory corruption. So,
the pci_dev_reset_iommu_done() won't recover the device's IOMMU pathway if
the device reset fails.

The quirk functions in the pci_dev_reset_methods array invoke pcie_flr(),
but do not check the return value. Propagate them correctly.

Also propagate device-internal ack timeouts in reset_hinic_vf_dev().

Note: this change does not introduce any early return on failure, keeping
everything status quo. It only propagates the error code properly.

Given that these functions have been running okay, and the return values
will only be needed for incoming work. This is not treated as bug fix.

Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/pci/quirks.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index a344abd745947..6cded18c9a687 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3955,7 +3955,7 @@ static int reset_intel_82599_sfp_virtfn(struct pci_dev *dev, bool probe)
 	 * supported.
 	 */
 	if (!probe)
-		pcie_flr(dev);
+		return pcie_flr(dev);
 	return 0;
 }
 
@@ -4013,6 +4013,7 @@ static int reset_chelsio_generic_dev(struct pci_dev *dev, bool probe)
 {
 	u16 old_command;
 	u16 msix_flags;
+	int ret;
 
 	/*
 	 * If this isn't a Chelsio T4-based device, return -ENOTTY indicating
@@ -4058,16 +4059,15 @@ static int reset_chelsio_generic_dev(struct pci_dev *dev, bool probe)
 				      PCI_MSIX_FLAGS_ENABLE |
 				      PCI_MSIX_FLAGS_MASKALL);
 
-	pcie_flr(dev);
+	ret = pcie_flr(dev);
 
 	/*
 	 * Restore the configuration information (BAR values, etc.) including
-	 * the original PCI Configuration Space Command word, and return
-	 * success.
+	 * the original PCI Configuration Space Command word.
 	 */
 	pci_restore_state(dev);
 	pci_write_config_word(dev, PCI_COMMAND, old_command);
-	return 0;
+	return ret;
 }
 
 #define PCI_DEVICE_ID_INTEL_82599_SFP_VF   0x10ed
@@ -4150,9 +4150,7 @@ static int nvme_disable_and_flr(struct pci_dev *dev, bool probe)
 
 	pci_iounmap(dev, bar);
 
-	pcie_flr(dev);
-
-	return 0;
+	return pcie_flr(dev);
 }
 
 /*
@@ -4164,14 +4162,17 @@ static int nvme_disable_and_flr(struct pci_dev *dev, bool probe)
  */
 static int delay_250ms_after_flr(struct pci_dev *dev, bool probe)
 {
+	int ret;
+
 	if (probe)
 		return pcie_reset_flr(dev, PCI_RESET_PROBE);
 
-	pcie_reset_flr(dev, PCI_RESET_DO_RESET);
+	ret = pcie_reset_flr(dev, PCI_RESET_DO_RESET);
 
+	/* Settle the device even on a failed FLR */
 	msleep(250);
 
-	return 0;
+	return ret;
 }
 
 #define PCI_DEVICE_ID_HINIC_VF      0x375E
@@ -4187,6 +4188,7 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe)
 	unsigned long timeout;
 	void __iomem *bar;
 	u32 val;
+	int ret;
 
 	if (probe)
 		return 0;
@@ -4207,12 +4209,13 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe)
 	val = val | HINIC_VF_FLR_PROC_BIT;
 	iowrite32be(val, bar + HINIC_VF_OP);
 
-	pcie_flr(pdev);
+	ret = pcie_flr(pdev);
 
 	/*
 	 * The device must recapture its Bus and Device Numbers after FLR
 	 * in order generate Completions.  Issue a config write to let the
-	 * device capture this information.
+	 * device capture this information. Note that pcie_flr() can fail
+	 * after the reset is asserted. So, recapture it unconditionally.
 	 */
 	pci_write_config_word(pdev, PCI_VENDOR_ID, 0);
 
@@ -4230,11 +4233,12 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe)
 		goto reset_complete;
 
 	pci_warn(pdev, "Reset dev timeout, FLR ack reg: %#010x\n", val);
+	ret = -ETIMEDOUT;
 
 reset_complete:
 	pci_iounmap(pdev, bar);
 
-	return 0;
+	return ret;
 }
 
 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 03/24] iommu: Convert gdev->blocked from bool to enum gdev_blocked
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 01/24] PCI: Don't suspend IOMMU when probing reset capability Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 02/24] PCI: Propagate FLR return values to callers Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 04/24] iommu: Pass in reset result to pci_dev_reset_iommu_done() Nicolin Chen
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

The gdev->blocked flag tracks whether a device is individually being held
in the group->blocking_domain while group->domain is retained. Up to now,
a PCI reset in flight is the only producer, so a bool suffices.

Subsequent changes will add more reasons to keep a device blocked, e.g. a
failed-reset case that must not auto-unblock, or a driver-side quarantine
for a hardware fault. These reasons are cleared by different events, which
a single bool cannot encode.

Convert "bool blocked" into "enum gdev_blocked blocked", provisioned with
two initial values: BLOCKED_NO and BLOCKED_RESETTING, for the existing use
cases. All readers keep the "if (gdev->blocked)" form, as BLOCKED_NO == 0.

This is a pure type change with no behavior change. Follow-on changes will
add new enum values along with their producers.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index e7bd28cc77eeb..c40f4bfc93352 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -73,16 +73,20 @@ struct iommu_group {
 	void *owner;
 };
 
+enum gdev_blocked {
+	BLOCKED_NO = 0, /* Not blocked */
+	BLOCKED_RESETTING, /* PCI reset in flight */
+};
+
 struct group_device {
 	struct list_head list;
 	struct device *dev;
 	char *name;
 	/*
 	 * Device is blocked for a pending recovery while its group->domain is
-	 * retained. This can happen when:
-	 *  - Device is undergoing a reset
+	 * retained.
 	 */
-	bool blocked;
+	enum gdev_blocked blocked;
 	unsigned int reset_depth;
 };
 
@@ -4083,7 +4087,7 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 	 * the correct domain in iommu_driver_get_domain_for_dev() that might be
 	 * called in a set_dev_pasid callback function.
 	 */
-	gdev->blocked = true;
+	gdev->blocked = BLOCKED_RESETTING;
 
 	/*
 	 * Stage PASID domains at blocking_domain while retaining pasid_array.
@@ -4209,7 +4213,7 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 	 * the correct domain in iommu_driver_get_domain_for_dev() that might be
 	 * called in a set_dev_pasid callback function.
 	 */
-	gdev->blocked = false;
+	gdev->blocked = BLOCKED_NO;
 
 	/*
 	 * Re-attach PASID domains back to the domains retained in pasid_array.
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 04/24] iommu: Pass in reset result to pci_dev_reset_iommu_done()
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (2 preceding siblings ...)
  2026-05-19  3:38 ` [PATCH v4 03/24] iommu: Convert gdev->blocked from bool to enum gdev_blocked Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 05/24] iommu: Add reset_device_done callback for hardware fault recovery Nicolin Chen
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

IOMMU drivers handle ATC cache maintenance. They may encounter ATC-related
errors (e.g., ATC invalidation timeout), indicating that the ATC cache may
have stale entries that can corrupt the memory. In this case, IOMMU driver
has no choice but to block the device's ATS function and wait for a device
recovery.

The pci_dev_reset_iommu_done() called at the end of a reset function could
serve as a reliable signal to the IOMMU subsystem that the physical device
cache is completely clean. However, the function is called unconditionally
even if the reset operation had actually failed, which would re-attach the
faulty device back to a normal translation domain. And this will leave the
system highly exposed, creating vulnerabilities for data corruption:
    IOMMU blocks RID/ATS
    pci_reset_function():
        pci_dev_reset_iommu_prepare(); // Block RID/ATS
        __reset(); // Failed (ATC is still stale)
        pci_dev_reset_iommu_done(); // Unblock RID/ATS (ah-ha)

Instead, pass in @reset_result to pci_dev_reset_iommu_done() from callers:
    IOMMU blocks RID/ATS
    pci_reset_function():
        pci_dev_reset_iommu_prepare(); // Block RID/ATS
        rc = __reset();
        pci_dev_reset_iommu_done(rc); // Unblock or quarantine

On a successful reset, done() restores the device to its RID/PASID domains
and decrements group->recovery_cnt. On failure, the device remains blocked,
and concurrent domain attachment will be rejected until a successful reset.

Note: -ENOTTY is overloaded with different meanings by PCI reset functions.
Some of them indicate "reset was not attempted", while others indicate "try
the next reset method and the current method failed". IOMMU that must react
these two outcomes separately has no choice but to keep the device blocked
on -ENOTTY as well. Leave an inline FIXME and warning.

This introduces a new situation where a blocked device is being unplugged.
Decrement the group->recovery_cnt accordingly.

Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h  |  5 ++--
 drivers/iommu/iommu.c  | 62 ++++++++++++++++++++++++++++++++++++++++--
 drivers/pci/pci-acpi.c |  2 +-
 drivers/pci/pci.c      | 10 +++----
 drivers/pci/quirks.c   |  2 +-
 5 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e587d4ac4d331..e191d30d228ac 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1195,7 +1195,7 @@ void iommu_free_global_pasid(ioasid_t pasid);
 
 /* PCI device reset functions */
 int pci_dev_reset_iommu_prepare(struct pci_dev *pdev);
-void pci_dev_reset_iommu_done(struct pci_dev *pdev);
+void pci_dev_reset_iommu_done(struct pci_dev *pdev, int reset_result);
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
@@ -1525,7 +1525,8 @@ static inline int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 	return 0;
 }
 
-static inline void pci_dev_reset_iommu_done(struct pci_dev *pdev)
+static inline void pci_dev_reset_iommu_done(struct pci_dev *pdev,
+					    int reset_result)
 {
 }
 #endif /* CONFIG_IOMMU_API */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c40f4bfc93352..6c92b7a2b14cc 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -76,6 +76,7 @@ struct iommu_group {
 enum gdev_blocked {
 	BLOCKED_NO = 0, /* Not blocked */
 	BLOCKED_RESETTING, /* PCI reset in flight */
+	BLOCKED_RESET_FAILED, /* PCI reset failed */
 };
 
 struct group_device {
@@ -763,6 +764,9 @@ static void __iommu_group_remove_device(struct device *dev)
 		if (device->dev != dev)
 			continue;
 
+		/* Must drop the recovery_cnt when removing a blocked device */
+		if (device->blocked && !WARN_ON(group->recovery_cnt == 0))
+			group->recovery_cnt--;
 		list_del(&device->list);
 		__iommu_group_free_device(group, device);
 		if (dev_has_iommu(dev))
@@ -4036,7 +4040,12 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
  * reset is finished, pci_dev_reset_iommu_done() can restore everything.
  *
  * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
- * before/after the core-level reset routine, to decrement the recovery_cnt.
+ * before/after the core-level reset routine. On a successful reset, done() will
+ * decrement group->recovery_cnt and restore domains. On a failure, recovery_cnt
+ * is left intact and the device stays blocked.
+ *
+ * Callers must skip pci_dev_reset_iommu_prepare/done() entirely when no reset
+ * is attempted (e.g. probe mode).
  *
  * Return: 0 on success or negative error code if the preparation failed.
  *
@@ -4066,6 +4075,10 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 	if (gdev->reset_depth++)
 		return 0;
 
+	/* Device might be already blocked for a quarantine */
+	if (gdev->blocked)
+		return 0;
+
 	ret = __iommu_group_alloc_blocking_domain(group);
 	if (ret) {
 		gdev->reset_depth--;
@@ -4147,20 +4160,28 @@ static bool group_device_dma_alias_is_blocked(struct iommu_group *group,
 /**
  * pci_dev_reset_iommu_done() - Restore IOMMU after a PCI device reset is done
  * @pdev: PCI device that has finished a reset routine
+ * @reset_result: Return code from the reset routine
  *
  * After a PCIe device finishes a reset routine, it wants to restore its IOMMU
  * activity, including new translation and cache invalidation, by re-attaching
  * all RID/PASID of the device back to the domains retained in the core-level
  * structure.
  *
- * Caller must pair it with a successful pci_dev_reset_iommu_prepare().
+ * This is a pairing function for pci_dev_reset_iommu_prepare(). Caller passes
+ * the reset return value to @reset_result. On a failed reset, the device will
+ * remain blocked as a quarantine measure, with group->recovery_cnt intact, to
+ * protect system memory until a subsequent successful reset.
+ *
+ * Callers must skip pci_dev_reset_iommu_prepare/done() entirely when no reset
+ * is attempted (e.g. probe mode).
  *
  * Note that, although unlikely, there is a risk that re-attaching domains might
  * fail due to some unexpected happening like OOM.
  */
-void pci_dev_reset_iommu_done(struct pci_dev *pdev)
+void pci_dev_reset_iommu_done(struct pci_dev *pdev, int reset_result)
 {
 	struct iommu_group *group = pdev->dev.iommu_group;
+	enum gdev_blocked old_gdev_blocked;
 	struct group_device *gdev;
 	unsigned long pasid;
 	void *entry;
@@ -4183,6 +4204,37 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 	if (WARN_ON(!group->blocking_domain))
 		return;
 
+	/*
+	 * A reset failure implies that the device might be unreliable. E.g. its
+	 * device cache might retain stale entries, which might result in memory
+	 * corruption. Thus, do not unblock the device until a successful reset.
+	 */
+	if (reset_result) {
+		/*
+		 * FIXME: the int-return values from the PCI reset functions are
+		 * not consistent: some reset functions use -ENOTTY to indicate
+		 * "no reset was attempted" (in which case IOMMU should revert a
+		 * prepare), while others use -ENOTTY to indicate "reset failed;
+		 * try the next reset method" (in which case IOMMU should keep
+		 * the device blocked). Without fixing the PCI return result, we
+		 * cannot tell the difference between the two cases. Warn it.
+		 */
+		if (reset_result == -ENOTTY)
+			dev_warn_ratelimited(
+				&pdev->dev,
+				"Reset may have been skipped. Keep it blocked conservatively\n");
+		else
+			dev_err_ratelimited(
+				&pdev->dev,
+				"Reset failed. Keep it blocked to protect memory\n");
+		if (gdev->blocked == BLOCKED_RESETTING)
+			gdev->blocked = BLOCKED_RESET_FAILED;
+		return;
+	}
+
+	if (WARN_ON(!gdev->blocked))
+		return;
+
 	if (group_device_dma_alias_is_blocked(group, gdev)) {
 		/*
 		 * FIXME: DMA aliased devices share the same RID, which would be
@@ -4213,6 +4265,7 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 	 * the correct domain in iommu_driver_get_domain_for_dev() that might be
 	 * called in a set_dev_pasid callback function.
 	 */
+	old_gdev_blocked = gdev->blocked;
 	gdev->blocked = BLOCKED_NO;
 
 	/*
@@ -4234,6 +4287,9 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
 
 	if (!WARN_ON(group->recovery_cnt == 0))
 		group->recovery_cnt--;
+
+	if (old_gdev_blocked > BLOCKED_RESETTING)
+		pci_info(pdev, "Device is unblocked after successful reset\n");
 }
 EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_done);
 
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 4d0f2cb6c695b..280d7193cb4ca 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -977,7 +977,7 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
 		ret = -ENOTTY;
 	}
 
-	pci_dev_reset_iommu_done(dev);
+	pci_dev_reset_iommu_done(dev, ret);
 	return ret;
 }
 
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d0af8b5eca2ce..b71e3e10c7b52 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4355,7 +4355,7 @@ int pcie_flr(struct pci_dev *dev)
 
 	ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
 done:
-	pci_dev_reset_iommu_done(dev);
+	pci_dev_reset_iommu_done(dev, ret);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(pcie_flr);
@@ -4433,7 +4433,7 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
 
 	ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
 done:
-	pci_dev_reset_iommu_done(dev);
+	pci_dev_reset_iommu_done(dev, ret);
 	return ret;
 }
 
@@ -4487,7 +4487,7 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
 	pci_dev_d3_sleep(dev);
 
 	ret = pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
-	pci_dev_reset_iommu_done(dev);
+	pci_dev_reset_iommu_done(dev, ret);
 	return ret;
 }
 
@@ -4929,7 +4929,7 @@ static int pci_reset_bus_function(struct pci_dev *dev, bool probe)
 	rc = pci_parent_bus_reset(dev, probe);
 done:
 	if (!probe)
-		pci_dev_reset_iommu_done(dev);
+		pci_dev_reset_iommu_done(dev, rc);
 	return rc;
 }
 
@@ -4974,7 +4974,7 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
 		pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
 				      reg);
 
-	pci_dev_reset_iommu_done(dev);
+	pci_dev_reset_iommu_done(dev, rc);
 	return rc;
 }
 
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 6cded18c9a687..39b1c6250a4d0 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4274,7 +4274,7 @@ static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
 
 	ret = i->reset(dev, probe);
 	if (!probe)
-		pci_dev_reset_iommu_done(dev);
+		pci_dev_reset_iommu_done(dev, ret);
 	return ret;
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 05/24] iommu: Add reset_device_done callback for hardware fault recovery
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (3 preceding siblings ...)
  2026-05-19  3:38 ` [PATCH v4 04/24] iommu: Pass in reset result to pci_dev_reset_iommu_done() Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 06/24] iommu: Defer iommu_group free via kfree_rcu() Nicolin Chen
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

When an IOMMU hardware detects an error due to a faulty device (e.g. an ATS
invalidation timeout), IOMMU drivers may quarantine the device by disabling
specific hardware features or dropping translation capabilities.

To recover from these states, the IOMMU driver needs a reliable signal that
the underlying physical hardware has been cleanly reset (e.g., via PCIe AER
or a sysfs Function Level Reset) so as to lift the quarantine.

Introduce a reset_device_done callback in struct iommu_ops. Trigger it from
the existing pci_dev_reset_iommu_done() path to notify the underlying IOMMU
driver that the device's internal state has been sanitized, when the result
indicates a successful physical reset.

As the initial use case, this will be used by ATS-capable PCI devices.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h |  4 ++++
 drivers/iommu/iommu.c | 12 ++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e191d30d228ac..6c124e9e9af8b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -629,6 +629,9 @@ __iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
  * @release_device: Remove device from iommu driver handling
  * @probe_finalize: Do final setup work after the device is added to an IOMMU
  *                  group and attached to the groups domain
+ * @reset_device_done: Notify the driver that a device has reset successfully.
+ *                     Note that the core invokes the callback function while
+ *                     holding the group->mutex
  * @device_group: find iommu group for a particular device
  * @get_resv_regions: Request list of reserved regions for a device
  * @of_xlate: add OF master IDs to iommu grouping
@@ -686,6 +689,7 @@ struct iommu_ops {
 	struct iommu_device *(*probe_device)(struct device *dev);
 	void (*release_device)(struct device *dev);
 	void (*probe_finalize)(struct device *dev);
+	void (*reset_device_done)(struct device *dev);
 	struct iommu_group *(*device_group)(struct device *dev);
 
 	/* Request/Free a list of reserved regions for a device */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 6c92b7a2b14cc..e68c7b142ad5a 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -4182,12 +4182,14 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev, int reset_result)
 {
 	struct iommu_group *group = pdev->dev.iommu_group;
 	enum gdev_blocked old_gdev_blocked;
+	const struct iommu_ops *ops;
 	struct group_device *gdev;
 	unsigned long pasid;
 	void *entry;
 
 	if (!pci_ats_supported(pdev) || !dev_has_iommu(&pdev->dev))
 		return;
+	ops = dev_iommu_ops(&pdev->dev);
 
 	guard(mutex)(&group->mutex);
 
@@ -4249,6 +4251,16 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev, int reset_result)
 			 "DMA-aliased sibling may be prematurely unblocked\n");
 	}
 
+	/*
+	 * A PCI device might have been in an error state, so the IOMMU driver
+	 * had to quarantine the device by disabling specific hardware features
+	 * or dropping translation capability. Here notify the IOMMU driver as
+	 * a reliable signal that the faulty PCI device has been cleanly reset
+	 * so now it can lift its quarantine and restore full functionality.
+	 */
+	if (ops->reset_device_done)
+		ops->reset_device_done(&pdev->dev);
+
 	/*
 	 * Re-attach RID domain back to group->domain
 	 *
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 06/24] iommu: Defer iommu_group free via kfree_rcu()
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (4 preceding siblings ...)
  2026-05-19  3:38 ` [PATCH v4 05/24] iommu: Add reset_device_done callback for hardware fault recovery Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19 11:39   ` Jason Gunthorpe
  2026-05-19  3:38 ` [PATCH v4 07/24] iommu: Defer __iommu_group_free_device() to be outside group->mutex Nicolin Chen
                   ` (17 subsequent siblings)
  23 siblings, 1 reply; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

dev->iommu_group will be read in an ISR-context to look up a group_device
for fault reporting, in which case mutex cannot be used. For that read to
be safe, two things are needed:

 (1) The iommu_group memory that dev->iommu_group points to must outlive
     any in-flight rcu_read_lock section. Add rcu_head to iommu_group and
     switch iommu_group_release() to calling kfree_rcu().

 (2) The publication of dev->iommu_group must pair with rcu_dereference()
     at the upcoming reader (cannot hold mutex but rcu_read_lock), so the
     writers must use rcu_assign_pointer().

Existing readers do not use rcu_dereference(); they retain their current
synchronization model. Apply a __rcu __force cast at the writer sites to
satisfy sparse without forcing every reader to convert.

New reader added by the subsequent change uses rcu_dereference() only, to
reach group->devices for a list lookup. And it does not touch group->name
and other fields. The kfree_rcu() here is supposed to keep group->devices
alive across the read-side critical section; other fields will not affect
the reader.

Note: this change alone does not yet make group->devices iteration safe
under rcu_read_lock(); a subsequent change will convert the group_device
list to RCU and switch struct group_device to kfree_rcu().

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index e68c7b142ad5a..6727b6f7797bd 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -71,8 +71,12 @@ struct iommu_group {
 	 */
 	unsigned int recovery_cnt;
 	void *owner;
+	struct rcu_head rcu;
 };
 
+#define dev_iommu_group_rcu(dev) \
+	(*((struct iommu_group __rcu __force **)&(dev)->iommu_group))
+
 enum gdev_blocked {
 	BLOCKED_NO = 0, /* Not blocked */
 	BLOCKED_RESETTING, /* PCI reset in flight */
@@ -531,7 +535,7 @@ static int iommu_init_device(struct device *dev)
 		ret = PTR_ERR(group);
 		goto err_unlink;
 	}
-	dev->iommu_group = group;
+	rcu_assign_pointer(dev_iommu_group_rcu(dev), group);
 
 	dev->iommu->max_pasids = dev_iommu_get_max_pasids(dev);
 	if (ops->is_attach_deferred)
@@ -613,7 +617,7 @@ static void iommu_deinit_device(struct device *dev)
 	}
 
 	/* Caller must put iommu_group */
-	dev->iommu_group = NULL;
+	rcu_assign_pointer(dev_iommu_group_rcu(dev), NULL);
 	module_put(ops->owner);
 	dev_iommu_free(dev);
 #ifdef CONFIG_IOMMU_DMA
@@ -772,7 +776,7 @@ static void __iommu_group_remove_device(struct device *dev)
 		if (dev_has_iommu(dev))
 			iommu_deinit_device(dev);
 		else
-			dev->iommu_group = NULL;
+			rcu_assign_pointer(dev_iommu_group_rcu(dev), NULL);
 		break;
 	}
 	mutex_unlock(&group->mutex);
@@ -1059,7 +1063,7 @@ static void iommu_group_release(struct kobject *kobj)
 	WARN_ON(group->blocking_domain);
 
 	kfree(group->name);
-	kfree(group);
+	kfree_rcu(group, rcu);
 }
 
 static const struct kobj_type iommu_group_ktype = {
@@ -1344,7 +1348,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 		return PTR_ERR(gdev);
 
 	iommu_group_ref_get(group);
-	dev->iommu_group = group;
+	rcu_assign_pointer(dev_iommu_group_rcu(dev), group);
 
 	mutex_lock(&group->mutex);
 	list_add_tail(&gdev->list, &group->devices);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 07/24] iommu: Defer __iommu_group_free_device() to be outside group->mutex
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (5 preceding siblings ...)
  2026-05-19  3:38 ` [PATCH v4 06/24] iommu: Defer iommu_group free via kfree_rcu() Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19 11:47   ` Jason Gunthorpe
  2026-05-19  3:38 ` [PATCH v4 08/24] iommu: Change group->devices to RCU-protected list Nicolin Chen
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

__iommu_group_remove_device() holds group->mutex across the entire call to
__iommu_group_free_device() that performs sysfs removals, tracing, and the
final kfree(). But in fact, most of these operations don't really need the
group->mutex.

Subsequent changes will introduce sleepable operations to this function:
 + synchronize_rcu() to defer the gdev->dev put past a grace period.
 + disable_work_sync() to cancel a future broken_work.
Neither should run while holding group->mutex. Thus, move them outside.

Separate the assertion from __iommu_group_free_device() to another helper
__iommu_group_empty_assert_owner_cnt(). While moving it, revise the inline
comment a bit to make it clearer.

Defer the __iommu_group_free_device() until the mutex is released.

This is a preparatory refactor with no functional change.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 6727b6f7797bd..2f8f3ea13f490 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -634,6 +634,19 @@ static struct iommu_domain *pasid_array_entry_to_domain(void *entry)
 
 DEFINE_MUTEX(iommu_probe_device_lock);
 
+static void __iommu_group_empty_assert_owner_cnt(struct iommu_group *group)
+{
+	lockdep_assert_held(&group->mutex);
+	/*
+	 * If the group has become empty, the ownership must have been released,
+	 * and the current domain must be set back to the default domain (which
+	 * itself can be NULL).
+	 */
+	if (list_empty(&group->devices))
+		WARN_ON(group->owner_cnt ||
+			group->domain != group->default_domain);
+}
+
 static int __iommu_probe_device(struct device *dev, struct list_head *group_list)
 {
 	struct iommu_group *group;
@@ -707,10 +720,12 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 
 err_remove_gdev:
 	list_del(&gdev->list);
-	__iommu_group_free_device(group, gdev);
+	__iommu_group_empty_assert_owner_cnt(group);
 err_put_group:
 	iommu_deinit_device(dev);
 	mutex_unlock(&group->mutex);
+	if (!IS_ERR(gdev))
+		__iommu_group_free_device(group, gdev);
 	iommu_group_put(group);
 
 	return ret;
@@ -739,20 +754,13 @@ static void __iommu_group_free_device(struct iommu_group *group,
 {
 	struct device *dev = grp_dev->dev;
 
+	lockdep_assert_not_held(&group->mutex);
+
 	sysfs_remove_link(group->devices_kobj, grp_dev->name);
 	sysfs_remove_link(&dev->kobj, "iommu_group");
 
 	trace_remove_device_from_group(group->id, dev);
 
-	/*
-	 * If the group has become empty then ownership must have been
-	 * released, and the current domain must be set back to NULL or
-	 * the default domain.
-	 */
-	if (list_empty(&group->devices))
-		WARN_ON(group->owner_cnt ||
-			group->domain != group->default_domain);
-
 	kfree(grp_dev->name);
 	kfree(grp_dev);
 }
@@ -761,7 +769,7 @@ static void __iommu_group_free_device(struct iommu_group *group,
 static void __iommu_group_remove_device(struct device *dev)
 {
 	struct iommu_group *group = dev->iommu_group;
-	struct group_device *device;
+	struct group_device *device, *to_free = NULL;
 
 	mutex_lock(&group->mutex);
 	for_each_group_device(group, device) {
@@ -772,15 +780,18 @@ static void __iommu_group_remove_device(struct device *dev)
 		if (device->blocked && !WARN_ON(group->recovery_cnt == 0))
 			group->recovery_cnt--;
 		list_del(&device->list);
-		__iommu_group_free_device(group, device);
+		__iommu_group_empty_assert_owner_cnt(group);
 		if (dev_has_iommu(dev))
 			iommu_deinit_device(dev);
 		else
 			rcu_assign_pointer(dev_iommu_group_rcu(dev), NULL);
+		to_free = device;
 		break;
 	}
 	mutex_unlock(&group->mutex);
 
+	if (to_free)
+		__iommu_group_free_device(group, to_free);
 	/*
 	 * Pairs with the get in iommu_init_device() or
 	 * iommu_group_add_device()
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 08/24] iommu: Change group->devices to RCU-protected list
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (6 preceding siblings ...)
  2026-05-19  3:38 ` [PATCH v4 07/24] iommu: Defer __iommu_group_free_device() to be outside group->mutex Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 09/24] iommu: Add group pointer to struct group_device Nicolin Chen
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

To allow lockless iterations of the group->devices list in an ISR context
that cannot hold the group->mutex, change the list to be RCU protected.

Mark the existing __dev_to_gdev() for group->mutex case only. A subsequent
change will add another __dev_to_gdev_rcu() for RCU case.

Hold grp_dev->dev across the RCU grace period using synchronize_rcu(), in
__iommu_group_free_device(). Without that, the driver core might free the
struct device while an RCU reader is still mid-iteration.

Note: a call_rcu() callback runs in softirq context, but put_device() may
sleep -- the device release path can invoke devres_release_all() and
->release callbacks that take mutexes. Use synchronize_rcu() to defer the
put_device() to the (sleepable) caller context instead.

Note that in bus_iommu_probe() there is a for_each_group_device marked as
FIXME, which can't take either mutex or RCU read lock. Plainly replace it
with list_for_each_entry for a status quo.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2f8f3ea13f490..4116b28258bde 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -97,8 +97,10 @@ struct group_device {
 
 /* Iterate over each struct group_device in a struct iommu_group */
 #define for_each_group_device(group, pos) \
-	list_for_each_entry(pos, &(group)->devices, list)
+	list_for_each_entry_rcu(pos, &(group)->devices, list, \
+				lockdep_is_held(&(group)->mutex))
 
+/* Caller must hold dev->iommu_group->mutex. */
 static struct group_device *__dev_to_gdev(struct device *dev)
 {
 	struct iommu_group *group = dev->iommu_group;
@@ -688,7 +690,7 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 	 * The gdev must be in the list before calling
 	 * iommu_setup_default_domain()
 	 */
-	list_add_tail(&gdev->list, &group->devices);
+	list_add_tail_rcu(&gdev->list, &group->devices);
 	WARN_ON(group->default_domain && !group->domain);
 	if (group->default_domain)
 		iommu_create_device_direct_mappings(group->default_domain, dev);
@@ -719,7 +721,7 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 	return 0;
 
 err_remove_gdev:
-	list_del(&gdev->list);
+	list_del_rcu(&gdev->list);
 	__iommu_group_empty_assert_owner_cnt(group);
 err_put_group:
 	iommu_deinit_device(dev);
@@ -762,6 +764,10 @@ static void __iommu_group_free_device(struct iommu_group *group,
 	trace_remove_device_from_group(group->id, dev);
 
 	kfree(grp_dev->name);
+
+	/* Wait for any in-flight reader to drop the reference to gdev->dev */
+	synchronize_rcu();
+	put_device(grp_dev->dev);
 	kfree(grp_dev);
 }
 
@@ -779,7 +785,7 @@ static void __iommu_group_remove_device(struct device *dev)
 		/* Must drop the recovery_cnt when removing a blocked device */
 		if (device->blocked && !WARN_ON(group->recovery_cnt == 0))
 			group->recovery_cnt--;
-		list_del(&device->list);
+		list_del_rcu(&device->list);
 		__iommu_group_empty_assert_owner_cnt(group);
 		if (dev_has_iommu(dev))
 			iommu_deinit_device(dev);
@@ -1298,6 +1304,8 @@ static struct group_device *iommu_group_alloc_device(struct iommu_group *group,
 		return ERR_PTR(-ENOMEM);
 
 	device->dev = dev;
+	/* Keep dev alive for any in-flight RCU reader of grp_dev->dev. */
+	get_device(dev);
 
 	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
 	if (ret)
@@ -1337,6 +1345,7 @@ static struct group_device *iommu_group_alloc_device(struct iommu_group *group,
 err_remove_link:
 	sysfs_remove_link(&dev->kobj, "iommu_group");
 err_free_device:
+	put_device(dev);
 	kfree(device);
 	dev_err(dev, "Failed to add to iommu group %d: %d\n", group->id, ret);
 	return ERR_PTR(ret);
@@ -1362,7 +1371,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 	rcu_assign_pointer(dev_iommu_group_rcu(dev), group);
 
 	mutex_lock(&group->mutex);
-	list_add_tail(&gdev->list, &group->devices);
+	list_add_tail_rcu(&gdev->list, &group->devices);
 	mutex_unlock(&group->mutex);
 	return 0;
 }
@@ -2011,9 +2020,11 @@ static int bus_iommu_probe(const struct bus_type *bus)
 		 * FIXME: Mis-locked because the ops->probe_finalize() call-back
 		 * of some IOMMU drivers calls arm_iommu_attach_device() which
 		 * in-turn might call back into IOMMU core code, where it tries
-		 * to take group->mutex, resulting in a deadlock.
+		 * to take group->mutex, resulting in a deadlock. Unfortunately,
+		 * as iommu_group_do_probe_finalize() can sleep, rcu_read_lock()
+		 * cannot be held to mitigate this.
 		 */
-		for_each_group_device(group, gdev)
+		list_for_each_entry(gdev, &group->devices, list)
 			iommu_group_do_probe_finalize(gdev->dev);
 	}
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 09/24] iommu: Add group pointer to struct group_device
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (7 preceding siblings ...)
  2026-05-19  3:38 ` [PATCH v4 08/24] iommu: Change group->devices to RCU-protected list Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 10/24] iommu: Add __iommu_group_block_device helper Nicolin Chen
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

Though group pointer is in general available at dev->iommu_group, it would
be NULLed by iommu_deinit_device() holding group->mutex.

To introduce an asynchronous worker that would hold the mutex as well, its
disable_work_sync() can only get called afterwards outside the mutex. Then,
using dev->iommu_group would crash the kernel.

Add a group pointer to the gdev to prepare for that. No functional change.

Drop group arguments next to gdev in function parameters.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 4116b28258bde..f745083c032d6 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -84,6 +84,7 @@ enum gdev_blocked {
 };
 
 struct group_device {
+	struct iommu_group *group;
 	struct list_head list;
 	struct device *dev;
 	char *name;
@@ -177,8 +178,7 @@ static ssize_t iommu_group_store_type(struct iommu_group *group,
 				      const char *buf, size_t count);
 static struct group_device *iommu_group_alloc_device(struct iommu_group *group,
 						     struct device *dev);
-static void __iommu_group_free_device(struct iommu_group *group,
-				      struct group_device *grp_dev);
+static void __iommu_group_free_device(struct group_device *grp_dev);
 static void iommu_domain_init(struct iommu_domain *domain, unsigned int type,
 			      const struct iommu_ops *ops);
 
@@ -727,7 +727,7 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 	iommu_deinit_device(dev);
 	mutex_unlock(&group->mutex);
 	if (!IS_ERR(gdev))
-		__iommu_group_free_device(group, gdev);
+		__iommu_group_free_device(gdev);
 	iommu_group_put(group);
 
 	return ret;
@@ -751,9 +751,9 @@ int iommu_probe_device(struct device *dev)
 	return 0;
 }
 
-static void __iommu_group_free_device(struct iommu_group *group,
-				      struct group_device *grp_dev)
+static void __iommu_group_free_device(struct group_device *grp_dev)
 {
+	struct iommu_group *group = grp_dev->group;
 	struct device *dev = grp_dev->dev;
 
 	lockdep_assert_not_held(&group->mutex);
@@ -797,7 +797,7 @@ static void __iommu_group_remove_device(struct device *dev)
 	mutex_unlock(&group->mutex);
 
 	if (to_free)
-		__iommu_group_free_device(group, to_free);
+		__iommu_group_free_device(to_free);
 	/*
 	 * Pairs with the get in iommu_init_device() or
 	 * iommu_group_add_device()
@@ -1304,6 +1304,7 @@ static struct group_device *iommu_group_alloc_device(struct iommu_group *group,
 		return ERR_PTR(-ENOMEM);
 
 	device->dev = dev;
+	device->group = group;
 	/* Keep dev alive for any in-flight RCU reader of grp_dev->dev. */
 	get_device(dev);
 
@@ -4161,9 +4162,9 @@ static int group_device_cmp_dma_alias(struct pci_dev *dev, u16 alias,
 				      &alias);
 }
 
-static bool group_device_dma_alias_is_blocked(struct iommu_group *group,
-					      struct group_device *gdev)
+static bool group_device_dma_alias_is_blocked(struct group_device *gdev)
 {
+	struct iommu_group *group = gdev->group;
 	struct group_device *sibling;
 
 	lockdep_assert_held(&group->mutex);
@@ -4263,7 +4264,7 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev, int reset_result)
 	if (WARN_ON(!gdev->blocked))
 		return;
 
-	if (group_device_dma_alias_is_blocked(group, gdev)) {
+	if (group_device_dma_alias_is_blocked(gdev)) {
 		/*
 		 * FIXME: DMA aliased devices share the same RID, which would be
 		 * convoluted to handle, as "gdev->blocked" is not sufficient:
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 10/24] iommu: Add __iommu_group_block_device helper
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (8 preceding siblings ...)
  2026-05-19  3:38 ` [PATCH v4 09/24] iommu: Add group pointer to struct group_device Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

Move the RID/PASID blocking routine into a separate helper, which will be
reused by a new function to quarantine the device but does not bother the
gdev->reset_depth counter.

Also, document the severity ordering at enum gdev_blocked.

No functional changes.

Suggested-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 106 ++++++++++++++++++++++++------------------
 1 file changed, 60 insertions(+), 46 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index f745083c032d6..b150d22d8015f 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -77,6 +77,7 @@ struct iommu_group {
 #define dev_iommu_group_rcu(dev) \
 	(*((struct iommu_group __rcu __force **)&(dev)->iommu_group))
 
+/* A bigger number indicates a higher severity */
 enum gdev_blocked {
 	BLOCKED_NO = 0, /* Not blocked */
 	BLOCKED_RESETTING, /* PCI reset in flight */
@@ -4053,6 +4054,62 @@ int iommu_replace_group_handle(struct iommu_group *group,
 }
 EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
 
+/* Caller can use this function on a blocked @gdev just to update the @reason */
+static int __iommu_group_block_device(struct group_device *gdev,
+				      enum gdev_blocked reason)
+{
+	struct iommu_group *group = gdev->group;
+	unsigned long pasid;
+	void *entry;
+	int ret;
+
+	lockdep_assert_held(&group->mutex);
+
+	/* Device might be already blocked for a quarantine */
+	if (gdev->blocked) {
+		/* Escalate the severity */
+		gdev->blocked = max(gdev->blocked, reason);
+		return 0;
+	}
+
+	ret = __iommu_group_alloc_blocking_domain(group);
+	if (ret)
+		return ret;
+
+	/* Stage RID domain at blocking_domain while retaining group->domain */
+	if (group->domain != group->blocking_domain) {
+		ret = __iommu_attach_device(group->blocking_domain, gdev->dev,
+					    group->domain);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * Update gdev->blocked upon the domain change, as it is used to return
+	 * the correct domain in iommu_driver_get_domain_for_dev() that might be
+	 * called in a set_dev_pasid callback function.
+	 */
+	gdev->blocked = reason;
+
+	/*
+	 * Stage PASID domains at blocking_domain while retaining pasid_array.
+	 *
+	 * The pasid_array is mostly fenced by group->mutex, except one reader
+	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
+	 */
+	if (gdev->dev->iommu->max_pasids > 0) {
+		xa_for_each_start(&group->pasid_array, pasid, entry, 1) {
+			struct iommu_domain *pasid_dom =
+				pasid_array_entry_to_domain(entry);
+
+			iommu_remove_dev_pasid(gdev->dev, pasid, pasid_dom);
+		}
+	}
+
+	group->recovery_cnt++;
+	return 0;
+}
+
 /**
  * pci_dev_reset_iommu_prepare() - Block IOMMU to prepare for a PCI device reset
  * @pdev: PCI device that is going to enter a reset routine
@@ -4086,8 +4143,6 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 {
 	struct iommu_group *group = pdev->dev.iommu_group;
 	struct group_device *gdev;
-	unsigned long pasid;
-	void *entry;
 	int ret;
 
 	if (!pci_ats_supported(pdev) || !dev_has_iommu(&pdev->dev))
@@ -4102,49 +4157,9 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 	if (gdev->reset_depth++)
 		return 0;
 
-	/* Device might be already blocked for a quarantine */
-	if (gdev->blocked)
-		return 0;
-
-	ret = __iommu_group_alloc_blocking_domain(group);
-	if (ret) {
+	ret = __iommu_group_block_device(gdev, BLOCKED_RESETTING);
+	if (ret)
 		gdev->reset_depth--;
-		return ret;
-	}
-
-	/* Stage RID domain at blocking_domain while retaining group->domain */
-	if (group->domain != group->blocking_domain) {
-		ret = __iommu_attach_device(group->blocking_domain, &pdev->dev,
-					    group->domain);
-		if (ret) {
-			gdev->reset_depth--;
-			return ret;
-		}
-	}
-
-	/*
-	 * Update gdev->blocked upon the domain change, as it is used to return
-	 * the correct domain in iommu_driver_get_domain_for_dev() that might be
-	 * called in a set_dev_pasid callback function.
-	 */
-	gdev->blocked = BLOCKED_RESETTING;
-
-	/*
-	 * Stage PASID domains at blocking_domain while retaining pasid_array.
-	 *
-	 * The pasid_array is mostly fenced by group->mutex, except one reader
-	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
-	 */
-	if (pdev->dev.iommu->max_pasids > 0) {
-		xa_for_each_start(&group->pasid_array, pasid, entry, 1) {
-			struct iommu_domain *pasid_dom =
-				pasid_array_entry_to_domain(entry);
-
-			iommu_remove_dev_pasid(&pdev->dev, pasid, pasid_dom);
-		}
-	}
-
-	group->recovery_cnt++;
 	return ret;
 }
 EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
@@ -4256,8 +4271,7 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev, int reset_result)
 			dev_err_ratelimited(
 				&pdev->dev,
 				"Reset failed. Keep it blocked to protect memory\n");
-		if (gdev->blocked == BLOCKED_RESETTING)
-			gdev->blocked = BLOCKED_RESET_FAILED;
+		WARN_ON(__iommu_group_block_device(gdev, BLOCKED_RESET_FAILED));
 		return;
 	}
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (9 preceding siblings ...)
  2026-05-19  3:38 ` [PATCH v4 10/24] iommu: Add __iommu_group_block_device helper Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19 12:07   ` Jason Gunthorpe
  2026-05-19  3:38 ` [PATCH v4 12/24] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

When an IOMMU hardware detects an error due to a faulty device (e.g. an ATS
invalidation timeout), IOMMU drivers may quarantine the device by disabling
specific hardware features or dropping translation capabilities.

However, the core-level states of the faulty device are out of sync, as the
device can still be attached to a translation domain or even potentially be
moved to a new domain that might overwrite the driver-level quarantine.

Given that such an error can likely be triggered from an ISR, introduce an
asynchronous broken_work per group_device, and provide a helper function to
allow the driver to initiate a quarantine in the core. __dev_to_gdev_rcu()
is required here to safely iterate the group_device list.

Note: gdev and gdev->group teardown will be blocked by disable_work_sync(),
so it's completely safe for the worker thread to access the group pointer.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h |  11 +++-
 drivers/iommu/iommu.c | 139 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 149 insertions(+), 1 deletion(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 6c124e9e9af8b..c088c8e8c1e2b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -631,7 +631,10 @@ __iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
  *                  group and attached to the groups domain
  * @reset_device_done: Notify the driver that a device has reset successfully.
  *                     Note that the core invokes the callback function while
- *                     holding the group->mutex
+ *                     holding the group->mutex. Before returning, the driver
+ *                     must drain or filter pre-reset fault reports so that
+ *                     subsequent calls to iommu_report_device_broken() will
+ *                     reflect only post-reset faults.
  * @device_group: find iommu group for a particular device
  * @get_resv_regions: Request list of reserved regions for a device
  * @of_xlate: add OF master IDs to iommu grouping
@@ -896,6 +899,8 @@ static inline struct iommu_device *__iommu_get_iommu_dev(struct device *dev)
 #define iommu_get_iommu_dev(dev, type, member) \
 	container_of(__iommu_get_iommu_dev(dev), type, member)
 
+void iommu_report_device_broken(struct device *dev);
+
 static inline void iommu_iotlb_gather_init(struct iommu_iotlb_gather *gather)
 {
 	*gather = (struct iommu_iotlb_gather) {
@@ -1211,6 +1216,10 @@ struct iommu_iotlb_gather {};
 struct iommu_dirty_bitmap {};
 struct iommu_dirty_ops {};
 
+static inline void iommu_report_device_broken(struct device *dev)
+{
+}
+
 static inline bool device_iommu_capable(struct device *dev, enum iommu_cap cap)
 {
 	return false;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index b150d22d8015f..6e5a7e38c5e67 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -82,6 +82,7 @@ enum gdev_blocked {
 	BLOCKED_NO = 0, /* Not blocked */
 	BLOCKED_RESETTING, /* PCI reset in flight */
 	BLOCKED_RESET_FAILED, /* PCI reset failed */
+	BLOCKED_BROKEN, /* Driver flagged a fault */
 };
 
 struct group_device {
@@ -95,6 +96,9 @@ struct group_device {
 	 */
 	enum gdev_blocked blocked;
 	unsigned int reset_depth;
+	struct work_struct broken_work;
+	/* Transient state before broken_worker thread starts */
+	bool broken_pending;
 };
 
 /* Iterate over each struct group_device in a struct iommu_group */
@@ -117,6 +121,25 @@ static struct group_device *__dev_to_gdev(struct device *dev)
 	return NULL;
 }
 
+/* Caller must be inside rcu_read_lock(). */
+static struct group_device *__dev_to_gdev_rcu(struct device *dev)
+{
+	struct iommu_group *group;
+	struct group_device *gdev;
+
+	lockdep_assert(rcu_read_lock_held());
+
+	group = rcu_dereference(dev_iommu_group_rcu(dev));
+	if (!group)
+		return NULL;
+
+	for_each_group_device(group, gdev) {
+		if (gdev->dev == dev)
+			return gdev;
+	}
+	return NULL;
+}
+
 struct iommu_group_attribute {
 	struct attribute attr;
 	ssize_t (*show)(struct iommu_group *group, char *buf);
@@ -180,6 +203,7 @@ static ssize_t iommu_group_store_type(struct iommu_group *group,
 static struct group_device *iommu_group_alloc_device(struct iommu_group *group,
 						     struct device *dev);
 static void __iommu_group_free_device(struct group_device *grp_dev);
+static void iommu_group_broken_worker(struct work_struct *work);
 static void iommu_domain_init(struct iommu_domain *domain, unsigned int type,
 			      const struct iommu_ops *ops);
 
@@ -762,6 +786,12 @@ static void __iommu_group_free_device(struct group_device *grp_dev)
 	sysfs_remove_link(group->devices_kobj, grp_dev->name);
 	sysfs_remove_link(&dev->kobj, "iommu_group");
 
+	/*
+	 * Disable (not just cancel) broken_work to prevent UAF; otherwise a
+	 * concurrent schedule_work() from iommu_report_device_broken() would
+	 * queue onto an about-to-be-freed gdev.
+	 */
+	disable_work_sync(&grp_dev->broken_work);
 	trace_remove_device_from_group(group->id, dev);
 
 	kfree(grp_dev->name);
@@ -1308,6 +1338,7 @@ static struct group_device *iommu_group_alloc_device(struct iommu_group *group,
 	device->group = group;
 	/* Keep dev alive for any in-flight RCU reader of grp_dev->dev. */
 	get_device(dev);
+	INIT_WORK(&device->broken_work, iommu_group_broken_worker);
 
 	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
 	if (ret)
@@ -4301,6 +4332,14 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev, int reset_result)
 	 */
 	if (ops->reset_device_done)
 		ops->reset_device_done(&pdev->dev);
+	/*
+	 * Clear broken_pending after the reset_device_done callback to
+	 * neutralize any stale pre-reset report. Until this moment, a
+	 * driver may call iommu_report_device_broken() on a pre-reset
+	 * fault. Legitimate post-reset reports can only fire after the
+	 * re-attach below, so this clear cannot hide them.
+	 */
+	WRITE_ONCE(gdev->broken_pending, false);
 
 	/*
 	 * Re-attach RID domain back to group->domain
@@ -4346,6 +4385,106 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev, int reset_result)
 }
 EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_done);
 
+static void iommu_group_broken_worker(struct work_struct *work)
+{
+	struct group_device *gdev =
+		container_of(work, struct group_device, broken_work);
+	struct iommu_group *group = gdev->group;
+	struct device *dev = gdev->dev;
+
+	guard(mutex)(&group->mutex);
+
+	/*
+	 * iommu_deinit_device() frees dev->iommu under group->mutex. Bail
+	 * out if the device has already been removed from IOMMU handling.
+	 */
+	if (!dev_has_iommu(dev))
+		return;
+
+	/*
+	 * A successful reset between schedule_work() and now would have cleared
+	 * broken_pending. Skip as the device has already been recovered.
+	 */
+	if (!READ_ONCE(gdev->broken_pending))
+		return;
+
+	/*
+	 * Quarantine the device completely. For a PCI device, it will be lifted
+	 * upon a pci_dev_reset_iommu_done(pdev, reset_result=0) call indicating
+	 * a device recovery.
+	 *
+	 * For a non-PCI device, currently it has no recovery framework tied to
+	 * the IOMMU subsystem. Quarantine it indefinitely until a recovery path
+	 * is introduced.
+	 */
+	if (WARN_ON(__iommu_group_block_device(gdev, BLOCKED_BROKEN)))
+		return;
+
+	dev_warn(dev, "IOMMU has quarantined the device\n");
+	WRITE_ONCE(gdev->broken_pending, false);
+}
+
+/**
+ * iommu_report_device_broken() - Report a broken device to quarantine it
+ * @dev: Device that has encountered an unrecoverable IOMMU-related error
+ *
+ * When an IOMMU driver detects a critical error caused by a device (e.g. an ATC
+ * invalidation timeout), this function should be used to quarantine the device
+ * at the IOMMU core level.
+ *
+ * The quarantine moves the device's RID and PASIDs to group->blocking_domain to
+ * prevent any further DMA/ATS activity that can potentially corrupt the system
+ * memory due to stale device cache entries.
+ *
+ * This function must not be called from a reset_device_done callback: setting
+ * broken_pending there would race the clear in pci_dev_reset_iommu_done() that
+ * follows. Otherwise, this function is safe to call from any context (including
+ * interrupt handlers), as the actual quarantine is done in an asynchronous work
+ * thread. The caller should have already taken driver-level measures (e.g., ATS
+ * disabled in HW) to contain the fault promptly before calling this function.
+ *
+ * An asynchronous reset can occur while the driver is handling an IOMMU-related
+ * error. The driver is responsible for calling this only for post-reset faults.
+ * A queued or delayed pre-reset fault must be drained or filtered by the driver
+ * before delivery. Otherwise, a stale report would falsely quarantine a freshly
+ * recovered device.
+ *
+ * For PCI devices, the quarantine will be lifted by a successful device reset
+ * via pci_dev_reset_iommu_done(). For non-PCI devices, the quarantine remains
+ * in effect indefinitely until a recovery mechanism is introduced.
+ *
+ * If the device is concurrently being removed or has already been removed from
+ * the IOMMU subsystem, this function will silently return without any action.
+ */
+void iommu_report_device_broken(struct device *dev)
+{
+	struct group_device *gdev;
+
+	/*
+	 * We cannot hold group->mutex here. Rely on iommu_group_broken_worker()
+	 * to validate dev_has_iommu(). The iommu_group memory is RCU-protected
+	 * via kfree_rcu() in iommu_group_release(), and group->devices is an
+	 * RCU-protected list, so the lookup runs entirely under rcu_read_lock.
+	 *
+	 * Note the device might have been concurrently removed from the group
+	 * (list_del_rcu) before iommu_deinit_device() cleared the dev->iommu.
+	 */
+	rcu_read_lock();
+	gdev = __dev_to_gdev_rcu(dev);
+	if (gdev) {
+		/*
+		 * Narrow chance we re-set broken_pending right after a concurrent
+		 * worker cleared it. Benign: the worker we are queueing here will
+		 * read it true and clear it again (skipping if already blocked);
+		 * pci_dev_reset_iommu_done() also clears it on a successful reset.
+		 */
+		WRITE_ONCE(gdev->broken_pending, true);
+		schedule_work(&gdev->broken_work);
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(iommu_report_device_broken);
+
 #if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
 /**
  * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 12/24] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (10 preceding siblings ...)
  2026-05-19  3:38 ` [PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 13/24] iommu/arm-smmu-v3: Skip remaining GERROR causes on SFM Nicolin Chen
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

An ATC invalidation timeout is a fatal error. While the SMMUv3 hardware is
aware of the timeout via a GERROR interrupt, the driver thread issuing the
commands lacks a direct mechanism to verify whether its specific batch was
the cause or not, as polling the CMD_SYNC status doesn't natively return a
failure code, making it very difficult to coordinate per-device recovery.

Introduce an atc_sync_timeouts bitmap in the cmdq structure to bridge this
gap. When the ISR detects an ATC timeout, set the bit corresponding to the
physical CMDQ index of the faulting CMD_SYNC command.

On the issuer side, after polling completes (or times out), test and clear
its dedicated bit. If set, return -EIO to trigger device quarantine.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 42 ++++++++++++++++++++-
 2 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 16353596e08ad..46f9e292a1cc8 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -700,6 +700,7 @@ struct arm_smmu_cmdq {
 	atomic_long_t			*valid_map;
 	atomic_t			owner_prod;
 	atomic_t			lock;
+	unsigned long			*atc_sync_timeouts;
 	bool				(*supports_cmd)(struct arm_smmu_cmd *cmd);
 };
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9be589d14a3bd..1065301a54eeb 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -343,7 +343,10 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
 		 * at the CMD_SYNC. Attempt to complete other pending commands
 		 * by repeating the CMD_SYNC, though we might well end up back
 		 * here since the ATC invalidation may still be pending.
+		 *
+		 * Mark the faulty batch in the bitmap for the issuer to match.
 		 */
+		set_bit(Q_IDX(&q->llq, cons), cmdq->atc_sync_timeouts);
 		return;
 	case CMDQ_ERR_CERROR_ILL_IDX:
 	default:
@@ -750,6 +753,14 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 		queue_write(Q_ENT(&cmdq->q, prod), cmd_sync.data,
 			    ARRAY_SIZE(cmd_sync.data));
 
+		/*
+		 * Clear any stale ATC-timeout bit left in the slot from a prior
+		 * wraparound, before the slot becomes visible to the SMMU. Must
+		 * do this prior to step 3 to prevent potentially races with the
+		 * GERROR ISR calling set_bit() for our own CMD_SYNC.
+		 */
+		clear_bit(Q_IDX(&llq, prod), cmdq->atc_sync_timeouts);
+
 		/*
 		 * In order to determine completion of our CMD_SYNC, we must
 		 * ensure that the queue can't wrap twice without us noticing.
@@ -796,9 +807,33 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 
 	/* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */
 	if (sync) {
+		u32 sync_prod;
+
 		llq.prod = queue_inc_prod_n(&llq, n);
+		sync_prod = llq.prod;
 		ret = arm_smmu_cmdq_poll_until_sync(smmu, cmdq, &llq);
-		if (ret) {
+
+		/*
+		 * Test atc_sync_timeouts first and see if there is ATC timeout
+		 * resulted from this cmdlist. Return -EIO to separate from the
+		 * ARM_SMMU_POLL_TIMEOUT_US software timeout.
+		 *
+		 * FIXME possible unhandled ATC invalidation timeout scenario:
+		 * PCI Completion Timeout can be set to a range longer than the
+		 * ARM_SMMU_POLL_TIMEOUT_US software timeout. -ETIMEDOUT can be
+		 * returned by arm_smmu_cmdq_poll_until_sync() while the ATC_INV
+		 * is still pending and not yet reflected in GERROR, so the bit
+		 * on atc_sync_timeouts is not set. In this case, we can hardly
+		 * do anything here, since the command queue HW is still pending
+		 * on the ATC command.
+		 */
+		if (test_and_clear_bit(Q_IDX(&llq, sync_prod),
+				       cmdq->atc_sync_timeouts)) {
+			dev_err_ratelimited(smmu->dev,
+					    "CMD_SYNC for ATC_INV timeout at prod=0x%08x\n",
+					    sync_prod);
+			ret = -EIO;
+		} else if (ret) {
 			dev_err_ratelimited(smmu->dev,
 					    "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
 					    llq.prod,
@@ -4332,6 +4367,11 @@ int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
 	if (!cmdq->valid_map)
 		return -ENOMEM;
 
+	cmdq->atc_sync_timeouts =
+		devm_bitmap_zalloc(smmu->dev, nents, GFP_KERNEL);
+	if (!cmdq->atc_sync_timeouts)
+		return -ENOMEM;
+
 	return 0;
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 13/24] iommu/arm-smmu-v3: Skip remaining GERROR causes on SFM
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (11 preceding siblings ...)
  2026-05-19  3:38 ` [PATCH v4 12/24] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 14/24] iommu/arm-smmu-v3: Introduce per-cmdq cmdq_err_handler callback Nicolin Chen
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

When the SMMU enters Service Failure Mode (SFM), arm_smmu_device_disable()
clears CR0 and the SMMU stops processing requests entirely. The remaining
GERROR causes (MSI write aborts, PRIQ/EVTQ aborts, CMDQ_ERR) are moot at
that point: the cmdq is dead so arm_smmu_cmdq_skip_err() would just twiddle
bookkeeping for a queue nobody's reading, and the per-cause dev_warn lines
add little diagnostic value beyond the SFM message itself.

After arm_smmu_device_disable(), ack GERRORN and return. This intentionally
duplicates the last two lines of the function because this SFM path will be
slightly different.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 1065301a54eeb..d9fe48989fcd7 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2268,8 +2268,11 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
 		 active);
 
 	if (active & GERROR_SFM_ERR) {
+		/* SMMU is being disabled, so other errors don't matter */
+		writel(gerror, smmu->base + ARM_SMMU_GERRORN);
 		dev_err(smmu->dev, "device has entered Service Failure Mode!\n");
 		arm_smmu_device_disable(smmu);
+		return IRQ_HANDLED;
 	}
 
 	if (active & GERROR_MSI_GERROR_ABT_ERR)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 14/24] iommu/arm-smmu-v3: Introduce per-cmdq cmdq_err_handler callback
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (12 preceding siblings ...)
  2026-05-19  3:38 ` [PATCH v4 13/24] iommu/arm-smmu-v3: Skip remaining GERROR causes on SFM Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 15/24] iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when CMD_SYNC times out Nicolin Chen
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

A subsequent change will need arm_smmu_cmdq_issue_cmdlist() to co-clear a
pending CMDQ_ERR after a CMD_SYNC poll timeout. And this needs to be done
for both smmu->cmdq and tegra241-cmdq.

Add a cmdq_err_handler and a paired cmdq_err_lock to struct arm_smmu_cmdq.
arm_smmu_gerror_handler() and tegra241_vintf0_handle_error() now take the
per-cmdq cmdq_err_lock when acking CMDQ_ERR.

tegra241_vintf0_handle_error() also checks (gerror ^ gerrorn) inside the
cmdq_err_lock since a concurrent cmdq_err_handler may have already acked
the error. arm_smmu_gerror_handler() already covers this via its existing
early-exit on no-active-bits.

Impl functions and caller will be added in the subsequent change.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h    | 12 ++++++++++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c    | 18 ++++++++++++++----
 drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 15 ++++++++++++---
 3 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 46f9e292a1cc8..604f7edf54158 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -695,6 +695,10 @@ struct arm_smmu_queue_poll {
 	bool				wfe;
 };
 
+struct arm_smmu_cmdq;
+typedef void (*arm_smmu_cmdq_err_fn)(struct arm_smmu_device *smmu,
+				     struct arm_smmu_cmdq *cmdq);
+
 struct arm_smmu_cmdq {
 	struct arm_smmu_queue		q;
 	atomic_long_t			*valid_map;
@@ -702,6 +706,10 @@ struct arm_smmu_cmdq {
 	atomic_t			lock;
 	unsigned long			*atc_sync_timeouts;
 	bool				(*supports_cmd)(struct arm_smmu_cmd *cmd);
+
+	/* Drain a pending CMDQ_ERR; will hold cmdq_err_lock with irqsave */
+	arm_smmu_cmdq_err_fn		cmdq_err_handler;
+	raw_spinlock_t			cmdq_err_lock;
 };
 
 static inline bool arm_smmu_cmdq_supports_cmd(struct arm_smmu_cmdq *cmdq,
@@ -1163,8 +1171,8 @@ int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 			    struct arm_smmu_queue *q, void __iomem *page,
 			    unsigned long prod_off, unsigned long cons_off,
 			    size_t dwords, const char *name);
-int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
-		       struct arm_smmu_cmdq *cmdq);
+int arm_smmu_cmdq_init(struct arm_smmu_device *smmu, struct arm_smmu_cmdq *cmdq,
+		       arm_smmu_cmdq_err_fn cmdq_err_handler);
 
 static inline bool arm_smmu_master_canwbs(struct arm_smmu_master *master)
 {
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d9fe48989fcd7..fc0757359b783 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2255,13 +2255,18 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
 {
 	u32 gerror, gerrorn, active;
 	struct arm_smmu_device *smmu = dev;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&smmu->cmdq.cmdq_err_lock, flags);
 
 	gerror = readl_relaxed(smmu->base + ARM_SMMU_GERROR);
 	gerrorn = readl_relaxed(smmu->base + ARM_SMMU_GERRORN);
 
 	active = gerror ^ gerrorn;
-	if (!(active & GERROR_ERR_MASK))
+	if (!(active & GERROR_ERR_MASK)) {
+		raw_spin_unlock_irqrestore(&smmu->cmdq.cmdq_err_lock, flags);
 		return IRQ_NONE; /* No errors pending */
+	}
 
 	dev_warn(smmu->dev,
 		 "unexpected global error reported (0x%08x), this could be serious\n",
@@ -2270,6 +2275,8 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
 	if (active & GERROR_SFM_ERR) {
 		/* SMMU is being disabled, so other errors don't matter */
 		writel(gerror, smmu->base + ARM_SMMU_GERRORN);
+		/* Release before arm_smmu_device_disable() that sleeps */
+		raw_spin_unlock_irqrestore(&smmu->cmdq.cmdq_err_lock, flags);
 		dev_err(smmu->dev, "device has entered Service Failure Mode!\n");
 		arm_smmu_device_disable(smmu);
 		return IRQ_HANDLED;
@@ -2297,6 +2304,7 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
 		arm_smmu_cmdq_skip_err(smmu);
 
 	writel(gerror, smmu->base + ARM_SMMU_GERRORN);
+	raw_spin_unlock_irqrestore(&smmu->cmdq.cmdq_err_lock, flags);
 	return IRQ_HANDLED;
 }
 
@@ -4357,13 +4365,15 @@ int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 	return 0;
 }
 
-int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
-		       struct arm_smmu_cmdq *cmdq)
+int arm_smmu_cmdq_init(struct arm_smmu_device *smmu, struct arm_smmu_cmdq *cmdq,
+		       arm_smmu_cmdq_err_fn cmdq_err_handler)
 {
 	unsigned int nents = 1 << cmdq->q.llq.max_n_shift;
 
 	atomic_set(&cmdq->owner_prod, 0);
 	atomic_set(&cmdq->lock, 0);
+	raw_spin_lock_init(&cmdq->cmdq_err_lock);
+	cmdq->cmdq_err_handler = cmdq_err_handler;
 
 	cmdq->valid_map = (atomic_long_t *)devm_bitmap_zalloc(smmu->dev, nents,
 							      GFP_KERNEL);
@@ -4389,7 +4399,7 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu)
 	if (ret)
 		return ret;
 
-	ret = arm_smmu_cmdq_init(smmu, &smmu->cmdq);
+	ret = arm_smmu_cmdq_init(smmu, &smmu->cmdq, NULL);
 	if (ret)
 		return ret;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index 67be62a6e7640..fb2f8f68fa344 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -319,10 +319,19 @@ static void tegra241_vintf0_handle_error(struct tegra241_vintf *vintf)
 		while (map) {
 			unsigned long lidx = __ffs64(map);
 			struct tegra241_vcmdq *vcmdq = vintf->lvcmdqs[lidx];
-			u32 gerror = readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERROR));
+			struct arm_smmu_cmdq *cmdq = &vcmdq->cmdq;
+			unsigned long flags;
+			u32 gerror, gerrorn;
 
-			__arm_smmu_cmdq_skip_err(&vintf->cmdqv->smmu, &vcmdq->cmdq);
+			raw_spin_lock_irqsave(&cmdq->cmdq_err_lock, flags);
+			gerror = readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERROR));
+			gerrorn = readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERRORN));
+
+			if ((gerror ^ gerrorn) & GERROR_CMDQ_ERR)
+				__arm_smmu_cmdq_skip_err(&vintf->cmdqv->smmu,
+							 cmdq);
 			writel(gerror, REG_VCMDQ_PAGE0(vcmdq, GERRORN));
+			raw_spin_unlock_irqrestore(&cmdq->cmdq_err_lock, flags);
 			map &= ~BIT_ULL(lidx);
 		}
 	}
@@ -643,7 +652,7 @@ static int tegra241_vcmdq_alloc_smmu_cmdq(struct tegra241_vcmdq *vcmdq)
 	q->q_base = q->base_dma & VCMDQ_ADDR;
 	q->q_base |= FIELD_PREP(VCMDQ_LOG2SIZE, q->llq.max_n_shift);
 
-	return arm_smmu_cmdq_init(smmu, cmdq);
+	return arm_smmu_cmdq_init(smmu, cmdq, NULL);
 }
 
 /* VINTF Logical VCMDQ Resource Helpers */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 15/24] iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when CMD_SYNC times out
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (13 preceding siblings ...)
  2026-05-19  3:38 ` [PATCH v4 14/24] iommu/arm-smmu-v3: Introduce per-cmdq cmdq_err_handler callback Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19  3:38 ` [PATCH v4 16/24] iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when queue_has_space() fails Nicolin Chen
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

Once arm_smmu_cmdq_poll_until_sync() returns, arm_smmu_cmdq_issue_cmdlist()
tests its CMD_SYNC slot in atc_sync_timeouts to decide whether there was an
ATC_INV timeout.

On the other hand, when that poll timed out, the GERROR ISR might have been
delayed past the poll deadline, so the atc_sync_timeouts test could miss an
ATC_INV timeout, classifying it as a generic CMD_SYNC timeout and bypassing
the per-device quarantine.

Add two cmdq_err_handler impl functions:
 - arm_smmu_cmdq_err_handler() reads SMMU GERROR/GERRORN.
 - tegra241_vcmdq_handle_cmdq_err() reads VCMDQ GERROR/GERRORN.

Co-clear any pending CMDQ_ERR in the issuer, when the polling on a CMD_SYNC
times out. Each cmdq impl serializes the synchronous drain against its own
IRQ handler with cmdq->cmdq_err_lock.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 36 ++++++++++++++++++-
 .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c    | 23 +++++++++++-
 2 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index fc0757359b783..7f81fd2e92480 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -813,6 +813,15 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 		sync_prod = llq.prod;
 		ret = arm_smmu_cmdq_poll_until_sync(smmu, cmdq, &llq);
 
+		/*
+		 * When the poll above timed out, the GERROR ISR might have been
+		 * delayed past the poll deadline, so the atc_sync_timeouts test
+		 * below could miss our ATC_INV timeout. Thus, drain any pending
+		 * CMDQ_ERR synchronously first via the per-cmdq callback.
+		 */
+		if (ret && cmdq->cmdq_err_handler)
+			cmdq->cmdq_err_handler(smmu, cmdq);
+
 		/*
 		 * Test atc_sync_timeouts first and see if there is ATC timeout
 		 * resulted from this cmdlist. Return -EIO to separate from the
@@ -2251,6 +2260,31 @@ static irqreturn_t arm_smmu_priq_thread(int irq, void *dev)
 
 static int arm_smmu_device_disable(struct arm_smmu_device *smmu);
 
+/*
+ * Drain a pending CMDQ_ERR on the primary cmdq. Installed as the primary
+ * cmdq's cmdq_err_handler so arm_smmu_cmdq_issue_cmdlist() can drain after
+ * a CMD_SYNC poll timeout; serialized against arm_smmu_gerror_handler() by
+ * cmdq->cmdq_err_lock.
+ */
+static void arm_smmu_cmdq_err_handler(struct arm_smmu_device *smmu,
+				      struct arm_smmu_cmdq *cmdq)
+{
+	u32 gerror, gerrorn;
+
+	guard(raw_spinlock_irqsave)(&cmdq->cmdq_err_lock);
+
+	gerror = readl_relaxed(smmu->base + ARM_SMMU_GERROR);
+	gerrorn = readl_relaxed(smmu->base + ARM_SMMU_GERRORN);
+
+	if (!((gerror ^ gerrorn) & GERROR_CMDQ_ERR))
+		return;
+
+	__arm_smmu_cmdq_skip_err(smmu, cmdq);
+
+	/* Toggle only the CMDQ_ERR bit; other bits are left for the ISR. */
+	writel(gerrorn ^ GERROR_CMDQ_ERR, smmu->base + ARM_SMMU_GERRORN);
+}
+
 static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
 {
 	u32 gerror, gerrorn, active;
@@ -4399,7 +4433,7 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu)
 	if (ret)
 		return ret;
 
-	ret = arm_smmu_cmdq_init(smmu, &smmu->cmdq, NULL);
+	ret = arm_smmu_cmdq_init(smmu, &smmu->cmdq, arm_smmu_cmdq_err_handler);
 	if (ret)
 		return ret;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index fb2f8f68fa344..e04107f0490c9 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -337,6 +337,27 @@ static void tegra241_vintf0_handle_error(struct tegra241_vintf *vintf)
 	}
 }
 
+static void tegra241_vcmdq_handle_cmdq_err(struct arm_smmu_device *smmu,
+					   struct arm_smmu_cmdq *cmdq)
+{
+	struct tegra241_vcmdq *vcmdq =
+		container_of(cmdq, struct tegra241_vcmdq, cmdq);
+	u32 gerror, gerrorn;
+
+	guard(raw_spinlock_irqsave)(&cmdq->cmdq_err_lock);
+
+	gerror = readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERROR));
+	gerrorn = readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERRORN));
+
+	if (!((gerror ^ gerrorn) & GERROR_CMDQ_ERR))
+		return;
+
+	__arm_smmu_cmdq_skip_err(smmu, cmdq);
+
+	/* Toggle only the CMDQ_ERR bit on this VCMDQ's GERRORN */
+	writel(gerrorn ^ GERROR_CMDQ_ERR, REG_VCMDQ_PAGE0(vcmdq, GERRORN));
+}
+
 static irqreturn_t tegra241_cmdqv_isr(int irq, void *devid)
 {
 	struct tegra241_cmdqv *cmdqv = (struct tegra241_cmdqv *)devid;
@@ -652,7 +673,7 @@ static int tegra241_vcmdq_alloc_smmu_cmdq(struct tegra241_vcmdq *vcmdq)
 	q->q_base = q->base_dma & VCMDQ_ADDR;
 	q->q_base |= FIELD_PREP(VCMDQ_LOG2SIZE, q->llq.max_n_shift);
 
-	return arm_smmu_cmdq_init(smmu, cmdq, NULL);
+	return arm_smmu_cmdq_init(smmu, cmdq, tegra241_vcmdq_handle_cmdq_err);
 }
 
 /* VINTF Logical VCMDQ Resource Helpers */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 16/24] iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when queue_has_space() fails
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (14 preceding siblings ...)
  2026-05-19  3:38 ` [PATCH v4 15/24] iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when CMD_SYNC times out Nicolin Chen
@ 2026-05-19  3:38 ` Nicolin Chen
  2026-05-19  3:39 ` [PATCH v4 17/24] iommu/arm-smmu-v3: Add master in arm_smmu_inv for ATS entries Nicolin Chen
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:38 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

It's unusual when the command queue fails the queue_has_space() test. There
must be something stalling the HW so the queue does not advance.

Currently, a possible scenario: arm_smmu_cmdq_issue_cmdlist() may be called
in an IRQ context where IRQ is already disabled. E.g., ata_sg_clean() in
drivers/ata/libata-core.c

When GERROR is affined to the CPU currently running with IRQs disabled, the
GERROR ISR will not run and a CERROR_ILL will not be cleared, which stalls
the HW; arm_smmu_cmdq_poll_until_not_full() then either times out or loops
without seeing CONS advance.

The window is narrow and it's very difficult to trigger this lockup. Yet, a
subsequent change requires serializing the STE update routines between the
attach_dev path (mutex-ed) and the invalidation path (non-mutexed), where a
spin_lock_irqsave is inevitable. And this would expand the currently narrow
window to a wider range -- arm_smmu_write_ste() as well.

Since we have a cmdq_err_handler, call it when queue_has_space() fails, to
give the CMDQ hardware a chance to advance its CONS.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7f81fd2e92480..0e4f34ed036c6 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -723,6 +723,13 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,

 		while (!queue_has_space(&llq, n + sync)) {
 			local_irq_restore(flags);
+			/*
+			 * If the CMDQ is nearly full, it's possible that the HW
+			 * is stalled by an unhandled GERROR_CMDQ_ERR. Thus give
+			 * cmdq_err_handler a chance before each poll.
+			 */
+			if (cmdq->cmdq_err_handler)
+				cmdq->cmdq_err_handler(smmu, cmdq);
 			if (arm_smmu_cmdq_poll_until_not_full(smmu, cmdq, &llq))
 				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
 			local_irq_save(flags);
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 17/24] iommu/arm-smmu-v3: Add master in arm_smmu_inv for ATS entries
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (15 preceding siblings ...)
  2026-05-19  3:38 ` [PATCH v4 16/24] iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when queue_has_space() fails Nicolin Chen
@ 2026-05-19  3:39 ` Nicolin Chen
  2026-05-19 12:01   ` Jason Gunthorpe
  2026-05-19  3:39 ` [PATCH v4 18/24] iommu/arm-smmu-v3: Introduce master->ats_broken flag Nicolin Chen
                   ` (6 subsequent siblings)
  23 siblings, 1 reply; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:39 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

Storing the master pointer allows backtracking it from an ATS invalidation
entry, which will be useful when handling ATC invalidation timeouts.

Don't simply swap the "smmu" pointer for the "master": a non-ATS entry may
be shared across multiple devices (masters). An ATS entry is okay since it
is tied to a unique SID.

Master must outlive any concurrent RCU reader iterating the domain->invs,
because inv->master is dereferenced inside the read-side critical section.

Add a synchronize_rcu() in arm_smmu_release_device() before freeing master.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 14 +++++++++++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 604f7edf54158..df6e539f75274 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -738,6 +738,7 @@ enum arm_smmu_inv_type {
 
 struct arm_smmu_inv {
 	struct arm_smmu_device *smmu;
+	struct arm_smmu_master *master; /* INV_TYPE_ATS* */
 	u8 type;
 	u8 size_opcode;
 	u8 nsize_opcode;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 0e4f34ed036c6..cde2ff2dcc49b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3211,6 +3211,7 @@ arm_smmu_master_build_inv(struct arm_smmu_master *master,
 	case INV_TYPE_ATS_FULL:
 		cur->size_opcode = cur->nsize_opcode = CMDQ_OP_ATC_INV;
 		cur->ssid = ssid;
+		cur->master = master;
 		break;
 	}
 
@@ -4168,9 +4169,6 @@ static void arm_smmu_remove_master(struct arm_smmu_master *master)
 	for (i = 0; i < fwspec->num_ids; i++)
 		rb_erase(&master->streams[i].node, &smmu->streams);
 	mutex_unlock(&smmu->streams_mutex);
-
-	kfree(master->streams);
-	kfree(master->build_invs);
 }
 
 static struct iommu_device *arm_smmu_probe_device(struct device *dev)
@@ -4244,6 +4242,16 @@ static void arm_smmu_release_device(struct device *dev)
 	arm_smmu_remove_master(master);
 	if (arm_smmu_cdtab_allocated(&master->cd_table))
 		arm_smmu_free_cd_tables(master);
+
+	/*
+	 * The iommu core detaches @dev from every iommu domain before invoking
+	 * release_device. So the updated domain->invs no longer references the
+	 * @master; IOW, new RCU readers cannot reach it. Wait one grace period
+	 * for in-flight readers to drop their references.
+	 */
+	synchronize_rcu();
+	kfree(master->streams);
+	kfree(master->build_invs);
 	kfree(master);
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 18/24] iommu/arm-smmu-v3: Introduce master->ats_broken flag
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (16 preceding siblings ...)
  2026-05-19  3:39 ` [PATCH v4 17/24] iommu/arm-smmu-v3: Add master in arm_smmu_inv for ATS entries Nicolin Chen
@ 2026-05-19  3:39 ` Nicolin Chen
  2026-05-19 12:06   ` Jason Gunthorpe
  2026-05-19  3:39 ` [PATCH v4 19/24] iommu/arm-smmu-v3: Add invs and has_ats to struct arm_smmu_cmdq_batch Nicolin Chen
                   ` (5 subsequent siblings)
  23 siblings, 1 reply; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:39 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

A subsequent change will set this flag when IOMMU cannot trust device's ATS
function. E.g., when ATC invalidation request to the device times out.

Once the flag is set, unsupport the ATS feature to prevent data corruption,
and skip further ATC invalidation commands to avoid new timeouts.

Unset the flag when the device finishes a reset for recovery.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 +++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index df6e539f75274..44956daf83dfa 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1017,6 +1017,7 @@ struct arm_smmu_master {
 	/* Locked by the iommu core using the group mutex */
 	struct arm_smmu_ctx_desc_cfg	cd_table;
 	unsigned int			num_streams;
+	bool				ats_broken;
 	bool				ats_enabled : 1;
 	bool				ste_ats_enabled : 1;
 	bool				stall_enabled;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index cde2ff2dcc49b..638956e2535b4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2429,6 +2429,10 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
 	struct arm_smmu_cmd cmd;
 	struct arm_smmu_cmdq_batch cmds;
 
+	/* Do not issue ATC_INV that will definitely time out */
+	if (READ_ONCE(master->ats_broken))
+		return 0;
+
 	cmd = arm_smmu_make_cmd_atc_inv_all(0, IOMMU_NO_PASID);
 	arm_smmu_cmdq_batch_init_cmd(master->smmu, &cmds, &cmd);
 	for (i = 0; i < master->num_streams; i++)
@@ -2651,12 +2655,18 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
 						       cur->id));
 			break;
 		case INV_TYPE_ATS:
+			/* Do not issue ATC_INV that will definitely time out */
+			if (READ_ONCE(cur->master->ats_broken))
+				break;
 			arm_smmu_cmdq_batch_add_cmd(
 				smmu, &cmds,
 				arm_smmu_atc_inv_to_cmd(cur->id, cur->ssid,
 							iova, size));
 			break;
 		case INV_TYPE_ATS_FULL:
+			/* Do not issue ATC_INV that will definitely time out */
+			if (READ_ONCE(cur->master->ats_broken))
+				break;
 			arm_smmu_cmdq_batch_add_cmd(
 				smmu, &cmds,
 				arm_smmu_make_cmd_atc_inv_all(cur->id,
@@ -2995,6 +3005,16 @@ void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master,
 	}
 }
 
+static void arm_smmu_reset_device_done(struct device *dev)
+{
+	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+
+	if (WARN_ON(!master))
+		return;
+	/* Pair with lockless readers */
+	WRITE_ONCE(master->ats_broken, false);
+}
+
 static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
 {
 	struct device *dev = master->dev;
@@ -3007,6 +3027,14 @@ static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
 	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_ATS))
 		return false;
 
+	/*
+	 * Do not enable ATS if master->ats_broken is set. The PCI device should
+	 * go through a recovery (reset) that shall notify the SMMUv3 driver via
+	 * a reset_device_done callback.
+	 */
+	if (READ_ONCE(master->ats_broken))
+		return false;
+
 	return dev_is_pci(dev) && pci_ats_supported(to_pci_dev(dev));
 }
 
@@ -4345,6 +4373,7 @@ static const struct iommu_ops arm_smmu_ops = {
 	.domain_alloc_paging_flags = arm_smmu_domain_alloc_paging_flags,
 	.probe_device		= arm_smmu_probe_device,
 	.release_device		= arm_smmu_release_device,
+	.reset_device_done	= arm_smmu_reset_device_done,
 	.device_group		= arm_smmu_device_group,
 	.of_xlate		= arm_smmu_of_xlate,
 	.get_resv_regions	= arm_smmu_get_resv_regions,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 19/24] iommu/arm-smmu-v3: Add invs and has_ats to struct arm_smmu_cmdq_batch
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (17 preceding siblings ...)
  2026-05-19  3:39 ` [PATCH v4 18/24] iommu/arm-smmu-v3: Introduce master->ats_broken flag Nicolin Chen
@ 2026-05-19  3:39 ` Nicolin Chen
  2026-05-19 12:09   ` Jason Gunthorpe
  2026-05-19  3:39 ` [PATCH v4 20/24] iommu/arm-smmu-v3: Introduce arm_smmu_cmdq_batch_issue() wrapper Nicolin Chen
                   ` (4 subsequent siblings)
  23 siblings, 1 reply; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:39 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

The arm_smmu_cmdq_batch_add_cmd_p() might flush a sub-batch mid-way, when
the ARM_SMMU_OPT_CMDQ_FORCE_SYNC is set or when a batch is full. To allow
a future change to retry these sub-batch flushes on a timeout and identify
the broken master, the batch needs to carry both the per-domain invs and
a per-batch indicator of whether the batch contains an ATC_INV.

Add an "invs" pointer to record the per-domain invalidation array (set via
a new arm_smmu_cmdq_batch_init_cmd() parameter), and a "has_ats" flag set
in arm_smmu_cmdq_batch_add_cmd_p() when an ATC_INV command is queued. Any
caller that does not associate a batch with an invs array can pass NULL.

No functional changes.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  4 ++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 +++++++++++------
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 44956daf83dfa..2074814534fef 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -721,6 +721,10 @@ static inline bool arm_smmu_cmdq_supports_cmd(struct arm_smmu_cmdq *cmdq,
 struct arm_smmu_cmdq_batch {
 	struct arm_smmu_cmd		cmds[CMDQ_BATCH_ENTRIES];
 	struct arm_smmu_cmdq		*cmdq;
+	/* Per-domain invalidation array, for sub-batch retry-on-EIO lookup */
+	struct arm_smmu_invs		*invs;
+	/* Set when an ATC_INV is queued; gates the retry-aware sync decision */
+	bool				has_ats;
 	int				num;
 };
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 638956e2535b4..a31f8b1a94979 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -892,10 +892,13 @@ static int arm_smmu_cmdq_issue_cmd_p(struct arm_smmu_device *smmu,
 
 static void arm_smmu_cmdq_batch_init_cmd(struct arm_smmu_device *smmu,
 					 struct arm_smmu_cmdq_batch *cmds,
-					 struct arm_smmu_cmd *cmd)
+					 struct arm_smmu_cmd *cmd,
+					 struct arm_smmu_invs *invs)
 {
 	cmds->num = 0;
 	cmds->cmdq = arm_smmu_get_cmdq(smmu, cmd);
+	cmds->invs = invs;
+	cmds->has_ats = false;
 }
 
 static void arm_smmu_cmdq_batch_add_cmd_p(struct arm_smmu_device *smmu,
@@ -910,15 +913,17 @@ static void arm_smmu_cmdq_batch_add_cmd_p(struct arm_smmu_device *smmu,
 	if (force_sync || unsupported_cmd) {
 		arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
 					    cmds->num, true);
-		arm_smmu_cmdq_batch_init_cmd(smmu, cmds, cmd);
+		arm_smmu_cmdq_batch_init_cmd(smmu, cmds, cmd, cmds->invs);
 	}
 
 	if (cmds->num == CMDQ_BATCH_ENTRIES) {
 		arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
 					    cmds->num, false);
-		arm_smmu_cmdq_batch_init_cmd(smmu, cmds, cmd);
+		arm_smmu_cmdq_batch_init_cmd(smmu, cmds, cmd, cmds->invs);
 	}
 
+	if (FIELD_GET(CMDQ_0_OP, cmd->data[0]) == CMDQ_OP_ATC_INV)
+		cmds->has_ats = true;
 	cmds->cmds[cmds->num++] = *cmd;
 }
 
@@ -1486,7 +1491,7 @@ static void arm_smmu_sync_cd(struct arm_smmu_master *master,
 	struct arm_smmu_device *smmu = master->smmu;
 	struct arm_smmu_cmd cmd = arm_smmu_make_cmd_cfgi_cd(0, ssid, leaf);
 
-	arm_smmu_cmdq_batch_init_cmd(smmu, &cmds, &cmd);
+	arm_smmu_cmdq_batch_init_cmd(smmu, &cmds, &cmd, NULL);
 	for (i = 0; i < master->num_streams; i++)
 		arm_smmu_cmdq_batch_add_cmd(
 			smmu, &cmds,
@@ -2434,7 +2439,7 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
 		return 0;
 
 	cmd = arm_smmu_make_cmd_atc_inv_all(0, IOMMU_NO_PASID);
-	arm_smmu_cmdq_batch_init_cmd(master->smmu, &cmds, &cmd);
+	arm_smmu_cmdq_batch_init_cmd(master->smmu, &cmds, &cmd, NULL);
 	for (i = 0; i < master->num_streams; i++)
 		arm_smmu_cmdq_batch_add_cmd(
 			master->smmu, &cmds,
@@ -2630,7 +2635,7 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
 		struct arm_smmu_inv *next;
 
 		if (!cmds.num)
-			arm_smmu_cmdq_batch_init_cmd(smmu, &cmds, &cmd);
+			arm_smmu_cmdq_batch_init_cmd(smmu, &cmds, &cmd, invs);
 
 		switch (cur->type) {
 		case INV_TYPE_S1_ASID:
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 20/24] iommu/arm-smmu-v3: Introduce arm_smmu_cmdq_batch_issue() wrapper
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (18 preceding siblings ...)
  2026-05-19  3:39 ` [PATCH v4 19/24] iommu/arm-smmu-v3: Add invs and has_ats to struct arm_smmu_cmdq_batch Nicolin Chen
@ 2026-05-19  3:39 ` Nicolin Chen
  2026-05-19  3:39 ` [PATCH v4 21/24] iommu/arm-smmu-v3: Move arm_smmu_invs_for_each_entry to header Nicolin Chen
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:39 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

Both arm_smmu_cmdq_batch_submit() and arm_smmu_cmdq_batch_add_cmd_p() call
arm_smmu_cmdq_issue_cmdlist() to flush batches. A future change will retry
the issued commands on -EIO, using the arm_smmu_invs carried in the batch.
So, a single hook point is preferred.

Introduce an arm_smmu_cmdq_batch_issue() wrapper, so a retry logic will be
simply filled into the wrapper.

No functional changes.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index a31f8b1a94979..4f2b23b1e8163 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -901,6 +901,14 @@ static void arm_smmu_cmdq_batch_init_cmd(struct arm_smmu_device *smmu,
 	cmds->has_ats = false;
 }
 
+static int arm_smmu_cmdq_batch_issue(struct arm_smmu_device *smmu,
+				     struct arm_smmu_cmdq_batch *cmds,
+				     bool sync)
+{
+	return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
+					   cmds->num, sync);
+}
+
 static void arm_smmu_cmdq_batch_add_cmd_p(struct arm_smmu_device *smmu,
 					  struct arm_smmu_cmdq_batch *cmds,
 					  struct arm_smmu_cmd *cmd)
@@ -911,14 +919,12 @@ static void arm_smmu_cmdq_batch_add_cmd_p(struct arm_smmu_device *smmu,
 
 	unsupported_cmd = !arm_smmu_cmdq_supports_cmd(cmds->cmdq, cmd);
 	if (force_sync || unsupported_cmd) {
-		arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
-					    cmds->num, true);
+		arm_smmu_cmdq_batch_issue(smmu, cmds, true);
 		arm_smmu_cmdq_batch_init_cmd(smmu, cmds, cmd, cmds->invs);
 	}
 
 	if (cmds->num == CMDQ_BATCH_ENTRIES) {
-		arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
-					    cmds->num, false);
+		arm_smmu_cmdq_batch_issue(smmu, cmds, false);
 		arm_smmu_cmdq_batch_init_cmd(smmu, cmds, cmd, cmds->invs);
 	}
 
@@ -936,8 +942,7 @@ static void arm_smmu_cmdq_batch_add_cmd_p(struct arm_smmu_device *smmu,
 static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu,
 				      struct arm_smmu_cmdq_batch *cmds)
 {
-	return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
-					   cmds->num, true);
+	return arm_smmu_cmdq_batch_issue(smmu, cmds, true);
 }
 
 static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 21/24] iommu/arm-smmu-v3: Move arm_smmu_invs_for_each_entry to header
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (19 preceding siblings ...)
  2026-05-19  3:39 ` [PATCH v4 20/24] iommu/arm-smmu-v3: Introduce arm_smmu_cmdq_batch_issue() wrapper Nicolin Chen
@ 2026-05-19  3:39 ` Nicolin Chen
  2026-05-19  3:39 ` [PATCH v4 22/24] iommu/arm-smmu-v3: Introduce master->ats_invs Nicolin Chen
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:39 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

A subsequent change will use this helper in a lockless context. Since this
this is a macro, move it to the header file so other functions can use it.

Also, add READ_ONCE to invs->num_invs for lockless use.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 28 +++++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 28 ---------------------
 2 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 2074814534fef..b5ace01c05a5d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -803,6 +803,34 @@ struct arm_smmu_invs {
 	struct arm_smmu_inv inv[] __counted_by(max_invs);
 };
 
+/* Invalidation array manipulation functions */
+static inline struct arm_smmu_inv *
+arm_smmu_invs_iter_next(struct arm_smmu_invs *invs, size_t next, size_t *idx)
+{
+	while (true) {
+		if (next >= READ_ONCE(invs->num_invs)) {
+			*idx = next;
+			return NULL;
+		}
+		if (!READ_ONCE(invs->inv[next].users)) {
+			next++;
+			continue;
+		}
+		*idx = next;
+		return &invs->inv[next];
+	}
+}
+
+/**
+ * arm_smmu_invs_for_each_entry - Iterate over all non-trash entries in invs
+ * @invs: the base invalidation array
+ * @idx: a stack variable of 'size_t', to store the array index
+ * @cur: a stack variable of 'struct arm_smmu_inv *'
+ */
+#define arm_smmu_invs_for_each_entry(invs, idx, cur)                           \
+	for (cur = arm_smmu_invs_iter_next(invs, 0, &(idx)); cur;              \
+	     cur = arm_smmu_invs_iter_next(invs, idx + 1, &(idx)))
+
 static inline struct arm_smmu_invs *arm_smmu_invs_alloc(size_t num_invs)
 {
 	struct arm_smmu_invs *new_invs;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 4f2b23b1e8163..c95297acf2cfe 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -979,34 +979,6 @@ static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused
 	 */
 }
 
-/* Invalidation array manipulation functions */
-static inline struct arm_smmu_inv *
-arm_smmu_invs_iter_next(struct arm_smmu_invs *invs, size_t next, size_t *idx)
-{
-	while (true) {
-		if (next >= invs->num_invs) {
-			*idx = next;
-			return NULL;
-		}
-		if (!READ_ONCE(invs->inv[next].users)) {
-			next++;
-			continue;
-		}
-		*idx = next;
-		return &invs->inv[next];
-	}
-}
-
-/**
- * arm_smmu_invs_for_each_entry - Iterate over all non-trash entries in invs
- * @invs: the base invalidation array
- * @idx: a stack variable of 'size_t', to store the array index
- * @cur: a stack variable of 'struct arm_smmu_inv *'
- */
-#define arm_smmu_invs_for_each_entry(invs, idx, cur)                           \
-	for (cur = arm_smmu_invs_iter_next(invs, 0, &(idx)); cur;              \
-	     cur = arm_smmu_invs_iter_next(invs, idx + 1, &(idx)))
-
 static int arm_smmu_inv_cmp(const struct arm_smmu_inv *inv_l,
 			    const struct arm_smmu_inv *inv_r)
 {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 22/24] iommu/arm-smmu-v3: Introduce master->ats_invs
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (20 preceding siblings ...)
  2026-05-19  3:39 ` [PATCH v4 21/24] iommu/arm-smmu-v3: Move arm_smmu_invs_for_each_entry to header Nicolin Chen
@ 2026-05-19  3:39 ` Nicolin Chen
  2026-05-19 12:12   ` Jason Gunthorpe
  2026-05-19  3:39 ` [PATCH v4 23/24] iommu/arm-smmu-v3: Serialize STE.EATS and ats_broken updates Nicolin Chen
  2026-05-19  3:39 ` [PATCH v4 24/24] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout Nicolin Chen
  23 siblings, 1 reply; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:39 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

Similar to master->build_invs used by a per-domain invalidation, add a new
master->ats_invs to be used by arm_smmu_atc_inv_master().

Since arm_smmu_cmdq_batch_init_cmd() now takes an invs pointer, pass it in.

This will be useful by arm_smmu_cmdq_batch_issue() to backtrack the master
pointer from a timed out ATC invalidation command in a subsequent change.

Also replace the streams loop with arm_smmu_invs_for_each_entry() as it is
initialized (except ssid) upon allocation.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  2 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 46 ++++++++++++++++++---
 2 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index b5ace01c05a5d..186efcbed1ea9 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1045,6 +1045,8 @@ struct arm_smmu_master {
 	 * iommu_group mutex.
 	 */
 	struct arm_smmu_invs		*build_invs;
+	/* Scratch memory for arm_smmu_atc_inv_master() to build an ATS array */
+	struct arm_smmu_invs		*ats_invs;
 	struct arm_smmu_vmaster		*vmaster; /* use smmu->streams_mutex */
 	/* Locked by the iommu core using the group mutex */
 	struct arm_smmu_ctx_desc_cfg	cd_table;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index c95297acf2cfe..9591e4ab2b14a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2407,21 +2407,28 @@ arm_smmu_atc_inv_to_cmd(u32 sid, int ssid, unsigned long iova, size_t size)
 static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
 				   ioasid_t ssid)
 {
-	int i;
+	struct arm_smmu_invs *invs = master->ats_invs;
 	struct arm_smmu_cmd cmd;
 	struct arm_smmu_cmdq_batch cmds;
+	struct arm_smmu_inv *inv;
+	size_t i;
+
+	/* No concurrent user on master->ats_invs */
+	iommu_group_mutex_assert(master->dev);
 
 	/* Do not issue ATC_INV that will definitely time out */
 	if (READ_ONCE(master->ats_broken))
 		return 0;
 
 	cmd = arm_smmu_make_cmd_atc_inv_all(0, IOMMU_NO_PASID);
-	arm_smmu_cmdq_batch_init_cmd(master->smmu, &cmds, &cmd, NULL);
-	for (i = 0; i < master->num_streams; i++)
+	arm_smmu_cmdq_batch_init_cmd(master->smmu, &cmds, &cmd, invs);
+
+	arm_smmu_invs_for_each_entry(invs, i, inv) {
+		inv->ssid = ssid;
 		arm_smmu_cmdq_batch_add_cmd(
 			master->smmu, &cmds,
-			arm_smmu_make_cmd_atc_inv_all(master->streams[i].id,
-						      ssid));
+			arm_smmu_make_cmd_atc_inv_all(inv->id, ssid));
+	}
 
 	return arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
 }
@@ -4087,6 +4094,18 @@ static int arm_smmu_stream_id_cmp(const void *_l, const void *_r)
 	return cmp_int(*l, *r);
 }
 
+static void arm_smmu_master_init_ats_inv(struct arm_smmu_master *master,
+					 struct arm_smmu_inv *inv, u32 sid)
+{
+	inv->id = sid;
+	inv->users = 1;
+	inv->master = master;
+	inv->smmu = master->smmu;
+	inv->type = INV_TYPE_ATS;
+	inv->size_opcode = CMDQ_OP_ATC_INV;
+	inv->nsize_opcode = CMDQ_OP_ATC_INV;
+}
+
 static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
 				  struct arm_smmu_master *master)
 {
@@ -4105,11 +4124,19 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
 		/* Base case has 1 ASID entry or maximum 2 VMID entries */
 		master->build_invs = arm_smmu_invs_alloc(2);
 	} else {
+		master->ats_invs = arm_smmu_invs_alloc(fwspec->num_ids);
+		if (!master->ats_invs) {
+			kfree(master->streams);
+			return -ENOMEM;
+		}
+		master->ats_invs->has_ats = true;
+
 		/* ATS case adds num_ids of entries, on top of the base case */
 		master->build_invs = arm_smmu_invs_alloc(2 + fwspec->num_ids);
 	}
 	if (!master->build_invs) {
 		kfree(master->streams);
+		kfree(master->ats_invs);
 		return -ENOMEM;
 	}
 
@@ -4125,6 +4152,13 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
 		       sizeof(master->streams[0]), arm_smmu_stream_id_cmp,
 		       NULL);
 
+	if (master->ats_invs) {
+		for (i = 0; i < fwspec->num_ids; i++)
+			arm_smmu_master_init_ats_inv(master,
+						     &master->ats_invs->inv[i],
+						     master->streams[i].id);
+	}
+
 	mutex_lock(&smmu->streams_mutex);
 	for (i = 0; i < fwspec->num_ids; i++) {
 		struct arm_smmu_stream *new_stream = &master->streams[i];
@@ -4159,6 +4193,7 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
 		for (i--; i >= 0; i--)
 			rb_erase(&master->streams[i].node, &smmu->streams);
 		kfree(master->streams);
+		kfree(master->ats_invs);
 		kfree(master->build_invs);
 	}
 	mutex_unlock(&smmu->streams_mutex);
@@ -4261,6 +4296,7 @@ static void arm_smmu_release_device(struct device *dev)
 	 */
 	synchronize_rcu();
 	kfree(master->streams);
+	kfree(master->ats_invs);
 	kfree(master->build_invs);
 	kfree(master);
 }
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 23/24] iommu/arm-smmu-v3: Serialize STE.EATS and ats_broken updates
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (21 preceding siblings ...)
  2026-05-19  3:39 ` [PATCH v4 22/24] iommu/arm-smmu-v3: Introduce master->ats_invs Nicolin Chen
@ 2026-05-19  3:39 ` Nicolin Chen
  2026-05-19  3:39 ` [PATCH v4 24/24] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout Nicolin Chen
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:39 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

A subsequent change adding an ATS-broken path will set master->ats_broken
flag and overwrite STE.EATS to 0 for a broken master. This will introduce
race conditions:
 - A concurrent attachment that read ats_broken=false prior to ats_broken
   being set to true could re-enable STE.EATS.
 - A concurrent reset_device_done callback could clear master->ats_broken
   during the ATS-broken path, leading to iommu_report_device_broken() on
   a pre-reset fault.
 - When the ATS-broken path reads the old_data[1] and writes (old_data[1]
   & ~EATS), a concurrent attachment could write a new_data[1] in-between.

Due to an ATS-broken path can run in an atomic context (invalidation), it
cannot use mutex.

Introduce a per-master spinlock_t ats_broken_lock to fence these cases so
as to guarantee that in concurrent cases:
 a) A master->ats_broken update is observed by every concurrent attach
 b) STE.EATS is never re-enabled while master->ats_broken is true
 c) data[1] writes are not lost to a concurrent ATS-broken path

Note IRQ has to be disabled while holding the lock, because an ATS-broken
path can be entered via a hardirq that might interrupt another caller and
lead to deadlock.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  5 +++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 22 ++++++++++++++++++++-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 186efcbed1ea9..e3eb4c4a62d3a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1048,6 +1048,11 @@ struct arm_smmu_master {
 	/* Scratch memory for arm_smmu_atc_inv_master() to build an ATS array */
 	struct arm_smmu_invs		*ats_invs;
 	struct arm_smmu_vmaster		*vmaster; /* use smmu->streams_mutex */
+	/*
+	 * Serializes arm_smmu_write_ste(), reset_device_done, and an ATS-broken
+	 * path, preventing races on ats_broken flag and STE updates.
+	 */
+	spinlock_t			ats_broken_lock;
 	/* Locked by the iommu core using the group mutex */
 	struct arm_smmu_ctx_desc_cfg	cd_table;
 	unsigned int			num_streams;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9591e4ab2b14a..ee864046f0baa 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1791,7 +1791,23 @@ static void arm_smmu_write_ste(struct arm_smmu_master *master, u32 sid,
 		.sid = sid,
 	};
 
-	arm_smmu_write_entry(&ste_writer.writer, ste->data, target->data);
+	/*
+	 * Fence against the ATS-broken path concurrently overwriting STE.EATS.
+	 * It's fine if the ATS-broken path writes after arm_smmu_write_entry.
+	 * Otherwise, we must clear STE.EATS before sending a CFGI_STE command.
+	 *
+	 * Must disable IRQs; otherwise a hardirq-context invalidation path on
+	 * this CPU could deadlock at ats_broken_lock on an ATC_INV timeout.
+	 */
+	scoped_guard(spinlock_irqsave, &master->ats_broken_lock) {
+		struct arm_smmu_ste local_target = *target;
+
+		if (master->ats_broken)
+			local_target.data[1] &=
+				~cpu_to_le64(STRTAB_STE_1_EATS);
+		arm_smmu_write_entry(&ste_writer.writer, ste->data,
+				     local_target.data);
+	}
 
 	/* It's likely that we'll want to use the new STE soon */
 	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
@@ -3000,6 +3016,9 @@ static void arm_smmu_reset_device_done(struct device *dev)
 
 	if (WARN_ON(!master))
 		return;
+
+	/* Ensure the device recovery is seen, to flush any pre-reset fault */
+	guard(spinlock_irqsave)(&master->ats_broken_lock);
 	/* Pair with lockless readers */
 	WRITE_ONCE(master->ats_broken, false);
 }
@@ -4236,6 +4255,7 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev)
 
 	master->dev = dev;
 	master->smmu = smmu;
+	spin_lock_init(&master->ats_broken_lock);
 	dev_iommu_priv_set(dev, master);
 
 	ret = arm_smmu_insert_master(smmu, master);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 24/24] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout
  2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
                   ` (22 preceding siblings ...)
  2026-05-19  3:39 ` [PATCH v4 23/24] iommu/arm-smmu-v3: Serialize STE.EATS and ats_broken updates Nicolin Chen
@ 2026-05-19  3:39 ` Nicolin Chen
  23 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19  3:39 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Jason Gunthorpe
  Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

Currently, when GERROR_CMDQ_ERR occurs, the arm_smmu_cmdq_skip_err() won't
do anything for the CMDQ_ERR_CERROR_ATC_INV_IDX.

When a device wasn't responsive to an ATC invalidation request, this often
results in constant CMDQ errors:
  unexpected global error reported (0x00000001), this could be serious
  CMDQ error (cons 0x0302bb84): ATC invalidate timeout
  unexpected global error reported (0x00000001), this could be serious
  CMDQ error (cons 0x0302bb88): ATC invalidate timeout
  unexpected global error reported (0x00000001), this could be serious
  CMDQ error (cons 0x0302bb8c): ATC invalidate timeout
  ...

An ATC invalidation timeout indicates that the device failed to respond to
a protocol-critical coherency request, which means that device's internal
ATS state is desynchronized from the SMMU.

Furthermore, ignoring the timeout leaves the system in an unsafe state, as
the device cache may retain stale ATC entries for memory pages that the OS
has already reclaimed and reassigned. This might lead to data corruption.

Isolate the device that is confirmed to be unresponsive by a surgical STE
update to unset its EATS bit so as to reject any further ATS transaction,
which could corrupt the memory.

Also, set the master->ats_broken flag that is revertible after the device
completes a reset. This flag avoids further ATS requests and invalidations
from happening.

Finally, report this broken device to the IOMMU core to isolate the device
in the core level too.

Since the three steps above are invoked in an invalidation path (which can
be an atomic context), hold the ats_broken_lock instead of any mutex.

For batched ATC_INV commands, SMMU hardware only reports a timeout at the
CMD_SYNC, which could follow the batch issued for multiple devices. So, it
isn't straightforward to identify which command in a batch resulted in the
timeout. Fortunately, the invs array has a sorted list of ATC entries. So,
the issued batch must be sorted as well. This makes it possible to retry
the ATC_INV command for each unique Stream ID in the batch to identify the
unresponsive master.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  18 +++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 118 +++++++++++++++++++-
 2 files changed, 133 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index e3eb4c4a62d3a..43d4a35500500 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -831,6 +831,24 @@ arm_smmu_invs_iter_next(struct arm_smmu_invs *invs, size_t next, size_t *idx)
 	for (cur = arm_smmu_invs_iter_next(invs, 0, &(idx)); cur;              \
 	     cur = arm_smmu_invs_iter_next(invs, idx + 1, &(idx)))
 
+static inline struct arm_smmu_master *
+arm_smmu_invs_find_ats_master(struct arm_smmu_invs *invs,
+			      struct arm_smmu_device *smmu, u32 sid)
+{
+	struct arm_smmu_inv *cur;
+	size_t i;
+
+	if (!invs->has_ats)
+		return NULL;
+
+	arm_smmu_invs_for_each_entry(invs, i, cur) {
+		if (cur->smmu == smmu && arm_smmu_inv_is_ats(cur) &&
+		    cur->id == sid)
+			return cur->master;
+	}
+	return NULL;
+}
+
 static inline struct arm_smmu_invs *arm_smmu_invs_alloc(size_t num_invs)
 {
 	struct arm_smmu_invs *new_invs;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index ee864046f0baa..0323fd3f33b7f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -107,8 +107,13 @@ static const char * const event_class_str[] = {
 	[3] = "Reserved",
 };
 
+static struct arm_smmu_ste *
+arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid);
 static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
 static bool arm_smmu_ats_supported(struct arm_smmu_master *master);
+static void arm_smmu_cmdq_batch_retry(struct arm_smmu_device *smmu,
+				      struct arm_smmu_invs *invs,
+				      struct arm_smmu_cmdq_batch *cmds);
 
 static void parse_driver_options(struct arm_smmu_device *smmu)
 {
@@ -905,8 +910,13 @@ static int arm_smmu_cmdq_batch_issue(struct arm_smmu_device *smmu,
 				     struct arm_smmu_cmdq_batch *cmds,
 				     bool sync)
 {
-	return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
-					   cmds->num, sync);
+	int ret = arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
+					      cmds->num, sync);
+
+	/* Identify the timed-out master via cmds->invs */
+	if (ret == -EIO && cmds->invs)
+		arm_smmu_cmdq_batch_retry(smmu, cmds->invs, cmds);
+	return ret;
 }
 
 static void arm_smmu_cmdq_batch_add_cmd_p(struct arm_smmu_device *smmu,
@@ -924,7 +934,11 @@ static void arm_smmu_cmdq_batch_add_cmd_p(struct arm_smmu_device *smmu,
 	}
 
 	if (cmds->num == CMDQ_BATCH_ENTRIES) {
-		arm_smmu_cmdq_batch_issue(smmu, cmds, false);
+		/*
+		 * Force sync for ATS-bearing batches so the timeout is caught
+		 * here, not at a later unrelated batch's CMD_SYNC.
+		 */
+		arm_smmu_cmdq_batch_issue(smmu, cmds, cmds->has_ats);
 		arm_smmu_cmdq_batch_init_cmd(smmu, cmds, cmd, cmds->invs);
 	}
 
@@ -945,6 +959,104 @@ static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu,
 	return arm_smmu_cmdq_batch_issue(smmu, cmds, true);
 }
 
+static void arm_smmu_master_disable_ats(struct arm_smmu_master *master)
+{
+	struct arm_smmu_cmd cmd = arm_smmu_make_cmd_op(CMDQ_OP_CFGI_STE);
+	struct arm_smmu_device *smmu = master->smmu;
+	struct arm_smmu_cmdq_batch cmds;
+	struct arm_smmu_inv *cur;
+	size_t i;
+
+	lockdep_assert_held(&master->ats_broken_lock);
+
+	/* Disable STE.EATS on every SID */
+	arm_smmu_cmdq_batch_init_cmd(smmu, &cmds, &cmd, NULL);
+	arm_smmu_invs_for_each_entry(master->ats_invs, i, cur) {
+		struct arm_smmu_ste *step =
+			arm_smmu_get_step_for_sid(smmu, cur->id);
+
+		/* EATS is safe to update. See arm_smmu_get_ste_update_safe() */
+		WRITE_ONCE(step->data[1],
+			   step->data[1] & ~cpu_to_le64(STRTAB_STE_1_EATS));
+
+		arm_smmu_cmdq_batch_add_cmd(
+			smmu, &cmds, arm_smmu_make_cmd_cfgi_ste(cur->id, true));
+	}
+	if (arm_smmu_cmdq_batch_submit(smmu, &cmds))
+		dev_err_ratelimited(smmu->dev,
+				    "failed to disable ATS for master\n");
+
+	/* Pair with lockless readers */
+	WRITE_ONCE(master->ats_broken, true);
+
+	/* Lastly, report to the core to schedule a full blocking procedure */
+	iommu_report_device_broken(master->dev);
+
+	/*
+	 * When a concurrent pci_dev_reset_iommu_done() runs after this report
+	 * (e.g. an AER recovery in flight), the broken_worker may transiently
+	 * block a recovering device. pci_dev_reset_iommu_done() will lift it
+	 * immediately. Net end-state is correct.
+	 */
+}
+
+static void arm_smmu_cmdq_batch_retry(struct arm_smmu_device *smmu,
+				      struct arm_smmu_invs *invs,
+				      struct arm_smmu_cmdq_batch *cmds)
+{
+	struct arm_smmu_cmd atc = {};
+	int i;
+
+	/* Only a timed out ATC_INV command needs a retry */
+	if (!invs->has_ats)
+		return;
+
+	for (i = 0; i < cmds->num; i++) {
+		struct arm_smmu_cmdq *cmdq = cmds->cmdq;
+		struct arm_smmu_master *master = NULL;
+		unsigned long flags;
+		u32 sid;
+		int ret;
+
+		/* Only need to retry ATC invalidations */
+		if (FIELD_GET(CMDQ_0_OP, cmds->cmds[i].data[0]) !=
+		    CMDQ_OP_ATC_INV)
+			continue;
+
+		/* Only need to retry with one ATC_INV per Stream ID (device) */
+		sid = FIELD_GET(CMDQ_ATC_0_SID, cmds->cmds[i].data[0]);
+		if (atc.data[0] &&
+		    sid == FIELD_GET(CMDQ_ATC_0_SID, atc.data[0]))
+			continue;
+
+		master = arm_smmu_invs_find_ats_master(invs, smmu, sid);
+		if (WARN_ON(!master))
+			continue;
+
+		atc = cmds->cmds[i];
+		/*
+		 * Hold ats_broken_lock across the per-master re-issue and the
+		 * possible disable_ats, so a concurrent reset_device_done()
+		 * cannot clear ats_broken between the timeout observation and
+		 * the quarantine action.
+		 */
+		spin_lock_irqsave(&master->ats_broken_lock, flags);
+		/*
+		 * A previous retry on a sibling SID may have already disabled
+		 * ATS across all the STEs owned by this master's SIDs. Skip it.
+		 */
+		if (master->ats_broken) {
+			spin_unlock_irqrestore(&master->ats_broken_lock, flags);
+			continue;
+		}
+
+		ret = arm_smmu_cmdq_issue_cmdlist(smmu, cmdq, &atc, 1, true);
+		if (ret == -EIO)
+			arm_smmu_master_disable_ats(master);
+		spin_unlock_irqrestore(&master->ats_broken_lock, flags);
+	}
+}
+
 static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused,
 				   struct iommu_page_response *resp)
 {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 06/24] iommu: Defer iommu_group free via kfree_rcu()
  2026-05-19  3:38 ` [PATCH v4 06/24] iommu: Defer iommu_group free via kfree_rcu() Nicolin Chen
@ 2026-05-19 11:39   ` Jason Gunthorpe
  2026-05-19 18:54     ` Nicolin Chen
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Gunthorpe @ 2026-05-19 11:39 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

On Mon, May 18, 2026 at 08:38:49PM -0700, Nicolin Chen wrote:
> dev->iommu_group will be read in an ISR-context to look up a group_device
> for fault reporting, in which case mutex cannot be used. For that read to
> be safe, two things are needed:

What driver does this? iommu_report_device_fault() has to be called in
a sleepable context - usually a threaded IRQ handler. So mutex is no
problem.

This seems like Sashiko slop - iommu_group does not change while a
driver is attached and a driver is not permitted to do any "fault
handling" after it has detached, it must flush and synchronize its
IRQ if it is using this from a hard IRQ for some bad reason.

You need to be alot more critical about the noise that Sashiko
generates, alot is useful, alot is not. It is a not a tool you want to
have 0 reports, complaining about hallucinations is expected.

> +#define dev_iommu_group_rcu(dev) \
> +	(*((struct iommu_group __rcu __force **)&(dev)->iommu_group))

Don't do things like this.

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 07/24] iommu: Defer __iommu_group_free_device() to be outside group->mutex
  2026-05-19  3:38 ` [PATCH v4 07/24] iommu: Defer __iommu_group_free_device() to be outside group->mutex Nicolin Chen
@ 2026-05-19 11:47   ` Jason Gunthorpe
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2026-05-19 11:47 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

On Mon, May 18, 2026 at 08:38:50PM -0700, Nicolin Chen wrote:
> __iommu_group_remove_device() holds group->mutex across the entire call to
> __iommu_group_free_device() that performs sysfs removals, tracing, and the
> final kfree(). But in fact, most of these operations don't really need the
> group->mutex.

Are you sure? sysfs requires unique names, and this mutex is providing
a guarentee that will happen. While it shouldn't be possible to race
remove and attach, it is easier to reason about if we don't have to
make this assumption.

> Subsequent changes will introduce sleepable operations to this function:
>  + synchronize_rcu() to defer the gdev->dev put past a grace period.

I'm not keen on that at all.

Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 17/24] iommu/arm-smmu-v3: Add master in arm_smmu_inv for ATS entries
  2026-05-19  3:39 ` [PATCH v4 17/24] iommu/arm-smmu-v3: Add master in arm_smmu_inv for ATS entries Nicolin Chen
@ 2026-05-19 12:01   ` Jason Gunthorpe
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2026-05-19 12:01 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

On Mon, May 18, 2026 at 08:39:00PM -0700, Nicolin Chen wrote:
> Storing the master pointer allows backtracking it from an ATS invalidation
> entry, which will be useful when handling ATC invalidation timeouts.
> 
> Don't simply swap the "smmu" pointer for the "master": a non-ATS entry may
> be shared across multiple devices (masters). An ATS entry is okay since it
> is tied to a unique SID.
> 
> Master must outlive any concurrent RCU reader iterating the domain->invs,
> because inv->master is dereferenced inside the read-side critical section.
> 
> Add a synchronize_rcu() in arm_smmu_release_device() before freeing master.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 14 +++++++++++---
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 604f7edf54158..df6e539f75274 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -738,6 +738,7 @@ enum arm_smmu_inv_type {
>  
>  struct arm_smmu_inv {
>  	struct arm_smmu_device *smmu;
> +	struct arm_smmu_master *master; /* INV_TYPE_ATS* */

I don't like this and the locking for just a slow error case.. You
should use the SID for this and have a spinlocked ATS master list to
search for the SID.

Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 18/24] iommu/arm-smmu-v3: Introduce master->ats_broken flag
  2026-05-19  3:39 ` [PATCH v4 18/24] iommu/arm-smmu-v3: Introduce master->ats_broken flag Nicolin Chen
@ 2026-05-19 12:06   ` Jason Gunthorpe
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2026-05-19 12:06 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

On Mon, May 18, 2026 at 08:39:01PM -0700, Nicolin Chen wrote:

> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index cde2ff2dcc49b..638956e2535b4 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2429,6 +2429,10 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
>  	struct arm_smmu_cmd cmd;
>  	struct arm_smmu_cmdq_batch cmds;
>  
> +	/* Do not issue ATC_INV that will definitely time out */
> +	if (READ_ONCE(master->ats_broken))
> +		return 0;
> +
>  	cmd = arm_smmu_make_cmd_atc_inv_all(0, IOMMU_NO_PASID);
>  	arm_smmu_cmdq_batch_init_cmd(master->smmu, &cmds, &cmd);
>  	for (i = 0; i < master->num_streams; i++)
> @@ -2651,12 +2655,18 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
>  						       cur->id));
>  			break;
>  		case INV_TYPE_ATS:
> +			/* Do not issue ATC_INV that will definitely time out */
> +			if (READ_ONCE(cur->master->ats_broken))
> +				break;

Yuk, this should be a bool in the invs list, not the master.

If the flow is to have the core code always attach a blocked domain before
reset_done then the invs list will naturally fix itself.

> @@ -3007,6 +3027,14 @@ static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
>  	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_ATS))
>  		return false;
>  
> +	/*
> +	 * Do not enable ATS if master->ats_broken is set. The PCI device should
> +	 * go through a recovery (reset) that shall notify the SMMUv3 driver via
> +	 * a reset_device_done callback.
> +	 */

Should just fail attach of paging domains at this point, the core code
should arguably be preventing races like this. Attaching a paging
domain in a broken way is going to create more problems.

Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-05-19  3:38 ` [PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
@ 2026-05-19 12:07   ` Jason Gunthorpe
  2026-05-19 18:29     ` Nicolin Chen
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Gunthorpe @ 2026-05-19 12:07 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

On Mon, May 18, 2026 at 08:38:54PM -0700, Nicolin Chen wrote:
> +void iommu_report_device_broken(struct device *dev)
> +{
> +	struct group_device *gdev;
> +
> +	/*
> +	 * We cannot hold group->mutex here. Rely on iommu_group_broken_worker()
> +	 * to validate dev_has_iommu(). The iommu_group memory is RCU-protected
> +	 * via kfree_rcu() in iommu_group_release(), and group->devices is an
> +	 * RCU-protected list, so the lookup runs entirely under rcu_read_lock.
> +	 *
> +	 * Note the device might have been concurrently removed from the group
> +	 * (list_del_rcu) before iommu_deinit_device() cleared the dev->iommu.
> +	 */
> +	rcu_read_lock();
> +	gdev = __dev_to_gdev_rcu(dev);
> +	if (gdev) {

If this is why the RCU is being added it seems like overkill.

Just add the worker to struct dev_iommu and push it there so it can
use a mutex but I'm confused why are we even adding this function?

The entire design of this series was supposed to have the IOMMU driver
itself adjust it's "STE" to inhibit translated TLPs synchronosly
within its fully locked invalidation loop.

Whats the async worker for?

Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 19/24] iommu/arm-smmu-v3: Add invs and has_ats to struct arm_smmu_cmdq_batch
  2026-05-19  3:39 ` [PATCH v4 19/24] iommu/arm-smmu-v3: Add invs and has_ats to struct arm_smmu_cmdq_batch Nicolin Chen
@ 2026-05-19 12:09   ` Jason Gunthorpe
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2026-05-19 12:09 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

On Mon, May 18, 2026 at 08:39:02PM -0700, Nicolin Chen wrote:
> The arm_smmu_cmdq_batch_add_cmd_p() might flush a sub-batch mid-way, when
> the ARM_SMMU_OPT_CMDQ_FORCE_SYNC is set or when a batch is full. To allow
> a future change to retry these sub-batch flushes on a timeout and identify
> the broken master, the batch needs to carry both the per-domain invs and
> a per-batch indicator of whether the batch contains an ATC_INV.

This seems like too much to tackle in one series. Let's just assume
all the ATCs in the batch have failed and blow up everything for this
point.

I think retrying will be a lot easier by removing the batch..

This series has become way too big now anyhow..

Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 22/24] iommu/arm-smmu-v3: Introduce master->ats_invs
  2026-05-19  3:39 ` [PATCH v4 22/24] iommu/arm-smmu-v3: Introduce master->ats_invs Nicolin Chen
@ 2026-05-19 12:12   ` Jason Gunthorpe
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2026-05-19 12:12 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

On Mon, May 18, 2026 at 08:39:05PM -0700, Nicolin Chen wrote:
> Similar to master->build_invs used by a per-domain invalidation, add a new
> master->ats_invs to be used by arm_smmu_atc_inv_master().
> 
> Since arm_smmu_cmdq_batch_init_cmd() now takes an invs pointer, pass it in.
> 
> This will be useful by arm_smmu_cmdq_batch_issue() to backtrack the master
> pointer from a timed out ATC invalidation command in a subsequent change.

Again this is a good place to just use the SID and get back to the
master through the rbtree under a spinlock.

Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-05-19 12:07   ` Jason Gunthorpe
@ 2026-05-19 18:29     ` Nicolin Chen
  2026-05-19 19:16       ` Jason Gunthorpe
  0 siblings, 1 reply; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19 18:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

On Tue, May 19, 2026 at 09:07:37AM -0300, Jason Gunthorpe wrote:
> On Mon, May 18, 2026 at 08:38:54PM -0700, Nicolin Chen wrote:
> > +void iommu_report_device_broken(struct device *dev)
> > +{
> > +	struct group_device *gdev;
> > +
> > +	/*
> > +	 * We cannot hold group->mutex here. Rely on iommu_group_broken_worker()
> > +	 * to validate dev_has_iommu(). The iommu_group memory is RCU-protected
> > +	 * via kfree_rcu() in iommu_group_release(), and group->devices is an
> > +	 * RCU-protected list, so the lookup runs entirely under rcu_read_lock.
> > +	 *
> > +	 * Note the device might have been concurrently removed from the group
> > +	 * (list_del_rcu) before iommu_deinit_device() cleared the dev->iommu.
> > +	 */
> > +	rcu_read_lock();
> > +	gdev = __dev_to_gdev_rcu(dev);
> > +	if (gdev) {
> 
> If this is why the RCU is being added it seems like overkill.
> 
> Just add the worker to struct dev_iommu and push it there so it can
> use a mutex but I'm confused why are we even adding this function?
> 
> The entire design of this series was supposed to have the IOMMU driver
> itself adjust it's "STE" to inhibit translated TLPs synchronosly
> within its fully locked invalidation loop.

Yes. Surgical STE is done in the driver. But, core-level attaching
state doesn't reflect correctly. So the driver calls this function
to notify the core (this is in an invalidation context -- not able
to use mutex).

> Whats the async worker for?

Then, the core needs to block the device using the similar routine
to the reset prepare(). And that needs to hold group->mutex, so it
needs an async worker.

Do you see a much simpler way?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 06/24] iommu: Defer iommu_group free via kfree_rcu()
  2026-05-19 11:39   ` Jason Gunthorpe
@ 2026-05-19 18:54     ` Nicolin Chen
  0 siblings, 0 replies; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19 18:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

On Tue, May 19, 2026 at 08:39:16AM -0300, Jason Gunthorpe wrote:
> On Mon, May 18, 2026 at 08:38:49PM -0700, Nicolin Chen wrote:
> > dev->iommu_group will be read in an ISR-context to look up a group_device
> > for fault reporting, in which case mutex cannot be used. For that read to
> > be safe, two things are needed:
> 
> What driver does this? iommu_report_device_fault() has to be called in
> a sleepable context - usually a threaded IRQ handler. So mutex is no
> problem.

It's used in an invalidation context, where mutex is a problem.

> This seems like Sashiko slop - iommu_group does not change while a
> driver is attached and a driver is not permitted to do any "fault
> handling" after it has detached, it must flush and synchronize its
> IRQ if it is using this from a hard IRQ for some bad reason.

Oh, I probably picked a wrong word. "fault" here means "ATS broken".

IOMMU driver sees ATC timeout during invalidation, and reports "ATS
broken" in a lockless context where device could be detached. No?

> You need to be alot more critical about the noise that Sashiko
> generates, alot is useful, alot is not. It is a not a tool you want to
> have 0 reports, complaining about hallucinations is expected.

OK. I will be more aware of that in the future.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-05-19 18:29     ` Nicolin Chen
@ 2026-05-19 19:16       ` Jason Gunthorpe
  2026-05-19 22:30         ` Nicolin Chen
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Gunthorpe @ 2026-05-19 19:16 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

On Tue, May 19, 2026 at 11:29:23AM -0700, Nicolin Chen wrote:
> On Tue, May 19, 2026 at 09:07:37AM -0300, Jason Gunthorpe wrote:
> > On Mon, May 18, 2026 at 08:38:54PM -0700, Nicolin Chen wrote:
> > > +void iommu_report_device_broken(struct device *dev)
> > > +{
> > > +	struct group_device *gdev;
> > > +
> > > +	/*
> > > +	 * We cannot hold group->mutex here. Rely on iommu_group_broken_worker()
> > > +	 * to validate dev_has_iommu(). The iommu_group memory is RCU-protected
> > > +	 * via kfree_rcu() in iommu_group_release(), and group->devices is an
> > > +	 * RCU-protected list, so the lookup runs entirely under rcu_read_lock.
> > > +	 *
> > > +	 * Note the device might have been concurrently removed from the group
> > > +	 * (list_del_rcu) before iommu_deinit_device() cleared the dev->iommu.
> > > +	 */
> > > +	rcu_read_lock();
> > > +	gdev = __dev_to_gdev_rcu(dev);
> > > +	if (gdev) {
> > 
> > If this is why the RCU is being added it seems like overkill.
> > 
> > Just add the worker to struct dev_iommu and push it there so it can
> > use a mutex but I'm confused why are we even adding this function?
> > 
> > The entire design of this series was supposed to have the IOMMU driver
> > itself adjust it's "STE" to inhibit translated TLPs synchronosly
> > within its fully locked invalidation loop.
> 
> Yes. Surgical STE is done in the driver. But, core-level attaching
> state doesn't reflect correctly. So the driver calls this function
> to notify the core (this is in an invalidation context -- not able
> to use mutex).
> 
> > Whats the async worker for?
> 
> Then, the core needs to block the device using the similar routine
> to the reset prepare(). And that needs to hold group->mutex, so it
> needs an async worker.
> 
> Do you see a much simpler way?

Put the work on the dev_iommu and forget about rcu.

But this is all probably better as some later series if at all. The
driver can block the ATS and the expectation is something will FLR the
device. The FLR will set the blocking and then restore the
domain. None of this async work seems functionally necessary, though
it would be a nice to have. Lets focus on the bare minimum here it, it
is already a difficult enough problem without tacking on these
extras..

Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-05-19 19:16       ` Jason Gunthorpe
@ 2026-05-19 22:30         ` Nicolin Chen
  2026-05-19 23:02           ` Jason Gunthorpe
  0 siblings, 1 reply; 37+ messages in thread
From: Nicolin Chen @ 2026-05-19 22:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

On Tue, May 19, 2026 at 04:16:26PM -0300, Jason Gunthorpe wrote:
> On Tue, May 19, 2026 at 11:29:23AM -0700, Nicolin Chen wrote:
> > On Tue, May 19, 2026 at 09:07:37AM -0300, Jason Gunthorpe wrote:
> > > On Mon, May 18, 2026 at 08:38:54PM -0700, Nicolin Chen wrote:
> > Then, the core needs to block the device using the similar routine
> > to the reset prepare(). And that needs to hold group->mutex, so it
> > needs an async worker.
> > 
> > Do you see a much simpler way?
> 
> Put the work on the dev_iommu and forget about rcu.
> 
> But this is all probably better as some later series if at all. The
> driver can block the ATS and the expectation is something will FLR the
> device. The FLR will set the blocking and then restore the
> domain. None of this async work seems functionally necessary, though
> it would be a nice to have. Lets focus on the bare minimum here it, it
> is already a difficult enough problem without tacking on these
> extras..

OK. So you are suggesting a quarantine at the driver-level only:

1. Driver detects ATC_INV timeout during an invalidation.
2. Driver retries the commands to identify the master.
3. Driver calls pci_disable_ats() and clears STE.EATS.
4. Driver marks domain->invs ATS entries as BROKEN.
   (optional since pci_disable_ats() is done?)
5. Driver sets master->ats_broken to fence concurrent attach:
   arm_smmu_write_ste() and arm_smmu_ats_supported().
6. Something external triggers an FLR (sysfs or AER).
7. FLR goes through pci_dev_reset_iommu_prepare()/done(). done()
   reverts 3+4 and calls the reset_device_done callback clearing
   master->ats_broken (5).

Right?

Then, we'll have very limited work in the core for this series.

Nicolin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device
  2026-05-19 22:30         ` Nicolin Chen
@ 2026-05-19 23:02           ` Jason Gunthorpe
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2026-05-19 23:02 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
	Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
	Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi, Shuai Xue

On Tue, May 19, 2026 at 03:30:45PM -0700, Nicolin Chen wrote:
> On Tue, May 19, 2026 at 04:16:26PM -0300, Jason Gunthorpe wrote:
> > On Tue, May 19, 2026 at 11:29:23AM -0700, Nicolin Chen wrote:
> > > On Tue, May 19, 2026 at 09:07:37AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, May 18, 2026 at 08:38:54PM -0700, Nicolin Chen wrote:
> > > Then, the core needs to block the device using the similar routine
> > > to the reset prepare(). And that needs to hold group->mutex, so it
> > > needs an async worker.
> > > 
> > > Do you see a much simpler way?
> > 
> > Put the work on the dev_iommu and forget about rcu.
> > 
> > But this is all probably better as some later series if at all. The
> > driver can block the ATS and the expectation is something will FLR the
> > device. The FLR will set the blocking and then restore the
> > domain. None of this async work seems functionally necessary, though
> > it would be a nice to have. Lets focus on the bare minimum here it, it
> > is already a difficult enough problem without tacking on these
> > extras..
> 
> OK. So you are suggesting a quarantine at the driver-level only:
> 
> 1. Driver detects ATC_INV timeout during an invalidation.
> 2. Driver retries the commands to identify the master.

I might argue to push even this out to a followup series given it is
complex and I suspect it becomes much simpler after the batch
removal...

> 3. Driver calls pci_disable_ats() and clears STE.EATS.
> 4. Driver marks domain->invs ATS entries as BROKEN.
>    (optional since pci_disable_ats() is done?)

We need to stop sending invs otherwise there will be trouble making
forward progress.

> 5. Driver sets master->ats_broken to fence concurrent attach:
>    arm_smmu_write_ste() and arm_smmu_ats_supported().

Not sure this is needed, if we race some attach then the attach will
re-set EATS, get another timeout and clear EATS. Doesn't seem worth
trying to optimize for.

> 6. Something external triggers an FLR (sysfs or AER).
> 7. FLR goes through pci_dev_reset_iommu_prepare()/done(). done()
>    reverts 3+4 and calls the reset_device_done callback clearing
>    master->ats_broken (5).

It should restore core/driver/hw synchronization of EATS and the
pci_enable_ats() by installing a blocking domain. Then it can go on to
re-attach a translating domain and everything is back to correct.

We do need to push a pci error event (didn't see that in this series)
so the driver can catch it and start the FLR process. I suppose that
will still need to bounce through a workqueue, and once you have that
it can also set the blocked domain prior to calling out to the driver.

Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2026-05-19 23:02 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-19  3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
2026-05-19  3:38 ` [PATCH v4 01/24] PCI: Don't suspend IOMMU when probing reset capability Nicolin Chen
2026-05-19  3:38 ` [PATCH v4 02/24] PCI: Propagate FLR return values to callers Nicolin Chen
2026-05-19  3:38 ` [PATCH v4 03/24] iommu: Convert gdev->blocked from bool to enum gdev_blocked Nicolin Chen
2026-05-19  3:38 ` [PATCH v4 04/24] iommu: Pass in reset result to pci_dev_reset_iommu_done() Nicolin Chen
2026-05-19  3:38 ` [PATCH v4 05/24] iommu: Add reset_device_done callback for hardware fault recovery Nicolin Chen
2026-05-19  3:38 ` [PATCH v4 06/24] iommu: Defer iommu_group free via kfree_rcu() Nicolin Chen
2026-05-19 11:39   ` Jason Gunthorpe
2026-05-19 18:54     ` Nicolin Chen
2026-05-19  3:38 ` [PATCH v4 07/24] iommu: Defer __iommu_group_free_device() to be outside group->mutex Nicolin Chen
2026-05-19 11:47   ` Jason Gunthorpe
2026-05-19  3:38 ` [PATCH v4 08/24] iommu: Change group->devices to RCU-protected list Nicolin Chen
2026-05-19  3:38 ` [PATCH v4 09/24] iommu: Add group pointer to struct group_device Nicolin Chen
2026-05-19  3:38 ` [PATCH v4 10/24] iommu: Add __iommu_group_block_device helper Nicolin Chen
2026-05-19  3:38 ` [PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
2026-05-19 12:07   ` Jason Gunthorpe
2026-05-19 18:29     ` Nicolin Chen
2026-05-19 19:16       ` Jason Gunthorpe
2026-05-19 22:30         ` Nicolin Chen
2026-05-19 23:02           ` Jason Gunthorpe
2026-05-19  3:38 ` [PATCH v4 12/24] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
2026-05-19  3:38 ` [PATCH v4 13/24] iommu/arm-smmu-v3: Skip remaining GERROR causes on SFM Nicolin Chen
2026-05-19  3:38 ` [PATCH v4 14/24] iommu/arm-smmu-v3: Introduce per-cmdq cmdq_err_handler callback Nicolin Chen
2026-05-19  3:38 ` [PATCH v4 15/24] iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when CMD_SYNC times out Nicolin Chen
2026-05-19  3:38 ` [PATCH v4 16/24] iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when queue_has_space() fails Nicolin Chen
2026-05-19  3:39 ` [PATCH v4 17/24] iommu/arm-smmu-v3: Add master in arm_smmu_inv for ATS entries Nicolin Chen
2026-05-19 12:01   ` Jason Gunthorpe
2026-05-19  3:39 ` [PATCH v4 18/24] iommu/arm-smmu-v3: Introduce master->ats_broken flag Nicolin Chen
2026-05-19 12:06   ` Jason Gunthorpe
2026-05-19  3:39 ` [PATCH v4 19/24] iommu/arm-smmu-v3: Add invs and has_ats to struct arm_smmu_cmdq_batch Nicolin Chen
2026-05-19 12:09   ` Jason Gunthorpe
2026-05-19  3:39 ` [PATCH v4 20/24] iommu/arm-smmu-v3: Introduce arm_smmu_cmdq_batch_issue() wrapper Nicolin Chen
2026-05-19  3:39 ` [PATCH v4 21/24] iommu/arm-smmu-v3: Move arm_smmu_invs_for_each_entry to header Nicolin Chen
2026-05-19  3:39 ` [PATCH v4 22/24] iommu/arm-smmu-v3: Introduce master->ats_invs Nicolin Chen
2026-05-19 12:12   ` Jason Gunthorpe
2026-05-19  3:39 ` [PATCH v4 23/24] iommu/arm-smmu-v3: Serialize STE.EATS and ats_broken updates Nicolin Chen
2026-05-19  3:39 ` [PATCH v4 24/24] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout Nicolin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox