[PATCH v1 0/2] iommu/arm-smmu-v3: Reset PCI device upon ATC invalidate timeout

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1 0/2] iommu/arm-smmu-v3: Reset PCI device upon ATC invalidate timeout
@ 2026-03-05  5:21 Nicolin Chen
  2026-03-05  5:21 ` [PATCH v1 1/2] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds Nicolin Chen
  2026-03-05  5:21 ` [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts Nicolin Chen
  0 siblings, 2 replies; 44+ messages in thread
From: Nicolin Chen @ 2026-03-05  5:21 UTC (permalink / raw)
  To: will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, kees, baolu.lu, smostafa, Alexander.Grest,
	kevin.tian, miko.lenczewski, linux-arm-kernel, iommu,
	linux-kernel, linux-acpi, linux-pci, vsethi

This is a small series to handle ATC invalidate timeout in the CMDQ Error
handler. When the ATC is out of sync, the device might corrupt the memory
due to the stale cache entries. The safest way is to reset the device for
a recovery that wipes out the ATC as well.

On the other hand, a reset attempt in the ATC recovery thread might fail,
which leaves the IOMMU driver no choice but to isolate the device.

pci_dev_reset_iommu_prepare() was introduced previously to fence the ATS
and ATC during a PCI reset. It's easy to use yet doesn't allow a re-entry
nor support the recovery very well.

In this series,
   - loosen the re-entry and block ATS if reset fails
   - add a reset routine in SMMUv3 driver

This is on Github:
https://github.com/nicolinc/iommufd/commits/smmuv3_atc_timeout

Nicolin Chen (2):
  iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds
  iommu/arm-smmu-v3: Recover ATC invalidate timeouts

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |   5 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 131 +++++++++++++++++++-
 drivers/iommu/iommu.c                       |  16 ++-
 drivers/pci/pci-acpi.c                      |  11 +-
 drivers/pci/pci.c                           |  50 +++++++-
 drivers/pci/quirks.c                        |  11 +-
 6 files changed, 207 insertions(+), 17 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1 1/2] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds
  2026-03-05  5:21 [PATCH v1 0/2] iommu/arm-smmu-v3: Reset PCI device upon ATC invalidate timeout Nicolin Chen
@ 2026-03-05  5:21 ` Nicolin Chen
  2026-03-05  5:21 ` [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts Nicolin Chen
  1 sibling, 0 replies; 44+ messages in thread
From: Nicolin Chen @ 2026-03-05  5:21 UTC (permalink / raw)
  To: will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, kees, baolu.lu, smostafa, Alexander.Grest,
	kevin.tian, miko.lenczewski, linux-arm-kernel, iommu,
	linux-kernel, linux-acpi, linux-pci, vsethi

IOMMU drivers handle ATC cache maintenance. They may encounter ATC-related
errors (e.g., ATC invalidation request timeout), which are typically sent
to the driver's ISR. To recover from such errors, the driver would need to
initiate a device reset procedure (I/O waiting) in an asynchronous thread.

If somehow the reset procedure fails, the ATC will be out of sync with the
OS, since the memory is already unmmaped and could be even re-assigned. In
this case, the device must be kept in the resetting domain, to prevent any
memory corruption.

Yet, currently pci_dev_reset_iommu_done() is called unconditionally:
    IOMMU recovery thread():
        pci_reset_function():
            pci_dev_reset_iommu_prepare(); // Block RID/ATS
	    __reset(); // Failed (ATC is still stale)
            pci_dev_reset_iommu_done(); // Unblock RID/ATS (ah-ha)

The simplest fix is to use pci_dev_reset_iommu_done() only on a successful
reset:
    IOMMU recovery thread():
        pci_reset_function():
            pci_dev_reset_iommu_prepare(); // Block RID/ATS
	    if (!__reset())
                pci_dev_reset_iommu_done(); // Unblock RID/ATS
            else
                // keep the device blocked by IOMMU

However, this breaks the symmetric requirement of these reset APIs so that
we have to allow a re-entry to pass a second reset attempt:
    IOMMU recovery thread():
        pci_reset_function():
            pci_dev_reset_iommu_prepare(); // Block RID/ATS
	    __reset(); // Failed (ATC is still stale)
            // Keep the device blocked by IOMMU
    ...
    Another thread():
        pci_reset_function():
            pci_dev_reset_iommu_prepare(); // Re-entry (!)

Update the function kdocs and all the existing callers to only unblock ATS
when the reset succeeds. Drop the WARN_ON in pci_dev_reset_iommu_prepare()
to allow re-entries.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c  | 16 +++++++++-----
 drivers/pci/pci-acpi.c | 11 +++++++++-
 drivers/pci/pci.c      | 50 +++++++++++++++++++++++++++++++++++++-----
 drivers/pci/quirks.c   | 11 +++++++++-
 4 files changed, 75 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 35db517809540..40a15c9360bd1 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3938,8 +3938,10 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
  * IOMMU activity while leaving the group->domain pointer intact. Later when the
  * reset is finished, pci_dev_reset_iommu_done() can restore everything.
  *
- * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
- * before/after the core-level reset routine, to unset the resetting_domain.
+ * Caller must use pci_dev_reset_iommu_done() after a successful PCI-level reset
+ * to unset the resetting_domain. If the reset fails, caller can choose to keep
+ * the device in the resetting_domain to protect system memory using IOMMU from
+ * any bad ATS.
  *
  * Return: 0 on success or negative error code if the preparation failed.
  *
@@ -3961,9 +3963,9 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
 
 	guard(mutex)(&group->mutex);
 
-	/* Re-entry is not allowed */
-	if (WARN_ON(group->resetting_domain))
-		return -EBUSY;
+	/* Already prepared */
+	if (group->resetting_domain)
+		return 0;
 
 	ret = __iommu_group_alloc_blocking_domain(group);
 	if (ret)
@@ -4001,7 +4003,9 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
  * re-attaching all RID/PASID of the device's back to the domains retained in
  * the core-level structure.
  *
- * Caller must pair it with a successful pci_dev_reset_iommu_prepare().
+ * This is a pairing function for pci_dev_reset_iommu_prepare(). Caller should
+ * use it on a successful PCI-level reset. Otherwise, it's suggested for caller
+ * to keep the device in the resetting_domain to protect system memory.
  *
  * Note that, although unlikely, there is a risk that re-attaching domains might
  * fail due to some unexpected happening like OOM.
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 4d0f2cb6c695b..f1a918938242c 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -16,6 +16,7 @@
 #include <linux/pci_hotplug.h>
 #include <linux/module.h>
 #include <linux/pci-acpi.h>
+#include <linux/pci-ats.h>
 #include <linux/pci-ecam.h>
 #include <linux/pm_runtime.h>
 #include <linux/pm_qos.h>
@@ -977,7 +978,15 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
 		ret = -ENOTTY;
 	}
 
-	pci_dev_reset_iommu_done(dev);
+	/*
+	 * The reset might be invoked to recover a serious error. E.g. when the
+	 * ATC failed to invalidate its stale entries, which can result in data
+	 * corruption. Thus, do not unblock ATS until a successful reset.
+	 */
+	if (!ret || !pci_ats_supported(dev))
+		pci_dev_reset_iommu_done(dev);
+	else
+		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
 	return ret;
 }
 
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8479c2e1f74f1..80c5cf6eeebdc 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4358,7 +4358,15 @@ int pcie_flr(struct pci_dev *dev)
 
 	ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
 done:
-	pci_dev_reset_iommu_done(dev);
+	/*
+	 * The reset might be invoked to recover a serious error. E.g. when the
+	 * ATC failed to invalidate its stale entries, which can result in data
+	 * corruption. Thus, do not unblock ATS until a successful reset.
+	 */
+	if (!ret || !pci_ats_supported(dev))
+		pci_dev_reset_iommu_done(dev);
+	else
+		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
 	return ret;
 }
 EXPORT_SYMBOL_GPL(pcie_flr);
@@ -4436,7 +4444,15 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
 
 	ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
 done:
-	pci_dev_reset_iommu_done(dev);
+	/*
+	 * The reset might be invoked to recover a serious error. E.g. when the
+	 * ATC failed to invalidate its stale entries, which can result in data
+	 * corruption. Thus, do not unblock ATS until a successful reset.
+	 */
+	if (!ret || !pci_ats_supported(dev))
+		pci_dev_reset_iommu_done(dev);
+	else
+		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
 	return ret;
 }
 
@@ -4490,7 +4506,15 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
 	pci_dev_d3_sleep(dev);
 
 	ret = pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
-	pci_dev_reset_iommu_done(dev);
+	/*
+	 * The reset might be invoked to recover a serious error. E.g. when the
+	 * ATC failed to invalidate its stale entries, which can result in data
+	 * corruption. Thus, do not unblock ATS until a successful reset.
+	 */
+	if (!ret || !pci_ats_supported(dev))
+		pci_dev_reset_iommu_done(dev);
+	else
+		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
 	return ret;
 }
 
@@ -4933,7 +4957,15 @@ static int pci_reset_bus_function(struct pci_dev *dev, bool probe)
 
 	rc = pci_parent_bus_reset(dev, probe);
 done:
-	pci_dev_reset_iommu_done(dev);
+	/*
+	 * The reset might be invoked to recover a serious error. E.g. when the
+	 * ATC failed to invalidate its stale entries, which can result in data
+	 * corruption. Thus, do not unblock ATS until a successful reset.
+	 */
+	if (!rc || !pci_ats_supported(dev))
+		pci_dev_reset_iommu_done(dev);
+	else
+		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
 	return rc;
 }
 
@@ -4978,7 +5010,15 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
 		pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
 				      reg);
 
-	pci_dev_reset_iommu_done(dev);
+	/*
+	 * The reset might be invoked to recover a serious error. E.g. when the
+	 * ATC failed to invalidate its stale entries, which can result in data
+	 * corruption. Thus, do not unblock ATS until a successful reset.
+	 */
+	if (!rc || !pci_ats_supported(dev))
+		pci_dev_reset_iommu_done(dev);
+	else
+		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
 	return rc;
 }
 
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 48946cca4be72..d9a03a7772916 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -30,6 +30,7 @@
 #include <linux/ktime.h>
 #include <linux/mm.h>
 #include <linux/nvme.h>
+#include <linux/pci-ats.h>
 #include <linux/platform_data/x86/apple.h>
 #include <linux/pm_runtime.h>
 #include <linux/sizes.h>
@@ -4269,7 +4270,15 @@ static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
 	}
 
 	ret = i->reset(dev, probe);
-	pci_dev_reset_iommu_done(dev);
+	/*
+	 * The reset might be invoked to recover a serious error. E.g. when the
+	 * ATC failed to invalidate its stale entries, which can result in data
+	 * corruption. Thus, do not unblock ATS until a successful reset.
+	 */
+	if (!ret || !pci_ats_supported(dev))
+		pci_dev_reset_iommu_done(dev);
+	else
+		pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
 	return ret;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05  5:21 [PATCH v1 0/2] iommu/arm-smmu-v3: Reset PCI device upon ATC invalidate timeout Nicolin Chen
  2026-03-05  5:21 ` [PATCH v1 1/2] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds Nicolin Chen
@ 2026-03-05  5:21 ` Nicolin Chen
  2026-03-05 15:15   ` kernel test robot
                     ` (4 more replies)
  1 sibling, 5 replies; 44+ messages in thread
From: Nicolin Chen @ 2026-03-05  5:21 UTC (permalink / raw)
  To: will, robin.murphy, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, kees, baolu.lu, smostafa, Alexander.Grest,
	kevin.tian, miko.lenczewski, linux-arm-kernel, iommu,
	linux-kernel, linux-acpi, linux-pci, vsethi

Currently, when GERROR_CMDQ_ERR occurs, the arm_smmu_cmdq_skip_err() won't
do anything for the CMDQ_ERR_CERROR_ATC_INV_IDX.

When a device wasn't responsive to an ATC invalidation request, this often
results in constant CMDQ errors:
  unexpected global error reported (0x00000001), this could be serious
  CMDQ error (cons 0x0302bb84): ATC invalidate timeout
  unexpected global error reported (0x00000001), this could be serious
  CMDQ error (cons 0x0302bb88): ATC invalidate timeout
  unexpected global error reported (0x00000001), this could be serious
  CMDQ error (cons 0x0302bb8c): ATC invalidate timeout
  ...

An ATC invalidation timeout indicates that the device failed to respond to
a protocol-critical coherency request, which means that device's internal
ATS state is desynchronized from the SMMU.

Furthermore, ignoring the timeout leaves the system in an unsafe state, as
the device cache may retain stale ATC entries for memory pages that the OS
has already reclaimed and reassigned. This might lead to data corruption.

The only safe recovery action is to issue a PCI reset, which guarantees to
flush all internal device caches and recover the device.

Read the ATC_INV command that led to the timeouts, and schedule a recovery
worker to reset the device corresponding to the Stream ID. If reset fails,
keep the device in the resetting/blocking domain to avoid data corruption.

Though it'd be ideal to block it immediately in the ISR, it cannot be done
because an STE update would require another CFIG_STE command that couldn't
finish in the context of an ISR handling a CMDQ error.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |   5 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 131 +++++++++++++++++++-
 2 files changed, 132 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 3c6d65d36164f..8789cf8294504 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -803,6 +803,11 @@ struct arm_smmu_device {
 
 	struct rb_root			streams;
 	struct mutex			streams_mutex;
+
+	struct {
+		struct list_head	list;
+		spinlock_t		lock; /* Lock the list */
+	} atc_recovery;
 };
 
 struct arm_smmu_stream {
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 4d00d796f0783..de182c27c77c4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -106,6 +106,8 @@ static const char * const event_class_str[] = {
 	[3] = "Reserved",
 };
 
+static struct arm_smmu_master *
+arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid);
 static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
 
 static void parse_driver_options(struct arm_smmu_device *smmu)
@@ -174,6 +176,13 @@ static void queue_inc_cons(struct arm_smmu_ll_queue *q)
 	q->cons = Q_OVF(q->cons) | Q_WRP(q, cons) | Q_IDX(q, cons);
 }
 
+static u32 queue_prev_cons(struct arm_smmu_ll_queue *q, u32 cons)
+{
+	u32 idx_wrp = (Q_WRP(q, cons) | Q_IDX(q, cons)) - 1;
+
+	return Q_OVF(cons) | Q_WRP(q, idx_wrp) | Q_IDX(q, idx_wrp);
+}
+
 static void queue_sync_cons_ovf(struct arm_smmu_queue *q)
 {
 	struct arm_smmu_ll_queue *llq = &q->llq;
@@ -410,6 +419,97 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct arm_smmu_device *smmu,
 		u64p_replace_bits(cmd, CMDQ_SYNC_0_CS_NONE, CMDQ_SYNC_0_CS);
 }
 
+/* ATC recovery upon ATC invalidation timeout */
+struct arm_smmu_atc_recovery_param {
+	struct arm_smmu_device *smmu;
+	struct pci_dev *pdev;
+	u32 sid;
+
+	struct work_struct work;
+	struct list_head node;
+};
+
+static void arm_smmu_atc_recovery_worker(struct work_struct *work)
+{
+	struct arm_smmu_atc_recovery_param *param =
+		container_of(work, struct arm_smmu_atc_recovery_param, work);
+	struct pci_dev *pdev;
+
+	scoped_guard(mutex, &param->smmu->streams_mutex) {
+		struct arm_smmu_master *master;
+
+		master = arm_smmu_find_master(param->smmu, param->sid);
+		if (!master || WARN_ON(!dev_is_pci(master->dev)))
+			goto free_param;
+		pdev = to_pci_dev(master->dev);
+		pci_dev_get(pdev);
+	}
+
+	scoped_guard(spinlock_irqsave, &param->smmu->atc_recovery.lock) {
+		struct arm_smmu_atc_recovery_param *e;
+
+		list_for_each_entry(e, &param->smmu->atc_recovery.list, node) {
+			/* Device is already being recovered */
+			if (e->pdev == pdev)
+				goto put_pdev;
+		}
+		param->pdev = pdev;
+		list_add(&param->node, &param->smmu->atc_recovery.list);
+	}
+
+	/*
+	 * Stop DMA (PCI) and block ATS (IOMMU) immediately, to prevent memory
+	 * corruption. This must take pci_dev_lock to prevent any racy unplug.
+	 *
+	 * If pci_dev_reset_iommu_prepare() fails, pci_reset_function will call
+	 * it again internally.
+	 */
+	pci_dev_lock(pdev);
+	pci_clear_master(pdev);
+	if (pci_dev_reset_iommu_prepare(pdev))
+		pci_err(pdev, "failed to block ATS!\n");
+	pci_dev_unlock(pdev);
+
+	/*
+	 * ATC timeout indicates the device has stopped responding to coherence
+	 * protocol requests. The only safe recovery is a reset to flush stale
+	 * cached translations. Note that pci_reset_function() internally calls
+	 * pci_dev_reset_iommu_prepare/done() as well and ensures to block ATS
+	 * if PCI-level reset fails.
+	 */
+	if (!pci_reset_function(pdev)) {
+		/*
+		 * If reset succeeds, set BME back. Otherwise, fence the system
+		 * from a faulty device, in which case user will have to replug
+		 * the device to invoke pci_set_master().
+		 */
+		pci_dev_lock(pdev);
+		pci_set_master(pdev);
+		pci_dev_unlock(pdev);
+	}
+	scoped_guard(spinlock_irqsave, &param->smmu->atc_recovery.lock)
+		list_del(&param->node);
+put_pdev:
+	pci_dev_put(pdev);
+free_param:
+	kfree(param);
+}
+
+static int arm_smmu_sched_atc_recovery(struct arm_smmu_device *smmu, u32 sid)
+{
+	struct arm_smmu_atc_recovery_param *param;
+
+	param = kzalloc_obj(*param, GFP_ATOMIC);
+	if (!param)
+		return -ENOMEM;
+	param->smmu = smmu;
+	param->sid = sid;
+
+	INIT_WORK(&param->work, arm_smmu_atc_recovery_worker);
+	queue_work(system_unbound_wq, &param->work);
+	return 0;
+}
+
 void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
 			      struct arm_smmu_cmdq *cmdq)
 {
@@ -441,11 +541,10 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
 	case CMDQ_ERR_CERROR_ATC_INV_IDX:
 		/*
 		 * ATC Invalidation Completion timeout. CONS is still pointing
-		 * at the CMD_SYNC. Attempt to complete other pending commands
-		 * by repeating the CMD_SYNC, though we might well end up back
-		 * here since the ATC invalidation may still be pending.
+		 * at the CMD_SYNC. Rewind it to read the ATC_INV command.
 		 */
-		return;
+		cons = queue_prev_cons(&q->llq, cons);
+		fallthrough;
 	case CMDQ_ERR_CERROR_ILL_IDX:
 	default:
 		break;
@@ -456,6 +555,27 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
 	 * not to touch any of the shadow cmdq state.
 	 */
 	queue_read(cmd, Q_ENT(q, cons), q->ent_dwords);
+
+	if (idx == CMDQ_ERR_CERROR_ATC_INV_IDX) {
+		/*
+		 * Since commands can be issued in batch making it difficult to
+		 * identify which CMDQ_OP_ATC_INV actually timed out, the driver
+		 * must ensure only CMDQ_OP_ATC_INV commands for the same device
+		 * can be batched.
+		 */
+		WARN_ON(FIELD_GET(CMDQ_0_OP, cmd[0]) != CMDQ_OP_ATC_INV);
+
+		/*
+		 * If we failed to schedule a recovery worker, we would well end
+		 * up back here since the ATC invalidation may still be pending.
+		 * This gives us another chance to reschedule a recovery worker.
+		 */
+		arm_smmu_sched_atc_recovery(smmu,
+					    FIELD_GET(CMDQ_ATC_0_SID, cmd[0]));
+		return;
+	}
+
+	/* idx == CMDQ_ERR_CERROR_ILL_IDX */
 	dev_err(smmu->dev, "skipping command in error state:\n");
 	for (i = 0; i < ARRAY_SIZE(cmd); ++i)
 		dev_err(smmu->dev, "\t0x%016llx\n", (unsigned long long)cmd[i]);
@@ -3942,6 +4062,9 @@ static int arm_smmu_init_structures(struct arm_smmu_device *smmu)
 {
 	int ret;
 
+	INIT_LIST_HEAD(&smmu->atc_recovery.list);
+	spin_lock_init(&smmu->atc_recovery.lock);
+
 	mutex_init(&smmu->streams_mutex);
 	smmu->streams = RB_ROOT;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05  5:21 ` [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts Nicolin Chen
@ 2026-03-05 15:15   ` kernel test robot
  2026-03-05 15:24   ` Robin Murphy
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2026-03-05 15:15 UTC (permalink / raw)
  To: Nicolin Chen, will, robin.murphy, joro, bhelgaas, jgg
  Cc: llvm, oe-kbuild-all, rafael, lenb, praan, kees, baolu.lu,
	smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

Hi Nicolin,

kernel test robot noticed the following build errors:

[auto build test ERROR on pci/next]
[also build test ERROR on pci/for-linus rafael-pm/linux-next rafael-pm/bleeding-edge soc/for-next linus/master v7.0-rc2 next-20260304]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nicolin-Chen/iommu-Do-not-call-pci_dev_reset_iommu_done-unless-reset-succeeds/20260305-132923
base:   https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git next
patch link:    https://lore.kernel.org/r/ca7ab934bf0f433b62a5c15d42241632c4cb9366.1772686998.git.nicolinc%40nvidia.com
patch subject: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
config: arm64-randconfig-003-20260305 (https://download.01.org/0day-ci/archive/20260305/202603052303.jrfwR7CV-lkp@intel.com/config)
compiler: clang version 16.0.6 (https://github.com/llvm/llvm-project 7cbf1a2591520c2491aa35339f227775f4d3adf6)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260305/202603052303.jrfwR7CV-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603052303.jrfwR7CV-lkp@intel.com/

All errors (new ones prefixed by >>):

>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:467:2: error: call to undeclared function 'pci_dev_lock'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
           pci_dev_lock(pdev);
           ^
>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:471:2: error: call to undeclared function 'pci_dev_unlock'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
           pci_dev_unlock(pdev);
           ^
>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:480:7: error: call to undeclared function 'pci_reset_function'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
           if (!pci_reset_function(pdev)) {
                ^
   3 errors generated.


vim +/pci_dev_lock +467 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c

   431	
   432	static void arm_smmu_atc_recovery_worker(struct work_struct *work)
   433	{
   434		struct arm_smmu_atc_recovery_param *param =
   435			container_of(work, struct arm_smmu_atc_recovery_param, work);
   436		struct pci_dev *pdev;
   437	
   438		scoped_guard(mutex, &param->smmu->streams_mutex) {
   439			struct arm_smmu_master *master;
   440	
   441			master = arm_smmu_find_master(param->smmu, param->sid);
   442			if (!master || WARN_ON(!dev_is_pci(master->dev)))
   443				goto free_param;
   444			pdev = to_pci_dev(master->dev);
   445			pci_dev_get(pdev);
   446		}
   447	
   448		scoped_guard(spinlock_irqsave, &param->smmu->atc_recovery.lock) {
   449			struct arm_smmu_atc_recovery_param *e;
   450	
   451			list_for_each_entry(e, &param->smmu->atc_recovery.list, node) {
   452				/* Device is already being recovered */
   453				if (e->pdev == pdev)
   454					goto put_pdev;
   455			}
   456			param->pdev = pdev;
   457			list_add(&param->node, &param->smmu->atc_recovery.list);
   458		}
   459	
   460		/*
   461		 * Stop DMA (PCI) and block ATS (IOMMU) immediately, to prevent memory
   462		 * corruption. This must take pci_dev_lock to prevent any racy unplug.
   463		 *
   464		 * If pci_dev_reset_iommu_prepare() fails, pci_reset_function will call
   465		 * it again internally.
   466		 */
 > 467		pci_dev_lock(pdev);
   468		pci_clear_master(pdev);
   469		if (pci_dev_reset_iommu_prepare(pdev))
   470			pci_err(pdev, "failed to block ATS!\n");
 > 471		pci_dev_unlock(pdev);
   472	
   473		/*
   474		 * ATC timeout indicates the device has stopped responding to coherence
   475		 * protocol requests. The only safe recovery is a reset to flush stale
   476		 * cached translations. Note that pci_reset_function() internally calls
   477		 * pci_dev_reset_iommu_prepare/done() as well and ensures to block ATS
   478		 * if PCI-level reset fails.
   479		 */
 > 480		if (!pci_reset_function(pdev)) {
   481			/*
   482			 * If reset succeeds, set BME back. Otherwise, fence the system
   483			 * from a faulty device, in which case user will have to replug
   484			 * the device to invoke pci_set_master().
   485			 */
   486			pci_dev_lock(pdev);
   487			pci_set_master(pdev);
   488			pci_dev_unlock(pdev);
   489		}
   490		scoped_guard(spinlock_irqsave, &param->smmu->atc_recovery.lock)
   491			list_del(&param->node);
   492	put_pdev:
   493		pci_dev_put(pdev);
   494	free_param:
   495		kfree(param);
   496	}
   497	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05  5:21 ` [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts Nicolin Chen
  2026-03-05 15:15   ` kernel test robot
@ 2026-03-05 15:24   ` Robin Murphy
  2026-03-05 21:06     ` Nicolin Chen
  2026-03-05 15:39   ` Jason Gunthorpe
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 44+ messages in thread
From: Robin Murphy @ 2026-03-05 15:24 UTC (permalink / raw)
  To: Nicolin Chen, will, joro, bhelgaas, jgg
  Cc: rafael, lenb, praan, kees, baolu.lu, smostafa, Alexander.Grest,
	kevin.tian, miko.lenczewski, linux-arm-kernel, iommu,
	linux-kernel, linux-acpi, linux-pci, vsethi

On 2026-03-05 5:21 am, Nicolin Chen wrote:
> Currently, when GERROR_CMDQ_ERR occurs, the arm_smmu_cmdq_skip_err() won't
> do anything for the CMDQ_ERR_CERROR_ATC_INV_IDX.
> 
> When a device wasn't responsive to an ATC invalidation request, this often
> results in constant CMDQ errors:
>    unexpected global error reported (0x00000001), this could be serious
>    CMDQ error (cons 0x0302bb84): ATC invalidate timeout
>    unexpected global error reported (0x00000001), this could be serious
>    CMDQ error (cons 0x0302bb88): ATC invalidate timeout
>    unexpected global error reported (0x00000001), this could be serious
>    CMDQ error (cons 0x0302bb8c): ATC invalidate timeout
>    ...
> 
> An ATC invalidation timeout indicates that the device failed to respond to
> a protocol-critical coherency request, which means that device's internal
> ATS state is desynchronized from the SMMU.
> 
> Furthermore, ignoring the timeout leaves the system in an unsafe state, as
> the device cache may retain stale ATC entries for memory pages that the OS
> has already reclaimed and reassigned. This might lead to data corruption.
> 
> The only safe recovery action is to issue a PCI reset, which guarantees to
> flush all internal device caches and recover the device.
> 
> Read the ATC_INV command that led to the timeouts, and schedule a recovery
> worker to reset the device corresponding to the Stream ID. If reset fails,
> keep the device in the resetting/blocking domain to avoid data corruption.
> 
> Though it'd be ideal to block it immediately in the ISR, it cannot be done
> because an STE update would require another CFIG_STE command that couldn't
> finish in the context of an ISR handling a CMDQ error.

Why not? As soon as we've acked GERRORN.CMDQ_ERR, command consumption 
will resume and we're free to do whatever we fancy. Admittedly this 
probably represents more work than we *want* to be doing in the SMMU's 
IRQ handler (arguably even in a thread, since all the PCI housekeeping 
isn't really the SMMU driver's own problem), but I would say the 
workqueue is a definite design choice, not a functional requirement.

> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |   5 +
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 131 +++++++++++++++++++-
>   2 files changed, 132 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 3c6d65d36164f..8789cf8294504 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -803,6 +803,11 @@ struct arm_smmu_device {
>   
>   	struct rb_root			streams;
>   	struct mutex			streams_mutex;
> +
> +	struct {
> +		struct list_head	list;
> +		spinlock_t		lock; /* Lock the list */
> +	} atc_recovery;
>   };
>   
>   struct arm_smmu_stream {
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 4d00d796f0783..de182c27c77c4 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -106,6 +106,8 @@ static const char * const event_class_str[] = {
>   	[3] = "Reserved",
>   };
>   
> +static struct arm_smmu_master *
> +arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid);
>   static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
>   
>   static void parse_driver_options(struct arm_smmu_device *smmu)
> @@ -174,6 +176,13 @@ static void queue_inc_cons(struct arm_smmu_ll_queue *q)
>   	q->cons = Q_OVF(q->cons) | Q_WRP(q, cons) | Q_IDX(q, cons);
>   }
>   
> +static u32 queue_prev_cons(struct arm_smmu_ll_queue *q, u32 cons)
> +{
> +	u32 idx_wrp = (Q_WRP(q, cons) | Q_IDX(q, cons)) - 1;
> +
> +	return Q_OVF(cons) | Q_WRP(q, idx_wrp) | Q_IDX(q, idx_wrp);
> +}
> +
>   static void queue_sync_cons_ovf(struct arm_smmu_queue *q)
>   {
>   	struct arm_smmu_ll_queue *llq = &q->llq;
> @@ -410,6 +419,97 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct arm_smmu_device *smmu,
>   		u64p_replace_bits(cmd, CMDQ_SYNC_0_CS_NONE, CMDQ_SYNC_0_CS);
>   }
>   
> +/* ATC recovery upon ATC invalidation timeout */
> +struct arm_smmu_atc_recovery_param {
> +	struct arm_smmu_device *smmu;
> +	struct pci_dev *pdev;
> +	u32 sid;
> +
> +	struct work_struct work;
> +	struct list_head node;
> +};
> +
> +static void arm_smmu_atc_recovery_worker(struct work_struct *work)
> +{
> +	struct arm_smmu_atc_recovery_param *param =
> +		container_of(work, struct arm_smmu_atc_recovery_param, work);
> +	struct pci_dev *pdev;
> +
> +	scoped_guard(mutex, &param->smmu->streams_mutex) {
> +		struct arm_smmu_master *master;
> +
> +		master = arm_smmu_find_master(param->smmu, param->sid);
> +		if (!master || WARN_ON(!dev_is_pci(master->dev)))
> +			goto free_param;
> +		pdev = to_pci_dev(master->dev);
> +		pci_dev_get(pdev);
> +	}
> +
> +	scoped_guard(spinlock_irqsave, &param->smmu->atc_recovery.lock) {
> +		struct arm_smmu_atc_recovery_param *e;
> +
> +		list_for_each_entry(e, &param->smmu->atc_recovery.list, node) {
> +			/* Device is already being recovered */
> +			if (e->pdev == pdev)
> +				goto put_pdev;
> +		}
> +		param->pdev = pdev;
> +		list_add(&param->node, &param->smmu->atc_recovery.list);
> +	}
> +
> +	/*
> +	 * Stop DMA (PCI) and block ATS (IOMMU) immediately, to prevent memory
> +	 * corruption. This must take pci_dev_lock to prevent any racy unplug.
> +	 *
> +	 * If pci_dev_reset_iommu_prepare() fails, pci_reset_function will call
> +	 * it again internally.
> +	 */
> +	pci_dev_lock(pdev);
> +	pci_clear_master(pdev);
> +	if (pci_dev_reset_iommu_prepare(pdev))
> +		pci_err(pdev, "failed to block ATS!\n");
> +	pci_dev_unlock(pdev);
> +
> +	/*
> +	 * ATC timeout indicates the device has stopped responding to coherence
> +	 * protocol requests. The only safe recovery is a reset to flush stale
> +	 * cached translations. Note that pci_reset_function() internally calls
> +	 * pci_dev_reset_iommu_prepare/done() as well and ensures to block ATS
> +	 * if PCI-level reset fails.
> +	 */
> +	if (!pci_reset_function(pdev)) {
> +		/*
> +		 * If reset succeeds, set BME back. Otherwise, fence the system
> +		 * from a faulty device, in which case user will have to replug
> +		 * the device to invoke pci_set_master().
> +		 */
> +		pci_dev_lock(pdev);
> +		pci_set_master(pdev);
> +		pci_dev_unlock(pdev);
> +	}
> +	scoped_guard(spinlock_irqsave, &param->smmu->atc_recovery.lock)
> +		list_del(&param->node);
> +put_pdev:
> +	pci_dev_put(pdev);
> +free_param:
> +	kfree(param);
> +}

The only thing SMMU-specific about this seems to be the use of 
arm_smmu_find_master() to resolve the device, which could just as well 
be done upon submission anyway - why isn't this a generic IOMMU/IOPF 
mechanism?

> +
> +static int arm_smmu_sched_atc_recovery(struct arm_smmu_device *smmu, u32 sid)
> +{
> +	struct arm_smmu_atc_recovery_param *param;
> +
> +	param = kzalloc_obj(*param, GFP_ATOMIC);
> +	if (!param)
> +		return -ENOMEM;
> +	param->smmu = smmu;
> +	param->sid = sid;
> +
> +	INIT_WORK(&param->work, arm_smmu_atc_recovery_worker);
> +	queue_work(system_unbound_wq, &param->work);

Might it make more sense to have a single work item associated with the 
list_head and use the latter as an actual queue, such that the work runs 
until the list is empty, then here at submisison time we do the 
list_add() and schedule_work()? That could perhaps even be a global 
queue, since ATS timeouts can hardly be expected to be a 
highly-contended high-perfoamnce concern.

Right now it seems the list is barely doing anything - a "deduplication" 
mechanism that only works if multiple resets for the same device happen 
to have their work scheduled concurrently seems pretty ineffective.

> +	return 0;
> +}
> +
>   void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
>   			      struct arm_smmu_cmdq *cmdq)
>   {
> @@ -441,11 +541,10 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
>   	case CMDQ_ERR_CERROR_ATC_INV_IDX:
>   		/*
>   		 * ATC Invalidation Completion timeout. CONS is still pointing
> -		 * at the CMD_SYNC. Attempt to complete other pending commands
> -		 * by repeating the CMD_SYNC, though we might well end up back
> -		 * here since the ATC invalidation may still be pending.
> +		 * at the CMD_SYNC. Rewind it to read the ATC_INV command.
>   		 */
> -		return;
> +		cons = queue_prev_cons(&q->llq, cons);
> +		fallthrough;
>   	case CMDQ_ERR_CERROR_ILL_IDX:
>   	default:
>   		break;
> @@ -456,6 +555,27 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
>   	 * not to touch any of the shadow cmdq state.
>   	 */
>   	queue_read(cmd, Q_ENT(q, cons), q->ent_dwords);
> +
> +	if (idx == CMDQ_ERR_CERROR_ATC_INV_IDX) {
> +		/*
> +		 * Since commands can be issued in batch making it difficult to
> +		 * identify which CMDQ_OP_ATC_INV actually timed out, the driver
> +		 * must ensure only CMDQ_OP_ATC_INV commands for the same device
> +		 * can be batched.

But this *is* "the driver" - arm_smmu_atc_inv_domain() is literally 
further down this same C file, and does not do what this comment is 
saying it must do, so how are you expecting this to work correctly?

Thanks,
Robin.

> +		 */
> +		WARN_ON(FIELD_GET(CMDQ_0_OP, cmd[0]) != CMDQ_OP_ATC_INV);
> +
> +		/*
> +		 * If we failed to schedule a recovery worker, we would well end
> +		 * up back here since the ATC invalidation may still be pending.
> +		 * This gives us another chance to reschedule a recovery worker.
> +		 */
> +		arm_smmu_sched_atc_recovery(smmu,
> +					    FIELD_GET(CMDQ_ATC_0_SID, cmd[0]));
> +		return;
> +	}
> +
> +	/* idx == CMDQ_ERR_CERROR_ILL_IDX */
>   	dev_err(smmu->dev, "skipping command in error state:\n");
>   	for (i = 0; i < ARRAY_SIZE(cmd); ++i)
>   		dev_err(smmu->dev, "\t0x%016llx\n", (unsigned long long)cmd[i]);
> @@ -3942,6 +4062,9 @@ static int arm_smmu_init_structures(struct arm_smmu_device *smmu)
>   {
>   	int ret;
>   
> +	INIT_LIST_HEAD(&smmu->atc_recovery.list);
> +	spin_lock_init(&smmu->atc_recovery.lock);
> +
>   	mutex_init(&smmu->streams_mutex);
>   	smmu->streams = RB_ROOT;
>   


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05  5:21 ` [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts Nicolin Chen
  2026-03-05 15:15   ` kernel test robot
  2026-03-05 15:24   ` Robin Murphy
@ 2026-03-05 15:39   ` Jason Gunthorpe
  2026-03-05 21:15     ` Nicolin Chen
  2026-03-06  3:22     ` Baolu Lu
  2026-03-06  2:35   ` kernel test robot
  2026-03-10 19:16   ` Pranjal Shrivastava
  4 siblings, 2 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-05 15:39 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Wed, Mar 04, 2026 at 09:21:42PM -0800, Nicolin Chen wrote:
> +	/*
> +	 * ATC timeout indicates the device has stopped responding to coherence
> +	 * protocol requests. The only safe recovery is a reset to flush stale
> +	 * cached translations. Note that pci_reset_function() internally calls
> +	 * pci_dev_reset_iommu_prepare/done() as well and ensures to block ATS
> +	 * if PCI-level reset fails.
> +	 */
> +	if (!pci_reset_function(pdev)) {
> +		/*
> +		 * If reset succeeds, set BME back. Otherwise, fence the system
> +		 * from a faulty device, in which case user will have to replug
> +		 * the device to invoke pci_set_master().
> +		 */
> +		pci_dev_lock(pdev);
> +		pci_set_master(pdev);
> +		pci_dev_unlock(pdev);
> +	}

I thought we talked about this, the iommu driver cannot just blindly
issue a reset like this, the reset has to come from the actual device
driver through the AERish mechanism. Otherwise the driver RAS is going
to explode.

The smmu driver should immediately block the STE (reject translated
requests) to protect the system before resuming whatever command
submissio n has encountered the error.

You could delegate the STE change to the interrupted command
submission to avoid doing it from a ISR, that makes alot of sense
because the submission thread is already operating a cmdq so it could
stick in a STE invalidation command, possibly even in place of the
failed ATC command.

I think I'd break this up into smaller steps, just focus on this STE
mechanism at start and have any future attach callback fix the STE.

Then we can talk about how to properly trigger the PCI RAS flow and so
on.

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05 15:24   ` Robin Murphy
@ 2026-03-05 21:06     ` Nicolin Chen
  2026-03-05 23:30       ` Nicolin Chen
  2026-03-05 23:52       ` Jason Gunthorpe
  0 siblings, 2 replies; 44+ messages in thread
From: Nicolin Chen @ 2026-03-05 21:06 UTC (permalink / raw)
  To: Robin Murphy
  Cc: will, joro, bhelgaas, jgg, rafael, lenb, praan, kees, baolu.lu,
	smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Thu, Mar 05, 2026 at 03:24:30PM +0000, Robin Murphy wrote:
> On 2026-03-05 5:21 am, Nicolin Chen wrote:
> > Though it'd be ideal to block it immediately in the ISR, it cannot be done
> > because an STE update would require another CFIG_STE command that couldn't
> > finish in the context of an ISR handling a CMDQ error.
> 
> Why not? As soon as we've acked GERRORN.CMDQ_ERR, command consumption will
> resume and we're free to do whatever we fancy. Admittedly this probably
> represents more work than we *want* to be doing in the SMMU's IRQ handler
> (arguably even in a thread, since all the PCI housekeeping isn't really the
> SMMU driver's own problem), but I would say the workqueue is a definite
> design choice, not a functional requirement.

Hmm, you are right, after writel(gerror, ARM_SMMU_GERRORN), it
would be doable.

Though SID->pdev conversion requires streams_mutex, which can't
happen in ISR?

> > +static void arm_smmu_atc_recovery_worker(struct work_struct *work)
> > +{
> > +	struct arm_smmu_atc_recovery_param *param =
> > +		container_of(work, struct arm_smmu_atc_recovery_param, work);
> > +	struct pci_dev *pdev;
> > +
> > +	scoped_guard(mutex, &param->smmu->streams_mutex) {
> > +		struct arm_smmu_master *master;
> > +
> > +		master = arm_smmu_find_master(param->smmu, param->sid);
> 
> The only thing SMMU-specific about this seems to be the use of
> arm_smmu_find_master() to resolve the device, which could just as well be
> done upon submission anyway - why isn't this a generic IOMMU/IOPF mechanism?

You mean treating this as a page fault? That's an interesting idea.

So, the IOMMU-level fence could be done in arm_smmu_page_response().

> > +static int arm_smmu_sched_atc_recovery(struct arm_smmu_device *smmu, u32 sid)
> > +{
> > +	struct arm_smmu_atc_recovery_param *param;
> > +
> > +	param = kzalloc_obj(*param, GFP_ATOMIC);
> > +	if (!param)
> > +		return -ENOMEM;
> > +	param->smmu = smmu;
> > +	param->sid = sid;
> > +
> > +	INIT_WORK(&param->work, arm_smmu_atc_recovery_worker);
> > +	queue_work(system_unbound_wq, &param->work);
> 
> Might it make more sense to have a single work item associated with the
> list_head and use the latter as an actual queue, such that the work runs
> until the list is empty, then here at submisison time we do the list_add()
> and schedule_work()? That could perhaps even be a global queue, since ATS
> timeouts can hardly be expected to be a highly-contended high-perfoamnce
> concern.

That sounds like the IOPF implementation. Maybe inventing another
IOMMU_FAULT_ATC_TIMEOUT to reuse the existing infrastructure would
make things cleaner.

> Right now it seems the list is barely doing anything - a "deduplication"
> mechanism that only works if multiple resets for the same device happen to
> have their work scheduled concurrently seems pretty ineffective.

From my testing results, it does effectively block duplications, so
I think it's somehow meaningful.

If I wasn't wrong, the duplicated ATC timeout might come from this
exact reset thread, as it did a CMDQ_OP_CFGI_STE while the CONS was
still pointing to previous CMDQ_OP_ATC_INV.

IIUIC, the IOPF queue doesn't expect duplicated PRI events. So, it
might invoke the handler twice if there is a duplication. The core
handling list_add() might need a deduplication.

> > @@ -456,6 +555,27 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
> >   	 * not to touch any of the shadow cmdq state.
> >   	 */
> >   	queue_read(cmd, Q_ENT(q, cons), q->ent_dwords);
> > +
> > +	if (idx == CMDQ_ERR_CERROR_ATC_INV_IDX) {
> > +		/*
> > +		 * Since commands can be issued in batch making it difficult to
> > +		 * identify which CMDQ_OP_ATC_INV actually timed out, the driver
> > +		 * must ensure only CMDQ_OP_ATC_INV commands for the same device
> > +		 * can be batched.
> 
> But this *is* "the driver" - arm_smmu_atc_inv_domain() is literally further
> down this same C file, and does not do what this comment is saying it must
> do, so how are you expecting this to work correctly?

Oh, that's a big miss..

I imaged these changes to be based on the arm_smmu_invs series,
where it could break the ATC in the batching function. Here, it
should have made some change to arm_smmu_atc_inv_domain().

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05 15:39   ` Jason Gunthorpe
@ 2026-03-05 21:15     ` Nicolin Chen
  2026-03-05 23:41       ` Jason Gunthorpe
  2026-03-06  3:22     ` Baolu Lu
  1 sibling, 1 reply; 44+ messages in thread
From: Nicolin Chen @ 2026-03-05 21:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: will, robin.murphy, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Thu, Mar 05, 2026 at 11:39:11AM -0400, Jason Gunthorpe wrote:
> On Wed, Mar 04, 2026 at 09:21:42PM -0800, Nicolin Chen wrote:
> > +	/*
> > +	 * ATC timeout indicates the device has stopped responding to coherence
> > +	 * protocol requests. The only safe recovery is a reset to flush stale
> > +	 * cached translations. Note that pci_reset_function() internally calls
> > +	 * pci_dev_reset_iommu_prepare/done() as well and ensures to block ATS
> > +	 * if PCI-level reset fails.
> > +	 */
> > +	if (!pci_reset_function(pdev)) {
> > +		/*
> > +		 * If reset succeeds, set BME back. Otherwise, fence the system
> > +		 * from a faulty device, in which case user will have to replug
> > +		 * the device to invoke pci_set_master().
> > +		 */
> > +		pci_dev_lock(pdev);
> > +		pci_set_master(pdev);
> > +		pci_dev_unlock(pdev);
> > +	}
> 
> I thought we talked about this, the iommu driver cannot just blindly
> issue a reset like this, the reset has to come from the actual device
> driver through the AERish mechanism. Otherwise the driver RAS is going
> to explode.
> 
> The smmu driver should immediately block the STE (reject translated
> requests) to protect the system before resuming whatever command
> submissio n has encountered the error.
> 
> You could delegate the STE change to the interrupted command
> submission to avoid doing it from a ISR, that makes alot of sense
> because the submission thread is already operating a cmdq so it could
> stick in a STE invalidation command, possibly even in place of the
> failed ATC command.

You mean in arm_smmu_cmdq_issue_cmdlist() that issued the timed
out ATC command?

So my test case was to trigger a device fault followed by an ATC
command. But, I found that the ATC command submission returned 0
while only the ISR received:
    CMDQ error (cons 0x03000003): ATC invalidate timeout
    arm_smmu_debugfs_atc_write: ATC_INV ret=0

It seems difficult to insert a CMDQ_OP_CFGI_STE in the submission
thread?

> I think I'd break this up into smaller steps, just focus on this STE
> mechanism at start and have any future attach callback fix the STE.
> 
> Then we can talk about how to properly trigger the PCI RAS flow and so
> on.

OK.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05 21:06     ` Nicolin Chen
@ 2026-03-05 23:30       ` Nicolin Chen
  2026-03-05 23:52       ` Jason Gunthorpe
  1 sibling, 0 replies; 44+ messages in thread
From: Nicolin Chen @ 2026-03-05 23:30 UTC (permalink / raw)
  To: Robin Murphy
  Cc: will, joro, bhelgaas, jgg, rafael, lenb, praan, kees, baolu.lu,
	smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Thu, Mar 05, 2026 at 01:06:24PM -0800, Nicolin Chen wrote:
> On Thu, Mar 05, 2026 at 03:24:30PM +0000, Robin Murphy wrote:
> > On 2026-03-05 5:21 am, Nicolin Chen wrote:
> > > Though it'd be ideal to block it immediately in the ISR, it cannot be done
> > > because an STE update would require another CFIG_STE command that couldn't
> > > finish in the context of an ISR handling a CMDQ error.
> > 
> > Why not? As soon as we've acked GERRORN.CMDQ_ERR, command consumption will
> > resume and we're free to do whatever we fancy. Admittedly this probably
> > represents more work than we *want* to be doing in the SMMU's IRQ handler
> > (arguably even in a thread, since all the PCI housekeeping isn't really the
> > SMMU driver's own problem), but I would say the workqueue is a definite
> > design choice, not a functional requirement.
> 
> Hmm, you are right, after writel(gerror, ARM_SMMU_GERRORN), it
> would be doable.
> 
> Though SID->pdev conversion requires streams_mutex, which can't
> happen in ISR?

I forgot that (if we don't mind the core-level domain attachment
and the driver-level ATS state being temporarily out of sync with
the physical STE) SID alone was enough for a surgical STE update
in the ISR to unset EATS.

Thanks
Nicolin

> > > +static void arm_smmu_atc_recovery_worker(struct work_struct *work)
> > > +{
> > > +	struct arm_smmu_atc_recovery_param *param =
> > > +		container_of(work, struct arm_smmu_atc_recovery_param, work);
> > > +	struct pci_dev *pdev;
> > > +
> > > +	scoped_guard(mutex, &param->smmu->streams_mutex) {
> > > +		struct arm_smmu_master *master;
> > > +
> > > +		master = arm_smmu_find_master(param->smmu, param->sid);
> > 
> > The only thing SMMU-specific about this seems to be the use of
> > arm_smmu_find_master() to resolve the device, which could just as well be
> > done upon submission anyway - why isn't this a generic IOMMU/IOPF mechanism?
> 
> You mean treating this as a page fault? That's an interesting idea.
> 
> So, the IOMMU-level fence could be done in arm_smmu_page_response().
> 
> > > +static int arm_smmu_sched_atc_recovery(struct arm_smmu_device *smmu, u32 sid)
> > > +{
> > > +	struct arm_smmu_atc_recovery_param *param;
> > > +
> > > +	param = kzalloc_obj(*param, GFP_ATOMIC);
> > > +	if (!param)
> > > +		return -ENOMEM;
> > > +	param->smmu = smmu;
> > > +	param->sid = sid;
> > > +
> > > +	INIT_WORK(&param->work, arm_smmu_atc_recovery_worker);
> > > +	queue_work(system_unbound_wq, &param->work);
> > 
> > Might it make more sense to have a single work item associated with the
> > list_head and use the latter as an actual queue, such that the work runs
> > until the list is empty, then here at submisison time we do the list_add()
> > and schedule_work()? That could perhaps even be a global queue, since ATS
> > timeouts can hardly be expected to be a highly-contended high-perfoamnce
> > concern.
> 
> That sounds like the IOPF implementation. Maybe inventing another
> IOMMU_FAULT_ATC_TIMEOUT to reuse the existing infrastructure would
> make things cleaner.
> 
> > Right now it seems the list is barely doing anything - a "deduplication"
> > mechanism that only works if multiple resets for the same device happen to
> > have their work scheduled concurrently seems pretty ineffective.
> 
> From my testing results, it does effectively block duplications, so
> I think it's somehow meaningful.
> 
> If I wasn't wrong, the duplicated ATC timeout might come from this
> exact reset thread, as it did a CMDQ_OP_CFGI_STE while the CONS was
> still pointing to previous CMDQ_OP_ATC_INV.
> 
> IIUIC, the IOPF queue doesn't expect duplicated PRI events. So, it
> might invoke the handler twice if there is a duplication. The core
> handling list_add() might need a deduplication.
> 
> > > @@ -456,6 +555,27 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
> > >   	 * not to touch any of the shadow cmdq state.
> > >   	 */
> > >   	queue_read(cmd, Q_ENT(q, cons), q->ent_dwords);
> > > +
> > > +	if (idx == CMDQ_ERR_CERROR_ATC_INV_IDX) {
> > > +		/*
> > > +		 * Since commands can be issued in batch making it difficult to
> > > +		 * identify which CMDQ_OP_ATC_INV actually timed out, the driver
> > > +		 * must ensure only CMDQ_OP_ATC_INV commands for the same device
> > > +		 * can be batched.
> > 
> > But this *is* "the driver" - arm_smmu_atc_inv_domain() is literally further
> > down this same C file, and does not do what this comment is saying it must
> > do, so how are you expecting this to work correctly?
> 
> Oh, that's a big miss..
> 
> I imaged these changes to be based on the arm_smmu_invs series,
> where it could break the ATC in the batching function. Here, it
> should have made some change to arm_smmu_atc_inv_domain().
> 
> Thanks
> Nicolin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05 21:15     ` Nicolin Chen
@ 2026-03-05 23:41       ` Jason Gunthorpe
  2026-03-06  1:29         ` Nicolin Chen
  2026-03-06 13:22         ` Robin Murphy
  0 siblings, 2 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-05 23:41 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Thu, Mar 05, 2026 at 01:15:45PM -0800, Nicolin Chen wrote:

> You mean in arm_smmu_cmdq_issue_cmdlist() that issued the timed
> out ATC command?

Yes, it was my off hand thought.

> So my test case was to trigger a device fault followed by an ATC
> command. But, I found that the ATC command submission returned 0
> while only the ISR received:
>     CMDQ error (cons 0x03000003): ATC invalidate timeout
>     arm_smmu_debugfs_atc_write: ATC_INV ret=0
> 
> It seems difficult to insert a CMDQ_OP_CFGI_STE in the submission
> thread?

I didn't look, but I thought the CMDQ stops on the ATC invalidation,
flags the error and the ISR NOP's the failing CMDQ entry and restarts
it to resume the thread? Is that something else?

If so you could insert the STE flush instead of a NOP

Otherwise the arm_smmu_cmdq_issue_cmdlist() can just push another CMD
to the queue and sync, it is obviously in a context that can do that.

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05 21:06     ` Nicolin Chen
  2026-03-05 23:30       ` Nicolin Chen
@ 2026-03-05 23:52       ` Jason Gunthorpe
  2026-03-06 15:24         ` Robin Murphy
  1 sibling, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-05 23:52 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Robin Murphy, will, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Thu, Mar 05, 2026 at 01:06:21PM -0800, Nicolin Chen wrote:
> That sounds like the IOPF implementation. Maybe inventing another
> IOMMU_FAULT_ATC_TIMEOUT to reuse the existing infrastructure would
> make things cleaner.

I think the routing is quite different, IOPF wants to route an event
the domain creator, here you want to route an event to the IOMMU core
then the PCIe RAS callbacks.

IDK if there is much to be reused there, especially since IOPF
requires a memory allocation and ideally we should not be allocating
memory to resolve this critical error condition.

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05 23:41       ` Jason Gunthorpe
@ 2026-03-06  1:29         ` Nicolin Chen
  2026-03-06  1:33           ` Jason Gunthorpe
  2026-03-06 13:22         ` Robin Murphy
  1 sibling, 1 reply; 44+ messages in thread
From: Nicolin Chen @ 2026-03-06  1:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: will, robin.murphy, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Thu, Mar 05, 2026 at 07:41:58PM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 05, 2026 at 01:15:45PM -0800, Nicolin Chen wrote:
> 
> > You mean in arm_smmu_cmdq_issue_cmdlist() that issued the timed
> > out ATC command?
> 
> Yes, it was my off hand thought.
>
> > So my test case was to trigger a device fault followed by an ATC
> > command. But, I found that the ATC command submission returned 0
> > while only the ISR received:
> >     CMDQ error (cons 0x03000003): ATC invalidate timeout
> >     arm_smmu_debugfs_atc_write: ATC_INV ret=0
> > 
> > It seems difficult to insert a CMDQ_OP_CFGI_STE in the submission
> > thread?
> 
> I didn't look, but I thought the CMDQ stops on the ATC invalidation,
> flags the error and the ISR NOP's the failing CMDQ entry and restarts
> it to resume the thread? Is that something else?
> 
> If so you could insert the STE flush instead of a NOP

Yea, we could do a surgical STE update/flush in the ISR, bypassing
the arm_smmu_ste_writer that has dependency on "master" vs "smmu".

> Otherwise the arm_smmu_cmdq_issue_cmdlist() can just push another CMD
> to the queue and sync, it is obviously in a context that can do that.

It was actually a good idea and would make things cleaner..

But arm_smmu_cmdq_issue_cmdlist() doesn't know when to push another
CMD. In my case where ATC_INV irq occurs, the return value from the
arm_smmu_cmdq_poll_until_sync() in the Step 5 is 0, and prods/cons
are also matched. Actually, at this point that NOP ISR has already
finished.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06  1:29         ` Nicolin Chen
@ 2026-03-06  1:33           ` Jason Gunthorpe
  2026-03-06  5:06             ` Nicolin Chen
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-06  1:33 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Thu, Mar 05, 2026 at 05:29:22PM -0800, Nicolin Chen wrote:

> But arm_smmu_cmdq_issue_cmdlist() doesn't know when to push another
> CMD. In my case where ATC_INV irq occurs, the return value from the
> arm_smmu_cmdq_poll_until_sync() in the Step 5 is 0, and prods/cons
> are also matched. Actually, at this point that NOP ISR has already
> finished.

Yes, you'd need a sneaky way to convay the error from the ISR to the
cmdlist code that didn't harm performance. Maybe we could come up with
something, but if it works replacing the NOP with flush sounds fairly
appealing - though can you do a single WORD edit to the STE that will
block translated requests? Zero EATS?

Also, will the SMMU start spamming with blocked translation events or
something that will need suppression too?

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05  5:21 ` [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts Nicolin Chen
                     ` (2 preceding siblings ...)
  2026-03-05 15:39   ` Jason Gunthorpe
@ 2026-03-06  2:35   ` kernel test robot
  2026-03-10 19:16   ` Pranjal Shrivastava
  4 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2026-03-06  2:35 UTC (permalink / raw)
  To: Nicolin Chen, will, robin.murphy, joro, bhelgaas, jgg
  Cc: oe-kbuild-all, rafael, lenb, praan, kees, baolu.lu, smostafa,
	Alexander.Grest, kevin.tian, miko.lenczewski, linux-arm-kernel,
	iommu, linux-kernel, linux-acpi, linux-pci, vsethi

Hi Nicolin,

kernel test robot noticed the following build errors:

[auto build test ERROR on pci/next]
[also build test ERROR on pci/for-linus rafael-pm/linux-next rafael-pm/bleeding-edge soc/for-next linus/master v7.0-rc2 next-20260305]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nicolin-Chen/iommu-Do-not-call-pci_dev_reset_iommu_done-unless-reset-succeeds/20260305-132923
base:   https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git next
patch link:    https://lore.kernel.org/r/ca7ab934bf0f433b62a5c15d42241632c4cb9366.1772686998.git.nicolinc%40nvidia.com
patch subject: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
config: arm64-randconfig-001-20260306 (https://download.01.org/0day-ci/archive/20260306/202603061001.fesCQb1B-lkp@intel.com/config)
compiler: aarch64-linux-gcc (GCC) 12.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260306/202603061001.fesCQb1B-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603061001.fesCQb1B-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c: In function 'arm_smmu_atc_recovery_worker':
>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:467:9: error: implicit declaration of function 'pci_dev_lock'; did you mean 'pci_dev_get'? [-Werror=implicit-function-declaration]
     467 |         pci_dev_lock(pdev);
         |         ^~~~~~~~~~~~
         |         pci_dev_get
>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:471:9: error: implicit declaration of function 'pci_dev_unlock'; did you mean 'inode_unlock'? [-Werror=implicit-function-declaration]
     471 |         pci_dev_unlock(pdev);
         |         ^~~~~~~~~~~~~~
         |         inode_unlock
>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:480:14: error: implicit declaration of function 'pci_reset_function' [-Werror=implicit-function-declaration]
     480 |         if (!pci_reset_function(pdev)) {
         |              ^~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +467 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c

   431	
   432	static void arm_smmu_atc_recovery_worker(struct work_struct *work)
   433	{
   434		struct arm_smmu_atc_recovery_param *param =
   435			container_of(work, struct arm_smmu_atc_recovery_param, work);
   436		struct pci_dev *pdev;
   437	
   438		scoped_guard(mutex, &param->smmu->streams_mutex) {
   439			struct arm_smmu_master *master;
   440	
   441			master = arm_smmu_find_master(param->smmu, param->sid);
   442			if (!master || WARN_ON(!dev_is_pci(master->dev)))
   443				goto free_param;
   444			pdev = to_pci_dev(master->dev);
   445			pci_dev_get(pdev);
   446		}
   447	
   448		scoped_guard(spinlock_irqsave, &param->smmu->atc_recovery.lock) {
   449			struct arm_smmu_atc_recovery_param *e;
   450	
   451			list_for_each_entry(e, &param->smmu->atc_recovery.list, node) {
   452				/* Device is already being recovered */
   453				if (e->pdev == pdev)
   454					goto put_pdev;
   455			}
   456			param->pdev = pdev;
   457			list_add(&param->node, &param->smmu->atc_recovery.list);
   458		}
   459	
   460		/*
   461		 * Stop DMA (PCI) and block ATS (IOMMU) immediately, to prevent memory
   462		 * corruption. This must take pci_dev_lock to prevent any racy unplug.
   463		 *
   464		 * If pci_dev_reset_iommu_prepare() fails, pci_reset_function will call
   465		 * it again internally.
   466		 */
 > 467		pci_dev_lock(pdev);
   468		pci_clear_master(pdev);
   469		if (pci_dev_reset_iommu_prepare(pdev))
   470			pci_err(pdev, "failed to block ATS!\n");
 > 471		pci_dev_unlock(pdev);
   472	
   473		/*
   474		 * ATC timeout indicates the device has stopped responding to coherence
   475		 * protocol requests. The only safe recovery is a reset to flush stale
   476		 * cached translations. Note that pci_reset_function() internally calls
   477		 * pci_dev_reset_iommu_prepare/done() as well and ensures to block ATS
   478		 * if PCI-level reset fails.
   479		 */
 > 480		if (!pci_reset_function(pdev)) {
   481			/*
   482			 * If reset succeeds, set BME back. Otherwise, fence the system
   483			 * from a faulty device, in which case user will have to replug
   484			 * the device to invoke pci_set_master().
   485			 */
   486			pci_dev_lock(pdev);
   487			pci_set_master(pdev);
   488			pci_dev_unlock(pdev);
   489		}
   490		scoped_guard(spinlock_irqsave, &param->smmu->atc_recovery.lock)
   491			list_del(&param->node);
   492	put_pdev:
   493		pci_dev_put(pdev);
   494	free_param:
   495		kfree(param);
   496	}
   497	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05 15:39   ` Jason Gunthorpe
  2026-03-05 21:15     ` Nicolin Chen
@ 2026-03-06  3:22     ` Baolu Lu
  2026-03-06 13:00       ` Jason Gunthorpe
  1 sibling, 1 reply; 44+ messages in thread
From: Baolu Lu @ 2026-03-06  3:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: will, robin.murphy, joro, bhelgaas, rafael, lenb, praan, kees,
	smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On 3/5/26 23:39, Jason Gunthorpe wrote:
> On Wed, Mar 04, 2026 at 09:21:42PM -0800, Nicolin Chen wrote:
>> +	/*
>> +	 * ATC timeout indicates the device has stopped responding to coherence
>> +	 * protocol requests. The only safe recovery is a reset to flush stale
>> +	 * cached translations. Note that pci_reset_function() internally calls
>> +	 * pci_dev_reset_iommu_prepare/done() as well and ensures to block ATS
>> +	 * if PCI-level reset fails.
>> +	 */
>> +	if (!pci_reset_function(pdev)) {
>> +		/*
>> +		 * If reset succeeds, set BME back. Otherwise, fence the system
>> +		 * from a faulty device, in which case user will have to replug
>> +		 * the device to invoke pci_set_master().
>> +		 */
>> +		pci_dev_lock(pdev);
>> +		pci_set_master(pdev);
>> +		pci_dev_unlock(pdev);
>> +	}
> I thought we talked about this, the iommu driver cannot just blindly
> issue a reset like this, the reset has to come from the actual device
> driver through the AERish mechanism. Otherwise the driver RAS is going
> to explode.
> 
> The smmu driver should immediately block the STE (reject translated
> requests) to protect the system before resuming whatever command
> submissio n has encountered the error.
> 
> You could delegate the STE change to the interrupted command
> submission to avoid doing it from a ISR, that makes alot of sense
> because the submission thread is already operating a cmdq so it could
> stick in a STE invalidation command, possibly even in place of the
> failed ATC command.
> 
> I think I'd break this up into smaller steps, just focus on this STE
> mechanism at start and have any future attach callback fix the STE.
> 
> Then we can talk about how to properly trigger the PCI RAS flow and so
> on.

I believe this issue is not unique to the arm-smmu-v3 driver. Device ATC
invalidation timeout is a generic challenge across all IOMMU
architectures that support PCI ATS. Would it be feasible to implement a
common 'fencing and recovery' mechanism in the IOMMU core so that all
IOMMU drivers could benefit?

Thanks,
baolu

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06  1:33           ` Jason Gunthorpe
@ 2026-03-06  5:06             ` Nicolin Chen
  2026-03-06 13:02               ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Nicolin Chen @ 2026-03-06  5:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: will, robin.murphy, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Thu, Mar 05, 2026 at 09:33:47PM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 05, 2026 at 05:29:22PM -0800, Nicolin Chen wrote:
> 
> > But arm_smmu_cmdq_issue_cmdlist() doesn't know when to push another
> > CMD. In my case where ATC_INV irq occurs, the return value from the
> > arm_smmu_cmdq_poll_until_sync() in the Step 5 is 0, and prods/cons
> > are also matched. Actually, at this point that NOP ISR has already
> > finished.
> 
> Yes, you'd need a sneaky way to convay the error from the ISR to the
> cmdlist code that didn't harm performance. Maybe we could come up with
> something, but if it works replacing the NOP with flush sounds fairly
> appealing - though can you do a single WORD edit to the STE that will
> block translated requests? Zero EATS?

Yea. I can give that a try.

> Also, will the SMMU start spamming with blocked translation events or
> something that will need suppression too?

CD.R=0 can suppress fault records, but we would need to override
that in every CD of the device.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06  3:22     ` Baolu Lu
@ 2026-03-06 13:00       ` Jason Gunthorpe
  2026-03-06 19:35         ` Samiullah Khawaja
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-06 13:00 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Nicolin Chen, will, robin.murphy, joro, bhelgaas, rafael, lenb,
	praan, kees, smostafa, Alexander.Grest, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Fri, Mar 06, 2026 at 11:22:52AM +0800, Baolu Lu wrote:
> I believe this issue is not unique to the arm-smmu-v3 driver. Device ATC
> invalidation timeout is a generic challenge across all IOMMU
> architectures that support PCI ATS. Would it be feasible to implement a
> common 'fencing and recovery' mechanism in the IOMMU core so that all
> IOMMU drivers could benefit?

I think yes, for parts, but the driver itself has to do something deep
inside it's invalidation to allow the flush to complete without
exposing the system to memory corruption - meaning it has to block
translated requests before completing the flush

I don't see how that can be made too generalized since we are running
this flush stuff in irq and reclaim contexts, it has to be very small
and targtted without memory allocation or sleeping locks.

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06  5:06             ` Nicolin Chen
@ 2026-03-06 13:02               ` Jason Gunthorpe
  2026-03-06 19:20                 ` Nicolin Chen
  2026-03-10 19:40                 ` Pranjal Shrivastava
  0 siblings, 2 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-06 13:02 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Thu, Mar 05, 2026 at 09:06:17PM -0800, Nicolin Chen wrote:
> On Thu, Mar 05, 2026 at 09:33:47PM -0400, Jason Gunthorpe wrote:
> > On Thu, Mar 05, 2026 at 05:29:22PM -0800, Nicolin Chen wrote:
> > 
> > > But arm_smmu_cmdq_issue_cmdlist() doesn't know when to push another
> > > CMD. In my case where ATC_INV irq occurs, the return value from the
> > > arm_smmu_cmdq_poll_until_sync() in the Step 5 is 0, and prods/cons
> > > are also matched. Actually, at this point that NOP ISR has already
> > > finished.
> > 
> > Yes, you'd need a sneaky way to convay the error from the ISR to the
> > cmdlist code that didn't harm performance. Maybe we could come up with
> > something, but if it works replacing the NOP with flush sounds fairly
> > appealing - though can you do a single WORD edit to the STE that will
> > block translated requests? Zero EATS?
> 
> Yea. I can give that a try.

This also really needs to go after the invalidation changes because it
is feasible to also edit the lockless RCU invalidation list from the
ISR and disable the ATC for the failed device too.

> > Also, will the SMMU start spamming with blocked translation events or
> > something that will need suppression too?
> 
> CD.R=0 can suppress fault records, but we would need to override
> that in every CD of the device.

That's too much to do from ISR, but maybe we can do it from a WQ..

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05 23:41       ` Jason Gunthorpe
  2026-03-06  1:29         ` Nicolin Chen
@ 2026-03-06 13:22         ` Robin Murphy
  2026-03-06 14:01           ` Jason Gunthorpe
  1 sibling, 1 reply; 44+ messages in thread
From: Robin Murphy @ 2026-03-06 13:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: will, joro, bhelgaas, rafael, lenb, praan, kees, baolu.lu,
	smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On 2026-03-05 11:41 pm, Jason Gunthorpe wrote:
> On Thu, Mar 05, 2026 at 01:15:45PM -0800, Nicolin Chen wrote:
> 
>> You mean in arm_smmu_cmdq_issue_cmdlist() that issued the timed
>> out ATC command?
> 
> Yes, it was my off hand thought.
> 
>> So my test case was to trigger a device fault followed by an ATC
>> command. But, I found that the ATC command submission returned 0
>> while only the ISR received:
>>      CMDQ error (cons 0x03000003): ATC invalidate timeout
>>      arm_smmu_debugfs_atc_write: ATC_INV ret=0
>>
>> It seems difficult to insert a CMDQ_OP_CFGI_STE in the submission
>> thread?
> 
> I didn't look, but I thought the CMDQ stops on the ATC invalidation,
> flags the error and the ISR NOP's the failing CMDQ entry and restarts
> it to resume the thread? Is that something else?
> 
> If so you could insert the STE flush instead of a NOP

Nope, sadly the timeout is asynchronous, and CERROR_ATC_INV_SYNC is only 
reported on the *next* CMD_SYNC - it can't even tell us which 
CMD_ATC_INV(s) had a problem. Also there is no NOP; currently the only 
command rewriting we do is for CERROR_ILL, where we turn the illegal 
command into a CMD_SYNC.

We couldn't necessarily rely on being able to rewind the hardware CONS 
pointer from a CMD_SYNC, as by that point we're likely to have observed 
it and updated llq->cons, such that other threads could move llq->prod 
forward and fill that space with new commands.

Thanks,
Robin.

> Otherwise the arm_smmu_cmdq_issue_cmdlist() can just push another CMD
> to the queue and sync, it is obviously in a context that can do that.
> 
> Jason


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 13:22         ` Robin Murphy
@ 2026-03-06 14:01           ` Jason Gunthorpe
  2026-03-06 20:18             ` Nicolin Chen
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-06 14:01 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Nicolin Chen, will, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Fri, Mar 06, 2026 at 01:22:11PM +0000, Robin Murphy wrote:
> On 2026-03-05 11:41 pm, Jason Gunthorpe wrote:
> > On Thu, Mar 05, 2026 at 01:15:45PM -0800, Nicolin Chen wrote:
> > 
> > > You mean in arm_smmu_cmdq_issue_cmdlist() that issued the timed
> > > out ATC command?
> > 
> > Yes, it was my off hand thought.
> > 
> > > So my test case was to trigger a device fault followed by an ATC
> > > command. But, I found that the ATC command submission returned 0
> > > while only the ISR received:
> > >      CMDQ error (cons 0x03000003): ATC invalidate timeout
> > >      arm_smmu_debugfs_atc_write: ATC_INV ret=0
> > > 
> > > It seems difficult to insert a CMDQ_OP_CFGI_STE in the submission
> > > thread?
> > 
> > I didn't look, but I thought the CMDQ stops on the ATC invalidation,
> > flags the error and the ISR NOP's the failing CMDQ entry and restarts
> > it to resume the thread? Is that something else?
> > 
> > If so you could insert the STE flush instead of a NOP
> 
> Nope, sadly the timeout is asynchronous, and CERROR_ATC_INV_SYNC is only
> reported on the *next* CMD_SYNC - it can't even tell us which CMD_ATC_INV(s)
> had a problem.

!! That's a good point! The new invalidation code runs many ATC
invalidations under one sync to optimize for SVA performance so we
have no idea what devices need to be reset :(

So we really do need to signal to the issuing thread and it will have
to go back and check how many ATC invalidations are under this sync
and re-issue one by one to isolate the error then issue the STE change
and sync. Nothing from an ISR then..

> Also there is no NOP; currently the only command rewriting we
> do is for CERROR_ILL, where we turn the illegal command into a CMD_SYNC.

Oh, OK thats what I was half remembering then..

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05 23:52       ` Jason Gunthorpe
@ 2026-03-06 15:24         ` Robin Murphy
  2026-03-06 15:56           ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Robin Murphy @ 2026-03-06 15:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: will, joro, bhelgaas, rafael, lenb, praan, kees, baolu.lu,
	smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On 2026-03-05 11:52 pm, Jason Gunthorpe wrote:
> On Thu, Mar 05, 2026 at 01:06:21PM -0800, Nicolin Chen wrote:
>> That sounds like the IOPF implementation. Maybe inventing another
>> IOMMU_FAULT_ATC_TIMEOUT to reuse the existing infrastructure would
>> make things cleaner.
> 
> I think the routing is quite different, IOPF wants to route an event
> the domain creator, here you want to route an event to the IOMMU core
> then the PCIe RAS callbacks.
> 
> IDK if there is much to be reused there, especially since IOPF
> requires a memory allocation and ideally we should not be allocating
> memory to resolve this critical error condition.

Yeah, sorry, for a moment there I somehow forgot that we can expect to 
use ATS without PRI, so indeed tying this to IOPF wouldn't be 
appropriate. And given the general difficulty of trying to infer what 
went wrong and what to do from the CMDQ contents alone, I do like your 
idea of trying to return a new kind of sync failure back to 
arm_smmu_atc_inv_{master,domain}() so that we can take any defensive 
action from there, with all the information to hand. We'd just have to 
ensure that if a large set of ATCI commands needs to span multiple 
batches, every batch must contain its own sync (since if some other 
batch of unrelated commands could get interleaved in the middle and 
issue a sync that then fails due to someone else's ATC timeout, 
everything's likely to get confused and go wrong).

The fiddly thing then is that we might also have to be prepared to 
"handle" CMD_SYNC timeout by manually checking for GERRORs, in case the 
whole invalidation is in the context of an dma_unmap within some other 
device's IRQ handler, which happens to be on the same CPU where the 
GERROR IRQ is now pending, but can't be taken until we can complete the 
inv and return out of the current IRQ :/

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 15:24         ` Robin Murphy
@ 2026-03-06 15:56           ` Jason Gunthorpe
  2026-03-10 19:34             ` Pranjal Shrivastava
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-06 15:56 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Nicolin Chen, will, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Fri, Mar 06, 2026 at 03:24:20PM +0000, Robin Murphy wrote:
> On 2026-03-05 11:52 pm, Jason Gunthorpe wrote:
> > On Thu, Mar 05, 2026 at 01:06:21PM -0800, Nicolin Chen wrote:
> > > That sounds like the IOPF implementation. Maybe inventing another
> > > IOMMU_FAULT_ATC_TIMEOUT to reuse the existing infrastructure would
> > > make things cleaner.
> > 
> > I think the routing is quite different, IOPF wants to route an event
> > the domain creator, here you want to route an event to the IOMMU core
> > then the PCIe RAS callbacks.
> > 
> > IDK if there is much to be reused there, especially since IOPF
> > requires a memory allocation and ideally we should not be allocating
> > memory to resolve this critical error condition.
> 
> Yeah, sorry, for a moment there I somehow forgot that we can expect to use
> ATS without PRI, so indeed tying this to IOPF wouldn't be appropriate. And
> given the general difficulty of trying to infer what went wrong and what to
> do from the CMDQ contents alone, I do like your idea of trying to return a
> new kind of sync failure back to arm_smmu_atc_inv_{master,domain}() so that
> we can take any defensive action from there, with all the information to
> hand. We'd just have to ensure that if a large set of ATCI commands needs to
> span multiple batches, every batch must contain its own sync (since if some
> other batch of unrelated commands could get interleaved in the middle and
> issue a sync that then fails due to someone else's ATC timeout, everything's
> likely to get confused and go wrong).

Yeah, that all makes sense to me.

The batching issue is scary, we definately can't allow an ATC
invalidation to be pushed without a SYNC that localizes any failure to
this specific thread, or we can't properly disambiguate the failures
anymore.

My feeling is when the sync "fails", it can bubble up the error and we
can get back to the invalidation list processor which can then see it
failed to process an ATC batch and take an appropriate action.

> The fiddly thing then is that we might also have to be prepared to "handle"
> CMD_SYNC timeout by manually checking for GERRORs, in case the whole
> invalidation is in the context of an dma_unmap within some other device's
> IRQ handler, which happens to be on the same CPU where the GERROR IRQ is now
> pending, but can't be taken until we can complete the inv and return out of
> the current IRQ :/

IIRC didn't the PM patches propose to add this anyhow?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 13:02               ` Jason Gunthorpe
@ 2026-03-06 19:20                 ` Nicolin Chen
  2026-03-06 19:22                   ` Jason Gunthorpe
  2026-03-10 19:40                 ` Pranjal Shrivastava
  1 sibling, 1 reply; 44+ messages in thread
From: Nicolin Chen @ 2026-03-06 19:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: will, robin.murphy, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Fri, Mar 06, 2026 at 09:02:02AM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 05, 2026 at 09:06:17PM -0800, Nicolin Chen wrote:
> > On Thu, Mar 05, 2026 at 09:33:47PM -0400, Jason Gunthorpe wrote:
> > > On Thu, Mar 05, 2026 at 05:29:22PM -0800, Nicolin Chen wrote:
> > > 
> > > > But arm_smmu_cmdq_issue_cmdlist() doesn't know when to push another
> > > > CMD. In my case where ATC_INV irq occurs, the return value from the
> > > > arm_smmu_cmdq_poll_until_sync() in the Step 5 is 0, and prods/cons
> > > > are also matched. Actually, at this point that NOP ISR has already
> > > > finished.
> > > 
> > > Yes, you'd need a sneaky way to convay the error from the ISR to the
> > > cmdlist code that didn't harm performance. Maybe we could come up with
> > > something, but if it works replacing the NOP with flush sounds fairly
> > > appealing - though can you do a single WORD edit to the STE that will
> > > block translated requests? Zero EATS?
> > 
> > Yea. I can give that a try.
> 
> This also really needs to go after the invalidation changes because it
> is feasible to also edit the lockless RCU invalidation list from the
> ISR and disable the ATC for the failed device too.

Yea, it is likely something that we have to do to deduplicate new
ATC timeouts triggering another reset.

In general, the maximum users count of an INV_TYPE_ATS would be 1.
So, an unref() would be sufficient to mute it, though it'd require
the unref() API to support a mismatched users counter, because the
PCI reset in the WQ would block ATS, which would try to unref the
removed command once again.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 19:20                 ` Nicolin Chen
@ 2026-03-06 19:22                   ` Jason Gunthorpe
  2026-03-06 19:39                     ` Nicolin Chen
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-06 19:22 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Fri, Mar 06, 2026 at 11:20:35AM -0800, Nicolin Chen wrote:

> In general, the maximum users count of an INV_TYPE_ATS would be 1.
> So, an unref() would be sufficient to mute it, though it'd require
> the unref() API to support a mismatched users counter, because the
> PCI reset in the WQ would block ATS, which would try to unref the
> removed command once again.

INV_TYPE_ATS_DISABLED could be made to work too..

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 13:00       ` Jason Gunthorpe
@ 2026-03-06 19:35         ` Samiullah Khawaja
  2026-03-06 19:43           ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Samiullah Khawaja @ 2026-03-06 19:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Baolu Lu, Nicolin Chen, will, robin.murphy, joro, bhelgaas,
	rafael, lenb, praan, kees, smostafa, Alexander.Grest, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Fri, Mar 06, 2026 at 09:00:06AM -0400, Jason Gunthorpe wrote:
>On Fri, Mar 06, 2026 at 11:22:52AM +0800, Baolu Lu wrote:
>> I believe this issue is not unique to the arm-smmu-v3 driver. Device ATC
>> invalidation timeout is a generic challenge across all IOMMU
>> architectures that support PCI ATS. Would it be feasible to implement a
>> common 'fencing and recovery' mechanism in the IOMMU core so that all
>> IOMMU drivers could benefit?
>
>I think yes, for parts, but the driver itself has to do something deep
>inside it's invalidation to allow the flush to complete without
>exposing the system to memory corruption - meaning it has to block
>translated requests before completing the flush

Yes and currently the underlying drivers have software timeouts
(AMD=100millisecond, arm-smmu-v3=1second) defined which could timeout
before the actual ATC invalidation timeout occurs. Do you think maybe
the timeout needs to be propagated to the caller (flush callback) so the
memory/IOVA is not allocated to something else? Or blocking translated
requests for such devices should be enough?
>
>I don't see how that can be made too generalized since we are running
>this flush stuff in irq and reclaim contexts, it has to be very small
>and targtted without memory allocation or sleeping locks.
>

>Jason
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 19:22                   ` Jason Gunthorpe
@ 2026-03-06 19:39                     ` Nicolin Chen
  2026-03-06 19:47                       ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Nicolin Chen @ 2026-03-06 19:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: will, robin.murphy, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Fri, Mar 06, 2026 at 03:22:11PM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 06, 2026 at 11:20:35AM -0800, Nicolin Chen wrote:
> 
> > In general, the maximum users count of an INV_TYPE_ATS would be 1.
> > So, an unref() would be sufficient to mute it, though it'd require
> > the unref() API to support a mismatched users counter, because the
> > PCI reset in the WQ would block ATS, which would try to unref the
> > removed command once again.
> 
> INV_TYPE_ATS_DISABLED could be made to work too..

Overriding the type.. hmm, that's brilliant.

Rechecking the unref API, I realized that even the "users" in the
latest version isn't atomic anymore. I recall that we are fine to
do READ_ONCE/WRITE_ONCE on an int type because invalidation array
is serialized by the asid mutex, which wouldn't work either in an
ISR or just in the arm_smmu_master/domain_atc_inv()?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 19:35         ` Samiullah Khawaja
@ 2026-03-06 19:43           ` Jason Gunthorpe
  2026-03-06 19:59             ` Samiullah Khawaja
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-06 19:43 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Baolu Lu, Nicolin Chen, will, robin.murphy, joro, bhelgaas,
	rafael, lenb, praan, kees, smostafa, Alexander.Grest, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Fri, Mar 06, 2026 at 07:35:19PM +0000, Samiullah Khawaja wrote:
> On Fri, Mar 06, 2026 at 09:00:06AM -0400, Jason Gunthorpe wrote:
> > On Fri, Mar 06, 2026 at 11:22:52AM +0800, Baolu Lu wrote:
> > > I believe this issue is not unique to the arm-smmu-v3 driver. Device ATC
> > > invalidation timeout is a generic challenge across all IOMMU
> > > architectures that support PCI ATS. Would it be feasible to implement a
> > > common 'fencing and recovery' mechanism in the IOMMU core so that all
> > > IOMMU drivers could benefit?
> > 
> > I think yes, for parts, but the driver itself has to do something deep
> > inside it's invalidation to allow the flush to complete without
> > exposing the system to memory corruption - meaning it has to block
> > translated requests before completing the flush
> 
> Yes and currently the underlying drivers have software timeouts
> (AMD=100millisecond, arm-smmu-v3=1second) defined which could timeout
> before the actual ATC invalidation timeout occurs. Do you think maybe
> the timeout needs to be propagated to the caller (flush callback) so the
> memory/IOVA is not allocated to something else?

No, definitely not, that's basically impossible, so many callers just
can't handle such an idea, and you can't ever fully recover from such
a thing.

> Or blocking translated requests for such devices should be enough?

Yes, we have to fence the hardware and then allow the existing SW
stack to continue without any fear of UAF from the broken HW.

Fencing the HW means using the IOMMU to block translated requests.

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 19:39                     ` Nicolin Chen
@ 2026-03-06 19:47                       ` Jason Gunthorpe
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-06 19:47 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Fri, Mar 06, 2026 at 11:39:09AM -0800, Nicolin Chen wrote:
> On Fri, Mar 06, 2026 at 03:22:11PM -0400, Jason Gunthorpe wrote:
> > On Fri, Mar 06, 2026 at 11:20:35AM -0800, Nicolin Chen wrote:
> > 
> > > In general, the maximum users count of an INV_TYPE_ATS would be 1.
> > > So, an unref() would be sufficient to mute it, though it'd require
> > > the unref() API to support a mismatched users counter, because the
> > > PCI reset in the WQ would block ATS, which would try to unref the
> > > removed command once again.
> > 
> > INV_TYPE_ATS_DISABLED could be made to work too..
> 
> Overriding the type.. hmm, that's brilliant.
> 
> Rechecking the unref API, I realized that even the "users" in the
> latest version isn't atomic anymore. I recall that we are fine to
> do READ_ONCE/WRITE_ONCE on an int type because invalidation array
> is serialized by the asid mutex, which wouldn't work either in an
> ISR or just in the arm_smmu_master/domain_atc_inv()?

I think doing anything from an ISR is dead at this point, you'll be in
a flush context instead

We can be sloppy with locking here because the goal is to
opportunistically inhibit future ATS invalidations, if one leaks
through because of sloppy locking then it will get handled anyhow.

So you can WRITE_ONCE the type and READ_ONCE from the merge side and
that is close enough. Just document why the race is OK. We also hold a
lock around the ATS invalidation push so maybe it can be fully locked
anyhow?

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 19:43           ` Jason Gunthorpe
@ 2026-03-06 19:59             ` Samiullah Khawaja
  2026-03-06 20:03               ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Samiullah Khawaja @ 2026-03-06 19:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Baolu Lu, Nicolin Chen, will, robin.murphy, joro, bhelgaas,
	rafael, lenb, praan, kees, smostafa, Alexander.Grest, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Fri, Mar 06, 2026 at 03:43:12PM -0400, Jason Gunthorpe wrote:
>On Fri, Mar 06, 2026 at 07:35:19PM +0000, Samiullah Khawaja wrote:
>> On Fri, Mar 06, 2026 at 09:00:06AM -0400, Jason Gunthorpe wrote:
>> > On Fri, Mar 06, 2026 at 11:22:52AM +0800, Baolu Lu wrote:
>> > > I believe this issue is not unique to the arm-smmu-v3 driver. Device ATC
>> > > invalidation timeout is a generic challenge across all IOMMU
>> > > architectures that support PCI ATS. Would it be feasible to implement a
>> > > common 'fencing and recovery' mechanism in the IOMMU core so that all
>> > > IOMMU drivers could benefit?
>> >
>> > I think yes, for parts, but the driver itself has to do something deep
>> > inside it's invalidation to allow the flush to complete without
>> > exposing the system to memory corruption - meaning it has to block
>> > translated requests before completing the flush
>>
>> Yes and currently the underlying drivers have software timeouts
>> (AMD=100millisecond, arm-smmu-v3=1second) defined which could timeout
>> before the actual ATC invalidation timeout occurs. Do you think maybe
>> the timeout needs to be propagated to the caller (flush callback) so the
>> memory/IOVA is not allocated to something else?
>
>No, definitely not, that's basically impossible, so many callers just
>can't handle such an idea, and you can't ever fully recover from such
>a thing.
>

Agreed.
>> Or blocking translated requests for such devices should be enough?
>
>Yes, we have to fence the hardware and then allow the existing SW
>stack to continue without any fear of UAF from the broken HW.

And this applies to software timeout also I think, since both have same
end result.

I am working on a series to solve this for VT-d and testing it
internally.
>
>Fencing the HW means using the IOMMU to block translated requests.
>
>Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 19:59             ` Samiullah Khawaja
@ 2026-03-06 20:03               ` Jason Gunthorpe
  2026-03-06 20:22                 ` Samiullah Khawaja
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-06 20:03 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Baolu Lu, Nicolin Chen, will, robin.murphy, joro, bhelgaas,
	rafael, lenb, praan, kees, smostafa, Alexander.Grest, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Fri, Mar 06, 2026 at 07:59:33PM +0000, Samiullah Khawaja wrote:
> On Fri, Mar 06, 2026 at 03:43:12PM -0400, Jason Gunthorpe wrote:
> > On Fri, Mar 06, 2026 at 07:35:19PM +0000, Samiullah Khawaja wrote:
> > > On Fri, Mar 06, 2026 at 09:00:06AM -0400, Jason Gunthorpe wrote:
> > > > On Fri, Mar 06, 2026 at 11:22:52AM +0800, Baolu Lu wrote:
> > > > > I believe this issue is not unique to the arm-smmu-v3 driver. Device ATC
> > > > > invalidation timeout is a generic challenge across all IOMMU
> > > > > architectures that support PCI ATS. Would it be feasible to implement a
> > > > > common 'fencing and recovery' mechanism in the IOMMU core so that all
> > > > > IOMMU drivers could benefit?
> > > >
> > > > I think yes, for parts, but the driver itself has to do something deep
> > > > inside it's invalidation to allow the flush to complete without
> > > > exposing the system to memory corruption - meaning it has to block
> > > > translated requests before completing the flush
> > > 
> > > Yes and currently the underlying drivers have software timeouts
> > > (AMD=100millisecond, arm-smmu-v3=1second) defined which could timeout
> > > before the actual ATC invalidation timeout occurs. Do you think maybe
> > > the timeout needs to be propagated to the caller (flush callback) so the
> > > memory/IOVA is not allocated to something else?
> > 
> > No, definitely not, that's basically impossible, so many callers just
> > can't handle such an idea, and you can't ever fully recover from such
> > a thing.
> > 
> 
> Agreed.
> > > Or blocking translated requests for such devices should be enough?
> > 
> > Yes, we have to fence the hardware and then allow the existing SW
> > stack to continue without any fear of UAF from the broken HW.
> 
> And this applies to software timeout also I think, since both have same
> end result.

Any situation where the ATC flush doesn't get a positive response from
the HW must fence the HW before continuing to avoid UAF bugs.

Obviously today we just succeed the flush anyhow and hope for the
best, and I think that is a good starting point for VT-d. We need at
least that to build anything more complex on to.

Fencing the device also has to come with a full RAS flow to eventually
unfence it, so I wouldn't do it in isolation.

I would like the unfence to be done with a fresh domain attach (or
re-attach I guess) that just rewrites the context entry with the
correct one.

For VT-d that probably also means it will need all the domain attach
fixing we've talked about as a precondition too.

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 14:01           ` Jason Gunthorpe
@ 2026-03-06 20:18             ` Nicolin Chen
  2026-03-06 20:22               ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Nicolin Chen @ 2026-03-06 20:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, will, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Fri, Mar 06, 2026 at 10:01:15AM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 06, 2026 at 01:22:11PM +0000, Robin Murphy wrote:
> > On 2026-03-05 11:41 pm, Jason Gunthorpe wrote:
> > > On Thu, Mar 05, 2026 at 01:15:45PM -0800, Nicolin Chen wrote:
> > > 
> > > > You mean in arm_smmu_cmdq_issue_cmdlist() that issued the timed
> > > > out ATC command?
> > > 
> > > Yes, it was my off hand thought.
> > > 
> > > > So my test case was to trigger a device fault followed by an ATC
> > > > command. But, I found that the ATC command submission returned 0
> > > > while only the ISR received:
> > > >      CMDQ error (cons 0x03000003): ATC invalidate timeout
> > > >      arm_smmu_debugfs_atc_write: ATC_INV ret=0
> > > > 
> > > > It seems difficult to insert a CMDQ_OP_CFGI_STE in the submission
> > > > thread?
> > > 
> > > I didn't look, but I thought the CMDQ stops on the ATC invalidation,
> > > flags the error and the ISR NOP's the failing CMDQ entry and restarts
> > > it to resume the thread? Is that something else?
> > > 
> > > If so you could insert the STE flush instead of a NOP
> > 
> > Nope, sadly the timeout is asynchronous, and CERROR_ATC_INV_SYNC is only
> > reported on the *next* CMD_SYNC - it can't even tell us which CMD_ATC_INV(s)
> > had a problem.
> 
> !! That's a good point! The new invalidation code runs many ATC
> invalidations under one sync to optimize for SVA performance so we
> have no idea what devices need to be reset :(
> 
> So we really do need to signal to the issuing thread and it will have
> to go back and check how many ATC invalidations are under this sync
> and re-issue one by one to isolate the error then issue the STE change
> and sync. Nothing from an ISR then..

IIUIC, we would have two timeouts to identify the device(s), so we
wouldn't need to give away the optimization of batching ATCI cmds?

Will letting a faulty device time out once again give it a window
to corrupt the memory?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 20:03               ` Jason Gunthorpe
@ 2026-03-06 20:22                 ` Samiullah Khawaja
  2026-03-06 20:26                   ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Samiullah Khawaja @ 2026-03-06 20:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Baolu Lu, Nicolin Chen, will, robin.murphy, joro, bhelgaas,
	rafael, lenb, praan, kees, smostafa, Alexander.Grest, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Fri, Mar 06, 2026 at 04:03:21PM -0400, Jason Gunthorpe wrote:
>On Fri, Mar 06, 2026 at 07:59:33PM +0000, Samiullah Khawaja wrote:
>> On Fri, Mar 06, 2026 at 03:43:12PM -0400, Jason Gunthorpe wrote:
>> > On Fri, Mar 06, 2026 at 07:35:19PM +0000, Samiullah Khawaja wrote:
>> > > On Fri, Mar 06, 2026 at 09:00:06AM -0400, Jason Gunthorpe wrote:
>> > > > On Fri, Mar 06, 2026 at 11:22:52AM +0800, Baolu Lu wrote:
>> > > > > I believe this issue is not unique to the arm-smmu-v3 driver. Device ATC
>> > > > > invalidation timeout is a generic challenge across all IOMMU
>> > > > > architectures that support PCI ATS. Would it be feasible to implement a
>> > > > > common 'fencing and recovery' mechanism in the IOMMU core so that all
>> > > > > IOMMU drivers could benefit?
>> > > >
>> > > > I think yes, for parts, but the driver itself has to do something deep
>> > > > inside it's invalidation to allow the flush to complete without
>> > > > exposing the system to memory corruption - meaning it has to block
>> > > > translated requests before completing the flush
>> > >
>> > > Yes and currently the underlying drivers have software timeouts
>> > > (AMD=100millisecond, arm-smmu-v3=1second) defined which could timeout
>> > > before the actual ATC invalidation timeout occurs. Do you think maybe
>> > > the timeout needs to be propagated to the caller (flush callback) so the
>> > > memory/IOVA is not allocated to something else?
>> >
>> > No, definitely not, that's basically impossible, so many callers just
>> > can't handle such an idea, and you can't ever fully recover from such
>> > a thing.
>> >
>>
>> Agreed.
>> > > Or blocking translated requests for such devices should be enough?
>> >
>> > Yes, we have to fence the hardware and then allow the existing SW
>> > stack to continue without any fear of UAF from the broken HW.
>>
>> And this applies to software timeout also I think, since both have same
>> end result.
>
>Any situation where the ATC flush doesn't get a positive response from
>the HW must fence the HW before continuing to avoid UAF bugs.
>
>Obviously today we just succeed the flush anyhow and hope for the
>best, and I think that is a good starting point for VT-d. We need at
>least that to build anything more complex on to.
>
>Fencing the device also has to come with a full RAS flow to eventually
>unfence it, so I wouldn't do it in isolation.

But do you think doing the timeout logic without fencing would be good
enough? Currently VT-d blocks itself, until it gets an Invalidation
Timeout from HW, and system ends up in a hardlockup since interrupts are
disabled.

Are you concerned that if fencing is done without an RAS flow, the
device might not be able to detect the failure (if it really needs ATS
to work)?

I am thinking, we can do translated fence and timeout change for VT-d.
And the device can use existing RAS mechanism to recover itself. This
way we atleast make sure that caller of flush can reuse the memory/IOVAs
without UAFs.
>
>I would like the unfence to be done with a fresh domain attach (or
>re-attach I guess) that just rewrites the context entry with the
>correct one.

Agreed.
>
>For VT-d that probably also means it will need all the domain attach
>fixing we've talked about as a precondition too.
>
>Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 20:18             ` Nicolin Chen
@ 2026-03-06 20:22               ` Jason Gunthorpe
  2026-03-06 20:34                 ` Nicolin Chen
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-06 20:22 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Robin Murphy, will, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Fri, Mar 06, 2026 at 12:18:24PM -0800, Nicolin Chen wrote:

> IIUIC, we would have two timeouts to identify the device(s), so we
> wouldn't need to give away the optimization of batching ATCI cmds?

Yes we should not harm success path performance to compensate for a
RAS error case.

> Will letting a faulty device time out once again give it a window
> to corrupt the memory?

No, you are doing this within the flush callback so everything relying
on that flush should remain held.

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 20:22                 ` Samiullah Khawaja
@ 2026-03-06 20:26                   ` Jason Gunthorpe
  2026-03-10 20:00                     ` Samiullah Khawaja
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-06 20:26 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Baolu Lu, Nicolin Chen, will, robin.murphy, joro, bhelgaas,
	rafael, lenb, praan, kees, smostafa, Alexander.Grest, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Fri, Mar 06, 2026 at 08:22:08PM +0000, Samiullah Khawaja wrote:

> But do you think doing the timeout logic without fencing would be good
> enough?

It is what ARM and AMD do, so I wouldn't object to it.

> Currently VT-d blocks itself, until it gets an Invalidation Timeout
> from HW, and system ends up in a hardlockup since interrupts are
> disabled.
> 
> Are you concerned that if fencing is done without an RAS flow, the
> device might not be able to detect the failure (if it really needs ATS
> to work)?

Yes, and then the device is badly locked because nothing will fix the
IOMMU fence.

VFIO might fix it if it is restarted, but other approahces like
rmmod/insmod won't restore the broken device.

So I'd rather see a more complete solution before we add fencing to
the iommu drivers. Minimally userspace doing a rmmod, flr, insmod
should be able to restore the device.

Then auto-FLR through RAS could sit on top of that.

> I am thinking, we can do translated fence and timeout change for VT-d.
> And the device can use existing RAS mechanism to recover itself. This
> way we atleast make sure that caller of flush can reuse the memory/IOVAs
> without UAFs.

Without a larger framework to unfence I think this will get devices
stuck..

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 20:22               ` Jason Gunthorpe
@ 2026-03-06 20:34                 ` Nicolin Chen
  0 siblings, 0 replies; 44+ messages in thread
From: Nicolin Chen @ 2026-03-06 20:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, will, joro, bhelgaas, rafael, lenb, praan, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Fri, Mar 06, 2026 at 04:22:16PM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 06, 2026 at 12:18:24PM -0800, Nicolin Chen wrote:
> 
> > IIUIC, we would have two timeouts to identify the device(s), so we
> > wouldn't need to give away the optimization of batching ATCI cmds?
> 
> Yes we should not harm success path performance to compensate for a
> RAS error case.
> 
> > Will letting a faulty device time out once again give it a window
> > to corrupt the memory?
> 
> No, you are doing this within the flush callback so everything relying
> on that flush should remain held.

Oh, right! I didn't see that coming..

I will rebase the next version on the invalidation series as we
have multiple things relying on that now.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-05  5:21 ` [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts Nicolin Chen
                     ` (3 preceding siblings ...)
  2026-03-06  2:35   ` kernel test robot
@ 2026-03-10 19:16   ` Pranjal Shrivastava
  2026-03-10 19:51     ` Nicolin Chen
  4 siblings, 1 reply; 44+ messages in thread
From: Pranjal Shrivastava @ 2026-03-10 19:16 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, bhelgaas, jgg, rafael, lenb, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Wed, Mar 04, 2026 at 09:21:42PM -0800, Nicolin Chen wrote:
> Currently, when GERROR_CMDQ_ERR occurs, the arm_smmu_cmdq_skip_err() won't
> do anything for the CMDQ_ERR_CERROR_ATC_INV_IDX.
> 
> When a device wasn't responsive to an ATC invalidation request, this often
> results in constant CMDQ errors:
>   unexpected global error reported (0x00000001), this could be serious
>   CMDQ error (cons 0x0302bb84): ATC invalidate timeout
>   unexpected global error reported (0x00000001), this could be serious
>   CMDQ error (cons 0x0302bb88): ATC invalidate timeout
>   unexpected global error reported (0x00000001), this could be serious
>   CMDQ error (cons 0x0302bb8c): ATC invalidate timeout
>   ...
> 
> An ATC invalidation timeout indicates that the device failed to respond to
> a protocol-critical coherency request, which means that device's internal
> ATS state is desynchronized from the SMMU.
> 
> Furthermore, ignoring the timeout leaves the system in an unsafe state, as
> the device cache may retain stale ATC entries for memory pages that the OS
> has already reclaimed and reassigned. This might lead to data corruption.
> 
> The only safe recovery action is to issue a PCI reset, which guarantees to
> flush all internal device caches and recover the device.
> 
> Read the ATC_INV command that led to the timeouts, and schedule a recovery
> worker to reset the device corresponding to the Stream ID. If reset fails,
> keep the device in the resetting/blocking domain to avoid data corruption.
> 
> Though it'd be ideal to block it immediately in the ISR, it cannot be done
> because an STE update would require another CFIG_STE command that couldn't

Nit: s/CFIG_STE/CFGI_STE

> finish in the context of an ISR handling a CMDQ error.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |   5 +
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 131 +++++++++++++++++++-
>  2 files changed, 132 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 3c6d65d36164f..8789cf8294504 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -803,6 +803,11 @@ struct arm_smmu_device {
>  
>  	struct rb_root			streams;
>  	struct mutex			streams_mutex;
> +
> +	struct {
> +		struct list_head	list;
> +		spinlock_t		lock; /* Lock the list */
> +	} atc_recovery;
>  };
>  
>  struct arm_smmu_stream {
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 4d00d796f0783..de182c27c77c4 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -106,6 +106,8 @@ static const char * const event_class_str[] = {
>  	[3] = "Reserved",
>  };
>  
> +static struct arm_smmu_master *
> +arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid);
>  static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
>  
>  static void parse_driver_options(struct arm_smmu_device *smmu)
> @@ -174,6 +176,13 @@ static void queue_inc_cons(struct arm_smmu_ll_queue *q)
>  	q->cons = Q_OVF(q->cons) | Q_WRP(q, cons) | Q_IDX(q, cons);
>  }
>  
> +static u32 queue_prev_cons(struct arm_smmu_ll_queue *q, u32 cons)
> +{
> +	u32 idx_wrp = (Q_WRP(q, cons) | Q_IDX(q, cons)) - 1;
> +
> +	return Q_OVF(cons) | Q_WRP(q, idx_wrp) | Q_IDX(q, idx_wrp);
> +}
> +
>  static void queue_sync_cons_ovf(struct arm_smmu_queue *q)
>  {
>  	struct arm_smmu_ll_queue *llq = &q->llq;
> @@ -410,6 +419,97 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct arm_smmu_device *smmu,
>  		u64p_replace_bits(cmd, CMDQ_SYNC_0_CS_NONE, CMDQ_SYNC_0_CS);
>  }
>  
> +/* ATC recovery upon ATC invalidation timeout */
> +struct arm_smmu_atc_recovery_param {
> +	struct arm_smmu_device *smmu;
> +	struct pci_dev *pdev;
> +	u32 sid;
> +
> +	struct work_struct work;
> +	struct list_head node;
> +};
> +
> +static void arm_smmu_atc_recovery_worker(struct work_struct *work)
> +{
> +	struct arm_smmu_atc_recovery_param *param =
> +		container_of(work, struct arm_smmu_atc_recovery_param, work);
> +	struct pci_dev *pdev;
> +
> +	scoped_guard(mutex, &param->smmu->streams_mutex) {
> +		struct arm_smmu_master *master;
> +
> +		master = arm_smmu_find_master(param->smmu, param->sid);
> +		if (!master || WARN_ON(!dev_is_pci(master->dev)))
> +			goto free_param;
> +		pdev = to_pci_dev(master->dev);
> +		pci_dev_get(pdev);
> +	}
> +
> +	scoped_guard(spinlock_irqsave, &param->smmu->atc_recovery.lock) {
> +		struct arm_smmu_atc_recovery_param *e;
> +
> +		list_for_each_entry(e, &param->smmu->atc_recovery.list, node) {
> +			/* Device is already being recovered */
> +			if (e->pdev == pdev)
> +				goto put_pdev;
> +		}
> +		param->pdev = pdev;
> +		list_add(&param->node, &param->smmu->atc_recovery.list);
> +	}
> +
> +	/*
> +	 * Stop DMA (PCI) and block ATS (IOMMU) immediately, to prevent memory
> +	 * corruption. This must take pci_dev_lock to prevent any racy unplug.
> +	 *
> +	 * If pci_dev_reset_iommu_prepare() fails, pci_reset_function will call
> +	 * it again internally.
> +	 */
> +	pci_dev_lock(pdev);
> +	pci_clear_master(pdev);
> +	if (pci_dev_reset_iommu_prepare(pdev))
> +		pci_err(pdev, "failed to block ATS!\n");
> +	pci_dev_unlock(pdev);
> +
> +	/*
> +	 * ATC timeout indicates the device has stopped responding to coherence
> +	 * protocol requests. The only safe recovery is a reset to flush stale
> +	 * cached translations. Note that pci_reset_function() internally calls
> +	 * pci_dev_reset_iommu_prepare/done() as well and ensures to block ATS
> +	 * if PCI-level reset fails.
> +	 */
> +	if (!pci_reset_function(pdev)) {

I'm a little uncomfortable with this, why is an IOMMU driver poking into
the PCI mechanics? I agree that a reset might be the right thing to do
here but we wouldn't want the IOMMU driver to trigger it.. Ideally, we'd
need a mechanism that bubbles up fatal IOMMU faults to the PCI core and
let it decide/perform the reset. Maybe this could mean adding another op
to struct pci_error_handlers or something like that?

> +		/*
> +		 * If reset succeeds, set BME back. Otherwise, fence the system
> +		 * from a faulty device, in which case user will have to replug
> +		 * the device to invoke pci_set_master().
> +		 */
> +		pci_dev_lock(pdev);

Why are we using spinlock_irqsave across the worker? Also, why does
atc_recovery.lock have to be a spinlock? The workers run in process
context, and I also don't see anyone else take the atc_recovery.lock?
Why does it need to be irq-safe? If this can somehow run in irq context,
we also seem to be using pci_dev_lock and streams_mutex across the
worker? Mixing mutexes with spinlocks is brittle and invites 
"sleep-while-atomic" bugs in future refactors..

> +		pci_set_master(pdev);
> +		pci_dev_unlock(pdev);
> +	}
> +	scoped_guard(spinlock_irqsave, &param->smmu->atc_recovery.lock)
> +		list_del(&param->node);
> +put_pdev:
> +	pci_dev_put(pdev);
> +free_param:
> +	kfree(param);
> +}
> +
> +static int arm_smmu_sched_atc_recovery(struct arm_smmu_device *smmu, u32 sid)
> +{
> +	struct arm_smmu_atc_recovery_param *param;
> +
> +	param = kzalloc_obj(*param, GFP_ATOMIC);
> +	if (!param)
> +		return -ENOMEM;
> +	param->smmu = smmu;
> +	param->sid = sid;
> +
> +	INIT_WORK(&param->work, arm_smmu_atc_recovery_worker);
> +	queue_work(system_unbound_wq, &param->work);
> +	return 0;
> +}
> +
>  void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
>  			      struct arm_smmu_cmdq *cmdq)
>  {
> @@ -441,11 +541,10 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
>  	case CMDQ_ERR_CERROR_ATC_INV_IDX:
>  		/*
>  		 * ATC Invalidation Completion timeout. CONS is still pointing
> -		 * at the CMD_SYNC. Attempt to complete other pending commands
> -		 * by repeating the CMD_SYNC, though we might well end up back
> -		 * here since the ATC invalidation may still be pending.
> +		 * at the CMD_SYNC. Rewind it to read the ATC_INV command.
>  		 */
> -		return;
> +		cons = queue_prev_cons(&q->llq, cons);

What about batched commands? We might never know which command caused
the timeout? This just fetches the "previous" command and we might end
up resetting the wrong device if the invalidations were batched?

> +		fallthrough;
>  	case CMDQ_ERR_CERROR_ILL_IDX:
>  	default:
>  		break;
> @@ -456,6 +555,27 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
>  	 * not to touch any of the shadow cmdq state.
>  	 */
>  	queue_read(cmd, Q_ENT(q, cons), q->ent_dwords);
> +
> +	if (idx == CMDQ_ERR_CERROR_ATC_INV_IDX) {
> +		/*
> +		 * Since commands can be issued in batch making it difficult to
> +		 * identify which CMDQ_OP_ATC_INV actually timed out, the driver
> +		 * must ensure only CMDQ_OP_ATC_INV commands for the same device
> +		 * can be batched.
> +		 */
> +		WARN_ON(FIELD_GET(CMDQ_0_OP, cmd[0]) != CMDQ_OP_ATC_INV);
> +
> +		/*
> +		 * If we failed to schedule a recovery worker, we would well end
> +		 * up back here since the ATC invalidation may still be pending.
> +		 * This gives us another chance to reschedule a recovery worker.
> +		 */
> +		arm_smmu_sched_atc_recovery(smmu,
> +					    FIELD_GET(CMDQ_ATC_0_SID, cmd[0]));

I guess instead of attempting recovery, could we have the worker
attempt to mark the STE invalid / ABORT. Once we Ack the Gerror, we
should be able to issue a CFGI_STE?

> +		return;
> +	}
> +
> +	/* idx == CMDQ_ERR_CERROR_ILL_IDX */
>  	dev_err(smmu->dev, "skipping command in error state:\n");
>  	for (i = 0; i < ARRAY_SIZE(cmd); ++i)
>  		dev_err(smmu->dev, "\t0x%016llx\n", (unsigned long long)cmd[i]);
> @@ -3942,6 +4062,9 @@ static int arm_smmu_init_structures(struct arm_smmu_device *smmu)
>  {
>  	int ret;
>  
> +	INIT_LIST_HEAD(&smmu->atc_recovery.list);
> +	spin_lock_init(&smmu->atc_recovery.lock);
> +
>  	mutex_init(&smmu->streams_mutex);
>  	smmu->streams = RB_ROOT;

Thanks,
Praan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 15:56           ` Jason Gunthorpe
@ 2026-03-10 19:34             ` Pranjal Shrivastava
  0 siblings, 0 replies; 44+ messages in thread
From: Pranjal Shrivastava @ 2026-03-10 19:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, Nicolin Chen, will, joro, bhelgaas, rafael, lenb,
	kees, baolu.lu, smostafa, Alexander.Grest, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Fri, Mar 06, 2026 at 11:56:46AM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 06, 2026 at 03:24:20PM +0000, Robin Murphy wrote:
> > On 2026-03-05 11:52 pm, Jason Gunthorpe wrote:
> > > On Thu, Mar 05, 2026 at 01:06:21PM -0800, Nicolin Chen wrote:
> > > > That sounds like the IOPF implementation. Maybe inventing another
> > > > IOMMU_FAULT_ATC_TIMEOUT to reuse the existing infrastructure would
> > > > make things cleaner.
> > > 
> > > I think the routing is quite different, IOPF wants to route an event
> > > the domain creator, here you want to route an event to the IOMMU core
> > > then the PCIe RAS callbacks.
> > > 
> > > IDK if there is much to be reused there, especially since IOPF
> > > requires a memory allocation and ideally we should not be allocating
> > > memory to resolve this critical error condition.
> > 
> > Yeah, sorry, for a moment there I somehow forgot that we can expect to use
> > ATS without PRI, so indeed tying this to IOPF wouldn't be appropriate. And
> > given the general difficulty of trying to infer what went wrong and what to
> > do from the CMDQ contents alone, I do like your idea of trying to return a
> > new kind of sync failure back to arm_smmu_atc_inv_{master,domain}() so that
> > we can take any defensive action from there, with all the information to
> > hand. We'd just have to ensure that if a large set of ATCI commands needs to
> > span multiple batches, every batch must contain its own sync (since if some
> > other batch of unrelated commands could get interleaved in the middle and
> > issue a sync that then fails due to someone else's ATC timeout, everything's
> > likely to get confused and go wrong).
> 
> Yeah, that all makes sense to me.
> 
> The batching issue is scary, we definately can't allow an ATC
> invalidation to be pushed without a SYNC that localizes any failure to
> this specific thread, or we can't properly disambiguate the failures
> anymore.
> 
> My feeling is when the sync "fails", it can bubble up the error and we
> can get back to the invalidation list processor which can then see it
> failed to process an ATC batch and take an appropriate action.
> 

+1 just saw this thread (replied something similar)

> > The fiddly thing then is that we might also have to be prepared to "handle"
> > CMD_SYNC timeout by manually checking for GERRORs, in case the whole
> > invalidation is in the context of an dma_unmap within some other device's
> > IRQ handler, which happens to be on the same CPU where the GERROR IRQ is now
> > pending, but can't be taken until we can complete the inv and return out of
> > the current IRQ :/
> 
> IIRC didn't the PM patches propose to add this anyhow?

If this is regarding the runtime pm patches, I've tried to address the
Gerror issue (pointed out by you in v4) in the v5 [1]

Thanks,
Praan

[1] https://lore.kernel.org/all/20260126151157.3418145-9-praan@google.com/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 13:02               ` Jason Gunthorpe
  2026-03-06 19:20                 ` Nicolin Chen
@ 2026-03-10 19:40                 ` Pranjal Shrivastava
  2026-03-10 19:57                   ` Nicolin Chen
  1 sibling, 1 reply; 44+ messages in thread
From: Pranjal Shrivastava @ 2026-03-10 19:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, will, robin.murphy, joro, bhelgaas, rafael, lenb,
	kees, baolu.lu, smostafa, Alexander.Grest, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Fri, Mar 06, 2026 at 09:02:02AM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 05, 2026 at 09:06:17PM -0800, Nicolin Chen wrote:
> > On Thu, Mar 05, 2026 at 09:33:47PM -0400, Jason Gunthorpe wrote:
> > > On Thu, Mar 05, 2026 at 05:29:22PM -0800, Nicolin Chen wrote:
> > > 
> > > > But arm_smmu_cmdq_issue_cmdlist() doesn't know when to push another
> > > > CMD. In my case where ATC_INV irq occurs, the return value from the
> > > > arm_smmu_cmdq_poll_until_sync() in the Step 5 is 0, and prods/cons
> > > > are also matched. Actually, at this point that NOP ISR has already
> > > > finished.
> > > 
> > > Yes, you'd need a sneaky way to convay the error from the ISR to the
> > > cmdlist code that didn't harm performance. Maybe we could come up with
> > > something, but if it works replacing the NOP with flush sounds fairly
> > > appealing - though can you do a single WORD edit to the STE that will
> > > block translated requests? Zero EATS?
> > 
> > Yea. I can give that a try.
> 
> This also really needs to go after the invalidation changes because it
> is feasible to also edit the lockless RCU invalidation list from the
> ISR and disable the ATC for the failed device too.
> 
> > > Also, will the SMMU start spamming with blocked translation events or
> > > something that will need suppression too?
> > 
> > CD.R=0 can suppress fault records, but we would need to override
> > that in every CD of the device.
> 
> That's too much to do from ISR, but maybe we can do it from a WQ..
> 

(Skimming through these, apologies if I'm losing context), shouldn't we
do all that (marking it as an inv STE / abort STE, suppressing the
faults) in the worker instead of trying to reset/recover the device?

> Jason

Thanks
Praan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-10 19:16   ` Pranjal Shrivastava
@ 2026-03-10 19:51     ` Nicolin Chen
  2026-03-10 20:00       ` Pranjal Shrivastava
  0 siblings, 1 reply; 44+ messages in thread
From: Nicolin Chen @ 2026-03-10 19:51 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: will, robin.murphy, joro, bhelgaas, jgg, rafael, lenb, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Tue, Mar 10, 2026 at 07:16:02PM +0000, Pranjal Shrivastava wrote:
> On Wed, Mar 04, 2026 at 09:21:42PM -0800, Nicolin Chen wrote:
> > +	/*
> > +	 * ATC timeout indicates the device has stopped responding to coherence
> > +	 * protocol requests. The only safe recovery is a reset to flush stale
> > +	 * cached translations. Note that pci_reset_function() internally calls
> > +	 * pci_dev_reset_iommu_prepare/done() as well and ensures to block ATS
> > +	 * if PCI-level reset fails.
> > +	 */
> > +	if (!pci_reset_function(pdev)) {
> 
> I'm a little uncomfortable with this, why is an IOMMU driver poking into
> the PCI mechanics? I agree that a reset might be the right thing to do
> here but we wouldn't want the IOMMU driver to trigger it.. Ideally, we'd
> need a mechanism that bubbles up fatal IOMMU faults to the PCI core and
> let it decide/perform the reset. Maybe this could mean adding another op
> to struct pci_error_handlers or something like that?

Robin/Jason already had similar remarks (to most of your other
comments as well). I have acked their comments, and am already
reworking on these.

> > +		/*
> > +		 * If reset succeeds, set BME back. Otherwise, fence the system
> > +		 * from a faulty device, in which case user will have to replug
> > +		 * the device to invoke pci_set_master().
> > +		 */
> > +		pci_dev_lock(pdev);
> 
> Why are we using spinlock_irqsave across the worker? Also, why does
> atc_recovery.lock have to be a spinlock? The workers run in process
> context, and I also don't see anyone else take the atc_recovery.lock?

I guess mutex would be okay here, since there is no other place
access the linked list. Pairing a linked list with a spinlock is
just a common practice..

> Why does it need to be irq-safe? If this can somehow run in irq context,
> we also seem to be using pci_dev_lock and streams_mutex across the
> worker?

pci_dev_lock was to fence race on the PCI level. Yet, the entire
BME call is probably not a good idea. So, dropping that means we
won't need pci_dev_lock.

> Mixing mutexes with spinlocks is brittle and invites 
> "sleep-while-atomic" bugs in future refactors..

Either streams_mutex or atc_recovery.lock was scoped for only a
few lines each section. Each was released before the other one
was taken. Where is the "mixing" or "sleep-while-atomic" case?

Nicolin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-10 19:40                 ` Pranjal Shrivastava
@ 2026-03-10 19:57                   ` Nicolin Chen
  2026-03-10 20:04                     ` Pranjal Shrivastava
  0 siblings, 1 reply; 44+ messages in thread
From: Nicolin Chen @ 2026-03-10 19:57 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Jason Gunthorpe, will, robin.murphy, joro, bhelgaas, rafael, lenb,
	kees, baolu.lu, smostafa, Alexander.Grest, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Tue, Mar 10, 2026 at 07:40:56PM +0000, Pranjal Shrivastava wrote:
> On Fri, Mar 06, 2026 at 09:02:02AM -0400, Jason Gunthorpe wrote:
> > On Thu, Mar 05, 2026 at 09:06:17PM -0800, Nicolin Chen wrote:
> > > On Thu, Mar 05, 2026 at 09:33:47PM -0400, Jason Gunthorpe wrote:
> > > > On Thu, Mar 05, 2026 at 05:29:22PM -0800, Nicolin Chen wrote:
> > > > 
> > > > > But arm_smmu_cmdq_issue_cmdlist() doesn't know when to push another
> > > > > CMD. In my case where ATC_INV irq occurs, the return value from the
> > > > > arm_smmu_cmdq_poll_until_sync() in the Step 5 is 0, and prods/cons
> > > > > are also matched. Actually, at this point that NOP ISR has already
> > > > > finished.
> > > > 
> > > > Yes, you'd need a sneaky way to convay the error from the ISR to the
> > > > cmdlist code that didn't harm performance. Maybe we could come up with
> > > > something, but if it works replacing the NOP with flush sounds fairly
> > > > appealing - though can you do a single WORD edit to the STE that will
> > > > block translated requests? Zero EATS?
> > > 
> > > Yea. I can give that a try.
> > 
> > This also really needs to go after the invalidation changes because it
> > is feasible to also edit the lockless RCU invalidation list from the
> > ISR and disable the ATC for the failed device too.
> > 
> > > > Also, will the SMMU start spamming with blocked translation events or
> > > > something that will need suppression too?
> > > 
> > > CD.R=0 can suppress fault records, but we would need to override
> > > that in every CD of the device.
> > 
> > That's too much to do from ISR, but maybe we can do it from a WQ..
> > 
> 
> (Skimming through these, apologies if I'm losing context), shouldn't we
> do all that (marking it as an inv STE / abort STE, suppressing the
> faults) in the worker instead of trying to reset/recover the device?

EATS should be unset asap to avoid memory corruption. It's best
to do in the unmap() context where the page isn't reclaimed yet
by the kernel.

Worker thread will be a bit late, but it is good enough for any
further step.

Nicolin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-06 20:26                   ` Jason Gunthorpe
@ 2026-03-10 20:00                     ` Samiullah Khawaja
  2026-03-11 12:12                       ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Samiullah Khawaja @ 2026-03-10 20:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Baolu Lu, Nicolin Chen, will, robin.murphy, joro, bhelgaas,
	rafael, lenb, praan, kees, smostafa, Alexander.Grest, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Fri, Mar 06, 2026 at 04:26:52PM -0400, Jason Gunthorpe wrote:
>On Fri, Mar 06, 2026 at 08:22:08PM +0000, Samiullah Khawaja wrote:
>
>> But do you think doing the timeout logic without fencing would be good
>> enough?
>
>It is what ARM and AMD do, so I wouldn't object to it.

I think without any back pressure to the caller, a device will be able
to fill the invalidation queue with device IOTLB invalidations that get
stuck until the HW timeout occurs.
>
>> Currently VT-d blocks itself, until it gets an Invalidation Timeout
>> from HW, and system ends up in a hardlockup since interrupts are
>> disabled.
>>
>> Are you concerned that if fencing is done without an RAS flow, the
>> device might not be able to detect the failure (if it really needs ATS
>> to work)?
>
>Yes, and then the device is badly locked because nothing will fix the
>IOMMU fence.
>
>VFIO might fix it if it is restarted, but other approahces like
>rmmod/insmod won't restore the broken device.
>
>So I'd rather see a more complete solution before we add fencing to
>the iommu drivers. Minimally userspace doing a rmmod, flr, insmod
>should be able to restore the device.
>
>Then auto-FLR through RAS could sit on top of that.
>
>> I am thinking, we can do translated fence and timeout change for VT-d.
>> And the device can use existing RAS mechanism to recover itself. This
>> way we atleast make sure that caller of flush can reuse the memory/IOVAs
>> without UAFs.
>
>Without a larger framework to unfence I think this will get devices
>stuck..
>
>Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-10 19:51     ` Nicolin Chen
@ 2026-03-10 20:00       ` Pranjal Shrivastava
  0 siblings, 0 replies; 44+ messages in thread
From: Pranjal Shrivastava @ 2026-03-10 20:00 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, bhelgaas, jgg, rafael, lenb, kees,
	baolu.lu, smostafa, Alexander.Grest, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci,
	vsethi

On Tue, Mar 10, 2026 at 12:51:51PM -0700, Nicolin Chen wrote:
> On Tue, Mar 10, 2026 at 07:16:02PM +0000, Pranjal Shrivastava wrote:
> > On Wed, Mar 04, 2026 at 09:21:42PM -0800, Nicolin Chen wrote:
> > > +	/*
> > > +	 * ATC timeout indicates the device has stopped responding to coherence
> > > +	 * protocol requests. The only safe recovery is a reset to flush stale
> > > +	 * cached translations. Note that pci_reset_function() internally calls
> > > +	 * pci_dev_reset_iommu_prepare/done() as well and ensures to block ATS
> > > +	 * if PCI-level reset fails.
> > > +	 */
> > > +	if (!pci_reset_function(pdev)) {
> > 
> > I'm a little uncomfortable with this, why is an IOMMU driver poking into
> > the PCI mechanics? I agree that a reset might be the right thing to do
> > here but we wouldn't want the IOMMU driver to trigger it.. Ideally, we'd
> > need a mechanism that bubbles up fatal IOMMU faults to the PCI core and
> > let it decide/perform the reset. Maybe this could mean adding another op
> > to struct pci_error_handlers or something like that?
> 
> Robin/Jason already had similar remarks (to most of your other
> comments as well). I have acked their comments, and am already
> reworking on these.
> 

Yea just saw those discussions as well, replied before seeing those.

> > > +		/*
> > > +		 * If reset succeeds, set BME back. Otherwise, fence the system
> > > +		 * from a faulty device, in which case user will have to replug
> > > +		 * the device to invoke pci_set_master().
> > > +		 */
> > > +		pci_dev_lock(pdev);
> > 
> > Why are we using spinlock_irqsave across the worker? Also, why does
> > atc_recovery.lock have to be a spinlock? The workers run in process
> > context, and I also don't see anyone else take the atc_recovery.lock?
> 
> I guess mutex would be okay here, since there is no other place
> access the linked list. Pairing a linked list with a spinlock is
> just a common practice..
> 

Ack agreed. No problem with the type of the lock, just questioning the
choice to use spinlock_irqsave et al since I don't believe this could be
in interrupt context.

> > Why does it need to be irq-safe? If this can somehow run in irq context,
> > we also seem to be using pci_dev_lock and streams_mutex across the
> > worker?
> 
> pci_dev_lock was to fence race on the PCI level. Yet, the entire
> BME call is probably not a good idea. So, dropping that means we
> won't need pci_dev_lock.
> 

Ack.

> > Mixing mutexes with spinlocks is brittle and invites 
> > "sleep-while-atomic" bugs in future refactors..
> 
> Either streams_mutex or atc_recovery.lock was scoped for only a
> few lines each section. Each was released before the other one
> was taken. Where is the "mixing" or "sleep-while-atomic" case?

The case doesn't exist yet, I meant it as a warning against future
re-factors, since I didn't see the need to use a spinlock here, I didn't
understand why couldn't all 3 be mutexes when the existing 2 already
were.

Praan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-10 19:57                   ` Nicolin Chen
@ 2026-03-10 20:04                     ` Pranjal Shrivastava
  0 siblings, 0 replies; 44+ messages in thread
From: Pranjal Shrivastava @ 2026-03-10 20:04 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Jason Gunthorpe, will, robin.murphy, joro, bhelgaas, rafael, lenb,
	kees, baolu.lu, smostafa, Alexander.Grest, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Tue, Mar 10, 2026 at 12:57:30PM -0700, Nicolin Chen wrote:
> On Tue, Mar 10, 2026 at 07:40:56PM +0000, Pranjal Shrivastava wrote:
> > On Fri, Mar 06, 2026 at 09:02:02AM -0400, Jason Gunthorpe wrote:
> > > On Thu, Mar 05, 2026 at 09:06:17PM -0800, Nicolin Chen wrote:
> > > > On Thu, Mar 05, 2026 at 09:33:47PM -0400, Jason Gunthorpe wrote:
> > > > > On Thu, Mar 05, 2026 at 05:29:22PM -0800, Nicolin Chen wrote:
> > > > > 
> > > > > > But arm_smmu_cmdq_issue_cmdlist() doesn't know when to push another
> > > > > > CMD. In my case where ATC_INV irq occurs, the return value from the
> > > > > > arm_smmu_cmdq_poll_until_sync() in the Step 5 is 0, and prods/cons
> > > > > > are also matched. Actually, at this point that NOP ISR has already
> > > > > > finished.
> > > > > 
> > > > > Yes, you'd need a sneaky way to convay the error from the ISR to the
> > > > > cmdlist code that didn't harm performance. Maybe we could come up with
> > > > > something, but if it works replacing the NOP with flush sounds fairly
> > > > > appealing - though can you do a single WORD edit to the STE that will
> > > > > block translated requests? Zero EATS?
> > > > 
> > > > Yea. I can give that a try.
> > > 
> > > This also really needs to go after the invalidation changes because it
> > > is feasible to also edit the lockless RCU invalidation list from the
> > > ISR and disable the ATC for the failed device too.
> > > 
> > > > > Also, will the SMMU start spamming with blocked translation events or
> > > > > something that will need suppression too?
> > > > 
> > > > CD.R=0 can suppress fault records, but we would need to override
> > > > that in every CD of the device.
> > > 
> > > That's too much to do from ISR, but maybe we can do it from a WQ..
> > > 
> > 
> > (Skimming through these, apologies if I'm losing context), shouldn't we
> > do all that (marking it as an inv STE / abort STE, suppressing the
> > faults) in the worker instead of trying to reset/recover the device?
> 
> EATS should be unset asap to avoid memory corruption. It's best
> to do in the unmap() context where the page isn't reclaimed yet
> by the kernel.
> 

Makes sense.

> Worker thread will be a bit late, but it is good enough for any
> further step.
>

Ack.
Praan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
  2026-03-10 20:00                     ` Samiullah Khawaja
@ 2026-03-11 12:12                       ` Jason Gunthorpe
  0 siblings, 0 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2026-03-11 12:12 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: Baolu Lu, Nicolin Chen, will, robin.murphy, joro, bhelgaas,
	rafael, lenb, praan, kees, smostafa, Alexander.Grest, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel,
	linux-acpi, linux-pci, vsethi

On Tue, Mar 10, 2026 at 08:00:45PM +0000, Samiullah Khawaja wrote:
> On Fri, Mar 06, 2026 at 04:26:52PM -0400, Jason Gunthorpe wrote:
> > On Fri, Mar 06, 2026 at 08:22:08PM +0000, Samiullah Khawaja wrote:
> > 
> > > But do you think doing the timeout logic without fencing would be good
> > > enough?
> > 
> > It is what ARM and AMD do, so I wouldn't object to it.
> 
> I think without any back pressure to the caller, a device will be able
> to fill the invalidation queue with device IOTLB invalidations that get
> stuck until the HW timeout occurs.

Yes, all drivers have this problem currently.

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2026-03-11 12:12 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-05  5:21 [PATCH v1 0/2] iommu/arm-smmu-v3: Reset PCI device upon ATC invalidate timeout Nicolin Chen
2026-03-05  5:21 ` [PATCH v1 1/2] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds Nicolin Chen
2026-03-05  5:21 ` [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts Nicolin Chen
2026-03-05 15:15   ` kernel test robot
2026-03-05 15:24   ` Robin Murphy
2026-03-05 21:06     ` Nicolin Chen
2026-03-05 23:30       ` Nicolin Chen
2026-03-05 23:52       ` Jason Gunthorpe
2026-03-06 15:24         ` Robin Murphy
2026-03-06 15:56           ` Jason Gunthorpe
2026-03-10 19:34             ` Pranjal Shrivastava
2026-03-05 15:39   ` Jason Gunthorpe
2026-03-05 21:15     ` Nicolin Chen
2026-03-05 23:41       ` Jason Gunthorpe
2026-03-06  1:29         ` Nicolin Chen
2026-03-06  1:33           ` Jason Gunthorpe
2026-03-06  5:06             ` Nicolin Chen
2026-03-06 13:02               ` Jason Gunthorpe
2026-03-06 19:20                 ` Nicolin Chen
2026-03-06 19:22                   ` Jason Gunthorpe
2026-03-06 19:39                     ` Nicolin Chen
2026-03-06 19:47                       ` Jason Gunthorpe
2026-03-10 19:40                 ` Pranjal Shrivastava
2026-03-10 19:57                   ` Nicolin Chen
2026-03-10 20:04                     ` Pranjal Shrivastava
2026-03-06 13:22         ` Robin Murphy
2026-03-06 14:01           ` Jason Gunthorpe
2026-03-06 20:18             ` Nicolin Chen
2026-03-06 20:22               ` Jason Gunthorpe
2026-03-06 20:34                 ` Nicolin Chen
2026-03-06  3:22     ` Baolu Lu
2026-03-06 13:00       ` Jason Gunthorpe
2026-03-06 19:35         ` Samiullah Khawaja
2026-03-06 19:43           ` Jason Gunthorpe
2026-03-06 19:59             ` Samiullah Khawaja
2026-03-06 20:03               ` Jason Gunthorpe
2026-03-06 20:22                 ` Samiullah Khawaja
2026-03-06 20:26                   ` Jason Gunthorpe
2026-03-10 20:00                     ` Samiullah Khawaja
2026-03-11 12:12                       ` Jason Gunthorpe
2026-03-06  2:35   ` kernel test robot
2026-03-10 19:16   ` Pranjal Shrivastava
2026-03-10 19:51     ` Nicolin Chen
2026-03-10 20:00       ` Pranjal Shrivastava

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox