* [PATCH v5 01/18] PCI: Don't suspend IOMMU when probing reset capability
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 02/18] PCI/CXL: Probe the underlying bus reset in cxl_reset_bus_function() Nicolin Chen
` (16 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
reset_method_store() in drivers/pci/pci-sysfs.c discovers supported reset
methods by calling reset_fn(pdev, PCI_RESET_PROBE, ...) without holding a
device_lock, since the probe path is expected to query the device's reset
capability without changing device state.
However, pci_reset_bus_function() and __pci_dev_specific_reset() violate
that contract after pci_dev_reset_iommu_prepare/done() were added, which
moves the device into a blocking domain and abruptly aborts any in-flight
DMA. Doing this for a probe -- a state-query call that does not even hold
device_lock -- can cause driver timeouts and data loss on a DMAing device.
The peer reset helpers all handle this correctly: they short-circuit on a
probe input before touching the IOMMU.
Skip pci_dev_reset_iommu_prepare()/_done() entirely when probe is set. The
inner reset routines already implement their own probe semantics, and they
perform the capability checks and return without changing device state.
Fixes: f5b16b802174 ("PCI: Suspend iommu function prior to resetting a device")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/pci/pci.c | 13 ++++++++-----
drivers/pci/quirks.c | 13 ++++++++-----
2 files changed, 16 insertions(+), 10 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 77b17b13ee615..01cf310540561 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4946,10 +4946,12 @@ static int pci_reset_bus_function(struct pci_dev *dev, bool probe)
if (bridge && pcie_is_cxl(bridge) && cxl_sbr_masked(bridge))
return -ENOTTY;
- rc = pci_dev_reset_iommu_prepare(dev);
- if (rc) {
- pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", rc);
- return rc;
+ if (!probe) {
+ rc = pci_dev_reset_iommu_prepare(dev);
+ if (rc) {
+ pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", rc);
+ return rc;
+ }
}
rc = pci_dev_reset_slot_function(dev, probe);
@@ -4958,7 +4960,8 @@ static int pci_reset_bus_function(struct pci_dev *dev, bool probe)
rc = pci_parent_bus_reset(dev, probe);
done:
- pci_dev_reset_iommu_done(dev);
+ if (!probe)
+ pci_dev_reset_iommu_done(dev);
return rc;
}
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index b09f27f7846fc..8ecd1bc561d28 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4250,14 +4250,17 @@ static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
{
int ret;
- ret = pci_dev_reset_iommu_prepare(dev);
- if (ret) {
- pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
- return ret;
+ if (!probe) {
+ ret = pci_dev_reset_iommu_prepare(dev);
+ if (ret) {
+ pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
+ return ret;
+ }
}
ret = i->reset(dev, probe);
- pci_dev_reset_iommu_done(dev);
+ if (!probe)
+ pci_dev_reset_iommu_done(dev);
return ret;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 02/18] PCI/CXL: Probe the underlying bus reset in cxl_reset_bus_function()
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 01/18] PCI: Don't suspend IOMMU when probing reset capability Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 03/18] PCI: Propagate FLR return values to callers Nicolin Chen
` (15 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
cxl_reset_bus_function() reports "supported" to a probe after checking only
that the upstream bridge carries a CXL port DVSEC. The underlying bus reset
can still be unavailable, e.g. on a bus shared with other devices, so both
the reset_methods[] array and the reset_method sysfs node end up listing a
"cxl_bus" that is guaranteed to fail with -ENOTTY when it is attempted.
Probe the underlying pci_dev_reset_slot_function() and then, if it is not
applicable, pci_parent_bus_reset(). These are the same two checks that the
actual reset runs, so a shared-bus CXL device no longer advertises a method
that can never succeed.
Probing via pci_reset_bus_function() would not work: its cxl_sbr_masked()
check rejects every CXL port with a masked SBR, while the do-reset path in
this function unmasks the SBR before resetting. Such a port would wrongly
probe as unsupported.
Also pass an explicit PCI_RESET_DO_RESET at the do-reset call site, since
probe is always false at that point.
Fixes: 53c49b6e6dd2e ("PCI/CXL: Add 'cxl_bus' reset method for devices below CXL Ports")
Assisted-by: Claude:claude-fable-5
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/pci/pci.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 01cf310540561..8102989673333 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4979,8 +4979,16 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
if (!dvsec)
return -ENOTTY;
- if (probe)
- return 0;
+ /*
+ * Do not probe via pci_reset_bus_function(), which would reject a
+ * masked SBR that the do-reset path below unmasks before resetting.
+ */
+ if (probe) {
+ rc = pci_dev_reset_slot_function(dev, PCI_RESET_PROBE);
+ if (rc != -ENOTTY)
+ return rc;
+ return pci_parent_bus_reset(dev, PCI_RESET_PROBE);
+ }
rc = pci_read_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL, ®);
if (rc)
@@ -5000,7 +5008,7 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
val);
}
- rc = pci_reset_bus_function(dev, probe);
+ rc = pci_reset_bus_function(dev, PCI_RESET_DO_RESET);
if (reg != val)
pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 03/18] PCI: Propagate FLR return values to callers
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 01/18] PCI: Don't suspend IOMMU when probing reset capability Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 02/18] PCI/CXL: Probe the underlying bus reset in cxl_reset_bus_function() Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 04/18] iommu: Convert gdev->blocked from bool to enum gdev_blocked Nicolin Chen
` (14 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
A reset failure implies that the device might be unreliable. E.g. its ATC
might still retain stale entries. Thus, the IOMMU layer cannot trust this
device to resume its ATS function that can lead to memory corruption. So,
a subsequent change to pci_dev_reset_iommu_done() will keep such a failed
device blocked instead of recovering its IOMMU pathway.
Several quirk functions in the pci_dev_reset_methods array call pcie_flr()
but discard its return value and unconditionally report success, hiding a
failed reset from __pci_reset_function_locked(). Return the value instead.
Nothing is short-circuited on failure: each quirk still runs its restore or
settle steps as before, and reset_hinic_vf_dev() also reports its firmware
FLR-completion timeout, keeping an earlier pcie_flr() error if any.
Note that a kept pcie_flr() error does not misreport a normal reset: as per
the erratum in commit 411e2a43d210e ("PCI: Work around Huawei Intelligent
NIC VF FLR erratum"), the VF responds to config reads before its firmware
completes the reset processing, so pci_dev_wait() returns promptly instead
of timing out.
However, pcie_flr() reports a device that did not come back after the FLR
as -ENOTTY, and __pci_reset_function_locked() takes any -ENOTTY as "try the
next method". Since these quirks have always returned success, escalating
would be a new and harmful behavior: the next method is typically the plain
"flr" method, which would re-run the same FLR without the extra steps the
quirk wraps around its pcie_flr() call. Add quirk_flr_err() converting that
-ENOTTY to -ETIMEDOUT, so the cascade still stops at the quirk as before.
The early "not applicable" -ENOTTY returns are left intact, and they still
ask the caller to try the next method.
Convert it at the quirk level rather than in pcie_flr() or pci_dev_wait(),
because those also serve the native "flr"/"af_flr"/"pm" methods, where the
-ENOTTY on timeout is deliberate per commit 91295d79d658 ("PCI: Handle FLR
failure and allow other reset types"): a timed-out native FLR escalates to
a stronger reset (e.g. bus reset), which skips no workaround there and may
recover the device. Converting in the core would also change the value seen
by every direct pcie_flr() caller.
This is not a bug fix, since these functions have always returned success
and the propagated values are only consumed by incoming work.
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Assisted-by: Claude:claude-fable-5
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/pci/quirks.c | 43 ++++++++++++++++++++++++++++++-------------
1 file changed, 30 insertions(+), 13 deletions(-)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 8ecd1bc561d28..7f6d1574fe2bf 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3934,6 +3934,18 @@ DECLARE_PCI_FIXUP_SUSPEND_LATE(PCI_VENDOR_ID_INTEL,
* reset a single function if other methods (e.g. FLR, PM D0->D3) are
* not available.
*/
+
+/*
+ * pcie_flr() reports a device that did not come back after the FLR as -ENOTTY.
+ * The caller __pci_reset_function_locked() takes that as "try the next method"
+ * and ignores the error, which is wrong for the quirk functions below. Return
+ * -ETIMEDOUT instead.
+ */
+static int quirk_flr_err(int err)
+{
+ return err == -ENOTTY ? -ETIMEDOUT : err;
+}
+
static int reset_intel_82599_sfp_virtfn(struct pci_dev *dev, bool probe)
{
/*
@@ -3945,7 +3957,7 @@ static int reset_intel_82599_sfp_virtfn(struct pci_dev *dev, bool probe)
* supported.
*/
if (!probe)
- pcie_flr(dev);
+ return quirk_flr_err(pcie_flr(dev));
return 0;
}
@@ -4003,6 +4015,7 @@ static int reset_chelsio_generic_dev(struct pci_dev *dev, bool probe)
{
u16 old_command;
u16 msix_flags;
+ int ret;
/*
* If this isn't a Chelsio T4-based device, return -ENOTTY indicating
@@ -4048,16 +4061,15 @@ static int reset_chelsio_generic_dev(struct pci_dev *dev, bool probe)
PCI_MSIX_FLAGS_ENABLE |
PCI_MSIX_FLAGS_MASKALL);
- pcie_flr(dev);
+ ret = quirk_flr_err(pcie_flr(dev));
/*
* Restore the configuration information (BAR values, etc.) including
- * the original PCI Configuration Space Command word, and return
- * success.
+ * the original PCI Configuration Space Command word.
*/
pci_restore_state(dev);
pci_write_config_word(dev, PCI_COMMAND, old_command);
- return 0;
+ return ret;
}
#define PCI_DEVICE_ID_INTEL_82599_SFP_VF 0x10ed
@@ -4140,9 +4152,7 @@ static int nvme_disable_and_flr(struct pci_dev *dev, bool probe)
pci_iounmap(dev, bar);
- pcie_flr(dev);
-
- return 0;
+ return quirk_flr_err(pcie_flr(dev));
}
/*
@@ -4154,14 +4164,17 @@ static int nvme_disable_and_flr(struct pci_dev *dev, bool probe)
*/
static int delay_250ms_after_flr(struct pci_dev *dev, bool probe)
{
+ int ret;
+
if (probe)
return pcie_reset_flr(dev, PCI_RESET_PROBE);
- pcie_reset_flr(dev, PCI_RESET_DO_RESET);
+ ret = quirk_flr_err(pcie_reset_flr(dev, PCI_RESET_DO_RESET));
+ /* Settle the device even on a failed FLR */
msleep(250);
- return 0;
+ return ret;
}
#define PCI_DEVICE_ID_HINIC_VF 0x375E
@@ -4177,6 +4190,7 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe)
unsigned long timeout;
void __iomem *bar;
u32 val;
+ int ret;
if (probe)
return 0;
@@ -4197,12 +4211,13 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe)
val = val | HINIC_VF_FLR_PROC_BIT;
iowrite32be(val, bar + HINIC_VF_OP);
- pcie_flr(pdev);
+ ret = quirk_flr_err(pcie_flr(pdev));
/*
* The device must recapture its Bus and Device Numbers after FLR
* in order generate Completions. Issue a config write to let the
- * device capture this information.
+ * device capture this information. Note that pcie_flr() can fail
+ * after the reset is asserted. So, recapture it unconditionally.
*/
pci_write_config_word(pdev, PCI_VENDOR_ID, 0);
@@ -4220,11 +4235,13 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe)
goto reset_complete;
pci_warn(pdev, "Reset dev timeout, FLR ack reg: %#010x\n", val);
+ /* Preserve pcie_flr()'s error if it failed before the device ack stage */
+ ret = ret ? : -ETIMEDOUT;
reset_complete:
pci_iounmap(pdev, bar);
- return 0;
+ return ret;
}
static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 04/18] iommu: Convert gdev->blocked from bool to enum gdev_blocked
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (2 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 03/18] PCI: Propagate FLR return values to callers Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 05/18] iommu: Pass in reset result to pci_dev_reset_iommu_done() Nicolin Chen
` (13 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
The gdev->blocked flag tracks whether a device is individually being held
in the group->blocking_domain while group->domain is retained. Up to now,
a PCI reset in flight is the only producer, so a bool suffices.
Subsequent changes will add more reasons to keep a device blocked, e.g. a
failed-reset case that must not auto-unblock, or a driver-side quarantine
for a hardware fault. These reasons are cleared by different events, which
a single bool cannot encode.
Convert "bool blocked" into "enum gdev_blocked blocked", provisioned with
two initial values: BLOCKED_NO and BLOCKED_RESETTING, for the existing use
cases. All readers keep the "if (gdev->blocked)" form, as BLOCKED_NO == 0.
This is a pure type change with no behavior change. Follow-on changes will
add new enum values along with their producers.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/iommu.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index e8f13dcebbde5..342e8a5ad628c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -73,16 +73,20 @@ struct iommu_group {
void *owner;
};
+enum gdev_blocked {
+ BLOCKED_NO = 0, /* Not blocked */
+ BLOCKED_RESETTING, /* PCI reset in flight */
+};
+
struct group_device {
struct list_head list;
struct device *dev;
char *name;
/*
* Device is blocked for a pending recovery while its group->domain is
- * retained. This can happen when:
- * - Device is undergoing a reset
+ * retained.
*/
- bool blocked;
+ enum gdev_blocked blocked;
unsigned int reset_depth;
};
@@ -4072,7 +4076,7 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
* the correct domain in iommu_driver_get_domain_for_dev() that might be
* called in a set_dev_pasid callback function.
*/
- gdev->blocked = true;
+ gdev->blocked = BLOCKED_RESETTING;
/*
* Stage PASID domains at blocking_domain while retaining pasid_array.
@@ -4198,7 +4202,7 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
* the correct domain in iommu_driver_get_domain_for_dev() that might be
* called in a set_dev_pasid callback function.
*/
- gdev->blocked = false;
+ gdev->blocked = BLOCKED_NO;
/*
* Re-attach PASID domains back to the domains retained in pasid_array.
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 05/18] iommu: Pass in reset result to pci_dev_reset_iommu_done()
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (3 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 04/18] iommu: Convert gdev->blocked from bool to enum gdev_blocked Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 06/18] iommu/arm-smmu-v3: Don't rb_erase() a never-inserted stream node Nicolin Chen
` (12 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
IOMMU drivers handle ATC cache maintenance. They may encounter ATC-related
errors (e.g., ATC invalidation timeout), indicating that the ATC cache may
have stale entries that can corrupt the memory. In this case, IOMMU driver
has no choice but to block the device's ATS function and wait for a device
recovery.
The pci_dev_reset_iommu_done() called at the end of a reset function could
serve as a reliable signal to the IOMMU subsystem that the physical device
cache is completely clean. However, the function is called unconditionally
even if the reset operation had actually failed, which would re-attach the
faulty device back to a normal translation domain. And this will leave the
system highly exposed, creating vulnerabilities for data corruption:
IOMMU blocks RID/ATS
pci_reset_function():
pci_dev_reset_iommu_prepare(); // Block RID/ATS
__reset(); // Failed (ATC is still stale)
pci_dev_reset_iommu_done(); // Unblock RID/ATS (ah-ha)
Instead, pass in @reset_result to pci_dev_reset_iommu_done() from callers:
IOMMU blocks RID/ATS
pci_reset_function():
pci_dev_reset_iommu_prepare(); // Block RID/ATS
rc = __reset();
pci_dev_reset_iommu_done(rc); // Unblock or quarantine
On a successful reset, done() restores the device to its RID/PASID domains
and decrements group->recovery_cnt. On failure, the device remains blocked,
and concurrent domain attachment will be rejected until a successful reset.
Note: -ENOTTY is overloaded with different meanings by PCI reset functions.
Some of them indicate "reset was not attempted", while others indicate "try
the next reset method and the current method failed". IOMMU that must react
these two outcomes separately has no choice but to keep the device blocked
on -ENOTTY as well. Leave an inline FIXME and warning.
This introduces a new situation where a blocked device is being unplugged.
Decrement the group->recovery_cnt accordingly.
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
include/linux/iommu.h | 5 ++--
drivers/iommu/iommu.c | 62 ++++++++++++++++++++++++++++++++++++++++--
drivers/pci/pci-acpi.c | 2 +-
drivers/pci/pci.c | 10 +++----
drivers/pci/quirks.c | 2 +-
5 files changed, 69 insertions(+), 12 deletions(-)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d20aa6f6863ab..59ea7e601a2d7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1224,7 +1224,7 @@ void iommu_free_global_pasid(ioasid_t pasid);
/* PCI device reset functions */
int pci_dev_reset_iommu_prepare(struct pci_dev *pdev);
-void pci_dev_reset_iommu_done(struct pci_dev *pdev);
+void pci_dev_reset_iommu_done(struct pci_dev *pdev, int reset_result);
#else /* CONFIG_IOMMU_API */
struct iommu_ops {};
@@ -1554,7 +1554,8 @@ static inline int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
return 0;
}
-static inline void pci_dev_reset_iommu_done(struct pci_dev *pdev)
+static inline void pci_dev_reset_iommu_done(struct pci_dev *pdev,
+ int reset_result)
{
}
#endif /* CONFIG_IOMMU_API */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 342e8a5ad628c..6e2e607de8d8f 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -76,6 +76,7 @@ struct iommu_group {
enum gdev_blocked {
BLOCKED_NO = 0, /* Not blocked */
BLOCKED_RESETTING, /* PCI reset in flight */
+ BLOCKED_RESET_FAILED, /* PCI reset failed */
};
struct group_device {
@@ -762,6 +763,9 @@ static void __iommu_group_remove_device(struct device *dev)
if (device->dev != dev)
continue;
+ /* Must drop the recovery_cnt when removing a blocked device */
+ if (device->blocked && !WARN_ON(group->recovery_cnt == 0))
+ group->recovery_cnt--;
list_del(&device->list);
__iommu_group_free_device(group, device);
if (dev_has_iommu(dev))
@@ -4025,7 +4029,12 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
* reset is finished, pci_dev_reset_iommu_done() can restore everything.
*
* Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
- * before/after the core-level reset routine, to decrement the recovery_cnt.
+ * before/after the core-level reset routine. On a successful reset, done() will
+ * decrement group->recovery_cnt and restore domains. On a failure, recovery_cnt
+ * is left intact and the device stays blocked.
+ *
+ * Callers must skip pci_dev_reset_iommu_prepare/done() entirely when no reset
+ * is attempted (e.g. probe mode).
*
* Return: 0 on success or negative error code if the preparation failed.
*
@@ -4055,6 +4064,10 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
if (gdev->reset_depth++)
return 0;
+ /* Device might be already blocked for a quarantine */
+ if (gdev->blocked)
+ return 0;
+
ret = __iommu_group_alloc_blocking_domain(group);
if (ret) {
gdev->reset_depth--;
@@ -4136,20 +4149,28 @@ static bool group_device_dma_alias_is_blocked(struct iommu_group *group,
/**
* pci_dev_reset_iommu_done() - Restore IOMMU after a PCI device reset is done
* @pdev: PCI device that has finished a reset routine
+ * @reset_result: Return code from the reset routine
*
* After a PCIe device finishes a reset routine, it wants to restore its IOMMU
* activity, including new translation and cache invalidation, by re-attaching
* all RID/PASID of the device back to the domains retained in the core-level
* structure.
*
- * Caller must pair it with a successful pci_dev_reset_iommu_prepare().
+ * This is a pairing function for pci_dev_reset_iommu_prepare(). Caller passes
+ * the reset return value to @reset_result. On a failed reset, the device will
+ * remain blocked as a quarantine measure, with group->recovery_cnt intact, to
+ * protect system memory until a subsequent successful reset.
+ *
+ * Callers must skip pci_dev_reset_iommu_prepare/done() entirely when no reset
+ * is attempted (e.g. probe mode).
*
* Note that, although unlikely, there is a risk that re-attaching domains might
* fail due to some unexpected happening like OOM.
*/
-void pci_dev_reset_iommu_done(struct pci_dev *pdev)
+void pci_dev_reset_iommu_done(struct pci_dev *pdev, int reset_result)
{
struct iommu_group *group = pdev->dev.iommu_group;
+ enum gdev_blocked old_gdev_blocked;
struct group_device *gdev;
unsigned long pasid;
void *entry;
@@ -4172,6 +4193,37 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
if (WARN_ON(!group->blocking_domain))
return;
+ /*
+ * A reset failure implies that the device might be unreliable. E.g. its
+ * device cache might retain stale entries, which might result in memory
+ * corruption. Thus, do not unblock the device until a successful reset.
+ */
+ if (reset_result) {
+ /*
+ * FIXME: the int-return values from the PCI reset functions are
+ * not consistent: some reset functions use -ENOTTY to indicate
+ * "no reset was attempted" (in which case IOMMU should revert a
+ * prepare), while others use -ENOTTY to indicate "reset failed;
+ * try the next reset method" (in which case IOMMU should keep
+ * the device blocked). Without fixing the PCI return result, we
+ * cannot tell the difference between the two cases. Warn it.
+ */
+ if (reset_result == -ENOTTY)
+ dev_warn_ratelimited(
+ &pdev->dev,
+ "Reset may have been skipped. Keep it blocked conservatively\n");
+ else
+ dev_err_ratelimited(
+ &pdev->dev,
+ "Reset failed. Keep it blocked to protect memory\n");
+ if (gdev->blocked == BLOCKED_RESETTING)
+ gdev->blocked = BLOCKED_RESET_FAILED;
+ return;
+ }
+
+ if (WARN_ON(!gdev->blocked))
+ return;
+
if (group_device_dma_alias_is_blocked(group, gdev)) {
/*
* FIXME: DMA aliased devices share the same RID, which would be
@@ -4202,6 +4254,7 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
* the correct domain in iommu_driver_get_domain_for_dev() that might be
* called in a set_dev_pasid callback function.
*/
+ old_gdev_blocked = gdev->blocked;
gdev->blocked = BLOCKED_NO;
/*
@@ -4223,6 +4276,9 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
if (!WARN_ON(group->recovery_cnt == 0))
group->recovery_cnt--;
+
+ if (old_gdev_blocked > BLOCKED_RESETTING)
+ pci_info(pdev, "Device is unblocked after successful reset\n");
}
EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_done);
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 4d0f2cb6c695b..280d7193cb4ca 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -977,7 +977,7 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
ret = -ENOTTY;
}
- pci_dev_reset_iommu_done(dev);
+ pci_dev_reset_iommu_done(dev, ret);
return ret;
}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8102989673333..c974f1e2cffe5 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4387,7 +4387,7 @@ int pcie_flr(struct pci_dev *dev)
ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
done:
- pci_dev_reset_iommu_done(dev);
+ pci_dev_reset_iommu_done(dev, ret);
return ret;
}
EXPORT_SYMBOL_GPL(pcie_flr);
@@ -4465,7 +4465,7 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
done:
- pci_dev_reset_iommu_done(dev);
+ pci_dev_reset_iommu_done(dev, ret);
return ret;
}
@@ -4519,7 +4519,7 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
pci_dev_d3_sleep(dev);
ret = pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
- pci_dev_reset_iommu_done(dev);
+ pci_dev_reset_iommu_done(dev, ret);
return ret;
}
@@ -4961,7 +4961,7 @@ static int pci_reset_bus_function(struct pci_dev *dev, bool probe)
rc = pci_parent_bus_reset(dev, probe);
done:
if (!probe)
- pci_dev_reset_iommu_done(dev);
+ pci_dev_reset_iommu_done(dev, rc);
return rc;
}
@@ -5014,7 +5014,7 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
reg);
- pci_dev_reset_iommu_done(dev);
+ pci_dev_reset_iommu_done(dev, rc);
return rc;
}
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 7f6d1574fe2bf..a478e6e36248f 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4277,7 +4277,7 @@ static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
ret = i->reset(dev, probe);
if (!probe)
- pci_dev_reset_iommu_done(dev);
+ pci_dev_reset_iommu_done(dev, ret);
return ret;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 06/18] iommu/arm-smmu-v3: Don't rb_erase() a never-inserted stream node
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (4 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 05/18] iommu: Pass in reset result to pci_dev_reset_iommu_done() Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 07/18] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
` (11 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
arm_smmu_insert_master() skips inserting a stream whose StreamID duplicates
one the same master already owns (bridged PCI devices can present duplicate
IDs), leaving that master->streams[i].node zeroed and unlinked from the
smmu->streams rb-tree.
Both the insert error-rollback loop and arm_smmu_remove_master() then call
rb_erase() on every master->streams[i].node unconditionally. rb_erase() on
a zeroed node sees a NULL parent, treats the node as the tree root and sets
root->rb_node = NULL, silently emptying the whole SID tree and breaking SID
lookups (and DMA) for every other master on the SMMU.
Mark each node with RB_CLEAR_NODE() after sort_nonatomic() reorders the
array, since sorting relocates the entries and would leave the earlier
self-referential RB_CLEAR_NODE() pointer stale. An un-inserted node then
stays RB_EMPTY_NODE() and is skipped in both erase loops; inserted nodes
are linked by rb_find_add() and erased as before.
Fixes: b00d24997a11 ("iommu/arm-smmu-v3: Fix iommu_device_probe bug due to duplicated stream ids")
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index a10affb483a4f..d9734cfe6f989 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -4051,6 +4051,13 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
sizeof(master->streams[0]), arm_smmu_stream_id_cmp,
NULL);
+ /*
+ * Clear after sorting: RB_CLEAR_NODE() records the node's own address,
+ * which sort_nonatomic() invalidates by relocating the entries.
+ */
+ for (i = 0; i < fwspec->num_ids; i++)
+ RB_CLEAR_NODE(&master->streams[i].node);
+
mutex_lock(&smmu->streams_mutex);
for (i = 0; i < fwspec->num_ids; i++) {
struct arm_smmu_stream *new_stream = &master->streams[i];
@@ -4083,7 +4090,9 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
if (ret) {
for (i--; i >= 0; i--)
- rb_erase(&master->streams[i].node, &smmu->streams);
+ if (!RB_EMPTY_NODE(&master->streams[i].node))
+ rb_erase(&master->streams[i].node,
+ &smmu->streams);
kfree(master->streams);
kfree(master->build_invs);
}
@@ -4103,7 +4112,8 @@ static void arm_smmu_remove_master(struct arm_smmu_master *master)
mutex_lock(&smmu->streams_mutex);
for (i = 0; i < fwspec->num_ids; i++)
- rb_erase(&master->streams[i].node, &smmu->streams);
+ if (!RB_EMPTY_NODE(&master->streams[i].node))
+ rb_erase(&master->streams[i].node, &smmu->streams);
mutex_unlock(&smmu->streams_mutex);
kfree(master->streams);
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 07/18] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (5 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 06/18] iommu/arm-smmu-v3: Don't rb_erase() a never-inserted stream node Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 08/18] iommu/arm-smmu-v3: Skip remaining GERROR causes on SFM Nicolin Chen
` (10 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
An ATC invalidation timeout is a fatal error. While the SMMUv3 hardware is
aware of the timeout via a GERROR interrupt, the driver thread issuing the
commands lacks a direct mechanism to verify whether its specific batch was
the cause or not, as polling the CMD_SYNC status doesn't natively return a
failure code, making it very difficult to coordinate per-device recovery.
Introduce an atc_sync_timeouts bitmap in the cmdq structure to bridge this
gap. When the ISR detects an ATC timeout, set the bit corresponding to the
physical CMDQ index of the faulting CMD_SYNC command.
On the issuer side, after polling completes (or times out), test and clear
its dedicated bit. If set, return -EIO to trigger device quarantine. This
reader site tests with a plain test_bit() first and clear_bit() only when
the bit is set, sparing the shared cache line an atomic RMW in the common
no-timeout case. An smp_rmb() ahead of the issuer-side test orders it after
the completion poll, which may observe the completion using a relaxed load
that would otherwise allow this bitmap read to be hoisted over it.
When inserting a CMD_SYNC, clear any stale bit left in its slot by a prior
wraparound, before the slot becomes visible to the SMMU, so that the GERROR
ISR can only set the bit for the new CMD_SYNC.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 +
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 70 ++++++++++++++++++++-
2 files changed, 70 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index c909c9a88538b..56e872e59afeb 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -700,6 +700,7 @@ struct arm_smmu_cmdq {
atomic_long_t *valid_map;
atomic_t owner_prod;
atomic_t lock;
+ unsigned long *atc_sync_timeouts;
bool (*supports_cmd)(struct arm_smmu_cmd *cmd);
};
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d9734cfe6f989..da29e523d78b7 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -343,7 +343,10 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
* at the CMD_SYNC. Attempt to complete other pending commands
* by repeating the CMD_SYNC, though we might well end up back
* here since the ATC invalidation may still be pending.
+ *
+ * Mark the faulty batch in the bitmap for the issuer to match.
*/
+ set_bit(Q_IDX(&q->llq, cons), cmdq->atc_sync_timeouts);
return;
case CMDQ_ERR_CERROR_ILL_IDX:
default:
@@ -750,6 +753,14 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
queue_write(Q_ENT(&cmdq->q, prod), cmd_sync.data,
ARRAY_SIZE(cmd_sync.data));
+ /*
+ * Clear any stale ATC-timeout bit left in the slot from a prior
+ * wraparound, before the slot becomes visible to the SMMU. Must
+ * do this prior to step 3, to prevent a potential race with the
+ * GERROR ISR calling set_bit() for our own CMD_SYNC.
+ */
+ clear_bit(Q_IDX(&llq, prod), cmdq->atc_sync_timeouts);
+
/*
* In order to determine completion of our CMD_SYNC, we must
* ensure that the queue can't wrap twice without us noticing.
@@ -796,9 +807,61 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
/* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */
if (sync) {
+ u32 sync_prod;
+
llq.prod = queue_inc_prod_n(&llq, n);
+ sync_prod = llq.prod;
ret = arm_smmu_cmdq_poll_until_sync(smmu, cmdq, &llq);
- if (ret) {
+
+ /*
+ * Ensure that the read in __arm_smmu_cmdq_poll_until_msi() or
+ * __arm_smmu_cmdq_poll_until_consumed() are completed, before
+ * testing the atc_sync_timeouts bitmap below.
+ *
+ * Without it, the test_bit() below could be reordered before
+ * the relaxed reads in the two poll functions, missing a bit
+ * that is set before the CMD_SYNC completion. The wmb in the
+ * writel(GERRORN) ensures that the set_bit() in the ISR must
+ * be completed, followed by the SMMU consuming the CMD_SYNC.
+ *
+ * [CPU0 - issuer] | [CPU1 - GERROR ISR]
+ * | __arm_smmu_cmdq_skip_err() {
+ * | set_bit(atc_sync_timeouts);
+ * | }
+ * | writel(gerror, GERRORN);
+ * | // wmb: SMMU then resumes,
+ * // completion generated by | // consuming the CMD_SYNC
+ * // the consumed CMD_SYNC |
+ * read CMD_SYNC completion; |
+ * smp_rmb(); // ensure reads |
+ * // are completed |
+ * test_bit(atc_sync_timeouts);|
+ */
+ smp_rmb();
+
+ /*
+ * Test atc_sync_timeouts first and see if there is ATC timeout
+ * resulted from this cmdlist. Return -EIO to separate from the
+ * ARM_SMMU_POLL_TIMEOUT_US software timeout. Use a non-atomic
+ * test_bit() first, sparing an atomic RMW in the common case.
+ *
+ * FIXME possible unhandled ATC invalidation timeout scenario:
+ * PCI Completion Timeout can be set to a range longer than the
+ * ARM_SMMU_POLL_TIMEOUT_US software timeout. -ETIMEDOUT can be
+ * returned by arm_smmu_cmdq_poll_until_sync() while the ATC_INV
+ * is still pending and not yet reflected in GERROR, so the bit
+ * on atc_sync_timeouts is not set. In this case, we can hardly
+ * do anything here, since the command queue HW is still pending
+ * on the ATC command.
+ */
+ if (test_bit(Q_IDX(&llq, sync_prod), cmdq->atc_sync_timeouts)) {
+ clear_bit(Q_IDX(&llq, sync_prod),
+ cmdq->atc_sync_timeouts);
+ dev_err_ratelimited(smmu->dev,
+ "CMD_SYNC for ATC_INV timeout at prod=0x%08x\n",
+ sync_prod);
+ ret = -EIO;
+ } else if (ret) {
dev_err_ratelimited(smmu->dev,
"CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
llq.prod,
@@ -4405,6 +4468,11 @@ int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
if (!cmdq->valid_map)
return -ENOMEM;
+ cmdq->atc_sync_timeouts =
+ devm_bitmap_zalloc(smmu->dev, nents, GFP_KERNEL);
+ if (!cmdq->atc_sync_timeouts)
+ return -ENOMEM;
+
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 08/18] iommu/arm-smmu-v3: Skip remaining GERROR causes on SFM
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (6 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 07/18] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 09/18] iommu/arm-smmu-v3: Introduce per-cmdq cmdq_err_handler callback Nicolin Chen
` (9 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
When the SMMU enters Service Failure Mode (SFM), arm_smmu_device_disable()
clears CR0 and the SMMU stops processing requests entirely. The remaining
GERROR causes (MSI write aborts, PRIQ/EVTQ aborts, CMDQ_ERR) are moot at
that point: the cmdq is dead so arm_smmu_cmdq_skip_err() would just twiddle
bookkeeping for a queue nobody's reading, and the per-cause dev_warn lines
add little diagnostic value beyond the SFM message itself.
Ack the GERROR before arm_smmu_device_disable() and return. Acking before
the multi-ms disable wait keeps a level-triggered IRQ source from re-firing
the handler. The writel+return here duplicates the non-SFM tail because a
subsequent commit will give the two paths different locking. SFM is one-way
and the SMMU does not generate new GERROR causes, so the ack is final.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index da29e523d78b7..8a4edefeec770 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2315,8 +2315,11 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
active);
if (active & GERROR_SFM_ERR) {
+ /* SMMU is being disabled, so other errors don't matter */
+ writel(gerror, smmu->base + ARM_SMMU_GERRORN);
dev_err(smmu->dev, "device has entered Service Failure Mode!\n");
arm_smmu_device_disable(smmu);
+ return IRQ_HANDLED;
}
if (active & GERROR_MSI_GERROR_ABT_ERR)
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 09/18] iommu/arm-smmu-v3: Introduce per-cmdq cmdq_err_handler callback
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (7 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 08/18] iommu/arm-smmu-v3: Skip remaining GERROR causes on SFM Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 10/18] iommu/arm-smmu-v3: Recheck CMDQ_ERR in tegra241_vintf0_handle_error() Nicolin Chen
` (8 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
A subsequent change will need arm_smmu_cmdq_issue_cmdlist() to co-clear a
pending CMDQ_ERR after a CMD_SYNC poll timeout. And this needs to be done
for both smmu->cmdq and tegra241-cmdq.
Add a cmdq_err_handler and a paired cmdq_err_lock to struct arm_smmu_cmdq.
arm_smmu_gerror_handler() now takes the per-cmdq cmdq_err_lock when acking
CMDQ_ERR. It already covers a concurrent ack from cmdq_err_handler via its
existing early-exit on no-active-bits.
Impl functions and caller will be added in the subsequent change.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 12 ++++++++++--
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 ++++++++++++++----
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 2 +-
3 files changed, 25 insertions(+), 7 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 56e872e59afeb..86e934d046eaa 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -695,6 +695,10 @@ struct arm_smmu_queue_poll {
bool wfe;
};
+struct arm_smmu_cmdq;
+typedef void (*arm_smmu_cmdq_err_fn)(struct arm_smmu_device *smmu,
+ struct arm_smmu_cmdq *cmdq);
+
struct arm_smmu_cmdq {
struct arm_smmu_queue q;
atomic_long_t *valid_map;
@@ -702,6 +706,10 @@ struct arm_smmu_cmdq {
atomic_t lock;
unsigned long *atc_sync_timeouts;
bool (*supports_cmd)(struct arm_smmu_cmd *cmd);
+
+ /* Drain a pending CMDQ_ERR; will hold cmdq_err_lock with irqsave */
+ arm_smmu_cmdq_err_fn cmdq_err_handler;
+ raw_spinlock_t cmdq_err_lock;
};
static inline bool arm_smmu_cmdq_supports_cmd(struct arm_smmu_cmdq *cmdq,
@@ -1164,8 +1172,8 @@ int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
struct arm_smmu_queue *q, void __iomem *page,
unsigned long prod_off, unsigned long cons_off,
size_t dwords, const char *name);
-int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
- struct arm_smmu_cmdq *cmdq);
+int arm_smmu_cmdq_init(struct arm_smmu_device *smmu, struct arm_smmu_cmdq *cmdq,
+ arm_smmu_cmdq_err_fn cmdq_err_handler);
static inline bool arm_smmu_master_canwbs(struct arm_smmu_master *master)
{
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8a4edefeec770..c6e3d1be23403 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2302,13 +2302,18 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
{
u32 gerror, gerrorn, active;
struct arm_smmu_device *smmu = dev;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&smmu->cmdq.cmdq_err_lock, flags);
gerror = readl_relaxed(smmu->base + ARM_SMMU_GERROR);
gerrorn = readl_relaxed(smmu->base + ARM_SMMU_GERRORN);
active = gerror ^ gerrorn;
- if (!(active & GERROR_ERR_MASK))
+ if (!(active & GERROR_ERR_MASK)) {
+ raw_spin_unlock_irqrestore(&smmu->cmdq.cmdq_err_lock, flags);
return IRQ_NONE; /* No errors pending */
+ }
dev_warn(smmu->dev,
"unexpected global error reported (0x%08x), this could be serious\n",
@@ -2317,6 +2322,8 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
if (active & GERROR_SFM_ERR) {
/* SMMU is being disabled, so other errors don't matter */
writel(gerror, smmu->base + ARM_SMMU_GERRORN);
+ /* Release before arm_smmu_device_disable() that sleeps */
+ raw_spin_unlock_irqrestore(&smmu->cmdq.cmdq_err_lock, flags);
dev_err(smmu->dev, "device has entered Service Failure Mode!\n");
arm_smmu_device_disable(smmu);
return IRQ_HANDLED;
@@ -2344,6 +2351,7 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
arm_smmu_cmdq_skip_err(smmu);
writel(gerror, smmu->base + ARM_SMMU_GERRORN);
+ raw_spin_unlock_irqrestore(&smmu->cmdq.cmdq_err_lock, flags);
return IRQ_HANDLED;
}
@@ -4458,13 +4466,15 @@ int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
return 0;
}
-int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
- struct arm_smmu_cmdq *cmdq)
+int arm_smmu_cmdq_init(struct arm_smmu_device *smmu, struct arm_smmu_cmdq *cmdq,
+ arm_smmu_cmdq_err_fn cmdq_err_handler)
{
unsigned int nents = 1 << cmdq->q.llq.max_n_shift;
atomic_set(&cmdq->owner_prod, 0);
atomic_set(&cmdq->lock, 0);
+ raw_spin_lock_init(&cmdq->cmdq_err_lock);
+ cmdq->cmdq_err_handler = cmdq_err_handler;
cmdq->valid_map = (atomic_long_t *)devm_bitmap_zalloc(smmu->dev, nents,
GFP_KERNEL);
@@ -4490,7 +4500,7 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu)
if (ret)
return ret;
- ret = arm_smmu_cmdq_init(smmu, &smmu->cmdq);
+ ret = arm_smmu_cmdq_init(smmu, &smmu->cmdq, NULL);
if (ret)
return ret;
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index 67be62a6e7640..9012ab584d1dd 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -643,7 +643,7 @@ static int tegra241_vcmdq_alloc_smmu_cmdq(struct tegra241_vcmdq *vcmdq)
q->q_base = q->base_dma & VCMDQ_ADDR;
q->q_base |= FIELD_PREP(VCMDQ_LOG2SIZE, q->llq.max_n_shift);
- return arm_smmu_cmdq_init(smmu, cmdq);
+ return arm_smmu_cmdq_init(smmu, cmdq, NULL);
}
/* VINTF Logical VCMDQ Resource Helpers */
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 10/18] iommu/arm-smmu-v3: Recheck CMDQ_ERR in tegra241_vintf0_handle_error()
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (8 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 09/18] iommu/arm-smmu-v3: Introduce per-cmdq cmdq_err_handler callback Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 11/18] iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when CMD_SYNC times out Nicolin Chen
` (7 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
A subsequent change will allow cmdq_err_handler to ack a pending CMDQ_ERR
concurrently with tegra241_vintf0_handle_error(). Take cmdq_err_lock around
the gerror read and ack, and recheck (gerror ^ gerrorn) & GERROR_CMDQ_ERR
before calling __arm_smmu_cmdq_skip_err() so a concurrent ack doesn't cause
us to skip_err on an already-handled error.
arm_smmu_gerror_handler() already covers this via its existing early-exit
on no-active-bits.
tegra241_vcmdq_hw_deinit() acks the same GERROR/GERRORN pair unlocked, and
the error IRQ is live from probe, so a latched-error ISR can race a VCMDQ
deinit during a device reset. Take the lock around that ack as well. Since
a user-owned VCMDQ never goes through arm_smmu_cmdq_init() yet does reach
tegra241_vcmdq_hw_deinit(), initialize its cmdq_err_lock at allocation.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index 9012ab584d1dd..666dd23b0c7ca 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -319,10 +319,19 @@ static void tegra241_vintf0_handle_error(struct tegra241_vintf *vintf)
while (map) {
unsigned long lidx = __ffs64(map);
struct tegra241_vcmdq *vcmdq = vintf->lvcmdqs[lidx];
- u32 gerror = readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERROR));
+ struct arm_smmu_cmdq *cmdq = &vcmdq->cmdq;
+ unsigned long flags;
+ u32 gerror, gerrorn;
- __arm_smmu_cmdq_skip_err(&vintf->cmdqv->smmu, &vcmdq->cmdq);
+ raw_spin_lock_irqsave(&cmdq->cmdq_err_lock, flags);
+ gerror = readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERROR));
+ gerrorn = readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERRORN));
+
+ if ((gerror ^ gerrorn) & GERROR_CMDQ_ERR)
+ __arm_smmu_cmdq_skip_err(&vintf->cmdqv->smmu,
+ cmdq);
writel(gerror, REG_VCMDQ_PAGE0(vcmdq, GERRORN));
+ raw_spin_unlock_irqrestore(&cmdq->cmdq_err_lock, flags);
map &= ~BIT_ULL(lidx);
}
}
@@ -444,6 +453,7 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
{
char header[64], *h = lvcmdq_error_header(vcmdq, header, 64);
u32 gerrorn, gerror;
+ unsigned long flags;
if (vcmdq_write_config(vcmdq, 0)) {
dev_err(vcmdq->cmdqv->dev,
@@ -459,6 +469,7 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
writeq_relaxed(0, REG_VCMDQ_PAGE1(vcmdq, BASE));
writeq_relaxed(0, REG_VCMDQ_PAGE1(vcmdq, CONS_INDX_BASE));
+ raw_spin_lock_irqsave(&vcmdq->cmdq.cmdq_err_lock, flags);
gerrorn = readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERRORN));
gerror = readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERROR));
if (gerror != gerrorn) {
@@ -466,6 +477,7 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
"%suncleared error detected, resetting\n", h);
writel(gerror, REG_VCMDQ_PAGE0(vcmdq, GERRORN));
}
+ raw_spin_unlock_irqrestore(&vcmdq->cmdq.cmdq_err_lock, flags);
dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h);
}
@@ -1126,6 +1138,8 @@ static int tegra241_vintf_alloc_lvcmdq_user(struct iommufd_hw_queue *hw_queue,
vcmdq->cmdq.q.q_base = base_addr_pa & VCMDQ_ADDR;
vcmdq->cmdq.q.q_base |= log2size;
+ raw_spin_lock_init(&vcmdq->cmdq.cmdq_err_lock);
+
ret = tegra241_vcmdq_hw_init_user(vcmdq);
if (ret)
goto unmap_lvcmdq;
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 11/18] iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when CMD_SYNC times out
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (9 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 10/18] iommu/arm-smmu-v3: Recheck CMDQ_ERR in tegra241_vintf0_handle_error() Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 12/18] iommu/arm-smmu-v3: Introduce arm_smmu_cmdq_batch_issue() wrapper Nicolin Chen
` (6 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
Once arm_smmu_cmdq_poll_until_sync() returns, arm_smmu_cmdq_issue_cmdlist()
tests its CMD_SYNC slot in atc_sync_timeouts to decide whether there was an
ATC_INV timeout.
On the other hand, when that poll timed out, the GERROR ISR might have been
delayed past the poll deadline, so the atc_sync_timeouts test could miss an
ATC_INV timeout, classifying it as a generic CMD_SYNC timeout and bypassing
the per-device quarantine.
Add two cmdq_err_handler impl functions:
- arm_smmu_cmdq_err_handler() reads SMMU GERROR/GERRORN.
- tegra241_vcmdq_handle_cmdq_err() reads VCMDQ GERROR/GERRORN.
Co-clear any pending CMDQ_ERR in the issuer, when the polling on a CMD_SYNC
times out. Each cmdq impl serializes the synchronous drain against its own
IRQ handler with cmdq->cmdq_err_lock.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 31 ++++++++++++++++++-
.../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 24 +++++++++++++-
2 files changed, 53 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index c6e3d1be23403..4b4e8108d5944 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -813,6 +813,15 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
sync_prod = llq.prod;
ret = arm_smmu_cmdq_poll_until_sync(smmu, cmdq, &llq);
+ /*
+ * When the poll above timed out, the GERROR ISR might have been
+ * delayed past the poll deadline, so the atc_sync_timeouts test
+ * below could miss our ATC_INV timeout. Thus, drain any pending
+ * CMDQ_ERR synchronously first via the per-cmdq callback.
+ */
+ if (ret && cmdq->cmdq_err_handler)
+ cmdq->cmdq_err_handler(smmu, cmdq);
+
/*
* Ensure that the read in __arm_smmu_cmdq_poll_until_msi() or
* __arm_smmu_cmdq_poll_until_consumed() are completed, before
@@ -2298,6 +2307,26 @@ static irqreturn_t arm_smmu_priq_thread(int irq, void *dev)
static int arm_smmu_device_disable(struct arm_smmu_device *smmu);
+/* Drain a pending CMDQ_ERR, used by arm_smmu_cmdq_issue_cmdlist() */
+static void arm_smmu_cmdq_err_handler(struct arm_smmu_device *smmu,
+ struct arm_smmu_cmdq *cmdq)
+{
+ u32 gerror, gerrorn;
+
+ guard(raw_spinlock_irqsave)(&cmdq->cmdq_err_lock);
+
+ gerror = readl_relaxed(smmu->base + ARM_SMMU_GERROR);
+ gerrorn = readl_relaxed(smmu->base + ARM_SMMU_GERRORN);
+
+ if (!((gerror ^ gerrorn) & GERROR_CMDQ_ERR))
+ return;
+
+ __arm_smmu_cmdq_skip_err(smmu, cmdq);
+
+ /* Toggle only the CMDQ_ERR bit; other bits are left for the ISR. */
+ writel(gerrorn ^ GERROR_CMDQ_ERR, smmu->base + ARM_SMMU_GERRORN);
+}
+
static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
{
u32 gerror, gerrorn, active;
@@ -4500,7 +4529,7 @@ static int arm_smmu_init_queues(struct arm_smmu_device *smmu)
if (ret)
return ret;
- ret = arm_smmu_cmdq_init(smmu, &smmu->cmdq, NULL);
+ ret = arm_smmu_cmdq_init(smmu, &smmu->cmdq, arm_smmu_cmdq_err_handler);
if (ret)
return ret;
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index 666dd23b0c7ca..628a3a7cc0335 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -337,6 +337,28 @@ static void tegra241_vintf0_handle_error(struct tegra241_vintf *vintf)
}
}
+/* Drain a pending CMDQ_ERR, used by arm_smmu_cmdq_issue_cmdlist() */
+static void tegra241_vcmdq_handle_cmdq_err(struct arm_smmu_device *smmu,
+ struct arm_smmu_cmdq *cmdq)
+{
+ struct tegra241_vcmdq *vcmdq =
+ container_of(cmdq, struct tegra241_vcmdq, cmdq);
+ u32 gerror, gerrorn;
+
+ guard(raw_spinlock_irqsave)(&cmdq->cmdq_err_lock);
+
+ gerror = readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERROR));
+ gerrorn = readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERRORN));
+
+ if (!((gerror ^ gerrorn) & GERROR_CMDQ_ERR))
+ return;
+
+ __arm_smmu_cmdq_skip_err(smmu, cmdq);
+
+ /* Toggle only the CMDQ_ERR bit on this VCMDQ's GERRORN */
+ writel(gerrorn ^ GERROR_CMDQ_ERR, REG_VCMDQ_PAGE0(vcmdq, GERRORN));
+}
+
static irqreturn_t tegra241_cmdqv_isr(int irq, void *devid)
{
struct tegra241_cmdqv *cmdqv = (struct tegra241_cmdqv *)devid;
@@ -655,7 +677,7 @@ static int tegra241_vcmdq_alloc_smmu_cmdq(struct tegra241_vcmdq *vcmdq)
q->q_base = q->base_dma & VCMDQ_ADDR;
q->q_base |= FIELD_PREP(VCMDQ_LOG2SIZE, q->llq.max_n_shift);
- return arm_smmu_cmdq_init(smmu, cmdq, NULL);
+ return arm_smmu_cmdq_init(smmu, cmdq, tegra241_vcmdq_handle_cmdq_err);
}
/* VINTF Logical VCMDQ Resource Helpers */
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 12/18] iommu/arm-smmu-v3: Introduce arm_smmu_cmdq_batch_issue() wrapper
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (10 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 11/18] iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when CMD_SYNC times out Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 13/18] iommu/arm-smmu-v3: Add streams_lock for atomic-context SID->master lookup Nicolin Chen
` (5 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
Both arm_smmu_cmdq_batch_submit() and arm_smmu_cmdq_batch_add_cmd_p() call
arm_smmu_cmdq_issue_cmdlist() to flush batches. A future change will retry
the issued commands on -EIO, using the arm_smmu_invs carried in the batch.
So, a single hook point is preferred.
Introduce an arm_smmu_cmdq_batch_issue() wrapper, so a retry logic will be
simply filled into the wrapper.
No functional changes.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 4b4e8108d5944..6c099338a17ef 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -919,6 +919,14 @@ static void arm_smmu_cmdq_batch_init_cmd(struct arm_smmu_device *smmu,
cmds->cmdq = arm_smmu_get_cmdq(smmu, cmd);
}
+static int arm_smmu_cmdq_batch_issue(struct arm_smmu_device *smmu,
+ struct arm_smmu_cmdq_batch *cmds,
+ bool sync)
+{
+ return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
+ cmds->num, sync);
+}
+
static void arm_smmu_cmdq_batch_add_cmd_p(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq_batch *cmds,
struct arm_smmu_cmd *cmd)
@@ -929,14 +937,12 @@ static void arm_smmu_cmdq_batch_add_cmd_p(struct arm_smmu_device *smmu,
unsupported_cmd = !arm_smmu_cmdq_supports_cmd(cmds->cmdq, cmd);
if (force_sync || unsupported_cmd) {
- arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
- cmds->num, true);
+ arm_smmu_cmdq_batch_issue(smmu, cmds, true);
arm_smmu_cmdq_batch_init_cmd(smmu, cmds, cmd);
}
if (cmds->num == CMDQ_BATCH_ENTRIES) {
- arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
- cmds->num, false);
+ arm_smmu_cmdq_batch_issue(smmu, cmds, false);
arm_smmu_cmdq_batch_init_cmd(smmu, cmds, cmd);
}
@@ -952,8 +958,7 @@ static void arm_smmu_cmdq_batch_add_cmd_p(struct arm_smmu_device *smmu,
static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq_batch *cmds)
{
- return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
- cmds->num, true);
+ return arm_smmu_cmdq_batch_issue(smmu, cmds, true);
}
static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused,
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 13/18] iommu/arm-smmu-v3: Add streams_lock for atomic-context SID->master lookup
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (11 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 12/18] iommu/arm-smmu-v3: Introduce arm_smmu_cmdq_batch_issue() wrapper Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 14/18] iommu/arm-smmu-v3: Add has_ats to struct arm_smmu_cmdq_batch Nicolin Chen
` (4 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
A subsequent change will look up arm_smmu_master entries by SID from inside
arm_smmu_cmdq_batch_retry(), which runs with invs->rwlock read_lock held in
IRQ-disabled context, and so cannot take the sleeping streams_mutex.
Add a spinlock_t streams_lock that protects rb_root mutations alongside the
existing streams_mutex:
- atomic-context readers will hold the spinlock alone
- writers (insert/remove paths) take both
A reader under the streams_lock uses all the streams of the found master,
so the insertion has to be all-or-nothing: make arm_smmu_insert_master()
initialize all the L2 strtabs first and then insert all the stream nodes in
one critical section, making a master found via any single SID always fully
initialized.
Update the lockdep assertion in arm_smmu_find_master() to accept either of
the locks so the helper is callable from both contexts.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Assisted-by: Claude:claude-fable-5
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 +
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 95 ++++++++++++++-------
2 files changed, 65 insertions(+), 32 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 86e934d046eaa..c34be7c59ad45 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -966,6 +966,8 @@ struct arm_smmu_device {
struct rb_root streams;
struct mutex streams_mutex;
+ /* Held during rb_root updates; allows atomic-context lookups */
+ spinlock_t streams_lock;
};
struct arm_smmu_stream {
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 6c099338a17ef..e2fa9d27c6586 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2061,7 +2061,8 @@ arm_smmu_find_master(struct arm_smmu_device *smmu, u32 sid)
{
struct rb_node *node;
- lockdep_assert_held(&smmu->streams_mutex);
+ lockdep_assert(lockdep_is_held(&smmu->streams_mutex) ||
+ lockdep_is_held(&smmu->streams_lock));
node = rb_find(&sid, &smmu->streams, arm_smmu_streams_cmp_key);
if (!node)
@@ -4121,6 +4122,51 @@ static int arm_smmu_stream_id_cmp(const void *_l, const void *_r)
return cmp_int(*l, *r);
}
+/* Caller must hold the streams_mutex. Publishes all the nodes, or none */
+static int arm_smmu_insert_streams(struct arm_smmu_device *smmu,
+ struct arm_smmu_master *master)
+{
+ struct arm_smmu_master *existing_master = NULL;
+ u32 existing_sid = 0;
+ unsigned long flags;
+ int ret = 0;
+ int i;
+
+ spin_lock_irqsave(&smmu->streams_lock, flags);
+ for (i = 0; i < master->num_streams; i++) {
+ struct rb_node *existing;
+
+ existing = rb_find_add(&master->streams[i].node,
+ &smmu->streams,
+ arm_smmu_streams_cmp_node);
+ if (!existing)
+ continue;
+
+ existing_master = rb_entry(existing, struct arm_smmu_stream,
+ node)->master;
+
+ /* Bridged PCI devices may end up with duplicated IDs */
+ if (existing_master == master)
+ continue;
+
+ existing_sid = master->streams[i].id;
+ ret = -ENODEV;
+ break;
+ }
+ if (ret)
+ for (i--; i >= 0; i--)
+ if (!RB_EMPTY_NODE(&master->streams[i].node))
+ rb_erase(&master->streams[i].node,
+ &smmu->streams);
+ spin_unlock_irqrestore(&smmu->streams_lock, flags);
+
+ if (ret)
+ dev_warn(master->dev,
+ "Aliasing StreamID 0x%x (from %s) unsupported, expect DMA to be broken\n",
+ existing_sid, dev_name(existing_master->dev));
+ return ret;
+}
+
static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
struct arm_smmu_master *master)
{
@@ -4167,40 +4213,22 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
RB_CLEAR_NODE(&master->streams[i].node);
mutex_lock(&smmu->streams_mutex);
- for (i = 0; i < fwspec->num_ids; i++) {
- struct arm_smmu_stream *new_stream = &master->streams[i];
- struct rb_node *existing;
- u32 sid = new_stream->id;
- ret = arm_smmu_init_sid_strtab(smmu, sid);
+ /*
+ * Initialize the L2 strtabs before publishing any stream node, and
+ * insert all the nodes in one critical section, so an atomic reader
+ * never sees a partially initialized master.
+ */
+ for (i = 0; i < fwspec->num_ids; i++) {
+ ret = arm_smmu_init_sid_strtab(smmu, master->streams[i].id);
if (ret)
break;
-
- /* Insert into SID tree */
- existing = rb_find_add(&new_stream->node, &smmu->streams,
- arm_smmu_streams_cmp_node);
- if (existing) {
- struct arm_smmu_master *existing_master =
- rb_entry(existing, struct arm_smmu_stream, node)
- ->master;
-
- /* Bridged PCI devices may end up with duplicated IDs */
- if (existing_master == master)
- continue;
-
- dev_warn(master->dev,
- "Aliasing StreamID 0x%x (from %s) unsupported, expect DMA to be broken\n",
- sid, dev_name(existing_master->dev));
- ret = -ENODEV;
- break;
- }
}
+ if (!ret)
+ ret = arm_smmu_insert_streams(smmu, master);
+
if (ret) {
- for (i--; i >= 0; i--)
- if (!RB_EMPTY_NODE(&master->streams[i].node))
- rb_erase(&master->streams[i].node,
- &smmu->streams);
kfree(master->streams);
kfree(master->build_invs);
}
@@ -4219,9 +4247,11 @@ static void arm_smmu_remove_master(struct arm_smmu_master *master)
return;
mutex_lock(&smmu->streams_mutex);
- for (i = 0; i < fwspec->num_ids; i++)
- if (!RB_EMPTY_NODE(&master->streams[i].node))
- rb_erase(&master->streams[i].node, &smmu->streams);
+ scoped_guard(spinlock_irqsave, &smmu->streams_lock)
+ for (i = 0; i < fwspec->num_ids; i++)
+ if (!RB_EMPTY_NODE(&master->streams[i].node))
+ rb_erase(&master->streams[i].node,
+ &smmu->streams);
mutex_unlock(&smmu->streams_mutex);
kfree(master->streams);
@@ -4636,6 +4666,7 @@ static int arm_smmu_init_structures(struct arm_smmu_device *smmu)
int ret;
mutex_init(&smmu->streams_mutex);
+ spin_lock_init(&smmu->streams_lock);
smmu->streams = RB_ROOT;
ret = arm_smmu_init_queues(smmu);
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 14/18] iommu/arm-smmu-v3: Add has_ats to struct arm_smmu_cmdq_batch
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (12 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 13/18] iommu/arm-smmu-v3: Add streams_lock for atomic-context SID->master lookup Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 15/18] iommu/arm-smmu-v3: Add INV_TYPE_ATS_BROKEN to skip quarantined ATS masters Nicolin Chen
` (3 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
The arm_smmu_cmdq_batch_add_cmd_p() might flush a sub-batch mid-way, when
the ARM_SMMU_OPT_CMDQ_FORCE_SYNC is set or when a batch is full. To allow
a future change to retry these sub-batch flushes on a timeout and identify
the broken master, the batch needs to know whether it holds an ATC_INV.
Add a "has_ats" flag, set by arm_smmu_cmdq_batch_add_cmd_p() when it queues
an ATC_INV command.
No functional changes.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 ++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 3 +++
2 files changed, 5 insertions(+)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index c34be7c59ad45..56e9a94826a12 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -721,6 +721,8 @@ static inline bool arm_smmu_cmdq_supports_cmd(struct arm_smmu_cmdq *cmdq,
struct arm_smmu_cmdq_batch {
struct arm_smmu_cmd cmds[CMDQ_BATCH_ENTRIES];
struct arm_smmu_cmdq *cmdq;
+ /* Set when an ATC_INV is queued; gates the retry-aware sync decision */
+ bool has_ats;
int num;
};
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index e2fa9d27c6586..78e2559bdc491 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -917,6 +917,7 @@ static void arm_smmu_cmdq_batch_init_cmd(struct arm_smmu_device *smmu,
{
cmds->num = 0;
cmds->cmdq = arm_smmu_get_cmdq(smmu, cmd);
+ cmds->has_ats = false;
}
static int arm_smmu_cmdq_batch_issue(struct arm_smmu_device *smmu,
@@ -946,6 +947,8 @@ static void arm_smmu_cmdq_batch_add_cmd_p(struct arm_smmu_device *smmu,
arm_smmu_cmdq_batch_init_cmd(smmu, cmds, cmd);
}
+ if (FIELD_GET(CMDQ_0_OP, cmd->data[0]) == CMDQ_OP_ATC_INV)
+ cmds->has_ats = true;
cmds->cmds[cmds->num++] = *cmd;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 15/18] iommu/arm-smmu-v3: Add INV_TYPE_ATS_BROKEN to skip quarantined ATS masters
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (13 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 14/18] iommu/arm-smmu-v3: Add has_ats to struct arm_smmu_cmdq_batch Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 16/18] iommu/arm-smmu-v3: Factor out CMDQ batch force-sync conditions Nicolin Chen
` (2 subsequent siblings)
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
A subsequent change quarantines a master whose ATC invalidation timed out,
marking its INV_TYPE_ATS / INV_TYPE_ATS_FULL entries in that domain's invs
as broken. Clearing STE.EATS makes the SMMU reject the device's ATS but
does not stop the driver from issuing ATC_INV, so without a marker those
commands would keep timing out on that master.
Add the INV_TYPE_ATS_BROKEN type. __arm_smmu_domain_inv_range() skips it in
its switch without issuing commands. arm_smmu_inv_is_ats() recognizes it so
the iter's batch-boundary logic places it next to other ATS group entries.
The setter writes cur->type via WRITE_ONCE while the inv_range iter holds
read_lock on invs->rwlock, so its loads of cur->type and next->type race
with it. A new arm_smmu_inv_type() helper wraps the load in READ_ONCE, and
the iter and arm_smmu_inv_cmp() read through it. cur->type is u8 so the
access is already atomic; READ_ONCE annotates it for KCSAN and matches the
cur->users pattern.
arm_smmu_invs_merge() and arm_smmu_invs_purge() also read cur->type, but
through a whole-struct copy of the live array that cannot be a READ_ONCE.
Wrap those copies in data_race(): the u8 load returns the old or flipped
type, and a stale read at worst yields one more ATC_INV timeout, which
re-quarantines. The hot-path WRITE_ONCE and READ_ONCE stay; data_race()
only covers the cold copies that cannot be marked.
An in-place flip of cur->type to INV_TYPE_ATS_BROKEN must not change its
sort position, or arm_smmu_invs_merge() and unref() walks would no longer
match it against an incoming ATS / ATS_FULL identity. Treat all three ATS
variants as one sort class in arm_smmu_inv_cmp() so a flip stays in place
and the flipped entry still matches its pre-flip ssid on attach and detach.
No functional change yet; the new type is introduced but never set
anywhere.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 12 +++++++-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 33 +++++++++++++++------
2 files changed, 35 insertions(+), 10 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 56e9a94826a12..8eb5684696316 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -736,6 +736,7 @@ enum arm_smmu_inv_type {
INV_TYPE_S2_VMID_S1_CLEAR,
INV_TYPE_ATS,
INV_TYPE_ATS_FULL,
+ INV_TYPE_ATS_BROKEN,
};
struct arm_smmu_inv {
@@ -752,9 +753,18 @@ struct arm_smmu_inv {
int users; /* users=0 to mark as a trash to be purged */
};
+/* cur->type may flip to INV_TYPE_ATS_BROKEN concurrently with readers */
+static inline u8 arm_smmu_inv_type(const struct arm_smmu_inv *inv)
+{
+ return READ_ONCE(inv->type);
+}
+
static inline bool arm_smmu_inv_is_ats(const struct arm_smmu_inv *inv)
{
- return inv->type == INV_TYPE_ATS || inv->type == INV_TYPE_ATS_FULL;
+ u8 type = arm_smmu_inv_type(inv);
+
+ return type == INV_TYPE_ATS || type == INV_TYPE_ATS_FULL ||
+ type == INV_TYPE_ATS_BROKEN;
}
/**
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 78e2559bdc491..a18a56ceeb7fb 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1029,13 +1029,21 @@ arm_smmu_invs_iter_next(struct arm_smmu_invs *invs, size_t next, size_t *idx)
static int arm_smmu_inv_cmp(const struct arm_smmu_inv *inv_l,
const struct arm_smmu_inv *inv_r)
{
+ /*
+ * Treat all ATS types as one class, so an in-place flip to ATS_BROKEN
+ * preserves the sort order and still matches the original ATS entry.
+ */
+ bool are_ats = arm_smmu_inv_is_ats(inv_l) & arm_smmu_inv_is_ats(inv_r);
+ u8 type_l = arm_smmu_inv_type(inv_l);
+ u8 type_r = arm_smmu_inv_type(inv_r);
+
if (inv_l->smmu != inv_r->smmu)
return cmp_int((uintptr_t)inv_l->smmu, (uintptr_t)inv_r->smmu);
- if (inv_l->type != inv_r->type)
- return cmp_int(inv_l->type, inv_r->type);
+ if (!are_ats && type_l != type_r)
+ return cmp_int(type_l, type_r);
if (inv_l->id != inv_r->id)
return cmp_int(inv_l->id, inv_r->id);
- if (arm_smmu_inv_is_ats(inv_l))
+ if (are_ats)
return cmp_int(inv_l->ssid, inv_r->ssid);
return 0;
}
@@ -1121,11 +1129,12 @@ struct arm_smmu_invs *arm_smmu_invs_merge(struct arm_smmu_invs *invs,
return ERR_PTR(-ENOMEM);
new = new_invs->inv;
+ /* data_race(): a racing quarantine may flip ->type; the u8 is safe */
arm_smmu_invs_for_each_cmp(invs, i, to_merge, j, cmp) {
if (cmp < 0) {
- *new = invs->inv[i];
+ *new = data_race(invs->inv[i]);
} else if (cmp == 0) {
- *new = invs->inv[i];
+ *new = data_race(invs->inv[i]);
WRITE_ONCE(new->users, READ_ONCE(new->users) + 1);
} else {
*new = to_merge->inv[j];
@@ -1247,8 +1256,9 @@ struct arm_smmu_invs *arm_smmu_invs_purge(struct arm_smmu_invs *invs)
if (!new_invs)
return NULL;
+ /* data_race(): a racing quarantine may flip ->type; the u8 is safe */
arm_smmu_invs_for_each_entry(invs, i, inv) {
- new_invs->inv[num_invs] = *inv;
+ new_invs->inv[num_invs] = data_race(*inv);
if (arm_smmu_inv_is_ats(inv))
new_invs->has_ats = true;
num_invs++;
@@ -2635,8 +2645,8 @@ static inline bool arm_smmu_invs_end_batch(struct arm_smmu_inv *cur,
if (cur->smmu != next->smmu)
return true;
/* The batch for S2 TLBI must be done before nested S1 ASIDs */
- if (cur->type != INV_TYPE_S2_VMID_S1_CLEAR &&
- next->type == INV_TYPE_S2_VMID_S1_CLEAR)
+ if (arm_smmu_inv_type(cur) != INV_TYPE_S2_VMID_S1_CLEAR &&
+ arm_smmu_inv_type(next) == INV_TYPE_S2_VMID_S1_CLEAR)
return true;
/* ATS must be after a sync of the S1/S2 invalidations */
if (!arm_smmu_inv_is_ats(cur) && arm_smmu_inv_is_ats(next))
@@ -2672,7 +2682,7 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
if (!cmds.num)
arm_smmu_cmdq_batch_init_cmd(smmu, &cmds, &cmd);
- switch (cur->type) {
+ switch (arm_smmu_inv_type(cur)) {
case INV_TYPE_S1_ASID:
cmd = arm_smmu_make_cmd_tlbi(cur->size_opcode,
cur->id, 0);
@@ -2706,6 +2716,9 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
arm_smmu_make_cmd_atc_inv_all(cur->id,
IOMMU_NO_PASID));
break;
+ case INV_TYPE_ATS_BROKEN:
+ /* Master is quarantined; skip its ATC_INV */
+ break;
default:
WARN_ON_ONCE(1);
break;
@@ -3256,6 +3269,8 @@ arm_smmu_master_build_inv(struct arm_smmu_master *master,
cur->size_opcode = cur->nsize_opcode = CMDQ_OP_ATC_INV;
cur->ssid = ssid;
break;
+ case INV_TYPE_ATS_BROKEN:
+ break;
}
return cur;
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 16/18] iommu/arm-smmu-v3: Factor out CMDQ batch force-sync conditions
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (14 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 15/18] iommu/arm-smmu-v3: Add INV_TYPE_ATS_BROKEN to skip quarantined ATS masters Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 17/18] iommu/arm-smmu-v3: Thread arm_smmu_master_domain on a per-master list Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 18/18] iommu/arm-smmu-v3: Block ATS for a master upon an ATC invalidation timeout Nicolin Chen
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
arm_smmu_cmdq_batch_add_cmd_p() carries two distinct reasons for flushing
the current batch with a CMD_SYNC before appending the new command:
- The batch's pre-assigned cmdq does not support the new command.
- The Arm erratum 2812531 workaround (ARM_SMMU_OPT_CMDQ_FORCE_SYNC)
forces a SYNC at one entry before the batch is full.
Factor those checks into a new arm_smmu_cmdq_batch_force_sync() helper so
that adding another force-sync condition becomes a one-line addition.
No functional change.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 23 +++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index a18a56ceeb7fb..0697bbc558d0e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -928,16 +928,27 @@ static int arm_smmu_cmdq_batch_issue(struct arm_smmu_device *smmu,
cmds->num, sync);
}
+static bool arm_smmu_cmdq_batch_force_sync(struct arm_smmu_device *smmu,
+ struct arm_smmu_cmdq_batch *cmds,
+ struct arm_smmu_cmd *cmd)
+{
+ /* The batch's pre-assigned cmdq doesn't support the new command */
+ if (!arm_smmu_cmdq_supports_cmd(cmds->cmdq, cmd))
+ return true;
+
+ /* Arm erratum 2812531 */
+ if (cmds->num == CMDQ_BATCH_ENTRIES - 1 &&
+ (smmu->options & ARM_SMMU_OPT_CMDQ_FORCE_SYNC))
+ return true;
+
+ return false;
+}
+
static void arm_smmu_cmdq_batch_add_cmd_p(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq_batch *cmds,
struct arm_smmu_cmd *cmd)
{
- bool force_sync = (cmds->num == CMDQ_BATCH_ENTRIES - 1) &&
- (smmu->options & ARM_SMMU_OPT_CMDQ_FORCE_SYNC);
- bool unsupported_cmd;
-
- unsupported_cmd = !arm_smmu_cmdq_supports_cmd(cmds->cmdq, cmd);
- if (force_sync || unsupported_cmd) {
+ if (arm_smmu_cmdq_batch_force_sync(smmu, cmds, cmd)) {
arm_smmu_cmdq_batch_issue(smmu, cmds, true);
arm_smmu_cmdq_batch_init_cmd(smmu, cmds, cmd);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 17/18] iommu/arm-smmu-v3: Thread arm_smmu_master_domain on a per-master list
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (15 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 16/18] iommu/arm-smmu-v3: Factor out CMDQ batch force-sync conditions Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
2026-07-03 4:06 ` [PATCH v5 18/18] iommu/arm-smmu-v3: Block ATS for a master upon an ATC invalidation timeout Nicolin Chen
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
A subsequent change needs to enumerate, from the CMDQ error path in atomic
context, every domain a master is attached to so it can mark this master's
ATS entries broken in each domain's invs after an ATC invalidation timeout.
The existing per-domain smmu_domain->devices list tracks the inverse
direction (masters in a given domain), so introduce a per-master list.
Add a second list_head master_elm to arm_smmu_master_domain, threaded onto
a new master->master_domains list under master_domains_lock. The CMDQ error
path walks the list while holding smmu->streams_lock; that path runs under
the invs->rwlock read side, which is itself sleepable on PREEMPT_RT, so a
plain spinlock_t suffices for both. The attach and detach sites now take it
with spin_lock(), nested inside the existing devices_lock critical section
that already disables IRQs; it is a leaf in the lock order, so no inversion
is introduced.
No functional change.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 4 ++++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 8 ++++++++
2 files changed, 12 insertions(+)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 8eb5684696316..4aa0c5fedff71 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1029,6 +1029,9 @@ struct arm_smmu_master {
struct arm_smmu_vmaster *vmaster; /* use smmu->streams_mutex */
/* Locked by the iommu core using the group mutex */
struct arm_smmu_ctx_desc_cfg cd_table;
+ struct list_head master_domains;
+ /* Protects master_domains */
+ spinlock_t master_domains_lock;
unsigned int num_streams;
bool ats_enabled : 1;
bool ste_ats_enabled : 1;
@@ -1123,6 +1126,7 @@ struct arm_smmu_invs *arm_smmu_invs_purge(struct arm_smmu_invs *invs);
struct arm_smmu_master_domain {
struct list_head devices_elm;
+ struct list_head master_elm;
struct arm_smmu_master *master;
/*
* For nested domains the master_domain is threaded onto the S2 parent,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 0697bbc558d0e..fd9a095154c72 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3372,6 +3372,9 @@ static void arm_smmu_remove_master_domain(struct arm_smmu_master *master,
ssid, nested_ats_flush);
if (master_domain) {
list_del(&master_domain->devices_elm);
+ spin_lock(&master->master_domains_lock);
+ list_del(&master_domain->master_elm);
+ spin_unlock(&master->master_domains_lock);
if (master->ats_enabled)
atomic_dec(&smmu_domain->nr_ats_masters);
}
@@ -3622,6 +3625,9 @@ int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
if (state->ats_enabled)
atomic_inc(&smmu_domain->nr_ats_masters);
list_add(&master_domain->devices_elm, &smmu_domain->devices);
+ spin_lock(&master->master_domains_lock);
+ list_add(&master_domain->master_elm, &master->master_domains);
+ spin_unlock(&master->master_domains_lock);
spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
arm_smmu_install_new_domain_invs(state);
@@ -4346,6 +4352,8 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev)
master->dev = dev;
master->smmu = smmu;
dev_iommu_priv_set(dev, master);
+ INIT_LIST_HEAD(&master->master_domains);
+ spin_lock_init(&master->master_domains_lock);
ret = arm_smmu_insert_master(smmu, master);
if (ret)
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH v5 18/18] iommu/arm-smmu-v3: Block ATS for a master upon an ATC invalidation timeout
2026-07-03 4:06 [PATCH v5 00/18] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
` (16 preceding siblings ...)
2026-07-03 4:06 ` [PATCH v5 17/18] iommu/arm-smmu-v3: Thread arm_smmu_master_domain on a per-master list Nicolin Chen
@ 2026-07-03 4:06 ` Nicolin Chen
17 siblings, 0 replies; 19+ messages in thread
From: Nicolin Chen @ 2026-07-03 4:06 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
When a CMD_ATC_INV times out, the SMMU stalls at the trailing CMD_SYNC and
arm_smmu_cmdq_issue_cmdlist() returns -EIO. The CMDQ HW reports the timeout
on the CMD_SYNC, not the failing CMD_ATC_INV, so the master that caused it
cannot be identified from the error alone.
cmds->cmds is sorted by SID, so arm_smmu_cmdq_batch_retry() walks the batch
and re-issues one CMD_ATC_INV per unique ATS SID; a second -EIO can confirm
which master is broken. arm_smmu_quarantine_ats() then quarantines it:
- clear STE.EATS on every SID it owns
- walk master->master_domains marking its INV_TYPE_ATS/_ATS_FULL entries
in every domain's invs as INV_TYPE_ATS_BROKEN so later walks skip them
When a batch carries only one unique Stream ID, the timed-out CMD_SYNC by
itself identifies the target, in which case quarantine it directly and skip
the re-issue probes. This is the common case, since it is uncommon for an
ATS-capable PCI device to have multiple Stream IDs.
The marking spans every domain because a master may be attached at the RID
and at multiple PASIDs; marking only the invs that hit the timeout would
leave its other invs issuing CMD_ATC_INV that keep timing out. Clearing
STE.EATS makes the SMMU reject the device's ATS but does not by itself
stop the driver from issuing CMD_ATC_INV, so the marking is what suppresses
the recurring timeouts.
The marking is gated on the STE.EATS clear: it runs only after the CFGI_STE
batch completes successfully. If that batch fails the entries are left as
ATS/ATS_FULL, so invalidations keep issuing CMD_ATC_INV and re-quarantine
until the clear is confirmed, rather than suppressing ATC_INV while ATS may
still be enabled.
The flip to INV_TYPE_ATS_BROKEN is a WRITE_ONCE on inv->type, paired with
the READ_ONCE in arm_smmu_inv_type(). The invs->rwlock is not taken here:
the caller holds the read side for the timed-out batch, so taking a write
side would ABBA-deadlock against a concurrent timeout. It is safe unlocked
because STE.EATS is cleared first, so a racing CMD_ATC_INV forms no ATC
entry and at worst times out again.
The STE.EATS clear uses try_cmpxchg64() to avoid losing a concurrent
arm_smmu_write_ste() update to data[1]. try_cmpxchg64() would be UB on the
Non-Cacheable stream table of a non-coherent SMMU, and a non-atomic
fallback could revert such a concurrent update (e.g. an S1DSS change). So
leave a non-coherent SMMU unquarantined, keeping the pre-existing behavior
of reporting every ATC_INV timeout.
Also force a CMD_SYNC on every sub-batch flush carrying an ATC_INV so the
timeout is observed at the call site that issued the commands.
Identification is synchronous: each master that keeps timing out adds one
more CMD_SYNC poll, capped at the ARM_SMMU_POLL_TIMEOUT_US software limit.
That cap is rarely reached: a non-responding ATC_INV is completed in error
when the device's PCIe Completion Timeout expires, which defaults to a
short interval.
Assisted-by: Claude:claude-fable-5
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 201 +++++++++++++++++++-
1 file changed, 198 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index fd9a095154c72..528d816479d7f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -109,6 +109,10 @@ static const char * const event_class_str[] = {
static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
static bool arm_smmu_ats_supported(struct arm_smmu_master *master);
+static struct arm_smmu_ste *
+arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid);
+static struct arm_smmu_domain *
+to_smmu_domain_devices(struct iommu_domain *domain);
static void parse_driver_options(struct arm_smmu_device *smmu)
{
@@ -920,12 +924,25 @@ static void arm_smmu_cmdq_batch_init_cmd(struct arm_smmu_device *smmu,
cmds->has_ats = false;
}
+static void arm_smmu_cmdq_batch_retry(struct arm_smmu_device *smmu,
+ struct arm_smmu_cmdq_batch *cmds);
+
static int arm_smmu_cmdq_batch_issue(struct arm_smmu_device *smmu,
struct arm_smmu_cmdq_batch *cmds,
bool sync)
{
- return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
- cmds->num, sync);
+ int ret = arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
+ cmds->num, sync);
+
+ /*
+ * The CMDQ HW reports an ATC invalidation timeout at the trailing
+ * CMD_SYNC, not at the failing CMD_ATC_INV. Re-issue each unique ATS
+ * SID in the batch to identify the unresponsive master and block its
+ * ATS so subsequent invalidations make forward progress.
+ */
+ if (ret == -EIO && cmds->has_ats)
+ arm_smmu_cmdq_batch_retry(smmu, cmds);
+ return ret;
}
static bool arm_smmu_cmdq_batch_force_sync(struct arm_smmu_device *smmu,
@@ -941,6 +958,10 @@ static bool arm_smmu_cmdq_batch_force_sync(struct arm_smmu_device *smmu,
(smmu->options & ARM_SMMU_OPT_CMDQ_FORCE_SYNC))
return true;
+ /* ATC_INV timeout is reported to CMD_SYNC; catch at the call site */
+ if (cmds->num == CMDQ_BATCH_ENTRIES && cmds->has_ats)
+ return true;
+
return false;
}
@@ -1014,7 +1035,11 @@ static inline struct arm_smmu_inv *
arm_smmu_invs_iter_next(struct arm_smmu_invs *invs, size_t next, size_t *idx)
{
while (true) {
- if (next >= invs->num_invs) {
+ /*
+ * Lockless readers (arm_smmu_invs_set_ats_broken) pair with the
+ * WRITE_ONCE() in arm_smmu_invs_unref(); num_invs only shrinks.
+ */
+ if (next >= READ_ONCE(invs->num_invs)) {
*idx = next;
return NULL;
}
@@ -2505,6 +2530,176 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
return arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
}
+static void arm_smmu_invs_set_ats_broken(struct arm_smmu_invs *invs,
+ struct arm_smmu_device *smmu, u32 sid)
+{
+ struct arm_smmu_inv *inv;
+ size_t i;
+
+ /* arm_smmu_atc_inv_master() submits batches with invs=NULL */
+ if (!invs)
+ return;
+
+ /*
+ * invs->rwlock is deliberately not taken: the caller holds one domain's
+ * read side for the timed-out batch, then taking another domain's write
+ * side while a concurrent timeout does the reverse would ABBA-deadlock.
+ *
+ * This indicates some potential races, but they are harmless since EATS
+ * was already cleared:
+ * - WRITE_ONCE() may hit a stale invs copy if an attach just installed
+ * a new invs, which might result in another ATC_INV timeout.
+ * - a concurrent invalidation may still issue an ATC_INV that may time
+ * out again.
+ */
+ arm_smmu_invs_for_each_entry(invs, i, inv) {
+ u8 type = arm_smmu_inv_type(inv);
+
+ if (inv->smmu == smmu && inv->id == sid &&
+ (type == INV_TYPE_ATS || type == INV_TYPE_ATS_FULL))
+ WRITE_ONCE(inv->type, INV_TYPE_ATS_BROKEN);
+ }
+}
+
+/* Find the master by SID and block its ATS at the SMMU */
+static void arm_smmu_quarantine_ats(struct arm_smmu_device *smmu, u32 stream_id)
+{
+ struct arm_smmu_cmd cmd = arm_smmu_make_cmd_op(CMDQ_OP_CFGI_STE);
+ struct arm_smmu_master_domain *md;
+ struct arm_smmu_cmdq_batch cmds;
+ struct arm_smmu_master *master;
+ struct arm_smmu_invs *invs;
+ unsigned long flags;
+ int i;
+
+ /*
+ * The in-place STE.EATS clear relies on try_cmpxchg64(), which is UB
+ * on Non-Cacheable memory. Leave a non-coherent SMMU unquarantined:
+ * its invalidations keep issuing ATC_INV and reporting the timeouts.
+ */
+ if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
+ return;
+
+ guard(spinlock_irqsave)(&smmu->streams_lock);
+ master = arm_smmu_find_master(smmu, stream_id);
+ /*
+ * A concurrent hot-unplug can release the master while a stale ATS
+ * entry for it still lingers in the invs snapshot being walked here.
+ */
+ if (!master)
+ return;
+
+ /* Clear STE.EATS for every SID and sync to the SMMU */
+ arm_smmu_cmdq_batch_init_cmd(smmu, &cmds, &cmd);
+
+ for (i = 0; i < master->num_streams; i++) {
+ u32 sid = master->streams[i].id;
+ struct arm_smmu_ste *ste = arm_smmu_get_step_for_sid(smmu, sid);
+ __le64 old, new;
+
+ /*
+ * A concurrent arm_smmu_write_ste() of a domain attachment may
+ * overwrite the data[1] and set EATS, which is recoverable by
+ * another ATC_INV issued by its arm_smmu_attach_commit().
+ */
+ old = READ_ONCE(ste->data[1]);
+ do {
+ new = old & ~cpu_to_le64(STRTAB_STE_1_EATS);
+ } while (!try_cmpxchg64(&ste->data[1], &old, new));
+
+ arm_smmu_cmdq_batch_add_cmd(
+ smmu, &cmds, arm_smmu_make_cmd_cfgi_ste(sid, true));
+ }
+
+ /*
+ * Only proceed to mark the entries broken if the STE.EATS clear above
+ * is confirmed; otherwise return so invalidations keep issuing ATC_INV
+ * (and re-quarantine) until ATS is actually disabled.
+ */
+ if (arm_smmu_cmdq_batch_submit(smmu, &cmds)) {
+ dev_err_ratelimited(smmu->dev,
+ "failed to disable ATS for master\n");
+ return;
+ }
+
+ /*
+ * Mark this master's ATS entries broken in every domain it is attached,
+ * so later invalidations skip the ATC_INV that would time out again.
+ */
+ rcu_read_lock();
+ spin_lock_irqsave(&master->master_domains_lock, flags);
+ list_for_each_entry(md, &master->master_domains, master_elm) {
+ struct arm_smmu_domain *smmu_domain =
+ to_smmu_domain_devices(md->domain);
+
+ if (!smmu_domain)
+ continue;
+ invs = rcu_dereference(smmu_domain->invs);
+ for (i = 0; i < master->num_streams; i++)
+ arm_smmu_invs_set_ats_broken(invs, smmu,
+ master->streams[i].id);
+ }
+ spin_unlock_irqrestore(&master->master_domains_lock, flags);
+ rcu_read_unlock();
+}
+
+/* Re-issue every unique ATS SID in @cmds to identify and quarantine masters. */
+static void arm_smmu_cmdq_batch_retry(struct arm_smmu_device *smmu,
+ struct arm_smmu_cmdq_batch *cmds)
+{
+ struct arm_smmu_cmd atc = {};
+ u32 last_sid = 0;
+ int nr_sids = 0;
+ int i;
+
+ /*
+ * Count unique Stream IDs, taking advantage of the sorted commands. An
+ * ATS-capable PCI device rarely has multiple SIDs, so a batch commonly
+ * carries a single SID, where a re-issue probe would be pointless.
+ */
+ for (i = 0; i < cmds->num; i++) {
+ u32 sid;
+
+ /* Only ATC_INV commands can time out */
+ if (FIELD_GET(CMDQ_0_OP, cmds->cmds[i].data[0]) !=
+ CMDQ_OP_ATC_INV)
+ continue;
+ sid = FIELD_GET(CMDQ_ATC_0_SID, cmds->cmds[i].data[0]);
+ if (!nr_sids || sid != last_sid) {
+ nr_sids++;
+ last_sid = sid;
+ }
+ }
+
+ /* The timed-out CMD_SYNC already identifies the lone Stream ID */
+ if (nr_sids == 1) {
+ arm_smmu_quarantine_ats(smmu, last_sid);
+ return;
+ }
+
+ for (i = 0; i < cmds->num; i++) {
+ u32 sid;
+
+ if (FIELD_GET(CMDQ_0_OP, cmds->cmds[i].data[0]) !=
+ CMDQ_OP_ATC_INV)
+ continue;
+
+ /*
+ * One retry per Stream ID. So, only try the first command since
+ * commands are sorted. And each dead master costs one CMD_SYNC,
+ * bounded by its PCIe Completion Timeout (usually <= 250ms).
+ */
+ sid = FIELD_GET(CMDQ_ATC_0_SID, cmds->cmds[i].data[0]);
+ if (atc.data[0] &&
+ sid == FIELD_GET(CMDQ_ATC_0_SID, atc.data[0]))
+ continue;
+
+ atc = cmds->cmds[i];
+ if (arm_smmu_cmdq_issue_cmd_p(smmu, &atc, true) == -EIO)
+ arm_smmu_quarantine_ats(smmu, sid);
+ }
+}
+
/* IO_PGTABLE API */
static void arm_smmu_tlb_inv_context(void *cookie)
{
--
2.43.0
^ permalink raw reply related [flat|nested] 19+ messages in thread