* [PATCH v3 03/11] iommu: Add reset_device_done callback for hardware fault recovery
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
In-Reply-To: <cover.1776381841.git.nicolinc@nvidia.com>
When an IOMMU hardware detects an error due to a faulty device (e.g. an ATS
invalidation timeout), IOMMU drivers may quarantine the device by disabling
specific hardware features or dropping translation capabilities.
To recover from these states, the IOMMU driver needs a reliable signal that
the underlying physical hardware has been cleanly reset (e.g., via PCIe AER
or a sysfs Function Level Reset) so as to lift the quarantine.
Introduce a reset_device_done callback in struct iommu_ops. Trigger it from
the existing pci_dev_reset_iommu_done() path to notify the underlying IOMMU
driver that the device's internal state has been sanitized.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
include/linux/iommu.h | 4 ++++
drivers/iommu/iommu.c | 12 ++++++++++++
2 files changed, 16 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d3685967e960a..3c5c5fa5cdc6a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -626,6 +626,9 @@ __iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
* @release_device: Remove device from iommu driver handling
* @probe_finalize: Do final setup work after the device is added to an IOMMU
* group and attached to the groups domain
+ * @reset_device_done: Notify the driver that a device has reset successfully.
+ * Note that the core invokes the callback function while
+ * holding the group->mutex
* @device_group: find iommu group for a particular device
* @get_resv_regions: Request list of reserved regions for a device
* @of_xlate: add OF master IDs to iommu grouping
@@ -683,6 +686,7 @@ struct iommu_ops {
struct iommu_device *(*probe_device)(struct device *dev);
void (*release_device)(struct device *dev);
void (*probe_finalize)(struct device *dev);
+ void (*reset_device_done)(struct device *dev);
struct iommu_group *(*device_group)(struct device *dev);
/* Request/Free a list of reserved regions for a device */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 28d4c1f143a08..df23ef0a26e6c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -4071,12 +4071,14 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
void pci_dev_reset_iommu_done(struct pci_dev *pdev, bool reset_succeeds)
{
struct iommu_group *group = pdev->dev.iommu_group;
+ const struct iommu_ops *ops;
struct group_device *gdev;
unsigned long pasid;
void *entry;
if (!pci_ats_supported(pdev) || !dev_has_iommu(&pdev->dev))
return;
+ ops = dev_iommu_ops(&pdev->dev);
guard(mutex)(&group->mutex);
@@ -4105,6 +4107,16 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev, bool reset_succeeds)
return;
}
+ /*
+ * A PCI device might have been in an error state, so the IOMMU driver
+ * had to quarantine the device by disabling specific hardware features
+ * or dropping translation capability. Here notify the IOMMU driver as
+ * a reliable signal that the faulty PCI device has been cleanly reset
+ * so now it can lift its quarantine and restore full functionality.
+ */
+ if (ops->reset_device_done)
+ ops->reset_device_done(&pdev->dev);
+
/* Re-attach RID domain back to group->domain */
if (group->domain != group->blocking_domain) {
WARN_ON(__iommu_attach_device(group->domain, &pdev->dev,
--
2.43.0
^ permalink raw reply related
* [PATCH v3 02/11] iommu: Pass in reset result to pci_dev_reset_iommu_done()
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
In-Reply-To: <cover.1776381841.git.nicolinc@nvidia.com>
IOMMU drivers handle ATC cache maintenance. They may encounter ATC-related
errors (e.g., ATC invalidation request timeout), indicating that ATC cache
might have stale entries that can corrupt the memory. In this case, IOMMU
driver has no choice but to block the device's ATS function and wait for a
device recovery.
The pci_dev_reset_iommu_done() called at the end of a reset function could
serve as a reliable signal to the IOMMU subsystem that the physical device
cache is completely clean. However, the function is called unconditionally
even if the reset operation had actually failed, which would re-attach the
faulty device back to a normal translation domain. And this will leave the
system highly exposed, creating vulnerabilities for data corruption:
IOMMU blocks RID/ATS
pci_reset_function():
pci_dev_reset_iommu_prepare(); // Block RID/ATS
__reset(); // Failed (ATC is still stale)
pci_dev_reset_iommu_done(); // Unblock RID/ATS (ah-ha)
Instead, add a @reset_succeeds parameter to pci_dev_reset_iommu_done() and
pass the reset result from each caller:
IOMMU blocks RID/ATS
pci_reset_function():
pci_dev_reset_iommu_prepare(); // Block RID/ATS
rc = __reset();
pci_dev_reset_iommu_done(!rc); // Unblock or quarantine
On a successful reset, done() restores the device to its RID/PASID domains
and decrements group->recovery_cnt. On failure, the device remains blocked,
and concurrent domain attachment will be rejected until a successful reset.
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
include/linux/iommu.h | 5 +++--
drivers/iommu/iommu.c | 28 +++++++++++++++++++++++++---
drivers/pci/pci-acpi.c | 2 +-
drivers/pci/pci.c | 10 +++++-----
drivers/pci/quirks.c | 2 +-
5 files changed, 35 insertions(+), 12 deletions(-)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 54b8b48c762e8..d3685967e960a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1191,7 +1191,7 @@ void iommu_free_global_pasid(ioasid_t pasid);
/* PCI device reset functions */
int pci_dev_reset_iommu_prepare(struct pci_dev *pdev);
-void pci_dev_reset_iommu_done(struct pci_dev *pdev);
+void pci_dev_reset_iommu_done(struct pci_dev *pdev, bool reset_succeeds);
#else /* CONFIG_IOMMU_API */
struct iommu_ops {};
@@ -1521,7 +1521,8 @@ static inline int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
return 0;
}
-static inline void pci_dev_reset_iommu_done(struct pci_dev *pdev)
+static inline void pci_dev_reset_iommu_done(struct pci_dev *pdev,
+ bool reset_succeeds)
{
}
#endif /* CONFIG_IOMMU_API */
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ff181db687bbf..28d4c1f143a08 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -80,6 +80,7 @@ struct group_device {
* Device is blocked for a pending recovery while its group->domain is
* retained. This can happen when:
* - Device is undergoing a reset
+ * - Device failed the last reset
*/
bool blocked;
unsigned int reset_depth;
@@ -3971,7 +3972,9 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
* reset is finished, pci_dev_reset_iommu_done() can restore everything.
*
* Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
- * before/after the core-level reset routine, to decrement the recovery_cnt.
+ * before/after the core-level reset routine. On a successful reset, done() will
+ * decrement group->recovery_cnt and restore domains. On a failure, recovery_cnt
+ * is left intact and the device stays blocked.
*
* Return: 0 on success or negative error code if the preparation failed.
*
@@ -4000,6 +4003,9 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
if (gdev->reset_depth++)
return 0;
+ /* Device might be already blocked for a quarantine */
+ if (gdev->blocked)
+ return 0;
ret = __iommu_group_alloc_blocking_domain(group);
if (ret)
@@ -4047,18 +4053,22 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
/**
* pci_dev_reset_iommu_done() - Restore IOMMU after a PCI device reset is done
* @pdev: PCI device that has finished a reset routine
+ * @reset_succeeds: Whether the PCI device reset is successful or not
*
* After a PCIe device finishes a reset routine, it wants to restore its IOMMU
* activity, including new translation and cache invalidation, by re-attaching
* all RID/PASID of the device back to the domains retained in the core-level
* structure.
*
- * Caller must pair it with a successful pci_dev_reset_iommu_prepare().
+ * This is a pairing function for pci_dev_reset_iommu_prepare(). Caller should
+ * pass in the reset state via @reset_succeeds. On a failed reset, the device
+ * remains blocked for a quarantine with the group->recovery_cnt intact, so as
+ * to protect system memory until a subsequent successful reset.
*
* Note that, although unlikely, there is a risk that re-attaching domains might
* fail due to some unexpected happening like OOM.
*/
-void pci_dev_reset_iommu_done(struct pci_dev *pdev)
+void pci_dev_reset_iommu_done(struct pci_dev *pdev, bool reset_succeeds)
{
struct iommu_group *group = pdev->dev.iommu_group;
struct group_device *gdev;
@@ -4083,6 +4093,18 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
if (WARN_ON(!group->blocking_domain))
return;
+ /*
+ * A reset failure implies that the device might be unreliable. E.g. its
+ * device cache might retain stale entries, which potentially results in
+ * memory corruption. Thus, do not unblock the device until a successful
+ * reset.
+ */
+ if (!reset_succeeds) {
+ pci_err(pdev,
+ "Reset failed. Keep it blocked to protect memory\n");
+ return;
+ }
+
/* Re-attach RID domain back to group->domain */
if (group->domain != group->blocking_domain) {
WARN_ON(__iommu_attach_device(group->domain, &pdev->dev,
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 4d0f2cb6c695b..9ffd7f013a7d4 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -977,7 +977,7 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
ret = -ENOTTY;
}
- pci_dev_reset_iommu_done(dev);
+ pci_dev_reset_iommu_done(dev, !ret);
return ret;
}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8479c2e1f74f1..d78e724027c78 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4358,7 +4358,7 @@ int pcie_flr(struct pci_dev *dev)
ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
done:
- pci_dev_reset_iommu_done(dev);
+ pci_dev_reset_iommu_done(dev, !ret);
return ret;
}
EXPORT_SYMBOL_GPL(pcie_flr);
@@ -4436,7 +4436,7 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
done:
- pci_dev_reset_iommu_done(dev);
+ pci_dev_reset_iommu_done(dev, !ret);
return ret;
}
@@ -4490,7 +4490,7 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
pci_dev_d3_sleep(dev);
ret = pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
- pci_dev_reset_iommu_done(dev);
+ pci_dev_reset_iommu_done(dev, !ret);
return ret;
}
@@ -4933,7 +4933,7 @@ static int pci_reset_bus_function(struct pci_dev *dev, bool probe)
rc = pci_parent_bus_reset(dev, probe);
done:
- pci_dev_reset_iommu_done(dev);
+ pci_dev_reset_iommu_done(dev, !rc);
return rc;
}
@@ -4978,7 +4978,7 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
reg);
- pci_dev_reset_iommu_done(dev);
+ pci_dev_reset_iommu_done(dev, !rc);
return rc;
}
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 05ce12b6b2f76..6ce79a25e5c76 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4271,7 +4271,7 @@ static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
}
ret = i->reset(dev, probe);
- pci_dev_reset_iommu_done(dev);
+ pci_dev_reset_iommu_done(dev, !ret);
return ret;
}
--
2.43.0
^ permalink raw reply related
* [PATCH v3 04/11] iommu: Add __iommu_group_block_device helper
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
In-Reply-To: <cover.1776381841.git.nicolinc@nvidia.com>
Move the RID/PASID blocking routine into a separate helper, which will be
reused by a new function to quarantine the device but does not bother the
gdev->reset_depth counter.
No functional changes.
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/iommu.c | 99 ++++++++++++++++++++++++-------------------
1 file changed, 56 insertions(+), 43 deletions(-)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index df23ef0a26e6c..768ac728b4cc3 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3958,6 +3958,57 @@ int iommu_replace_group_handle(struct iommu_group *group,
}
EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
+static int __iommu_group_block_device(struct iommu_group *group,
+ struct group_device *gdev)
+{
+ unsigned long pasid;
+ void *entry;
+ int ret;
+
+ lockdep_assert_held(&group->mutex);
+
+ /* Device might be already blocked for a quarantine */
+ if (gdev->blocked)
+ return 0;
+
+ ret = __iommu_group_alloc_blocking_domain(group);
+ if (ret)
+ return ret;
+
+ /* Stage RID domain at blocking_domain while retaining group->domain */
+ if (group->domain != group->blocking_domain) {
+ ret = __iommu_attach_device(group->blocking_domain, gdev->dev,
+ group->domain);
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Update gdev->blocked upon the domain change, as it is used to return
+ * the correct domain in iommu_driver_get_domain_for_dev() that might be
+ * called in a set_dev_pasid callback function.
+ */
+ gdev->blocked = true;
+
+ /*
+ * Stage PASID domains at blocking_domain while retaining pasid_array.
+ *
+ * The pasid_array is mostly fenced by group->mutex, except one reader
+ * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
+ */
+ if (gdev->dev->iommu->max_pasids > 0) {
+ xa_for_each_start(&group->pasid_array, pasid, entry, 1) {
+ struct iommu_domain *pasid_dom =
+ pasid_array_entry_to_domain(entry);
+
+ iommu_remove_dev_pasid(gdev->dev, pasid, pasid_dom);
+ }
+ }
+
+ group->recovery_cnt++;
+ return 0;
+}
+
/**
* pci_dev_reset_iommu_prepare() - Block IOMMU to prepare for a PCI device reset
* @pdev: PCI device that is going to enter a reset routine
@@ -3988,8 +4039,6 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
{
struct iommu_group *group = pdev->dev.iommu_group;
struct group_device *gdev;
- unsigned long pasid;
- void *entry;
int ret;
if (!pci_ats_supported(pdev) || !dev_has_iommu(&pdev->dev))
@@ -4003,50 +4052,14 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
if (gdev->reset_depth++)
return 0;
- /* Device might be already blocked for a quarantine */
- if (gdev->blocked)
- return 0;
-
- ret = __iommu_group_alloc_blocking_domain(group);
- if (ret)
- goto err_depth;
- /* Stage RID domain at blocking_domain while retaining group->domain */
- if (group->domain != group->blocking_domain) {
- ret = __iommu_attach_device(group->blocking_domain, &pdev->dev,
- group->domain);
- if (ret)
- goto err_depth;
- }
-
- /*
- * Update gdev->blocked upon the domain change, as it is used to return
- * the correct domain in iommu_driver_get_domain_for_dev() that might be
- * called in a set_dev_pasid callback function.
- */
- gdev->blocked = true;
-
- /*
- * Stage PASID domains at blocking_domain while retaining pasid_array.
- *
- * The pasid_array is mostly fenced by group->mutex, except one reader
- * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
- */
- if (pdev->dev.iommu->max_pasids > 0) {
- xa_for_each_start(&group->pasid_array, pasid, entry, 1) {
- struct iommu_domain *pasid_dom =
- pasid_array_entry_to_domain(entry);
-
- iommu_remove_dev_pasid(&pdev->dev, pasid, pasid_dom);
- }
+ ret = __iommu_group_block_device(group, gdev);
+ if (ret) {
+ gdev->reset_depth--;
+ return ret;
}
- group->recovery_cnt++;
- return ret;
-
-err_depth:
- gdev->reset_depth--;
- return ret;
+ return 0;
}
EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
--
2.43.0
^ permalink raw reply related
* [PATCH v3 05/11] iommu: Change group->devices to RCU-protected list
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
In-Reply-To: <cover.1776381841.git.nicolinc@nvidia.com>
To allow lockless iterations of the group->devices list in an ISR context
that cannot hold the group->mutex, change the list to be RCU protected.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/iommu.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 768ac728b4cc3..d1be62a07904a 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -84,18 +84,20 @@ struct group_device {
*/
bool blocked;
unsigned int reset_depth;
+ struct rcu_head rcu;
};
/* Iterate over each struct group_device in a struct iommu_group */
#define for_each_group_device(group, pos) \
- list_for_each_entry(pos, &(group)->devices, list)
+ list_for_each_entry_rcu(pos, &(group)->devices, list, \
+ lockdep_is_held(&(group)->mutex))
static struct group_device *__dev_to_gdev(struct device *dev)
{
struct iommu_group *group = dev->iommu_group;
struct group_device *gdev;
- lockdep_assert_held(&group->mutex);
+ lockdep_assert(lockdep_is_held(&group->mutex) || rcu_read_lock_held());
for_each_group_device(group, gdev) {
if (gdev->dev == dev)
@@ -666,7 +668,7 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
* The gdev must be in the list before calling
* iommu_setup_default_domain()
*/
- list_add_tail(&gdev->list, &group->devices);
+ list_add_tail_rcu(&gdev->list, &group->devices);
WARN_ON(group->default_domain && !group->domain);
if (group->default_domain)
iommu_create_device_direct_mappings(group->default_domain, dev);
@@ -697,7 +699,7 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
return 0;
err_remove_gdev:
- list_del(&gdev->list);
+ list_del_rcu(&gdev->list);
__iommu_group_free_device(group, gdev);
err_put_group:
iommu_deinit_device(dev);
@@ -745,7 +747,7 @@ static void __iommu_group_free_device(struct iommu_group *group,
group->domain != group->default_domain);
kfree(grp_dev->name);
- kfree(grp_dev);
+ kfree_rcu(grp_dev, rcu);
}
/* Remove the iommu_group from the struct device. */
@@ -759,7 +761,7 @@ static void __iommu_group_remove_device(struct device *dev)
if (device->dev != dev)
continue;
- list_del(&device->list);
+ list_del_rcu(&device->list);
__iommu_group_free_device(group, device);
if (dev_has_iommu(dev))
iommu_deinit_device(dev);
@@ -1335,7 +1337,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
dev->iommu_group = group;
mutex_lock(&group->mutex);
- list_add_tail(&gdev->list, &group->devices);
+ list_add_tail_rcu(&gdev->list, &group->devices);
mutex_unlock(&group->mutex);
return 0;
}
--
2.43.0
^ permalink raw reply related
* [PATCH v3 06/11] iommu: Defer __iommu_group_free_device() to be outside group->mutex
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
In-Reply-To: <cover.1776381841.git.nicolinc@nvidia.com>
__iommu_group_remove_device() holds group->mutex across the entire call to
__iommu_group_free_device() that performs sysfs removals, tracing, and the
final kfree_rcu(). But in fact, most of these operations don't really need
the group->mutex.
The group_device structure will support a work_struct to quarantine broken
devices asynchronously. The work function must hold group->mutex to safely
update group state. cancel_work_sync() must be called, to cancel that work
before freeing the device. But doing so under group->mutex would deadlock
if the worker is already running and waiting to acquire the same lock.
Separate the assertion from __iommu_group_free_device() to another helper
__iommu_group_empty_assert_owner_cnt().
Defer the __iommu_group_free_device() until the mutex is released.
This is a preparatory refactor with no functional change.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/iommu.c | 35 +++++++++++++++++++++++------------
1 file changed, 23 insertions(+), 12 deletions(-)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d1be62a07904a..810e7b94a1ae2 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -627,6 +627,19 @@ static struct iommu_domain *pasid_array_entry_to_domain(void *entry)
DEFINE_MUTEX(iommu_probe_device_lock);
+static void __iommu_group_empty_assert_owner_cnt(struct iommu_group *group)
+{
+ lockdep_assert_held(&group->mutex);
+ /*
+ * If the group has become empty then ownership must have been
+ * released, and the current domain must be set back to NULL or
+ * the default domain.
+ */
+ if (list_empty(&group->devices))
+ WARN_ON(group->owner_cnt ||
+ group->domain != group->default_domain);
+}
+
static int __iommu_probe_device(struct device *dev, struct list_head *group_list)
{
struct iommu_group *group;
@@ -700,10 +713,12 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
err_remove_gdev:
list_del_rcu(&gdev->list);
- __iommu_group_free_device(group, gdev);
+ __iommu_group_empty_assert_owner_cnt(group);
err_put_group:
iommu_deinit_device(dev);
mutex_unlock(&group->mutex);
+ if (!IS_ERR(gdev))
+ __iommu_group_free_device(group, gdev);
iommu_group_put(group);
return ret;
@@ -732,20 +747,13 @@ static void __iommu_group_free_device(struct iommu_group *group,
{
struct device *dev = grp_dev->dev;
+ lockdep_assert_not_held(&group->mutex);
+
sysfs_remove_link(group->devices_kobj, grp_dev->name);
sysfs_remove_link(&dev->kobj, "iommu_group");
trace_remove_device_from_group(group->id, dev);
- /*
- * If the group has become empty then ownership must have been
- * released, and the current domain must be set back to NULL or
- * the default domain.
- */
- if (list_empty(&group->devices))
- WARN_ON(group->owner_cnt ||
- group->domain != group->default_domain);
-
kfree(grp_dev->name);
kfree_rcu(grp_dev, rcu);
}
@@ -754,7 +762,7 @@ static void __iommu_group_free_device(struct iommu_group *group,
static void __iommu_group_remove_device(struct device *dev)
{
struct iommu_group *group = dev->iommu_group;
- struct group_device *device;
+ struct group_device *device, *to_free = NULL;
mutex_lock(&group->mutex);
for_each_group_device(group, device) {
@@ -762,15 +770,18 @@ static void __iommu_group_remove_device(struct device *dev)
continue;
list_del_rcu(&device->list);
- __iommu_group_free_device(group, device);
+ __iommu_group_empty_assert_owner_cnt(group);
if (dev_has_iommu(dev))
iommu_deinit_device(dev);
else
dev->iommu_group = NULL;
+ to_free = device;
break;
}
mutex_unlock(&group->mutex);
+ if (to_free)
+ __iommu_group_free_device(group, to_free);
/*
* Pairs with the get in iommu_init_device() or
* iommu_group_add_device()
--
2.43.0
^ permalink raw reply related
* [PATCH v3 07/11] iommu: Add iommu_report_device_broken() to quarantine a broken device
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
In-Reply-To: <cover.1776381841.git.nicolinc@nvidia.com>
When an IOMMU hardware detects an error due to a faulty device (e.g. an ATS
invalidation timeout), IOMMU drivers may quarantine the device by disabling
specific hardware features or dropping translation capabilities.
However, the core-level states of the faulty device are out of sync, as the
device can be still attached to a translation domain or even potentially be
moved to a new domain that might overwrite the driver-level quarantine.
Given that such an error can likely be triggered from an ISR, introduce an
asynchronous broken_work per group_device, and provide a helper function to
allow driver initiate a quarantine in the core.
Note that the worker function must not use dev->iommu_group that is NULLed
by iommu_deinit_device() holding group->mutex. The cancel_work_sync() only
gets called afterwards outside the mutex. So, this would be a NULL pointer
dereference. Add a stable group backpointer to struct group_device instead.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
include/linux/iommu.h | 6 +++
drivers/iommu/iommu.c | 100 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 106 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 3c5c5fa5cdc6a..97d0e5b90c58f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -893,6 +893,8 @@ static inline struct iommu_device *__iommu_get_iommu_dev(struct device *dev)
#define iommu_get_iommu_dev(dev, type, member) \
container_of(__iommu_get_iommu_dev(dev), type, member)
+void iommu_report_device_broken(struct device *dev);
+
static inline void iommu_iotlb_gather_init(struct iommu_iotlb_gather *gather)
{
*gather = (struct iommu_iotlb_gather) {
@@ -1207,6 +1209,10 @@ struct iommu_iotlb_gather {};
struct iommu_dirty_bitmap {};
struct iommu_dirty_ops {};
+static inline void iommu_report_device_broken(struct device *dev)
+{
+}
+
static inline bool device_iommu_capable(struct device *dev, enum iommu_cap cap)
{
return false;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 810e7b94a1ae2..bb00918e1b70d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -73,6 +73,7 @@ struct iommu_group {
};
struct group_device {
+ struct iommu_group *group;
struct list_head list;
struct device *dev;
char *name;
@@ -81,10 +82,12 @@ struct group_device {
* retained. This can happen when:
* - Device is undergoing a reset
* - Device failed the last reset
+ * - Device is broken and quarantined
*/
bool blocked;
unsigned int reset_depth;
struct rcu_head rcu;
+ struct work_struct broken_work;
};
/* Iterate over each struct group_device in a struct iommu_group */
@@ -170,6 +173,7 @@ static struct group_device *iommu_group_alloc_device(struct iommu_group *group,
struct device *dev);
static void __iommu_group_free_device(struct iommu_group *group,
struct group_device *grp_dev);
+static void iommu_group_broken_worker(struct work_struct *work);
static void iommu_domain_init(struct iommu_domain *domain, unsigned int type,
const struct iommu_ops *ops);
@@ -752,6 +756,8 @@ static void __iommu_group_free_device(struct iommu_group *group,
sysfs_remove_link(group->devices_kobj, grp_dev->name);
sysfs_remove_link(&dev->kobj, "iommu_group");
+ /* Must wait for broken_work to prevent UAF */
+ cancel_work_sync(&grp_dev->broken_work);
trace_remove_device_from_group(group->id, dev);
kfree(grp_dev->name);
@@ -1284,6 +1290,8 @@ static struct group_device *iommu_group_alloc_device(struct iommu_group *group,
return ERR_PTR(-ENOMEM);
device->dev = dev;
+ device->group = group;
+ INIT_WORK(&device->broken_work, iommu_group_broken_worker);
ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
if (ret)
@@ -4178,6 +4186,98 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev, bool reset_succeeds)
}
EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_done);
+static void iommu_group_broken_worker(struct work_struct *work)
+{
+ struct group_device *gdev =
+ container_of(work, struct group_device, broken_work);
+ struct iommu_group *group = gdev->group;
+ struct device *dev = gdev->dev;
+
+ mutex_lock(&group->mutex);
+
+ /*
+ * iommu_deinit_device() frees dev->iommu under group->mutex. Bail
+ * out if the device has already been removed from IOMMU handling.
+ */
+ if (!dev_has_iommu(dev))
+ goto out_unlock;
+
+ if (gdev->blocked) {
+ dev_dbg(dev, "IOMMU has already quarantined the device\n");
+ goto out_unlock;
+ }
+
+ /*
+ * Quarantine the device completely. For a PCI device, it will be lifted
+ * upon a pci_dev_reset_iommu_done(pdev, succeeds=true) call indicating
+ * a device recovery.
+ *
+ * For a non-PCI device, currently it has no recovery framework tied to
+ * the IOMMU subsystem. Quarantine it indefinitely until a recovery path
+ * is introduced.
+ */
+ if (!WARN_ON(__iommu_group_block_device(group, gdev)))
+ dev_warn(dev, "IOMMU has quarantined the device\n");
+
+out_unlock:
+ mutex_unlock(&group->mutex);
+ iommu_group_put(group);
+}
+
+/**
+ * iommu_report_device_broken() - Report a broken device to quarantine it
+ * @dev: Device that has encountered an unrecoverable IOMMU-related error
+ *
+ * When an IOMMU driver detects a critical error caused by a device (e.g. an ATC
+ * invalidation timeout), this function should be used to quarantine the device
+ * at the IOMMU core level.
+ *
+ * The quarantine moves the device's RID and PASIDs to group->blocking_domain to
+ * prevent any further DMA/ATS activity that can potentially corrupt the system
+ * memory due to stale device cache entries.
+ *
+ * This function is safe to call from any context, including interrupt handlers,
+ * as it schedules the actual quarantine work asynchronously. The caller should
+ * have already taken driver-level measures (e.g., disabling ATS in hardware) to
+ * contain the fault immediately, before calling this function.
+ *
+ * For PCI devices, the quarantine will be lifted by a successful device reset
+ * via pci_dev_reset_iommu_done(). For non-PCI devices, the quarantine remains
+ * in effect indefinitely until a recovery mechanism is introduced.
+ *
+ * If the device is concurrently being removed or has already been removed from
+ * the IOMMU subsystem, this function will silently return without any action.
+ */
+void iommu_report_device_broken(struct device *dev)
+{
+ struct iommu_group *group = iommu_group_get(dev);
+ struct group_device *gdev;
+ bool scheduled = false;
+
+ if (!group)
+ return;
+ if (!dev_has_iommu(dev))
+ goto out;
+
+ rcu_read_lock();
+ /*
+ * Note the device might have been concurrently removed from the group
+ * (list_del_rcu) before iommu_deinit_device() cleared the dev->iommu.
+ */
+ list_for_each_entry_rcu(gdev, &group->devices, list) {
+ if (gdev->dev != dev)
+ continue;
+ /* iommu_group_broken_worker() must put the group ref */
+ scheduled = schedule_work(&gdev->broken_work);
+ break;
+ }
+ rcu_read_unlock();
+out:
+ if (!scheduled)
+ iommu_group_put(group);
+}
+EXPORT_SYMBOL_GPL(iommu_report_device_broken);
+
#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
/**
* iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
--
2.43.0
^ permalink raw reply related
* [PATCH v3 08/11] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
In-Reply-To: <cover.1776381841.git.nicolinc@nvidia.com>
An ATC invalidation timeout is a fatal error. While the SMMUv3 hardware is
aware of the timeout via a GERROR interrupt, the driver thread issuing the
commands lacks a direct mechanism to verify whether its specific batch was
the cause or not, as polling the CMD_SYNC status doesn't natively return a
failure code, making it very difficult to coordinate per-device recovery.
Introduce an atc_sync_timeouts bitmap in the cmdq structure to bridge this
gap. When the ISR detects an ATC timeout, set the bit corresponding to the
physical CMDQ index of the faulting CMD_SYNC command.
On the issuer side, after polling completes (or times out), test and clear
its dedicated bit. If set, return -EIO to trigger device quarantine.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 +
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 41 ++++++++++++++++++++-
2 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index ef42df4753ec4..1d72e5040ea97 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -633,6 +633,7 @@ struct arm_smmu_cmdq {
atomic_long_t *valid_map;
atomic_t owner_prod;
atomic_t lock;
+ unsigned long *atc_sync_timeouts;
bool (*supports_cmd)(struct arm_smmu_cmdq_ent *ent);
};
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f6901c5437edc..f47943f860f3d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -445,7 +445,10 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
* at the CMD_SYNC. Attempt to complete other pending commands
* by repeating the CMD_SYNC, though we might well end up back
* here since the ATC invalidation may still be pending.
+ *
+ * Mark the faulty batch in the bitmap for the issuer to match.
*/
+ set_bit(Q_IDX(&q->llq, cons), cmdq->atc_sync_timeouts);
return;
case CMDQ_ERR_CERROR_ILL_IDX:
default:
@@ -895,9 +898,40 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
/* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */
if (sync) {
+ u32 sync_prod;
+
llq.prod = queue_inc_prod_n(&llq, n);
+ sync_prod = llq.prod;
+ /*
+ * If there is an unhandled ATC timeout, we will have no choice
+ * but to ignore it, since this was left on the ring buffer in
+ * the last round. And we certainly don't want it to affect the
+ * current issue.
+ */
+ clear_bit(Q_IDX(&llq, sync_prod), cmdq->atc_sync_timeouts);
+
ret = arm_smmu_cmdq_poll_until_sync(smmu, cmdq, &llq);
- if (ret) {
+
+ /*
+ * Test atc_sync_timeouts first and see if there is ATC timeout
+ * resulted from this cmdlist. Return -EIO to separate from the
+ * ARM_SMMU_POLL_TIMEOUT_US software timeout.
+ *
+ * FIXME possible unhandled ATC invalidation timeout scenario:
+ * PCI Completion Timeout can be set to a range longer than the
+ * ARM_SMMU_POLL_TIMEOUT_US software timeout. -ETIMEDOUT can be
+ * returned by arm_smmu_cmdq_poll_until_sync() while ATC timeout
+ * might not be flagged on atc_sync_timeouts yet. In this case,
+ * we can hardly do anything here since the command queue HW is
+ * still pending on the ATC command.
+ */
+ if (test_and_clear_bit(Q_IDX(&llq, sync_prod),
+ cmdq->atc_sync_timeouts)) {
+ dev_err_ratelimited(smmu->dev,
+ "CMD_SYNC for ATC_INV timeout at prod=0x%08x\n",
+ sync_prod);
+ ret = -EIO;
+ } else if (ret) {
dev_err_ratelimited(smmu->dev,
"CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
llq.prod,
@@ -4458,6 +4492,11 @@ int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
if (!cmdq->valid_map)
return -ENOMEM;
+ cmdq->atc_sync_timeouts =
+ devm_bitmap_zalloc(smmu->dev, nents, GFP_KERNEL);
+ if (!cmdq->atc_sync_timeouts)
+ return -ENOMEM;
+
return 0;
}
--
2.43.0
^ permalink raw reply related
* [PATCH v3 09/11] iommu/arm-smmu-v3: Replace smmu with master in arm_smmu_inv
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
In-Reply-To: <cover.1776381841.git.nicolinc@nvidia.com>
Storing master allows to backtrack the master pointer from an invalidation
entry, which will be useful when handling ATC invalidation time outs.
No functional changes.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 +-
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c | 34 +++++++++++--------
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 +++++++------
3 files changed, 33 insertions(+), 27 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 1d72e5040ea97..26e0ee0bb5b45 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -662,7 +662,7 @@ enum arm_smmu_inv_type {
};
struct arm_smmu_inv {
- struct arm_smmu_device *smmu;
+ struct arm_smmu_master *master;
u8 type;
u8 size_opcode;
u8 nsize_opcode;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
index add671363c828..ef0c0bfe44206 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
@@ -653,39 +653,43 @@ static void arm_smmu_v3_invs_test_verify(struct kunit *test,
}
}
+static struct arm_smmu_master invs_master = {
+ .smmu = &smmu,
+};
+
static struct arm_smmu_invs invs1 = {
.num_invs = 3,
- .inv = { { .type = INV_TYPE_S2_VMID, .id = 1, },
- { .type = INV_TYPE_S2_VMID_S1_CLEAR, .id = 1, },
- { .type = INV_TYPE_ATS, .id = 3, }, },
+ .inv = { { .master = &invs_master, .type = INV_TYPE_S2_VMID, .id = 1, },
+ { .master = &invs_master, .type = INV_TYPE_S2_VMID_S1_CLEAR, .id = 1, },
+ { .master = &invs_master, .type = INV_TYPE_ATS, .id = 3, }, },
};
static struct arm_smmu_invs invs2 = {
.num_invs = 3,
- .inv = { { .type = INV_TYPE_S2_VMID, .id = 1, }, /* duplicated */
- { .type = INV_TYPE_ATS, .id = 4, },
- { .type = INV_TYPE_ATS, .id = 5, }, },
+ .inv = { { .master = &invs_master, .type = INV_TYPE_S2_VMID, .id = 1, }, /* dup */
+ { .master = &invs_master, .type = INV_TYPE_ATS, .id = 4, },
+ { .master = &invs_master, .type = INV_TYPE_ATS, .id = 5, }, },
};
static struct arm_smmu_invs invs3 = {
.num_invs = 3,
- .inv = { { .type = INV_TYPE_S2_VMID, .id = 1, }, /* duplicated */
- { .type = INV_TYPE_ATS, .id = 5, }, /* recover a trash */
- { .type = INV_TYPE_ATS, .id = 6, }, },
+ .inv = { { .master = &invs_master, .type = INV_TYPE_S2_VMID, .id = 1, }, /* dup */
+ { .master = &invs_master, .type = INV_TYPE_ATS, .id = 5, }, /* recover a trash */
+ { .master = &invs_master, .type = INV_TYPE_ATS, .id = 6, }, },
};
static struct arm_smmu_invs invs4 = {
.num_invs = 3,
- .inv = { { .type = INV_TYPE_ATS, .id = 10, .ssid = 1 },
- { .type = INV_TYPE_ATS, .id = 10, .ssid = 3 },
- { .type = INV_TYPE_ATS, .id = 12, .ssid = 1 }, },
+ .inv = { { .master = &invs_master, .type = INV_TYPE_ATS, .id = 10, .ssid = 1 },
+ { .master = &invs_master, .type = INV_TYPE_ATS, .id = 10, .ssid = 3 },
+ { .master = &invs_master, .type = INV_TYPE_ATS, .id = 12, .ssid = 1 }, },
};
static struct arm_smmu_invs invs5 = {
.num_invs = 3,
- .inv = { { .type = INV_TYPE_ATS, .id = 10, .ssid = 2 },
- { .type = INV_TYPE_ATS, .id = 10, .ssid = 3 }, /* duplicate */
- { .type = INV_TYPE_ATS, .id = 12, .ssid = 2 }, },
+ .inv = { { .master = &invs_master, .type = INV_TYPE_ATS, .id = 10, .ssid = 2 },
+ { .master = &invs_master, .type = INV_TYPE_ATS, .id = 10, .ssid = 3 }, /* dup */
+ { .master = &invs_master, .type = INV_TYPE_ATS, .id = 12, .ssid = 2 }, },
};
static void arm_smmu_v3_invs_test(struct kunit *test)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f47943f860f3d..13f225f704e73 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1092,8 +1092,9 @@ arm_smmu_invs_iter_next(struct arm_smmu_invs *invs, size_t next, size_t *idx)
static int arm_smmu_inv_cmp(const struct arm_smmu_inv *inv_l,
const struct arm_smmu_inv *inv_r)
{
- if (inv_l->smmu != inv_r->smmu)
- return cmp_int((uintptr_t)inv_l->smmu, (uintptr_t)inv_r->smmu);
+ if (inv_l->master->smmu != inv_r->master->smmu)
+ return cmp_int((uintptr_t)inv_l->master->smmu,
+ (uintptr_t)inv_r->master->smmu);
if (inv_l->type != inv_r->type)
return cmp_int(inv_l->type, inv_r->type);
if (inv_l->id != inv_r->id)
@@ -2650,22 +2651,22 @@ static void arm_smmu_inv_to_cmdq_batch(struct arm_smmu_inv *inv,
unsigned long iova, size_t size,
unsigned int granule)
{
- if (arm_smmu_inv_size_too_big(inv->smmu, size, granule)) {
+ if (arm_smmu_inv_size_too_big(inv->master->smmu, size, granule)) {
cmd->opcode = inv->nsize_opcode;
- arm_smmu_cmdq_batch_add(inv->smmu, cmds, cmd);
+ arm_smmu_cmdq_batch_add(inv->master->smmu, cmds, cmd);
return;
}
cmd->opcode = inv->size_opcode;
- arm_smmu_cmdq_batch_add_range(inv->smmu, cmds, cmd, iova, size, granule,
- inv->pgsize);
+ arm_smmu_cmdq_batch_add_range(inv->master->smmu, cmds, cmd, iova, size,
+ granule, inv->pgsize);
}
static inline bool arm_smmu_invs_end_batch(struct arm_smmu_inv *cur,
struct arm_smmu_inv *next)
{
/* Changing smmu means changing command queue */
- if (cur->smmu != next->smmu)
+ if (cur->master->smmu != next->master->smmu)
return true;
/* The batch for S2 TLBI must be done before nested S1 ASIDs */
if (cur->type != INV_TYPE_S2_VMID_S1_CLEAR &&
@@ -2692,7 +2693,7 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
if (READ_ONCE(cur->users))
break;
while (cur != end) {
- struct arm_smmu_device *smmu = cur->smmu;
+ struct arm_smmu_device *smmu = cur->master->smmu;
struct arm_smmu_cmdq_ent cmd = {
/*
* Pick size_opcode to run arm_smmu_get_cmdq(). This can
@@ -2721,7 +2722,8 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
break;
case INV_TYPE_S2_VMID_S1_CLEAR:
/* CMDQ_OP_TLBI_S12_VMALL already flushed S1 entries */
- if (arm_smmu_inv_size_too_big(cur->smmu, size, granule))
+ if (arm_smmu_inv_size_too_big(cur->master->smmu, size,
+ granule))
break;
cmd.tlbi.vmid = cur->id;
arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
@@ -3246,7 +3248,7 @@ arm_smmu_master_build_inv(struct arm_smmu_master *master,
{
struct arm_smmu_invs *build_invs = master->build_invs;
struct arm_smmu_inv *cur, inv = {
- .smmu = master->smmu,
+ .master = master,
.type = type,
.id = id,
.pgsize = pgsize,
@@ -3478,7 +3480,7 @@ static void arm_smmu_inv_flush_iotlb_tag(struct arm_smmu_inv *inv)
}
cmd.opcode = inv->nsize_opcode;
- arm_smmu_cmdq_issue_cmd_with_sync(inv->smmu, &cmd);
+ arm_smmu_cmdq_issue_cmd_with_sync(inv->master->smmu, &cmd);
}
/* Should be installed after arm_smmu_install_ste_for_dev() */
--
2.43.0
^ permalink raw reply related
* [PATCH v3 10/11] iommu/arm-smmu-v3: Introduce master->ats_broken flag
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
In-Reply-To: <cover.1776381841.git.nicolinc@nvidia.com>
The flag will be set when IOMMU cannot trust device's ATS function. E.g.,
when ATC invalidation request to the device times out.
Once it is set, unsupport the ATS feature to prevent data corruption, and
skip further ATC invalidation commands to avoid new timeouts.
Unset the flag when the device finishes a reset for recovery.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 +
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 28 +++++++++++++++++++++
2 files changed, 29 insertions(+)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 26e0ee0bb5b45..95bce9966374a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -941,6 +941,7 @@ struct arm_smmu_master {
/* Locked by the iommu core using the group mutex */
struct arm_smmu_ctx_desc_cfg cd_table;
unsigned int num_streams;
+ bool ats_broken;
bool ats_enabled : 1;
bool ste_ats_enabled : 1;
bool stall_enabled;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 13f225f704e73..5dead82cf1186 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2523,6 +2523,10 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
struct arm_smmu_cmdq_ent cmd;
struct arm_smmu_cmdq_batch cmds;
+ /* Do not issue ATC_INV that will definitely time out */
+ if (READ_ONCE(master->ats_broken))
+ return 0;
+
arm_smmu_atc_inv_to_cmd(ssid, 0, 0, &cmd);
arm_smmu_cmdq_batch_init(master->smmu, &cmds, &cmd);
@@ -2729,11 +2733,17 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
break;
case INV_TYPE_ATS:
+ /* Do not issue ATC_INV that will definitely time out */
+ if (READ_ONCE(cur->master->ats_broken))
+ break;
arm_smmu_atc_inv_to_cmd(cur->ssid, iova, size, &cmd);
cmd.atc.sid = cur->id;
arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
break;
case INV_TYPE_ATS_FULL:
+ /* Do not issue ATC_INV that will definitely time out */
+ if (READ_ONCE(cur->master->ats_broken))
+ break;
arm_smmu_atc_inv_to_cmd(IOMMU_NO_PASID, 0, 0, &cmd);
cmd.atc.sid = cur->id;
arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
@@ -3069,6 +3079,15 @@ void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master,
}
}
+static void arm_smmu_reset_device_done(struct device *dev)
+{
+ struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+
+ if (WARN_ON(!master))
+ return;
+ WRITE_ONCE(master->ats_broken, false);
+}
+
static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
{
struct device *dev = master->dev;
@@ -3081,6 +3100,14 @@ static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_ATS))
return false;
+ /*
+ * Do not enable ATS if master->ats_broken is set. The PCI device should
+ * go through a recovery (reset) that shall notify the SMMUv3 driver via
+ * a reset_device_done callback.
+ */
+ if (READ_ONCE(master->ats_broken))
+ return false;
+
return dev_is_pci(dev) && pci_ats_supported(to_pci_dev(dev));
}
@@ -4412,6 +4439,7 @@ static const struct iommu_ops arm_smmu_ops = {
.domain_alloc_paging_flags = arm_smmu_domain_alloc_paging_flags,
.probe_device = arm_smmu_probe_device,
.release_device = arm_smmu_release_device,
+ .reset_device_done = arm_smmu_reset_device_done,
.device_group = arm_smmu_device_group,
.of_xlate = arm_smmu_of_xlate,
.get_resv_regions = arm_smmu_get_resv_regions,
--
2.43.0
^ permalink raw reply related
* [PATCH v3 11/11] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout
From: Nicolin Chen @ 2026-04-16 23:28 UTC (permalink / raw)
To: Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Helgaas,
Jason Gunthorpe
Cc: Rafael J . Wysocki, Len Brown, Pranjal Shrivastava, Mostafa Saleh,
Lu Baolu, Kevin Tian, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, vsethi, Shuai Xue
In-Reply-To: <cover.1776381841.git.nicolinc@nvidia.com>
Currently, when GERROR_CMDQ_ERR occurs, the arm_smmu_cmdq_skip_err() won't
do anything for the CMDQ_ERR_CERROR_ATC_INV_IDX.
When a device wasn't responsive to an ATC invalidation request, this often
results in constant CMDQ errors:
unexpected global error reported (0x00000001), this could be serious
CMDQ error (cons 0x0302bb84): ATC invalidate timeout
unexpected global error reported (0x00000001), this could be serious
CMDQ error (cons 0x0302bb88): ATC invalidate timeout
unexpected global error reported (0x00000001), this could be serious
CMDQ error (cons 0x0302bb8c): ATC invalidate timeout
...
An ATC invalidation timeout indicates that the device failed to respond to
a protocol-critical coherency request, which means that device's internal
ATS state is desynchronized from the SMMU.
Furthermore, ignoring the timeout leaves the system in an unsafe state, as
the device cache may retain stale ATC entries for memory pages that the OS
has already reclaimed and reassigned. This might lead to data corruption.
Isolate the device that is confirmed to be unresponsive by a surgical STE
update to unset its EATS bit so as to reject any further ATS transaction,
which could corrupt the memory.
Also, set the master->ats_broken flag that is revertible after the device
completes a reset. This flag avoids further ATS requests and invalidations
from happening.
Finally, report this broken device to the IOMMU core to isolate the device
in the core level too.
For batched ATC_INV commands, SMMU hardware only reports a timeout at the
CMD_SYNC, which could follow the batch issued for multiple devices. So, it
isn't straightforward to identify which command in a batch resulted in the
timeout. Fortunately, the invs array has a sorted list of ATC entries. So,
the issued batch must be sorted as well. This makes it possible to retry
the ATC_INV command for each unique Stream ID in the batch to identify the
unresponsive master.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 100 +++++++++++++++++++-
1 file changed, 97 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5dead82cf1186..7dbd9c5834314 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -107,6 +107,8 @@ static const char * const event_class_str[] = {
[3] = "Reserved",
};
+static struct arm_smmu_ste *
+arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid);
static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
static void parse_driver_options(struct arm_smmu_device *smmu)
@@ -2516,10 +2518,49 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size,
cmd->atc.size = log2_span;
}
+static void arm_smmu_disable_eats_for_sid(struct arm_smmu_device *smmu,
+ struct arm_smmu_cmdq *cmdq, u32 sid)
+{
+ struct arm_smmu_cmdq_ent ent = {
+ .opcode = CMDQ_OP_CFGI_STE,
+ .cfgi = {
+ .sid = sid,
+ .leaf = true,
+ },
+ };
+ struct arm_smmu_ste *step;
+ u64 cmd[CMDQ_ENT_DWORDS];
+ __le64 old, new;
+
+ step = arm_smmu_get_step_for_sid(smmu, sid);
+
+ old = READ_ONCE(step->data[1]);
+ do {
+ new = old & cpu_to_le64(~STRTAB_STE_1_EATS);
+ } while (!try_cmpxchg64(&step->data[1], &old, new));
+
+ arm_smmu_cmdq_build_cmd(cmd, &ent);
+ if (arm_smmu_cmdq_issue_cmdlist(smmu, cmdq, cmd, 1, true))
+ dev_err_ratelimited(smmu->dev,
+ "failed to disable ATS for sid %#x\n", sid);
+}
+
+static void arm_smmu_master_disable_ats(struct arm_smmu_master *master,
+ struct arm_smmu_cmdq *cmdq)
+{
+ int i;
+
+ for (i = 0; i < master->num_streams; i++)
+ arm_smmu_disable_eats_for_sid(master->smmu, cmdq,
+ master->streams[i].id);
+ WRITE_ONCE(master->ats_broken, true);
+ iommu_report_device_broken(master->dev);
+}
+
static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
ioasid_t ssid)
{
- int i;
+ int i, ret;
struct arm_smmu_cmdq_ent cmd;
struct arm_smmu_cmdq_batch cmds;
@@ -2535,7 +2576,10 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
arm_smmu_cmdq_batch_add(master->smmu, &cmds, &cmd);
}
- return arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
+ ret = arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
+ if (ret == -EIO)
+ arm_smmu_master_disable_ats(master, cmds.cmdq);
+ return ret;
}
/* IO_PGTABLE API */
@@ -2682,6 +2726,55 @@ static inline bool arm_smmu_invs_end_batch(struct arm_smmu_inv *cur,
return false;
}
+static void arm_smmu_invs_disable_ats(struct arm_smmu_invs *invs,
+ struct arm_smmu_cmdq *cmdq,
+ struct arm_smmu_device *smmu, u32 sid)
+{
+ struct arm_smmu_inv *cur;
+ size_t i;
+
+ arm_smmu_invs_for_each_entry(invs, i, cur) {
+ if (cur->master->smmu == smmu && arm_smmu_inv_is_ats(cur) &&
+ cur->id == sid) {
+ arm_smmu_master_disable_ats(cur->master, cmdq);
+ break;
+ }
+ }
+}
+
+static void arm_smmu_cmdq_batch_retry(struct arm_smmu_device *smmu,
+ struct arm_smmu_invs *invs,
+ struct arm_smmu_cmdq_batch *cmds)
+{
+ u64 atc[CMDQ_ENT_DWORDS] = {0};
+ int i;
+
+ /* Only a timed out ATC_INV command needs a retry */
+ if (!invs->has_ats)
+ return;
+
+ for (i = 0; i < cmds->num * CMDQ_ENT_DWORDS; i += CMDQ_ENT_DWORDS) {
+ struct arm_smmu_cmdq *cmdq = cmds->cmdq;
+ u32 sid;
+ int ret;
+
+ /* Only need to retry ATC invalidations */
+ if (FIELD_GET(CMDQ_0_OP, cmds->cmds[i]) != CMDQ_OP_ATC_INV)
+ continue;
+
+ /* Only need to retry with one ATC_INV per Stream ID (device) */
+ sid = FIELD_GET(CMDQ_ATC_0_SID, cmds->cmds[i]);
+ if (atc[0] && sid == FIELD_GET(CMDQ_ATC_0_SID, atc[0]))
+ continue;
+
+ atc[0] = cmds->cmds[i];
+ atc[1] = cmds->cmds[i + 1];
+ ret = arm_smmu_cmdq_issue_cmdlist(smmu, cmdq, atc, 1, true);
+ if (ret == -EIO)
+ arm_smmu_invs_disable_ats(invs, cmdq, smmu, sid);
+ }
+}
+
static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
unsigned long iova, size_t size,
unsigned int granule, bool leaf)
@@ -2760,7 +2853,8 @@ static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
if (cmds.num &&
(next == end || arm_smmu_invs_end_batch(cur, next))) {
- arm_smmu_cmdq_batch_submit(smmu, &cmds);
+ if (arm_smmu_cmdq_batch_submit(smmu, &cmds) == -EIO)
+ arm_smmu_cmdq_batch_retry(smmu, invs, &cmds);
cmds.num = 0;
}
cur = next;
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v2 2/4] KVM: arm64: sefltests: Add helpers for guest hypervisors
From: Itaru Kitayama @ 2026-04-16 23:39 UTC (permalink / raw)
To: Wei-Lin Chang
Cc: linux-arm-kernel, kvmarm, kvm, linux-kselftest, linux-kernel,
Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
Zenghui Yu, Catalin Marinas, Will Deacon, Paolo Bonzini,
Shuah Khan
In-Reply-To: <2grav6nkabbab3e4gm4vtca63kqrpaegq4xqlhdhwc4l32v65i@x5ksiimtk4er>
On Thu, Apr 16, 2026 at 11:15:57PM +0100, Wei-Lin Chang wrote:
> On Wed, Apr 15, 2026 at 07:14:46AM +0900, Itaru Kitayama wrote:
> > On Sun, Apr 12, 2026 at 03:22:14PM +0100, Wei-Lin Chang wrote:
> > > Add helpers so that guest hypervisors can run nested guests. SP_EL1
> > > save/restore is added to allow nested guests to use a stack.
> > >
> > > Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
> > > ---
> > > .../selftests/kvm/include/arm64/nested.h | 17 +++++++
> > > tools/testing/selftests/kvm/lib/arm64/entry.S | 5 ++
> > > .../testing/selftests/kvm/lib/arm64/nested.c | 46 +++++++++++++++++++
> > > 3 files changed, 68 insertions(+)
> > >
> > > diff --git a/tools/testing/selftests/kvm/include/arm64/nested.h b/tools/testing/selftests/kvm/include/arm64/nested.h
> > > index 86d931facacb..7928ef89494a 100644
> > > --- a/tools/testing/selftests/kvm/include/arm64/nested.h
> > > +++ b/tools/testing/selftests/kvm/include/arm64/nested.h
> > > @@ -21,8 +21,17 @@
> > >
> > > extern char hyp_vectors[];
> > >
> > > +enum vcpu_sysreg {
> > > + __INVALID_SYSREG__, /* 0 is reserved as an invalid value */
> > > +
> > > + SP_EL1,
> > > +
> > > + NR_SYS_REGS
> > > +};
> > > +
> > > struct cpu_context {
> > > struct user_pt_regs regs; /* sp = sp_el0 */
> > > + u64 sys_regs[NR_SYS_REGS];
> > > };
> > >
> > > struct vcpu {
> > > @@ -37,9 +46,17 @@ struct hyp_data {
> > > struct cpu_context hyp_context;
> > > };
> >
> > I am not sure of these structs you introduced only for nested guest feature
> > testing, as the KVM arm64 code they are quite complex and involved,
> > extracring part of those and add members as hello_nested or simliar
> > tests evolve, then add test cases to me seems fragile.
> > But if you have strong reason to add these would you mind explaining a bit?
>
> Sorry, I don't quite get all of your points. I understand your argument
> being evolving these structs as time goes is fragile. For this didn't
> KVM itself evolve like this?
>
> As for having these structs, how can we make L1 a small hypervisor
> without them?
You're correct and I was wrong. We will just have to change the structs for
nested virtualization selftests as we add more test cases.
Itaru.
>
> Thanks,
> Wei-Lin Chang
>
> >
> > Thanks,
> > Itaru.
> >
> > >
> > > +void prepare_hyp(void);
> > > +void init_vcpu(struct vcpu *vcpu, vm_paddr_t l2_pc, vm_paddr_t l2_stack_top);
> > > +int run_l2(struct vcpu *vcpu, struct hyp_data *hyp_data);
> > > +
> > > +void do_hvc(void);
> > > u64 __guest_enter(struct vcpu *vcpu, struct cpu_context *hyp_context);
> > > void __hyp_exception(u64 type);
> > >
> > > +void __sysreg_save_el1_state(struct cpu_context *ctxt);
> > > +void __sysreg_restore_el1_state(struct cpu_context *ctxt);
> > > +
> > > #endif /* !__ASSEMBLER__ */
> > >
> > > #endif /* SELFTEST_KVM_NESTED_H */
> > > diff --git a/tools/testing/selftests/kvm/lib/arm64/entry.S b/tools/testing/selftests/kvm/lib/arm64/entry.S
> > > index 33bedf5e7fb2..df3af3463c6c 100644
> > > --- a/tools/testing/selftests/kvm/lib/arm64/entry.S
> > > +++ b/tools/testing/selftests/kvm/lib/arm64/entry.S
> > > @@ -3,6 +3,11 @@
> > > * adapted from arch/arm64/kvm/hyp/entry.S
> > > */
> > >
> > > + .globl do_hvc
> > > + do_hvc:
> > > + hvc #0
> > > + ret
> > > +
> > > /*
> > > * Manually define these for now
> > > */
> > > diff --git a/tools/testing/selftests/kvm/lib/arm64/nested.c b/tools/testing/selftests/kvm/lib/arm64/nested.c
> > > index 06ddaab2436f..b30d20b101c4 100644
> > > --- a/tools/testing/selftests/kvm/lib/arm64/nested.c
> > > +++ b/tools/testing/selftests/kvm/lib/arm64/nested.c
> > > @@ -4,7 +4,53 @@
> > > */
> > >
> > > #include "nested.h"
> > > +#include "processor.h"
> > > #include "test_util.h"
> > > +#include <asm/sysreg.h>
> > > +
> > > +void prepare_hyp(void)
> > > +{
> > > + write_sysreg(HCR_EL2_E2H | HCR_EL2_RW, hcr_el2);
> > > + write_sysreg(hyp_vectors, vbar_el2);
> > > + isb();
> > > +}
> > > +
> > > +void init_vcpu(struct vcpu *vcpu, vm_paddr_t l2_pc, vm_paddr_t l2_stack_top)
> > > +{
> > > + memset(vcpu, 0, sizeof(*vcpu));
> > > + vcpu->context.regs.pc = l2_pc;
> > > + vcpu->context.regs.pstate = PSR_MODE_EL1h | PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT;
> > > + vcpu->context.sys_regs[SP_EL1] = l2_stack_top;
> > > +}
> > > +
> > > +void __sysreg_save_el1_state(struct cpu_context *ctxt)
> > > +{
> > > + ctxt->sys_regs[SP_EL1] = read_sysreg(sp_el1);
> > > +}
> > > +
> > > +void __sysreg_restore_el1_state(struct cpu_context *ctxt)
> > > +{
> > > + write_sysreg(ctxt->sys_regs[SP_EL1], sp_el1);
> > > +}
> > > +
> > > +int run_l2(struct vcpu *vcpu, struct hyp_data *hyp_data)
> > > +{
> > > + u64 ret;
> > > +
> > > + __sysreg_restore_el1_state(&vcpu->context);
> > > +
> > > + write_sysreg(vcpu->context.regs.pstate, spsr_el2);
> > > + write_sysreg(vcpu->context.regs.pc, elr_el2);
> > > +
> > > + ret = __guest_enter(vcpu, &hyp_data->hyp_context);
> > > +
> > > + vcpu->context.regs.pc = read_sysreg(elr_el2);
> > > + vcpu->context.regs.pstate = read_sysreg(spsr_el2);
> > > +
> > > + __sysreg_save_el1_state(&vcpu->context);
> > > +
> > > + return ret;
> > > +}
> > >
> > > void __hyp_exception(u64 type)
> > > {
> > > --
> > > 2.43.0
> > >
^ permalink raw reply
* Re: [PATCH v2 1/3] arm64: mm: Fix rodata=full block mapping support for realm guests
From: Yang Shi @ 2026-04-16 23:41 UTC (permalink / raw)
To: Kevin Brodsky, Catalin Marinas
Cc: Ryan Roberts, Will Deacon, David Hildenbrand (Arm), Dev Jain,
Suzuki K Poulose, Jinjiang Tu, linux-arm-kernel, linux-kernel,
stable, Kalyazin, Nikita
In-Reply-To: <315131b7-237b-4705-ba84-e03a484128da@arm.com>
On 4/13/26 7:57 AM, Kevin Brodsky wrote:
> On 10/04/2026 01:08, Yang Shi wrote:
>> On 4/9/26 11:33 AM, Catalin Marinas wrote:
>>> On Thu, Apr 09, 2026 at 09:48:58AM -0700, Yang Shi wrote:
>>>> On 4/9/26 8:20 AM, Catalin Marinas wrote:
>>>>> On Thu, Apr 09, 2026 at 11:53:41AM +0200, Kevin Brodsky wrote:
>>>>>> What would make more sense to me is to enable the use of
>>>>>> BBML2-noabort
>>>>>> unconditionally if !force_pte_mapping(). We can then have
>>>>>> can_set_direct_map() return true if we have BBML2-noabort, and we no
>>>>>> longer need to check it in map_mem().
>>>>> Indeed.
>>>> I'm trying to wrap up my head for this discussion. IIUC, if none of the
>>>> features is enabled, it means we don't need do anything because the
>>>> direct
>>>> map is not changed. For example, if vmalloc doesn't change direct map
>>>> permission when rodata != full, there is no need to call
>>>> set_direct_map_*_noflush(). So unconditionally checking
>>>> BBML2_NOABORT will
>>>> change the behavior unnecessarily. Did I miss something?
>>>>
>>>> I think the only exception is secretmem if I don't miss something.
>>>> Currently, secretmem is actually not supported if none of the
>>>> features is
>>>> enabled. But BBML2_NOABORT allows to lift the restriction.
>>> Yes, it's secretmem only AFAICT. I think execmem will only change the
>>> linear map if rodata_full anyway.
>> Yes, execmem calls set_memory_rox(), which won't change linear map
>> permission if rodata_full is not enabled.
> That is a good point, AFAICT set_direct_map_*_noflush() are only used by
> execmem and secretmem. excmem only modifies the direct map if
> rodata=full, so the proposed change would only be useful for secretmem.
>
> The current situation with execmem is pretty strange: if rodata!=full,
> but another feature is enabled (say kfence), then set_memory_rox() won't
> touch the direct map but we will still use set_direct_map_*_noflush() to
> reset it (directly or via VM_FLUSH_RESET_PERMS). Checking BBML2-noabort
> in can_set_direct_map() would make these unnecessary calls more likely,
> but it doesn't fundamentally change the situation.
>
> It's also worth considering the series unmapping parts of the direct map
> for guest_memfd [1], since it gates the use of
> set_direct_map_*_noflush() on can_set_direct_map().
>
> I think it makes complete sense to enable secretmem and the guest_memfd
> use-case if BBML2-noabort is available, regardless of the other
> features. The question is: are we worried about the overhead of
Yes, agreed.
> needlessly calling set_direct_map_*_noflush() for execmem mappings? If
> so, it seems that the right solution is to introduce a new API to check
> whether set_memory_ro() and friends actually modify the direct map or not.
I don't have data regarding the overhead. The set_direct_map_*_noflush()
does walk the page table and they will be called for each page for the
area. It sounds not cheap anyway. In addition, it may split direct map
into smaller granules unnecessarily, it may result in unexpected direct
map fragmentation when rodata != full.
So it seems like introducing a new API is worth it.
Thanks,
Yang
>
> - Kevin
>
> [1] https://lore.kernel.org/lkml/20260317141031.514-1-kalyazin@amazon.com/
^ permalink raw reply
* Re: [PATCH v5 06/12] coresight: etm4x: fix leaked trace id
From: Jie Gan @ 2026-04-17 1:01 UTC (permalink / raw)
To: Leo Yan, Yeoreum Yun
Cc: coresight, linux-arm-kernel, linux-kernel, suzuki.poulose,
mike.leach, james.clark, alexander.shishkin
In-Reply-To: <20260416165541.GN356832@e132581.arm.com>
On 4/17/2026 12:55 AM, Leo Yan wrote:
> On Wed, Apr 15, 2026 at 05:55:22PM +0100, Yeoreum Yun wrote:
>> If etm4_enable_sysfs() fails in cscfg_csdev_enable_active_config(),
>> the trace ID may be leaked because it is not released.
>>
>> To address this, call etm4_release_trace_id() when etm4_enable_sysfs()
>> fails in cscfg_csdev_enable_active_config().
>>
>> Reviewed-by: Jie Gan <jie.gan@oss.qualcomm.com>
>> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
>> ---
>> drivers/hwtracing/coresight/coresight-etm4x-core.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c
>> index f55338a4989d..b199aebbdb60 100644
>> --- a/drivers/hwtracing/coresight/coresight-etm4x-core.c
>> +++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c
>> @@ -920,8 +920,10 @@ static int etm4_enable_sysfs(struct coresight_device *csdev, struct coresight_pa
>> cscfg_config_sysfs_get_active_cfg(&cfg_hash, &preset);
>> if (cfg_hash) {
>> ret = cscfg_csdev_enable_active_config(csdev, cfg_hash, preset);
>> - if (ret)
>> + if (ret) {
>> + etm4_release_trace_id(drvdata);
>> return ret;
>> + }
>
> LGTM:
>
> Reviewed-by: Leo Yan <leo.yan@arm.com>
>
> Just recording a bit thoughts. As Suzuki mentioned, it would be better
> to allocate trace IDs within a session. We might consider maintaining
> the trace ID map in the sink driver data, since the sink driver is
> unique within a session so it is a central place to allocate trace ID.
>
> We should use paired way for allocation and release. For example:
>
> coresight_enable_sysfs()
> {
> ...
> coresight_path_assign_trace_id(path);
>
> failed:
> coresight_path_unassign_trace_id(path);
> }
>
> coresight_disable_sysfs()
> {
> coresight_path_unassign_trace_id(path);
> }
>
> But this requires broader refactoring. E.g., the STM driver currently
> allocates system trace IDs statically during probe, we might need to
> consolidate for all modules to use dynamic allocation.
Agree. That's making sense. Currently, the trace ID of some devices is
allocated during probe, and never to be released. It's kind of waste of
our trace ID resource if the device never to be enabled. But we still
need support static trace ID allocation in parallel for the dummy
sources and we should not break this logic in future refactor.
Thanks,
Jie
>
> Thanks,
> Leo
^ permalink raw reply
* Re: arm64: Supporting DYNAMIC_FTRACE_WITH_CALL_OPS with CLANG_CFI
From: Clayton Craft @ 2026-04-17 1:10 UTC (permalink / raw)
To: Puranjay Mohan, catalin.marinas, ast, daniel, mark.rutland,
broonie, linux-arm-kernel, linux-kernel, bpf
Cc: craftyguy
In-Reply-To: <mb61pedcvxdhw.fsf@gmail.com>
On Thu Feb 29, 2024 at 11:43 AM PST, Puranjay Mohan wrote:
> puranjay12@gmail.com writes:
> I hacked some patches and tried the above approach:
>
> Here are the patches(RFC) that I created:
>
> LLVM Patch:
Hi Puranjay,
Did this LLVM patch ever make it upstream? I'd like to use CFI together
with BPF LSM on arm64 and wondering what the current state of this
effort is.
Thanks,
Clayton
^ permalink raw reply
* Re: [PATCH net v2] net: airoha: Wait for NPU PPE configuration to complete in airoha_ppe_offload_setup()
From: patchwork-bot+netdevbpf @ 2026-04-17 2:10 UTC (permalink / raw)
To: Lorenzo Bianconi
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, linux-arm-kernel,
linux-mediatek, netdev
In-Reply-To: <20260414-airoha-wait-for-npu-config-offload-setup-v2-1-5a9bf6d43aee@kernel.org>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Tue, 14 Apr 2026 16:08:52 +0200 you wrote:
> In order to properly enable flowtable hw offloading, poll
> REG_PPE_FLOW_CFG register in airoha_ppe_offload_setup routine and
> wait for NPU PPE configuration triggered by ppe_init callback to complete
> before running airoha_ppe_hw_init().
>
> Fixes: 00a7678310fe3 ("net: airoha: Introduce flowtable offload support")
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
>
> [...]
Here is the summary with links:
- [net,v2] net: airoha: Wait for NPU PPE configuration to complete in airoha_ppe_offload_setup()
https://git.kernel.org/netdev/net/c/f3206328bb52
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* [PATCH] dt-bindings: Remove the redundant 'type: boolean'
From: phucduc.bui @ 2026-04-17 2:18 UTC (permalink / raw)
To: robh, krzk+dt, conor+dt
Cc: nick, dmitry.torokhov, nicolas.ferre, alexandre.belloni,
claudiu.beznea, lee, heiko, gregkh, linusw, zyw, zhangqing,
gene_chen, linux-input, devicetree, linux-arm-kernel, linux-usb,
bui duc phuc
From: bui duc phuc <phucduc.bui@gmail.com>
The 'wakeup-source' property already has its type defined in the core
schema. Remove the redundant 'type: boolean' from the binding file to
clean up the binding files.
Signed-off-by: bui duc phuc <phucduc.bui@gmail.com>
---
Documentation/devicetree/bindings/input/atmel,maxtouch.yaml | 3 +--
Documentation/devicetree/bindings/mfd/rockchip,rk816.yaml | 3 +--
Documentation/devicetree/bindings/usb/richtek,rt1711h.yaml | 3 +--
3 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/Documentation/devicetree/bindings/input/atmel,maxtouch.yaml b/Documentation/devicetree/bindings/input/atmel,maxtouch.yaml
index 9bf07acea599..26ea78df27c4 100644
--- a/Documentation/devicetree/bindings/input/atmel,maxtouch.yaml
+++ b/Documentation/devicetree/bindings/input/atmel,maxtouch.yaml
@@ -88,8 +88,7 @@ properties:
- 2 # ATMEL_MXT_WAKEUP_GPIO
default: 0
- wakeup-source:
- type: boolean
+ wakeup-source: true
required:
- compatible
diff --git a/Documentation/devicetree/bindings/mfd/rockchip,rk816.yaml b/Documentation/devicetree/bindings/mfd/rockchip,rk816.yaml
index 0676890f101e..a58d9455a1a5 100644
--- a/Documentation/devicetree/bindings/mfd/rockchip,rk816.yaml
+++ b/Documentation/devicetree/bindings/mfd/rockchip,rk816.yaml
@@ -44,8 +44,7 @@ properties:
description:
Telling whether or not this PMIC is controlling the system power.
- wakeup-source:
- type: boolean
+ wakeup-source: true
vcc1-supply:
description:
diff --git a/Documentation/devicetree/bindings/usb/richtek,rt1711h.yaml b/Documentation/devicetree/bindings/usb/richtek,rt1711h.yaml
index ae611f7e57ca..ec0d83220527 100644
--- a/Documentation/devicetree/bindings/usb/richtek,rt1711h.yaml
+++ b/Documentation/devicetree/bindings/usb/richtek,rt1711h.yaml
@@ -33,8 +33,7 @@ properties:
vbus-supply:
description: VBUS power supply
- wakeup-source:
- type: boolean
+ wakeup-source: true
connector:
type: object
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v7 1/3] dt-bindings: pinctrl: Add aspeed,ast2700-soc0-pinctrl
From: Billy Tsai @ 2026-04-17 2:20 UTC (permalink / raw)
To: Conor Dooley
Cc: Lee Jones, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Joel Stanley, Andrew Jeffery, Linus Walleij, Bartosz Golaszewski,
Ryan Chen, Andrew Jeffery, devicetree@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-aspeed@lists.ozlabs.org, linux-kernel@vger.kernel.org,
openbmc@lists.ozlabs.org, linux-gpio@vger.kernel.org,
linux-clk@vger.kernel.org
In-Reply-To: <20260416-brutishly-saga-ba7168a4cd14@spud>
> > + properties:
> > + function:
> > + enum:
> > + - EMMC
> > + - JTAGDDR
> > + - JTAGM0
> > + - JTAGPCIEA
> > + - JTAGPCIEB
> > + - JTAGPSP
> > + - JTAGSSP
> > + - JTAGTSP
> > + - JTAGUSB3A
> > + - JTAGUSB3B
> > + - PCIERC0PERST
> > + - PCIERC1PERST
> > + - TSPRSTN
> > + - UFSCLKI
> > + - USB2AD0
> > + - USB2AD1
> > + - USB2AH
> > + - USB2AHP
> > + - USB2AHPD0
> > + - USB2AXH
> > + - USB2AXH2B
> > + - USB2AXHD1
> > + - USB2AXHP
> > + - USB2AXHP2B
> > + - USB2AXHPD1
> > + - USB2BD0
> > + - USB2BD1
> > + - USB2BH
> > + - USB2BHP
> > + - USB2BHPD0
> > + - USB2BXH
> > + - USB2BXH2A
> > + - USB2BXHD1
> > + - USB2BXHP
> > + - USB2BXHP2A
> > + - USB2BXHPD1
> > + - USB3AXH
> > + - USB3AXH2B
> > + - USB3AXHD
> > + - USB3AXHP
> > + - USB3AXHP2B
> > + - USB3AXHPD
> > + - USB3BXH
> > + - USB3BXH2A
> > + - USB3BXHD
> > + - USB3BXHP
> > + - USB3BXHP2A
> > + - USB3BXHPD
> > + - VB
> > + - VGADDC
> > +
> > + groups:
> > + enum:
> > + - EMMCCDN
> > + - EMMCG1
> > + - EMMCG4
> > + - EMMCG8
> > + - EMMCWPN
> > + - JTAG0
> > + - PCIERC0PERST
> > + - PCIERC1PERST
> > + - TSPRSTN
> > + - UFSCLKI
> > + - USB2A
> > + - USB2AAP
> > + - USB2ABP
> > + - USB2ADAP
> > + - USB2AH
> > + - USB2AHAP
> > + - USB2B
> > + - USB2BAP
> > + - USB2BBP
> > + - USB2BDBP
> > + - USB2BH
> > + - USB2BHBP
> > + - USB3A
> > + - USB3AAP
> > + - USB3ABP
> > + - USB3B
> > + - USB3BAP
> > + - USB3BBP
> > + - VB0
> > + - VB1
> > + - VGADDC
> > + pins:
> > + enum:
> > + - AB13
> > + - AB14
> > + - AC13
> > + - AC14
> > + - AD13
> > + - AD14
> > + - AE13
> > + - AE14
> > + - AE15
> > + - AF13
> > + - AF14
> > + - AF15
> Why do you have groups and pins?
> Is it valid in your device to have groups and pins in the same node?
The intent is to support both group-based mux selection and
configuration, as well as per-pin configuration.
In our hardware:
- `function` + `groups` are used for pinmux selection.
- `pins` is used for per-pin configuration (e.g. drive strength,
bias settings).
- `groups` may also be used for group-level configuration.
As a result, both `groups` and `pins` may appear in the same node,
but they serve different purposes and do not conflict:
- `groups` selects the mux function and may apply configuration to
the entire group.
- `pins` allows overriding or specifying configuration for individual
pins.
In most cases, only one of them is needed, but both are allowed when
both group-level and per-pin configuration are required.
Thanks
Billy Tsai
^ permalink raw reply
* Re: [PATCH v2 net 0/2] net: enetc: fix command BD ring issues
From: patchwork-bot+netdevbpf @ 2026-04-17 2:40 UTC (permalink / raw)
To: Wei Fang
Cc: claudiu.manoil, vladimir.oltean, xiaoning.wang, andrew+netdev,
davem, edumazet, kuba, pabeni, chleroy, netdev, linux-kernel, imx,
linuxppc-dev, linux-arm-kernel
In-Reply-To: <20260415060833.2303846-1-wei.fang@nxp.com>
Hello:
This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 15 Apr 2026 14:08:31 +0800 you wrote:
> Currently, the implementation of command BD ring has two issues, one is
> that the driver may obtain wrong consumer index of the ring, because the
> driver does not mask out the SBE bit of the CIR value, so a wrong index
> will be obtained when a SBE error ouccrs. The other one is that the DMA
> buffer may be used after free. If netc_xmit_ntmp_cmd() times out and
> returns an error, the pending command is not explicitly aborted, while
> ntmp_free_data_mem() unconditionally frees the DMA buffer. If the buffer
> has already been reallocated elsewhere, this may lead to silent memory
> corruption. Because the hardware eventually processes the pending command
> and perform a DMA write of the response to the physical address of the
> freed buffer. So this patch set is to fix these two issues.
>
> [...]
Here is the summary with links:
- [v2,net,1/2] net: enetc: correct the command BD ring consumer index
https://git.kernel.org/netdev/net/c/759a32900b6f
- [v2,net,2/2] net: enetc: fix NTMP DMA use-after-free issue
https://git.kernel.org/netdev/net/c/3cade698881e
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* RE: [PATCH V13 02/12] PCI: host-generic: Add common helpers for parsing Root Port properties
From: Sherry Sun @ 2026-04-17 3:17 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: robh@kernel.org, krzk+dt@kernel.org, conor+dt@kernel.org,
Frank Li, s.hauer@pengutronix.de, kernel@pengutronix.de,
festevam@gmail.com, lpieralisi@kernel.org, kwilczynski@kernel.org,
mani@kernel.org, bhelgaas@google.com, Hongxing Zhu,
l.stach@pengutronix.de, imx@lists.linux.dev,
linux-pci@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
devicetree@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260416203905.GA29913@bhelgaas>
> On Thu, Apr 16, 2026 at 07:14:12PM +0800, Sherry Sun wrote:
> > Introduce generic helper functions to parse Root Port device tree
> > nodes and extract common properties like reset GPIOs. This allows
> > multiple PCI host controller drivers to share the same parsing logic.
> >
> > Define struct pci_host_port to hold common Root Port properties
> > (currently only reset GPIO descriptor) and add
> > pci_host_common_parse_ports() to parse Root Port nodes from device
> tree.
>
> Are the Root Port and the RC the only possible places for 'reset' GPIO
> descriptions in DT? I think PERST# routing is outside the PCIe spec, so it
> seems like a system could provide a PERST# GPIO routed to any Switch
> Upstream Port or Endpoint (I assume a PERST# connected to a switch would
> apply to both the upstream port and the downstream ports).
Hi Bjorn,
Thanks for the feedback. You're right that PERST# routing could theoretically be
connected to any device in the hierarchy. However, for this patch series, I've focused
on the most common use case in practice: use Root Port level PERST# instead of the
legacy Root Complex level PERST#.
Root Port level PERST# - This is the primary target, where each Root Port has individual
control over devices connected to it.
RC level PERST# - Legacy binding support, where a single GPIO controls all ports.
We can extend this framework later if real hardware emerges that needs Switch or
EP-level PERST# control. I can add a comment documenting this limitation if needed.
BTW, Mani and Rob had some great discussions in dt-schema about PERST# and WAKE#
sideband signals settings.
You can check here:
https://github.com/devicetree-org/dt-schema/issues/168
https://github.com/devicetree-org/dt-schema/pull/126
https://github.com/devicetree-org/dt-schema/pull/170
Best Regards
Sherry
^ permalink raw reply
* [PATCH] gpio: drop bitmap_complement() where feasible
From: Yury Norov @ 2026-04-17 3:34 UTC (permalink / raw)
To: Linus Walleij, Andy Shevchenko, Bartosz Golaszewski,
Shubhrajyoti Datta, Srinivas Neeli, Michal Simek, linux-gpio,
linux-kernel, linux-arm-kernel
Cc: Yury Norov
The gpio drivers reproduce the following pattern:
bitmap_complement(tmp, data1, nbits);
bitmap_and(dst, data2, tmp, nbits);
This can be done in a single pass:
bitmap_andnot(dst, data2, data1t, nbits);
Signed-off-by: Yury Norov <ynorov@nvidia.com>
---
drivers/gpio/gpio-pca953x.c | 7 ++-----
drivers/gpio/gpio-xilinx.c | 6 ++----
2 files changed, 4 insertions(+), 9 deletions(-)
diff --git a/drivers/gpio/gpio-pca953x.c b/drivers/gpio/gpio-pca953x.c
index 52e96cc5f67b..1fef733fe1f0 100644
--- a/drivers/gpio/gpio-pca953x.c
+++ b/drivers/gpio/gpio-pca953x.c
@@ -877,11 +877,9 @@ static void pca953x_irq_bus_sync_unlock(struct irq_data *d)
bitmap_or(irq_mask, chip->irq_trig_fall, chip->irq_trig_raise, gc->ngpio);
bitmap_or(irq_mask, irq_mask, chip->irq_trig_level_high, gc->ngpio);
bitmap_or(irq_mask, irq_mask, chip->irq_trig_level_low, gc->ngpio);
- bitmap_complement(reg_direction, reg_direction, gc->ngpio);
- bitmap_and(irq_mask, irq_mask, reg_direction, gc->ngpio);
/* Look for any newly setup interrupt */
- for_each_set_bit(level, irq_mask, gc->ngpio)
+ for_each_andnot_bit(level, irq_mask, reg_direction, gc->ngpio)
pca953x_gpio_direction_input(&chip->gpio_chip, level);
mutex_unlock(&chip->irq_lock);
@@ -1005,8 +1003,7 @@ static bool pca953x_irq_pending(struct pca953x_chip *chip, unsigned long *pendin
bitmap_and(cur_stat, cur_stat, chip->irq_mask, gc->ngpio);
bitmap_or(pending, pending, cur_stat, gc->ngpio);
- bitmap_complement(cur_stat, new_stat, gc->ngpio);
- bitmap_and(cur_stat, cur_stat, reg_direction, gc->ngpio);
+ bitmap_andnot(cur_stat, reg_direction, new_stat, gc->ngpio);
bitmap_and(old_stat, cur_stat, chip->irq_trig_level_low, gc->ngpio);
bitmap_and(old_stat, old_stat, chip->irq_mask, gc->ngpio);
bitmap_or(pending, pending, old_stat, gc->ngpio);
diff --git a/drivers/gpio/gpio-xilinx.c b/drivers/gpio/gpio-xilinx.c
index be4b4d730547..532205175827 100644
--- a/drivers/gpio/gpio-xilinx.c
+++ b/drivers/gpio/gpio-xilinx.c
@@ -495,13 +495,11 @@ static void xgpio_irqhandler(struct irq_desc *desc)
xgpio_read_ch_all(chip, XGPIO_DATA_OFFSET, hw);
- bitmap_complement(rising, chip->last_irq_read, 64);
- bitmap_and(rising, rising, hw, 64);
+ bitmap_andnot(rising, hw, chip->last_irq_read, 64);
bitmap_and(rising, rising, chip->enable, 64);
bitmap_and(rising, rising, chip->rising_edge, 64);
- bitmap_complement(falling, hw, 64);
- bitmap_and(falling, falling, chip->last_irq_read, 64);
+ bitmap_andnot(falling, chip->last_irq_read, hw, 64);
bitmap_and(falling, falling, chip->enable, 64);
bitmap_and(falling, falling, chip->falling_edge, 64);
--
2.51.0
^ permalink raw reply related
* [PATCH net] net: dsa: mt7530: fix .get_stats64 sleeping in atomic context
From: Daniel Golle @ 2026-04-17 3:55 UTC (permalink / raw)
To: Chester A. Unal, Daniel Golle, Andrew Lunn, Vladimir Oltean,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Matthias Brugger, AngeloGioacchino Del Regno, Russell King,
Christian Marangi, netdev, linux-kernel, linux-arm-kernel,
linux-mediatek
Cc: Frank Wunderlich, John Crispin
The .get_stats64 callback runs in atomic context, but on
MDIO-connected switches every register read acquires the MDIO bus
mutex, which can sleep:
[ 12.645973] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:609
[ 12.654442] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 759, name: grep
[ 12.663377] preempt_count: 0, expected: 0
[ 12.667410] RCU nest depth: 1, expected: 0
[ 12.671511] INFO: lockdep is turned off.
[ 12.675441] CPU: 0 UID: 0 PID: 759 Comm: grep Tainted: G S W 7.0.0+ #0 PREEMPT
[ 12.675453] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
[ 12.675456] Hardware name: Bananapi BPI-R64 (DT)
[ 12.675459] Call trace:
[ 12.675462] show_stack+0x14/0x1c (C)
[ 12.675477] dump_stack_lvl+0x68/0x8c
[ 12.675487] dump_stack+0x14/0x1c
[ 12.675495] __might_resched+0x14c/0x220
[ 12.675504] __might_sleep+0x44/0x80
[ 12.675511] __mutex_lock+0x50/0xb10
[ 12.675523] mutex_lock_nested+0x20/0x30
[ 12.675532] mt7530_get_stats64+0x40/0x2ac
[ 12.675542] dsa_user_get_stats64+0x2c/0x40
[ 12.675553] dev_get_stats+0x44/0x1e0
[ 12.675564] dev_seq_printf_stats+0x24/0xe0
[ 12.675575] dev_seq_show+0x14/0x3c
[ 12.675583] seq_read_iter+0x37c/0x480
[ 12.675595] seq_read+0xd0/0xec
[ 12.675605] proc_reg_read+0x94/0xe4
[ 12.675615] vfs_read+0x98/0x29c
[ 12.675625] ksys_read+0x54/0xdc
[ 12.675633] __arm64_sys_read+0x18/0x20
[ 12.675642] invoke_syscall.constprop.0+0x54/0xec
[ 12.675653] do_el0_svc+0x3c/0xb4
[ 12.675662] el0_svc+0x38/0x200
[ 12.675670] el0t_64_sync_handler+0x98/0xdc
[ 12.675679] el0t_64_sync+0x158/0x15c
For MDIO-connected switches, poll MIB counters asynchronously using a
delayed workqueue every second and let .get_stats64 return the cached
values under a per-port spinlock. A mod_delayed_work() call on each
read triggers an immediate refresh so counters stay responsive when
queried more frequently.
MMIO-connected switches (MT7988, EN7581, AN7583) are not affected
because their regmap does not sleep, so they continue to read MIB
counters directly in .get_stats64.
Fixes: 88c810f35ed5 ("net: dsa: mt7530: implement .get_stats64")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
---
This bug highlights a bigger problem and the actual cause:
Locking in the mt7530 driver deserves a cleanup, and refactoring
towards cleanly and directly using the regmap API.
I've prepared this already and am going to submit a series doing
most of that using Coccinelle semantic patches once net-next opens
again.
drivers/net/dsa/mt7530.c | 54 +++++++++++++++++++++++++++++++++++++---
drivers/net/dsa/mt7530.h | 6 +++++
2 files changed, 57 insertions(+), 3 deletions(-)
diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
index b9423389c2ef0..786d3a8492bcb 100644
--- a/drivers/net/dsa/mt7530.c
+++ b/drivers/net/dsa/mt7530.c
@@ -25,6 +25,8 @@
#include "mt7530.h"
+#define MT7530_STATS_POLL_INTERVAL (1 * HZ)
+
static struct mt753x_pcs *pcs_to_mt753x_pcs(struct phylink_pcs *pcs)
{
return container_of(pcs, struct mt753x_pcs, pcs);
@@ -906,10 +908,9 @@ static void mt7530_get_rmon_stats(struct dsa_switch *ds, int port,
*ranges = mt7530_rmon_ranges;
}
-static void mt7530_get_stats64(struct dsa_switch *ds, int port,
- struct rtnl_link_stats64 *storage)
+static void mt7530_read_port_stats64(struct mt7530_priv *priv, int port,
+ struct rtnl_link_stats64 *storage)
{
- struct mt7530_priv *priv = ds->priv;
uint64_t data;
/* MIB counter doesn't provide a FramesTransmittedOK but instead
@@ -951,6 +952,43 @@ static void mt7530_get_stats64(struct dsa_switch *ds, int port,
&storage->rx_crc_errors);
}
+static void mt7530_stats_poll(struct work_struct *work)
+{
+ struct mt7530_priv *priv = container_of(work, struct mt7530_priv,
+ stats_work.work);
+ struct rtnl_link_stats64 stats = {};
+ struct dsa_port *dp;
+ int port;
+
+ dsa_switch_for_each_user_port(dp, priv->ds) {
+ port = dp->index;
+
+ mt7530_read_port_stats64(priv, port, &stats);
+
+ spin_lock(&priv->stats_lock);
+ priv->ports[port].stats = stats;
+ spin_unlock(&priv->stats_lock);
+ }
+
+ schedule_delayed_work(&priv->stats_work,
+ MT7530_STATS_POLL_INTERVAL);
+}
+
+static void mt7530_get_stats64(struct dsa_switch *ds, int port,
+ struct rtnl_link_stats64 *storage)
+{
+ struct mt7530_priv *priv = ds->priv;
+
+ if (priv->bus) {
+ spin_lock(&priv->stats_lock);
+ *storage = priv->ports[port].stats;
+ spin_unlock(&priv->stats_lock);
+ mod_delayed_work(system_wq, &priv->stats_work, 0);
+ } else {
+ mt7530_read_port_stats64(priv, port, storage);
+ }
+}
+
static void mt7530_get_eth_ctrl_stats(struct dsa_switch *ds, int port,
struct ethtool_eth_ctrl_stats *ctrl_stats)
{
@@ -3137,6 +3175,13 @@ mt753x_setup(struct dsa_switch *ds)
if (ret && priv->irq_domain)
mt7530_free_mdio_irq(priv);
+ if (!ret && priv->bus) {
+ spin_lock_init(&priv->stats_lock);
+ INIT_DELAYED_WORK(&priv->stats_work, mt7530_stats_poll);
+ schedule_delayed_work(&priv->stats_work,
+ MT7530_STATS_POLL_INTERVAL);
+ }
+
return ret;
}
@@ -3404,6 +3449,9 @@ EXPORT_SYMBOL_GPL(mt7530_probe_common);
void
mt7530_remove_common(struct mt7530_priv *priv)
{
+ if (priv->bus)
+ cancel_delayed_work_sync(&priv->stats_work);
+
if (priv->irq_domain)
mt7530_free_mdio_irq(priv);
diff --git a/drivers/net/dsa/mt7530.h b/drivers/net/dsa/mt7530.h
index 3e0090bed298d..44c1dc75baea8 100644
--- a/drivers/net/dsa/mt7530.h
+++ b/drivers/net/dsa/mt7530.h
@@ -796,6 +796,7 @@ struct mt7530_fdb {
* @pvid: The VLAN specified is to be considered a PVID at ingress. Any
* untagged frames will be assigned to the related VLAN.
* @sgmii_pcs: Pointer to PCS instance for SerDes ports
+ * @stats: Cached port statistics for MDIO-connected switches
*/
struct mt7530_port {
bool enable;
@@ -803,6 +804,7 @@ struct mt7530_port {
u32 pm;
u16 pvid;
struct phylink_pcs *sgmii_pcs;
+ struct rtnl_link_stats64 stats;
};
/* Port 5 mode definitions of the MT7530 switch */
@@ -875,6 +877,8 @@ struct mt753x_info {
* @create_sgmii: Pointer to function creating SGMII PCS instance(s)
* @active_cpu_ports: Holding the active CPU ports
* @mdiodev: The pointer to the MDIO device structure
+ * @stats_lock: Protects cached per-port stats from concurrent access
+ * @stats_work: Delayed work for polling MIB counters on MDIO switches
*/
struct mt7530_priv {
struct device *dev;
@@ -900,6 +904,8 @@ struct mt7530_priv {
int (*create_sgmii)(struct mt7530_priv *priv);
u8 active_cpu_ports;
struct mdio_device *mdiodev;
+ spinlock_t stats_lock; /* protects cached stats counters */
+ struct delayed_work stats_work;
};
struct mt7530_hw_vlan_entry {
--
2.53.0
^ permalink raw reply related
* Re: [GIT PULL 1/4] soc: dt changes for 7.1
From: pr-tracker-bot @ 2026-04-17 4:10 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: Linus Torvalds, soc, linux-kernel, linux-arm-kernel
In-Reply-To: <b3203b1d-4c1d-49b8-924e-c17cf6be7ec0@app.fastmail.com>
The pull request you sent on Fri, 17 Apr 2026 00:22:48 +0200:
> https://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git tags/soc-dt-7.1
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/e65f4718a577fcc84d40431f022985898b6dbf2e
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply
* Re: [GIT PULL 3/4] soc: defconfig updates for 7.1
From: pr-tracker-bot @ 2026-04-17 4:10 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: Linus Torvalds, soc, linux-kernel, linux-arm-kernel
In-Reply-To: <704b8a79-4fe5-4815-82aa-026de29dcf1a@app.fastmail.com>
The pull request you sent on Fri, 17 Apr 2026 00:30:16 +0200:
> https://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git tags/soc-defconfig-7.1
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/231d703058b2e2ee59884c8531e02c60a2a109ab
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply
* Re: Re: [GIT PULL 2/4] soc: drivers for 7.1
From: pr-tracker-bot @ 2026-04-17 4:10 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: Linus Torvalds, soc, linux-kernel, linux-arm-kernel
In-Reply-To: <d80b4370-3d20-4d7c-b91f-455797c52b39@app.fastmail.com>
The pull request you sent on Fri, 17 Apr 2026 00:28:45 +0200:
> https://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git tags/soc-drivers-7.1
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/31b43c079f9aa55754c20404a42bca9a49e01f60
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply
* Re: [GIT PULL 4/4] soc: ARM code changes for 7.1
From: pr-tracker-bot @ 2026-04-17 4:10 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: Linus Torvalds, soc, linux-kernel, linux-arm-kernel
In-Reply-To: <2af6e914-7d5f-4976-8b18-0a98ad3a427c@app.fastmail.com>
The pull request you sent on Fri, 17 Apr 2026 00:31:18 +0200:
> https://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git tags/soc-arm-7.1
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/8242c709d4ba858c483ef7ef3cc2dc1280f5383c
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox