* [PATCH v5 0/5] Disable ATS via iommu during PCI resets
@ 2025-11-11 5:12 Nicolin Chen
2025-11-11 5:12 ` [PATCH v5 1/5] iommu: Lock group->mutex in iommu_deferred_attach() Nicolin Chen
` (4 more replies)
0 siblings, 5 replies; 36+ messages in thread
From: Nicolin Chen @ 2025-11-11 5:12 UTC (permalink / raw)
To: joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, robin.murphy, lenb, baolu.lu, linux-arm-kernel, iommu,
linux-kernel, linux-acpi, linux-pci, kvm, patches, pjaroszynski,
vsethi, helgaas, etzhao1900
Hi all,
PCIe permits a device to ignore ATS invalidation TLPs, while processing a
reset. This creates a problem visible to the OS where an ATS invalidation
command will time out: e.g. an SVA domain will have no coordination with a
reset event and can racily issue ATS invalidations to a resetting device.
The OS should do something to mitigate this as we do not want production
systems to be reporting critical ATS failures, especially in a hypervisor
environment. Broadly, OS could arrange to ignore the timeouts, block page
table mutations to prevent invalidations, or disable and block ATS.
The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and
block ATS before initiating a Function Level Reset. It also mentions that
other reset methods could have the same vulnerability as well.
Provide a callback from the PCI subsystem that will enclose the reset and
have the iommu core temporarily change domains to group->blocking_domain,
so IOMMU drivers would fence any incoming ATS queries, synchronously stop
issuing new ATS invalidations, and wait for existing ATS invalidations to
complete. Doing this can avoid any ATS invaliation timeouts.
When a device is resetting, any new domain attachment has to be rejected,
until the reset is finished, to prevent ATS activity from being activated
between the two callback functions. Introduce a new resetting_domain, and
reject a concurrent __iommu_attach_device/set_group_pasid().
Finally, apply these iommu_dev_reset_prepare/done() functions in the PCI
reset functions.
Note this series does not support a shared iommu_group cases. And PF will
be blocked, even though VFs (that are already broken) would not be aware
of the reset.
This is on Github:
https://github.com/nicolinc/iommufd/commits/iommu_dev_reset-v5
Changelog
v5
* Rebase on Joerg's next tree
* [iommu] Skip in shared iommu_group cases
* [iommu] Pass in default_domain to iommu_setup_dma_ops
* [iommu] Add kdocs to iommu_get_domain_for_dev_locked()
* [iommu] s/get_domain_for_dev_locked/driver_get_domain_for_dev
* [iommu] Replace per-gdev pending_reset with per-group resetting_domain
v4
https://lore.kernel.org/all/cover.1756682135.git.nicolinc@nvidia.com/
* Add Reviewed-by from Baolu
* [iommu] Use guard(mutex)
* [iommu] Update kdocs for typos and revisings
* [iommu] Skip two corner cases (alias and SRIOV)
* [iommu] Rework attach_dev to pass in old domain pointer
* [iommu] Reject concurrent attach_dev/set_dev_pasid for compatibility
concern
* [smmuv3] Drop the old_domain depedency in its release_dev callback
* [pci] Add pci_reset_iommu_prepare/_done() wrappers checking ATS cap
v3
https://lore.kernel.org/all/cover.1754952762.git.nicolinc@nvidia.com/
* Add Reviewed-by from Jason
* [iommu] Add a fast return in iommu_deferred_attach()
* [iommu] Update kdocs, inline comments, and commit logs
* [iommu] Use group->blocking_domain v.s. ops->blocked_domain
* [iommu] Drop require_direct, iommu_group_get(), and xa_lock()
* [iommu] Set the pending_reset flag after RID/PASID domain setups
* [iommu] Do not bypass PASID domains when RID domain is already the
blocking_domain
* [iommu] Add iommu_get_domain_for_dev_locked to correctly return the
blocking_domain
v2
https://lore.kernel.org/all/cover.1751096303.git.nicolinc@nvidia.com/
* [iommu] Update kdocs, inline comments, and commit logs
* [iommu] Replace long-holding group->mutex with a pending_reset flag
* [pci] Abort reset routines if iommu_dev_reset_prepare() fails
* [pci] Apply the same vulnerability fix to other reset functions
v1
https://lore.kernel.org/all/cover.1749494161.git.nicolinc@nvidia.com/
Thanks
Nicolin
Nicolin Chen (5):
iommu: Lock group->mutex in iommu_deferred_attach()
iommu: Tiny domain for iommu_setup_dma_ops()
iommu: Add iommu_driver_get_domain_for_dev() helper
iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
pci: Suspend iommu function prior to resetting a device
drivers/iommu/dma-iommu.h | 5 +-
drivers/pci/pci.h | 2 +
include/linux/iommu.h | 13 ++
include/uapi/linux/vfio.h | 3 +
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 5 +-
drivers/iommu/dma-iommu.c | 4 +-
drivers/iommu/iommu.c | 230 +++++++++++++++++++-
drivers/pci/pci-acpi.c | 12 +-
drivers/pci/pci.c | 68 +++++-
drivers/pci/quirks.c | 18 +-
10 files changed, 339 insertions(+), 21 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH v5 1/5] iommu: Lock group->mutex in iommu_deferred_attach()
2025-11-11 5:12 [PATCH v5 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
@ 2025-11-11 5:12 ` Nicolin Chen
2025-11-12 2:47 ` Baolu Lu
2025-11-11 5:12 ` [PATCH v5 2/5] iommu: Tiny domain for iommu_setup_dma_ops() Nicolin Chen
` (3 subsequent siblings)
4 siblings, 1 reply; 36+ messages in thread
From: Nicolin Chen @ 2025-11-11 5:12 UTC (permalink / raw)
To: joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, robin.murphy, lenb, baolu.lu, linux-arm-kernel, iommu,
linux-kernel, linux-acpi, linux-pci, kvm, patches, pjaroszynski,
vsethi, helgaas, etzhao1900
The iommu_deferred_attach() function invokes __iommu_attach_device(), but
doesn't hold the group->mutex like other __iommu_attach_device() callers.
Though there is no pratical bug being triggered so far, it would be better
to apply the same locking to this __iommu_attach_device(), since the IOMMU
drivers nowaday are more aware of the group->mutex -- some of them use the
iommu_group_mutex_assert() function that could be potentially in the path
of an attach_dev callback function invoked by the __iommu_attach_device().
Worth mentioning that the iommu_deferred_attach() will soon need to check
group->resetting_domain that must be locked also.
Thus, grab the mutex to guard __iommu_attach_device() like other callers.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/iommu.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2ca990dfbb884..170e522b5bda4 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2185,10 +2185,17 @@ EXPORT_SYMBOL_GPL(iommu_attach_device);
int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
{
- if (dev->iommu && dev->iommu->attach_deferred)
- return __iommu_attach_device(domain, dev, NULL);
+ /*
+ * This is called on the dma mapping fast path so avoid locking. This is
+ * racy, but we have an expectation that the driver will setup its DMAs
+ * inside probe while being single threaded to avoid racing.
+ */
+ if (!dev->iommu || !dev->iommu->attach_deferred)
+ return 0;
- return 0;
+ guard(mutex)(&dev->iommu_group->mutex);
+
+ return __iommu_attach_device(domain, dev, NULL);
}
void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
--
2.43.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v5 2/5] iommu: Tiny domain for iommu_setup_dma_ops()
2025-11-11 5:12 [PATCH v5 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
2025-11-11 5:12 ` [PATCH v5 1/5] iommu: Lock group->mutex in iommu_deferred_attach() Nicolin Chen
@ 2025-11-11 5:12 ` Nicolin Chen
2025-11-12 5:22 ` Baolu Lu
` (2 more replies)
2025-11-11 5:12 ` [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper Nicolin Chen
` (2 subsequent siblings)
4 siblings, 3 replies; 36+ messages in thread
From: Nicolin Chen @ 2025-11-11 5:12 UTC (permalink / raw)
To: joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, robin.murphy, lenb, baolu.lu, linux-arm-kernel, iommu,
linux-kernel, linux-acpi, linux-pci, kvm, patches, pjaroszynski,
vsethi, helgaas, etzhao1900
This function can only be called on the default_domain. Trivally pass it
in. In all three existing cases, the default domain was just attached to
the device.
This avoids iommu_setup_dma_ops() calling iommu_get_domain_for_dev() the
that will be used by external callers.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/dma-iommu.h | 5 +++--
drivers/iommu/dma-iommu.c | 4 +---
drivers/iommu/iommu.c | 6 +++---
3 files changed, 7 insertions(+), 8 deletions(-)
diff --git a/drivers/iommu/dma-iommu.h b/drivers/iommu/dma-iommu.h
index eca201c1f9639..040d002525632 100644
--- a/drivers/iommu/dma-iommu.h
+++ b/drivers/iommu/dma-iommu.h
@@ -9,7 +9,7 @@
#ifdef CONFIG_IOMMU_DMA
-void iommu_setup_dma_ops(struct device *dev);
+void iommu_setup_dma_ops(struct device *dev, struct iommu_domain *domain);
int iommu_get_dma_cookie(struct iommu_domain *domain);
void iommu_put_dma_cookie(struct iommu_domain *domain);
@@ -26,7 +26,8 @@ extern bool iommu_dma_forcedac;
#else /* CONFIG_IOMMU_DMA */
-static inline void iommu_setup_dma_ops(struct device *dev)
+static inline void iommu_setup_dma_ops(struct device *dev,
+ struct iommu_domain *domain)
{
}
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7944a3af4545e..e8ffb50c66dbf 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -2096,10 +2096,8 @@ void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
}
EXPORT_SYMBOL_GPL(dma_iova_destroy);
-void iommu_setup_dma_ops(struct device *dev)
+void iommu_setup_dma_ops(struct device *dev, struct iommu_domain *domain)
{
- struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-
if (dev_is_pci(dev))
dev->iommu->pci_32bit_workaround = !iommu_dma_forcedac;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 170e522b5bda4..1e322f87b1710 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -661,7 +661,7 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
}
if (group->default_domain)
- iommu_setup_dma_ops(dev);
+ iommu_setup_dma_ops(dev, group->default_domain);
mutex_unlock(&group->mutex);
@@ -1949,7 +1949,7 @@ static int bus_iommu_probe(const struct bus_type *bus)
return ret;
}
for_each_group_device(group, gdev)
- iommu_setup_dma_ops(gdev->dev);
+ iommu_setup_dma_ops(gdev->dev, group->default_domain);
mutex_unlock(&group->mutex);
/*
@@ -3155,7 +3155,7 @@ static ssize_t iommu_group_store_type(struct iommu_group *group,
/* Make sure dma_ops is appropriatley set */
for_each_group_device(group, gdev)
- iommu_setup_dma_ops(gdev->dev);
+ iommu_setup_dma_ops(gdev->dev, group->default_domain);
out_unlock:
mutex_unlock(&group->mutex);
--
2.43.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper
2025-11-11 5:12 [PATCH v5 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
2025-11-11 5:12 ` [PATCH v5 1/5] iommu: Lock group->mutex in iommu_deferred_attach() Nicolin Chen
2025-11-11 5:12 ` [PATCH v5 2/5] iommu: Tiny domain for iommu_setup_dma_ops() Nicolin Chen
@ 2025-11-11 5:12 ` Nicolin Chen
2025-11-12 5:58 ` Baolu Lu
` (2 more replies)
2025-11-11 5:12 ` [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
2025-11-11 5:12 ` [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device Nicolin Chen
4 siblings, 3 replies; 36+ messages in thread
From: Nicolin Chen @ 2025-11-11 5:12 UTC (permalink / raw)
To: joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, robin.murphy, lenb, baolu.lu, linux-arm-kernel, iommu,
linux-kernel, linux-acpi, linux-pci, kvm, patches, pjaroszynski,
vsethi, helgaas, etzhao1900
There is a need to stage a resetting PCI device to temporally the blocked
domain and then attach back to its previously attached domain after reset.
This can be simply done by keeping the "previously attached domain" in the
iommu_group->domain pointer while adding an iommu_group->resetting_domain,
which gives troubles to IOMMU drivers using the iommu_get_domain_for_dev()
for a device's physical domain in order to program IOMMU hardware.
And in such for-driver use cases, the iommu_group->mutex must be held, so
it doesn't fit in external callers that don't hold the iommu_group->mutex.
Introduce a new iommu_driver_get_domain_for_dev() helper, exclusively for
driver use cases that hold the iommu_group->mutex, to separate from those
external use cases.
Add a lockdep_assert_not_held to the existing iommu_get_domain_for_dev()
and highlight that in a kdoc.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
include/linux/iommu.h | 1 +
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 5 ++--
drivers/iommu/iommu.c | 28 +++++++++++++++++++++
3 files changed, 32 insertions(+), 2 deletions(-)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 801b2bd9e8d49..a42a2d1d7a0b7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -910,6 +910,7 @@ extern int iommu_attach_device(struct iommu_domain *domain,
extern void iommu_detach_device(struct iommu_domain *domain,
struct device *dev);
extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
+struct iommu_domain *iommu_driver_get_domain_for_dev(struct device *dev);
extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index a33fbd12a0dd9..412d1a9b31275 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3125,7 +3125,8 @@ int arm_smmu_set_pasid(struct arm_smmu_master *master,
struct arm_smmu_domain *smmu_domain, ioasid_t pasid,
struct arm_smmu_cd *cd, struct iommu_domain *old)
{
- struct iommu_domain *sid_domain = iommu_get_domain_for_dev(master->dev);
+ struct iommu_domain *sid_domain =
+ iommu_driver_get_domain_for_dev(master->dev);
struct arm_smmu_attach_state state = {
.master = master,
.ssid = pasid,
@@ -3191,7 +3192,7 @@ static int arm_smmu_blocking_set_dev_pasid(struct iommu_domain *new_domain,
*/
if (!arm_smmu_ssids_in_use(&master->cd_table)) {
struct iommu_domain *sid_domain =
- iommu_get_domain_for_dev(master->dev);
+ iommu_driver_get_domain_for_dev(master->dev);
if (sid_domain->type == IOMMU_DOMAIN_IDENTITY ||
sid_domain->type == IOMMU_DOMAIN_BLOCKED)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 1e322f87b1710..1f4d6ca0937bc 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2217,6 +2217,15 @@ void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
}
EXPORT_SYMBOL_GPL(iommu_detach_device);
+/**
+ * iommu_get_domain_for_dev() - Return the DMA API domain pointer
+ * @dev - Device to query
+ *
+ * This function can be called within a driver bound to dev. The returned
+ * pointer is valid for the lifetime of the bound driver.
+ *
+ * It should not be called by drivers with driver_managed_dma = true.
+ */
struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
{
/* Caller must be a probed driver on dev */
@@ -2225,10 +2234,29 @@ struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
if (!group)
return NULL;
+ lockdep_assert_not_held(&group->mutex);
+
return group->domain;
}
EXPORT_SYMBOL_GPL(iommu_get_domain_for_dev);
+/**
+ * iommu_driver_get_domain_for_dev() - Return the driver-level domain pointer
+ * @dev - Device to query
+ *
+ * This function can be called by an iommu driver that wants to get the physical
+ * domain within an iommu callback function where group->mutex is held.
+ */
+struct iommu_domain *iommu_driver_get_domain_for_dev(struct device *dev)
+{
+ struct iommu_group *group = dev->iommu_group;
+
+ lockdep_assert_held(&group->mutex);
+
+ return group->domain;
+}
+EXPORT_SYMBOL_GPL(iommu_driver_get_domain_for_dev);
+
/*
* For IOMMU_DOMAIN_DMA implementations which already provide their own
* guarantees that the group and its default domain are valid and correct.
--
2.43.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
2025-11-11 5:12 [PATCH v5 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
` (2 preceding siblings ...)
2025-11-11 5:12 ` [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper Nicolin Chen
@ 2025-11-11 5:12 ` Nicolin Chen
2025-11-12 6:18 ` Baolu Lu
` (3 more replies)
2025-11-11 5:12 ` [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device Nicolin Chen
4 siblings, 4 replies; 36+ messages in thread
From: Nicolin Chen @ 2025-11-11 5:12 UTC (permalink / raw)
To: joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, robin.murphy, lenb, baolu.lu, linux-arm-kernel, iommu,
linux-kernel, linux-acpi, linux-pci, kvm, patches, pjaroszynski,
vsethi, helgaas, etzhao1900
PCIe permits a device to ignore ATS invalidation TLPs, while processing a
reset. This creates a problem visible to the OS where an ATS invalidation
command will time out. E.g. an SVA domain will have no coordination with a
reset event and can racily issue ATS invalidations to a resetting device.
The OS should do something to mitigate this as we do not want production
systems to be reporting critical ATS failures, especially in a hypervisor
environment. Broadly, OS could arrange to ignore the timeouts, block page
table mutations to prevent invalidations, or disable and block ATS.
The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and
block ATS before initiating a Function Level Reset. It also mentions that
other reset methods could have the same vulnerability as well.
Provide a callback from the PCI subsystem that will enclose the reset and
have the iommu core temporarily change all the attached domain to BLOCKED.
After attaching a BLOCKED domain, IOMMU hardware would fence any incoming
ATS queries. And IOMMU drivers should also synchronously stop issuing new
ATS invalidations and wait for all ATS invalidations to complete. This can
avoid any ATS invaliation timeouts.
However, if there is a domain attachment/replacement happening during an
ongoing reset, ATS routines may be re-activated between the two function
calls. So, introduce a new resetting_domain in the iommu_group structure
to reject any concurrent attach_dev/set_dev_pasid call during a reset for
a concern of compatibility failure. Since this changes the behavior of an
attach operation, update the uAPI accordingly.
Note that there are two corner cases:
1. Devices in the same iommu_group
Since an attachment is always per iommu_group, disallowing one device
to switch domains (or HWPTs in iommufd) would have to disallow others
in the same iommu_group to switch domains as well. So, play safe by
preventing a shared iommu_group from going through the iommu reset.
2. SRIOV devices that its PF is resetting while its VF isn't
In such case, the VF itself is already broken. So, there is no point
in preventing PF from going through the iommu reset.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
include/linux/iommu.h | 12 +++
include/uapi/linux/vfio.h | 3 +
drivers/iommu/iommu.c | 183 ++++++++++++++++++++++++++++++++++++++
3 files changed, 198 insertions(+)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a42a2d1d7a0b7..25a2c2b00c9f7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1169,6 +1169,9 @@ void dev_iommu_priv_set(struct device *dev, void *priv);
extern struct mutex iommu_probe_device_lock;
int iommu_probe_device(struct device *dev);
+int iommu_dev_reset_prepare(struct device *dev);
+void iommu_dev_reset_done(struct device *dev);
+
int iommu_device_use_default_domain(struct device *dev);
void iommu_device_unuse_default_domain(struct device *dev);
@@ -1453,6 +1456,15 @@ static inline int iommu_fwspec_add_ids(struct device *dev, u32 *ids,
return -ENODEV;
}
+static inline int iommu_dev_reset_prepare(struct device *dev)
+{
+ return 0;
+}
+
+static inline void iommu_dev_reset_done(struct device *dev)
+{
+}
+
static inline struct iommu_fwspec *dev_iommu_fwspec_get(struct device *dev)
{
return NULL;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 75100bf009baf..6cc9d2709d13a 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -963,6 +963,9 @@ struct vfio_device_bind_iommufd {
* hwpt corresponding to the given pt_id.
*
* Return: 0 on success, -errno on failure.
+ *
+ * When a device gets reset, any attach will be rejected with -EBUSY until that
+ * reset routine finishes.
*/
struct vfio_device_attach_iommufd_pt {
__u32 argsz;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 1f4d6ca0937bc..74b9f2bfc0458 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -61,6 +61,11 @@ struct iommu_group {
int id;
struct iommu_domain *default_domain;
struct iommu_domain *blocking_domain;
+ /*
+ * During a group device reset, @resetting_domain points to the physical
+ * domain, while @domain points to the attached domain before the reset.
+ */
+ struct iommu_domain *resetting_domain;
struct iommu_domain *domain;
struct list_head entry;
unsigned int owner_cnt;
@@ -2195,6 +2200,12 @@ int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
guard(mutex)(&dev->iommu_group->mutex);
+ /*
+ * This is a concurrent attach while a group device is resetting. Reject
+ * it until iommu_dev_reset_done() attaches the device to group->domain.
+ */
+ if (dev->iommu_group->resetting_domain)
+ return -EBUSY;
return __iommu_attach_device(domain, dev, NULL);
}
@@ -2253,6 +2264,16 @@ struct iommu_domain *iommu_driver_get_domain_for_dev(struct device *dev)
lockdep_assert_held(&group->mutex);
+ /*
+ * Driver handles the low-level __iommu_attach_device(), including the
+ * one invoked by iommu_dev_reset_done(), in which case the driver must
+ * get the resetting_domain over group->domain caching the one prior to
+ * iommu_dev_reset_prepare(), so that it wouldn't end up with attaching
+ * the device from group->domain (old) to group->domain (new).
+ */
+ if (group->resetting_domain)
+ return group->resetting_domain;
+
return group->domain;
}
EXPORT_SYMBOL_GPL(iommu_driver_get_domain_for_dev);
@@ -2409,6 +2430,13 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
if (WARN_ON(!new_domain))
return -EINVAL;
+ /*
+ * This is a concurrent attach while a group device is resetting. Reject
+ * it until iommu_dev_reset_done() attaches the device to group->domain.
+ */
+ if (group->resetting_domain)
+ return -EBUSY;
+
/*
* Changing the domain is done by calling attach_dev() on the new
* domain. This switch does not have to be atomic and DMA can be
@@ -3527,6 +3555,16 @@ int iommu_attach_device_pasid(struct iommu_domain *domain,
return -EINVAL;
mutex_lock(&group->mutex);
+
+ /*
+ * This is a concurrent attach while a group device is resetting. Reject
+ * it until iommu_dev_reset_done() attaches the device to group->domain.
+ */
+ if (group->resetting_domain) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+
for_each_group_device(group, device) {
/*
* Skip PASID validation for devices without PASID support
@@ -3610,6 +3648,16 @@ int iommu_replace_device_pasid(struct iommu_domain *domain,
return -EINVAL;
mutex_lock(&group->mutex);
+
+ /*
+ * This is a concurrent attach while a group device is resetting. Reject
+ * it until iommu_dev_reset_done() attaches the device to group->domain.
+ */
+ if (group->resetting_domain) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+
entry = iommu_make_pasid_array_entry(domain, handle);
curr = xa_cmpxchg(&group->pasid_array, pasid, NULL,
XA_ZERO_ENTRY, GFP_KERNEL);
@@ -3867,6 +3915,141 @@ int iommu_replace_group_handle(struct iommu_group *group,
}
EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
+/**
+ * iommu_dev_reset_prepare() - Block IOMMU to prepare for a device reset
+ * @dev: device that is going to enter a reset routine
+ *
+ * When certain device is entering a reset routine, it wants to block any IOMMU
+ * activity during the reset routine. This includes blocking any translation as
+ * well as cache invalidation (especially the device cache).
+ *
+ * This function attaches all RID/PASID of the device's to IOMMU_DOMAIN_BLOCKED
+ * allowing any blocked-domain-supporting IOMMU driver to pause translation and
+ * cahce invalidation, but leaves the software domain pointers intact so later
+ * the iommu_dev_reset_done() can restore everything.
+ *
+ * Return: 0 on success or negative error code if the preparation failed.
+ *
+ * Caller must use iommu_dev_reset_prepare() and iommu_dev_reset_done() together
+ * before/after the core-level reset routine, to unset the resetting_domain.
+ *
+ * These two functions are designed to be used by PCI reset functions that would
+ * not invoke any racy iommu_release_device(), since PCI sysfs node gets removed
+ * before it notifies with a BUS_NOTIFY_REMOVED_DEVICE. When using them in other
+ * case, callers must ensure there will be no racy iommu_release_device() call,
+ * which otherwise would UAF the dev->iommu_group pointer.
+ */
+int iommu_dev_reset_prepare(struct device *dev)
+{
+ struct iommu_group *group = dev->iommu_group;
+ unsigned long pasid;
+ void *entry;
+ int ret = 0;
+
+ if (!dev_has_iommu(dev))
+ return 0;
+
+ guard(mutex)(&group->mutex);
+
+ /*
+ * Once the resetting_domain is set, any concurrent attachment to this
+ * iommu_group will be rejected, which would break the attach routines
+ * of the sibling devices in the same iommu_group. So, skip this case.
+ */
+ if (dev_is_pci(dev)) {
+ struct group_device *gdev;
+
+ for_each_group_device(group, gdev) {
+ if (gdev->dev != dev)
+ return 0;
+ }
+ }
+
+ /* Re-entry is not allowed */
+ if (WARN_ON(group->resetting_domain))
+ return -EBUSY;
+
+ ret = __iommu_group_alloc_blocking_domain(group);
+ if (ret)
+ return ret;
+
+ /* Stage RID domain at blocking_domain while retaining group->domain */
+ if (group->domain != group->blocking_domain) {
+ ret = __iommu_attach_device(group->blocking_domain, dev,
+ group->domain);
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Stage PASID domains at blocking_domain while retaining pasid_array.
+ *
+ * The pasid_array is mostly fenced by group->mutex, except one reader
+ * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
+ */
+ xa_for_each_start(&group->pasid_array, pasid, entry, 1)
+ iommu_remove_dev_pasid(dev, pasid,
+ pasid_array_entry_to_domain(entry));
+
+ group->resetting_domain = group->blocking_domain;
+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_reset_prepare);
+
+/**
+ * iommu_dev_reset_done() - Restore IOMMU after a device reset is finished
+ * @dev: device that has finished a reset routine
+ *
+ * When certain device has finished a reset routine, it wants to restore its
+ * IOMMU activity, including new translation as well as cache invalidation, by
+ * re-attaching all RID/PASID of the device's back to the domains retained in
+ * the core-level structure.
+ *
+ * Caller must pair it with a successfully returned iommu_dev_reset_prepare().
+ *
+ * Note that, although unlikely, there is a risk that re-attaching domains might
+ * fail due to some unexpected happening like OOM.
+ */
+void iommu_dev_reset_done(struct device *dev)
+{
+ struct iommu_group *group = dev->iommu_group;
+ unsigned long pasid;
+ void *entry;
+
+ if (!dev_has_iommu(dev))
+ return;
+
+ guard(mutex)(&group->mutex);
+
+ /* iommu_dev_reset_prepare() was bypassed for the device */
+ if (!group->resetting_domain)
+ return;
+
+ /* iommu_dev_reset_prepare() was not successfully called */
+ if (WARN_ON(!group->blocking_domain))
+ return;
+
+ /* Re-attach RID domain back to group->domain */
+ if (group->domain != group->blocking_domain) {
+ WARN_ON(__iommu_attach_device(group->domain, dev,
+ group->blocking_domain));
+ }
+
+ /*
+ * Re-attach PASID domains back to the domains retained in pasid_array.
+ *
+ * The pasid_array is mostly fenced by group->mutex, except one reader
+ * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
+ */
+ xa_for_each_start(&group->pasid_array, pasid, entry, 1)
+ WARN_ON(__iommu_set_group_pasid(
+ pasid_array_entry_to_domain(entry), group, pasid,
+ group->blocking_domain));
+
+ group->resetting_domain = NULL;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_reset_done);
+
#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
/**
* iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
--
2.43.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device
2025-11-11 5:12 [PATCH v5 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
` (3 preceding siblings ...)
2025-11-11 5:12 ` [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
@ 2025-11-11 5:12 ` Nicolin Chen
2025-11-14 9:45 ` Tian, Kevin
2025-11-17 22:58 ` Bjorn Helgaas
4 siblings, 2 replies; 36+ messages in thread
From: Nicolin Chen @ 2025-11-11 5:12 UTC (permalink / raw)
To: joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, robin.murphy, lenb, baolu.lu, linux-arm-kernel, iommu,
linux-kernel, linux-acpi, linux-pci, kvm, patches, pjaroszynski,
vsethi, helgaas, etzhao1900
PCIe permits a device to ignore ATS invalidation TLPs, while processing a
reset. This creates a problem visible to the OS where an ATS invalidation
command will time out: e.g. an SVA domain will have no coordination with a
reset event and can racily issue ATS invalidations to a resetting device.
The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and
block ATS before initiating a Function Level Reset. It also mentions that
other reset methods could have the same vulnerability as well.
Now iommu_dev_reset_prepare/done() helpers are introduced for this matter.
Use them in all the existing reset functions, which will attach the device
to an IOMMU_DOMAIN_BLOCKED during a reset, so as to allow IOMMU driver to:
- invoke pci_disable_ats() and pci_enable_ats(), if necessary
- wait for all ATS invalidations to complete
- stop issuing new ATS invalidations
- fence any incoming ATS queries
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/pci/pci.h | 2 ++
drivers/pci/pci-acpi.c | 12 ++++++--
drivers/pci/pci.c | 68 ++++++++++++++++++++++++++++++++++++++----
drivers/pci/quirks.c | 18 ++++++++++-
4 files changed, 92 insertions(+), 8 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4492b809094b5..a29286dfd870c 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -198,6 +198,8 @@ void pci_init_reset_methods(struct pci_dev *dev);
int pci_bridge_secondary_bus_reset(struct pci_dev *dev);
int pci_bus_error_reset(struct pci_dev *dev);
int __pci_reset_bus(struct pci_bus *bus);
+int pci_reset_iommu_prepare(struct pci_dev *dev);
+void pci_reset_iommu_done(struct pci_dev *dev);
struct pci_cap_saved_data {
u16 cap_nr;
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 9369377725fa0..60d29b183f2c2 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -971,6 +971,7 @@ void pci_set_acpi_fwnode(struct pci_dev *dev)
int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
{
acpi_handle handle = ACPI_HANDLE(&dev->dev);
+ int ret = 0;
if (!handle || !acpi_has_method(handle, "_RST"))
return -ENOTTY;
@@ -978,12 +979,19 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
if (probe)
return 0;
+ ret = pci_reset_iommu_prepare(dev);
+ if (ret) {
+ pci_err(dev, "failed to stop IOMMU\n");
+ return ret;
+ }
+
if (ACPI_FAILURE(acpi_evaluate_object(handle, "_RST", NULL, NULL))) {
pci_warn(dev, "ACPI _RST failed\n");
- return -ENOTTY;
+ ret = -ENOTTY;
}
- return 0;
+ pci_reset_iommu_done(dev);
+ return ret;
}
bool acpi_pci_power_manageable(struct pci_dev *dev)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b14dd064006cc..52461d952cbf1 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -13,6 +13,7 @@
#include <linux/delay.h>
#include <linux/dmi.h>
#include <linux/init.h>
+#include <linux/iommu.h>
#include <linux/msi.h>
#include <linux/of.h>
#include <linux/pci.h>
@@ -25,6 +26,7 @@
#include <linux/logic_pio.h>
#include <linux/device.h>
#include <linux/pm_runtime.h>
+#include <linux/pci-ats.h>
#include <linux/pci_hotplug.h>
#include <linux/vmalloc.h>
#include <asm/dma.h>
@@ -95,6 +97,23 @@ bool pci_reset_supported(struct pci_dev *dev)
return dev->reset_methods[0] != 0;
}
+/*
+ * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS before
+ * initiating a reset. Notify the iommu driver that enabled ATS.
+ */
+int pci_reset_iommu_prepare(struct pci_dev *dev)
+{
+ if (pci_ats_supported(dev))
+ return iommu_dev_reset_prepare(&dev->dev);
+ return 0;
+}
+
+void pci_reset_iommu_done(struct pci_dev *dev)
+{
+ if (pci_ats_supported(dev))
+ iommu_dev_reset_done(&dev->dev);
+}
+
#ifdef CONFIG_PCI_DOMAINS
int pci_domains_supported = 1;
#endif
@@ -4478,13 +4497,22 @@ EXPORT_SYMBOL(pci_wait_for_pending_transaction);
*/
int pcie_flr(struct pci_dev *dev)
{
+ int ret = 0;
+
if (!pci_wait_for_pending_transaction(dev))
pci_err(dev, "timed out waiting for pending transaction; performing function level reset anyway\n");
+ /* Have to call it after waiting for pending DMA transaction */
+ ret = pci_reset_iommu_prepare(dev);
+ if (ret) {
+ pci_err(dev, "failed to stop IOMMU\n");
+ return ret;
+ }
+
pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR);
if (dev->imm_ready)
- return 0;
+ goto done;
/*
* Per PCIe r4.0, sec 6.6.2, a device must complete an FLR within
@@ -4493,7 +4521,10 @@ int pcie_flr(struct pci_dev *dev)
*/
msleep(100);
- return pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
+ ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
+done:
+ pci_reset_iommu_done(dev);
+ return ret;
}
EXPORT_SYMBOL_GPL(pcie_flr);
@@ -4521,6 +4552,7 @@ EXPORT_SYMBOL_GPL(pcie_reset_flr);
static int pci_af_flr(struct pci_dev *dev, bool probe)
{
+ int ret = 0;
int pos;
u8 cap;
@@ -4547,10 +4579,17 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
PCI_AF_STATUS_TP << 8))
pci_err(dev, "timed out waiting for pending transaction; performing AF function level reset anyway\n");
+ /* Have to call it after waiting for pending DMA transaction */
+ ret = pci_reset_iommu_prepare(dev);
+ if (ret) {
+ pci_err(dev, "failed to stop IOMMU\n");
+ return ret;
+ }
+
pci_write_config_byte(dev, pos + PCI_AF_CTRL, PCI_AF_CTRL_FLR);
if (dev->imm_ready)
- return 0;
+ goto done;
/*
* Per Advanced Capabilities for Conventional PCI ECN, 13 April 2006,
@@ -4560,7 +4599,10 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
*/
msleep(100);
- return pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
+ ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
+done:
+ pci_reset_iommu_done(dev);
+ return ret;
}
/**
@@ -4581,6 +4623,7 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
static int pci_pm_reset(struct pci_dev *dev, bool probe)
{
u16 csr;
+ int ret;
if (!dev->pm_cap || dev->dev_flags & PCI_DEV_FLAGS_NO_PM_RESET)
return -ENOTTY;
@@ -4595,6 +4638,12 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
if (dev->current_state != PCI_D0)
return -EINVAL;
+ ret = pci_reset_iommu_prepare(dev);
+ if (ret) {
+ pci_err(dev, "failed to stop IOMMU\n");
+ return ret;
+ }
+
csr &= ~PCI_PM_CTRL_STATE_MASK;
csr |= PCI_D3hot;
pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
@@ -4605,7 +4654,9 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
pci_dev_d3_sleep(dev);
- return pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
+ ret = pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
+ pci_reset_iommu_done(dev);
+ return ret;
}
/**
@@ -5060,6 +5111,12 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
if (rc)
return -ENOTTY;
+ rc = pci_reset_iommu_prepare(dev);
+ if (rc) {
+ pci_err(dev, "failed to stop IOMMU\n");
+ return rc;
+ }
+
if (reg & PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR) {
val = reg;
} else {
@@ -5074,6 +5131,7 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
reg);
+ pci_reset_iommu_done(dev);
return rc;
}
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 214ed060ca1b3..891d9e5a97e93 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4226,6 +4226,22 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
{ 0 }
};
+static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
+ const struct pci_dev_reset_methods *i)
+{
+ int ret;
+
+ ret = pci_reset_iommu_prepare(dev);
+ if (ret) {
+ pci_err(dev, "failed to stop IOMMU\n");
+ return ret;
+ }
+
+ ret = i->reset(dev, probe);
+ pci_reset_iommu_done(dev);
+ return ret;
+}
+
/*
* These device-specific reset methods are here rather than in a driver
* because when a host assigns a device to a guest VM, the host may need
@@ -4240,7 +4256,7 @@ int pci_dev_specific_reset(struct pci_dev *dev, bool probe)
i->vendor == (u16)PCI_ANY_ID) &&
(i->device == dev->device ||
i->device == (u16)PCI_ANY_ID))
- return i->reset(dev, probe);
+ return __pci_dev_specific_reset(dev, probe, i);
}
return -ENOTTY;
--
2.43.0
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH v5 1/5] iommu: Lock group->mutex in iommu_deferred_attach()
2025-11-11 5:12 ` [PATCH v5 1/5] iommu: Lock group->mutex in iommu_deferred_attach() Nicolin Chen
@ 2025-11-12 2:47 ` Baolu Lu
0 siblings, 0 replies; 36+ messages in thread
From: Baolu Lu @ 2025-11-12 2:47 UTC (permalink / raw)
To: Nicolin Chen, joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, robin.murphy, lenb, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, kvm, patches, pjaroszynski, vsethi,
helgaas, etzhao1900
On 11/11/25 13:12, Nicolin Chen wrote:
> The iommu_deferred_attach() function invokes __iommu_attach_device(), but
> doesn't hold the group->mutex like other __iommu_attach_device() callers.
>
> Though there is no pratical bug being triggered so far, it would be better
> to apply the same locking to this __iommu_attach_device(), since the IOMMU
> drivers nowaday are more aware of the group->mutex -- some of them use the
> iommu_group_mutex_assert() function that could be potentially in the path
> of an attach_dev callback function invoked by the __iommu_attach_device().
>
> Worth mentioning that the iommu_deferred_attach() will soon need to check
> group->resetting_domain that must be locked also.
>
> Thus, grab the mutex to guard __iommu_attach_device() like other callers.
>
> Reviewed-by: Jason Gunthorpe<jgg@nvidia.com>
> Reviewed-by: Kevin Tian<kevin.tian@intel.com>
> Signed-off-by: Nicolin Chen<nicolinc@nvidia.com>
> ---
> drivers/iommu/iommu.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 2/5] iommu: Tiny domain for iommu_setup_dma_ops()
2025-11-11 5:12 ` [PATCH v5 2/5] iommu: Tiny domain for iommu_setup_dma_ops() Nicolin Chen
@ 2025-11-12 5:22 ` Baolu Lu
2025-11-14 9:17 ` Tian, Kevin
2025-11-14 9:18 ` Tian, Kevin
2 siblings, 0 replies; 36+ messages in thread
From: Baolu Lu @ 2025-11-12 5:22 UTC (permalink / raw)
To: Nicolin Chen, joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, robin.murphy, lenb, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, kvm, patches, pjaroszynski, vsethi,
helgaas, etzhao1900
On 11/11/25 13:12, Nicolin Chen wrote:
> This function can only be called on the default_domain. Trivally pass it
> in. In all three existing cases, the default domain was just attached to
> the device.
>
> This avoids iommu_setup_dma_ops() calling iommu_get_domain_for_dev() the
> that will be used by external callers.
>
> Suggested-by: Jason Gunthorpe<jgg@nvidia.com>
> Signed-off-by: Nicolin Chen<nicolinc@nvidia.com>
> ---
> drivers/iommu/dma-iommu.h | 5 +++--
> drivers/iommu/dma-iommu.c | 4 +---
> drivers/iommu/iommu.c | 6 +++---
> 3 files changed, 7 insertions(+), 8 deletions(-)
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper
2025-11-11 5:12 ` [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper Nicolin Chen
@ 2025-11-12 5:58 ` Baolu Lu
2025-11-12 17:41 ` Nicolin Chen
2025-11-12 8:52 ` kernel test robot
2025-11-14 9:18 ` Tian, Kevin
2 siblings, 1 reply; 36+ messages in thread
From: Baolu Lu @ 2025-11-12 5:58 UTC (permalink / raw)
To: Nicolin Chen, joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, robin.murphy, lenb, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, kvm, patches, pjaroszynski, vsethi,
helgaas, etzhao1900
On 11/11/25 13:12, Nicolin Chen wrote:
> There is a need to stage a resetting PCI device to temporally the blocked
> domain and then attach back to its previously attached domain after reset.
>
> This can be simply done by keeping the "previously attached domain" in the
> iommu_group->domain pointer while adding an iommu_group->resetting_domain,
> which gives troubles to IOMMU drivers using the iommu_get_domain_for_dev()
> for a device's physical domain in order to program IOMMU hardware.
>
> And in such for-driver use cases, the iommu_group->mutex must be held, so
> it doesn't fit in external callers that don't hold the iommu_group->mutex.
>
> Introduce a new iommu_driver_get_domain_for_dev() helper, exclusively for
> driver use cases that hold the iommu_group->mutex, to separate from those
> external use cases.
>
> Add a lockdep_assert_not_held to the existing iommu_get_domain_for_dev()
> and highlight that in a kdoc.
>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
> include/linux/iommu.h | 1 +
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 5 ++--
> drivers/iommu/iommu.c | 28 +++++++++++++++++++++
> 3 files changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 801b2bd9e8d49..a42a2d1d7a0b7 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -910,6 +910,7 @@ extern int iommu_attach_device(struct iommu_domain *domain,
> extern void iommu_detach_device(struct iommu_domain *domain,
> struct device *dev);
> extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
> +struct iommu_domain *iommu_driver_get_domain_for_dev(struct device *dev);
> extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
> extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
> phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index a33fbd12a0dd9..412d1a9b31275 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -3125,7 +3125,8 @@ int arm_smmu_set_pasid(struct arm_smmu_master *master,
> struct arm_smmu_domain *smmu_domain, ioasid_t pasid,
> struct arm_smmu_cd *cd, struct iommu_domain *old)
> {
> - struct iommu_domain *sid_domain = iommu_get_domain_for_dev(master->dev);
> + struct iommu_domain *sid_domain =
> + iommu_driver_get_domain_for_dev(master->dev);
> struct arm_smmu_attach_state state = {
> .master = master,
> .ssid = pasid,
> @@ -3191,7 +3192,7 @@ static int arm_smmu_blocking_set_dev_pasid(struct iommu_domain *new_domain,
> */
> if (!arm_smmu_ssids_in_use(&master->cd_table)) {
> struct iommu_domain *sid_domain =
> - iommu_get_domain_for_dev(master->dev);
> + iommu_driver_get_domain_for_dev(master->dev);
>
> if (sid_domain->type == IOMMU_DOMAIN_IDENTITY ||
> sid_domain->type == IOMMU_DOMAIN_BLOCKED)
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 1e322f87b1710..1f4d6ca0937bc 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2217,6 +2217,15 @@ void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
> }
> EXPORT_SYMBOL_GPL(iommu_detach_device);
>
> +/**
> + * iommu_get_domain_for_dev() - Return the DMA API domain pointer
> + * @dev - Device to query
> + *
> + * This function can be called within a driver bound to dev. The returned
> + * pointer is valid for the lifetime of the bound driver.
> + *
> + * It should not be called by drivers with driver_managed_dma = true.
"driver_managed_dma != true" means the driver will use the default
domain allocated by the iommu core during iommu probe. The iommu core
ensures that this domain stored at group->domain will not be changed
during the driver's whole lifecycle. That's reasonable.
How about making some code to enforce this requirement? Something like
below ...
> + */
> struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
> {
> /* Caller must be a probed driver on dev */
> @@ -2225,10 +2234,29 @@ struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
> if (!group)
> return NULL;
>
> + lockdep_assert_not_held(&group->mutex);
...
if (WARN_ON(!dev->driver || !group->owner_cnt || group->owner))
return NULL;
> +
> return group->domain;
> }
> EXPORT_SYMBOL_GPL(iommu_get_domain_for_dev);
>
> +/**
> + * iommu_driver_get_domain_for_dev() - Return the driver-level domain pointer
> + * @dev - Device to query
> + *
> + * This function can be called by an iommu driver that wants to get the physical
> + * domain within an iommu callback function where group->mutex is held.
> + */
> +struct iommu_domain *iommu_driver_get_domain_for_dev(struct device *dev)
> +{
> + struct iommu_group *group = dev->iommu_group;
> +
> + lockdep_assert_held(&group->mutex);
> +
> + return group->domain;
> +}
> +EXPORT_SYMBOL_GPL(iommu_driver_get_domain_for_dev);
> +
> /*
> * For IOMMU_DOMAIN_DMA implementations which already provide their own
> * guarantees that the group and its default domain are valid and correct.
Others look good to me.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
2025-11-11 5:12 ` [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
@ 2025-11-12 6:18 ` Baolu Lu
2025-11-12 17:43 ` Nicolin Chen
2025-11-14 9:37 ` Tian, Kevin
` (2 subsequent siblings)
3 siblings, 1 reply; 36+ messages in thread
From: Baolu Lu @ 2025-11-12 6:18 UTC (permalink / raw)
To: Nicolin Chen, joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: will, robin.murphy, lenb, linux-arm-kernel, iommu, linux-kernel,
linux-acpi, linux-pci, kvm, patches, pjaroszynski, vsethi,
helgaas, etzhao1900
On 11/11/25 13:12, Nicolin Chen wrote:
> +/**
> + * iommu_dev_reset_prepare() - Block IOMMU to prepare for a device reset
> + * @dev: device that is going to enter a reset routine
> + *
> + * When certain device is entering a reset routine, it wants to block any IOMMU
> + * activity during the reset routine. This includes blocking any translation as
> + * well as cache invalidation (especially the device cache).
> + *
> + * This function attaches all RID/PASID of the device's to IOMMU_DOMAIN_BLOCKED
> + * allowing any blocked-domain-supporting IOMMU driver to pause translation and
> + * cahce invalidation, but leaves the software domain pointers intact so later
> + * the iommu_dev_reset_done() can restore everything.
> + *
> + * Return: 0 on success or negative error code if the preparation failed.
> + *
> + * Caller must use iommu_dev_reset_prepare() and iommu_dev_reset_done() together
> + * before/after the core-level reset routine, to unset the resetting_domain.
> + *
> + * These two functions are designed to be used by PCI reset functions that would
> + * not invoke any racy iommu_release_device(), since PCI sysfs node gets removed
> + * before it notifies with a BUS_NOTIFY_REMOVED_DEVICE. When using them in other
> + * case, callers must ensure there will be no racy iommu_release_device() call,
> + * which otherwise would UAF the dev->iommu_group pointer.
> + */
> +int iommu_dev_reset_prepare(struct device *dev)
> +{
> + struct iommu_group *group = dev->iommu_group;
> + unsigned long pasid;
> + void *entry;
> + int ret = 0;
> +
> + if (!dev_has_iommu(dev))
> + return 0;
Nit: This interface is only for PCI layer, so why not just
if (WARN_ON(!dev_is_pci(dev)))
return -EINVAL;
?
> +
> + guard(mutex)(&group->mutex);
> +
> + /*
> + * Once the resetting_domain is set, any concurrent attachment to this
> + * iommu_group will be rejected, which would break the attach routines
> + * of the sibling devices in the same iommu_group. So, skip this case.
> + */
> + if (dev_is_pci(dev)) {
> + struct group_device *gdev;
> +
> + for_each_group_device(group, gdev) {
> + if (gdev->dev != dev)
> + return 0;
> + }
> + }
With above dev_is_pci() check, here it can simply be,
if (list_count_nodes(&group->devices) != 1)
return 0;
> +
> + /* Re-entry is not allowed */
> + if (WARN_ON(group->resetting_domain))
> + return -EBUSY;
> +
> + ret = __iommu_group_alloc_blocking_domain(group);
> + if (ret)
> + return ret;
> +
> + /* Stage RID domain at blocking_domain while retaining group->domain */
> + if (group->domain != group->blocking_domain) {
> + ret = __iommu_attach_device(group->blocking_domain, dev,
> + group->domain);
> + if (ret)
> + return ret;
> + }
> +
> + /*
> + * Stage PASID domains at blocking_domain while retaining pasid_array.
> + *
> + * The pasid_array is mostly fenced by group->mutex, except one reader
> + * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
> + */
> + xa_for_each_start(&group->pasid_array, pasid, entry, 1)
> + iommu_remove_dev_pasid(dev, pasid,
> + pasid_array_entry_to_domain(entry));
> +
> + group->resetting_domain = group->blocking_domain;
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_dev_reset_prepare);
> +
> +/**
> + * iommu_dev_reset_done() - Restore IOMMU after a device reset is finished
> + * @dev: device that has finished a reset routine
> + *
> + * When certain device has finished a reset routine, it wants to restore its
> + * IOMMU activity, including new translation as well as cache invalidation, by
> + * re-attaching all RID/PASID of the device's back to the domains retained in
> + * the core-level structure.
> + *
> + * Caller must pair it with a successfully returned iommu_dev_reset_prepare().
> + *
> + * Note that, although unlikely, there is a risk that re-attaching domains might
> + * fail due to some unexpected happening like OOM.
> + */
> +void iommu_dev_reset_done(struct device *dev)
> +{
> + struct iommu_group *group = dev->iommu_group;
> + unsigned long pasid;
> + void *entry;
> +
> + if (!dev_has_iommu(dev))
> + return;
> +
> + guard(mutex)(&group->mutex);
> +
> + /* iommu_dev_reset_prepare() was bypassed for the device */
> + if (!group->resetting_domain)
> + return;
> +
> + /* iommu_dev_reset_prepare() was not successfully called */
> + if (WARN_ON(!group->blocking_domain))
> + return;
> +
> + /* Re-attach RID domain back to group->domain */
> + if (group->domain != group->blocking_domain) {
> + WARN_ON(__iommu_attach_device(group->domain, dev,
> + group->blocking_domain));
> + }
> +
> + /*
> + * Re-attach PASID domains back to the domains retained in pasid_array.
> + *
> + * The pasid_array is mostly fenced by group->mutex, except one reader
> + * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
> + */
> + xa_for_each_start(&group->pasid_array, pasid, entry, 1)
> + WARN_ON(__iommu_set_group_pasid(
> + pasid_array_entry_to_domain(entry), group, pasid,
> + group->blocking_domain));
> +
> + group->resetting_domain = NULL;
> +}
> +EXPORT_SYMBOL_GPL(iommu_dev_reset_done);
> +
> #if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
> /**
> * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
Thanks,
baolu
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper
2025-11-11 5:12 ` [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper Nicolin Chen
2025-11-12 5:58 ` Baolu Lu
@ 2025-11-12 8:52 ` kernel test robot
2025-11-14 9:18 ` Tian, Kevin
2 siblings, 0 replies; 36+ messages in thread
From: kernel test robot @ 2025-11-12 8:52 UTC (permalink / raw)
To: Nicolin Chen, joro, afael, bhelgaas, alex, jgg, kevin.tian
Cc: oe-kbuild-all, will, robin.murphy, lenb, baolu.lu,
linux-arm-kernel, iommu, linux-kernel, linux-acpi, linux-pci, kvm,
patches, pjaroszynski, vsethi, helgaas, etzhao1900
Hi Nicolin,
kernel test robot noticed the following build warnings:
[auto build test WARNING on next-20251110]
[cannot apply to pci/next pci/for-linus awilliam-vfio/next awilliam-vfio/for-linus rafael-pm/linux-next rafael-pm/bleeding-edge linus/master v6.18-rc5 v6.18-rc4 v6.18-rc3 v6.18-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Nicolin-Chen/iommu-Lock-group-mutex-in-iommu_deferred_attach/20251111-131520
base: next-20251110
patch link: https://lore.kernel.org/r/0303739735f3f49bcebc244804e9eeb82b1c41dc.1762835355.git.nicolinc%40nvidia.com
patch subject: [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper
config: microblaze-randconfig-r072-20251112 (https://download.01.org/0day-ci/archive/20251112/202511121645.Oi9e0lrt-lkp@intel.com/config)
compiler: microblaze-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251112/202511121645.Oi9e0lrt-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511121645.Oi9e0lrt-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> Warning: drivers/iommu/iommu.c:2229 function parameter 'dev' not described in 'iommu_get_domain_for_dev'
>> Warning: drivers/iommu/iommu.c:2250 function parameter 'dev' not described in 'iommu_driver_get_domain_for_dev'
>> Warning: drivers/iommu/iommu.c:2229 function parameter 'dev' not described in 'iommu_get_domain_for_dev'
>> Warning: drivers/iommu/iommu.c:2250 function parameter 'dev' not described in 'iommu_driver_get_domain_for_dev'
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper
2025-11-12 5:58 ` Baolu Lu
@ 2025-11-12 17:41 ` Nicolin Chen
2025-11-18 7:02 ` Nicolin Chen
0 siblings, 1 reply; 36+ messages in thread
From: Nicolin Chen @ 2025-11-12 17:41 UTC (permalink / raw)
To: Baolu Lu
Cc: joro, afael, bhelgaas, alex, jgg, kevin.tian, will, robin.murphy,
lenb, linux-arm-kernel, iommu, linux-kernel, linux-acpi,
linux-pci, kvm, patches, pjaroszynski, vsethi, helgaas,
etzhao1900
Hi Baolu,
On Wed, Nov 12, 2025 at 01:58:51PM +0800, Baolu Lu wrote:
> On 11/11/25 13:12, Nicolin Chen wrote:
> > +/**
> > + * iommu_get_domain_for_dev() - Return the DMA API domain pointer
> > + * @dev - Device to query
> > + *
> > + * This function can be called within a driver bound to dev. The returned
> > + * pointer is valid for the lifetime of the bound driver.
> > + *
> > + * It should not be called by drivers with driver_managed_dma = true.
>
> "driver_managed_dma != true" means the driver will use the default
> domain allocated by the iommu core during iommu probe.
Hmm, I am not very sure. Jason's remarks pointed out that There
is an exception in host1x_client_iommu_detach():
https://lore.kernel.org/all/20250924191055.GJ2617119@nvidia.com/
Where the group->domain could be NULL, i.e. not attached to the
default domain?
> The iommu core
> ensures that this domain stored at group->domain will not be changed
> during the driver's whole lifecycle. That's reasonable.
>
> How about making some code to enforce this requirement? Something like
> below ...
>
> > + */
> > struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
> > {
> > /* Caller must be a probed driver on dev */
> > @@ -2225,10 +2234,29 @@ struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
> > if (!group)
> > return NULL;
> > + lockdep_assert_not_held(&group->mutex);
>
> ...
> if (WARN_ON(!dev->driver || !group->owner_cnt || group->owner))
> return NULL;
With that, could host1x_client_iommu_detach() trigger WARN_ON?
Thanks!
Nicolin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
2025-11-12 6:18 ` Baolu Lu
@ 2025-11-12 17:43 ` Nicolin Chen
0 siblings, 0 replies; 36+ messages in thread
From: Nicolin Chen @ 2025-11-12 17:43 UTC (permalink / raw)
To: Baolu Lu
Cc: joro, afael, bhelgaas, alex, jgg, kevin.tian, will, robin.murphy,
lenb, linux-arm-kernel, iommu, linux-kernel, linux-acpi,
linux-pci, kvm, patches, pjaroszynski, vsethi, helgaas,
etzhao1900
On Wed, Nov 12, 2025 at 02:18:09PM +0800, Baolu Lu wrote:
> On 11/11/25 13:12, Nicolin Chen wrote:
> > +int iommu_dev_reset_prepare(struct device *dev)
> > +{
> > + struct iommu_group *group = dev->iommu_group;
> > + unsigned long pasid;
> > + void *entry;
> > + int ret = 0;
> > +
> > + if (!dev_has_iommu(dev))
> > + return 0;
>
> Nit: This interface is only for PCI layer, so why not just
>
> if (WARN_ON(!dev_is_pci(dev)))
> return -EINVAL;
> ?
The function naming was a bit generic, but we do have a specific
use case here. So, yea, let's add one.
> > +
> > + guard(mutex)(&group->mutex);
> > +
> > + /*
> > + * Once the resetting_domain is set, any concurrent attachment to this
> > + * iommu_group will be rejected, which would break the attach routines
> > + * of the sibling devices in the same iommu_group. So, skip this case.
> > + */
> > + if (dev_is_pci(dev)) {
> > + struct group_device *gdev;
> > +
> > + for_each_group_device(group, gdev) {
> > + if (gdev->dev != dev)
> > + return 0;
> > + }
> > + }
>
> With above dev_is_pci() check, here it can simply be,
>
> if (list_count_nodes(&group->devices) != 1)
> return 0;
Will replace that.
Thanks!
Nicolin
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v5 2/5] iommu: Tiny domain for iommu_setup_dma_ops()
2025-11-11 5:12 ` [PATCH v5 2/5] iommu: Tiny domain for iommu_setup_dma_ops() Nicolin Chen
2025-11-12 5:22 ` Baolu Lu
@ 2025-11-14 9:17 ` Tian, Kevin
2025-11-14 9:18 ` Tian, Kevin
2 siblings, 0 replies; 36+ messages in thread
From: Tian, Kevin @ 2025-11-14 9:17 UTC (permalink / raw)
To: Nicolin Chen, joro@8bytes.org, afael@kernel.org,
bhelgaas@google.com, alex@shazbot.org, jgg@nvidia.com
Cc: will@kernel.org, robin.murphy@arm.com, lenb@kernel.org,
baolu.lu@linux.intel.com, linux-arm-kernel@lists.infradead.org,
iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
kvm@vger.kernel.org, patches@lists.linux.dev, Jaroszynski, Piotr,
Sethi, Vikram, helgaas@kernel.org, etzhao1900@gmail.com
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, November 11, 2025 1:13 PM
>
> This function can only be called on the default_domain. Trivally pass it
> in. In all three existing cases, the default domain was just attached to
> the device.
>
> This avoids iommu_setup_dma_ops() calling iommu_get_domain_for_dev()
> the that will be used by external callers.
remove 'the' before 'that'
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v5 2/5] iommu: Tiny domain for iommu_setup_dma_ops()
2025-11-11 5:12 ` [PATCH v5 2/5] iommu: Tiny domain for iommu_setup_dma_ops() Nicolin Chen
2025-11-12 5:22 ` Baolu Lu
2025-11-14 9:17 ` Tian, Kevin
@ 2025-11-14 9:18 ` Tian, Kevin
2 siblings, 0 replies; 36+ messages in thread
From: Tian, Kevin @ 2025-11-14 9:18 UTC (permalink / raw)
To: Nicolin Chen, joro@8bytes.org, afael@kernel.org,
bhelgaas@google.com, alex@shazbot.org, jgg@nvidia.com
Cc: will@kernel.org, robin.murphy@arm.com, lenb@kernel.org,
baolu.lu@linux.intel.com, linux-arm-kernel@lists.infradead.org,
iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
kvm@vger.kernel.org, patches@lists.linux.dev, Jaroszynski, Piotr,
Sethi, Vikram, helgaas@kernel.org, etzhao1900@gmail.com
> From: Tian, Kevin
> Sent: Friday, November 14, 2025 5:17 PM
>
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, November 11, 2025 1:13 PM
> >
> > This function can only be called on the default_domain. Trivally pass it
> > in. In all three existing cases, the default domain was just attached to
> > the device.
> >
> > This avoids iommu_setup_dma_ops() calling iommu_get_domain_for_dev()
> > the that will be used by external callers.
>
> remove 'the' before 'that'
Reviewed-by: Kevin Tian<kevin.tian@intel.com>
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper
2025-11-11 5:12 ` [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper Nicolin Chen
2025-11-12 5:58 ` Baolu Lu
2025-11-12 8:52 ` kernel test robot
@ 2025-11-14 9:18 ` Tian, Kevin
2 siblings, 0 replies; 36+ messages in thread
From: Tian, Kevin @ 2025-11-14 9:18 UTC (permalink / raw)
To: Nicolin Chen, joro@8bytes.org, afael@kernel.org,
bhelgaas@google.com, alex@shazbot.org, jgg@nvidia.com
Cc: will@kernel.org, robin.murphy@arm.com, lenb@kernel.org,
baolu.lu@linux.intel.com, linux-arm-kernel@lists.infradead.org,
iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
kvm@vger.kernel.org, patches@lists.linux.dev, Jaroszynski, Piotr,
Sethi, Vikram, helgaas@kernel.org, etzhao1900@gmail.com
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, November 11, 2025 1:13 PM
>
> There is a need to stage a resetting PCI device to temporally the blocked
s/temporally/temporarily/
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
2025-11-11 5:12 ` [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
2025-11-12 6:18 ` Baolu Lu
@ 2025-11-14 9:37 ` Tian, Kevin
2025-11-14 18:26 ` Nicolin Chen
2025-11-17 4:59 ` Tian, Kevin
2025-11-17 23:04 ` Bjorn Helgaas
3 siblings, 1 reply; 36+ messages in thread
From: Tian, Kevin @ 2025-11-14 9:37 UTC (permalink / raw)
To: Nicolin Chen, joro@8bytes.org, afael@kernel.org,
bhelgaas@google.com, alex@shazbot.org, jgg@nvidia.com
Cc: will@kernel.org, robin.murphy@arm.com, lenb@kernel.org,
baolu.lu@linux.intel.com, linux-arm-kernel@lists.infradead.org,
iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
kvm@vger.kernel.org, patches@lists.linux.dev, Jaroszynski, Piotr,
Sethi, Vikram, helgaas@kernel.org, etzhao1900@gmail.com
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, November 11, 2025 1:13 PM
>
> Note that there are two corner cases:
> 1. Devices in the same iommu_group
> Since an attachment is always per iommu_group, disallowing one device
> to switch domains (or HWPTs in iommufd) would have to disallow others
> in the same iommu_group to switch domains as well. So, play safe by
> preventing a shared iommu_group from going through the iommu reset.
It'd be good to make clear that 'preventing' means that the racing problem
is not addressed.
> + /*
> + * During a group device reset, @resetting_domain points to the
> physical
> + * domain, while @domain points to the attached domain before the
> reset.
> + */
> + struct iommu_domain *resetting_domain;
'a group device' is a bit confusing. Just remove 'group'?
> @@ -2195,6 +2200,12 @@ int iommu_deferred_attach(struct device *dev,
> struct iommu_domain *domain)
>
> guard(mutex)(&dev->iommu_group->mutex);
>
> + /*
> + * This is a concurrent attach while a group device is resetting. Reject
> + * it until iommu_dev_reset_done() attaches the device to group-
> >domain.
> + */
> + if (dev->iommu_group->resetting_domain)
> + return -EBUSY;
It might be worth noting that failing a deferred attach leads to failing
the dma map operation. It's different from other explicit attaching paths,
but there is nothing more we can do here.
> @@ -2253,6 +2264,16 @@ struct iommu_domain
> *iommu_driver_get_domain_for_dev(struct device *dev)
>
> lockdep_assert_held(&group->mutex);
>
> + /*
> + * Driver handles the low-level __iommu_attach_device(), including
> the
> + * one invoked by iommu_dev_reset_done(), in which case the driver
> must
> + * get the resetting_domain over group->domain caching the one
> prior to
> + * iommu_dev_reset_prepare(), so that it wouldn't end up with
> attaching
> + * the device from group->domain (old) to group->domain (new).
> + */
> + if (group->resetting_domain)
> + return group->resetting_domain;
It's a pretty long sentence. Let's break it.
> +int iommu_dev_reset_prepare(struct device *dev)
If this is intended to be used by pci for now, it's clearer to have a 'pci'
word in the name. Later when there is a demand calling it from other
buses, discussion will catch eyes to ensure no racy of UAF etc.
> + /*
> + * Once the resetting_domain is set, any concurrent attachment to
> this
> + * iommu_group will be rejected, which would break the attach
> routines
> + * of the sibling devices in the same iommu_group. So, skip this case.
> + */
> + if (dev_is_pci(dev)) {
> + struct group_device *gdev;
> +
> + for_each_group_device(group, gdev) {
> + if (gdev->dev != dev)
> + return 0;
> + }
> + }
btw what'd be a real impact to reject concurrent attachment for sibling
devices? This series already documents the impact in uAPI for the device
under attachment, and the userspace already knows the restriction
of devices in the group which must be attached to a same hwpt.
Combining those knowledge I don't think there is a problem for
userspace to be aware of that resetting a device in a multi-dev
group affects concurrent attachment of sibling devices...
> + /* Re-attach RID domain back to group->domain */
> + if (group->domain != group->blocking_domain) {
> + WARN_ON(__iommu_attach_device(group->domain, dev,
> + group->blocking_domain));
> + }
Even if we disallow resetting on a multi-dev group, there is still a
corner case not taken care here.
It's possible that there is only one device in the group at prepare,
coming with a device hotplug added to the group in the middle,
then doing reset_done.
In this case the newly-added device will inherit the blocking domain.
Then reset_done should loop all devices in the group and re-attach
all of them to the cached domain.
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device
2025-11-11 5:12 ` [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device Nicolin Chen
@ 2025-11-14 9:45 ` Tian, Kevin
2025-11-14 18:00 ` Nicolin Chen
2025-11-17 22:58 ` Bjorn Helgaas
1 sibling, 1 reply; 36+ messages in thread
From: Tian, Kevin @ 2025-11-14 9:45 UTC (permalink / raw)
To: Nicolin Chen, joro@8bytes.org, afael@kernel.org,
bhelgaas@google.com, alex@shazbot.org, jgg@nvidia.com
Cc: will@kernel.org, robin.murphy@arm.com, lenb@kernel.org,
baolu.lu@linux.intel.com, linux-arm-kernel@lists.infradead.org,
iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
kvm@vger.kernel.org, patches@lists.linux.dev, Jaroszynski, Piotr,
Sethi, Vikram, helgaas@kernel.org, etzhao1900@gmail.com
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, November 11, 2025 1:13 PM
>
> PCIe permits a device to ignore ATS invalidation TLPs, while processing a
> reset. This creates a problem visible to the OS where an ATS invalidation
> command will time out: e.g. an SVA domain will have no coordination with a
> reset event and can racily issue ATS invalidations to a resetting device.
>
> The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable
> and
> block ATS before initiating a Function Level Reset. It also mentions that
> other reset methods could have the same vulnerability as well.
>
> Now iommu_dev_reset_prepare/done() helpers are introduced for this
> matter.
> Use them in all the existing reset functions, which will attach the device
looks pci_reset_bus_function() was missed?
> @@ -971,6 +971,7 @@ void pci_set_acpi_fwnode(struct pci_dev *dev)
> int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
> {
> acpi_handle handle = ACPI_HANDLE(&dev->dev);
> + int ret = 0;
no need to initialize it. ditto for other reset functions.
> +/*
> + * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> before
> + * initiating a reset. Notify the iommu driver that enabled ATS.
> + */
> +int pci_reset_iommu_prepare(struct pci_dev *dev)
> +{
> + if (pci_ats_supported(dev))
> + return iommu_dev_reset_prepare(&dev->dev);
> + return 0;
> +}
the comment says "driver that enabled ATS", but the code checks
whether ATS is supported.
which one is desired?
>
> + /* Have to call it after waiting for pending DMA transaction */
> + ret = pci_reset_iommu_prepare(dev);
> + if (ret) {
> + pci_err(dev, "failed to stop IOMMU\n");
the error message could be more informative.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device
2025-11-14 9:45 ` Tian, Kevin
@ 2025-11-14 18:00 ` Nicolin Chen
2025-11-17 4:52 ` Tian, Kevin
0 siblings, 1 reply; 36+ messages in thread
From: Nicolin Chen @ 2025-11-14 18:00 UTC (permalink / raw)
To: Tian, Kevin
Cc: joro@8bytes.org, afael@kernel.org, bhelgaas@google.com,
alex@shazbot.org, jgg@nvidia.com, will@kernel.org,
robin.murphy@arm.com, lenb@kernel.org, baolu.lu@linux.intel.com,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
helgaas@kernel.org, etzhao1900@gmail.com
On Fri, Nov 14, 2025 at 09:45:31AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, November 11, 2025 1:13 PM
> >
> > PCIe permits a device to ignore ATS invalidation TLPs, while processing a
> > reset. This creates a problem visible to the OS where an ATS invalidation
> > command will time out: e.g. an SVA domain will have no coordination with a
> > reset event and can racily issue ATS invalidations to a resetting device.
> >
> > The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable
> > and
> > block ATS before initiating a Function Level Reset. It also mentions that
> > other reset methods could have the same vulnerability as well.
> >
> > Now iommu_dev_reset_prepare/done() helpers are introduced for this
> > matter.
> > Use them in all the existing reset functions, which will attach the device
>
> looks pci_reset_bus_function() was missed?
Will add that.
> > @@ -971,6 +971,7 @@ void pci_set_acpi_fwnode(struct pci_dev *dev)
> > int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
> > {
> > acpi_handle handle = ACPI_HANDLE(&dev->dev);
> > + int ret = 0;
>
> no need to initialize it. ditto for other reset functions.
Ack.
> > +/*
> > + * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> > before
> > + * initiating a reset. Notify the iommu driver that enabled ATS.
> > + */
> > +int pci_reset_iommu_prepare(struct pci_dev *dev)
> > +{
> > + if (pci_ats_supported(dev))
> > + return iommu_dev_reset_prepare(&dev->dev);
> > + return 0;
> > +}
>
> the comment says "driver that enabled ATS", but the code checks
> whether ATS is supported.
>
> which one is desired?
The comments says "the iommu driver that enabled ATS". It doesn't
conflict with what the PCI core checks here?
> > + /* Have to call it after waiting for pending DMA transaction */
> > + ret = pci_reset_iommu_prepare(dev);
> > + if (ret) {
> > + pci_err(dev, "failed to stop IOMMU\n");
>
> the error message could be more informative.
OK. Perhaps print the ret value.
Thanks!
Nicolin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
2025-11-14 9:37 ` Tian, Kevin
@ 2025-11-14 18:26 ` Nicolin Chen
0 siblings, 0 replies; 36+ messages in thread
From: Nicolin Chen @ 2025-11-14 18:26 UTC (permalink / raw)
To: Tian, Kevin
Cc: joro@8bytes.org, afael@kernel.org, bhelgaas@google.com,
alex@shazbot.org, jgg@nvidia.com, will@kernel.org,
robin.murphy@arm.com, lenb@kernel.org, baolu.lu@linux.intel.com,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
helgaas@kernel.org, etzhao1900@gmail.com
On Fri, Nov 14, 2025 at 09:37:27AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > @@ -2195,6 +2200,12 @@ int iommu_deferred_attach(struct device *dev,
> > struct iommu_domain *domain)
> >
> > guard(mutex)(&dev->iommu_group->mutex);
> >
> > + /*
> > + * This is a concurrent attach while a group device is resetting. Reject
> > + * it until iommu_dev_reset_done() attaches the device to group-
> > >domain.
> > + */
> > + if (dev->iommu_group->resetting_domain)
> > + return -EBUSY;
>
> It might be worth noting that failing a deferred attach leads to failing
> the dma map operation. It's different from other explicit attaching paths,
> but there is nothing more we can do here.
OK.
/*
* This is a concurrent attach while a group device is resetting. Reject
* it until iommu_dev_reset_done() attaches the device to group->domain.
*
* Worth noting that this may fail the dma map operation. But there is
* nothing more we can do here.
*/
> > @@ -2253,6 +2264,16 @@ struct iommu_domain
> > *iommu_driver_get_domain_for_dev(struct device *dev)
> >
> > lockdep_assert_held(&group->mutex);
> >
> > + /*
> > + * Driver handles the low-level __iommu_attach_device(), including
> > the
> > + * one invoked by iommu_dev_reset_done(), in which case the driver
> > must
> > + * get the resetting_domain over group->domain caching the one
> > prior to
> > + * iommu_dev_reset_prepare(), so that it wouldn't end up with
> > attaching
> > + * the device from group->domain (old) to group->domain (new).
> > + */
> > + if (group->resetting_domain)
> > + return group->resetting_domain;
>
> It's a pretty long sentence. Let's break it.
OK.
/*
* Driver handles the low-level __iommu_attach_device(), including the
* one invoked by iommu_dev_reset_done() that reattaches the device to
* the cached group->domain. In this case, the driver must get the old
* domain from group->resetting_domain rather than group->domain. This
* prevents it from reattaching the device from group->domain (old) to
* group->domain (new).
*/
>> +int iommu_dev_reset_prepare(struct device *dev)
>
> If this is intended to be used by pci for now, it's clearer to have a 'pci'
> word in the name. Later when there is a demand calling it from other
> buses, discussion will catch eyes to ensure no racy of UAF etc.
Well, if we make it exclusive for PCI. Perhaps just move these two
from pci.c to iommu.c:
int pci_reset_iommu_prepare(struct pci_dev *dev);
void pci_reset_iommu_done(struct pci_dev *dev);
> > + /*
> > + * Once the resetting_domain is set, any concurrent attachment to
> > this
> > + * iommu_group will be rejected, which would break the attach
> > routines
> > + * of the sibling devices in the same iommu_group. So, skip this case.
> > + */
> > + if (dev_is_pci(dev)) {
> > + struct group_device *gdev;
> > +
> > + for_each_group_device(group, gdev) {
> > + if (gdev->dev != dev)
> > + return 0;
> > + }
> > + }
>
> btw what'd be a real impact to reject concurrent attachment for sibling
> devices? This series already documents the impact in uAPI for the device
> under attachment, and the userspace already knows the restriction
> of devices in the group which must be attached to a same hwpt.
>
> Combining those knowledge I don't think there is a problem for
> userspace to be aware of that resetting a device in a multi-dev
> group affects concurrent attachment of sibling devices...
It's following Jason's remarks:
https://lore.kernel.org/linux-iommu/20250915125357.GH1024672@nvidia.com/
Perhaps we should add that to the uAPI, given the race condition
that you mentioned below.
> > + /* Re-attach RID domain back to group->domain */
> > + if (group->domain != group->blocking_domain) {
> > + WARN_ON(__iommu_attach_device(group->domain, dev,
> > + group->blocking_domain));
> > + }
>
> Even if we disallow resetting on a multi-dev group, there is still a
> corner case not taken care here.
>
> It's possible that there is only one device in the group at prepare,
> coming with a device hotplug added to the group in the middle,
> then doing reset_done.
>
> In this case the newly-added device will inherit the blocking domain.
>
> Then reset_done should loop all devices in the group and re-attach
> all of them to the cached domain.
Oh, that's a good catch!
I will address all of your notes.
Thank you
Nicolin
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device
2025-11-14 18:00 ` Nicolin Chen
@ 2025-11-17 4:52 ` Tian, Kevin
2025-11-17 19:26 ` Nicolin Chen
0 siblings, 1 reply; 36+ messages in thread
From: Tian, Kevin @ 2025-11-17 4:52 UTC (permalink / raw)
To: Nicolin Chen
Cc: joro@8bytes.org, afael@kernel.org, bhelgaas@google.com,
alex@shazbot.org, jgg@nvidia.com, will@kernel.org,
robin.murphy@arm.com, lenb@kernel.org, baolu.lu@linux.intel.com,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
helgaas@kernel.org, etzhao1900@gmail.com
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, November 15, 2025 2:01 AM
>
> On Fri, Nov 14, 2025 at 09:45:31AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Tuesday, November 11, 2025 1:13 PM
> > >
> > > +/*
> > > + * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables
> ATS
> > > before
> > > + * initiating a reset. Notify the iommu driver that enabled ATS.
> > > + */
> > > +int pci_reset_iommu_prepare(struct pci_dev *dev)
> > > +{
> > > + if (pci_ats_supported(dev))
> > > + return iommu_dev_reset_prepare(&dev->dev);
> > > + return 0;
> > > +}
> >
> > the comment says "driver that enabled ATS", but the code checks
> > whether ATS is supported.
> >
> > which one is desired?
>
> The comments says "the iommu driver that enabled ATS". It doesn't
> conflict with what the PCI core checks here?
actually this is sent to all IOMMU drivers. there is no check on whether
a specific driver has enabled ATS in this path.
>
> > > + /* Have to call it after waiting for pending DMA transaction */
> > > + ret = pci_reset_iommu_prepare(dev);
> > > + if (ret) {
> > > + pci_err(dev, "failed to stop IOMMU\n");
> >
> > the error message could be more informative.
>
> OK. Perhaps print the ret value.
>
and mention that it's for PCI reset.
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
2025-11-11 5:12 ` [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
2025-11-12 6:18 ` Baolu Lu
2025-11-14 9:37 ` Tian, Kevin
@ 2025-11-17 4:59 ` Tian, Kevin
2025-11-17 19:27 ` Nicolin Chen
2025-11-17 23:04 ` Bjorn Helgaas
3 siblings, 1 reply; 36+ messages in thread
From: Tian, Kevin @ 2025-11-17 4:59 UTC (permalink / raw)
To: Nicolin Chen, joro@8bytes.org, afael@kernel.org,
bhelgaas@google.com, alex@shazbot.org, jgg@nvidia.com
Cc: will@kernel.org, robin.murphy@arm.com, lenb@kernel.org,
baolu.lu@linux.intel.com, linux-arm-kernel@lists.infradead.org,
iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
kvm@vger.kernel.org, patches@lists.linux.dev, Jaroszynski, Piotr,
Sethi, Vikram, helgaas@kernel.org, etzhao1900@gmail.com
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, November 11, 2025 1:13 PM
>
> + *
> + * This function attaches all RID/PASID of the device's to
> IOMMU_DOMAIN_BLOCKED
> + * allowing any blocked-domain-supporting IOMMU driver to pause
> translation and
__iommu_group_alloc_blocking_domain() will allocate a paging
domain if a driver doesn't support blocked_domain itself. So in the
end this applies to all IOMMU drivers.
I saw several other places mention IOMMU_DOMAIN_BLOCKED in this
series. Not very accurate.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device
2025-11-17 4:52 ` Tian, Kevin
@ 2025-11-17 19:26 ` Nicolin Chen
2025-11-18 0:29 ` Tian, Kevin
0 siblings, 1 reply; 36+ messages in thread
From: Nicolin Chen @ 2025-11-17 19:26 UTC (permalink / raw)
To: Tian, Kevin
Cc: joro@8bytes.org, afael@kernel.org, bhelgaas@google.com,
alex@shazbot.org, jgg@nvidia.com, will@kernel.org,
robin.murphy@arm.com, lenb@kernel.org, baolu.lu@linux.intel.com,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
helgaas@kernel.org, etzhao1900@gmail.com
On Mon, Nov 17, 2025 at 04:52:05AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Saturday, November 15, 2025 2:01 AM
> >
> > On Fri, Nov 14, 2025 at 09:45:31AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Tuesday, November 11, 2025 1:13 PM
> > > >
> > > > +/*
> > > > + * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables
> > ATS
> > > > before
> > > > + * initiating a reset. Notify the iommu driver that enabled ATS.
> > > > + */
> > > > +int pci_reset_iommu_prepare(struct pci_dev *dev)
> > > > +{
> > > > + if (pci_ats_supported(dev))
> > > > + return iommu_dev_reset_prepare(&dev->dev);
> > > > + return 0;
> > > > +}
> > >
> > > the comment says "driver that enabled ATS", but the code checks
> > > whether ATS is supported.
> > >
> > > which one is desired?
> >
> > The comments says "the iommu driver that enabled ATS". It doesn't
> > conflict with what the PCI core checks here?
>
> actually this is sent to all IOMMU drivers. there is no check on whether
> a specific driver has enabled ATS in this path.
But the comment doesn't say "check"..
How about "Notify the iommu driver that enables/disables ATS"?
The point is that pci_enable_ats() is called in iommu drivers.
> > > > + /* Have to call it after waiting for pending DMA transaction */
> > > > + ret = pci_reset_iommu_prepare(dev);
> > > > + if (ret) {
> > > > + pci_err(dev, "failed to stop IOMMU\n");
> > >
> > > the error message could be more informative.
> >
> > OK. Perhaps print the ret value.
> >
>
> and mention that it's for PCI reset.
OK.
Thanks
Nicolin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
2025-11-17 4:59 ` Tian, Kevin
@ 2025-11-17 19:27 ` Nicolin Chen
0 siblings, 0 replies; 36+ messages in thread
From: Nicolin Chen @ 2025-11-17 19:27 UTC (permalink / raw)
To: Tian, Kevin
Cc: joro@8bytes.org, afael@kernel.org, bhelgaas@google.com,
alex@shazbot.org, jgg@nvidia.com, will@kernel.org,
robin.murphy@arm.com, lenb@kernel.org, baolu.lu@linux.intel.com,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
helgaas@kernel.org, etzhao1900@gmail.com
On Mon, Nov 17, 2025 at 04:59:33AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, November 11, 2025 1:13 PM
> >
> > + *
> > + * This function attaches all RID/PASID of the device's to
> > IOMMU_DOMAIN_BLOCKED
> > + * allowing any blocked-domain-supporting IOMMU driver to pause
> > translation and
>
> __iommu_group_alloc_blocking_domain() will allocate a paging
> domain if a driver doesn't support blocked_domain itself. So in the
> end this applies to all IOMMU drivers.
>
> I saw several other places mention IOMMU_DOMAIN_BLOCKED in this
> series. Not very accurate.
OK. I will replace that with "group->blocked_domain"
Nicolin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device
2025-11-11 5:12 ` [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device Nicolin Chen
2025-11-14 9:45 ` Tian, Kevin
@ 2025-11-17 22:58 ` Bjorn Helgaas
2025-11-18 8:16 ` Nicolin Chen
1 sibling, 1 reply; 36+ messages in thread
From: Bjorn Helgaas @ 2025-11-17 22:58 UTC (permalink / raw)
To: Nicolin Chen
Cc: joro, rafael, bhelgaas, alex, jgg, kevin.tian, will, robin.murphy,
lenb, baolu.lu, linux-arm-kernel, iommu, linux-kernel, linux-acpi,
linux-pci, kvm, patches, pjaroszynski, vsethi, etzhao1900
On Mon, Nov 10, 2025 at 09:12:55PM -0800, Nicolin Chen wrote:
Run "git log --oneline drivers/pci/pci.c" and match the subject line
style.
> PCIe permits a device to ignore ATS invalidation TLPs, while processing a
> reset. This creates a problem visible to the OS where an ATS invalidation
> command will time out: e.g. an SVA domain will have no coordination with a
> reset event and can racily issue ATS invalidations to a resetting device.
s/TLPs, while/TLPs while/
> The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and
> block ATS before initiating a Function Level Reset. It also mentions that
> other reset methods could have the same vulnerability as well.
Include spec revision, e.g., "PCIe r7.0, sec 10.3.1".
> Now iommu_dev_reset_prepare/done() helpers are introduced for this matter.
s/Now ... are introduced for this matter/Add ...helpers/
> Use them in all the existing reset functions, which will attach the device
> to an IOMMU_DOMAIN_BLOCKED during a reset, so as to allow IOMMU driver to:
> - invoke pci_disable_ats() and pci_enable_ats(), if necessary
> - wait for all ATS invalidations to complete
> - stop issuing new ATS invalidations
> - fence any incoming ATS queries
Thanks for addressing this problem.
> +++ b/drivers/pci/pci-acpi.c
> @@ -971,6 +971,7 @@ void pci_set_acpi_fwnode(struct pci_dev *dev)
> int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
> {
> acpi_handle handle = ACPI_HANDLE(&dev->dev);
> + int ret = 0;
Unnecessary initialization.
> +int pci_reset_iommu_prepare(struct pci_dev *dev)
> +{
> + if (pci_ats_supported(dev))
> + return iommu_dev_reset_prepare(&dev->dev);
Why bother checking pci_ats_supported() here? That could be done
inside iommu_dev_reset_prepare(), since iommu.c already uses
dev_is_pci() and pci_ats_supported() is already exported outside
drivers/pci/.
> +void pci_reset_iommu_done(struct pci_dev *dev)
> +{
> + if (pci_ats_supported(dev))
> + iommu_dev_reset_done(&dev->dev);
And here.
> int pcie_flr(struct pci_dev *dev)
> {
> + int ret = 0;
Unnecessary initialization.
> static int pci_af_flr(struct pci_dev *dev, bool probe)
> {
> + int ret = 0;
Unnecessary initialization.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
2025-11-11 5:12 ` [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
` (2 preceding siblings ...)
2025-11-17 4:59 ` Tian, Kevin
@ 2025-11-17 23:04 ` Bjorn Helgaas
3 siblings, 0 replies; 36+ messages in thread
From: Bjorn Helgaas @ 2025-11-17 23:04 UTC (permalink / raw)
To: Nicolin Chen
Cc: joro, rafael, bhelgaas, alex, jgg, kevin.tian, will, robin.murphy,
lenb, baolu.lu, linux-arm-kernel, iommu, linux-kernel, linux-acpi,
linux-pci, kvm, patches, pjaroszynski, vsethi, etzhao1900
On Mon, Nov 10, 2025 at 09:12:54PM -0800, Nicolin Chen wrote:
> PCIe permits a device to ignore ATS invalidation TLPs, while processing a
> reset. This creates a problem visible to the OS where an ATS invalidation
> command will time out. E.g. an SVA domain will have no coordination with a
> reset event and can racily issue ATS invalidations to a resetting device.
s/TLPs, while/TLPs while/
> The OS should do something to mitigate this as we do not want production
> systems to be reporting critical ATS failures, especially in a hypervisor
> environment. Broadly, OS could arrange to ignore the timeouts, block page
> table mutations to prevent invalidations, or disable and block ATS.
>
> The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and
> block ATS before initiating a Function Level Reset. It also mentions that
> other reset methods could have the same vulnerability as well.
>
> Provide a callback from the PCI subsystem that will enclose the reset and
> have the iommu core temporarily change all the attached domain to BLOCKED.
> After attaching a BLOCKED domain, IOMMU hardware would fence any incoming
> ATS queries. And IOMMU drivers should also synchronously stop issuing new
> ATS invalidations and wait for all ATS invalidations to complete. This can
> avoid any ATS invaliation timeouts.
>
> However, if there is a domain attachment/replacement happening during an
> ongoing reset, ATS routines may be re-activated between the two function
> calls. So, introduce a new resetting_domain in the iommu_group structure
> to reject any concurrent attach_dev/set_dev_pasid call during a reset for
> a concern of compatibility failure. Since this changes the behavior of an
> attach operation, update the uAPI accordingly.
>
> Note that there are two corner cases:
> 1. Devices in the same iommu_group
> Since an attachment is always per iommu_group, disallowing one device
> to switch domains (or HWPTs in iommufd) would have to disallow others
> in the same iommu_group to switch domains as well. So, play safe by
> preventing a shared iommu_group from going through the iommu reset.
> 2. SRIOV devices that its PF is resetting while its VF isn't
Slightly awkward. Maybe:
2. An SR-IOV PF that is being reset while its VF is not
(Obviously resetting a PF destroys all the VFs, which I guess is what
you're hinting at below.)
> In such case, the VF itself is already broken. So, there is no point
> in preventing PF from going through the iommu reset.
> + * iommu_dev_reset_prepare() - Block IOMMU to prepare for a device reset
> + * @dev: device that is going to enter a reset routine
> + *
> + * When certain device is entering a reset routine, it wants to block any IOMMU
> + * activity during the reset routine. This includes blocking any translation as
> + * well as cache invalidation (especially the device cache).
> + *
> + * This function attaches all RID/PASID of the device's to IOMMU_DOMAIN_BLOCKED
> + * allowing any blocked-domain-supporting IOMMU driver to pause translation and
> + * cahce invalidation, but leaves the software domain pointers intact so later
s/cahce/cache/
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device
2025-11-17 19:26 ` Nicolin Chen
@ 2025-11-18 0:29 ` Tian, Kevin
2025-11-18 1:42 ` Nicolin Chen
0 siblings, 1 reply; 36+ messages in thread
From: Tian, Kevin @ 2025-11-18 0:29 UTC (permalink / raw)
To: Nicolin Chen
Cc: joro@8bytes.org, afael@kernel.org, bhelgaas@google.com,
alex@shazbot.org, jgg@nvidia.com, will@kernel.org,
robin.murphy@arm.com, lenb@kernel.org, baolu.lu@linux.intel.com,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
helgaas@kernel.org, etzhao1900@gmail.com
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, November 18, 2025 3:27 AM
>
> On Mon, Nov 17, 2025 at 04:52:05AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Saturday, November 15, 2025 2:01 AM
> > >
> > > On Fri, Nov 14, 2025 at 09:45:31AM +0000, Tian, Kevin wrote:
> > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > Sent: Tuesday, November 11, 2025 1:13 PM
> > > > >
> > > > > +/*
> > > > > + * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software
> disables
> > > ATS
> > > > > before
> > > > > + * initiating a reset. Notify the iommu driver that enabled ATS.
> > > > > + */
> > > > > +int pci_reset_iommu_prepare(struct pci_dev *dev)
> > > > > +{
> > > > > + if (pci_ats_supported(dev))
> > > > > + return iommu_dev_reset_prepare(&dev->dev);
> > > > > + return 0;
> > > > > +}
> > > >
> > > > the comment says "driver that enabled ATS", but the code checks
> > > > whether ATS is supported.
> > > >
> > > > which one is desired?
> > >
> > > The comments says "the iommu driver that enabled ATS". It doesn't
> > > conflict with what the PCI core checks here?
> >
> > actually this is sent to all IOMMU drivers. there is no check on whether
> > a specific driver has enabled ATS in this path.
>
> But the comment doesn't say "check"..
>
> How about "Notify the iommu driver that enables/disables ATS"?
>
> The point is that pci_enable_ats() is called in iommu drivers.
>
but in current way even an iommu driver which doesn't call
pci_enable_ats() will also be notified then I didn't see the
point of adding an attribute to "the iommu driver".
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device
2025-11-18 0:29 ` Tian, Kevin
@ 2025-11-18 1:42 ` Nicolin Chen
2025-11-18 5:38 ` Baolu Lu
2025-11-18 7:53 ` Tian, Kevin
0 siblings, 2 replies; 36+ messages in thread
From: Nicolin Chen @ 2025-11-18 1:42 UTC (permalink / raw)
To: Tian, Kevin
Cc: joro@8bytes.org, afael@kernel.org, bhelgaas@google.com,
alex@shazbot.org, jgg@nvidia.com, will@kernel.org,
robin.murphy@arm.com, lenb@kernel.org, baolu.lu@linux.intel.com,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
helgaas@kernel.org, etzhao1900@gmail.com
On Tue, Nov 18, 2025 at 12:29:43AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, November 18, 2025 3:27 AM
> >
> > On Mon, Nov 17, 2025 at 04:52:05AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Saturday, November 15, 2025 2:01 AM
> > > >
> > > > On Fri, Nov 14, 2025 at 09:45:31AM +0000, Tian, Kevin wrote:
> > > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > > Sent: Tuesday, November 11, 2025 1:13 PM
> > > > > >
> > > > > > +/*
> > > > > > + * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software
> > disables
> > > > ATS
> > > > > > before
> > > > > > + * initiating a reset. Notify the iommu driver that enabled ATS.
> > > > > > + */
> > > > > > +int pci_reset_iommu_prepare(struct pci_dev *dev)
> > > > > > +{
> > > > > > + if (pci_ats_supported(dev))
> > > > > > + return iommu_dev_reset_prepare(&dev->dev);
> > > > > > + return 0;
> > > > > > +}
> > > > >
> > > > > the comment says "driver that enabled ATS", but the code checks
> > > > > whether ATS is supported.
> > > > >
> > > > > which one is desired?
> > > >
> > > > The comments says "the iommu driver that enabled ATS". It doesn't
> > > > conflict with what the PCI core checks here?
> > >
> > > actually this is sent to all IOMMU drivers. there is no check on whether
> > > a specific driver has enabled ATS in this path.
> >
> > But the comment doesn't say "check"..
> >
> > How about "Notify the iommu driver that enables/disables ATS"?
> >
> > The point is that pci_enable_ats() is called in iommu drivers.
> >
>
> but in current way even an iommu driver which doesn't call
> pci_enable_ats() will also be notified then I didn't see the
> point of adding an attribute to "the iommu driver".
Hmm, that's a fair point.
Having looked closely, I see only AMD and ARM call that to enable
ATs. How others (e.g. Intel) enable it?
And how do you think of the followings?
/*
* Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS before
* initiating a reset. Though not all IOMMU drivers calls pci_enable_ats(), it
* only gets invoked in IOMMU driver. And it is racy to check dev->ats_enabled
* here, as a concurrent IOMMU attachment can enable ATS right after this line.
*
* Notify the IOMMU driver to stop IOMMU translations until the reset is done,
* to ensure that the ATS function and its related invalidations are disabled.
*/
Thanks
Nicolin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device
2025-11-18 1:42 ` Nicolin Chen
@ 2025-11-18 5:38 ` Baolu Lu
2025-11-18 6:53 ` Nicolin Chen
2025-11-18 7:53 ` Tian, Kevin
1 sibling, 1 reply; 36+ messages in thread
From: Baolu Lu @ 2025-11-18 5:38 UTC (permalink / raw)
To: Nicolin Chen, Tian, Kevin
Cc: joro@8bytes.org, afael@kernel.org, bhelgaas@google.com,
alex@shazbot.org, jgg@nvidia.com, will@kernel.org,
robin.murphy@arm.com, lenb@kernel.org,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
helgaas@kernel.org, etzhao1900@gmail.com
On 11/18/25 09:42, Nicolin Chen wrote:
> On Tue, Nov 18, 2025 at 12:29:43AM +0000, Tian, Kevin wrote:
>>> From: Nicolin Chen<nicolinc@nvidia.com>
>>> Sent: Tuesday, November 18, 2025 3:27 AM
>>>
>>> On Mon, Nov 17, 2025 at 04:52:05AM +0000, Tian, Kevin wrote:
>>>>> From: Nicolin Chen<nicolinc@nvidia.com>
>>>>> Sent: Saturday, November 15, 2025 2:01 AM
>>>>>
>>>>> On Fri, Nov 14, 2025 at 09:45:31AM +0000, Tian, Kevin wrote:
>>>>>>> From: Nicolin Chen<nicolinc@nvidia.com>
>>>>>>> Sent: Tuesday, November 11, 2025 1:13 PM
>>>>>>>
>>>>>>> +/*
>>>>>>> + * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software
>>> disables
>>>>> ATS
>>>>>>> before
>>>>>>> + * initiating a reset. Notify the iommu driver that enabled ATS.
>>>>>>> + */
>>>>>>> +int pci_reset_iommu_prepare(struct pci_dev *dev)
>>>>>>> +{
>>>>>>> + if (pci_ats_supported(dev))
>>>>>>> + return iommu_dev_reset_prepare(&dev->dev);
>>>>>>> + return 0;
>>>>>>> +}
>>>>>> the comment says "driver that enabled ATS", but the code checks
>>>>>> whether ATS is supported.
>>>>>>
>>>>>> which one is desired?
>>>>> The comments says "the iommu driver that enabled ATS". It doesn't
>>>>> conflict with what the PCI core checks here?
>>>> actually this is sent to all IOMMU drivers. there is no check on whether
>>>> a specific driver has enabled ATS in this path.
>>> But the comment doesn't say "check"..
>>>
>>> How about "Notify the iommu driver that enables/disables ATS"?
>>>
>>> The point is that pci_enable_ats() is called in iommu drivers.
>>>
>> but in current way even an iommu driver which doesn't call
>> pci_enable_ats() will also be notified then I didn't see the
>> point of adding an attribute to "the iommu driver".
> Hmm, that's a fair point.
>
> Having looked closely, I see only AMD and ARM call that to enable
> ATs. How others (e.g. Intel) enable it?
The VT-d driver enables ATS in the iommu probe_finalize() path (for
scalable mode).
static void intel_iommu_probe_finalize(struct device *dev)
{
[...]
if (sm_supported(iommu) && !dev_is_real_dma_subdevice(dev)) {
iommu_enable_pci_ats(info);
/* Assign a DEVTLB cache tag to the default domain. */
if (info->ats_enabled && info->domain) {
u16 did = domain_id_iommu(info->domain, iommu);
if (cache_tag_assign(info->domain, did, dev,
IOMMU_NO_PASID,
CACHE_TAG_DEVTLB))
iommu_disable_pci_ats(info);
}
}
[...]
}
iommu_enable_pci_ats() will eventually call pci_enable_ats() after some
necessary checks.
Thanks,
baolu
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device
2025-11-18 5:38 ` Baolu Lu
@ 2025-11-18 6:53 ` Nicolin Chen
0 siblings, 0 replies; 36+ messages in thread
From: Nicolin Chen @ 2025-11-18 6:53 UTC (permalink / raw)
To: Baolu Lu
Cc: Tian, Kevin, joro@8bytes.org, afael@kernel.org,
bhelgaas@google.com, alex@shazbot.org, jgg@nvidia.com,
will@kernel.org, robin.murphy@arm.com, lenb@kernel.org,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
helgaas@kernel.org, etzhao1900@gmail.com
On Tue, Nov 18, 2025 at 01:38:40PM +0800, Baolu Lu wrote:
> The VT-d driver enables ATS in the iommu probe_finalize() path (for
> scalable mode).
..
> iommu_enable_pci_ats() will eventually call pci_enable_ats() after some
> necessary checks.
Oh, I missed that one.
Thanks!
Nicolin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper
2025-11-12 17:41 ` Nicolin Chen
@ 2025-11-18 7:02 ` Nicolin Chen
2025-11-19 2:47 ` Baolu Lu
0 siblings, 1 reply; 36+ messages in thread
From: Nicolin Chen @ 2025-11-18 7:02 UTC (permalink / raw)
To: Baolu Lu
Cc: joro, afael, bhelgaas, alex, jgg, kevin.tian, will, robin.murphy,
lenb, linux-arm-kernel, iommu, linux-kernel, linux-acpi,
linux-pci, kvm, patches, pjaroszynski, vsethi, helgaas,
etzhao1900
On Wed, Nov 12, 2025 at 09:41:25AM -0800, Nicolin Chen wrote:
> Hi Baolu,
>
> On Wed, Nov 12, 2025 at 01:58:51PM +0800, Baolu Lu wrote:
> > On 11/11/25 13:12, Nicolin Chen wrote:
> > > +/**
> > > + * iommu_get_domain_for_dev() - Return the DMA API domain pointer
> > > + * @dev - Device to query
> > > + *
> > > + * This function can be called within a driver bound to dev. The returned
> > > + * pointer is valid for the lifetime of the bound driver.
> > > + *
> > > + * It should not be called by drivers with driver_managed_dma = true.
> >
> > "driver_managed_dma != true" means the driver will use the default
> > domain allocated by the iommu core during iommu probe.
>
> Hmm, I am not very sure. Jason's remarks pointed out that There
> is an exception in host1x_client_iommu_detach():
> https://lore.kernel.org/all/20250924191055.GJ2617119@nvidia.com/
>
> Where the group->domain could be NULL, i.e. not attached to the
> default domain?
>
> > The iommu core
> > ensures that this domain stored at group->domain will not be changed
> > during the driver's whole lifecycle. That's reasonable.
> >
> > How about making some code to enforce this requirement? Something like
> > below ...
> >
> > > + */
> > > struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
> > > {
> > > /* Caller must be a probed driver on dev */
> > > @@ -2225,10 +2234,29 @@ struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
> > > if (!group)
> > > return NULL;
> > > + lockdep_assert_not_held(&group->mutex);
> >
> > ...
> > if (WARN_ON(!dev->driver || !group->owner_cnt || group->owner))
> > return NULL;
>
> With that, could host1x_client_iommu_detach() trigger WARN_ON?
Hi Baolu,
For v6, I tend to keep this API as-is, trying not to give troubles
to existing callers. Jason suggested a potential followup series:
https://lore.kernel.org/linux-iommu/20250821131304.GM802098@nvidia.com/
That would replace this function, so maybe we can think about that.
If you have a strong feeling about the WARN_ON, please let me know.
Thanks
Nicolin
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device
2025-11-18 1:42 ` Nicolin Chen
2025-11-18 5:38 ` Baolu Lu
@ 2025-11-18 7:53 ` Tian, Kevin
2025-11-18 8:17 ` Nicolin Chen
1 sibling, 1 reply; 36+ messages in thread
From: Tian, Kevin @ 2025-11-18 7:53 UTC (permalink / raw)
To: Nicolin Chen
Cc: joro@8bytes.org, afael@kernel.org, bhelgaas@google.com,
alex@shazbot.org, jgg@nvidia.com, will@kernel.org,
robin.murphy@arm.com, lenb@kernel.org, baolu.lu@linux.intel.com,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
helgaas@kernel.org, etzhao1900@gmail.com
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, November 18, 2025 9:42 AM
>
> On Tue, Nov 18, 2025 at 12:29:43AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Tuesday, November 18, 2025 3:27 AM
> > >
> > > On Mon, Nov 17, 2025 at 04:52:05AM +0000, Tian, Kevin wrote:
> > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > Sent: Saturday, November 15, 2025 2:01 AM
> > > > >
> > > > > On Fri, Nov 14, 2025 at 09:45:31AM +0000, Tian, Kevin wrote:
> > > > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > > > Sent: Tuesday, November 11, 2025 1:13 PM
> > > > > > >
> > > > > > > +/*
> > > > > > > + * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software
> > > disables
> > > > > ATS
> > > > > > > before
> > > > > > > + * initiating a reset. Notify the iommu driver that enabled ATS.
> > > > > > > + */
> > > > > > > +int pci_reset_iommu_prepare(struct pci_dev *dev)
> > > > > > > +{
> > > > > > > + if (pci_ats_supported(dev))
> > > > > > > + return iommu_dev_reset_prepare(&dev->dev);
> > > > > > > + return 0;
> > > > > > > +}
> > > > > >
> > > > > > the comment says "driver that enabled ATS", but the code checks
> > > > > > whether ATS is supported.
> > > > > >
> > > > > > which one is desired?
> > > > >
> > > > > The comments says "the iommu driver that enabled ATS". It doesn't
> > > > > conflict with what the PCI core checks here?
> > > >
> > > > actually this is sent to all IOMMU drivers. there is no check on whether
> > > > a specific driver has enabled ATS in this path.
> > >
> > > But the comment doesn't say "check"..
> > >
> > > How about "Notify the iommu driver that enables/disables ATS"?
> > >
> > > The point is that pci_enable_ats() is called in iommu drivers.
> > >
> >
> > but in current way even an iommu driver which doesn't call
> > pci_enable_ats() will also be notified then I didn't see the
> > point of adding an attribute to "the iommu driver".
>
> Hmm, that's a fair point.
>
> Having looked closely, I see only AMD and ARM call that to enable
> ATs. How others (e.g. Intel) enable it?
>
> And how do you think of the followings?
>
> /*
> * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> before
> * initiating a reset. Though not all IOMMU drivers calls pci_enable_ats(), it
> * only gets invoked in IOMMU driver. And it is racy to check dev-
> >ats_enabled
> * here, as a concurrent IOMMU attachment can enable ATS right after this
> line.
> *
> * Notify the IOMMU driver to stop IOMMU translations until the reset is
> done,
> * to ensure that the ATS function and its related invalidations are disabled.
> */
>
I'd remove the words between "Though not ..." and "after this line", which
could be explained in iommu side following Bjorn's suggestion to not check
pci_ats_supported() in pci core.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device
2025-11-17 22:58 ` Bjorn Helgaas
@ 2025-11-18 8:16 ` Nicolin Chen
0 siblings, 0 replies; 36+ messages in thread
From: Nicolin Chen @ 2025-11-18 8:16 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: joro, rafael, bhelgaas, alex, jgg, kevin.tian, will, robin.murphy,
lenb, baolu.lu, linux-arm-kernel, iommu, linux-kernel, linux-acpi,
linux-pci, kvm, patches, pjaroszynski, vsethi, etzhao1900
On Mon, Nov 17, 2025 at 04:58:52PM -0600, Bjorn Helgaas wrote:
> On Mon, Nov 10, 2025 at 09:12:55PM -0800, Nicolin Chen wrote:
> > +int pci_reset_iommu_prepare(struct pci_dev *dev)
> > +{
> > + if (pci_ats_supported(dev))
> > + return iommu_dev_reset_prepare(&dev->dev);
>
> Why bother checking pci_ats_supported() here? That could be done
> inside iommu_dev_reset_prepare(), since iommu.c already uses
> dev_is_pci() and pci_ats_supported() is already exported outside
> drivers/pci/.
Ack. I will fix all of these.
Thanks for the review!
Nicolin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device
2025-11-18 7:53 ` Tian, Kevin
@ 2025-11-18 8:17 ` Nicolin Chen
0 siblings, 0 replies; 36+ messages in thread
From: Nicolin Chen @ 2025-11-18 8:17 UTC (permalink / raw)
To: Tian, Kevin
Cc: joro@8bytes.org, afael@kernel.org, bhelgaas@google.com,
alex@shazbot.org, jgg@nvidia.com, will@kernel.org,
robin.murphy@arm.com, lenb@kernel.org, baolu.lu@linux.intel.com,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
helgaas@kernel.org, etzhao1900@gmail.com
On Tue, Nov 18, 2025 at 07:53:27AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > And how do you think of the followings?
> >
> > /*
> > * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> > before
> > * initiating a reset. Though not all IOMMU drivers calls pci_enable_ats(), it
> > * only gets invoked in IOMMU driver. And it is racy to check dev-
> > >ats_enabled
> > * here, as a concurrent IOMMU attachment can enable ATS right after this
> > line.
> > *
> > * Notify the IOMMU driver to stop IOMMU translations until the reset is
> > done,
> > * to ensure that the ATS function and its related invalidations are disabled.
> > */
> >
>
> I'd remove the words between "Though not ..." and "after this line", which
> could be explained in iommu side following Bjorn's suggestion to not check
> pci_ats_supported() in pci core.
OK. Thanks!
Nicolin
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper
2025-11-18 7:02 ` Nicolin Chen
@ 2025-11-19 2:47 ` Baolu Lu
2025-11-19 2:57 ` Nicolin Chen
0 siblings, 1 reply; 36+ messages in thread
From: Baolu Lu @ 2025-11-19 2:47 UTC (permalink / raw)
To: Nicolin Chen
Cc: joro, afael, bhelgaas, alex, jgg, kevin.tian, will, robin.murphy,
lenb, linux-arm-kernel, iommu, linux-kernel, linux-acpi,
linux-pci, kvm, patches, pjaroszynski, vsethi, helgaas,
etzhao1900
On 11/18/25 15:02, Nicolin Chen wrote:
> On Wed, Nov 12, 2025 at 09:41:25AM -0800, Nicolin Chen wrote:
>> Hi Baolu,
>>
>> On Wed, Nov 12, 2025 at 01:58:51PM +0800, Baolu Lu wrote:
>>> On 11/11/25 13:12, Nicolin Chen wrote:
>>>> +/**
>>>> + * iommu_get_domain_for_dev() - Return the DMA API domain pointer
>>>> + * @dev - Device to query
>>>> + *
>>>> + * This function can be called within a driver bound to dev. The returned
>>>> + * pointer is valid for the lifetime of the bound driver.
>>>> + *
>>>> + * It should not be called by drivers with driver_managed_dma = true.
>>>
>>> "driver_managed_dma != true" means the driver will use the default
>>> domain allocated by the iommu core during iommu probe.
>>
>> Hmm, I am not very sure. Jason's remarks pointed out that There
>> is an exception in host1x_client_iommu_detach():
>> https://lore.kernel.org/all/20250924191055.GJ2617119@nvidia.com/
>>
>> Where the group->domain could be NULL, i.e. not attached to the
>> default domain?
>>
>>> The iommu core
>>> ensures that this domain stored at group->domain will not be changed
>>> during the driver's whole lifecycle. That's reasonable.
>>>
>>> How about making some code to enforce this requirement? Something like
>>> below ...
>>>
>>>> + */
>>>> struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
>>>> {
>>>> /* Caller must be a probed driver on dev */
>>>> @@ -2225,10 +2234,29 @@ struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
>>>> if (!group)
>>>> return NULL;
>>>> + lockdep_assert_not_held(&group->mutex);
>>>
>>> ...
>>> if (WARN_ON(!dev->driver || !group->owner_cnt || group->owner))
>>> return NULL;
>>
>> With that, could host1x_client_iommu_detach() trigger WARN_ON?
>
> Hi Baolu,
>
> For v6, I tend to keep this API as-is, trying not to give troubles
> to existing callers. Jason suggested a potential followup series:
> https://lore.kernel.org/linux-iommu/20250821131304.GM802098@nvidia.com/
> That would replace this function, so maybe we can think about that.
>
> If you have a strong feeling about the WARN_ON, please let me know.
>
> Thanks
> Nicolin
No strong feeling. I am fine with it because the comments have already
stated that "This function can be called within a driver bound to dev.".
Thanks,
baolu
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper
2025-11-19 2:47 ` Baolu Lu
@ 2025-11-19 2:57 ` Nicolin Chen
0 siblings, 0 replies; 36+ messages in thread
From: Nicolin Chen @ 2025-11-19 2:57 UTC (permalink / raw)
To: Baolu Lu
Cc: joro, afael, bhelgaas, alex, jgg, kevin.tian, will, robin.murphy,
lenb, linux-arm-kernel, iommu, linux-kernel, linux-acpi,
linux-pci, kvm, patches, pjaroszynski, vsethi, helgaas,
etzhao1900
On Wed, Nov 19, 2025 at 10:47:26AM +0800, Baolu Lu wrote:
> On 11/18/25 15:02, Nicolin Chen wrote:
> > For v6, I tend to keep this API as-is, trying not to give troubles
> > to existing callers. Jason suggested a potential followup series:
> > https://lore.kernel.org/linux-iommu/20250821131304.GM802098@nvidia.com/
> > That would replace this function, so maybe we can think about that.
> >
> > If you have a strong feeling about the WARN_ON, please let me know.
> >
> No strong feeling. I am fine with it because the comments have already
> stated that "This function can be called within a driver bound to dev.".
Ack. Thanks!
Nicolin
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2025-11-19 2:57 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-11 5:12 [PATCH v5 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
2025-11-11 5:12 ` [PATCH v5 1/5] iommu: Lock group->mutex in iommu_deferred_attach() Nicolin Chen
2025-11-12 2:47 ` Baolu Lu
2025-11-11 5:12 ` [PATCH v5 2/5] iommu: Tiny domain for iommu_setup_dma_ops() Nicolin Chen
2025-11-12 5:22 ` Baolu Lu
2025-11-14 9:17 ` Tian, Kevin
2025-11-14 9:18 ` Tian, Kevin
2025-11-11 5:12 ` [PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper Nicolin Chen
2025-11-12 5:58 ` Baolu Lu
2025-11-12 17:41 ` Nicolin Chen
2025-11-18 7:02 ` Nicolin Chen
2025-11-19 2:47 ` Baolu Lu
2025-11-19 2:57 ` Nicolin Chen
2025-11-12 8:52 ` kernel test robot
2025-11-14 9:18 ` Tian, Kevin
2025-11-11 5:12 ` [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
2025-11-12 6:18 ` Baolu Lu
2025-11-12 17:43 ` Nicolin Chen
2025-11-14 9:37 ` Tian, Kevin
2025-11-14 18:26 ` Nicolin Chen
2025-11-17 4:59 ` Tian, Kevin
2025-11-17 19:27 ` Nicolin Chen
2025-11-17 23:04 ` Bjorn Helgaas
2025-11-11 5:12 ` [PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device Nicolin Chen
2025-11-14 9:45 ` Tian, Kevin
2025-11-14 18:00 ` Nicolin Chen
2025-11-17 4:52 ` Tian, Kevin
2025-11-17 19:26 ` Nicolin Chen
2025-11-18 0:29 ` Tian, Kevin
2025-11-18 1:42 ` Nicolin Chen
2025-11-18 5:38 ` Baolu Lu
2025-11-18 6:53 ` Nicolin Chen
2025-11-18 7:53 ` Tian, Kevin
2025-11-18 8:17 ` Nicolin Chen
2025-11-17 22:58 ` Bjorn Helgaas
2025-11-18 8:16 ` Nicolin Chen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).