[PATCH v3 0/5] Disable ATS via iommu during PCI resets

linux-acpi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/5] Disable ATS via iommu during PCI resets
@ 2025-08-11 22:59 Nicolin Chen
  2025-08-11 22:59 ` [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach Nicolin Chen
                   ` (4 more replies)
  0 siblings, 5 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-11 22:59 UTC (permalink / raw)
  To: robin.murphy, joro, bhelgaas, jgg
  Cc: will, robin.clark, yong.wu, matthias.bgg,
	angelogioacchino.delregno, thierry.reding, vdumpa, jonathanh,
	rafael, lenb, kevin.tian, yi.l.liu, baolu.lu, linux-arm-kernel,
	iommu, linux-kernel, linux-arm-msm, linux-mediatek, linux-tegra,
	linux-acpi, linux-pci, patches, pjaroszynski, vsethi, helgaas,
	etzhao1900

Hi all,

PCIe permits a device to ignore ATS invalidation TLPs, while processing a
reset. This creates a problem visible to the OS where an ATS invalidation
command will time out: e.g. an SVA domain will have no coordination with a
reset event and can racily issue ATS invalidations to a resetting device.

The OS should do something to mitigate this as we do not want production
systems to be reporting critical ATS failures, especially in a hypervisor
environment. Broadly, OS could arrange to ignore the timeouts, block page
table mutations to prevent invalidations, or disable and block ATS.

The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and
block ATS before initiating a Function Level Reset. It also mentions that
other reset methods could have the same vulnerability as well.

Provide a callback from the PCI subsystem that will enclose the reset and
have the iommu core temporarily change domains to group->blocking_domain,
so IOMMU drivers would fence any incoming ATS queries, synchronously stop
issuing new ATS invalidations, and wait for existing ATS invalidations to
complete. Doing this can avoid any ATS invaliation timeouts.

When a device is resetting, any new domain attachment should be deferred,
until the reset is finished, to prevent ATS activity from being activated
between the two callback functions. Introduce a new pending_reset flag to
allow iommu core to cache the target domains in the SW level but bypass
the driver-level attachment. Later, iommu_dev_reset_done() will re-attach
these soft-attached domains via __iommu_attach_device/set_group_pasid().

Finally, apply these iommu_dev_reset_prepare/done() functions in the PCI
reset functions.

This is on Github:
https://github.com/nicolinc/iommufd/commits/iommu_dev_reset-v3

Changelog
v3
 * Add Reviewed-by from Jason
 * [iommu] Add a fast return in iommu_deferred_attach()
 * [iommu] Update kdocs, inline comments, and commit logs
 * [iommu] Use group->blocking_domain v.s. ops->blocked_domain
 * [iommu] Drop require_direct, iommu_group_get(), and xa_lock()
 * [iommu] Set the pending_reset flag after RID/PASID domain setups
 * [iommu] Do not bypass PASID domains when RID domain is already the
           blocking_domain
 * [iommu] Add iommu_get_domain_for_dev_locked to correctly return the
           blocking_domain
v2
 https://lore.kernel.org/all/cover.1751096303.git.nicolinc@nvidia.com/
 * [iommu] Update kdocs, inline comments, and commit logs
 * [iommu] Replace long-holding group->mutex with a pending_reset flag
 * [pci] Abort reset routines if iommu_dev_reset_prepare() fails
 * [pci] Apply the same vulnerability fix to other reset functions
v1
 https://lore.kernel.org/all/cover.1749494161.git.nicolinc@nvidia.com/

Thanks
Nicolin

Nicolin Chen (5):
  iommu: Lock group->mutex in iommu_deferred_attach
  iommu: Pass in gdev to __iommu_device_set_domain
  iommu: Add iommu_get_domain_for_dev_locked() helper
  iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  pci: Suspend iommu function prior to resetting a device

 include/linux/iommu.h                         |  13 ++
 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     |   2 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   9 +-
 drivers/iommu/arm/arm-smmu/qcom_iommu.c       |   2 +-
 drivers/iommu/dma-iommu.c                     |   2 +-
 drivers/iommu/fsl_pamu_domain.c               |   2 +-
 drivers/iommu/iommu.c                         | 205 +++++++++++++++++-
 drivers/iommu/ipmmu-vmsa.c                    |   2 +-
 drivers/iommu/msm_iommu.c                     |   2 +-
 drivers/iommu/mtk_iommu.c                     |   2 +-
 drivers/iommu/omap-iommu.c                    |   2 +-
 drivers/iommu/tegra-smmu.c                    |   2 +-
 drivers/pci/pci-acpi.c                        |  17 +-
 drivers/pci/pci.c                             |  68 +++++-
 drivers/pci/quirks.c                          |  23 +-
 15 files changed, 323 insertions(+), 30 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach
  2025-08-11 22:59 [PATCH v3 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
@ 2025-08-11 22:59 ` Nicolin Chen
  2025-08-15  5:09   ` Baolu Lu
  2025-08-15  8:24   ` Tian, Kevin
  2025-08-11 22:59 ` [PATCH v3 2/5] iommu: Pass in gdev to __iommu_device_set_domain Nicolin Chen
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-11 22:59 UTC (permalink / raw)
  To: robin.murphy, joro, bhelgaas, jgg
  Cc: will, robin.clark, yong.wu, matthias.bgg,
	angelogioacchino.delregno, thierry.reding, vdumpa, jonathanh,
	rafael, lenb, kevin.tian, yi.l.liu, baolu.lu, linux-arm-kernel,
	iommu, linux-kernel, linux-arm-msm, linux-mediatek, linux-tegra,
	linux-acpi, linux-pci, patches, pjaroszynski, vsethi, helgaas,
	etzhao1900

The iommu_deferred_attach() is a runtime asynchronous function called by
iommu-dma function, which could race against other attach functions if it
accesses something in the dev->iommu_group.

So, grab the mutex to guard __iommu_attach_device() like other callers.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 060ebe330ee16..1e0116bce0762 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2144,10 +2144,17 @@ EXPORT_SYMBOL_GPL(iommu_attach_device);
 
 int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
 {
-	if (dev->iommu && dev->iommu->attach_deferred)
-		return __iommu_attach_device(domain, dev);
+	/*
+	 * This is called on the dma mapping fast path so avoid locking. This is
+	 * racy, but we have an expectation that the driver will setup its DMAs
+	 * inside probe while being single threaded to avoid racing.
+	 */
+	if (!dev->iommu || !dev->iommu->attach_deferred)
+		return 0;
 
-	return 0;
+	guard(mutex)(&dev->iommu_group->mutex);
+
+	return __iommu_attach_device(domain, dev);
 }
 
 void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 2/5] iommu: Pass in gdev to __iommu_device_set_domain
  2025-08-11 22:59 [PATCH v3 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
  2025-08-11 22:59 ` [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach Nicolin Chen
@ 2025-08-11 22:59 ` Nicolin Chen
  2025-08-15  5:18   ` Baolu Lu
  2025-08-15  8:29   ` Tian, Kevin
  2025-08-11 22:59 ` [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper Nicolin Chen
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-11 22:59 UTC (permalink / raw)
  To: robin.murphy, joro, bhelgaas, jgg
  Cc: will, robin.clark, yong.wu, matthias.bgg,
	angelogioacchino.delregno, thierry.reding, vdumpa, jonathanh,
	rafael, lenb, kevin.tian, yi.l.liu, baolu.lu, linux-arm-kernel,
	iommu, linux-kernel, linux-arm-msm, linux-mediatek, linux-tegra,
	linux-acpi, linux-pci, patches, pjaroszynski, vsethi, helgaas,
	etzhao1900

The device under the reset will be attached to a blocked domain, while not
updating the group->domain pointer. So there needs to be a per-device flag
to indicate the reset state, for other iommu core functions to check so as
not to shift the attached domain during the reset state.

The regular device pointer can't store any private any private iommu flag.
So, the flag has to be in the gdev structure.

Pass in the gdev pointer instead to the functions that will check that per
device flag.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommu.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 1e0116bce0762..e6a66dacce1b8 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -112,7 +112,7 @@ enum {
 };
 
 static int __iommu_device_set_domain(struct iommu_group *group,
-				     struct device *dev,
+				     struct group_device *gdev,
 				     struct iommu_domain *new_domain,
 				     unsigned int flags);
 static int __iommu_group_set_domain_internal(struct iommu_group *group,
@@ -602,7 +602,7 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 	if (group->default_domain)
 		iommu_create_device_direct_mappings(group->default_domain, dev);
 	if (group->domain) {
-		ret = __iommu_device_set_domain(group, dev, group->domain, 0);
+		ret = __iommu_device_set_domain(group, gdev, group->domain, 0);
 		if (ret)
 			goto err_remove_gdev;
 	} else if (!group->default_domain && !group_list) {
@@ -2263,10 +2263,11 @@ int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
 EXPORT_SYMBOL_GPL(iommu_attach_group);
 
 static int __iommu_device_set_domain(struct iommu_group *group,
-				     struct device *dev,
+				     struct group_device *gdev,
 				     struct iommu_domain *new_domain,
 				     unsigned int flags)
 {
+	struct device *dev = gdev->dev;
 	int ret;
 
 	/*
@@ -2346,8 +2347,7 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
 	 */
 	result = 0;
 	for_each_group_device(group, gdev) {
-		ret = __iommu_device_set_domain(group, gdev->dev, new_domain,
-						flags);
+		ret = __iommu_device_set_domain(group, gdev, new_domain, flags);
 		if (ret) {
 			result = ret;
 			/*
@@ -2379,7 +2379,7 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
 		 */
 		if (group->domain)
 			WARN_ON(__iommu_device_set_domain(
-				group, gdev->dev, group->domain,
+				group, gdev, group->domain,
 				IOMMU_SET_DOMAIN_MUST_SUCCEED));
 		if (gdev == last_gdev)
 			break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-11 22:59 [PATCH v3 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
  2025-08-11 22:59 ` [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach Nicolin Chen
  2025-08-11 22:59 ` [PATCH v3 2/5] iommu: Pass in gdev to __iommu_device_set_domain Nicolin Chen
@ 2025-08-11 22:59 ` Nicolin Chen
  2025-08-15  5:28   ` Baolu Lu
                     ` (2 more replies)
  2025-08-11 22:59 ` [PATCH v3 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
  2025-08-11 22:59 ` [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device Nicolin Chen
  4 siblings, 3 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-11 22:59 UTC (permalink / raw)
  To: robin.murphy, joro, bhelgaas, jgg
  Cc: will, robin.clark, yong.wu, matthias.bgg,
	angelogioacchino.delregno, thierry.reding, vdumpa, jonathanh,
	rafael, lenb, kevin.tian, yi.l.liu, baolu.lu, linux-arm-kernel,
	iommu, linux-kernel, linux-arm-msm, linux-mediatek, linux-tegra,
	linux-acpi, linux-pci, patches, pjaroszynski, vsethi, helgaas,
	etzhao1900

There is a need to attach a PCI device that's under a reset to temporally
the blocked domain (i.e. detach it from its previously attached domain),
and then to reattach it back to its previous domain (i.e. detach it from
the blocked domain) after reset.

During the reset stage, there can be races from other attach/detachment.
To solve this, a per-gdev reset flag will be introduced so that all the
attach functions will bypass the driver-level attach_dev callbacks, but
only update the group->domain pointer. The reset recovery procedure will
attach directly to the cached pointer so things will be back to normal.

On the other hand, the iommu_get_domain_for_dev() API always returns the
group->domain pointer, and several IOMMMU drivers call this API in their
attach_dev callback functions to get the currently attached domain for a
device, which will be broken for the recovery case mentioned above:
 1. core asks the driver to attach dev from blocked to group->domain
 2. driver attaches dev from group->domain to group->domain

So, iommu_get_domain_for_dev() should check the gdev flag and return the
blocked domain if the flag is set. But the caller of this API could hold
the group->mutex already or not, making it difficult to add the lock.

Introduce a new iommu_get_domain_for_dev_locked() helper to be used by
those drivers in a context that is already under the protection of the
group->mutex, e.g. those attach_dev callback functions. And roll out the
new helper to all the existing IOMMU drivers.

Add a lockdep_assert_not_held to the existing iommu_get_domain_for_dev()
to note that it would be only used outside the group->mutex.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h                              |  1 +
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c    |  2 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c        |  9 +++++----
 drivers/iommu/arm/arm-smmu/qcom_iommu.c            |  2 +-
 drivers/iommu/dma-iommu.c                          |  2 +-
 drivers/iommu/fsl_pamu_domain.c                    |  2 +-
 drivers/iommu/iommu.c                              | 14 ++++++++++++++
 drivers/iommu/ipmmu-vmsa.c                         |  2 +-
 drivers/iommu/msm_iommu.c                          |  2 +-
 drivers/iommu/mtk_iommu.c                          |  2 +-
 drivers/iommu/omap-iommu.c                         |  2 +-
 drivers/iommu/tegra-smmu.c                         |  2 +-
 12 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c30d12e16473d..61b17883cb0cb 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -909,6 +909,7 @@ extern int iommu_attach_device(struct iommu_domain *domain,
 extern void iommu_detach_device(struct iommu_domain *domain,
 				struct device *dev);
 extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
+struct iommu_domain *iommu_get_domain_for_dev_locked(struct device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
index 8cd8929bbfdf8..6d44c97d430b4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -145,7 +145,7 @@ static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
 	struct arm_smmu_attach_state state = {
 		.master = master,
-		.old_domain = iommu_get_domain_for_dev(dev),
+		.old_domain = iommu_get_domain_for_dev_locked(dev),
 		.ssid = IOMMU_NO_PASID,
 	};
 	struct arm_smmu_ste ste;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5968043ac8023..3efe51ae37edb 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3010,7 +3010,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	struct arm_smmu_device *smmu;
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
 	struct arm_smmu_attach_state state = {
-		.old_domain = iommu_get_domain_for_dev(dev),
+		.old_domain = iommu_get_domain_for_dev_locked(dev),
 		.ssid = IOMMU_NO_PASID,
 	};
 	struct arm_smmu_master *master;
@@ -3124,7 +3124,8 @@ int arm_smmu_set_pasid(struct arm_smmu_master *master,
 		       struct arm_smmu_domain *smmu_domain, ioasid_t pasid,
 		       struct arm_smmu_cd *cd, struct iommu_domain *old)
 {
-	struct iommu_domain *sid_domain = iommu_get_domain_for_dev(master->dev);
+	struct iommu_domain *sid_domain =
+		iommu_get_domain_for_dev_locked(master->dev);
 	struct arm_smmu_attach_state state = {
 		.master = master,
 		.ssid = pasid,
@@ -3190,7 +3191,7 @@ static int arm_smmu_blocking_set_dev_pasid(struct iommu_domain *new_domain,
 	 */
 	if (!arm_smmu_ssids_in_use(&master->cd_table)) {
 		struct iommu_domain *sid_domain =
-			iommu_get_domain_for_dev(master->dev);
+			iommu_get_domain_for_dev_locked(master->dev);
 
 		if (sid_domain->type == IOMMU_DOMAIN_IDENTITY ||
 		    sid_domain->type == IOMMU_DOMAIN_BLOCKED)
@@ -3207,7 +3208,7 @@ static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
 	struct arm_smmu_attach_state state = {
 		.master = master,
-		.old_domain = iommu_get_domain_for_dev(dev),
+		.old_domain = iommu_get_domain_for_dev_locked(dev),
 		.ssid = IOMMU_NO_PASID,
 	};
 
diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
index c5be95e560317..c82fbcd28cdde 100644
--- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c
+++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
@@ -390,7 +390,7 @@ static int qcom_iommu_attach_dev(struct iommu_domain *domain, struct device *dev
 static int qcom_iommu_identity_attach(struct iommu_domain *identity_domain,
 				      struct device *dev)
 {
-	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct iommu_domain *domain = iommu_get_domain_for_dev_locked(dev);
 	struct qcom_iommu_domain *qcom_domain;
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
 	struct qcom_iommu_dev *qcom_iommu = dev_iommu_priv_get(dev);
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index ea2ef53bd4fef..99680cdb57265 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -2097,7 +2097,7 @@ EXPORT_SYMBOL_GPL(dma_iova_destroy);
 
 void iommu_setup_dma_ops(struct device *dev)
 {
-	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct iommu_domain *domain = iommu_get_domain_for_dev_locked(dev);
 
 	if (dev_is_pci(dev))
 		dev->iommu->pci_32bit_workaround = !iommu_dma_forcedac;
diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index 5f08523f97cb9..2b7395c5884ea 100644
--- a/drivers/iommu/fsl_pamu_domain.c
+++ b/drivers/iommu/fsl_pamu_domain.c
@@ -300,7 +300,7 @@ static int fsl_pamu_attach_device(struct iommu_domain *domain,
 static int fsl_pamu_platform_attach(struct iommu_domain *platform_domain,
 				    struct device *dev)
 {
-	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct iommu_domain *domain = iommu_get_domain_for_dev_locked(dev);
 	struct fsl_dma_domain *dma_domain;
 	const u32 *prop;
 	int len;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index e6a66dacce1b8..8c277cc8e9750 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2176,6 +2176,7 @@ void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_detach_device);
 
+/* Caller can be any general/external function that isn't an IOMMU callback */
 struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
 {
 	/* Caller must be a probed driver on dev */
@@ -2184,10 +2185,23 @@ struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)
 	if (!group)
 		return NULL;
 
+	lockdep_assert_not_held(&group->mutex);
+
 	return group->domain;
 }
 EXPORT_SYMBOL_GPL(iommu_get_domain_for_dev);
 
+/* Caller must be an IOMMU callback/internal function that holds group->mutex */
+struct iommu_domain *iommu_get_domain_for_dev_locked(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+
+	lockdep_assert_held(&group->mutex);
+
+	return group->domain;
+}
+EXPORT_SYMBOL_GPL(iommu_get_domain_for_dev_locked);
+
 /*
  * For IOMMU_DOMAIN_DMA implementations which already provide their own
  * guarantees that the group and its default domain are valid and correct.
diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index ffa892f657140..7e8de6e1bf78b 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -639,7 +639,7 @@ static int ipmmu_attach_device(struct iommu_domain *io_domain,
 static int ipmmu_iommu_identity_attach(struct iommu_domain *identity_domain,
 				       struct device *dev)
 {
-	struct iommu_domain *io_domain = iommu_get_domain_for_dev(dev);
+	struct iommu_domain *io_domain = iommu_get_domain_for_dev_locked(dev);
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
 	struct ipmmu_vmsa_domain *domain;
 	unsigned int i;
diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 43a61ba021a51..58cc08c8ede8b 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -443,7 +443,7 @@ static int msm_iommu_attach_dev(struct iommu_domain *domain, struct device *dev)
 static int msm_iommu_identity_attach(struct iommu_domain *identity_domain,
 				     struct device *dev)
 {
-	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct iommu_domain *domain = iommu_get_domain_for_dev_locked(dev);
 	struct msm_priv *priv;
 	unsigned long flags;
 	struct msm_iommu_dev *iommu;
diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 0e0285348d2b8..31b29bcfc7569 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -775,7 +775,7 @@ static int mtk_iommu_attach_device(struct iommu_domain *domain,
 static int mtk_iommu_identity_attach(struct iommu_domain *identity_domain,
 				     struct device *dev)
 {
-	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct iommu_domain *domain = iommu_get_domain_for_dev_locked(dev);
 	struct mtk_iommu_data *data = dev_iommu_priv_get(dev);
 
 	if (domain == identity_domain || !domain)
diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 6fb93927bdb98..881fa5d243a15 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1538,7 +1538,7 @@ static void _omap_iommu_detach_dev(struct omap_iommu_domain *omap_domain,
 static int omap_iommu_identity_attach(struct iommu_domain *identity_domain,
 				      struct device *dev)
 {
-	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct iommu_domain *domain = iommu_get_domain_for_dev_locked(dev);
 	struct omap_iommu_domain *omap_domain;
 
 	if (domain == identity_domain || !domain)
diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
index 36cdd5fbab077..fdec0439bd683 100644
--- a/drivers/iommu/tegra-smmu.c
+++ b/drivers/iommu/tegra-smmu.c
@@ -526,7 +526,7 @@ static int tegra_smmu_attach_dev(struct iommu_domain *domain,
 static int tegra_smmu_identity_attach(struct iommu_domain *identity_domain,
 				      struct device *dev)
 {
-	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+	struct iommu_domain *domain = iommu_get_domain_for_dev_locked(dev);
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
 	struct tegra_smmu_as *as;
 	struct tegra_smmu *smmu;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-08-11 22:59 [PATCH v3 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
                   ` (2 preceding siblings ...)
  2025-08-11 22:59 ` [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper Nicolin Chen
@ 2025-08-11 22:59 ` Nicolin Chen
  2025-08-15  5:49   ` Baolu Lu
  2025-08-15  9:14   ` Tian, Kevin
  2025-08-11 22:59 ` [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device Nicolin Chen
  4 siblings, 2 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-11 22:59 UTC (permalink / raw)
  To: robin.murphy, joro, bhelgaas, jgg
  Cc: will, robin.clark, yong.wu, matthias.bgg,
	angelogioacchino.delregno, thierry.reding, vdumpa, jonathanh,
	rafael, lenb, kevin.tian, yi.l.liu, baolu.lu, linux-arm-kernel,
	iommu, linux-kernel, linux-arm-msm, linux-mediatek, linux-tegra,
	linux-acpi, linux-pci, patches, pjaroszynski, vsethi, helgaas,
	etzhao1900

PCIe permits a device to ignore ATS invalidation TLPs, while processing a
reset. This creates a problem visible to the OS where an ATS invalidation
command will time out: e.g. an SVA domain will have no coordination with a
reset event and can racily issue ATS invalidations to a resetting device.

The OS should do something to mitigate this as we do not want production
systems to be reporting critical ATS failures, especially in a hypervisor
environment. Broadly, OS could arrange to ignore the timeouts, block page
table mutations to prevent invalidations, or disable and block ATS.

The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and
block ATS before initiating a Function Level Reset. It also mentions that
other reset methods could have the same vulnerability as well.

Provide a callback from the PCI subsystem that will enclose the reset and
have the iommu core temporarily change all the attached domain to BLOCKED.
After attaching a BLOCKED domain, IOMMU drivers should fence any incoming
ATS queries, synchronously stop issuing new ATS invalidations, and wait
for all ATS invalidations to complete. This can avoid any ATS invaliation
timeouts.

However, if there is a domain attachment/replacement happening during an
ongoing reset, ATS routines may be re-activated between the two function
calls. So, introduce a new pending_reset flag in group_device to defer an
attachment during a reset, allowing iommu core to cache target domains in
the SW level while bypassing the driver. The iommu_dev_reset_done() will
re-attach these soft-attached domains, once the device reset is finished.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h |  12 +++
 drivers/iommu/iommu.c | 166 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 178 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 61b17883cb0cb..35181d4d8f302 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1168,6 +1168,9 @@ void dev_iommu_priv_set(struct device *dev, void *priv);
 extern struct mutex iommu_probe_device_lock;
 int iommu_probe_device(struct device *dev);
 
+int iommu_dev_reset_prepare(struct device *dev);
+void iommu_dev_reset_done(struct device *dev);
+
 int iommu_device_use_default_domain(struct device *dev);
 void iommu_device_unuse_default_domain(struct device *dev);
 
@@ -1452,6 +1455,15 @@ static inline int iommu_fwspec_add_ids(struct device *dev, u32 *ids,
 	return -ENODEV;
 }
 
+static inline int iommu_dev_reset_prepare(struct device *dev)
+{
+	return 0;
+}
+
+static inline void iommu_dev_reset_done(struct device *dev)
+{
+}
+
 static inline struct iommu_fwspec *dev_iommu_fwspec_get(struct device *dev)
 {
 	return NULL;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8c277cc8e9750..c1f8aa5d79f8e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -71,12 +71,29 @@ struct group_device {
 	struct list_head list;
 	struct device *dev;
 	char *name;
+	bool pending_reset : 1;
 };
 
 /* Iterate over each struct group_device in a struct iommu_group */
 #define for_each_group_device(group, pos) \
 	list_for_each_entry(pos, &(group)->devices, list)
 
+/* Callers must hold the dev->iommu_group->mutex */
+static struct group_device *device_to_group_device(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	struct group_device *gdev;
+
+	lockdep_assert_held(&group->mutex);
+
+	/* gdev must be in the list */
+	for_each_group_device(group, gdev) {
+		if (gdev->dev == dev)
+			break;
+	}
+	return gdev;
+}
+
 struct iommu_group_attribute {
 	struct attribute attr;
 	ssize_t (*show)(struct iommu_group *group, char *buf);
@@ -2154,6 +2171,12 @@ int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
 
 	guard(mutex)(&dev->iommu_group->mutex);
 
+	/*
+	 * There is a concurrent attach while the device is resetting. Defer it
+	 * until iommu_dev_reset_done() attaching the device to group->domain.
+	 */
+	if (device_to_group_device(dev)->pending_reset)
+		return 0;
 	return __iommu_attach_device(domain, dev);
 }
 
@@ -2198,6 +2221,16 @@ struct iommu_domain *iommu_get_domain_for_dev_locked(struct device *dev)
 
 	lockdep_assert_held(&group->mutex);
 
+	/*
+	 * Driver handles the low-level __iommu_attach_device(), including the
+	 * one invoked by iommu_dev_reset_done(), in which case the driver must
+	 * get the blocking domain over group->domain caching the one prior to
+	 * iommu_dev_reset_prepare(), so that it wouldn't end up with attaching
+	 * the device from group->domain (old) to group->domain (new).
+	 */
+	if (device_to_group_device(dev)->pending_reset)
+		return group->blocking_domain;
+
 	return group->domain;
 }
 EXPORT_SYMBOL_GPL(iommu_get_domain_for_dev_locked);
@@ -2305,6 +2338,13 @@ static int __iommu_device_set_domain(struct iommu_group *group,
 		dev->iommu->attach_deferred = 0;
 	}
 
+	/*
+	 * There is a concurrent attach while the device is resetting. Defer it
+	 * until iommu_dev_reset_done() attaching the device to group->domain.
+	 */
+	if (gdev->pending_reset)
+		return 0;
+
 	ret = __iommu_attach_device(new_domain, dev);
 	if (ret) {
 		/*
@@ -3388,6 +3428,13 @@ static int __iommu_set_group_pasid(struct iommu_domain *domain,
 	int ret;
 
 	for_each_group_device(group, device) {
+		/*
+		 * There is a concurrent attach while the device is resetting.
+		 * Defer it until iommu_dev_reset_done() attaching the device to
+		 * group->domain.
+		 */
+		if (device->pending_reset)
+			continue;
 		if (device->dev->iommu->max_pasids > 0) {
 			ret = domain->ops->set_dev_pasid(domain, device->dev,
 							 pasid, old);
@@ -3809,6 +3856,125 @@ int iommu_replace_group_handle(struct iommu_group *group,
 }
 EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
 
+/**
+ * iommu_dev_reset_prepare() - Block IOMMU to prepare for a device reset
+ * @dev: device that is going to enter a reset routine
+ *
+ * When certain device is entering a reset routine, it wants to block any IOMMU
+ * activity during the reset routine. This includes blocking any translation as
+ * well as cache invalidation too (especially the device cache).
+ *
+ * This function attaches all RID/PASID of the device's to IOMMU_DOMAIN_BLOCKED
+ * allowing any blocked-domain-supporting IOMMU driver to pause translation and
+ * cahce invalidation, but leaves the software domain pointers intact so later
+ * the iommu_dev_reset_done() can restore everything.
+ *
+ * Return: 0 on success or negative error code if the preparation failed.
+ *
+ * Caller must use iommu_dev_reset_prepare() and iommu_dev_reset_done() together
+ * before/after the core-level reset routine, to unclear the pending_reset flag.
+ *
+ * These two functions are designed to be used by PCI reset functions that would
+ * not invoke any racy iommu_release_device() since PCI sysfs node gets removed
+ * before it notifies with a BUS_NOTIFY_REMOVED_DEVICE. When using them in other
+ * case, callers must ensure there will be no racy iommu_release_device() call,
+ * which otherwise would UAF the dev->iommu_group pointer.
+ */
+int iommu_dev_reset_prepare(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	unsigned long pasid;
+	void *entry;
+	int ret = 0;
+
+	if (!dev_has_iommu(dev))
+		return 0;
+
+	mutex_lock(&group->mutex);
+
+	ret = __iommu_group_alloc_blocking_domain(group);
+	if (ret)
+		goto unlock;
+
+	/* Dock RID domain to blocking_domain while retaining group->domain */
+	if (group->domain != group->blocking_domain) {
+		ret = __iommu_attach_device(group->blocking_domain, dev);
+		if (ret)
+			goto unlock;
+	}
+
+	/*
+	 * Dock PASID domains to blocking_domain while retaining pasid_array.
+	 *
+	 * The pasid_array is mostly fenced by group->mutex, except one reader
+	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
+	 */
+	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
+		iommu_remove_dev_pasid(dev, pasid,
+				       pasid_array_entry_to_domain(entry));
+
+	device_to_group_device(dev)->pending_reset = true;
+
+unlock:
+	mutex_unlock(&group->mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_reset_prepare);
+
+/**
+ * iommu_dev_reset_done() - Restore IOMMU after a device reset is finished
+ * @dev: device that has finished a reset routine
+ *
+ * When certain device has finished  a reset routine, it wants to restore its
+ * IOMMU activity, including new translation as well as cache invalidation, by
+ * re-attaching all RID/PASID of the device's back to the domains retained in
+ * the core-level structure.
+ *
+ * Caller must pair it with a successfully returned iommu_dev_reset_prepare().
+ *
+ * Note that, although unlikely, there is a risk that re-attaching domains might
+ * fail due to some unexpected happening like OOM.
+ */
+void iommu_dev_reset_done(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	struct group_device *gdev;
+	unsigned long pasid;
+	void *entry;
+
+	if (!dev_has_iommu(dev))
+		return;
+
+	mutex_lock(&group->mutex);
+
+	gdev = device_to_group_device(dev);
+
+	/* iommu_dev_reset_prepare() was not successfully called */
+	if (WARN_ON(!group->blocking_domain || !gdev->pending_reset)) {
+		mutex_unlock(&group->mutex);
+		return;
+	}
+
+	/* Shift RID domain back to group->domain */
+	if (group->domain != group->blocking_domain)
+		WARN_ON(__iommu_attach_device(group->domain, dev));
+
+	/*
+	 * Shift PASID domains back to domains retained in pasid_array.
+	 *
+	 * The pasid_array is mostly fenced by group->mutex, except one reader
+	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
+	 */
+	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
+		WARN_ON(__iommu_set_group_pasid(
+			pasid_array_entry_to_domain(entry), group, pasid,
+			group->blocking_domain));
+
+	gdev->pending_reset = false;
+	mutex_unlock(&group->mutex);
+}
+EXPORT_SYMBOL_GPL(iommu_dev_reset_done);
+
 #if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
 /**
  * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device
  2025-08-11 22:59 [PATCH v3 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
                   ` (3 preceding siblings ...)
  2025-08-11 22:59 ` [PATCH v3 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
@ 2025-08-11 22:59 ` Nicolin Chen
  2025-08-19 14:12   ` Ethan Zhao
  4 siblings, 1 reply; 55+ messages in thread
From: Nicolin Chen @ 2025-08-11 22:59 UTC (permalink / raw)
  To: robin.murphy, joro, bhelgaas, jgg
  Cc: will, robin.clark, yong.wu, matthias.bgg,
	angelogioacchino.delregno, thierry.reding, vdumpa, jonathanh,
	rafael, lenb, kevin.tian, yi.l.liu, baolu.lu, linux-arm-kernel,
	iommu, linux-kernel, linux-arm-msm, linux-mediatek, linux-tegra,
	linux-acpi, linux-pci, patches, pjaroszynski, vsethi, helgaas,
	etzhao1900

PCIe permits a device to ignore ATS invalidation TLPs, while processing a
reset. This creates a problem visible to the OS where an ATS invalidation
command will time out: e.g. an SVA domain will have no coordination with a
reset event and can racily issue ATS invalidations to a resetting device.

The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and
block ATS before initiating a Function Level Reset. It also mentions that
other reset methods could have the same vulnerability as well.

Now iommu_dev_reset_prepare/done() helpers are introduced for this matter.
Use them in all the existing reset functions, which will attach the device
to an IOMMU_DOMAIN_BLOCKED during a reset, so as to allow IOMMU driver to:
 - invoke pci_disable_ats() and pci_enable_ats(), if necessary
 - wait for all ATS invalidations to complete
 - stop issuing new ATS invalidations
 - fence any incoming ATS queries

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/pci/pci-acpi.c | 17 +++++++++--
 drivers/pci/pci.c      | 68 ++++++++++++++++++++++++++++++++++++++----
 drivers/pci/quirks.c   | 23 +++++++++++++-
 3 files changed, 100 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index ddb25960ea47d..adaf46422c05d 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -9,6 +9,7 @@
 
 #include <linux/delay.h>
 #include <linux/init.h>
+#include <linux/iommu.h>
 #include <linux/irqdomain.h>
 #include <linux/pci.h>
 #include <linux/msi.h>
@@ -969,6 +970,7 @@ void pci_set_acpi_fwnode(struct pci_dev *dev)
 int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
 {
 	acpi_handle handle = ACPI_HANDLE(&dev->dev);
+	int ret = 0;
 
 	if (!handle || !acpi_has_method(handle, "_RST"))
 		return -ENOTTY;
@@ -976,12 +978,23 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
 	if (probe)
 		return 0;
 
+	/*
+	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
+	 * before initiating a reset. Notify the iommu driver that enabled ATS.
+	 */
+	ret = iommu_dev_reset_prepare(&dev->dev);
+	if (ret) {
+		pci_err(dev, "failed to stop IOMMU\n");
+		return ret;
+	}
+
 	if (ACPI_FAILURE(acpi_evaluate_object(handle, "_RST", NULL, NULL))) {
 		pci_warn(dev, "ACPI _RST failed\n");
-		return -ENOTTY;
+		ret = -ENOTTY;
 	}
 
-	return 0;
+	iommu_dev_reset_done(&dev->dev);
+	return ret;
 }
 
 bool acpi_pci_power_manageable(struct pci_dev *dev)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b0f4d98036cdd..d6d87e22d81b3 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -13,6 +13,7 @@
 #include <linux/delay.h>
 #include <linux/dmi.h>
 #include <linux/init.h>
+#include <linux/iommu.h>
 #include <linux/msi.h>
 #include <linux/of.h>
 #include <linux/pci.h>
@@ -4529,13 +4530,26 @@ EXPORT_SYMBOL(pci_wait_for_pending_transaction);
  */
 int pcie_flr(struct pci_dev *dev)
 {
+	int ret = 0;
+
 	if (!pci_wait_for_pending_transaction(dev))
 		pci_err(dev, "timed out waiting for pending transaction; performing function level reset anyway\n");
 
+	/*
+	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
+	 * before initiating a reset. Notify the iommu driver that enabled ATS.
+	 * Have to call it after waiting for pending DMA transaction.
+	 */
+	ret = iommu_dev_reset_prepare(&dev->dev);
+	if (ret) {
+		pci_err(dev, "failed to stop IOMMU\n");
+		return ret;
+	}
+
 	pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR);
 
 	if (dev->imm_ready)
-		return 0;
+		goto done;
 
 	/*
 	 * Per PCIe r4.0, sec 6.6.2, a device must complete an FLR within
@@ -4544,7 +4558,11 @@ int pcie_flr(struct pci_dev *dev)
 	 */
 	msleep(100);
 
-	return pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
+	ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
+
+done:
+	iommu_dev_reset_done(&dev->dev);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(pcie_flr);
 
@@ -4572,6 +4590,7 @@ EXPORT_SYMBOL_GPL(pcie_reset_flr);
 
 static int pci_af_flr(struct pci_dev *dev, bool probe)
 {
+	int ret = 0;
 	int pos;
 	u8 cap;
 
@@ -4598,10 +4617,21 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
 				 PCI_AF_STATUS_TP << 8))
 		pci_err(dev, "timed out waiting for pending transaction; performing AF function level reset anyway\n");
 
+	/*
+	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
+	 * before initiating a reset. Notify the iommu driver that enabled ATS.
+	 * Have to call it after waiting for pending DMA transaction.
+	 */
+	ret = iommu_dev_reset_prepare(&dev->dev);
+	if (ret) {
+		pci_err(dev, "failed to stop IOMMU\n");
+		return ret;
+	}
+
 	pci_write_config_byte(dev, pos + PCI_AF_CTRL, PCI_AF_CTRL_FLR);
 
 	if (dev->imm_ready)
-		return 0;
+		goto done;
 
 	/*
 	 * Per Advanced Capabilities for Conventional PCI ECN, 13 April 2006,
@@ -4611,7 +4641,11 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
 	 */
 	msleep(100);
 
-	return pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
+	ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
+
+done:
+	iommu_dev_reset_done(&dev->dev);
+	return ret;
 }
 
 /**
@@ -4632,6 +4666,7 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
 static int pci_pm_reset(struct pci_dev *dev, bool probe)
 {
 	u16 csr;
+	int ret;
 
 	if (!dev->pm_cap || dev->dev_flags & PCI_DEV_FLAGS_NO_PM_RESET)
 		return -ENOTTY;
@@ -4646,6 +4681,16 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
 	if (dev->current_state != PCI_D0)
 		return -EINVAL;
 
+	/*
+	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
+	 * before initiating a reset. Notify the iommu driver that enabled ATS.
+	 */
+	ret = iommu_dev_reset_prepare(&dev->dev);
+	if (ret) {
+		pci_err(dev, "failed to stop IOMMU\n");
+		return ret;
+	}
+
 	csr &= ~PCI_PM_CTRL_STATE_MASK;
 	csr |= PCI_D3hot;
 	pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
@@ -4656,7 +4701,9 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
 	pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
 	pci_dev_d3_sleep(dev);
 
-	return pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
+	ret = pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
+	iommu_dev_reset_done(&dev->dev);
+	return ret;
 }
 
 /**
@@ -5111,6 +5158,16 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
 	if (rc)
 		return -ENOTTY;
 
+	/*
+	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
+	 * before initiating a reset. Notify the iommu driver that enabled ATS.
+	 */
+	rc = iommu_dev_reset_prepare(&dev->dev);
+	if (rc) {
+		pci_err(dev, "failed to stop IOMMU\n");
+		return rc;
+	}
+
 	if (reg & PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR) {
 		val = reg;
 	} else {
@@ -5125,6 +5182,7 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
 		pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
 				      reg);
 
+	iommu_dev_reset_done(&dev->dev);
 	return rc;
 }
 
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index d97335a401930..6157c6c02bdb0 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -21,6 +21,7 @@
 #include <linux/pci.h>
 #include <linux/isa-dma.h> /* isa_dma_bridge_buggy */
 #include <linux/init.h>
+#include <linux/iommu.h>
 #include <linux/delay.h>
 #include <linux/acpi.h>
 #include <linux/dmi.h>
@@ -4225,6 +4226,26 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
 	{ 0 }
 };
 
+static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
+				    const struct pci_dev_reset_methods *i)
+{
+	int ret;
+
+	/*
+	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
+	 * before initiating a reset. Notify the iommu driver that enabled ATS.
+	 */
+	ret = iommu_dev_reset_prepare(&dev->dev);
+	if (ret) {
+		pci_err(dev, "failed to stop IOMMU\n");
+		return ret;
+	}
+
+	ret = i->reset(dev, probe);
+	iommu_dev_reset_done(&dev->dev);
+	return ret;
+}
+
 /*
  * These device-specific reset methods are here rather than in a driver
  * because when a host assigns a device to a guest VM, the host may need
@@ -4239,7 +4260,7 @@ int pci_dev_specific_reset(struct pci_dev *dev, bool probe)
 		     i->vendor == (u16)PCI_ANY_ID) &&
 		    (i->device == dev->device ||
 		     i->device == (u16)PCI_ANY_ID))
-			return i->reset(dev, probe);
+			return __pci_dev_specific_reset(dev, probe, i);
 	}
 
 	return -ENOTTY;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach
  2025-08-11 22:59 ` [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach Nicolin Chen
@ 2025-08-15  5:09   ` Baolu Lu
  2025-08-15  8:27     ` Tian, Kevin
  2025-08-15  8:24   ` Tian, Kevin
  1 sibling, 1 reply; 55+ messages in thread
From: Baolu Lu @ 2025-08-15  5:09 UTC (permalink / raw)
  To: Nicolin Chen, robin.murphy, joro, bhelgaas, jgg
  Cc: will, robin.clark, yong.wu, matthias.bgg,
	angelogioacchino.delregno, thierry.reding, vdumpa, jonathanh,
	rafael, lenb, kevin.tian, yi.l.liu, linux-arm-kernel, iommu,
	linux-kernel, linux-arm-msm, linux-mediatek, linux-tegra,
	linux-acpi, linux-pci, patches, pjaroszynski, vsethi, helgaas,
	etzhao1900

On 8/12/25 06:59, Nicolin Chen wrote:
> The iommu_deferred_attach() is a runtime asynchronous function called by
> iommu-dma function, which could race against other attach functions if it
> accesses something in the dev->iommu_group.
> 
> So, grab the mutex to guard __iommu_attach_device() like other callers.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   drivers/iommu/iommu.c | 13 ++++++++++---
>   1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 060ebe330ee16..1e0116bce0762 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2144,10 +2144,17 @@ EXPORT_SYMBOL_GPL(iommu_attach_device);
>   
>   int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
>   {
> -	if (dev->iommu && dev->iommu->attach_deferred)
> -		return __iommu_attach_device(domain, dev);
> +	/*
> +	 * This is called on the dma mapping fast path so avoid locking. This is
> +	 * racy, but we have an expectation that the driver will setup its DMAs

Why not making it like this,

int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
{
         /* Caller must be a probed driver on dev */
         if (!dev->iommu_group)
                 return 0;

         guard(mutex)(&dev->iommu_group->mutex);
         if (!dev->iommu->attach_deferred)
                 return 0;

         return __iommu_attach_device(domain, dev);
}

?

> +	 * inside probe while being single threaded to avoid racing.
> +	 */
> +	if (!dev->iommu || !dev->iommu->attach_deferred)
> +		return 0;
>   
> -	return 0;
> +	guard(mutex)(&dev->iommu_group->mutex);
> +
> +	return __iommu_attach_device(domain, dev);
>   }
>   
>   void iommu_detach_device(struct iommu_domain *domain, struct device *dev)

Thanks,
baolu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 2/5] iommu: Pass in gdev to __iommu_device_set_domain
  2025-08-11 22:59 ` [PATCH v3 2/5] iommu: Pass in gdev to __iommu_device_set_domain Nicolin Chen
@ 2025-08-15  5:18   ` Baolu Lu
  2025-08-15  8:29   ` Tian, Kevin
  1 sibling, 0 replies; 55+ messages in thread
From: Baolu Lu @ 2025-08-15  5:18 UTC (permalink / raw)
  To: Nicolin Chen, robin.murphy, joro, bhelgaas, jgg
  Cc: will, robin.clark, yong.wu, matthias.bgg,
	angelogioacchino.delregno, thierry.reding, vdumpa, jonathanh,
	rafael, lenb, kevin.tian, yi.l.liu, linux-arm-kernel, iommu,
	linux-kernel, linux-arm-msm, linux-mediatek, linux-tegra,
	linux-acpi, linux-pci, patches, pjaroszynski, vsethi, helgaas,
	etzhao1900

On 8/12/25 06:59, Nicolin Chen wrote:
> The device under the reset will be attached to a blocked domain, while not
> updating the group->domain pointer. So there needs to be a per-device flag
> to indicate the reset state, for other iommu core functions to check so as
> not to shift the attached domain during the reset state.
> 
> The regular device pointer can't store any private any private iommu flag.
> So, the flag has to be in the gdev structure.
> 
> Pass in the gdev pointer instead to the functions that will check that per
> device flag.
> 
> Reviewed-by: Jason Gunthorpe<jgg@nvidia.com>
> Signed-off-by: Nicolin Chen<nicolinc@nvidia.com>

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-11 22:59 ` [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper Nicolin Chen
@ 2025-08-15  5:28   ` Baolu Lu
  2025-08-15 18:48     ` Nicolin Chen
  2025-08-18 14:40     ` Jason Gunthorpe
  2025-08-15  8:55   ` Tian, Kevin
  2025-08-18 14:39   ` Jason Gunthorpe
  2 siblings, 2 replies; 55+ messages in thread
From: Baolu Lu @ 2025-08-15  5:28 UTC (permalink / raw)
  To: Nicolin Chen, robin.murphy, joro, bhelgaas, jgg
  Cc: will, robin.clark, yong.wu, matthias.bgg,
	angelogioacchino.delregno, thierry.reding, vdumpa, jonathanh,
	rafael, lenb, kevin.tian, yi.l.liu, linux-arm-kernel, iommu,
	linux-kernel, linux-arm-msm, linux-mediatek, linux-tegra,
	linux-acpi, linux-pci, patches, pjaroszynski, vsethi, helgaas,
	etzhao1900

On 8/12/25 06:59, Nicolin Chen wrote:
> There is a need to attach a PCI device that's under a reset to temporally
> the blocked domain (i.e. detach it from its previously attached domain),
> and then to reattach it back to its previous domain (i.e. detach it from
> the blocked domain) after reset.
> 
> During the reset stage, there can be races from other attach/detachment.
> To solve this, a per-gdev reset flag will be introduced so that all the
> attach functions will bypass the driver-level attach_dev callbacks, but
> only update the group->domain pointer. The reset recovery procedure will
> attach directly to the cached pointer so things will be back to normal.
> 
> On the other hand, the iommu_get_domain_for_dev() API always returns the
> group->domain pointer, and several IOMMMU drivers call this API in their
> attach_dev callback functions to get the currently attached domain for a
> device, which will be broken for the recovery case mentioned above:
>   1. core asks the driver to attach dev from blocked to group->domain
>   2. driver attaches dev from group->domain to group->domain
> 
> So, iommu_get_domain_for_dev() should check the gdev flag and return the
> blocked domain if the flag is set. But the caller of this API could hold
> the group->mutex already or not, making it difficult to add the lock.
> 
> Introduce a new iommu_get_domain_for_dev_locked() helper to be used by
> those drivers in a context that is already under the protection of the
> group->mutex, e.g. those attach_dev callback functions. And roll out the
> new helper to all the existing IOMMU drivers.

Given that iommu_group->mutex is transparent to the iommu driver, how
about

/*
  * Called only by iommu drivers in the callback context where
  * group->mutex has already been held by the core.
  */
struct iommu_domain *iommu_get_domain_for_dev_internal(struct device *dev)
{
	...
	lockdep_assert_held(&group->mutex);
	...
}

?

> 
> Add a lockdep_assert_not_held to the existing iommu_get_domain_for_dev()
> to note that it would be only used outside the group->mutex.
> 
> Signed-off-by: Nicolin Chen<nicolinc@nvidia.com>

Thanks,
baolu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-08-11 22:59 ` [PATCH v3 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
@ 2025-08-15  5:49   ` Baolu Lu
  2025-08-15 20:10     ` Nicolin Chen
  2025-08-15  9:14   ` Tian, Kevin
  1 sibling, 1 reply; 55+ messages in thread
From: Baolu Lu @ 2025-08-15  5:49 UTC (permalink / raw)
  To: Nicolin Chen, robin.murphy, joro, bhelgaas, jgg
  Cc: will, robin.clark, yong.wu, matthias.bgg,
	angelogioacchino.delregno, thierry.reding, vdumpa, jonathanh,
	rafael, lenb, kevin.tian, yi.l.liu, linux-arm-kernel, iommu,
	linux-kernel, linux-arm-msm, linux-mediatek, linux-tegra,
	linux-acpi, linux-pci, patches, pjaroszynski, vsethi, helgaas,
	etzhao1900

On 8/12/25 06:59, Nicolin Chen wrote:
> PCIe permits a device to ignore ATS invalidation TLPs, while processing a
> reset. This creates a problem visible to the OS where an ATS invalidation
> command will time out: e.g. an SVA domain will have no coordination with a
> reset event and can racily issue ATS invalidations to a resetting device.
> 
> The OS should do something to mitigate this as we do not want production
> systems to be reporting critical ATS failures, especially in a hypervisor
> environment. Broadly, OS could arrange to ignore the timeouts, block page
> table mutations to prevent invalidations, or disable and block ATS.
> 
> The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and
> block ATS before initiating a Function Level Reset. It also mentions that
> other reset methods could have the same vulnerability as well.
> 
> Provide a callback from the PCI subsystem that will enclose the reset and
> have the iommu core temporarily change all the attached domain to BLOCKED.
> After attaching a BLOCKED domain, IOMMU drivers should fence any incoming

Nit, my understanding is that it's not the "IOMMU drivers" but the
"IOMMU hardware" that fences any further incoming translation requests,
right?

> ATS queries, synchronously stop issuing new ATS invalidations, and wait
> for all ATS invalidations to complete. This can avoid any ATS invaliation
> timeouts.
> 
> However, if there is a domain attachment/replacement happening during an
> ongoing reset, ATS routines may be re-activated between the two function
> calls. So, introduce a new pending_reset flag in group_device to defer an
> attachment during a reset, allowing iommu core to cache target domains in
> the SW level while bypassing the driver. The iommu_dev_reset_done() will
> re-attach these soft-attached domains, once the device reset is finished.
> 
> Signed-off-by: Nicolin Chen<nicolinc@nvidia.com>

The code looks good to me:

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach
  2025-08-11 22:59 ` [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach Nicolin Chen
  2025-08-15  5:09   ` Baolu Lu
@ 2025-08-15  8:24   ` Tian, Kevin
  2025-08-15 19:26     ` Nicolin Chen
  2025-08-18 14:17     ` Jason Gunthorpe
  1 sibling, 2 replies; 55+ messages in thread
From: Tian, Kevin @ 2025-08-15  8:24 UTC (permalink / raw)
  To: Nicolin Chen, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com
  Cc: will@kernel.org, robin.clark@oss.qualcomm.com,
	yong.wu@mediatek.com, matthias.bgg@gmail.com,
	angelogioacchino.delregno@collabora.com, thierry.reding@gmail.com,
	vdumpa@nvidia.com, jonathanh@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, Liu, Yi L, baolu.lu@linux.intel.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org,
	linux-mediatek@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
	helgaas@kernel.org, etzhao1900@gmail.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, August 12, 2025 6:59 AM
> 
> The iommu_deferred_attach() is a runtime asynchronous function called by
> iommu-dma function, which could race against other attach functions if it
> accesses something in the dev->iommu_group.

Is there a real racing scenario being observed or more theoretical?

If the former may need a Fix tag. 

> 
> So, grab the mutex to guard __iommu_attach_device() like other callers.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommu.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 060ebe330ee16..1e0116bce0762 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2144,10 +2144,17 @@ EXPORT_SYMBOL_GPL(iommu_attach_device);
> 
>  int iommu_deferred_attach(struct device *dev, struct iommu_domain
> *domain)
>  {
> -	if (dev->iommu && dev->iommu->attach_deferred)
> -		return __iommu_attach_device(domain, dev);
> +	/*
> +	 * This is called on the dma mapping fast path so avoid locking. This is
> +	 * racy, but we have an expectation that the driver will setup its
> DMAs
> +	 * inside probe while being single threaded to avoid racing.
> +	 */
> +	if (!dev->iommu || !dev->iommu->attach_deferred)
> +		return 0;

Is there any way to detect a driver breaking the expectation?

> 
> -	return 0;
> +	guard(mutex)(&dev->iommu_group->mutex);
> +
> +	return __iommu_attach_device(domain, dev);
>  }
> 
>  void iommu_detach_device(struct iommu_domain *domain, struct device
> *dev)
> --
> 2.43.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach
  2025-08-15  5:09   ` Baolu Lu
@ 2025-08-15  8:27     ` Tian, Kevin
  0 siblings, 0 replies; 55+ messages in thread
From: Tian, Kevin @ 2025-08-15  8:27 UTC (permalink / raw)
  To: Baolu Lu, Nicolin Chen, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com
  Cc: will@kernel.org, robin.clark@oss.qualcomm.com,
	yong.wu@mediatek.com, matthias.bgg@gmail.com,
	angelogioacchino.delregno@collabora.com, thierry.reding@gmail.com,
	vdumpa@nvidia.com, jonathanh@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, Liu, Yi L, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-arm-msm@vger.kernel.org, linux-mediatek@lists.infradead.org,
	linux-tegra@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, patches@lists.linux.dev,
	Jaroszynski, Piotr, Sethi, Vikram, helgaas@kernel.org,
	etzhao1900@gmail.com

> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Friday, August 15, 2025 1:10 PM
> 
> On 8/12/25 06:59, Nicolin Chen wrote:
> >
> > +	/*
> > +	 * This is called on the dma mapping fast path so avoid locking. This is
> > +	 * racy, but we have an expectation that the driver will setup its
> DMAs
> 
> Why not making it like this,
> 
> int iommu_deferred_attach(struct device *dev, struct iommu_domain
> *domain)
> {
>          /* Caller must be a probed driver on dev */
>          if (!dev->iommu_group)
>                  return 0;
> 
>          guard(mutex)(&dev->iommu_group->mutex);
>          if (!dev->iommu->attach_deferred)
>                  return 0;
> 
>          return __iommu_attach_device(domain, dev);
> }
> 

As the comment said it's the fast path so avoid locking. your way
implies that every map call needs to acquire the group mutex.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 2/5] iommu: Pass in gdev to __iommu_device_set_domain
  2025-08-11 22:59 ` [PATCH v3 2/5] iommu: Pass in gdev to __iommu_device_set_domain Nicolin Chen
  2025-08-15  5:18   ` Baolu Lu
@ 2025-08-15  8:29   ` Tian, Kevin
  1 sibling, 0 replies; 55+ messages in thread
From: Tian, Kevin @ 2025-08-15  8:29 UTC (permalink / raw)
  To: Nicolin Chen, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com
  Cc: will@kernel.org, robin.clark@oss.qualcomm.com,
	yong.wu@mediatek.com, matthias.bgg@gmail.com,
	angelogioacchino.delregno@collabora.com, thierry.reding@gmail.com,
	vdumpa@nvidia.com, jonathanh@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, Liu, Yi L, baolu.lu@linux.intel.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org,
	linux-mediatek@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
	helgaas@kernel.org, etzhao1900@gmail.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, August 12, 2025 6:59 AM
> 
> The device under the reset will be attached to a blocked domain, while not
> updating the group->domain pointer. So there needs to be a per-device flag
> to indicate the reset state, for other iommu core functions to check so as
> not to shift the attached domain during the reset state.
> 
> The regular device pointer can't store any private any private iommu flag.
> So, the flag has to be in the gdev structure.

duplicated "any private"

> 
> Pass in the gdev pointer instead to the functions that will check that per
> device flag.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-11 22:59 ` [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper Nicolin Chen
  2025-08-15  5:28   ` Baolu Lu
@ 2025-08-15  8:55   ` Tian, Kevin
  2025-08-15 18:45     ` Nicolin Chen
  2025-08-18 14:39   ` Jason Gunthorpe
  2 siblings, 1 reply; 55+ messages in thread
From: Tian, Kevin @ 2025-08-15  8:55 UTC (permalink / raw)
  To: Nicolin Chen, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com
  Cc: will@kernel.org, robin.clark@oss.qualcomm.com,
	yong.wu@mediatek.com, matthias.bgg@gmail.com,
	angelogioacchino.delregno@collabora.com, thierry.reding@gmail.com,
	vdumpa@nvidia.com, jonathanh@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, Liu, Yi L, baolu.lu@linux.intel.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org,
	linux-mediatek@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
	helgaas@kernel.org, etzhao1900@gmail.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, August 12, 2025 6:59 AM
> 
> There is a need to attach a PCI device that's under a reset to temporally
> the blocked domain (i.e. detach it from its previously attached domain),
> and then to reattach it back to its previous domain (i.e. detach it from
> the blocked domain) after reset.
> 
> During the reset stage, there can be races from other attach/detachment.
> To solve this, a per-gdev reset flag will be introduced so that all the
> attach functions will bypass the driver-level attach_dev callbacks, but
> only update the group->domain pointer. The reset recovery procedure will
> attach directly to the cached pointer so things will be back to normal.
> 
> On the other hand, the iommu_get_domain_for_dev() API always returns the
> group->domain pointer, and several IOMMMU drivers call this API in their
> attach_dev callback functions to get the currently attached domain for a
> device, which will be broken for the recovery case mentioned above:
>  1. core asks the driver to attach dev from blocked to group->domain
>  2. driver attaches dev from group->domain to group->domain

the 2nd bullet implies that a driver may skip the operation by noting that
old domain is same as the new one?

> 
> So, iommu_get_domain_for_dev() should check the gdev flag and return the
> blocked domain if the flag is set. But the caller of this API could hold
> the group->mutex already or not, making it difficult to add the lock.
> 
> Introduce a new iommu_get_domain_for_dev_locked() helper to be used by
> those drivers in a context that is already under the protection of the
> group->mutex, e.g. those attach_dev callback functions. And roll out the
> new helper to all the existing IOMMU drivers.

iommu_get_domain_for_dev() is also called outside of attach_dev
callback functions, e.g. malidp_get_pgsize_bitmap(). and the returned
info according to the attached domain might be saved in static
structures, e.g.:

	ms->mmu_prefetch_pgsize = malidp_get_pgsize_bitmap(mp);

would that cause weird issues when blocking domain is returned
though one may not expect reset to happen at that point?

> +/* Caller can be any general/external function that isn't an IOMMU callback
> */
>  struct iommu_domain *iommu_get_domain_for_dev(struct device *dev)

s/can/must/ ?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-08-11 22:59 ` [PATCH v3 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
  2025-08-15  5:49   ` Baolu Lu
@ 2025-08-15  9:14   ` Tian, Kevin
  2025-08-15 19:45     ` Nicolin Chen
  1 sibling, 1 reply; 55+ messages in thread
From: Tian, Kevin @ 2025-08-15  9:14 UTC (permalink / raw)
  To: Nicolin Chen, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, jgg@nvidia.com
  Cc: will@kernel.org, robin.clark@oss.qualcomm.com,
	yong.wu@mediatek.com, matthias.bgg@gmail.com,
	angelogioacchino.delregno@collabora.com, thierry.reding@gmail.com,
	vdumpa@nvidia.com, jonathanh@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, Liu, Yi L, baolu.lu@linux.intel.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org,
	linux-mediatek@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
	helgaas@kernel.org, etzhao1900@gmail.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, August 12, 2025 6:59 AM
> 
[...]
> However, if there is a domain attachment/replacement happening during an
> ongoing reset, ATS routines may be re-activated between the two function
> calls. So, introduce a new pending_reset flag in group_device to defer an
> attachment during a reset, allowing iommu core to cache target domains in
> the SW level while bypassing the driver. The iommu_dev_reset_done() will
> re-attach these soft-attached domains, once the device reset is finished.

attach could fail e.g. when device and domain are incompatible. This
deferred attach (until reset done) makes compatibility check impossible in
the resetting window, given the driver attach_dev callback is not called 
in the normal attach path.

Worse..

> +	/* Shift RID domain back to group->domain */
> +	if (group->domain != group->blocking_domain)
> +		WARN_ON(__iommu_attach_device(group->domain, dev));

..means that an user could trigger WARN_ON() easily if he knows an attach
would fail.

IMHO we may just fail attach request in the resetting window then
WARN_ON above makes sense as it's shifting back to a domain which was 
originally attached to the resetting device.

Not sure any valid case where one would want to attach/replace domain
for a resetting device...

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-15  8:55   ` Tian, Kevin
@ 2025-08-15 18:45     ` Nicolin Chen
  2025-08-21  8:07       ` Tian, Kevin
  0 siblings, 1 reply; 55+ messages in thread
From: Nicolin Chen @ 2025-08-15 18:45 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: robin.murphy@arm.com, joro@8bytes.org, bhelgaas@google.com,
	jgg@nvidia.com, will@kernel.org, robin.clark@oss.qualcomm.com,
	yong.wu@mediatek.com, matthias.bgg@gmail.com,
	angelogioacchino.delregno@collabora.com, thierry.reding@gmail.com,
	vdumpa@nvidia.com, jonathanh@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, Liu, Yi L, baolu.lu@linux.intel.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org,
	linux-mediatek@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
	helgaas@kernel.org, etzhao1900@gmail.com

On Fri, Aug 15, 2025 at 08:55:28AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, August 12, 2025 6:59 AM
> > On the other hand, the iommu_get_domain_for_dev() API always returns the
> > group->domain pointer, and several IOMMMU drivers call this API in their
> > attach_dev callback functions to get the currently attached domain for a
> > device, which will be broken for the recovery case mentioned above:
> >  1. core asks the driver to attach dev from blocked to group->domain
> >  2. driver attaches dev from group->domain to group->domain
> 
> the 2nd bullet implies that a driver may skip the operation by noting that
> old domain is same as the new one?

Drivers uses iommu_get_domain_for_dev() to get the "old domain".
But during a reset, it doesn't return the actual old domain that
should be blocked domain but returns the group->domain.

Driver needs the actual domain (blocked) in that case because it
handles the requests from iommu_dev_reset_prepare/done().

> > So, iommu_get_domain_for_dev() should check the gdev flag and return the
> > blocked domain if the flag is set. But the caller of this API could hold
> > the group->mutex already or not, making it difficult to add the lock.
> > 
> > Introduce a new iommu_get_domain_for_dev_locked() helper to be used by
> > those drivers in a context that is already under the protection of the
> > group->mutex, e.g. those attach_dev callback functions. And roll out the
> > new helper to all the existing IOMMU drivers.
> 
> iommu_get_domain_for_dev() is also called outside of attach_dev
> callback functions, e.g. malidp_get_pgsize_bitmap(). and the returned
> info according to the attached domain might be saved in static
> structures, e.g.:
> 
> 	ms->mmu_prefetch_pgsize = malidp_get_pgsize_bitmap(mp);
> 
> would that cause weird issues when blocking domain is returned
> though one may not expect reset to happen at that point?

We aren't changing the iommu_get_domain_for_dev(). So, it should
be used exclusively for functions that are outside group->mutex,
like this one, i.e. they should still get the group->domain v.s.
the blocked domain.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-15  5:28   ` Baolu Lu
@ 2025-08-15 18:48     ` Nicolin Chen
  2025-08-18 14:40     ` Jason Gunthorpe
  1 sibling, 0 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-15 18:48 UTC (permalink / raw)
  To: Baolu Lu
  Cc: robin.murphy, joro, bhelgaas, jgg, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, linux-arm-kernel,
	iommu, linux-kernel, linux-arm-msm, linux-mediatek, linux-tegra,
	linux-acpi, linux-pci, patches, pjaroszynski, vsethi, helgaas,
	etzhao1900

On Fri, Aug 15, 2025 at 01:28:02PM +0800, Baolu Lu wrote:
> On 8/12/25 06:59, Nicolin Chen wrote:
> Given that iommu_group->mutex is transparent to the iommu driver, how
> about
> 
> /*
>  * Called only by iommu drivers in the callback context where
>  * group->mutex has already been held by the core.
>  */
> struct iommu_domain *iommu_get_domain_for_dev_internal(struct device *dev)
> {
> 	...
> 	lockdep_assert_held(&group->mutex);
> 	...
> }

I thought about the "_internal" tag, but didn't pick that because
we seem to use that for an internal function inside iommu.c?

I don't have a problem using that though, as it can be noted like
like your writing.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach
  2025-08-15  8:24   ` Tian, Kevin
@ 2025-08-15 19:26     ` Nicolin Chen
  2025-08-18 14:17     ` Jason Gunthorpe
  1 sibling, 0 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-15 19:26 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: robin.murphy@arm.com, joro@8bytes.org, bhelgaas@google.com,
	jgg@nvidia.com, will@kernel.org, robin.clark@oss.qualcomm.com,
	yong.wu@mediatek.com, matthias.bgg@gmail.com,
	angelogioacchino.delregno@collabora.com, thierry.reding@gmail.com,
	vdumpa@nvidia.com, jonathanh@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, Liu, Yi L, baolu.lu@linux.intel.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org,
	linux-mediatek@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
	helgaas@kernel.org, etzhao1900@gmail.com

On Fri, Aug 15, 2025 at 08:24:57AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, August 12, 2025 6:59 AM
> > 
> > The iommu_deferred_attach() is a runtime asynchronous function called by
> > iommu-dma function, which could race against other attach functions if it
> > accesses something in the dev->iommu_group.
> 
> Is there a real racing scenario being observed or more theoretical?
> 
> If the former may need a Fix tag. 

Theoretical. I will highlight that in next version.

> >  int iommu_deferred_attach(struct device *dev, struct iommu_domain
> > *domain)
> >  {
> > -	if (dev->iommu && dev->iommu->attach_deferred)
> > -		return __iommu_attach_device(domain, dev);
> > +	/*
> > +	 * This is called on the dma mapping fast path so avoid locking. This is
> > +	 * racy, but we have an expectation that the driver will setup its
> > DMAs
> > +	 * inside probe while being single threaded to avoid racing.
> > +	 */
> > +	if (!dev->iommu || !dev->iommu->attach_deferred)
> > +		return 0;
> 
> Is there any way to detect a driver breaking the expectation?

Hmm, I am not sure.. that would sound tricky if we really want to
have a 2nd flag to protect this one..

And this patch doesn't change the race situation, if it does ever
exist..

If we really want to play safe, moving the flag under the lock as
Baolu's suggestion is probably the answer?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-08-15  9:14   ` Tian, Kevin
@ 2025-08-15 19:45     ` Nicolin Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-15 19:45 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: robin.murphy@arm.com, joro@8bytes.org, bhelgaas@google.com,
	jgg@nvidia.com, will@kernel.org, robin.clark@oss.qualcomm.com,
	yong.wu@mediatek.com, matthias.bgg@gmail.com,
	angelogioacchino.delregno@collabora.com, thierry.reding@gmail.com,
	vdumpa@nvidia.com, jonathanh@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, Liu, Yi L, baolu.lu@linux.intel.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org,
	linux-mediatek@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
	helgaas@kernel.org, etzhao1900@gmail.com

On Fri, Aug 15, 2025 at 09:14:28AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, August 12, 2025 6:59 AM
> > 
> [...]
> > However, if there is a domain attachment/replacement happening during an
> > ongoing reset, ATS routines may be re-activated between the two function
> > calls. So, introduce a new pending_reset flag in group_device to defer an
> > attachment during a reset, allowing iommu core to cache target domains in
> > the SW level while bypassing the driver. The iommu_dev_reset_done() will
> > re-attach these soft-attached domains, once the device reset is finished.
> 
> attach could fail e.g. when device and domain are incompatible. This
> deferred attach (until reset done) makes compatibility check impossible in
> the resetting window, given the driver attach_dev callback is not called 
> in the normal attach path.
> 
> Worse..

Ah, that's a good point! I missed that one..

> > +	/* Shift RID domain back to group->domain */
> > +	if (group->domain != group->blocking_domain)
> > +		WARN_ON(__iommu_attach_device(group->domain, dev));
> 
> ..means that an user could trigger WARN_ON() easily if he knows an attach
> would fail.
> 
> IMHO we may just fail attach request in the resetting window then
> WARN_ON above makes sense as it's shifting back to a domain which was 
> originally attached to the resetting device.
>
> Not sure any valid case where one would want to attach/replace domain
> for a resetting device...

It would feel odd to me, if that could happen. So, failing it with
-EBUSY sounds plausible to me.

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
  2025-08-15  5:49   ` Baolu Lu
@ 2025-08-15 20:10     ` Nicolin Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-15 20:10 UTC (permalink / raw)
  To: Baolu Lu
  Cc: robin.murphy, joro, bhelgaas, jgg, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, linux-arm-kernel,
	iommu, linux-kernel, linux-arm-msm, linux-mediatek, linux-tegra,
	linux-acpi, linux-pci, patches, pjaroszynski, vsethi, helgaas,
	etzhao1900

On Fri, Aug 15, 2025 at 01:49:55PM +0800, Baolu Lu wrote:
> On 8/12/25 06:59, Nicolin Chen wrote:
> > Provide a callback from the PCI subsystem that will enclose the reset and
> > have the iommu core temporarily change all the attached domain to BLOCKED.
> > After attaching a BLOCKED domain, IOMMU drivers should fence any incoming
> 
> Nit, my understanding is that it's not the "IOMMU drivers" but the
> "IOMMU hardware" that fences any further incoming translation requests,
> right?

Yes. I will change this to:

Provide a callback from the PCI subsystem that will enclose the reset and
have the iommu core temporarily change all the attached domain to BLOCKED.
After attaching a BLOCKED domain, IOMMU hardware would fence any incoming
ATS queries. And IOMMU drivers should also synchronously stop issuing new
ATS invalidations and wait for all ATS invalidations to complete. This can
avoid any ATS invaliation timeouts.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach
  2025-08-15  8:24   ` Tian, Kevin
  2025-08-15 19:26     ` Nicolin Chen
@ 2025-08-18 14:17     ` Jason Gunthorpe
  2025-08-18 17:45       ` Nicolin Chen
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2025-08-18 14:17 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Nicolin Chen, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, will@kernel.org,
	robin.clark@oss.qualcomm.com, yong.wu@mediatek.com,
	matthias.bgg@gmail.com, angelogioacchino.delregno@collabora.com,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	rafael@kernel.org, lenb@kernel.org, Liu, Yi L,
	baolu.lu@linux.intel.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-arm-msm@vger.kernel.org, linux-mediatek@lists.infradead.org,
	linux-tegra@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, patches@lists.linux.dev,
	Jaroszynski, Piotr, Sethi, Vikram, helgaas@kernel.org,
	etzhao1900@gmail.com

On Fri, Aug 15, 2025 at 08:24:57AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, August 12, 2025 6:59 AM
> > 
> > The iommu_deferred_attach() is a runtime asynchronous function called by
> > iommu-dma function, which could race against other attach functions if it
> > accesses something in the dev->iommu_group.
> 
> Is there a real racing scenario being observed or more theoretical?

I think the commit message should explain the actual reason this is
being done, which AFAICT because the new lockdeps added in following
patches will fail on this path otherwise.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-11 22:59 ` [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper Nicolin Chen
  2025-08-15  5:28   ` Baolu Lu
  2025-08-15  8:55   ` Tian, Kevin
@ 2025-08-18 14:39   ` Jason Gunthorpe
  2025-08-18 17:22     ` Nicolin Chen
  2 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2025-08-18 14:39 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: robin.murphy, joro, bhelgaas, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas, etzhao1900

On Mon, Aug 11, 2025 at 03:59:10PM -0700, Nicolin Chen wrote:

> Introduce a new iommu_get_domain_for_dev_locked() helper to be used by
> those drivers in a context that is already under the protection of the
> group->mutex, e.g. those attach_dev callback functions. And roll out the
> new helper to all the existing IOMMU drivers.

I still don't much care for this, I think it would be better to
approach this problem by passing the old domain to the driver callback:

> @@ -390,7 +390,7 @@ static int qcom_iommu_attach_dev(struct iommu_domain *domain, struct device *dev
>  static int qcom_iommu_identity_attach(struct iommu_domain *identity_domain,
>  				      struct device *dev)
>  {
> -	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +	struct iommu_domain *domain = iommu_get_domain_for_dev_locked(dev);

Because this is a very common pattern in drivers.

Once that is done we can see what calls to iommu_get_domain_for_dev()
are even left, arguably we should be trying to eliminate this badly
locked thing...

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-15  5:28   ` Baolu Lu
  2025-08-15 18:48     ` Nicolin Chen
@ 2025-08-18 14:40     ` Jason Gunthorpe
  2025-08-19  2:09       ` Baolu Lu
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2025-08-18 14:40 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Nicolin Chen, robin.murphy, joro, bhelgaas, will, robin.clark,
	yong.wu, matthias.bgg, angelogioacchino.delregno, thierry.reding,
	vdumpa, jonathanh, rafael, lenb, kevin.tian, yi.l.liu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas, etzhao1900

On Fri, Aug 15, 2025 at 01:28:02PM +0800, Baolu Lu wrote:

> Given that iommu_group->mutex is transparent to the iommu driver, how
> about

It is not actually transparent, alot of drivers are implicitly
assuming that the core single threads their struct device for alot of
the ops callbacks and we exposed the lockdep test to the drivers to
help the document this.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-18 14:39   ` Jason Gunthorpe
@ 2025-08-18 17:22     ` Nicolin Chen
  2025-08-18 23:42       ` Jason Gunthorpe
  2025-08-21  8:11       ` Tian, Kevin
  0 siblings, 2 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-18 17:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: robin.murphy, joro, bhelgaas, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas, etzhao1900

On Mon, Aug 18, 2025 at 11:39:49AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 11, 2025 at 03:59:10PM -0700, Nicolin Chen wrote:
> 
> > Introduce a new iommu_get_domain_for_dev_locked() helper to be used by
> > those drivers in a context that is already under the protection of the
> > group->mutex, e.g. those attach_dev callback functions. And roll out the
> > new helper to all the existing IOMMU drivers.
> 
> I still don't much care for this, I think it would be better to
> approach this problem by passing the old domain to the driver callback:

Hmm, that was my first attempt before making this patch until...

> > @@ -390,7 +390,7 @@ static int qcom_iommu_attach_dev(struct iommu_domain *domain, struct device *dev
> >  static int qcom_iommu_identity_attach(struct iommu_domain *identity_domain,
> >  				      struct device *dev)
> >  {
> > -	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> > +	struct iommu_domain *domain = iommu_get_domain_for_dev_locked(dev);
> 
> Because this is a very common pattern in drivers.
> 
> Once that is done we can see what calls to iommu_get_domain_for_dev()
> are even left,

... I found that in SMMUv3 driver, iommu_get_domain_for_dev() is
used to get the RID domain for an SVA domain:
    arm_smmu_set_pasid()
    arm_smmu_blocking_set_dev_pasid()

These two are already given an "old" (SVA) domain pointer, FWIW.

So, we may change to passing in the old domain as you suggested,
yet we still have to fix the iommu_get_domain_for_dev() in order
to reflect the RID domain correctly for the driver that calls it
(or even potentially) in some group->mutex locked context where
the RID domain might not be naturally passed in.

> arguably we should be trying to eliminate this badly
> locked thing...

Any suggestion?

I see it inevitable that such an iommu specific flag per-device
would have to take the lock..

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach
  2025-08-18 14:17     ` Jason Gunthorpe
@ 2025-08-18 17:45       ` Nicolin Chen
  2025-08-18 18:09         ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Nicolin Chen @ 2025-08-18 17:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, will@kernel.org,
	robin.clark@oss.qualcomm.com, yong.wu@mediatek.com,
	matthias.bgg@gmail.com, angelogioacchino.delregno@collabora.com,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	rafael@kernel.org, lenb@kernel.org, Liu, Yi L,
	baolu.lu@linux.intel.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-arm-msm@vger.kernel.org, linux-mediatek@lists.infradead.org,
	linux-tegra@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, patches@lists.linux.dev,
	Jaroszynski, Piotr, Sethi, Vikram, helgaas@kernel.org,
	etzhao1900@gmail.com

On Mon, Aug 18, 2025 at 11:17:51AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 15, 2025 at 08:24:57AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Tuesday, August 12, 2025 6:59 AM
> > > 
> > > The iommu_deferred_attach() is a runtime asynchronous function called by
> > > iommu-dma function, which could race against other attach functions if it
> > > accesses something in the dev->iommu_group.
> > 
> > Is there a real racing scenario being observed or more theoretical?
> 
> I think the commit message should explain the actual reason this is
> being done, which AFAICT because the new lockdeps added in following
> patches will fail on this path otherwise.

Hmm, I can mention that. But I think that's just a part of the
reason. It still doesn't seem correct to invoke an attach_dev
function without the lock since iommu_group_mutex_assert() may
be used in the path?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach
  2025-08-18 17:45       ` Nicolin Chen
@ 2025-08-18 18:09         ` Jason Gunthorpe
  2025-08-18 18:29           ` Nicolin Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2025-08-18 18:09 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, will@kernel.org,
	robin.clark@oss.qualcomm.com, yong.wu@mediatek.com,
	matthias.bgg@gmail.com, angelogioacchino.delregno@collabora.com,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	rafael@kernel.org, lenb@kernel.org, Liu, Yi L,
	baolu.lu@linux.intel.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-arm-msm@vger.kernel.org, linux-mediatek@lists.infradead.org,
	linux-tegra@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, patches@lists.linux.dev,
	Jaroszynski, Piotr, Sethi, Vikram, helgaas@kernel.org,
	etzhao1900@gmail.com

On Mon, Aug 18, 2025 at 10:45:07AM -0700, Nicolin Chen wrote:
> On Mon, Aug 18, 2025 at 11:17:51AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 15, 2025 at 08:24:57AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Tuesday, August 12, 2025 6:59 AM
> > > > 
> > > > The iommu_deferred_attach() is a runtime asynchronous function called by
> > > > iommu-dma function, which could race against other attach functions if it
> > > > accesses something in the dev->iommu_group.
> > > 
> > > Is there a real racing scenario being observed or more theoretical?
> > 
> > I think the commit message should explain the actual reason this is
> > being done, which AFAICT because the new lockdeps added in following
> > patches will fail on this path otherwise.
> 
> Hmm, I can mention that. But I think that's just a part of the
> reason. It still doesn't seem correct to invoke an attach_dev
> function without the lock since iommu_group_mutex_assert() may
> be used in the path?

Last time this was brought up there was a bit of an argument that it
couldn't happen in parallel with anything anyhow so it doesn't
technically need locking. But I think we should not make such
arguments and be strict about our locking. It is too hard to
understand the system correctness otherwise.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach
  2025-08-18 18:09         ` Jason Gunthorpe
@ 2025-08-18 18:29           ` Nicolin Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-18 18:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, will@kernel.org,
	robin.clark@oss.qualcomm.com, yong.wu@mediatek.com,
	matthias.bgg@gmail.com, angelogioacchino.delregno@collabora.com,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	rafael@kernel.org, lenb@kernel.org, Liu, Yi L,
	baolu.lu@linux.intel.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-arm-msm@vger.kernel.org, linux-mediatek@lists.infradead.org,
	linux-tegra@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, patches@lists.linux.dev,
	Jaroszynski, Piotr, Sethi, Vikram, helgaas@kernel.org,
	etzhao1900@gmail.com

On Mon, Aug 18, 2025 at 03:09:20PM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 18, 2025 at 10:45:07AM -0700, Nicolin Chen wrote:
> > On Mon, Aug 18, 2025 at 11:17:51AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 15, 2025 at 08:24:57AM +0000, Tian, Kevin wrote:
> > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > Sent: Tuesday, August 12, 2025 6:59 AM
> > > > > 
> > > > > The iommu_deferred_attach() is a runtime asynchronous function called by
> > > > > iommu-dma function, which could race against other attach functions if it
> > > > > accesses something in the dev->iommu_group.
> > > > 
> > > > Is there a real racing scenario being observed or more theoretical?
> > > 
> > > I think the commit message should explain the actual reason this is
> > > being done, which AFAICT because the new lockdeps added in following
> > > patches will fail on this path otherwise.
> > 
> > Hmm, I can mention that. But I think that's just a part of the
> > reason. It still doesn't seem correct to invoke an attach_dev
> > function without the lock since iommu_group_mutex_assert() may
> > be used in the path?
> 
> Last time this was brought up there was a bit of an argument that it
> couldn't happen in parallel with anything anyhow so it doesn't
> technically need locking. But I think we should not make such
> arguments and be strict about our locking. It is too hard to
> understand the system correctness otherwise.

Yes. I will make the commit message clear that it isn't about a
bug fix, but to align with the locking policy at the attach_dev
callback and (more importantly) to fence gdev for the new flag.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-18 17:22     ` Nicolin Chen
@ 2025-08-18 23:42       ` Jason Gunthorpe
  2025-08-19  5:09         ` Nicolin Chen
  2025-08-21  8:11       ` Tian, Kevin
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2025-08-18 23:42 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: robin.murphy, joro, bhelgaas, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas, etzhao1900

On Mon, Aug 18, 2025 at 10:22:52AM -0700, Nicolin Chen wrote:
> > Because this is a very common pattern in drivers.
> > 
> > Once that is done we can see what calls to iommu_get_domain_for_dev()
> > are even left,
> 
> ... I found that in SMMUv3 driver, iommu_get_domain_for_dev() is
> used to get the RID domain for an SVA domain:
>     arm_smmu_set_pasid()
>     arm_smmu_blocking_set_dev_pasid()
> 
> These two are already given an "old" (SVA) domain pointer, FWIW.
> 
> So, we may change to passing in the old domain as you suggested,
> yet we still have to fix the iommu_get_domain_for_dev() in order
> to reflect the RID domain correctly for the driver that calls it
> (or even potentially) in some group->mutex locked context where
> the RID domain might not be naturally passed in.

It could probably be avoided by keeping track of more information in
the master, but also it is not so bad to use a _locked version here.

> > arguably we should be trying to eliminate this badly
> > locked thing...
> 
> Any suggestion?

Bit by bit.. I counted 58 by grep

Changing attach will get rid of alot of them

Then there is stuff like this:

        domain = iommu_get_domain_for_dev(emu->card->dev);
        if (!domain || domain->type == IOMMU_DOMAIN_IDENTITY)
                return;

Which should be more like 
   if (iommu_get_translation_mode(dev) == IDENTITY)

With sensible internal locking

So that is another bunch. Not sure what will be left after.

Not saying to do all that here, just prefer we move toward that direction.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-18 14:40     ` Jason Gunthorpe
@ 2025-08-19  2:09       ` Baolu Lu
  0 siblings, 0 replies; 55+ messages in thread
From: Baolu Lu @ 2025-08-19  2:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, robin.murphy, joro, bhelgaas, will, robin.clark,
	yong.wu, matthias.bgg, angelogioacchino.delregno, thierry.reding,
	vdumpa, jonathanh, rafael, lenb, kevin.tian, yi.l.liu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas, etzhao1900

On 8/18/25 22:40, Jason Gunthorpe wrote:
> On Fri, Aug 15, 2025 at 01:28:02PM +0800, Baolu Lu wrote:
> 
>> Given that iommu_group->mutex is transparent to the iommu driver, how
>> about
> It is not actually transparent, alot of drivers are implicitly
> assuming that the core single threads their struct device for alot of
> the ops callbacks and we exposed the lockdep test to the drivers to
> help the document this.

Fair enough. It's the role of iommu_group_mutex_assert().

Thanks,
baolu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-18 23:42       ` Jason Gunthorpe
@ 2025-08-19  5:09         ` Nicolin Chen
  2025-08-19 12:52           ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Nicolin Chen @ 2025-08-19  5:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: robin.murphy, joro, bhelgaas, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas, etzhao1900

On Mon, Aug 18, 2025 at 08:42:41PM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 18, 2025 at 10:22:52AM -0700, Nicolin Chen wrote:
> > > Because this is a very common pattern in drivers.
> > > 
> > > Once that is done we can see what calls to iommu_get_domain_for_dev()
> > > are even left,
> > 
> > ... I found that in SMMUv3 driver, iommu_get_domain_for_dev() is
> > used to get the RID domain for an SVA domain:
> >     arm_smmu_set_pasid()
> >     arm_smmu_blocking_set_dev_pasid()
> > 
> > These two are already given an "old" (SVA) domain pointer, FWIW.
> > 
> > So, we may change to passing in the old domain as you suggested,
> > yet we still have to fix the iommu_get_domain_for_dev() in order
> > to reflect the RID domain correctly for the driver that calls it
> > (or even potentially) in some group->mutex locked context where
> > the RID domain might not be naturally passed in.
> 
> It could probably be avoided by keeping track of more information in
> the master, but also it is not so bad to use a _locked version here.

Yes, I've thought about that. The concern is that some other place
someday may want to use iommu_get_domain_for_dev() in similar cases
but would find that it doesn't work. So it would have to duplicate
the domain pointer in its "master" structure.

Overall, having a _locked version feels cleaner to me.

> > > arguably we should be trying to eliminate this badly
> > > locked thing...
> > 
> > Any suggestion?
> 
> Bit by bit.. I counted 58 by grep
> 
> Changing attach will get rid of alot of them
> 
> Then there is stuff like this:
> 
>         domain = iommu_get_domain_for_dev(emu->card->dev);
>         if (!domain || domain->type == IOMMU_DOMAIN_IDENTITY)
>                 return;
> 
> Which should be more like 
>    if (iommu_get_translation_mode(dev) == IDENTITY)
>
> With sensible internal locking

Hmm, I feel this iommu_get_translation_mode() is somewhat the same
as the current iommu_get_domain_for_dev(). It would just return the
group->domain->type v.s. group->domain, right?

This doesn't have any UAF concern though.

> So that is another bunch. Not sure what will be left after.

I recall that some of the drivers manages their own domains, e.g.
    drivers/gpu/drm/tegra/drm.c

So, they would want more out of the domain pointer than just type.

> Not saying to do all that here, just prefer we move toward that direction.

Yea.. I also think it's a bit difficult to justify the changes in
the non-iommu callers, since they are not affected by any patch in
this series.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-19  5:09         ` Nicolin Chen
@ 2025-08-19 12:52           ` Jason Gunthorpe
  2025-08-19 17:22             ` Nicolin Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2025-08-19 12:52 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: robin.murphy, joro, bhelgaas, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas, etzhao1900

On Mon, Aug 18, 2025 at 10:09:11PM -0700, Nicolin Chen wrote:
> Yes, I've thought about that. The concern is that some other place
> someday may want to use iommu_get_domain_for_dev() in similar cases
> but would find that it doesn't work. So it would have to duplicate
> the domain pointer in its "master" structure.
> 
> Overall, having a _locked version feels cleaner to me.

We probably need the locked version, but it just shouldn't be called very
much..

> > With sensible internal locking
> 
> Hmm, I feel this iommu_get_translation_mode() is somewhat the same
> as the current iommu_get_domain_for_dev(). It would just return the
> group->domain->type v.s. group->domain, right?
> 
> This doesn't have any UAF concern though.

Yes, no UAF concern is the point

> > So that is another bunch. Not sure what will be left after.
> 
> I recall that some of the drivers manages their own domains, e.g.
>     drivers/gpu/drm/tegra/drm.c
> 
> So, they would want more out of the domain pointer than just type.

This looks like it wants an 'is currently attached' operation

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device
  2025-08-11 22:59 ` [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device Nicolin Chen
@ 2025-08-19 14:12   ` Ethan Zhao
  2025-08-19 21:59     ` Nicolin Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Ethan Zhao @ 2025-08-19 14:12 UTC (permalink / raw)
  To: Nicolin Chen, robin.murphy, joro, bhelgaas, jgg
  Cc: will, robin.clark, yong.wu, matthias.bgg,
	angelogioacchino.delregno, thierry.reding, vdumpa, jonathanh,
	rafael, lenb, kevin.tian, yi.l.liu, baolu.lu, linux-arm-kernel,
	iommu, linux-kernel, linux-arm-msm, linux-mediatek, linux-tegra,
	linux-acpi, linux-pci, patches, pjaroszynski, vsethi, helgaas



On 8/12/2025 6:59 AM, Nicolin Chen wrote:
> PCIe permits a device to ignore ATS invalidation TLPs, while processing a
> reset. This creates a problem visible to the OS where an ATS invalidation
> command will time out: e.g. an SVA domain will have no coordination with a
> reset event and can racily issue ATS invalidations to a resetting device.
> 
> The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and
> block ATS before initiating a Function Level Reset. It also mentions that
> other reset methods could have the same vulnerability as well.
> 
> Now iommu_dev_reset_prepare/done() helpers are introduced for this matter.
> Use them in all the existing reset functions, which will attach the device
> to an IOMMU_DOMAIN_BLOCKED during a reset, so as to allow IOMMU driver to:
>   - invoke pci_disable_ats() and pci_enable_ats(), if necessary
>   - wait for all ATS invalidations to complete
>   - stop issuing new ATS invalidations
>   - fence any incoming ATS queries
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   drivers/pci/pci-acpi.c | 17 +++++++++--
>   drivers/pci/pci.c      | 68 ++++++++++++++++++++++++++++++++++++++----
>   drivers/pci/quirks.c   | 23 +++++++++++++-
>   3 files changed, 100 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
> index ddb25960ea47d..adaf46422c05d 100644
> --- a/drivers/pci/pci-acpi.c
> +++ b/drivers/pci/pci-acpi.c
> @@ -9,6 +9,7 @@
>   
>   #include <linux/delay.h>
>   #include <linux/init.h>
> +#include <linux/iommu.h>
>   #include <linux/irqdomain.h>
>   #include <linux/pci.h>
>   #include <linux/msi.h>
> @@ -969,6 +970,7 @@ void pci_set_acpi_fwnode(struct pci_dev *dev)
>   int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
>   {
>   	acpi_handle handle = ACPI_HANDLE(&dev->dev);
> +	int ret = 0;
>   
>   	if (!handle || !acpi_has_method(handle, "_RST"))
>   		return -ENOTTY;
> @@ -976,12 +978,23 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
>   	if (probe)
>   		return 0;
>   
> +	/*
> +	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> +	 * before initiating a reset. Notify the iommu driver that enabled ATS.
> +	 */
> +	ret = iommu_dev_reset_prepare(&dev->dev);
> +	if (ret) {
> +		pci_err(dev, "failed to stop IOMMU\n");
> +		return ret;
> +	}
> +
>   	if (ACPI_FAILURE(acpi_evaluate_object(handle, "_RST", NULL, NULL))) {
>   		pci_warn(dev, "ACPI _RST failed\n");
> -		return -ENOTTY;
> +		ret = -ENOTTY;
>   	}
>   
> -	return 0;
> +	iommu_dev_reset_done(&dev->dev);
> +	return ret;
>   }
>   
>   bool acpi_pci_power_manageable(struct pci_dev *dev)
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b0f4d98036cdd..d6d87e22d81b3 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -13,6 +13,7 @@
>   #include <linux/delay.h>
>   #include <linux/dmi.h>
>   #include <linux/init.h>
> +#include <linux/iommu.h>
>   #include <linux/msi.h>
>   #include <linux/of.h>
>   #include <linux/pci.h>
> @@ -4529,13 +4530,26 @@ EXPORT_SYMBOL(pci_wait_for_pending_transaction);
>    */
>   int pcie_flr(struct pci_dev *dev)
>   {
> +	int ret = 0;
> +
>   	if (!pci_wait_for_pending_transaction(dev))
>   		pci_err(dev, "timed out waiting for pending transaction; performing function level reset anyway\n");
>   
> +	/*
> +	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> +	 * before initiating a reset. Notify the iommu driver that enabled ATS.
> +	 * Have to call it after waiting for pending DMA transaction.
> +	 */
> +	ret = iommu_dev_reset_prepare(&dev->dev);
If we dont' consider the association between IOMMU and devices in FLR(),
it can be understood that more complex processing logic resides outside
this function. However, if the FLR() function already synchironizes and
handles the association with IOMMU like this (disabling ATS by attaching
device to blocking domain), then how would the following scenarios
behave ?

1. Reset one of PCIe alias devices.
2. Reset PF when its VFs are actvie.
....

Thanks,
Ethan> +	if (ret) {
> +		pci_err(dev, "failed to stop IOMMU\n");
> +		return ret;
> +	}
> +
>   	pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR);
>   
>   	if (dev->imm_ready)
> -		return 0;
> +		goto done;
>   
>   	/*
>   	 * Per PCIe r4.0, sec 6.6.2, a device must complete an FLR within
> @@ -4544,7 +4558,11 @@ int pcie_flr(struct pci_dev *dev)
>   	 */
>   	msleep(100);
>   
> -	return pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
> +	ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
> +
> +done:
> +	iommu_dev_reset_done(&dev->dev);
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(pcie_flr);
>   
> @@ -4572,6 +4590,7 @@ EXPORT_SYMBOL_GPL(pcie_reset_flr);
>   
>   static int pci_af_flr(struct pci_dev *dev, bool probe)
>   {
> +	int ret = 0;
>   	int pos;
>   	u8 cap;
>   
> @@ -4598,10 +4617,21 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
>   				 PCI_AF_STATUS_TP << 8))
>   		pci_err(dev, "timed out waiting for pending transaction; performing AF function level reset anyway\n");
>   
> +	/*
> +	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> +	 * before initiating a reset. Notify the iommu driver that enabled ATS.
> +	 * Have to call it after waiting for pending DMA transaction.
> +	 */
> +	ret = iommu_dev_reset_prepare(&dev->dev);
> +	if (ret) {
> +		pci_err(dev, "failed to stop IOMMU\n");
> +		return ret;
> +	}
> +
>   	pci_write_config_byte(dev, pos + PCI_AF_CTRL, PCI_AF_CTRL_FLR);
>   
>   	if (dev->imm_ready)
> -		return 0;
> +		goto done;
>   
>   	/*
>   	 * Per Advanced Capabilities for Conventional PCI ECN, 13 April 2006,
> @@ -4611,7 +4641,11 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
>   	 */
>   	msleep(100);
>   
> -	return pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
> +	ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
> +
> +done:
> +	iommu_dev_reset_done(&dev->dev);
> +	return ret;
>   }
>   
>   /**
> @@ -4632,6 +4666,7 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
>   static int pci_pm_reset(struct pci_dev *dev, bool probe)
>   {
>   	u16 csr;
> +	int ret;
>   
>   	if (!dev->pm_cap || dev->dev_flags & PCI_DEV_FLAGS_NO_PM_RESET)
>   		return -ENOTTY;
> @@ -4646,6 +4681,16 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
>   	if (dev->current_state != PCI_D0)
>   		return -EINVAL;
>   
> +	/*
> +	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> +	 * before initiating a reset. Notify the iommu driver that enabled ATS.
> +	 */
> +	ret = iommu_dev_reset_prepare(&dev->dev);
> +	if (ret) {
> +		pci_err(dev, "failed to stop IOMMU\n");
> +		return ret;
> +	}
> +
>   	csr &= ~PCI_PM_CTRL_STATE_MASK;
>   	csr |= PCI_D3hot;
>   	pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
> @@ -4656,7 +4701,9 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
>   	pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, csr);
>   	pci_dev_d3_sleep(dev);
>   
> -	return pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
> +	ret = pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
> +	iommu_dev_reset_done(&dev->dev);
> +	return ret;
>   }
>   
>   /**
> @@ -5111,6 +5158,16 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
>   	if (rc)
>   		return -ENOTTY;
>   
> +	/*
> +	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> +	 * before initiating a reset. Notify the iommu driver that enabled ATS.
> +	 */
> +	rc = iommu_dev_reset_prepare(&dev->dev);
> +	if (rc) {
> +		pci_err(dev, "failed to stop IOMMU\n");
> +		return rc;
> +	}
> +
>   	if (reg & PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR) {
>   		val = reg;
>   	} else {
> @@ -5125,6 +5182,7 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
>   		pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
>   				      reg);
>   
> +	iommu_dev_reset_done(&dev->dev);
>   	return rc;
>   }
>   
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index d97335a401930..6157c6c02bdb0 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -21,6 +21,7 @@
>   #include <linux/pci.h>
>   #include <linux/isa-dma.h> /* isa_dma_bridge_buggy */
>   #include <linux/init.h>
> +#include <linux/iommu.h>
>   #include <linux/delay.h>
>   #include <linux/acpi.h>
>   #include <linux/dmi.h>
> @@ -4225,6 +4226,26 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
>   	{ 0 }
>   };
>   
> +static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
> +				    const struct pci_dev_reset_methods *i)
> +{
> +	int ret;
> +
> +	/*
> +	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> +	 * before initiating a reset. Notify the iommu driver that enabled ATS.
> +	 */
> +	ret = iommu_dev_reset_prepare(&dev->dev);
> +	if (ret) {
> +		pci_err(dev, "failed to stop IOMMU\n");
> +		return ret;
> +	}
> +
> +	ret = i->reset(dev, probe);
> +	iommu_dev_reset_done(&dev->dev);
> +	return ret;
> +}
> +
>   /*
>    * These device-specific reset methods are here rather than in a driver
>    * because when a host assigns a device to a guest VM, the host may need
> @@ -4239,7 +4260,7 @@ int pci_dev_specific_reset(struct pci_dev *dev, bool probe)
>   		     i->vendor == (u16)PCI_ANY_ID) &&
>   		    (i->device == dev->device ||
>   		     i->device == (u16)PCI_ANY_ID))
> -			return i->reset(dev, probe);
> +			return __pci_dev_specific_reset(dev, probe, i);
>   	}
>   
>   	return -ENOTTY;


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-19 12:52           ` Jason Gunthorpe
@ 2025-08-19 17:22             ` Nicolin Chen
  2025-08-21 13:13               ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Nicolin Chen @ 2025-08-19 17:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: robin.murphy, joro, bhelgaas, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas, etzhao1900

On Tue, Aug 19, 2025 at 09:52:49AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 18, 2025 at 10:09:11PM -0700, Nicolin Chen wrote:
> > Yes, I've thought about that. The concern is that some other place
> > someday may want to use iommu_get_domain_for_dev() in similar cases
> > but would find that it doesn't work. So it would have to duplicate
> > the domain pointer in its "master" structure.
> > 
> > Overall, having a _locked version feels cleaner to me.
> 
> We probably need the locked version, but it just shouldn't be called very
> much..

OK. Let's have one patch upgrading the attach_dev to pass in the
old domain pointer (aligning with the SVA version of attach_dev),
and another patch adding the _locked version that then will have
very limited callers.

> > > With sensible internal locking
> > 
> > Hmm, I feel this iommu_get_translation_mode() is somewhat the same
> > as the current iommu_get_domain_for_dev(). It would just return the
> > group->domain->type v.s. group->domain, right?
> > 
> > This doesn't have any UAF concern though.
> 
> Yes, no UAF concern is the point

I see.

> > > So that is another bunch. Not sure what will be left after.
> > 
> > I recall that some of the drivers manages their own domains, e.g.
> >     drivers/gpu/drm/tegra/drm.c
> > 
> > So, they would want more out of the domain pointer than just type.
> 
> This looks like it wants an 'is currently attached' operation

That's certainly one of the wide use cases. And I think we could
have an IOMMU_DOMAIN_NONE to fit that into the type helper.

Yet, I also see some other cases that cannot be helped with the
type function. Just listing a few:

1) domain matching (and type)
drivers/gpu/drm/tegra/drm.c:965:        if (domain && domain->type != IOMMU_DOMAIN_IDENTITY &&
drivers/gpu/drm/tegra/drm.c:966:            domain != tegra->domain)
drivers/gpu/drm/tegra/drm.c-967-                return 0;

2) page size
drivers/gpu/drm/arm/malidp_planes.c:307:	mmu_dom = iommu_get_domain_for_dev(mp->base.dev->dev);
drivers/gpu/drm/arm/malidp_planes.c-308-	if (mmu_dom)
drivers/gpu/drm/arm/malidp_planes.c-309-		return mmu_dom->pgsize_bitmap;

3) iommu_iova_to_phys
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:2597:	dom = iommu_get_domain_for_dev(adev->dev);
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2598-
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2599-	while (size) {
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2600-		phys_addr_t addr = *pos & PAGE_MASK;
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2601-		loff_t off = *pos & ~PAGE_MASK;
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2602-		size_t bytes = PAGE_SIZE - off;
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2603-		unsigned long pfn;
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2604-		struct page *p;
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2605-		void *ptr;
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2606-
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2607-		bytes = min(bytes, size);
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2608-
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:2609:		addr = dom ? iommu_iova_to_phys(dom, addr) : addr;

4) map/unmap
drivers/net/ipa/ipa_mem.c:465:  domain = iommu_get_domain_for_dev(dev);
drivers/net/ipa/ipa_mem.c-466-  if (!domain) {
drivers/net/ipa/ipa_mem.c-467-          dev_err(dev, "no IOMMU domain found for IMEM\n");
drivers/net/ipa/ipa_mem.c-468-          return -EINVAL;
drivers/net/ipa/ipa_mem.c-469-  }
drivers/net/ipa/ipa_mem.c-470-
drivers/net/ipa/ipa_mem.c-471-  /* Align the address down and the size up to page boundaries */
drivers/net/ipa/ipa_mem.c-472-  phys = addr & PAGE_MASK;
drivers/net/ipa/ipa_mem.c-473-  size = PAGE_ALIGN(size + addr - phys);
drivers/net/ipa/ipa_mem.c-474-  iova = phys;    /* We just want a direct mapping */
drivers/net/ipa/ipa_mem.c-475-
drivers/net/ipa/ipa_mem.c-476-  ret = iommu_map(domain, iova, phys, size, IOMMU_READ | IOMMU_WRITE,
...
drivers/net/ipa/ipa_mem.c:495:  domain = iommu_get_domain_for_dev(dev);
drivers/net/ipa/ipa_mem.c-496-  if (domain) {
drivers/net/ipa/ipa_mem.c-497-          size_t size;
drivers/net/ipa/ipa_mem.c-498-
drivers/net/ipa/ipa_mem.c-499-          size = iommu_unmap(domain, ipa->imem_iova, ipa->imem_size);

We could probably invent something for case 1-3. But for case 4,
it feels that iommu_get_domain_for_dev() is the only helper..

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device
  2025-08-19 14:12   ` Ethan Zhao
@ 2025-08-19 21:59     ` Nicolin Chen
  2025-08-20  3:18       ` Ethan Zhao
  2025-08-21 13:07       ` Jason Gunthorpe
  0 siblings, 2 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-19 21:59 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: robin.murphy, joro, bhelgaas, jgg, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas

On Tue, Aug 19, 2025 at 10:12:41PM +0800, Ethan Zhao wrote:
> On 8/12/2025 6:59 AM, Nicolin Chen wrote:
> > @@ -4529,13 +4530,26 @@ EXPORT_SYMBOL(pci_wait_for_pending_transaction);
> >    */
> >   int pcie_flr(struct pci_dev *dev)
> >   {
> > +	int ret = 0;
> > +
> >   	if (!pci_wait_for_pending_transaction(dev))
> >   		pci_err(dev, "timed out waiting for pending transaction; performing function level reset anyway\n");
> > +	/*
> > +	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
> > +	 * before initiating a reset. Notify the iommu driver that enabled ATS.
> > +	 * Have to call it after waiting for pending DMA transaction.
> > +	 */
> > +	ret = iommu_dev_reset_prepare(&dev->dev);

> If we dont' consider the association between IOMMU and devices in FLR(),
> it can be understood that more complex processing logic resides outside
> this function. However, if the FLR() function already synchironizes and
> handles the association with IOMMU like this (disabling ATS by attaching
> device to blocking domain), then how would the following scenarios
> behave ?

That's a good point. The iommu-level reset is per struct device.
So, basically it'll match with the FLR per pci_dev. Yet, the RID
isolation between siblings might be a concern:

> 1. Reset one of PCIe alias devices.

IIRC, an alias device might have:

 a) one pci_dev; multiple RIDs

    In this case, neither FLR nor IOMMU isolates between RIDs.
    So, both FLR and IOMMU blocking will reset all RIDs. There
    should be no issue resulted from the IOMMU blocking.

 b) multiple pci_devs; single RID

    In this case, FLR only resets one device, while the IOMMU-
    level reset will block the entire RID (i.e. all devices),
    since they share the single translation tunnel. This could
    break the siblings, if they aren't also being reset along.

> 2. Reset PF when its VFs are actvie.

 c) multiple pci_devs with their own RIDs

    In this case, either FLR or IOMMU only resets the PF. That
    being said, VFs might be affected since PF is resetting?
    If there is an issue, I don't see it coming from the IOMMU-
    level reset..

Thus, case b might be breaking. So, perhaps we should add a few
conditions when calling iommu_dev_reset_prepare/done():
 + Make sure that the pci_dev has ATS capability
 + Make sure no sibling pci_dev(s) sharing the same RID
 + Any others?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device
  2025-08-19 21:59     ` Nicolin Chen
@ 2025-08-20  3:18       ` Ethan Zhao
  2025-08-22  6:14         ` Nicolin Chen
  2025-08-21 13:07       ` Jason Gunthorpe
  1 sibling, 1 reply; 55+ messages in thread
From: Ethan Zhao @ 2025-08-20  3:18 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: robin.murphy, joro, bhelgaas, jgg, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas



On 8/20/2025 5:59 AM, Nicolin Chen wrote:
> On Tue, Aug 19, 2025 at 10:12:41PM +0800, Ethan Zhao wrote:
>> On 8/12/2025 6:59 AM, Nicolin Chen wrote:
>>> @@ -4529,13 +4530,26 @@ EXPORT_SYMBOL(pci_wait_for_pending_transaction);
>>>     */
>>>    int pcie_flr(struct pci_dev *dev)
>>>    {
>>> +	int ret = 0;
>>> +
>>>    	if (!pci_wait_for_pending_transaction(dev))
>>>    		pci_err(dev, "timed out waiting for pending transaction; performing function level reset anyway\n");
>>> +	/*
>>> +	 * Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software disables ATS
>>> +	 * before initiating a reset. Notify the iommu driver that enabled ATS.
>>> +	 * Have to call it after waiting for pending DMA transaction.
>>> +	 */
>>> +	ret = iommu_dev_reset_prepare(&dev->dev);
> 
>> If we dont' consider the association between IOMMU and devices in FLR(),
>> it can be understood that more complex processing logic resides outside
>> this function. However, if the FLR() function already synchironizes and
>> handles the association with IOMMU like this (disabling ATS by attaching
>> device to blocking domain), then how would the following scenarios
>> behave ?
> 
> That's a good point. The iommu-level reset is per struct device.
> So, basically it'll match with the FLR per pci_dev. Yet, the RID
> isolation between siblings might be a concern:
> 
>> 1. Reset one of PCIe alias devices.
> 
> IIRC, an alias device might have:
> 
>   a) one pci_dev; multiple RIDs
> 
>      In this case, neither FLR nor IOMMU isolates between RIDs.
>      So, both FLR and IOMMU blocking will reset all RIDs. There
>      should be no issue resulted from the IOMMU blocking.
> 
>   b) multiple pci_devs; single RID
> 
>      In this case, FLR only resets one device, while the IOMMU-
>      level reset will block the entire RID (i.e. all devices),
>      since they share the single translation tunnel. This could
>      break the siblings, if they aren't also being reset along.
Yup, such alias devices might not have ATS cap. because of they
are PCI devices or they share the RID(BDF), so checking ATS cap
condition might be useful here to skip the prepare()/done() .>
>> 2. Reset PF when its VFs are actvie.
> 
>   c) multiple pci_devs with their own RIDs
> 
>      In this case, either FLR or IOMMU only resets the PF. That
>      being said, VFs might be affected since PF is resetting?
>      If there is an issue, I don't see it coming from the IOMMU-
>      level reset..
Each of the PF and its VFs has it owns RID(BDF), but the VFs' life
depends on the living of PF, resetting PF, means all its VFs are
lost.

There is no processing logic about PF and its VFs in FLR() yet.
my understanding the upper layer callers should consider the
complexity of such case.

While we introducing the connection of IOMMU & device in FLR(),
seems we brought some of the logic from the outside to the inside
part.

One method might we don't handle PF either by explicit checking its
VF configuration existing to skip prepare()/done() ? till we have
much clearer handling logic about it.

Thanks,
Ethan
   > d
> Thus, case b might be breaking. So, perhaps we should add a few
> conditions when calling iommu_dev_reset_prepare/done():
>   + Make sure that the pci_dev has ATS capability
>   + Make sure no sibling pci_dev(s) sharing the same RID
>   + Any others?
> 
> Thanks
> Nicolin


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-15 18:45     ` Nicolin Chen
@ 2025-08-21  8:07       ` Tian, Kevin
  2025-08-21 14:41         ` Nicolin Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Tian, Kevin @ 2025-08-21  8:07 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: robin.murphy@arm.com, joro@8bytes.org, bhelgaas@google.com,
	jgg@nvidia.com, will@kernel.org, robin.clark@oss.qualcomm.com,
	yong.wu@mediatek.com, matthias.bgg@gmail.com,
	angelogioacchino.delregno@collabora.com, thierry.reding@gmail.com,
	vdumpa@nvidia.com, jonathanh@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, Liu, Yi L, baolu.lu@linux.intel.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org,
	linux-mediatek@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
	helgaas@kernel.org, etzhao1900@gmail.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, August 16, 2025 2:45 AM
> 
> On Fri, Aug 15, 2025 at 08:55:28AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Tuesday, August 12, 2025 6:59 AM
> > > So, iommu_get_domain_for_dev() should check the gdev flag and return
> the
> > > blocked domain if the flag is set. But the caller of this API could hold
> > > the group->mutex already or not, making it difficult to add the lock.
> > >
> > > Introduce a new iommu_get_domain_for_dev_locked() helper to be used
> by
> > > those drivers in a context that is already under the protection of the
> > > group->mutex, e.g. those attach_dev callback functions. And roll out the
> > > new helper to all the existing IOMMU drivers.
> >
> > iommu_get_domain_for_dev() is also called outside of attach_dev
> > callback functions, e.g. malidp_get_pgsize_bitmap(). and the returned
> > info according to the attached domain might be saved in static
> > structures, e.g.:
> >
> > 	ms->mmu_prefetch_pgsize = malidp_get_pgsize_bitmap(mp);
> >
> > would that cause weird issues when blocking domain is returned
> > though one may not expect reset to happen at that point?
> 
> We aren't changing the iommu_get_domain_for_dev(). So, it should
> be used exclusively for functions that are outside group->mutex,
> like this one, i.e. they should still get the group->domain v.s.
> the blocked domain.
> 

Usually the difference between func() and func_locked() is only about
whether the caller holds a lock. If they mean to return different
domains, may need better naming (though I don't have a good option
now)...

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-18 17:22     ` Nicolin Chen
  2025-08-18 23:42       ` Jason Gunthorpe
@ 2025-08-21  8:11       ` Tian, Kevin
  2025-08-21 13:14         ` Jason Gunthorpe
  1 sibling, 1 reply; 55+ messages in thread
From: Tian, Kevin @ 2025-08-21  8:11 UTC (permalink / raw)
  To: Nicolin Chen, Jason Gunthorpe
  Cc: robin.murphy@arm.com, joro@8bytes.org, bhelgaas@google.com,
	will@kernel.org, robin.clark@oss.qualcomm.com,
	yong.wu@mediatek.com, matthias.bgg@gmail.com,
	angelogioacchino.delregno@collabora.com, thierry.reding@gmail.com,
	vdumpa@nvidia.com, jonathanh@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, Liu, Yi L, baolu.lu@linux.intel.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org,
	linux-mediatek@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
	helgaas@kernel.org, etzhao1900@gmail.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, August 19, 2025 1:23 AM
> 
> ... I found that in SMMUv3 driver, iommu_get_domain_for_dev() is
> used to get the RID domain for an SVA domain:
>     arm_smmu_set_pasid()
>     arm_smmu_blocking_set_dev_pasid()
> 
> These two are already given an "old" (SVA) domain pointer, FWIW.
> 
> So, we may change to passing in the old domain as you suggested,
> yet we still have to fix the iommu_get_domain_for_dev() in order
> to reflect the RID domain correctly for the driver that calls it
> (or even potentially) in some group->mutex locked context where
> the RID domain might not be naturally passed in.
> 

Out of curiosity.

arm_smmu_blocking_set_dev_pasid()

	/*                              
	 * When the last user of the CD table goes away downgrade the STE back
	 * to a non-cd_table one.
	 */
	if (!arm_smmu_ssids_in_use(&master->cd_table)) {
		struct iommu_domain *sid_domain =
			iommu_get_domain_for_dev(master->dev);

		if (sid_domain->type == IOMMU_DOMAIN_IDENTITY ||
		   sid_domain->type == IOMMU_DOMAIN_BLOCKED)
			sid_domain->ops->attach_dev(sid_domain, dev); 
	}

why cannot downgrade apply to the case where the RID is attached to
a DMA domain?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device
  2025-08-19 21:59     ` Nicolin Chen
  2025-08-20  3:18       ` Ethan Zhao
@ 2025-08-21 13:07       ` Jason Gunthorpe
  2025-08-22  6:35         ` Nicolin Chen
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2025-08-21 13:07 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Ethan Zhao, robin.murphy, joro, bhelgaas, will, robin.clark,
	yong.wu, matthias.bgg, angelogioacchino.delregno, thierry.reding,
	vdumpa, jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas

On Tue, Aug 19, 2025 at 02:59:07PM -0700, Nicolin Chen wrote:
>  c) multiple pci_devs with their own RIDs
> 
>     In this case, either FLR or IOMMU only resets the PF. That
>     being said, VFs might be affected since PF is resetting?
>     If there is an issue, I don't see it coming from the IOMMU-
>     level reset..

It would still allow the ATS issue from the VF side. The VF could be
pushing an invalidation during the PF reset that will get clobbered.

I haven't fully checked but I think Linux doesn't really (easially?)
allow resetting a PF while a VF is present...

Arguably if the PF is reset the VFs should have their translations
blocked too.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-19 17:22             ` Nicolin Chen
@ 2025-08-21 13:13               ` Jason Gunthorpe
  2025-08-21 15:22                 ` Nicolin Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2025-08-21 13:13 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: robin.murphy, joro, bhelgaas, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas, etzhao1900

On Tue, Aug 19, 2025 at 10:22:20AM -0700, Nicolin Chen wrote:

> Yet, I also see some other cases that cannot be helped with the
> type function. Just listing a few:

Probably several query functions are needed that can be lock safe
 
> 1) domain matching (and type)
> drivers/gpu/drm/tegra/drm.c:965:        if (domain && domain->type != IOMMU_DOMAIN_IDENTITY &&
> drivers/gpu/drm/tegra/drm.c:966:            domain != tegra->domain)
> drivers/gpu/drm/tegra/drm.c-967-                return 0;

is attached
 
> 2) page size
> drivers/gpu/drm/arm/malidp_planes.c:307:	mmu_dom = iommu_get_domain_for_dev(mp->base.dev->dev);
> drivers/gpu/drm/arm/malidp_planes.c-308-	if (mmu_dom)
> drivers/gpu/drm/arm/malidp_planes.c-309-		return mmu_dom->pgsize_bitmap;

return page size bitmap
 
> 3) iommu_iova_to_phys
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:2597:	dom = iommu_get_domain_for_dev(adev->dev);
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2598-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2599-	while (size) {
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2600-		phys_addr_t addr = *pos & PAGE_MASK;
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2601-		loff_t off = *pos & ~PAGE_MASK;
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2602-		size_t bytes = PAGE_SIZE - off;
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2603-		unsigned long pfn;
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2604-		struct page *p;
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2605-		void *ptr;
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2606-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2607-		bytes = min(bytes, size);
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c-2608-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:2609:		addr = dom ? iommu_iova_to_phys(dom, addr) : addr;

safe iova to phys directly from a struct device
 
> 4) map/unmap
> drivers/net/ipa/ipa_mem.c:465:  domain = iommu_get_domain_for_dev(dev);
> drivers/net/ipa/ipa_mem.c-466-  if (!domain) {
> drivers/net/ipa/ipa_mem.c-467-          dev_err(dev, "no IOMMU domain found for IMEM\n");
> drivers/net/ipa/ipa_mem.c-468-          return -EINVAL;
> drivers/net/ipa/ipa_mem.c-469-  }
> drivers/net/ipa/ipa_mem.c-470-
> drivers/net/ipa/ipa_mem.c-471-  /* Align the address down and the size up to page boundaries */
> drivers/net/ipa/ipa_mem.c-472-  phys = addr & PAGE_MASK;
> drivers/net/ipa/ipa_mem.c-473-  size = PAGE_ALIGN(size + addr - phys);
> drivers/net/ipa/ipa_mem.c-474-  iova = phys;    /* We just want a direct mapping */
> drivers/net/ipa/ipa_mem.c-475-
> drivers/net/ipa/ipa_mem.c-476-  ret = iommu_map(domain, iova, phys, size, IOMMU_READ | IOMMU_WRITE,
> ...
> drivers/net/ipa/ipa_mem.c:495:  domain = iommu_get_domain_for_dev(dev);
> drivers/net/ipa/ipa_mem.c-496-  if (domain) {
> drivers/net/ipa/ipa_mem.c-497-          size_t size;
> drivers/net/ipa/ipa_mem.c-498-
> drivers/net/ipa/ipa_mem.c-499-          size = iommu_unmap(domain, ipa->imem_iova, ipa->imem_size);

Broken! Illegal to call iommu_map on a DMA API domain.

This is exactly the sort of abuse I would like to see made imposible :(

If it really needs something like this then it needs a proper dma api
interface to do it and properly reserve the iova from the allocator.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-21  8:11       ` Tian, Kevin
@ 2025-08-21 13:14         ` Jason Gunthorpe
  2025-08-22  9:45           ` Tian, Kevin
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2025-08-21 13:14 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Nicolin Chen, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, will@kernel.org,
	robin.clark@oss.qualcomm.com, yong.wu@mediatek.com,
	matthias.bgg@gmail.com, angelogioacchino.delregno@collabora.com,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	rafael@kernel.org, lenb@kernel.org, Liu, Yi L,
	baolu.lu@linux.intel.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-arm-msm@vger.kernel.org, linux-mediatek@lists.infradead.org,
	linux-tegra@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, patches@lists.linux.dev,
	Jaroszynski, Piotr, Sethi, Vikram, helgaas@kernel.org,
	etzhao1900@gmail.com

On Thu, Aug 21, 2025 at 08:11:05AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, August 19, 2025 1:23 AM
> > 
> > ... I found that in SMMUv3 driver, iommu_get_domain_for_dev() is
> > used to get the RID domain for an SVA domain:
> >     arm_smmu_set_pasid()
> >     arm_smmu_blocking_set_dev_pasid()
> > 
> > These two are already given an "old" (SVA) domain pointer, FWIW.
> > 
> > So, we may change to passing in the old domain as you suggested,
> > yet we still have to fix the iommu_get_domain_for_dev() in order
> > to reflect the RID domain correctly for the driver that calls it
> > (or even potentially) in some group->mutex locked context where
> > the RID domain might not be naturally passed in.
> > 
> 
> Out of curiosity.
> 
> arm_smmu_blocking_set_dev_pasid()
> 
> 	/*                              
> 	 * When the last user of the CD table goes away downgrade the STE back
> 	 * to a non-cd_table one.
> 	 */
> 	if (!arm_smmu_ssids_in_use(&master->cd_table)) {
> 		struct iommu_domain *sid_domain =
> 			iommu_get_domain_for_dev(master->dev);
> 
> 		if (sid_domain->type == IOMMU_DOMAIN_IDENTITY ||
> 		   sid_domain->type == IOMMU_DOMAIN_BLOCKED)
> 			sid_domain->ops->attach_dev(sid_domain, dev); 
> 	}
> 
> why cannot downgrade apply to the case where the RID is attached to
> a DMA domain?

If the RID is a PAGING domain then it must be a S1 paging domain and there is
no downgrade possible.

It is impossible for the RID to be a S2 paging domain while ssids are
in use.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-21  8:07       ` Tian, Kevin
@ 2025-08-21 14:41         ` Nicolin Chen
  2025-08-31 23:24           ` Nicolin Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Nicolin Chen @ 2025-08-21 14:41 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: robin.murphy@arm.com, joro@8bytes.org, bhelgaas@google.com,
	jgg@nvidia.com, will@kernel.org, robin.clark@oss.qualcomm.com,
	yong.wu@mediatek.com, matthias.bgg@gmail.com,
	angelogioacchino.delregno@collabora.com, thierry.reding@gmail.com,
	vdumpa@nvidia.com, jonathanh@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, Liu, Yi L, baolu.lu@linux.intel.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org,
	linux-mediatek@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
	helgaas@kernel.org, etzhao1900@gmail.com

On Thu, Aug 21, 2025 at 08:07:01AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Saturday, August 16, 2025 2:45 AM
> > 
> > On Fri, Aug 15, 2025 at 08:55:28AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Tuesday, August 12, 2025 6:59 AM
> > > > So, iommu_get_domain_for_dev() should check the gdev flag and return
> > the
> > > > blocked domain if the flag is set. But the caller of this API could hold
> > > > the group->mutex already or not, making it difficult to add the lock.
> > > >
> > > > Introduce a new iommu_get_domain_for_dev_locked() helper to be used
> > by
> > > > those drivers in a context that is already under the protection of the
> > > > group->mutex, e.g. those attach_dev callback functions. And roll out the
> > > > new helper to all the existing IOMMU drivers.
> > >
> > > iommu_get_domain_for_dev() is also called outside of attach_dev
> > > callback functions, e.g. malidp_get_pgsize_bitmap(). and the returned
> > > info according to the attached domain might be saved in static
> > > structures, e.g.:
> > >
> > > 	ms->mmu_prefetch_pgsize = malidp_get_pgsize_bitmap(mp);
> > >
> > > would that cause weird issues when blocking domain is returned
> > > though one may not expect reset to happen at that point?
> > 
> > We aren't changing the iommu_get_domain_for_dev(). So, it should
> > be used exclusively for functions that are outside group->mutex,
> > like this one, i.e. they should still get the group->domain v.s.
> > the blocked domain.
> > 
> 
> Usually the difference between func() and func_locked() is only about
> whether the caller holds a lock. If they mean to return different
> domains, may need better naming (though I don't have a good option
> now)...

Yea, naming is tricky here :)

Perhaps it could follow a different pattern:
    iommu_dev_get_domain_for_driver(struct device *dev);

With a simple note highlighting to be used by IOMMU drivers in the
iommu callback functions like attach_dev.

Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-21 13:13               ` Jason Gunthorpe
@ 2025-08-21 15:22                 ` Nicolin Chen
  2025-08-21 18:37                   ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Nicolin Chen @ 2025-08-21 15:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: robin.murphy, joro, bhelgaas, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas, etzhao1900

On Thu, Aug 21, 2025 at 10:13:04AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 19, 2025 at 10:22:20AM -0700, Nicolin Chen wrote:
> 
> > Yet, I also see some other cases that cannot be helped with the
> > type function. Just listing a few:
> 
> Probably several query functions are needed that can be lock safe
>  
> > 1) domain matching (and type)
> > drivers/gpu/drm/tegra/drm.c:965:        if (domain && domain->type != IOMMU_DOMAIN_IDENTITY &&
> > drivers/gpu/drm/tegra/drm.c:966:            domain != tegra->domain)
> > drivers/gpu/drm/tegra/drm.c-967-                return 0;
> 
> is attached

I should have pasted the full piece:
drivers/gpu/drm/tegra/drm.c-960-	/*
drivers/gpu/drm/tegra/drm.c:961:	 * If the host1x client is already attached to an IOMMU domain that is
drivers/gpu/drm/tegra/drm.c-962-	 * not the shared IOMMU domain, don't try to attach it to a different
drivers/gpu/drm/tegra/drm.c-963-	 * domain. This allows using the IOMMU-backed DMA API.
drivers/gpu/drm/tegra/drm.c-964-	 */
drivers/gpu/drm/tegra/drm.c-965-	if (domain && domain->type != IOMMU_DOMAIN_IDENTITY &&
drivers/gpu/drm/tegra/drm.c-966-	    domain != tegra->domain)

So, the check is two-fold:
1) is attached
2) is the shared IOMMU domain (tegra->domain?)
  
> > 4) map/unmap
> > drivers/net/ipa/ipa_mem.c:465:  domain = iommu_get_domain_for_dev(dev);
> > drivers/net/ipa/ipa_mem.c-466-  if (!domain) {
> > drivers/net/ipa/ipa_mem.c-467-          dev_err(dev, "no IOMMU domain found for IMEM\n");
> > drivers/net/ipa/ipa_mem.c-468-          return -EINVAL;
> > drivers/net/ipa/ipa_mem.c-469-  }
> > drivers/net/ipa/ipa_mem.c-470-
> > drivers/net/ipa/ipa_mem.c-471-  /* Align the address down and the size up to page boundaries */
> > drivers/net/ipa/ipa_mem.c-472-  phys = addr & PAGE_MASK;
> > drivers/net/ipa/ipa_mem.c-473-  size = PAGE_ALIGN(size + addr - phys);
> > drivers/net/ipa/ipa_mem.c-474-  iova = phys;    /* We just want a direct mapping */
> > drivers/net/ipa/ipa_mem.c-475-
> > drivers/net/ipa/ipa_mem.c-476-  ret = iommu_map(domain, iova, phys, size, IOMMU_READ | IOMMU_WRITE,
> > ...
> > drivers/net/ipa/ipa_mem.c:495:  domain = iommu_get_domain_for_dev(dev);
> > drivers/net/ipa/ipa_mem.c-496-  if (domain) {
> > drivers/net/ipa/ipa_mem.c-497-          size_t size;
> > drivers/net/ipa/ipa_mem.c-498-
> > drivers/net/ipa/ipa_mem.c-499-          size = iommu_unmap(domain, ipa->imem_iova, ipa->imem_size);
> 
> Broken! Illegal to call iommu_map on a DMA API domain.
> 
> This is exactly the sort of abuse I would like to see made imposible :(
> 
> If it really needs something like this then it needs a proper dma api
> interface to do it and properly reserve the iova from the allocator.

Yea. This particular case is forcing a direct mapping for a small
piece of memory. So it should probably be described in the Device
Tree v.s. the of_match_table data in the driver, so that _of core
would allocate an IOMMU_RESV_DIRECT.

Overall, I feel this would be a big project yet arguably for a low
reward..

Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-21 15:22                 ` Nicolin Chen
@ 2025-08-21 18:37                   ` Jason Gunthorpe
  2025-08-22  5:11                     ` Nicolin Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2025-08-21 18:37 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: robin.murphy, joro, bhelgaas, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas, etzhao1900

On Thu, Aug 21, 2025 at 08:22:06AM -0700, Nicolin Chen wrote:
> I should have pasted the full piece:
> drivers/gpu/drm/tegra/drm.c-960-	/*
> drivers/gpu/drm/tegra/drm.c:961:	 * If the host1x client is already attached to an IOMMU domain that is
> drivers/gpu/drm/tegra/drm.c-962-	 * not the shared IOMMU domain, don't try to attach it to a different
> drivers/gpu/drm/tegra/drm.c-963-	 * domain. This allows using the IOMMU-backed DMA API.
> drivers/gpu/drm/tegra/drm.c-964-	 */
> drivers/gpu/drm/tegra/drm.c-965-	if (domain && domain->type != IOMMU_DOMAIN_IDENTITY &&
> drivers/gpu/drm/tegra/drm.c-966-	    domain != tegra->domain)
> 
> So, the check is two-fold:
> 1) is attached
> 2) is the shared IOMMU domain (tegra->domain?)

Yea

 iommu_is_domain_currently_attached(dev, tegra->domain)

> Overall, I feel this would be a big project yet arguably for a low
> reward..

Indeed, we can drop a FIXME comment there and leave it as the last
user or something perhaps

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-21 18:37                   ` Jason Gunthorpe
@ 2025-08-22  5:11                     ` Nicolin Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-22  5:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: robin.murphy, joro, bhelgaas, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas, etzhao1900

On Thu, Aug 21, 2025 at 03:37:04PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 21, 2025 at 08:22:06AM -0700, Nicolin Chen wrote:
> > I should have pasted the full piece:
> > drivers/gpu/drm/tegra/drm.c-960-	/*
> > drivers/gpu/drm/tegra/drm.c:961:	 * If the host1x client is already attached to an IOMMU domain that is
> > drivers/gpu/drm/tegra/drm.c-962-	 * not the shared IOMMU domain, don't try to attach it to a different
> > drivers/gpu/drm/tegra/drm.c-963-	 * domain. This allows using the IOMMU-backed DMA API.
> > drivers/gpu/drm/tegra/drm.c-964-	 */
> > drivers/gpu/drm/tegra/drm.c-965-	if (domain && domain->type != IOMMU_DOMAIN_IDENTITY &&
> > drivers/gpu/drm/tegra/drm.c-966-	    domain != tegra->domain)
> > 
> > So, the check is two-fold:
> > 1) is attached
> > 2) is the shared IOMMU domain (tegra->domain?)
> 
> Yea
> 
>  iommu_is_domain_currently_attached(dev, tegra->domain)

Ah, yea.

> > Overall, I feel this would be a big project yet arguably for a low
> > reward..
> 
> Indeed, we can drop a FIXME comment there and leave it as the last
> user or something perhaps

I see. We could keep it in the library but discourage new callers.

I will start with the internal cleanup as we discussed, as a part
of this PCI reset series. Then, I will try adding those new helper
functions that we've listed here, as a separate series.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device
  2025-08-20  3:18       ` Ethan Zhao
@ 2025-08-22  6:14         ` Nicolin Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-22  6:14 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: robin.murphy, joro, bhelgaas, jgg, will, robin.clark, yong.wu,
	matthias.bgg, angelogioacchino.delregno, thierry.reding, vdumpa,
	jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas

On Wed, Aug 20, 2025 at 11:18:52AM +0800, Ethan Zhao wrote:
> On 8/20/2025 5:59 AM, Nicolin Chen wrote:
> >   b) multiple pci_devs; single RID
> > 
> >      In this case, FLR only resets one device, while the IOMMU-
> >      level reset will block the entire RID (i.e. all devices),
> >      since they share the single translation tunnel. This could
> >      break the siblings, if they aren't also being reset along.

> Yup, such alias devices might not have ATS cap. because of they
> are PCI devices or they share the RID(BDF), so checking ATS cap
> condition might be useful here to skip the prepare()/done()

Yea, I agree, yet I think we need it to be "sure" than "might"?

So perhaps we should check alias too. Given that all alias devices
in this case share the same RID and reside in the same iommu_group,
we could iterate the group devices for pci_devs_are_dma_aliases().

> > > 2. Reset PF when its VFs are actvie.
> > 
> >   c) multiple pci_devs with their own RIDs
> > 
> >      In this case, either FLR or IOMMU only resets the PF. That
> >      being said, VFs might be affected since PF is resetting?
> >      If there is an issue, I don't see it coming from the IOMMU-
> >      level reset..

> Each of the PF and its VFs has it owns RID(BDF), but the VFs' life
> depends on the living of PF, resetting PF, means all its VFs are
> lost.
> 
> There is no processing logic about PF and its VFs in FLR() yet.
> my understanding the upper layer callers should consider the
> complexity of such case.
> 
> While we introducing the connection of IOMMU & device in FLR(),
> seems we brought some of the logic from the outside to the inside
> part.
> 
> One method might we don't handle PF either by explicit checking its
> VF configuration existing to skip prepare()/done() ? till we have
> much clearer handling logic about it.

That sounds a good one to start with.

The prepare()/done() functions can internally bypass for devices:

	if (!pci_ats_supported(pci_dev) || pci_sriov_get_totalvfs(pci_dev))
		return 0;
	/* And check alias too */

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device
  2025-08-21 13:07       ` Jason Gunthorpe
@ 2025-08-22  6:35         ` Nicolin Chen
  2025-08-22 14:08           ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Nicolin Chen @ 2025-08-22  6:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ethan Zhao, robin.murphy, joro, bhelgaas, will, robin.clark,
	yong.wu, matthias.bgg, angelogioacchino.delregno, thierry.reding,
	vdumpa, jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas

On Thu, Aug 21, 2025 at 10:07:41AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 19, 2025 at 02:59:07PM -0700, Nicolin Chen wrote:
> >  c) multiple pci_devs with their own RIDs
> > 
> >     In this case, either FLR or IOMMU only resets the PF. That
> >     being said, VFs might be affected since PF is resetting?
> >     If there is an issue, I don't see it coming from the IOMMU-
> >     level reset..
> 
> It would still allow the ATS issue from the VF side. The VF could be
> pushing an invalidation during the PF reset that will get clobbered.
> 
> I haven't fully checked but I think Linux doesn't really (easially?)
> allow resetting a PF while a VF is present...

Hmm, what if the PF encountered some fault? Does Linux have a choice
not to reset PF?

> Arguably if the PF is reset the VFs should have their translations
> blocked too.

Yea, that sounds plausible to me. But, prior to that (an IOMMU-level
reset), should VFs be first reset at the PCI level?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-21 13:14         ` Jason Gunthorpe
@ 2025-08-22  9:45           ` Tian, Kevin
  2025-08-22 13:21             ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Tian, Kevin @ 2025-08-22  9:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, will@kernel.org,
	robin.clark@oss.qualcomm.com, yong.wu@mediatek.com,
	matthias.bgg@gmail.com, angelogioacchino.delregno@collabora.com,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	rafael@kernel.org, lenb@kernel.org, Liu, Yi L,
	baolu.lu@linux.intel.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-arm-msm@vger.kernel.org, linux-mediatek@lists.infradead.org,
	linux-tegra@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, patches@lists.linux.dev,
	Jaroszynski, Piotr, Sethi, Vikram, helgaas@kernel.org,
	etzhao1900@gmail.com

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, August 21, 2025 9:14 PM
> 
> On Thu, Aug 21, 2025 at 08:11:05AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Tuesday, August 19, 2025 1:23 AM
> > >
> > > ... I found that in SMMUv3 driver, iommu_get_domain_for_dev() is
> > > used to get the RID domain for an SVA domain:
> > >     arm_smmu_set_pasid()
> > >     arm_smmu_blocking_set_dev_pasid()
> > >
> > > These two are already given an "old" (SVA) domain pointer, FWIW.
> > >
> > > So, we may change to passing in the old domain as you suggested,
> > > yet we still have to fix the iommu_get_domain_for_dev() in order
> > > to reflect the RID domain correctly for the driver that calls it
> > > (or even potentially) in some group->mutex locked context where
> > > the RID domain might not be naturally passed in.
> > >
> >
> > Out of curiosity.
> >
> > arm_smmu_blocking_set_dev_pasid()
> >
> > 	/*
> > 	 * When the last user of the CD table goes away downgrade the STE
> back
> > 	 * to a non-cd_table one.
> > 	 */
> > 	if (!arm_smmu_ssids_in_use(&master->cd_table)) {
> > 		struct iommu_domain *sid_domain =
> > 			iommu_get_domain_for_dev(master->dev);
> >
> > 		if (sid_domain->type == IOMMU_DOMAIN_IDENTITY ||
> > 		   sid_domain->type == IOMMU_DOMAIN_BLOCKED)
> > 			sid_domain->ops->attach_dev(sid_domain, dev);
> > 	}
> >
> > why cannot downgrade apply to the case where the RID is attached to
> > a DMA domain?
> 
> If the RID is a PAGING domain then it must be a S1 paging domain and there
> is
> no downgrade possible.
> 
> It is impossible for the RID to be a S2 paging domain while ssids are
> in use.
> 

Thanks for the background. btw is it technically impossible or just
not worth of the extra software complexity for no value? e.g. if
maintaining two page tables (S1/S2 formats) with exact same mappings,
does SMMU allow smooth transition between two modes w/o breaking
in-fly DMAs? but probably keeping two page tables in-sync in transition
is already a problem w/o proper locking in map/unmap... 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-22  9:45           ` Tian, Kevin
@ 2025-08-22 13:21             ` Jason Gunthorpe
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2025-08-22 13:21 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Nicolin Chen, robin.murphy@arm.com, joro@8bytes.org,
	bhelgaas@google.com, will@kernel.org,
	robin.clark@oss.qualcomm.com, yong.wu@mediatek.com,
	matthias.bgg@gmail.com, angelogioacchino.delregno@collabora.com,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	rafael@kernel.org, lenb@kernel.org, Liu, Yi L,
	baolu.lu@linux.intel.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-arm-msm@vger.kernel.org, linux-mediatek@lists.infradead.org,
	linux-tegra@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-pci@vger.kernel.org, patches@lists.linux.dev,
	Jaroszynski, Piotr, Sethi, Vikram, helgaas@kernel.org,
	etzhao1900@gmail.com

On Fri, Aug 22, 2025 at 09:45:28AM +0000, Tian, Kevin wrote:
> Thanks for the background. btw is it technically impossible or just
> not worth of the extra software complexity for no value? 

Unsupported by HW. It cannot mix S2 and S1 formats on
SSIDs. If SSID is in use then all must be S1.

> e.g. if maintaining two page tables (S1/S2 formats) with exact same
> mappings, does SMMU allow smooth transition between two modes w/o
> breaking in-fly DMAs? 

It could and there is SW support for that..

> but probably keeping two page tables in-sync in transition is
> already a problem w/o proper locking in map/unmap...

Yes, plus the doubling the memory. It is not worthwile there is really
very little advantange to the S2 format.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device
  2025-08-22  6:35         ` Nicolin Chen
@ 2025-08-22 14:08           ` Jason Gunthorpe
  2025-08-22 18:50             ` Nicolin Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2025-08-22 14:08 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Ethan Zhao, robin.murphy, joro, bhelgaas, will, robin.clark,
	yong.wu, matthias.bgg, angelogioacchino.delregno, thierry.reding,
	vdumpa, jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas

On Thu, Aug 21, 2025 at 11:35:27PM -0700, Nicolin Chen wrote:
> On Thu, Aug 21, 2025 at 10:07:41AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 19, 2025 at 02:59:07PM -0700, Nicolin Chen wrote:
> > >  c) multiple pci_devs with their own RIDs
> > > 
> > >     In this case, either FLR or IOMMU only resets the PF. That
> > >     being said, VFs might be affected since PF is resetting?
> > >     If there is an issue, I don't see it coming from the IOMMU-
> > >     level reset..
> > 
> > It would still allow the ATS issue from the VF side. The VF could be
> > pushing an invalidation during the PF reset that will get clobbered.
> > 
> > I haven't fully checked but I think Linux doesn't really (easially?)
> > allow resetting a PF while a VF is present...
> 
> Hmm, what if the PF encountered some fault? Does Linux have a choice
> not to reset PF?

Upon more reflect I guess outside VFIO I seem to remember the SRIOV
reset to the PFs will clobber the VFs too and then restore the SRIOV
configuration in config space to bring them back.

> > Arguably if the PF is reset the VFs should have their translations
> > blocked too.
> 
> Yea, that sounds plausible to me. But, prior to that (an IOMMU-level
> reset), should VFs be first reset at the PCI level?

PF reset of a SRIOV PF disables the VFs and effectively resets them
already.

But reaching out to mangle the translation of the VFs means you do
have to take care not to disrupt anything else the VF owning driver is
doing since it is fully unaware of this. Ie it could be reattaching to
something else concurrently.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device
  2025-08-22 14:08           ` Jason Gunthorpe
@ 2025-08-22 18:50             ` Nicolin Chen
  2025-08-28 12:51               ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Nicolin Chen @ 2025-08-22 18:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ethan Zhao, robin.murphy, joro, bhelgaas, will, robin.clark,
	yong.wu, matthias.bgg, angelogioacchino.delregno, thierry.reding,
	vdumpa, jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas

On Fri, Aug 22, 2025 at 11:08:21AM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 21, 2025 at 11:35:27PM -0700, Nicolin Chen wrote:
> > On Thu, Aug 21, 2025 at 10:07:41AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 19, 2025 at 02:59:07PM -0700, Nicolin Chen wrote:
> > > >  c) multiple pci_devs with their own RIDs
> > > > 
> > > >     In this case, either FLR or IOMMU only resets the PF. That
> > > >     being said, VFs might be affected since PF is resetting?
> > > >     If there is an issue, I don't see it coming from the IOMMU-
> > > >     level reset..
> > > 
> > > It would still allow the ATS issue from the VF side. The VF could be
> > > pushing an invalidation during the PF reset that will get clobbered.
> > > 
> > > I haven't fully checked but I think Linux doesn't really (easially?)
> > > allow resetting a PF while a VF is present...
> > 
> > Hmm, what if the PF encountered some fault? Does Linux have a choice
> > not to reset PF?
> 
> Upon more reflect I guess outside VFIO I seem to remember the SRIOV
> reset to the PFs will clobber the VFs too and then restore the SRIOV
> configuration in config space to bring them back.

Yea, I see ci_restore_iov_state() called in pci_restore_state().

> > > Arguably if the PF is reset the VFs should have their translations
> > > blocked too.
> > 
> > Yea, that sounds plausible to me. But, prior to that (an IOMMU-level
> > reset), should VFs be first reset at the PCI level?
> 
> PF reset of a SRIOV PF disables the VFs and effectively resets them
> already.

Yea, I was expecting something like a SW reset routine, for each VF,
which would invoke iommu_dev_reset_prepare/_done() individually.

Without that, iommu_dev_reset_prepare/_done() has to iterate all VFs
internally and block them.

> But reaching out to mangle the translation of the VFs means you do
> have to take care not to disrupt anything else the VF owning driver is
> doing since it is fully unaware of this. Ie it could be reattaching to
> something else concurrently.

Hmm, and this is tricky now..

The current version allows deferring the concurrent attach during a
reset. But, as Kevin pointed out, we may have no choice but to fail
any concurrent attach with -EBUSY, because a deferred attach might
fail due to incompatibility triggering a WARN_ON only in done().

This isn't likely a problem for PF, as we can expect its driver not
to do an insane concurrent attach during a reset. But it would be a
very sane case for a VF. So if its driver doesn't retry or defer an
EBUSY-ed attach properly, it would not be restored successfully..

It feels like we need a no-fail re-attach operation, or at least an
unlikely-to-fail one. I recall years ago we tried a can_attach op
to test the compatibility but it didn't get merged. Maybe we'd need
it so that a concurrent attach can test compatibility, allowing the
re-attach in iommu_dev_reset_done() to more likely succeed.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device
  2025-08-22 18:50             ` Nicolin Chen
@ 2025-08-28 12:51               ` Jason Gunthorpe
  2025-08-28 15:08                 ` Nicolin Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2025-08-28 12:51 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Ethan Zhao, robin.murphy, joro, bhelgaas, will, robin.clark,
	yong.wu, matthias.bgg, angelogioacchino.delregno, thierry.reding,
	vdumpa, jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas

On Fri, Aug 22, 2025 at 11:50:58AM -0700, Nicolin Chen wrote:

> It feels like we need a no-fail re-attach operation, or at least an
> unlikely-to-fail one. I recall years ago we tried a can_attach op
> to test the compatibility but it didn't get merged. Maybe we'd need
> it so that a concurrent attach can test compatibility, allowing the
> re-attach in iommu_dev_reset_done() to more likely succeed.

This is probably the cleanest option to split these things

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device
  2025-08-28 12:51               ` Jason Gunthorpe
@ 2025-08-28 15:08                 ` Nicolin Chen
  2025-08-28 18:46                   ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Nicolin Chen @ 2025-08-28 15:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ethan Zhao, robin.murphy, joro, bhelgaas, will, robin.clark,
	yong.wu, matthias.bgg, angelogioacchino.delregno, thierry.reding,
	vdumpa, jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas

On Thu, Aug 28, 2025 at 09:51:49AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 22, 2025 at 11:50:58AM -0700, Nicolin Chen wrote:
> 
> > It feels like we need a no-fail re-attach operation, or at least an
> > unlikely-to-fail one. I recall years ago we tried a can_attach op
> > to test the compatibility but it didn't get merged. Maybe we'd need
> > it so that a concurrent attach can test compatibility, allowing the
> > re-attach in iommu_dev_reset_done() to more likely succeed.
> 
> This is probably the cleanest option to split these things

Yea, that could avoid failing a concurrent attach_dev during FLR
unless the dryrun fails, helping non-SRIOV cases too.

So, next version could have some new preparatory patches:
 - Pass in old domain to attach_dev
 - Add a can_attach_dev op

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device
  2025-08-28 15:08                 ` Nicolin Chen
@ 2025-08-28 18:46                   ` Jason Gunthorpe
  2025-08-28 19:35                     ` Nicolin Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2025-08-28 18:46 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Ethan Zhao, robin.murphy, joro, bhelgaas, will, robin.clark,
	yong.wu, matthias.bgg, angelogioacchino.delregno, thierry.reding,
	vdumpa, jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas

On Thu, Aug 28, 2025 at 08:08:13AM -0700, Nicolin Chen wrote:
> On Thu, Aug 28, 2025 at 09:51:49AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 22, 2025 at 11:50:58AM -0700, Nicolin Chen wrote:
> > 
> > > It feels like we need a no-fail re-attach operation, or at least an
> > > unlikely-to-fail one. I recall years ago we tried a can_attach op
> > > to test the compatibility but it didn't get merged. Maybe we'd need
> > > it so that a concurrent attach can test compatibility, allowing the
> > > re-attach in iommu_dev_reset_done() to more likely succeed.
> > 
> > This is probably the cleanest option to split these things
> 
> Yea, that could avoid failing a concurrent attach_dev during FLR
> unless the dryrun fails, helping non-SRIOV cases too.
> 
> So, next version could have some new preparatory patches:
>  - Pass in old domain to attach_dev
>  - Add a can_attach_dev op

I wouldn't make this more complicated, just focus on the signal device
case here then we move on from there

Just adding can_attach_dev is big series on its own

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device
  2025-08-28 18:46                   ` Jason Gunthorpe
@ 2025-08-28 19:35                     ` Nicolin Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-28 19:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ethan Zhao, robin.murphy, joro, bhelgaas, will, robin.clark,
	yong.wu, matthias.bgg, angelogioacchino.delregno, thierry.reding,
	vdumpa, jonathanh, rafael, lenb, kevin.tian, yi.l.liu, baolu.lu,
	linux-arm-kernel, iommu, linux-kernel, linux-arm-msm,
	linux-mediatek, linux-tegra, linux-acpi, linux-pci, patches,
	pjaroszynski, vsethi, helgaas

On Thu, Aug 28, 2025 at 03:46:08PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 28, 2025 at 08:08:13AM -0700, Nicolin Chen wrote:
> > On Thu, Aug 28, 2025 at 09:51:49AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 22, 2025 at 11:50:58AM -0700, Nicolin Chen wrote:
> > > 
> > > > It feels like we need a no-fail re-attach operation, or at least an
> > > > unlikely-to-fail one. I recall years ago we tried a can_attach op
> > > > to test the compatibility but it didn't get merged. Maybe we'd need
> > > > it so that a concurrent attach can test compatibility, allowing the
> > > > re-attach in iommu_dev_reset_done() to more likely succeed.
> > > 
> > > This is probably the cleanest option to split these things
> > 
> > Yea, that could avoid failing a concurrent attach_dev during FLR
> > unless the dryrun fails, helping non-SRIOV cases too.
> > 
> > So, next version could have some new preparatory patches:
> >  - Pass in old domain to attach_dev
> >  - Add a can_attach_dev op
> 
> I wouldn't make this more complicated, just focus on the signal device
> case here then we move on from there
> 
> Just adding can_attach_dev is big series on its own

OK. I suppose a concurrent attach on a single device will be rare,
so failing it won't impact that much and thus can be a Part-1.

Then, for part-2, we will do can_attach_dev and support SRIOV.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper
  2025-08-21 14:41         ` Nicolin Chen
@ 2025-08-31 23:24           ` Nicolin Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-08-31 23:24 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: robin.murphy@arm.com, joro@8bytes.org, bhelgaas@google.com,
	jgg@nvidia.com, will@kernel.org, robin.clark@oss.qualcomm.com,
	yong.wu@mediatek.com, matthias.bgg@gmail.com,
	angelogioacchino.delregno@collabora.com, thierry.reding@gmail.com,
	vdumpa@nvidia.com, jonathanh@nvidia.com, rafael@kernel.org,
	lenb@kernel.org, Liu, Yi L, baolu.lu@linux.intel.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org,
	linux-mediatek@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org,
	patches@lists.linux.dev, Jaroszynski, Piotr, Sethi, Vikram,
	helgaas@kernel.org, etzhao1900@gmail.com

Hi Kevin,

On Thu, Aug 21, 2025 at 07:41:57AM -0700, Nicolin Chen wrote:
> On Thu, Aug 21, 2025 at 08:07:01AM +0000, Tian, Kevin wrote:
> > Usually the difference between func() and func_locked() is only about
> > whether the caller holds a lock. If they mean to return different
> > domains, may need better naming (though I don't have a good option
> > now)...
> 
> Yea, naming is tricky here :)
> 
> Perhaps it could follow a different pattern:
>     iommu_dev_get_domain_for_driver(struct device *dev);
> 
> With a simple note highlighting to be used by IOMMU drivers in the
> iommu callback functions like attach_dev.

I am keeping this "_locked" naming in v4, as there are only two
places calling this new helper and only one of them is a driver.

Please check v4 and see if you still prefer a different name.

Thanks
Nic

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2025-08-31 23:24 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-11 22:59 [PATCH v3 0/5] Disable ATS via iommu during PCI resets Nicolin Chen
2025-08-11 22:59 ` [PATCH v3 1/5] iommu: Lock group->mutex in iommu_deferred_attach Nicolin Chen
2025-08-15  5:09   ` Baolu Lu
2025-08-15  8:27     ` Tian, Kevin
2025-08-15  8:24   ` Tian, Kevin
2025-08-15 19:26     ` Nicolin Chen
2025-08-18 14:17     ` Jason Gunthorpe
2025-08-18 17:45       ` Nicolin Chen
2025-08-18 18:09         ` Jason Gunthorpe
2025-08-18 18:29           ` Nicolin Chen
2025-08-11 22:59 ` [PATCH v3 2/5] iommu: Pass in gdev to __iommu_device_set_domain Nicolin Chen
2025-08-15  5:18   ` Baolu Lu
2025-08-15  8:29   ` Tian, Kevin
2025-08-11 22:59 ` [PATCH v3 3/5] iommu: Add iommu_get_domain_for_dev_locked() helper Nicolin Chen
2025-08-15  5:28   ` Baolu Lu
2025-08-15 18:48     ` Nicolin Chen
2025-08-18 14:40     ` Jason Gunthorpe
2025-08-19  2:09       ` Baolu Lu
2025-08-15  8:55   ` Tian, Kevin
2025-08-15 18:45     ` Nicolin Chen
2025-08-21  8:07       ` Tian, Kevin
2025-08-21 14:41         ` Nicolin Chen
2025-08-31 23:24           ` Nicolin Chen
2025-08-18 14:39   ` Jason Gunthorpe
2025-08-18 17:22     ` Nicolin Chen
2025-08-18 23:42       ` Jason Gunthorpe
2025-08-19  5:09         ` Nicolin Chen
2025-08-19 12:52           ` Jason Gunthorpe
2025-08-19 17:22             ` Nicolin Chen
2025-08-21 13:13               ` Jason Gunthorpe
2025-08-21 15:22                 ` Nicolin Chen
2025-08-21 18:37                   ` Jason Gunthorpe
2025-08-22  5:11                     ` Nicolin Chen
2025-08-21  8:11       ` Tian, Kevin
2025-08-21 13:14         ` Jason Gunthorpe
2025-08-22  9:45           ` Tian, Kevin
2025-08-22 13:21             ` Jason Gunthorpe
2025-08-11 22:59 ` [PATCH v3 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Nicolin Chen
2025-08-15  5:49   ` Baolu Lu
2025-08-15 20:10     ` Nicolin Chen
2025-08-15  9:14   ` Tian, Kevin
2025-08-15 19:45     ` Nicolin Chen
2025-08-11 22:59 ` [PATCH v3 5/5] pci: Suspend iommu function prior to resetting a device Nicolin Chen
2025-08-19 14:12   ` Ethan Zhao
2025-08-19 21:59     ` Nicolin Chen
2025-08-20  3:18       ` Ethan Zhao
2025-08-22  6:14         ` Nicolin Chen
2025-08-21 13:07       ` Jason Gunthorpe
2025-08-22  6:35         ` Nicolin Chen
2025-08-22 14:08           ` Jason Gunthorpe
2025-08-22 18:50             ` Nicolin Chen
2025-08-28 12:51               ` Jason Gunthorpe
2025-08-28 15:08                 ` Nicolin Chen
2025-08-28 18:46                   ` Jason Gunthorpe
2025-08-28 19:35                     ` Nicolin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).