patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Nicolin Chen <nicolinc@nvidia.com>
To: <jgg@nvidia.com>, <joro@8bytes.org>, <will@kernel.org>,
	<robin.murphy@arm.com>, <bhelgaas@google.com>
Cc: <iommu@lists.linux.dev>, <linux-kernel@vger.kernel.org>,
	<linux-pci@vger.kernel.org>, <patches@lists.linux.dev>,
	<pjaroszynski@nvidia.com>, <vsethi@nvidia.com>
Subject: [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
Date: Mon, 9 Jun 2025 11:45:13 -0700	[thread overview]
Message-ID: <4153fb7131dda901b13a2e90654232fe059c8f09.1749494161.git.nicolinc@nvidia.com> (raw)
In-Reply-To: <cover.1749494161.git.nicolinc@nvidia.com>

Per PCIe r6.3, sec 10.3.1 IMPLEMENTATION NOTE, software should disable ATS
before initiating a Function Level Reset, and then ensure no invalidation
requests being issued to a device when its ATS capability is disabled.

Since pci_enable_ats() and pci_disable_ats() are called by an IOMMU driver
while an unsolicited FLR can happen at any time in the PCI layer, PCI code
has to notify the IOMMU subsystem about the ongoing FLR. Add a pair of new
IOMMU APIs that will be called by the PCI reset functions before&after the
reset routines.

However, if there is a domain attachment/replacement happening during an
ongoing reset, the ATS might be re-enabled between the two function calls.
So the iommu_dev_reset_prepare() has to hold the group mutex to avoid this
race condition, unitl iommu_dev_reset_done() is finished. Thus, these two
functions are a strong pair that must be used together.

Inside the mutex, these two functions will dock all RID and PASID domains
to an IOMMU_DOMAIN_BLOCKED. This would further disable ATS by two-fold: an
IOMMU driver should disable ATS in its control bits (e.g. SMMU's STE.EATS)
and an IOMMU driver should call pci_disable_ats() as well.

Notes:
 - This only works for IOMMU drivers that implemented ops->blocked_domain
   correctly with pci_disable_ats().
 - This only works for IOMMU drivers that will not issue ATS invalidation
   requests to the device, after it's docked at ops->blocked_domain.
Driver should fix itself to align with the aforementioned notes.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h |  12 +++++
 drivers/iommu/iommu.c | 106 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 118 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 156732807994..a17161b8625a 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1123,6 +1123,9 @@ void dev_iommu_priv_set(struct device *dev, void *priv);
 extern struct mutex iommu_probe_device_lock;
 int iommu_probe_device(struct device *dev);
 
+int iommu_dev_reset_prepare(struct device *dev);
+void iommu_dev_reset_done(struct device *dev);
+
 int iommu_device_use_default_domain(struct device *dev);
 void iommu_device_unuse_default_domain(struct device *dev);
 
@@ -1407,6 +1410,15 @@ static inline int iommu_fwspec_add_ids(struct device *dev, u32 *ids,
 	return -ENODEV;
 }
 
+static inline int iommu_dev_reset_prepare(struct device *dev)
+{
+	return 0;
+}
+
+static inline void iommu_dev_reset_done(struct device *dev)
+{
+}
+
 static inline struct iommu_fwspec *dev_iommu_fwspec_get(struct device *dev)
 {
 	return NULL;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index a4b606c591da..3c1854c5e55e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3795,6 +3795,112 @@ int iommu_replace_group_handle(struct iommu_group *group,
 }
 EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
 
+/*
+ * Deadlock Alert
+ *
+ * Caller must use iommu_dev_reset_prepare() and iommu_dev_reset_done() together
+ * before/after the core-level reset routine, as iommu_dev_reset_prepare() holds
+ * the group->mutex that will be only released in iommu_dev_reset_done().
+ */
+int iommu_dev_reset_prepare(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	const struct iommu_ops *ops;
+	unsigned long pasid;
+	void *entry;
+	int ret;
+
+	/* Before locking */
+	if (!dev_has_iommu(dev))
+		return 0;
+
+	if (dev->iommu->require_direct) {
+		dev_warn(dev,
+			 "Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.\n");
+		return -EINVAL;
+	}
+
+	ops = dev_iommu_ops(dev);
+	if (!ops->blocked_domain) {
+		dev_warn(dev,
+			 "IOMMU driver doesn't support IOMMU_DOMAIN_BLOCKED\n");
+		return -EOPNOTSUPP;
+	}
+
+	/*
+	 * group->mutex starts
+	 *
+	 * This has to hold the group mutex until the reset is done, to prevent
+	 * any RID or PASID domain attachment/replacement, which otherwise might
+	 * re-enable the ATS during the reset cycle.
+	 */
+	mutex_lock(&group->mutex);
+
+	/* Device is already attached to the blocked_domain. Nothing to do */
+	if (group->domain->type == IOMMU_DOMAIN_BLOCKED)
+		return 0;
+
+	/* Dock RID domain to blocked_domain while retaining group->domain */
+	ret = __iommu_attach_device(ops->blocked_domain, dev);
+	if (ret)
+		return ret;
+
+	/* Dock PASID domains to blocked_domain while retaining pasid_array */
+	xa_lock(&group->pasid_array);
+	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
+		iommu_remove_dev_pasid(dev, pasid,
+				       pasid_array_entry_to_domain(entry));
+	xa_unlock(&group->pasid_array);
+
+	/* group->mutex is held. Caller must invoke iommu_dev_reset_done() */
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_reset_prepare);
+
+/*
+ * This is the resume routine of iommu_dev_reset_prepare(). It unlocks the group
+ * mutex at end, after all RID/PASID domains are re-attached.
+ *
+ * Note that, although unlikely, there is a risk that re-attaching domains might
+ * fail due to some unexpected happening like OOM.
+ */
+void iommu_dev_reset_done(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	const struct iommu_ops *ops;
+	unsigned long pasid;
+	void *entry;
+
+	/* Previously unlocked */
+	if (!dev_has_iommu(dev))
+		return;
+	ops = dev_iommu_ops(dev);
+	if (!ops->blocked_domain)
+		return;
+
+	/* group->mutex held in iommu_dev_reset_prepare() continues from here */
+	WARN_ON(!lockdep_is_held(&group->mutex));
+
+	if (group->domain->type == IOMMU_DOMAIN_BLOCKED)
+		goto unlock;
+
+	/* Shift RID domain back to group->domain */
+	WARN_ON(__iommu_attach_device(group->domain, dev));
+
+	/* Shift PASID domains back to domains retained in pasid_array */
+	xa_lock(&group->pasid_array);
+	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
+		WARN_ON(__iommu_set_group_pasid(
+			pasid_array_entry_to_domain(entry), group, pasid,
+			ops->blocked_domain));
+	xa_unlock(&group->pasid_array);
+
+unlock:
+	mutex_unlock(&group->mutex);
+}
+EXPORT_SYMBOL_GPL(iommu_dev_reset_done);
+
 #if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
 /**
  * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
-- 
2.43.0


  reply	other threads:[~2025-06-09 18:45 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-09 18:45 [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets Nicolin Chen
2025-06-09 18:45 ` Nicolin Chen [this message]
2025-06-10  4:26   ` [PATCH RFC v1 1/2] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done() Baolu Lu
2025-06-10  7:07     ` Nicolin Chen
2025-06-10 13:04       ` Jason Gunthorpe
2025-06-10 14:40         ` Robin Murphy
2025-06-10 15:36           ` Jason Gunthorpe
2025-06-10 16:31             ` Robin Murphy
2025-06-10 16:43               ` Jason Gunthorpe
2025-06-10 20:19                 ` Nicolin Chen
2025-06-10 23:41                   ` Jason Gunthorpe
2025-06-09 18:45 ` [PATCH RFC v1 2/2] pci: Suspend ATS before doing FLR Nicolin Chen
2025-06-10  4:27   ` Baolu Lu
2025-06-10  6:55     ` Nicolin Chen
2025-06-10 15:37 ` [PATCH RFC v1 0/2] iommu&pci: Disable ATS during FLR resets Robin Murphy
2025-06-10 16:30   ` Jason Gunthorpe
2025-06-10 20:36     ` Nicolin Chen
2025-06-10 23:43       ` Jason Gunthorpe
2025-06-13 19:27     ` Bjorn Helgaas
2025-06-13 21:10       ` Nicolin Chen
2025-06-16 13:09       ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4153fb7131dda901b13a2e90654232fe059c8f09.1749494161.git.nicolinc@nvidia.com \
    --to=nicolinc@nvidia.com \
    --cc=bhelgaas@google.com \
    --cc=iommu@lists.linux.dev \
    --cc=jgg@nvidia.com \
    --cc=joro@8bytes.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=pjaroszynski@nvidia.com \
    --cc=robin.murphy@arm.com \
    --cc=vsethi@nvidia.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).