[RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
@ 2025-03-19  0:42 Pranjal Shrivastava
  2025-03-19  0:42 ` [RFC PATCH 1/5] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
                   ` (6 more replies)
  0 siblings, 7 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-19  0:42 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

As arm-smmu-v3 rapidly finds its way into SoCs designed for hand-held
devices, power management capabilities, similar to its predecessors, are
crucial for these applications. This series introduces power management
support for the arm-smmu-v3 driver.

Design
=======
The arm-smmu-v3 primarily operates with in-memory data structures
through HW registers pointing to these data structures in some fashion.
The proposed design tries to make use of this fact for implementing the
suspend and resume ops.

1. Suspend / Resume
An initial idea for the "suspend" op is to wait till the command queue
finishes all commands before disabling the SMMU through CR0. In order to
avoid mis-use / spurious transactions (b/w SMMU disable -> power-down),
the GBPA register is configured to abort all transactions.

The resume operation uses the `arm_smmu_device_reset` function which
re-initializes the HW using the SW-copies maintained by the driver. For
example, prod/cons for queues, base addresses for queues & tables. The
arm_smmu_device_reset also clears the TLBs.

2. Interrupt Re-config
a. Wired irqs: the series refactors the `arm_smmu_setup_irqs` to be
able to enable/disable irqs and install their handlers separately to
help with the re-initialization of the interrupts correctly.

b. MSIs: The thought was of 2 approaches to teardown & re-config MSIs:

 1. Free MSIs on suspend and re-alloc on resume (implemented in series)
 2. Meddle with the msi_desc and use get_cached_msi_msg to resume MSIs.

The first approach (implemented in this series) may be potentially
slower, whereas the second one could reduce the time taken to resume.
However, the second one feels hacky as it forces us to cache the msi_msg
in `arm_smmu_write_msg` or in the core code (`irq_chip_write_msi_msg`)
as suggested in [1].

3. Invoking runtime_pm_get/put
Given that most of the configuration done by arm-smmu-v3 is stored in
memory, the initial idea is to focus on areas where the driver accesses
the hw via exposed ops, like iommmu_ops, iommu_flush_ops, sva_ops etc.

Instead of wrapping every exposed op with an rpm_get/put, the idea is to
follow through till a logical common point (lowest common ancestor in
the call hierarchies) and wrap those with the rpm_get/put calls.

Future Work
===========
- Elide TLBIs when the SMMU is powered down

Call for Review
================
Any insights/comments on the proposed changes are appreciated,
especially in areas related to locking, atomic contexts, early resume
etc. or any potential optimization.

Note: The series isn't tested with MSIs as I couldn't find a platform
that supports MSIs. Any help in testing MSIs is appreciated.

References
===========
[1]
https://lore.kernel.org/lkml/20210721013350.17664-1-cuibixuan@huawei.com/T/#m06bf92e2306ba52c106f2d4d24765921d4e9781e

[2]
https://lore.kernel.org/all/20180830144541.17740-1-vivek.gautam@codeaurora.org/

Pranjal Shrivastava (5):
  iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs
  iommu/arm-smmu-v3: Add a helper to wait till cmdq drains
  iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks
  iommu/arm-smmu-v3: Invoke pm_runtime before hw access

 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     |  22 +-
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  24 ++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 297 ++++++++++++++++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   5 +
 4 files changed, 326 insertions(+), 22 deletions(-)

-- 
2.49.0.rc1.451.g8f38331e32-goog

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [RFC PATCH 1/5] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs
  2025-03-19  0:42 [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
@ 2025-03-19  0:42 ` Pranjal Shrivastava
  2025-03-19  4:50   ` Nicolin Chen
                     ` (2 more replies)
  2025-03-19  0:42 ` [RFC PATCH 2/5] iommu/arm-smmu-v3: Add a helper to wait till cmdq drains Pranjal Shrivastava
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-19  0:42 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

Refactor arm_smmu_setup_irqs by splitting it into two parts, one for
registering interrupt handlers and the other one for enabling interrupt
generation in the hardware. This refactor helps in re-initialization of
hardware interrupts as part of a subsequent patch that enables runtime
power management for the arm-smmu-v3 driver.

Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 ++++++++++++++++-----
 1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 358072b4e293..80362d9f293b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3985,12 +3985,24 @@ static void arm_smmu_setup_unique_irqs(struct arm_smmu_device *smmu)
 	}
 }
 
-static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
+static void arm_smmu_enable_irqs(struct arm_smmu_device *smmu)
 {
-	int ret, irq;
+	int ret;
 	u32 irqen_flags = IRQ_CTRL_EVTQ_IRQEN | IRQ_CTRL_GERROR_IRQEN;
 
-	/* Disable IRQs first */
+	if (smmu->features & ARM_SMMU_FEAT_PRI)
+		irqen_flags |= IRQ_CTRL_PRIQ_IRQEN;
+
+	ret = arm_smmu_write_reg_sync(smmu, irqen_flags,
+				      ARM_SMMU_IRQ_CTRL, ARM_SMMU_IRQ_CTRLACK);
+	if (ret)
+		dev_warn(smmu->dev, "failed to enable irqs\n");
+}
+
+static int arm_smmu_disable_irqs(struct arm_smmu_device *smmu)
+{
+	int ret;
+
 	ret = arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_IRQ_CTRL,
 				      ARM_SMMU_IRQ_CTRLACK);
 	if (ret) {
@@ -3998,6 +4010,19 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
 		return ret;
 	}
 
+	return 0;
+}
+
+static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
+{
+	int ret, irq;
+
+	/* Disable IRQs first */
+	ret = arm_smmu_disable_irqs(smmu);
+
+	if (ret)
+		return ret;
+
 	irq = smmu->combined_irq;
 	if (irq) {
 		/*
@@ -4014,15 +4039,6 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
 	} else
 		arm_smmu_setup_unique_irqs(smmu);
 
-	if (smmu->features & ARM_SMMU_FEAT_PRI)
-		irqen_flags |= IRQ_CTRL_PRIQ_IRQEN;
-
-	/* Enable interrupt generation on the SMMU */
-	ret = arm_smmu_write_reg_sync(smmu, irqen_flags,
-				      ARM_SMMU_IRQ_CTRL, ARM_SMMU_IRQ_CTRLACK);
-	if (ret)
-		dev_warn(smmu->dev, "failed to enable irqs\n");
-
 	return 0;
 }
 
@@ -4171,6 +4187,9 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu)
 		return ret;
 	}
 
+	/* Enable interrupt generation on the SMMU */
+	arm_smmu_enable_irqs(smmu);
+
 	if (is_kdump_kernel())
 		enables &= ~(CR0_EVTQEN | CR0_PRIQEN);
 
@@ -4754,6 +4773,13 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 	/* Check for RMRs and install bypass STEs if any */
 	arm_smmu_rmr_install_bypass_ste(smmu);
 
+	/* Setup interrupt handlers */
+	ret = arm_smmu_setup_irqs(smmu);
+	if (ret) {
+		dev_err(smmu->dev, "failed to setup irqs\n");
+		return ret;
+	}
+
 	/* Reset the device */
 	ret = arm_smmu_device_reset(smmu);
 	if (ret)
-- 
2.49.0.rc1.451.g8f38331e32-goog


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 2/5] iommu/arm-smmu-v3: Add a helper to wait till cmdq drains
  2025-03-19  0:42 [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
  2025-03-19  0:42 ` [RFC PATCH 1/5] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
@ 2025-03-19  0:42 ` Pranjal Shrivastava
  2025-03-20 22:30   ` Mostafa Saleh
                     ` (2 more replies)
  2025-03-19  0:42 ` [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops Pranjal Shrivastava
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-19  0:42 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

Before we suspend SMMU, we want to ensure that all commands have been
consumed by the SMMU's command queue, i.e. the cmdq is empty. Add a
helper function that waits till the cmdq is empty using `queue_empty`.

Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 +++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 80362d9f293b..f65d9bca0392 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -754,6 +754,30 @@ static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu,
 	return __arm_smmu_cmdq_poll_until_consumed(smmu, cmdq, llq);
 }
 
+/*
+ * Wait until the SMMU cmdq is empty
+ * Must be called with the cmdq lock held in some capacity.
+ */
+static int arm_smmu_cmdq_poll_until_empty(struct arm_smmu_device *smmu,
+					  struct arm_smmu_cmdq *cmdq,
+					  struct arm_smmu_ll_queue *llq)
+{
+	struct arm_smmu_queue_poll qp;
+	int ret = 0;
+
+	queue_poll_init(smmu, &qp);
+	llq->val = READ_ONCE(cmdq->q.llq.val);
+	do {
+		if (queue_empty(llq))
+			break;
+
+		ret = queue_poll(&qp);
+		llq->cons = readl(cmdq->q.cons_reg);
+	} while (!ret);
+
+	return ret;
+}
+
 static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
 					u32 prod, int n)
 {
-- 
2.49.0.rc1.451.g8f38331e32-goog


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-03-19  0:42 [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
  2025-03-19  0:42 ` [RFC PATCH 1/5] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
  2025-03-19  0:42 ` [RFC PATCH 2/5] iommu/arm-smmu-v3: Add a helper to wait till cmdq drains Pranjal Shrivastava
@ 2025-03-19  0:42 ` Pranjal Shrivastava
  2025-03-20 22:33   ` Mostafa Saleh
                     ` (2 more replies)
  2025-03-19  0:42 ` [RFC PATCH 4/5] iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks Pranjal Shrivastava
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-19  0:42 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

Implement pm_runtime and system sleep ops for arm-smmu-v3. The smmu is
disabled as part of the suspend callbacks after ensuring the completion
of all pending commands and is configured to abort any transactions that
happen after disabling the smmu. The smmu shall be reinitialized in the
resume callback by invoking the `arm_smmu_device_reset` helper.

The MSIs are freed as part of the suspend and are re-allocated during
the resume operation.

Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 106 ++++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |   2 +
 2 files changed, 108 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f65d9bca0392..caf750470772 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -26,6 +26,7 @@
 #include <linux/pci.h>
 #include <linux/pci-ats.h>
 #include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
 #include <linux/string_choices.h>
 #include <kunit/visibility.h>
 #include <uapi/linux/iommufd.h>
@@ -108,6 +109,32 @@ static const char * const event_class_str[] = {
 
 static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
 
+static int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
+{
+	int ret;
+
+	if (pm_runtime_enabled(smmu->dev)) {
+		ret = pm_runtime_resume_and_get(smmu->dev);
+		if (ret < 0) {
+			dev_err(smmu->dev, "Failed to resume device: %d\n", ret);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static void arm_smmu_rpm_put(struct arm_smmu_device *smmu)
+{
+	int ret;
+
+	if (pm_runtime_enabled(smmu->dev)) {
+		ret = pm_runtime_put_autosuspend(smmu->dev);
+		if (ret < 0)
+			dev_err(smmu->dev, "Failed to suspend device: %d\n", ret);
+	}
+}
+
 static void parse_driver_options(struct arm_smmu_device *smmu)
 {
 	int i = 0;
@@ -4850,6 +4877,84 @@ static void arm_smmu_device_shutdown(struct platform_device *pdev)
 	arm_smmu_device_disable(smmu);
 }
 
+
+static int __maybe_unused arm_smmu_runtime_suspend(struct device *dev)
+{
+	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
+
+	/* We might get the vcmdq */
+	struct arm_smmu_cmdq_ent cmd = {
+		.opcode	= smmu->features & ARM_SMMU_FEAT_E2H ?
+			CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
+	};
+
+	struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu, &cmd);
+	struct arm_smmu_ll_queue *llq = &cmdq->q.llq;
+
+	/*
+	 * Since suspend is invoked when all clients have been
+	 * we don't expect more commands to be added to the cmdq.
+	 * Thus, wait for all existing commands to complete.
+	 */
+	arm_smmu_cmdq_shared_lock(cmdq);
+	arm_smmu_cmdq_poll_until_empty(smmu, cmdq, llq);
+	arm_smmu_cmdq_shared_unlock(cmdq);
+
+	/* Disable all queues */
+	arm_smmu_device_disable(smmu);
+
+	/* Abort all transactions to avoid spurious bypass */
+	arm_smmu_update_gbpa(smmu, GBPA_ABORT, 0);
+
+	/* Free all MSIs (re-allocated on resume) */
+	arm_smmu_free_msis(dev);
+
+	dev_dbg(dev, "Suspending smmu\n");
+	return 0;
+}
+
+static int __maybe_unused arm_smmu_runtime_resume(struct device *dev)
+{
+	int ret;
+	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
+
+	dev_dbg(dev, "Resuming device\n");
+
+	/*
+	 * The reset will re-initialize all the queues with the base addr,
+	 * prod and cons maintained within struct arm_smmu_device as well as
+	 * re-wire the IRQs and/or enable MSIs and install relevant handlers.
+	 */
+	ret = arm_smmu_device_reset(smmu);
+
+	if (ret)
+		dev_err(dev, "Failed to reset during resume operation: %d\n", ret);
+
+	return ret;
+}
+
+static int __maybe_unused arm_smmu_pm_suspend(struct device *dev)
+{
+	if (pm_runtime_suspended(dev))
+		return 0;
+
+	return arm_smmu_runtime_suspend(dev);
+}
+
+static int __maybe_unused arm_smmu_pm_resume(struct device *dev)
+{
+	if (pm_runtime_suspended(dev))
+		return 0;
+
+	return arm_smmu_runtime_resume(dev);
+}
+
+static const struct dev_pm_ops arm_smmu_pm_ops = {
+	SET_SYSTEM_SLEEP_PM_OPS(arm_smmu_pm_suspend, arm_smmu_pm_resume)
+	SET_RUNTIME_PM_OPS(arm_smmu_runtime_suspend,
+			   arm_smmu_runtime_resume, NULL)
+};
+
 static const struct of_device_id arm_smmu_of_match[] = {
 	{ .compatible = "arm,smmu-v3", },
 	{ },
@@ -4866,6 +4971,7 @@ static struct platform_driver arm_smmu_driver = {
 	.driver	= {
 		.name			= "arm-smmu-v3",
 		.of_match_table		= arm_smmu_of_match,
+		.pm                     = &arm_smmu_pm_ops,
 		.suppress_bind_attrs	= true,
 	},
 	.probe	= arm_smmu_device_probe,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index bd9d7c85576a..6bac83a511e2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -505,6 +505,8 @@ static inline unsigned int arm_smmu_cdtab_l2_idx(unsigned int ssid)
 #define MSI_IOVA_BASE			0x8000000
 #define MSI_IOVA_LENGTH			0x100000
 
+#define RPM_AUTOSUSPEND_DELAY_MS	15
+
 enum pri_resp {
 	PRI_RESP_DENY = 0,
 	PRI_RESP_FAIL = 1,
-- 
2.49.0.rc1.451.g8f38331e32-goog


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 4/5] iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks
  2025-03-19  0:42 [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
                   ` (2 preceding siblings ...)
  2025-03-19  0:42 ` [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops Pranjal Shrivastava
@ 2025-03-19  0:42 ` Pranjal Shrivastava
  2025-03-20 22:34   ` Mostafa Saleh
  2025-03-19  0:42 ` [RFC PATCH 5/5] iommu/arm-smmu-v3: Invoke pm_runtime before hw access Pranjal Shrivastava
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-19  0:42 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

Enable PM runtime for SMMUs having a power-domain during smmu probe.
Add a devlink between the clients and SMMU device. The absence of a
power domain effectively disables runtime power management.

Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index caf750470772..602784e84e84 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3515,6 +3515,9 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev)
 		pci_prepare_ats(to_pci_dev(dev), stu);
 	}
 
+	device_link_add(dev, smmu->dev,
+			DL_FLAG_PM_RUNTIME | DL_FLAG_AUTOREMOVE_SUPPLIER);
+
 	return &smmu->iommu;
 
 err_free_master:
@@ -4836,11 +4839,18 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 	if (ret)
 		goto err_disable;
 
+	if (dev->pm_domain) {
+		pm_runtime_set_active(dev);
+		pm_runtime_use_autosuspend(dev);
+		pm_runtime_set_autosuspend_delay(dev, RPM_AUTOSUSPEND_DELAY_MS);
+		pm_runtime_enable(dev);
+	}
+
 	/* And we're up. Go go go! */
 	ret = iommu_device_sysfs_add(&smmu->iommu, dev, NULL,
 				     "smmu3.%pa", &ioaddr);
 	if (ret)
-		goto err_disable;
+		goto err_rpm_disable;
 
 	ret = iommu_device_register(&smmu->iommu, &arm_smmu_ops, dev);
 	if (ret) {
@@ -4852,6 +4862,8 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 
 err_free_sysfs:
 	iommu_device_sysfs_remove(&smmu->iommu);
+err_rpm_disable:
+	pm_runtime_disable(dev);
 err_disable:
 	arm_smmu_device_disable(smmu);
 err_free_iopf:
-- 
2.49.0.rc1.451.g8f38331e32-goog


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 5/5] iommu/arm-smmu-v3: Invoke pm_runtime before hw access
  2025-03-19  0:42 [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
                   ` (3 preceding siblings ...)
  2025-03-19  0:42 ` [RFC PATCH 4/5] iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks Pranjal Shrivastava
@ 2025-03-19  0:42 ` Pranjal Shrivastava
  2025-03-19 12:04   ` Jason Gunthorpe
  2025-03-20 22:36   ` Mostafa Saleh
  2025-03-19 11:57 ` [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Jason Gunthorpe
  2025-03-19 18:22 ` Robin Murphy
  6 siblings, 2 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-19  0:42 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

Invoke the pm_runtime helpers at all places before accessing the hw.
The idea is to invoke runtime_pm helpers at common points which are
used by exposed ops or interrupt handlers. For cases without such a
common point, simply wrap the exposed op/handler with rpm_get/put calls.

Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     |  22 +++-
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  24 ++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 107 ++++++++++++++++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   3 +
 4 files changed, 145 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
index 5aa2e7af58b4..1d423fc2cf18 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -13,10 +13,17 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type)
 	struct iommu_hw_info_arm_smmuv3 *info;
 	u32 __iomem *base_idr;
 	unsigned int i;
+	int ret;
+
+	ret = arm_smmu_rpm_get(master->smmu);
+	if (ret < 0)
+		return ERR_PTR(-EIO);
 
 	info = kzalloc(sizeof(*info), GFP_KERNEL);
-	if (!info)
+	if (!info) {
+		arm_smmu_rpm_put(master->smmu);
 		return ERR_PTR(-ENOMEM);
+	}
 
 	base_idr = master->smmu->base + ARM_SMMU_IDR0;
 	for (i = 0; i <= 5; i++)
@@ -27,6 +34,7 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type)
 	*length = sizeof(*info);
 	*type = IOMMU_HW_INFO_TYPE_ARM_SMMUV3;
 
+	arm_smmu_rpm_put(master->smmu);
 	return info;
 }
 
@@ -96,6 +104,7 @@ static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
 		.old_domain = iommu_get_domain_for_dev(dev),
 		.ssid = IOMMU_NO_PASID,
 	};
+	struct arm_smmu_device *smmu = master->smmu;
 	struct arm_smmu_ste ste;
 	int ret;
 
@@ -104,6 +113,10 @@ static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
 	if (arm_smmu_ssids_in_use(&master->cd_table))
 		return -EBUSY;
 
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return ret;
+
 	mutex_lock(&arm_smmu_asid_lock);
 	/*
 	 * The VM has to control the actual ATS state at the PCI device because
@@ -117,6 +130,7 @@ static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
 	ret = arm_smmu_attach_prepare(&state, domain);
 	if (ret) {
 		mutex_unlock(&arm_smmu_asid_lock);
+		arm_smmu_rpm_put(smmu);
 		return ret;
 	}
 
@@ -125,6 +139,7 @@ static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
 	arm_smmu_install_ste_for_dev(master, &ste);
 	arm_smmu_attach_commit(&state);
 	mutex_unlock(&arm_smmu_asid_lock);
+	arm_smmu_rpm_put(smmu);
 	return 0;
 }
 
@@ -301,6 +316,10 @@ static int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
 	cur = cmds;
 	end = cmds + array->entry_num;
 
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return ret;
+
 	static_assert(sizeof(*cmds) == 2 * sizeof(u64));
 	ret = iommu_copy_struct_from_full_user_array(
 		cmds, sizeof(*cmds), array,
@@ -329,6 +348,7 @@ static int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
 		last = cur;
 	}
 out:
+	arm_smmu_rpm_put(smmu);
 	array->entry_num = cur - cmds;
 	kfree(cmds);
 	return ret;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index 9ba596430e7c..8f6d2533c493 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -140,6 +140,7 @@ static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
 {
 	struct arm_smmu_domain *smmu_domain =
 		container_of(mn, struct arm_smmu_domain, mmu_notifier);
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	size_t size;
 
 	/*
@@ -156,6 +157,10 @@ static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
 			size = 0;
 	}
 
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return;
+
 	if (!size)
 		arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_domain->cd.asid);
 	else
@@ -163,6 +168,7 @@ static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
 					    PAGE_SIZE, false, smmu_domain);
 
 	arm_smmu_atc_inv_domain(smmu_domain, start, size);
+	arm_smmu_rpm_put(smmu);
 }
 
 static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
@@ -170,8 +176,13 @@ static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	struct arm_smmu_domain *smmu_domain =
 		container_of(mn, struct arm_smmu_domain, mmu_notifier);
 	struct arm_smmu_master_domain *master_domain;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	unsigned long flags;
 
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return;
+
 	/*
 	 * DMA may still be running. Keep the cd valid to avoid C_BAD_CD events,
 	 * but disable translation.
@@ -195,6 +206,7 @@ static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 
 	arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_domain->cd.asid);
 	arm_smmu_atc_inv_domain(smmu_domain, 0, 0);
+	arm_smmu_rpm_put(smmu);
 }
 
 static void arm_smmu_mmu_notifier_free(struct mmu_notifier *mn)
@@ -371,12 +383,24 @@ static int arm_smmu_sva_set_dev_pasid(struct iommu_domain *domain,
 static void arm_smmu_sva_domain_free(struct iommu_domain *domain)
 {
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	int ret;
+
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		/*
+		 * Even if the rpm_get fails, the TLBs were invalidated
+		 * with suspend, continue with the free operation.
+		 */
+		goto free_asid;
 
 	/*
 	 * Ensure the ASID is empty in the iommu cache before allowing reuse.
 	 */
 	arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_domain->cd.asid);
+	arm_smmu_rpm_put(smmu);
 
+free_asid:
 	/*
 	 * Notice that the arm_smmu_mm_arch_invalidate_secondary_tlbs op can
 	 * still be called/running at this point. We allow the ASID to be
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 602784e84e84..f8d74ccaed6a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -109,7 +109,7 @@ static const char * const event_class_str[] = {
 
 static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
 
-static int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
+int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
 {
 	int ret;
 
@@ -124,7 +124,7 @@ static int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
 	return 0;
 }
 
-static void arm_smmu_rpm_put(struct arm_smmu_device *smmu)
+void arm_smmu_rpm_put(struct arm_smmu_device *smmu)
 {
 	int ret;
 
@@ -1037,7 +1037,9 @@ static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused
 {
 	struct arm_smmu_cmdq_ent cmd = {0};
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct arm_smmu_device *smmu = master->smmu;
 	int sid = master->streams[0].id;
+	int ret;
 
 	if (WARN_ON(!master->stall_enabled))
 		return;
@@ -1057,6 +1059,10 @@ static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused
 		break;
 	}
 
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return;
+
 	arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
 	/*
 	 * Don't send a SYNC, it doesn't do anything for RESUME or PRI_RESP.
@@ -1064,6 +1070,7 @@ static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused
 	 * terminated... at some point in the future. PRI_RESP is fire and
 	 * forget.
 	 */
+	arm_smmu_rpm_put(smmu);
 }
 
 /* Context descriptor manipulation functions */
@@ -1984,6 +1991,7 @@ static void arm_smmu_dump_event(struct arm_smmu_device *smmu, u64 *raw,
 
 static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 {
+	int ret;
 	u64 evt[EVTQ_ENT_DWORDS];
 	struct arm_smmu_event event = {0};
 	struct arm_smmu_device *smmu = dev;
@@ -1992,6 +2000,10 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 	static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
 				      DEFAULT_RATELIMIT_BURST);
 
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return IRQ_NONE;
+
 	do {
 		while (!queue_remove_raw(q, evt)) {
 			arm_smmu_decode_event(smmu, evt, &event);
@@ -2012,6 +2024,7 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 
 	/* Sync our overflow flag, as we believe we're up to speed */
 	queue_sync_cons_ovf(q);
+	arm_smmu_rpm_put(smmu);
 	return IRQ_HANDLED;
 }
 
@@ -2059,6 +2072,11 @@ static irqreturn_t arm_smmu_priq_thread(int irq, void *dev)
 	struct arm_smmu_queue *q = &smmu->priq.q;
 	struct arm_smmu_ll_queue *llq = &q->llq;
 	u64 evt[PRIQ_ENT_DWORDS];
+	int ret;
+
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return IRQ_NONE;
 
 	do {
 		while (!queue_remove_raw(q, evt))
@@ -2070,6 +2088,7 @@ static irqreturn_t arm_smmu_priq_thread(int irq, void *dev)
 
 	/* Sync our overflow flag, as we believe we're up to speed */
 	queue_sync_cons_ovf(q);
+	arm_smmu_rpm_put(smmu);
 	return IRQ_HANDLED;
 }
 
@@ -2079,13 +2098,24 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
 {
 	u32 gerror, gerrorn, active;
 	struct arm_smmu_device *smmu = dev;
+	int ret;
+
+	if (pm_runtime_enabled(smmu->dev)) {
+		ret = pm_runtime_get_if_active(smmu->dev);
+		if (ret == 0) {
+			dev_err(smmu->dev, "Ignoring gerror interrupt because device isn't rpm active\n");
+			return IRQ_NONE;
+		}
+	}
 
 	gerror = readl_relaxed(smmu->base + ARM_SMMU_GERROR);
 	gerrorn = readl_relaxed(smmu->base + ARM_SMMU_GERRORN);
 
 	active = gerror ^ gerrorn;
-	if (!(active & GERROR_ERR_MASK))
+	if (!(active & GERROR_ERR_MASK)) {
+		arm_smmu_rpm_put(smmu);
 		return IRQ_NONE; /* No errors pending */
+	}
 
 	dev_warn(smmu->dev,
 		 "unexpected global error reported (0x%08x), this could be serious\n",
@@ -2118,6 +2148,7 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
 		arm_smmu_cmdq_skip_err(smmu);
 
 	writel(gerror, smmu->base + ARM_SMMU_GERRORN);
+	arm_smmu_rpm_put(smmu);
 	return IRQ_HANDLED;
 }
 
@@ -2291,6 +2322,11 @@ static void arm_smmu_tlb_inv_context(void *cookie)
 	struct arm_smmu_domain *smmu_domain = cookie;
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	struct arm_smmu_cmdq_ent cmd;
+	int ret;
+
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return;
 
 	/*
 	 * NOTE: when io-pgtable is in non-strict mode, we may get here with
@@ -2307,6 +2343,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
 		arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 	}
 	arm_smmu_atc_inv_domain(smmu_domain, 0, 0);
+	arm_smmu_rpm_put(smmu);
 }
 
 static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
@@ -2384,6 +2421,8 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
 					  size_t granule, bool leaf,
 					  struct arm_smmu_domain *smmu_domain)
 {
+	int ret;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	struct arm_smmu_cmdq_ent cmd = {
 		.tlbi = {
 			.leaf	= leaf,
@@ -2398,6 +2437,11 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
 		cmd.opcode	= CMDQ_OP_TLBI_S2_IPA;
 		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
 	}
+
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return;
+
 	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
 
 	if (smmu_domain->nest_parent) {
@@ -2414,6 +2458,7 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
 	 * zapped an entire table.
 	 */
 	arm_smmu_atc_inv_domain(smmu_domain, iova, size);
+	arm_smmu_rpm_put(smmu);
 }
 
 void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
@@ -2985,6 +3030,10 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	if (smmu_domain->smmu != smmu)
 		return ret;
 
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return ret;
+
 	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
 		cdptr = arm_smmu_alloc_cd_ptr(master, IOMMU_NO_PASID);
 		if (!cdptr)
@@ -3028,6 +3077,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 
 	arm_smmu_attach_commit(&state);
 	mutex_unlock(&arm_smmu_asid_lock);
+	arm_smmu_rpm_put(smmu);
 	return 0;
 }
 
@@ -3090,6 +3140,7 @@ int arm_smmu_set_pasid(struct arm_smmu_master *master,
 		.ssid = pasid,
 		.old_domain = old,
 	};
+	struct arm_smmu_device *smmu = master->smmu;
 	struct arm_smmu_cd *cdptr;
 	int ret;
 
@@ -3103,9 +3154,15 @@ int arm_smmu_set_pasid(struct arm_smmu_master *master,
 	    sid_domain->type != IOMMU_DOMAIN_BLOCKED)
 		return -EINVAL;
 
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return ret;
+
 	cdptr = arm_smmu_alloc_cd_ptr(master, pasid);
-	if (!cdptr)
+	if (!cdptr) {
+		arm_smmu_rpm_put(smmu);
 		return -ENOMEM;
+	}
 
 	mutex_lock(&arm_smmu_asid_lock);
 	ret = arm_smmu_attach_prepare(&state, &smmu_domain->domain);
@@ -3127,6 +3184,7 @@ int arm_smmu_set_pasid(struct arm_smmu_master *master,
 
 out_unlock:
 	mutex_unlock(&arm_smmu_asid_lock);
+	arm_smmu_rpm_put(smmu);
 	return ret;
 }
 
@@ -3136,6 +3194,12 @@ static int arm_smmu_blocking_set_dev_pasid(struct iommu_domain *new_domain,
 {
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(old_domain);
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct arm_smmu_device *smmu = master->smmu;
+	int ret;
+
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return ret;
 
 	mutex_lock(&arm_smmu_asid_lock);
 	arm_smmu_clear_cd(master, pasid);
@@ -3156,21 +3220,28 @@ static int arm_smmu_blocking_set_dev_pasid(struct iommu_domain *new_domain,
 		    sid_domain->type == IOMMU_DOMAIN_BLOCKED)
 			sid_domain->ops->attach_dev(sid_domain, dev);
 	}
+
+	arm_smmu_rpm_put(smmu);
 	return 0;
 }
 
-static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
+static int arm_smmu_attach_dev_ste(struct iommu_domain *domain,
 				    struct device *dev,
 				    struct arm_smmu_ste *ste,
 				    unsigned int s1dss)
 {
+	int ret;
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct arm_smmu_device *smmu = master->smmu;
 	struct arm_smmu_attach_state state = {
 		.master = master,
 		.old_domain = iommu_get_domain_for_dev(dev),
 		.ssid = IOMMU_NO_PASID,
 	};
 
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return ret;
 	/*
 	 * Do not allow any ASID to be changed while are working on the STE,
 	 * otherwise we could miss invalidations.
@@ -3205,6 +3276,8 @@ static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
 	 * descriptor from arm_smmu_share_asid().
 	 */
 	arm_smmu_clear_cd(master, IOMMU_NO_PASID);
+	arm_smmu_rpm_put(smmu);
+	return 0;
 }
 
 static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
@@ -3214,8 +3287,8 @@ static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
 
 	arm_smmu_make_bypass_ste(master->smmu, &ste);
-	arm_smmu_attach_dev_ste(domain, dev, &ste, STRTAB_STE_1_S1DSS_BYPASS);
-	return 0;
+	return arm_smmu_attach_dev_ste(domain, dev, &ste,
+				       STRTAB_STE_1_S1DSS_BYPASS);
 }
 
 static const struct iommu_domain_ops arm_smmu_identity_ops = {
@@ -3233,9 +3306,8 @@ static int arm_smmu_attach_dev_blocked(struct iommu_domain *domain,
 	struct arm_smmu_ste ste;
 
 	arm_smmu_make_abort_ste(&ste);
-	arm_smmu_attach_dev_ste(domain, dev, &ste,
-				STRTAB_STE_1_S1DSS_TERMINATE);
-	return 0;
+	return arm_smmu_attach_dev_ste(domain, dev, &ste,
+				       STRTAB_STE_1_S1DSS_TERMINATE);
 }
 
 static const struct iommu_domain_ops arm_smmu_blocked_ops = {
@@ -4874,10 +4946,19 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 static void arm_smmu_device_remove(struct platform_device *pdev)
 {
 	struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
+	int ret;
 
 	iommu_device_unregister(&smmu->iommu);
 	iommu_device_sysfs_remove(&smmu->iommu);
+
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret <  0)
+		goto free_iopf;
+
 	arm_smmu_device_disable(smmu);
+	arm_smmu_rpm_put(smmu);
+
+free_iopf:
 	iopf_queue_free(smmu->evtq.iopf);
 	ida_destroy(&smmu->vmid_map);
 }
@@ -4885,8 +4966,14 @@ static void arm_smmu_device_remove(struct platform_device *pdev)
 static void arm_smmu_device_shutdown(struct platform_device *pdev)
 {
 	struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
+	int ret;
+
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return;
 
 	arm_smmu_device_disable(smmu);
+	arm_smmu_rpm_put(smmu);
 }
 
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 6bac83a511e2..b7c2e28e903b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -987,6 +987,9 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 				struct arm_smmu_cmdq *cmdq, u64 *cmds, int n,
 				bool sync);
 
+int arm_smmu_rpm_get(struct arm_smmu_device *smmu);
+void arm_smmu_rpm_put(struct arm_smmu_device *smmu);
+
 #ifdef CONFIG_ARM_SMMU_V3_SVA
 bool arm_smmu_sva_supported(struct arm_smmu_device *smmu);
 bool arm_smmu_master_sva_supported(struct arm_smmu_master *master);
-- 
2.49.0.rc1.451.g8f38331e32-goog


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 1/5] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs
  2025-03-19  0:42 ` [RFC PATCH 1/5] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
@ 2025-03-19  4:50   ` Nicolin Chen
  2025-03-19  7:43     ` Pranjal Shrivastava
  2025-03-20 22:29   ` Mostafa Saleh
  2025-03-25 16:19   ` Daniel Mentz
  2 siblings, 1 reply; 71+ messages in thread
From: Nicolin Chen @ 2025-03-19  4:50 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Mostafa Saleh, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 12:42:50AM +0000, Pranjal Shrivastava wrote:
> Refactor arm_smmu_setup_irqs by splitting it into two parts, one for
> registering interrupt handlers and the other one for enabling interrupt
> generation in the hardware. This refactor helps in re-initialization of
> hardware interrupts as part of a subsequent patch that enables runtime
> power management for the arm-smmu-v3 driver.
> 
> Signed-off-by: Pranjal Shrivastava <praan@google.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 ++++++++++++++++-----
>  1 file changed, 38 insertions(+), 12 deletions(-)

> +static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
> +{
> +	int ret, irq;
> +
> +	/* Disable IRQs first */
> +	ret = arm_smmu_disable_irqs(smmu);

Why do we need to touch HW in this SW handler setting-up function?

> @@ -4171,6 +4187,9 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu)
>  		return ret;
>  	}
>  
> +	/* Enable interrupt generation on the SMMU */
> +	arm_smmu_enable_irqs(smmu);
> +

Should this replace arm_smmu_setup_irqs() in this function? I see
arm_smmu_setup_irqs() still remain in arm_smmu_device_reset(), so
a normal probe routine would call it twice?

In my view, arm_smmu_setup_irqs() can be done just once, moving
out of arm_smmu_device_reset() and up to arm_smmu_device_probe()?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 1/5] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs
  2025-03-19  4:50   ` Nicolin Chen
@ 2025-03-19  7:43     ` Pranjal Shrivastava
  0 siblings, 0 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-19  7:43 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Mostafa Saleh, Daniel Mentz, iommu

On Tue, Mar 18, 2025 at 09:50:16PM -0700, Nicolin Chen wrote:
> On Wed, Mar 19, 2025 at 12:42:50AM +0000, Pranjal Shrivastava wrote:
> > Refactor arm_smmu_setup_irqs by splitting it into two parts, one for
> > registering interrupt handlers and the other one for enabling interrupt
> > generation in the hardware. This refactor helps in re-initialization of
> > hardware interrupts as part of a subsequent patch that enables runtime
> > power management for the arm-smmu-v3 driver.
> > 
> > Signed-off-by: Pranjal Shrivastava <praan@google.com>
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 ++++++++++++++++-----
> >  1 file changed, 38 insertions(+), 12 deletions(-)
> 
> > +static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
> > +{
> > +	int ret, irq;
> > +
> > +	/* Disable IRQs first */
> > +	ret = arm_smmu_disable_irqs(smmu);
> 
> Why do we need to touch HW in this SW handler setting-up function?
> 

This is just to maintain status quo, if you look at the current
`arm_smmu_setup_irqs` we still:

	/* Disable IRQs first */
	ret = arm_smmu_write_reg_sync(smmu, 0,ARM_SMMU_IRQ_CTRL,
				      ARM_SMMU_IRQ_CTRLACK);


> > @@ -4171,6 +4187,9 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu)
> >  		return ret;
> >  	}
> >  
> > +	/* Enable interrupt generation on the SMMU */
> > +	arm_smmu_enable_irqs(smmu);
> > +
> 
> Should this replace arm_smmu_setup_irqs() in this function? I see
> arm_smmu_setup_irqs() still remain in arm_smmu_device_reset(), so
> a normal probe routine would call it twice?
> 
> In my view, arm_smmu_setup_irqs() can be done just once, moving
> out of arm_smmu_device_reset() and up to arm_smmu_device_probe()?
> 

That makes sense, I kept the `arm_smmu_setup_irqs()` here to re-init the
MSIs during resume but I guess it's better to call `arm_smmu_setup_irqs`
once during probe and maybe re-work to call `arm_smmu_setup_msis` during
resume to re-alloc the MSIs. I'll add this in the next version.

Thanks,
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-19  0:42 [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
                   ` (4 preceding siblings ...)
  2025-03-19  0:42 ` [RFC PATCH 5/5] iommu/arm-smmu-v3: Invoke pm_runtime before hw access Pranjal Shrivastava
@ 2025-03-19 11:57 ` Jason Gunthorpe
  2025-03-19 16:07   ` Robin Murphy
  2025-03-19 18:22 ` Robin Murphy
  6 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2025-03-19 11:57 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Nicolin Chen,
	Mostafa Saleh, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 12:42:49AM +0000, Pranjal Shrivastava wrote:

> 3. Invoking runtime_pm_get/put
> Given that most of the configuration done by arm-smmu-v3 is stored in
> memory, the initial idea is to focus on areas where the driver accesses
> the hw via exposed ops, like iommmu_ops, iommu_flush_ops, sva_ops etc.

This seems weird, if the SMMU is suspended doesn't it also fail DMA
transactions? Why would ops like flush even be called if the HW is
disabled?

flush is performance path stuff, so it doesn't seem great to be adding
extra calls there.

> Instead of wrapping every exposed op with an rpm_get/put, the idea is to
> follow through till a logical common point (lowest common ancestor in
> the call hierarchies) and wrap those with the rpm_get/put calls.

Should the iommu core be doing this instead of hidden in drivers? It
seems like there might be better information there about when we
expect the IOMMU to be powered up and working and when it is OK to
have it shutdown?

And maybe in this direction we can do things like remove the
translation before doing a suspend so that flush ops are not a
problem.

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 5/5] iommu/arm-smmu-v3: Invoke pm_runtime before hw access
  2025-03-19  0:42 ` [RFC PATCH 5/5] iommu/arm-smmu-v3: Invoke pm_runtime before hw access Pranjal Shrivastava
@ 2025-03-19 12:04   ` Jason Gunthorpe
  2025-03-20  7:25     ` Pranjal Shrivastava
  2025-03-20 22:36   ` Mostafa Saleh
  1 sibling, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2025-03-19 12:04 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Nicolin Chen,
	Mostafa Saleh, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 12:42:54AM +0000, Pranjal Shrivastava wrote:
> -static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
> +static int arm_smmu_attach_dev_ste(struct iommu_domain *domain,
>  				    struct device *dev,
>  				    struct arm_smmu_ste *ste,
>  				    unsigned int s1dss)
>  {

This can't be changed into something that fails! That would mess up
all the error handling if it ever blows up.

> @@ -3214,8 +3287,8 @@ static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
>  	struct arm_smmu_master master = dev_iommu_priv_get(dev);
>  
>  	arm_smmu_make_bypass_ste(master->smmu, &ste);
> -	arm_smmu_attach_dev_ste(domain, dev, &ste, STRTAB_STE_1_S1DSS_BYPASS);
> -	return 0;
> +	return arm_smmu_attach_dev_ste(domain, dev, &ste,
> +				       STRTAB_STE_1_S1DSS_BYPASS);
>  }
>  
>  static const struct iommu_domain_ops arm_smmu_identity_ops = {
> @@ -3233,9 +3306,8 @@ static int arm_smmu_attach_dev_blocked(struct iommu_domain *domain,
>  	struct arm_smmu_ste ste;
>  
>  	arm_smmu_make_abort_ste(&ste);
> -	arm_smmu_attach_dev_ste(domain, dev, &ste,
> -				STRTAB_STE_1_S1DSS_TERMINATE);
> -	return 0;
> +	return arm_smmu_attach_dev_ste(domain, dev, &ste,
> +				       STRTAB_STE_1_S1DSS_TERMINATE);
>  }

Because these functions are required to be non-failing.

This feels kind of wonky, if we need to update a STE but the PM is off
then we just need to update the STE memory with some synchornization
and do nothing else. We don't need to wake up the HW and issue a cache
flush if the powered off cache could never have held a copy anyhow.

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-19 11:57 ` [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Jason Gunthorpe
@ 2025-03-19 16:07   ` Robin Murphy
  2025-03-20 22:25     ` Mostafa Saleh
  0 siblings, 1 reply; 71+ messages in thread
From: Robin Murphy @ 2025-03-19 16:07 UTC (permalink / raw)
  To: Jason Gunthorpe, Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Nicolin Chen, Mostafa Saleh,
	Daniel Mentz, iommu

On 19/03/2025 11:57 am, Jason Gunthorpe wrote:
> On Wed, Mar 19, 2025 at 12:42:49AM +0000, Pranjal Shrivastava wrote:
> 
>> 3. Invoking runtime_pm_get/put
>> Given that most of the configuration done by arm-smmu-v3 is stored in
>> memory, the initial idea is to focus on areas where the driver accesses
>> the hw via exposed ops, like iommmu_ops, iommu_flush_ops, sva_ops etc.
> 
> This seems weird, if the SMMU is suspended doesn't it also fail DMA
> transactions? Why would ops like flush even be called if the HW is
> disabled?

Because once the device has finished its operation, its driver is free 
to call rpm_put() before calling dma_unmap(), so by the time that gets 
as far as TLB maintenance, the SMMU may already be asleep as well if 
that device was the only thing keeping it awake.

For direct IOMMU API users, pagetable update may be even more 
asynchronous from device activity, e.g. a GPU buffer might only be 
unmapped once userspace closes the last file handle referencing it, long 
after the GPU itself has moved on to other things.

> flush is performance path stuff, so it doesn't seem great to be adding
> extra calls there.

That much is true - this really wants to be using 
pm_runtime_get_if_in_use() nearly everywhere such that at most it's just 
juggling refcounts. There's no point waking the SMMU up just to issue a 
CFGI or TLBI, if the act of doing so is inherently going to do a full 
arm_smmu_reset() and thus invalidate everything anyway.

>> Instead of wrapping every exposed op with an rpm_get/put, the idea is to
>> follow through till a logical common point (lowest common ancestor in
>> the call hierarchies) and wrap those with the rpm_get/put calls.
> 
> Should the iommu core be doing this instead of hidden in drivers? It
> seems like there might be better information there about when we
> expect the IOMMU to be powered up and working and when it is OK to
> have it shutdown?

On the contrary, I would say it's very much driver-level knowledge as to 
which operations need the hardware accessible or not, and trying to do 
it at the core level would inevitably end up very pessimistic and 
inefficient, or at best just horribly complex.

> And maybe in this direction we can do things like remove the
> translation before doing a suspend so that flush ops are not a
> problem.

Case in point: that would be a load of extra work to do, make 
suspend/resume even slower, and even then for no benefit, since 
map_pages/umnmap_pages can still be called on the unattached domain 
(especially a default DMA domain), wherein drivers may still end up 
trying to touch their hardware for various reasons in ways invisible to 
the core API. Easy example: consider a driver calling 
dma_alloc_coherent() to prepare a buffer *before* it wakes up its client 
device (which would then also wake up the SMMU via the device link), and 
deep within that call, arm_lpae_init_pte() ends up calling back into 
arm_smmu_tlb_inv_walk() because it's laying down a block mapping where 
an old empty table PTE happens to be...

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-19  0:42 [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
                   ` (5 preceding siblings ...)
  2025-03-19 11:57 ` [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Jason Gunthorpe
@ 2025-03-19 18:22 ` Robin Murphy
  2025-03-19 19:46   ` Jason Gunthorpe
  2025-03-20 14:13   ` Pranjal Shrivastava
  6 siblings, 2 replies; 71+ messages in thread
From: Robin Murphy @ 2025-03-19 18:22 UTC (permalink / raw)
  To: Pranjal Shrivastava, Joerg Roedel, Will Deacon, Jason Gunthorpe
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu

On 19/03/2025 12:42 am, Pranjal Shrivastava wrote:
> As arm-smmu-v3 rapidly finds its way into SoCs designed for hand-held
> devices, power management capabilities, similar to its predecessors, are
> crucial for these applications. This series introduces power management
> support for the arm-smmu-v3 driver.
> 
> Design
> =======
> The arm-smmu-v3 primarily operates with in-memory data structures
> through HW registers pointing to these data structures in some fashion.
> The proposed design tries to make use of this fact for implementing the
> suspend and resume ops.
> 
> 1. Suspend / Resume
> An initial idea for the "suspend" op is to wait till the command queue
> finishes all commands before disabling the SMMU through CR0. In order to
> avoid mis-use / spurious transactions (b/w SMMU disable -> power-down),
> the GBPA register is configured to abort all transactions.

But why disable SMMUEN at all? We can't expect the SMMU to come back in 
the same configuration we left it anyway (*including* GBPA, note), so 
what's the benefit of spending time and effort to potentially change its 
behaviour for a handful of microseconds before it all goes dark anyway?

Draining the command queue shouldn't strictly be necessary it only 
contains invalidations and syncs, since we know all the cached state 
they could refer to is going to be thrown away by the time the SMMU has 
come back anyway. If anything, making sure there are no outstanding IOPF 
handlers and the event queue and PRI queue are clear (as expected) seems 
more important, because the potential to lose hardware tokens or leave a 
stalled transaction across the interconnect would be A Bad Thing.

> The resume operation uses the `arm_smmu_device_reset` function which
> re-initializes the HW using the SW-copies maintained by the driver. For
> example, prod/cons for queues, base addresses for queues & tables. The
> arm_smmu_device_reset also clears the TLBs.
> 
> 2. Interrupt Re-config
> a. Wired irqs: the series refactors the `arm_smmu_setup_irqs` to be
> able to enable/disable irqs and install their handlers separately to
> help with the re-initialization of the interrupts correctly.
> 
> b. MSIs: The thought was of 2 approaches to teardown & re-config MSIs:
> 
>   1. Free MSIs on suspend and re-alloc on resume (implemented in series)
>   2. Meddle with the msi_desc and use get_cached_msi_msg to resume MSIs.

3: Just explicitly save/restore the IRQ_CFG{0,1} registers in the PM 
callbacks.
4: Have arm_smmu_write_msi_msg() stash the values, and defer the actual 
IRQ_CFG register writes until arm_smmu_setup_irqs(), alongside IRQ_CTRL.

If the other options aren't sufficiently clean or straightforward, 
another 36 bytes per arm_smmu_device instance isn't the end of the world...

> The first approach (implemented in this series) may be potentially
> slower, whereas the second one could reduce the time taken to resume.
> However, the second one feels hacky as it forces us to cache the msi_msg
> in `arm_smmu_write_msg` or in the core code (`irq_chip_write_msi_msg`)
> as suggested in [1].
>    
> 3. Invoking runtime_pm_get/put
> Given that most of the configuration done by arm-smmu-v3 is stored in
> memory, the initial idea is to focus on areas where the driver accesses
> the hw via exposed ops, like iommmu_ops, iommu_flush_ops, sva_ops etc.
> 
> Instead of wrapping every exposed op with an rpm_get/put, the idea is to
> follow through till a logical common point (lowest common ancestor in
> the call hierarchies) and wrap those with the rpm_get/put calls.

As mentioned already, I think there should be very few, if any, places 
where we actually need an rpm_get() to explicitly resume a suspended 
SMMU. All the initial hardware probing and reset can (and should) be 
done before PM is enabled; IRQ handlers can reasonably assume that any 
call while the SMMU is suspended must be shared or spurious, so don't 
need to look (even more so if we do still clear IRQ_CTRL on suspend); 
any commands issued while suspended are either safe to ignore 
(CFGI/TLBI/PREFETCH) or nonsensical (RESUME/PRI_RESP) - the only one I'm 
not entirely sure about off the top of my head is ATC_INV, which might 
depend on whether the PCI code can guarantee that an endpoint will 
resume with a clean ATC.

I don't see SVA and VFIO needing any special considerations, so it might 
just be the new vIOMMU stuff which may need to hold PM enabled for the 
lifetime of things exposed to userspace.

Thanks,
Robin.

> Future Work
> ===========
> - Elide TLBIs when the SMMU is powered down
> 
> Call for Review
> ================
> Any insights/comments on the proposed changes are appreciated,
> especially in areas related to locking, atomic contexts, early resume
> etc. or any potential optimization.
> 
> Note: The series isn't tested with MSIs as I couldn't find a platform
> that supports MSIs. Any help in testing MSIs is appreciated.
> 
> References
> ===========
> [1]
> https://lore.kernel.org/lkml/20210721013350.17664-1-cuibixuan@huawei.com/T/#m06bf92e2306ba52c106f2d4d24765921d4e9781e
> 
> [2]
> https://lore.kernel.org/all/20180830144541.17740-1-vivek.gautam@codeaurora.org/
> 
> Pranjal Shrivastava (5):
>    iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs
>    iommu/arm-smmu-v3: Add a helper to wait till cmdq drains
>    iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
>    iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks
>    iommu/arm-smmu-v3: Invoke pm_runtime before hw access
> 
>   .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     |  22 +-
>   .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  24 ++
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 297 ++++++++++++++++--
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   5 +
>   4 files changed, 326 insertions(+), 22 deletions(-)
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-19 18:22 ` Robin Murphy
@ 2025-03-19 19:46   ` Jason Gunthorpe
  2025-03-20 21:00     ` Pranjal Shrivastava
  2025-03-20 22:28     ` Mostafa Saleh
  2025-03-20 14:13   ` Pranjal Shrivastava
  1 sibling, 2 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2025-03-19 19:46 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Pranjal Shrivastava, Joerg Roedel, Will Deacon, Nicolin Chen,
	Mostafa Saleh, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 06:22:30PM +0000, Robin Murphy wrote:
> I don't see SVA and VFIO needing any special considerations, so it might
> just be the new vIOMMU stuff which may need to hold PM enabled for the
> lifetime of things exposed to userspace.

Are you expecting that the VFIO driver, and whatever driver is
managing the SVA will keep the power turned on as necessary then?

VFIO is not going to work if the SMMU power goes off unexpectedly
while the VFIO FD is open.

I suppose VFIO also can power off on-demand via userspace asking
it. In that case the VM would be co-operating.

vcmdq probably needs some suspend resume logic like the normal
commandq.

viommu I think will only uses STE fields so it should be OK?

SVA.. Is similar to VFIO, if the power goes off while the SVA is
working that is bad. The SVA requesting driver would have to manage
this. HW with userspace command submission probably can't allow the
power to go off while command submission is possible.

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 5/5] iommu/arm-smmu-v3: Invoke pm_runtime before hw access
  2025-03-19 12:04   ` Jason Gunthorpe
@ 2025-03-20  7:25     ` Pranjal Shrivastava
  2025-03-20 12:54       ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-20  7:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Nicolin Chen,
	Mostafa Saleh, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 09:04:04AM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 19, 2025 at 12:42:54AM +0000, Pranjal Shrivastava wrote:
> > -static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
> > +static int arm_smmu_attach_dev_ste(struct iommu_domain *domain,
> >  				    struct device *dev,
> >  				    struct arm_smmu_ste *ste,
> >  				    unsigned int s1dss)
> >  {
> 
> This can't be changed into something that fails! That would mess up
> all the error handling if it ever blows up.
> 
> > @@ -3214,8 +3287,8 @@ static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
> >  	struct arm_smmu_master master = dev_iommu_priv_get(dev);
> >  
> >  	arm_smmu_make_bypass_ste(master->smmu, &ste);
> > -	arm_smmu_attach_dev_ste(domain, dev, &ste, STRTAB_STE_1_S1DSS_BYPASS);
> > -	return 0;
> > +	return arm_smmu_attach_dev_ste(domain, dev, &ste,
> > +				       STRTAB_STE_1_S1DSS_BYPASS);
> >  }
> >  
> >  static const struct iommu_domain_ops arm_smmu_identity_ops = {
> > @@ -3233,9 +3306,8 @@ static int arm_smmu_attach_dev_blocked(struct iommu_domain *domain,
> >  	struct arm_smmu_ste ste;
> >  
> >  	arm_smmu_make_abort_ste(&ste);
> > -	arm_smmu_attach_dev_ste(domain, dev, &ste,
> > -				STRTAB_STE_1_S1DSS_TERMINATE);
> > -	return 0;
> > +	return arm_smmu_attach_dev_ste(domain, dev, &ste,
> > +				       STRTAB_STE_1_S1DSS_TERMINATE);
> >  }
> 
> Because these functions are required to be non-failing.
> 
> This feels kind of wonky, if we need to update a STE but the PM is off
> then we just need to update the STE memory with some synchornization
> and do nothing else. We don't need to wake up the HW and issue a cache
> flush if the powered off cache could never have held a copy anyhow.

I agree, that we can elide TLBIs if the smmu is asleep (powered-off),
the main reason to wake up the hw is because `arm_smmu_write_ste` issues
a prefetch CMD. Again, it is possible that the smmu sleeps after this op
and the prefetch has no real benefit.

Maybe having something like rpm_get_if_in_use could potentially help
here, wherein we can check if the smmu is powered ON before going ahead
with issuing all the commands. Although, in this case we might lose
slight performance due to wake-ups caused by other clients, for example:

1. SMMU is powered off, client1 attempts to attach to a domain
2. We elide the prefetch CMD since SMMU is powered off
3. Now, another client2 served by the same SMMU, powers on the SMMU

I'm not sure how much of a performance loss this is and also on systems
with IO-coherent SMMUs (where the SMMU can snoop into CPU caches) this
may not be a huge drop, but I'd like to discuss the possibility..

Anyway, having *any* rpm implementation that causes us to lose TLB
content, will cause a hit on the performance, so maybe this
micro-optimization isn't worth at all?

Also, while we're at it, should we have a feature switch for users who'd
like to disable runtime PM for performance reasons? I don't think that
relying on the `power-domain` being present alone is sufficient.

Thanks,
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 5/5] iommu/arm-smmu-v3: Invoke pm_runtime before hw access
  2025-03-20  7:25     ` Pranjal Shrivastava
@ 2025-03-20 12:54       ` Jason Gunthorpe
  2025-03-20 13:22         ` Robin Murphy
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2025-03-20 12:54 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Nicolin Chen,
	Mostafa Saleh, Daniel Mentz, iommu

On Thu, Mar 20, 2025 at 07:25:20AM +0000, Pranjal Shrivastava wrote:

> I agree, that we can elide TLBIs if the smmu is asleep (powered-off),
> the main reason to wake up the hw is because `arm_smmu_write_ste` issues
> a prefetch CMD. Again, it is possible that the smmu sleeps after this op
> and the prefetch has no real benefit.

Does any HW caching survive the power off? I was assuming no.

So it is pointless to issue a prefetch and then immediately power off
the SMMU and throw away the STE cache.

IOW if you don't know that the SMMU will stay powered up after the
attach there is no reason to do a prefetch.

> Anyway, having *any* rpm implementation that causes us to lose TLB
> content, will cause a hit on the performance, so maybe this
> micro-optimization isn't worth at all?

Right, it was my assumption that you loose the caches. I would guess
keeping all the cache SRAM powered is a big part of the power budget...

> Also, while we're at it, should we have a feature switch for users who'd
> like to disable runtime PM for performance reasons? I don't think that
> relying on the `power-domain` being present alone is sufficient.

I'm worried about how much this will hurt server workloads. Optimizing
invalidation is a worry and you've gone and put in some more atomic
write traffic on shared cache lines. Not great :\

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 5/5] iommu/arm-smmu-v3: Invoke pm_runtime before hw access
  2025-03-20 12:54       ` Jason Gunthorpe
@ 2025-03-20 13:22         ` Robin Murphy
  2025-03-20 14:21           ` Pranjal Shrivastava
  0 siblings, 1 reply; 71+ messages in thread
From: Robin Murphy @ 2025-03-20 13:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Nicolin Chen, Mostafa Saleh,
	Daniel Mentz, iommu

On 20/03/2025 12:54 pm, Jason Gunthorpe wrote:
> On Thu, Mar 20, 2025 at 07:25:20AM +0000, Pranjal Shrivastava wrote:
> 
>> I agree, that we can elide TLBIs if the smmu is asleep (powered-off),
>> the main reason to wake up the hw is because `arm_smmu_write_ste` issues
>> a prefetch CMD. Again, it is possible that the smmu sleeps after this op
>> and the prefetch has no real benefit.
> 
> Does any HW caching survive the power off? I was assuming no.
> 
> So it is pointless to issue a prefetch and then immediately power off
> the SMMU and throw away the STE cache.
> 
> IOW if you don't know that the SMMU will stay powered up after the
> attach there is no reason to do a prefetch.

And even if the caches don't lose state in hardware, we'll explicitly 
clear them on resume anyway. Furthermore I'm not aware of any current 
SMMU implementations where the prefetch command does anything meaningful 
anyway, so right now this is absolutely a non-issue.

>> Anyway, having *any* rpm implementation that causes us to lose TLB
>> content, will cause a hit on the performance, so maybe this
>> micro-optimization isn't worth at all?
> 
> Right, it was my assumption that you loose the caches. I would guess
> keeping all the cache SRAM powered is a big part of the power budget...
> 
>> Also, while we're at it, should we have a feature switch for users who'd
>> like to disable runtime PM for performance reasons? I don't think that
>> relying on the `power-domain` being present alone is sufficient.
> 
> I'm worried about how much this will hurt server workloads. Optimizing
> invalidation is a worry and you've gone and put in some more atomic
> write traffic on shared cache lines. Not great :\

Indeed, that was the reason for wrapping the RPM calls in explicit 
pm_runtime_enabled() checks in SMMUv2 - see the comment in 
arm_smmu_device_probe() - on the assumption that servers wouldn't 
advertise a power domain for the SMMU (not least because we've no 
support for such a thing under ACPI anyway).

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-19 18:22 ` Robin Murphy
  2025-03-19 19:46   ` Jason Gunthorpe
@ 2025-03-20 14:13   ` Pranjal Shrivastava
  2025-03-20 14:54     ` Jason Gunthorpe
  1 sibling, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-20 14:13 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Joerg Roedel, Will Deacon, Jason Gunthorpe, Nicolin Chen,
	Mostafa Saleh, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 06:22:30PM +0000, Robin Murphy wrote:
> On 19/03/2025 12:42 am, Pranjal Shrivastava wrote:
> > As arm-smmu-v3 rapidly finds its way into SoCs designed for hand-held
> > devices, power management capabilities, similar to its predecessors, are
> > crucial for these applications. This series introduces power management
> > support for the arm-smmu-v3 driver.
> > 
> > Design
> > =======
> > The arm-smmu-v3 primarily operates with in-memory data structures
> > through HW registers pointing to these data structures in some fashion.
> > The proposed design tries to make use of this fact for implementing the
> > suspend and resume ops.
> > 
> > 1. Suspend / Resume
> > An initial idea for the "suspend" op is to wait till the command queue
> > finishes all commands before disabling the SMMU through CR0. In order to
> > avoid mis-use / spurious transactions (b/w SMMU disable -> power-down),
> > the GBPA register is configured to abort all transactions.
> 
> But why disable SMMUEN at all? We can't expect the SMMU to come back in the
> same configuration we left it anyway (*including* GBPA, note), so what's the
> benefit of spending time and effort to potentially change its behaviour for
> a handful of microseconds before it all goes dark anyway?

Just for thwarting any potential security issues, since the clients
*may* be running their own firmware (something like a GPU maybe?) and in
case __that__ firmware is compromised, windows like this could
potentially be a way to inject an attack vector into the RAM.

Since, GBPA.Abort bit resets to an "IMPLEMENTATION DEFINED" value, there
might be implementations that take care of setting the ABORT bit with
Power-ON Reset.

I may be thinking too much here, let me know if this needs to be dropped?

> 
> Draining the command queue shouldn't strictly be necessary it only contains
> invalidations and syncs, since we know all the cached state they could refer
> to is going to be thrown away by the time the SMMU has come back anyway. If

More than just SMMU's TLBIs, I think we'd still need the ATC
invalidations to go through, right? As the ATC is present in the PCIE_EP
and if a translation is cached in the ATC, AFAIK, the EP gets to bypass
the SMMU, which could potentially lead to mis-translations...

> anything, making sure there are no outstanding IOPF handlers and the event
> queue and PRI queue are clear (as expected) seems more important, because
> the potential to lose hardware tokens or leave a stalled transaction across
> the interconnect would be A Bad Thing.

Right, agreed. I'll drain the eventq and priq in the next version.
But I think we'll still need to drain the CMDQ, since there might be a
RESUME_OP or PRI_RESP pending to be consumed? Or are you suggesting no
client driver would drop a ref if a transaction is stalled?

Also, we'd need to maintain the prod / cons for all the queues, since
the arm_smmu_device_reset will re-init the HW regs from values stored in
llq->prod/cons, if those values aren't same, it may trigger consumption.

> 
> > The resume operation uses the `arm_smmu_device_reset` function which
> > re-initializes the HW using the SW-copies maintained by the driver. For
> > example, prod/cons for queues, base addresses for queues & tables. The
> > arm_smmu_device_reset also clears the TLBs.
> > 
> > 2. Interrupt Re-config
> > a. Wired irqs: the series refactors the `arm_smmu_setup_irqs` to be
> > able to enable/disable irqs and install their handlers separately to
> > help with the re-initialization of the interrupts correctly.
> > 
> > b. MSIs: The thought was of 2 approaches to teardown & re-config MSIs:
> > 
> >   1. Free MSIs on suspend and re-alloc on resume (implemented in series)
> >   2. Meddle with the msi_desc and use get_cached_msi_msg to resume MSIs.
> 
> 3: Just explicitly save/restore the IRQ_CFG{0,1} registers in the PM
> callbacks.
> 4: Have arm_smmu_write_msi_msg() stash the values, and defer the actual
> IRQ_CFG register writes until arm_smmu_setup_irqs(), alongside IRQ_CTRL.
> 
> If the other options aren't sufficiently clean or straightforward, another
> 36 bytes per arm_smmu_device instance isn't the end of the world...
> 

Ack. I had thought of something like desc->msg = *msg; in
arm_smmu_write_msi_msg, and then get_cached_msi_msg() in resume but it
felt hacky.. If we have consensus, then I'd like to use option 4. and
maybe cache the data in the arm_smmu_write_msi_msg() itself. Something
like:

irq_chip_write_msi_msg() 
  -> platform_msi_write_msi_msg
    -> arm_smmu_write_msi_msg()
      (desc->msg = *msg;)

then get_cached_msi_msg -> call arm_smmu_write_msi_msg in resume cb.
(cooked up a diff and attached at the end).

> > The first approach (implemented in this series) may be potentially
> > slower, whereas the second one could reduce the time taken to resume.
> > However, the second one feels hacky as it forces us to cache the msi_msg
> > in `arm_smmu_write_msg` or in the core code (`irq_chip_write_msi_msg`)
> > as suggested in [1].
> > 3. Invoking runtime_pm_get/put
> > Given that most of the configuration done by arm-smmu-v3 is stored in
> > memory, the initial idea is to focus on areas where the driver accesses
> > the hw via exposed ops, like iommmu_ops, iommu_flush_ops, sva_ops etc.
> > 
> > Instead of wrapping every exposed op with an rpm_get/put, the idea is to
> > follow through till a logical common point (lowest common ancestor in
> > the call hierarchies) and wrap those with the rpm_get/put calls.
> 
> As mentioned already, I think there should be very few, if any, places where
> we actually need an rpm_get() to explicitly resume a suspended SMMU. All the
> initial hardware probing and reset can (and should) be done before PM is
> enabled; IRQ handlers can reasonably assume that any call while the SMMU is
> suspended must be shared or spurious, so don't need to look (even more so if
> we do still clear IRQ_CTRL on suspend); any commands issued while suspended
> are either safe to ignore (CFGI/TLBI/PREFETCH) or nonsensical

Ack. I think we can convert all IRQ handlers to use rpm_get_if_active
because there's a chance where we get an event, the threaded_irq is
scheduled and before it's executed to completion all the refs are
dropped and the SMMU is powered off.

Regarding TLBI/CFGI/Prefetch, I agree that it makes sense to go ahead
with them only if the SMMU is powered ON, otherwise elide everything.

> (RESUME/PRI_RESP) - the only one I'm not entirely sure about off the top of
> my head is ATC_INV, which might depend on whether the PCI code can guarantee
> that an endpoint will resume with a clean ATC.
> 

Exactly, and are we sure that the EP will be *really* suspended with the
SMMU? Even if the kernel assumes that a PCIe EP is down, it may still be
partially active or not even have a *real* ATC but just a firmware
implementation... for such cases, I believe we should explicitly wake up
for ATC invalidations atleast.

> I don't see SVA and VFIO needing any special considerations, so it might
> just be the new vIOMMU stuff which may need to hold PM enabled for the
> lifetime of things exposed to userspace.
> 

Hmm..so maybe just hold a ref till it's destroyed? 

Thanks,
Praan

==============================================================================
The proposed MSI diff:

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f8d74ccaed6a..58eb8af289fe 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -4020,6 +4020,9 @@ static void arm_smmu_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
 	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
 	phys_addr_t *cfg = arm_smmu_msi_cfg[desc->msi_index];
 
+	/* Save the msi_msg for resume */
+	desc->msg = *msg;
+
 	doorbell = (((u64)msg->address_hi) << 32) | msg->address_lo;
 	doorbell &= MSI_CFG0_ADDR_MASK;
 
@@ -4028,6 +4031,38 @@ static void arm_smmu_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
 	writel_relaxed(ARM_SMMU_MEMATTR_DEVICE_nGnRE, smmu->base + cfg[2]);
 }
 
+static void arm_smmu_resume_msis(struct arm_smmu_device *smmu)
+{
+	struct msi_desc *desc;
+	struct msi_msg msg;
+
+	if (!(smmu->features & ARM_SMMU_FEAT_MSI))
+		return;
+
+	if (!smmu->dev->msi.domain)
+		return;
+
+	desc = irq_get_msi_desc(smmu->evtq.q.irq);
+	if (desc) {
+		get_cached_msi_msg(smmu->evtq.q.irq, &msg);
+		arm_smmu_write_msi_msg(desc, &msg);
+	}
+
+	desc = irq_get_msi_desc(smmu->gerr_irq);
+	if (desc) {
+		get_cached_msi_msg(smmu->gerr_irq, &msg);
+		arm_smmu_write_msi_msg(desc, &msg);
+	}
+
+	if (smmu->features & ARM_SMMU_FEAT_PRI) {
+		desc = irq_get_msi_desc(smmu->priq.q.irq);
+		if (desc) {
+			get_cached_msi_msg(smmu->priq.q.irq, &msg);
+			arm_smmu_write_msi_msg(desc, &msg);
+		}
+	}
+}
+
 static void arm_smmu_setup_msis(struct arm_smmu_device *smmu)
 {
 	int ret, nvec = ARM_SMMU_MAX_MSIS;
@@ -5018,6 +5053,7 @@ static int __maybe_unused arm_smmu_runtime_resume(struct device *dev)
 	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
 
 	dev_dbg(dev, "Resuming device\n");
+	arm_smmu_resume_msis(smmu);
 
 	/*
 	 * The reset will re-initialize all the queues with the base addr,


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 5/5] iommu/arm-smmu-v3: Invoke pm_runtime before hw access
  2025-03-20 13:22         ` Robin Murphy
@ 2025-03-20 14:21           ` Pranjal Shrivastava
  0 siblings, 0 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-20 14:21 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Jason Gunthorpe, Joerg Roedel, Will Deacon, Nicolin Chen,
	Mostafa Saleh, Daniel Mentz, iommu

On Thu, Mar 20, 2025 at 01:22:07PM +0000, Robin Murphy wrote:
> On 20/03/2025 12:54 pm, Jason Gunthorpe wrote:
> > On Thu, Mar 20, 2025 at 07:25:20AM +0000, Pranjal Shrivastava wrote:
> > 
> > > I agree, that we can elide TLBIs if the smmu is asleep (powered-off),
> > > the main reason to wake up the hw is because `arm_smmu_write_ste` issues
> > > a prefetch CMD. Again, it is possible that the smmu sleeps after this op
> > > and the prefetch has no real benefit.
> > 
> > Does any HW caching survive the power off? I was assuming no.
> > 
> > So it is pointless to issue a prefetch and then immediately power off
> > the SMMU and throw away the STE cache.
> > 
> > IOW if you don't know that the SMMU will stay powered up after the
> > attach there is no reason to do a prefetch.
> 
> And even if the caches don't lose state in hardware, we'll explicitly clear
> them on resume anyway. Furthermore I'm not aware of any current SMMU
> implementations where the prefetch command does anything meaningful anyway,
> so right now this is absolutely a non-issue.
> 

Ack. I agree, for the next version, I'll use rpm_get_if_active for
any TLBIs / CFGIs and Prefetches and otherwise elide it.

> > > Anyway, having *any* rpm implementation that causes us to lose TLB
> > > content, will cause a hit on the performance, so maybe this
> > > micro-optimization isn't worth at all?
> > 
> > Right, it was my assumption that you loose the caches. I would guess
> > keeping all the cache SRAM powered is a big part of the power budget...
> > 

Right. I think it's fair to assume that SRAMs aren't retained, even if
they are, we anyway clean everything on resume.

> > > Also, while we're at it, should we have a feature switch for users who'd
> > > like to disable runtime PM for performance reasons? I don't think that
> > > relying on the `power-domain` being present alone is sufficient.
> > 
> > I'm worried about how much this will hurt server workloads. Optimizing
> > invalidation is a worry and you've gone and put in some more atomic
> > write traffic on shared cache lines. Not great :\
> 
> Indeed, that was the reason for wrapping the RPM calls in explicit
> pm_runtime_enabled() checks in SMMUv2 - see the comment in
> arm_smmu_device_probe() - on the assumption that servers wouldn't advertise
> a power domain for the SMMU (not least because we've no support for such a
> thing under ACPI anyway).
> 

Yes, SMMUv2 is what I referred to for this. So..just checking for
pm_runtime_enabled based on dev->power-domain should suffice then?

> Thanks,
> Robin.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-20 14:13   ` Pranjal Shrivastava
@ 2025-03-20 14:54     ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2025-03-20 14:54 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Robin Murphy, Joerg Roedel, Will Deacon, Nicolin Chen,
	Mostafa Saleh, Daniel Mentz, iommu

On Thu, Mar 20, 2025 at 02:13:08PM +0000, Pranjal Shrivastava wrote:

> I may be thinking too much here, let me know if this needs to be dropped?

IMHO you are correct to worry about this for security. During power
management the translation should not change to bypass. Either abort
or the attached domain only. I would document this in comments as it
is an important detail.

Eventually this may be operated from a VM and we do not want to see
any hole where the VM could abuse the power management and somehow be
able to DMA into a bypass as a race condition.

> > Draining the command queue shouldn't strictly be necessary it only contains
> > invalidations and syncs, since we know all the cached state they could refer
> > to is going to be thrown away by the time the SMMU has come back anyway. If
> 
> More than just SMMU's TLBIs, I think we'd still need the ATC
> invalidations to go through, right? As the ATC is present in the PCIE_EP
> and if a translation is cached in the ATC, AFAIK, the EP gets to bypass
> the SMMU, which could potentially lead to mis-translations...

Yes, you cannot just loose ATC invalidations unless you have a way to
guarentee that the power management cleaned the PCI device cache.

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-19 19:46   ` Jason Gunthorpe
@ 2025-03-20 21:00     ` Pranjal Shrivastava
  2025-03-20 23:08       ` Jason Gunthorpe
  2025-03-20 22:28     ` Mostafa Saleh
  1 sibling, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-20 21:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, Joerg Roedel, Will Deacon, Nicolin Chen,
	Mostafa Saleh, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 04:46:09PM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 19, 2025 at 06:22:30PM +0000, Robin Murphy wrote:
> > I don't see SVA and VFIO needing any special considerations, so it might
> > just be the new vIOMMU stuff which may need to hold PM enabled for the
> > lifetime of things exposed to userspace.
> 
> Are you expecting that the VFIO driver, and whatever driver is
> managing the SVA will keep the power turned on as necessary then?
> 
> VFIO is not going to work if the SMMU power goes off unexpectedly
> while the VFIO FD is open.
> 

In that case, why not hold a ref till the FD is closed? When all the
refs are dropped, we can safely power it down?

> I suppose VFIO also can power off on-demand via userspace asking
> it. In that case the VM would be co-operating.
> 

I'm wondering if user-space / VMM would be willing to control runtime
power at this point? Maybe we could have n VMMs vote when they need
power and drop the votes later and we can simply rpm_get if votes > 0

> vcmdq probably needs some suspend resume logic like the normal
> commandq.
> 

Ack.

> viommu I think will only uses STE fields so it should be OK?
> 

What about IOMMUFD stuff like the Get HW info IOCTL?

> SVA.. Is similar to VFIO, if the power goes off while the SVA is
> working that is bad. The SVA requesting driver would have to manage
> this. HW with userspace command submission probably can't allow the
> power to go off while command submission is possible.
> 

Hmm.. maybe then rpm_get() 

> Jason
Thanks,
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-19 16:07   ` Robin Murphy
@ 2025-03-20 22:25     ` Mostafa Saleh
  2025-03-21 14:18       ` Pranjal Shrivastava
  0 siblings, 1 reply; 71+ messages in thread
From: Mostafa Saleh @ 2025-03-20 22:25 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Jason Gunthorpe, Pranjal Shrivastava, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 04:07:57PM +0000, Robin Murphy wrote:
> On 19/03/2025 11:57 am, Jason Gunthorpe wrote:
> > On Wed, Mar 19, 2025 at 12:42:49AM +0000, Pranjal Shrivastava wrote:
> > 
> > > 3. Invoking runtime_pm_get/put
> > > Given that most of the configuration done by arm-smmu-v3 is stored in
> > > memory, the initial idea is to focus on areas where the driver accesses
> > > the hw via exposed ops, like iommmu_ops, iommu_flush_ops, sva_ops etc.
> > 
> > This seems weird, if the SMMU is suspended doesn't it also fail DMA
> > transactions? Why would ops like flush even be called if the HW is
> > disabled?
> 
> Because once the device has finished its operation, its driver is free to
> call rpm_put() before calling dma_unmap(), so by the time that gets as far
> as TLB maintenance, the SMMU may already be asleep as well if that device
> was the only thing keeping it awake.
> 
> For direct IOMMU API users, pagetable update may be even more asynchronous
> from device activity, e.g. a GPU buffer might only be unmapped once
> userspace closes the last file handle referencing it, long after the GPU
> itself has moved on to other things.
> 
> > flush is performance path stuff, so it doesn't seem great to be adding
> > extra calls there.
> 
> That much is true - this really wants to be using pm_runtime_get_if_in_use()
> nearly everywhere such that at most it's just juggling refcounts. There's no
> point waking the SMMU up just to issue a CFGI or TLBI, if the act of doing
> so is inherently going to do a full arm_smmu_reset() and thus invalidate
> everything anyway.

AFAICT, there is no guarantees that caches are clean on system resume,
but as we do invalidate everything that should be fine, but I am not sure
how that works with distributed SMMUs where the TBU can still be powered
with some TLBs that can be invalid?

Thanks,
Mostafa
> 
> > > Instead of wrapping every exposed op with an rpm_get/put, the idea is to
> > > follow through till a logical common point (lowest common ancestor in
> > > the call hierarchies) and wrap those with the rpm_get/put calls.
> > 
> > Should the iommu core be doing this instead of hidden in drivers? It
> > seems like there might be better information there about when we
> > expect the IOMMU to be powered up and working and when it is OK to
> > have it shutdown?
> 
> On the contrary, I would say it's very much driver-level knowledge as to
> which operations need the hardware accessible or not, and trying to do it at
> the core level would inevitably end up very pessimistic and inefficient, or
> at best just horribly complex.
> 
> > And maybe in this direction we can do things like remove the
> > translation before doing a suspend so that flush ops are not a
> > problem.
> 
> Case in point: that would be a load of extra work to do, make suspend/resume
> even slower, and even then for no benefit, since map_pages/umnmap_pages can
> still be called on the unattached domain (especially a default DMA domain),
> wherein drivers may still end up trying to touch their hardware for various
> reasons in ways invisible to the core API. Easy example: consider a driver
> calling dma_alloc_coherent() to prepare a buffer *before* it wakes up its
> client device (which would then also wake up the SMMU via the device link),
> and deep within that call, arm_lpae_init_pte() ends up calling back into
> arm_smmu_tlb_inv_walk() because it's laying down a block mapping where an
> old empty table PTE happens to be...
> 
> Thanks,
> Robin.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-19 19:46   ` Jason Gunthorpe
  2025-03-20 21:00     ` Pranjal Shrivastava
@ 2025-03-20 22:28     ` Mostafa Saleh
  2025-03-20 23:05       ` Jason Gunthorpe
  1 sibling, 1 reply; 71+ messages in thread
From: Mostafa Saleh @ 2025-03-20 22:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, Pranjal Shrivastava, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 04:46:09PM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 19, 2025 at 06:22:30PM +0000, Robin Murphy wrote:
> > I don't see SVA and VFIO needing any special considerations, so it might
> > just be the new vIOMMU stuff which may need to hold PM enabled for the
> > lifetime of things exposed to userspace.
> 
> Are you expecting that the VFIO driver, and whatever driver is
> managing the SVA will keep the power turned on as necessary then?
> 
> VFIO is not going to work if the SMMU power goes off unexpectedly
> while the VFIO FD is open.
> 
> I suppose VFIO also can power off on-demand via userspace asking
> it. In that case the VM would be co-operating.

I guess VFIO is tricky as there is a security boundary. In case the
SMMUv3 wakes up with caches not clean, there is a window of time where
the device is accessible from userspace with old (or garbage) TLBs.
I see VFIO-platform always take a PM reference for the device at
open(), but I am not sure how that works with VFIO-pci.

Thanks,
Mostafa

> 
> vcmdq probably needs some suspend resume logic like the normal
> commandq.
> 
> viommu I think will only uses STE fields so it should be OK?
> 
> SVA.. Is similar to VFIO, if the power goes off while the SVA is
> working that is bad. The SVA requesting driver would have to manage
> this. HW with userspace command submission probably can't allow the
> power to go off while command submission is possible.
> 
> Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 1/5] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs
  2025-03-19  0:42 ` [RFC PATCH 1/5] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
  2025-03-19  4:50   ` Nicolin Chen
@ 2025-03-20 22:29   ` Mostafa Saleh
  2025-03-21  7:26     ` Pranjal Shrivastava
  2025-03-25 16:19   ` Daniel Mentz
  2 siblings, 1 reply; 71+ messages in thread
From: Mostafa Saleh @ 2025-03-20 22:29 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 12:42:50AM +0000, Pranjal Shrivastava wrote:
> Refactor arm_smmu_setup_irqs by splitting it into two parts, one for
> registering interrupt handlers and the other one for enabling interrupt
> generation in the hardware. This refactor helps in re-initialization of
> hardware interrupts as part of a subsequent patch that enables runtime
> power management for the arm-smmu-v3 driver.
> 
> Signed-off-by: Pranjal Shrivastava <praan@google.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 ++++++++++++++++-----
>  1 file changed, 38 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 358072b4e293..80362d9f293b 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -3985,12 +3985,24 @@ static void arm_smmu_setup_unique_irqs(struct arm_smmu_device *smmu)
>  	}
>  }
>  
> -static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
> +static void arm_smmu_enable_irqs(struct arm_smmu_device *smmu)
>  {
> -	int ret, irq;
> +	int ret;
>  	u32 irqen_flags = IRQ_CTRL_EVTQ_IRQEN | IRQ_CTRL_GERROR_IRQEN;
>  
> -	/* Disable IRQs first */
> +	if (smmu->features & ARM_SMMU_FEAT_PRI)
> +		irqen_flags |= IRQ_CTRL_PRIQ_IRQEN;
> +
> +	ret = arm_smmu_write_reg_sync(smmu, irqen_flags,
> +				      ARM_SMMU_IRQ_CTRL, ARM_SMMU_IRQ_CTRLACK);
> +	if (ret)
> +		dev_warn(smmu->dev, "failed to enable irqs\n");
> +}
> +
> +static int arm_smmu_disable_irqs(struct arm_smmu_device *smmu)
> +{
> +	int ret;
> +
>  	ret = arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_IRQ_CTRL,
>  				      ARM_SMMU_IRQ_CTRLACK);
>  	if (ret) {
> @@ -3998,6 +4010,19 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
>  		return ret;
>  	}
>  
> +	return 0;
> +}
> +
> +static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
> +{
> +	int ret, irq;
> +
> +	/* Disable IRQs first */
> +	ret = arm_smmu_disable_irqs(smmu);
> +
> +	if (ret)
> +		return ret;
> +
>  	irq = smmu->combined_irq;
>  	if (irq) {
>  		/*
> @@ -4014,15 +4039,6 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
>  	} else
>  		arm_smmu_setup_unique_irqs(smmu);
>  
> -	if (smmu->features & ARM_SMMU_FEAT_PRI)
> -		irqen_flags |= IRQ_CTRL_PRIQ_IRQEN;
> -
> -	/* Enable interrupt generation on the SMMU */
> -	ret = arm_smmu_write_reg_sync(smmu, irqen_flags,
> -				      ARM_SMMU_IRQ_CTRL, ARM_SMMU_IRQ_CTRLACK);
> -	if (ret)
> -		dev_warn(smmu->dev, "failed to enable irqs\n");
> -
>  	return 0;
>  }
>  
> @@ -4171,6 +4187,9 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu)
>  		return ret;
>  	}
>  
> +	/* Enable interrupt generation on the SMMU */
> +	arm_smmu_enable_irqs(smmu);
> +
>  	if (is_kdump_kernel())
>  		enables &= ~(CR0_EVTQEN | CR0_PRIQEN);
>  
> @@ -4754,6 +4773,13 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
>  	/* Check for RMRs and install bypass STEs if any */
>  	arm_smmu_rmr_install_bypass_ste(smmu);
>  
> +	/* Setup interrupt handlers */
> +	ret = arm_smmu_setup_irqs(smmu);
> +	if (ret) {
> +		dev_err(smmu->dev, "failed to setup irqs\n");

Every path that can fail in this function has a print already, I’d say
remove the ones inside and keep this one, that also simplifies
arm_smmu_disable_irqs() to one line.

Otherwise, besides Nicolin’s comment, it looks good.

Thanks,
Mostafa

> +		return ret;
> +	}
> +
>  	/* Reset the device */
>  	ret = arm_smmu_device_reset(smmu);
>  	if (ret)
> -- 
> 2.49.0.rc1.451.g8f38331e32-goog
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/5] iommu/arm-smmu-v3: Add a helper to wait till cmdq drains
  2025-03-19  0:42 ` [RFC PATCH 2/5] iommu/arm-smmu-v3: Add a helper to wait till cmdq drains Pranjal Shrivastava
@ 2025-03-20 22:30   ` Mostafa Saleh
  2025-03-21  8:09     ` Pranjal Shrivastava
  2025-03-25 17:50   ` Daniel Mentz
  2025-03-26  4:51   ` Daniel Mentz
  2 siblings, 1 reply; 71+ messages in thread
From: Mostafa Saleh @ 2025-03-20 22:30 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 12:42:51AM +0000, Pranjal Shrivastava wrote:
> Before we suspend SMMU, we want to ensure that all commands have been
> consumed by the SMMU's command queue, i.e. the cmdq is empty. Add a
> helper function that waits till the cmdq is empty using `queue_empty`.
> 
> Signed-off-by: Pranjal Shrivastava <praan@google.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 +++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 80362d9f293b..f65d9bca0392 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -754,6 +754,30 @@ static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu,
>  	return __arm_smmu_cmdq_poll_until_consumed(smmu, cmdq, llq);
>  }
>  
> +/*
> + * Wait until the SMMU cmdq is empty
> + * Must be called with the cmdq lock held in some capacity.
> + */
> +static int arm_smmu_cmdq_poll_until_empty(struct arm_smmu_device *smmu,
> +					  struct arm_smmu_cmdq *cmdq,
> +					  struct arm_smmu_ll_queue *llq)
> +{
> +	struct arm_smmu_queue_poll qp;
> +	int ret = 0;
> +
> +	queue_poll_init(smmu, &qp);
> +	llq->val = READ_ONCE(cmdq->q.llq.val);
> +	do {
> +		if (queue_empty(llq))
> +			break;
> +
> +		ret = queue_poll(&qp);
> +		llq->cons = readl(cmdq->q.cons_reg);

If the command queue is not empty, that means some other thread is already
polling it, so I guess there is no need to do any MMIO here, just poll
the SW view of prod/cons, something similar to arm_smmu_cmdq_poll_until_not_full()

Thanks,
Mostafa


> +	} while (!ret);
> +
> +	return ret;
> +}
> +
>  static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
>  					u32 prod, int n)
>  {
> -- 
> 2.49.0.rc1.451.g8f38331e32-goog
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-03-19  0:42 ` [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops Pranjal Shrivastava
@ 2025-03-20 22:33   ` Mostafa Saleh
  2025-03-21  8:13     ` Pranjal Shrivastava
  2025-03-26  4:52   ` Daniel Mentz
  2025-04-14 17:57   ` Nicolin Chen
  2 siblings, 1 reply; 71+ messages in thread
From: Mostafa Saleh @ 2025-03-20 22:33 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 12:43 AM Pranjal Shrivastava <praan@google.com> wrote:
>
> Implement pm_runtime and system sleep ops for arm-smmu-v3. The smmu is
> disabled as part of the suspend callbacks after ensuring the completion
> of all pending commands and is configured to abort any transactions that
> happen after disabling the smmu. The smmu shall be reinitialized in the
> resume callback by invoking the `arm_smmu_device_reset` helper.
>
> The MSIs are freed as part of the suspend and are re-allocated during
> the resume operation.
>
> Signed-off-by: Pranjal Shrivastava <praan@google.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 106 ++++++++++++++++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |   2 +
>  2 files changed, 108 insertions(+)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index f65d9bca0392..caf750470772 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -26,6 +26,7 @@
>  #include <linux/pci.h>
>  #include <linux/pci-ats.h>
>  #include <linux/platform_device.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/string_choices.h>
>  #include <kunit/visibility.h>
>  #include <uapi/linux/iommufd.h>
> @@ -108,6 +109,32 @@ static const char * const event_class_str[] = {
>
>  static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
>
> +static int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
> +{
> +       int ret;
> +
> +       if (pm_runtime_enabled(smmu->dev)) {
> +               ret = pm_runtime_resume_and_get(smmu->dev);
> +               if (ret < 0) {
> +                       dev_err(smmu->dev, "Failed to resume device: %d\n", ret);
> +                       return ret;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +static void arm_smmu_rpm_put(struct arm_smmu_device *smmu)
> +{
> +       int ret;
> +
> +       if (pm_runtime_enabled(smmu->dev)) {
> +               ret = pm_runtime_put_autosuspend(smmu->dev);
> +               if (ret < 0)
> +                       dev_err(smmu->dev, "Failed to suspend device: %d\n", ret);
> +       }
> +}
> +
>  static void parse_driver_options(struct arm_smmu_device *smmu)
>  {
>         int i = 0;
> @@ -4850,6 +4877,84 @@ static void arm_smmu_device_shutdown(struct platform_device *pdev)
>         arm_smmu_device_disable(smmu);
>  }
>
> +
> +static int __maybe_unused arm_smmu_runtime_suspend(struct device *dev)
> +{
> +       struct arm_smmu_device *smmu = dev_get_drvdata(dev);
> +
> +       /* We might get the vcmdq */
> +       struct arm_smmu_cmdq_ent cmd = {
> +               .opcode = smmu->features & ARM_SMMU_FEAT_E2H ?
> +                       CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
> +       };
> +
> +       struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu, &cmd);
> +       struct arm_smmu_ll_queue *llq = &cmdq->q.llq;
> +
> +       /*
> +        * Since suspend is invoked when all clients have been
> +        * we don't expect more commands to be added to the cmdq.
> +        * Thus, wait for all existing commands to complete.
> +        */
> +       arm_smmu_cmdq_shared_lock(cmdq);
> +       arm_smmu_cmdq_poll_until_empty(smmu, cmdq, llq);
> +       arm_smmu_cmdq_shared_unlock(cmdq);
> +
> +       /* Disable all queues */
> +       arm_smmu_device_disable(smmu);
> +
> +       /* Abort all transactions to avoid spurious bypass */
> +       arm_smmu_update_gbpa(smmu, GBPA_ABORT, 0);
> +
> +       /* Free all MSIs (re-allocated on resume) */
> +       arm_smmu_free_msis(dev);
> +
> +       dev_dbg(dev, "Suspending smmu\n");
> +       return 0;
> +}
> +
> +static int __maybe_unused arm_smmu_runtime_resume(struct device *dev)
> +{
> +       int ret;
> +       struct arm_smmu_device *smmu = dev_get_drvdata(dev);
> +
> +       dev_dbg(dev, "Resuming device\n");
> +
> +       /*
> +        * The reset will re-initialize all the queues with the base addr,
> +        * prod and cons maintained within struct arm_smmu_device as well as
> +        * re-wire the IRQs and/or enable MSIs and install relevant handlers.
> +        */
> +       ret = arm_smmu_device_reset(smmu);
> +
> +       if (ret)
> +               dev_err(dev, "Failed to reset during resume operation: %d\n", ret);
> +
> +       return ret;
> +}
> +
> +static int __maybe_unused arm_smmu_pm_suspend(struct device *dev)
> +{
> +       if (pm_runtime_suspended(dev))
> +               return 0;
> +
> +       return arm_smmu_runtime_suspend(dev);
> +}
> +
> +static int __maybe_unused arm_smmu_pm_resume(struct device *dev)
> +{
> +       if (pm_runtime_suspended(dev))
> +               return 0;
> +
> +       return arm_smmu_runtime_resume(dev);
> +}
> +
> +static const struct dev_pm_ops arm_smmu_pm_ops = {
> +       SET_SYSTEM_SLEEP_PM_OPS(arm_smmu_pm_suspend, arm_smmu_pm_resume)

I guess, as there is no specific logic in the system suspend/resume handlers,
SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume)
would be enough?

Thanks,
Mostafa

> +       SET_RUNTIME_PM_OPS(arm_smmu_runtime_suspend,
> +                          arm_smmu_runtime_resume, NULL)
> +};
> +
>  static const struct of_device_id arm_smmu_of_match[] = {
>         { .compatible = "arm,smmu-v3", },
>         { },
> @@ -4866,6 +4971,7 @@ static struct platform_driver arm_smmu_driver = {
>         .driver = {
>                 .name                   = "arm-smmu-v3",
>                 .of_match_table         = arm_smmu_of_match,
> +               .pm                     = &arm_smmu_pm_ops,
>                 .suppress_bind_attrs    = true,
>         },
>         .probe  = arm_smmu_device_probe,
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index bd9d7c85576a..6bac83a511e2 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -505,6 +505,8 @@ static inline unsigned int arm_smmu_cdtab_l2_idx(unsigned int ssid)
>  #define MSI_IOVA_BASE                  0x8000000
>  #define MSI_IOVA_LENGTH                        0x100000
>
> +#define RPM_AUTOSUSPEND_DELAY_MS       15
> +
>  enum pri_resp {
>         PRI_RESP_DENY = 0,
>         PRI_RESP_FAIL = 1,
> --
> 2.49.0.rc1.451.g8f38331e32-goog
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 4/5] iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks
  2025-03-19  0:42 ` [RFC PATCH 4/5] iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks Pranjal Shrivastava
@ 2025-03-20 22:34   ` Mostafa Saleh
  0 siblings, 0 replies; 71+ messages in thread
From: Mostafa Saleh @ 2025-03-20 22:34 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 12:42:53AM +0000, Pranjal Shrivastava wrote:
> Enable PM runtime for SMMUs having a power-domain during smmu probe.
> Add a devlink between the clients and SMMU device. The absence of a
> power domain effectively disables runtime power management.
> 
> Signed-off-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>

> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index caf750470772..602784e84e84 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -3515,6 +3515,9 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev)
>  		pci_prepare_ats(to_pci_dev(dev), stu);
>  	}
>  
> +	device_link_add(dev, smmu->dev,
> +			DL_FLAG_PM_RUNTIME | DL_FLAG_AUTOREMOVE_SUPPLIER);
> +
>  	return &smmu->iommu;
>  
>  err_free_master:
> @@ -4836,11 +4839,18 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
>  	if (ret)
>  		goto err_disable;
>  
> +	if (dev->pm_domain) {
> +		pm_runtime_set_active(dev);
> +		pm_runtime_use_autosuspend(dev);
> +		pm_runtime_set_autosuspend_delay(dev, RPM_AUTOSUSPEND_DELAY_MS);
> +		pm_runtime_enable(dev);
> +	}
> +
>  	/* And we're up. Go go go! */
>  	ret = iommu_device_sysfs_add(&smmu->iommu, dev, NULL,
>  				     "smmu3.%pa", &ioaddr);
>  	if (ret)
> -		goto err_disable;
> +		goto err_rpm_disable;
>  
>  	ret = iommu_device_register(&smmu->iommu, &arm_smmu_ops, dev);
>  	if (ret) {
> @@ -4852,6 +4862,8 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
>  
>  err_free_sysfs:
>  	iommu_device_sysfs_remove(&smmu->iommu);
> +err_rpm_disable:
> +	pm_runtime_disable(dev);
>  err_disable:
>  	arm_smmu_device_disable(smmu);
>  err_free_iopf:
> -- 
> 2.49.0.rc1.451.g8f38331e32-goog
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 5/5] iommu/arm-smmu-v3: Invoke pm_runtime before hw access
  2025-03-19  0:42 ` [RFC PATCH 5/5] iommu/arm-smmu-v3: Invoke pm_runtime before hw access Pranjal Shrivastava
  2025-03-19 12:04   ` Jason Gunthorpe
@ 2025-03-20 22:36   ` Mostafa Saleh
  1 sibling, 0 replies; 71+ messages in thread
From: Mostafa Saleh @ 2025-03-20 22:36 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 12:42:54AM +0000, Pranjal Shrivastava wrote:
> Invoke the pm_runtime helpers at all places before accessing the hw.
> The idea is to invoke runtime_pm helpers at common points which are
> used by exposed ops or interrupt handlers. For cases without such a
> common point, simply wrap the exposed op/handler with rpm_get/put calls.
> 

Wouldn’t it be better to call this before/after map/unmap? Instead
of having it in low level invalidation functions, that would save
calling it a bunch of times. (although it’s not common to hit a tlb
invalidation on map)
This is also similar to the SMMUv2 driver.

Thanks,
Mostafa

> Signed-off-by: Pranjal Shrivastava <praan@google.com>
> ---
>  .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     |  22 +++-
>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  24 ++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 107 ++++++++++++++++--
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   3 +
>  4 files changed, 145 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> index 5aa2e7af58b4..1d423fc2cf18 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> @@ -13,10 +13,17 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type)
>  	struct iommu_hw_info_arm_smmuv3 *info;
>  	u32 __iomem *base_idr;
>  	unsigned int i;
> +	int ret;
> +
> +	ret = arm_smmu_rpm_get(master->smmu);
> +	if (ret < 0)
> +		return ERR_PTR(-EIO);
>  
>  	info = kzalloc(sizeof(*info), GFP_KERNEL);
> -	if (!info)
> +	if (!info) {
> +		arm_smmu_rpm_put(master->smmu);
>  		return ERR_PTR(-ENOMEM);
> +	}
>  
>  	base_idr = master->smmu->base + ARM_SMMU_IDR0;
>  	for (i = 0; i <= 5; i++)
> @@ -27,6 +34,7 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type)
>  	*length = sizeof(*info);
>  	*type = IOMMU_HW_INFO_TYPE_ARM_SMMUV3;
>  
> +	arm_smmu_rpm_put(master->smmu);
>  	return info;
>  }
>  
> @@ -96,6 +104,7 @@ static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
>  		.old_domain = iommu_get_domain_for_dev(dev),
>  		.ssid = IOMMU_NO_PASID,
>  	};
> +	struct arm_smmu_device *smmu = master->smmu;
>  	struct arm_smmu_ste ste;
>  	int ret;
>  
> @@ -104,6 +113,10 @@ static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
>  	if (arm_smmu_ssids_in_use(&master->cd_table))
>  		return -EBUSY;
>  
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		return ret;
> +
>  	mutex_lock(&arm_smmu_asid_lock);
>  	/*
>  	 * The VM has to control the actual ATS state at the PCI device because
> @@ -117,6 +130,7 @@ static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
>  	ret = arm_smmu_attach_prepare(&state, domain);
>  	if (ret) {
>  		mutex_unlock(&arm_smmu_asid_lock);
> +		arm_smmu_rpm_put(smmu);
>  		return ret;
>  	}
>  
> @@ -125,6 +139,7 @@ static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
>  	arm_smmu_install_ste_for_dev(master, &ste);
>  	arm_smmu_attach_commit(&state);
>  	mutex_unlock(&arm_smmu_asid_lock);
> +	arm_smmu_rpm_put(smmu);
>  	return 0;
>  }
>  
> @@ -301,6 +316,10 @@ static int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
>  	cur = cmds;
>  	end = cmds + array->entry_num;
>  
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		return ret;
> +
>  	static_assert(sizeof(*cmds) == 2 * sizeof(u64));
>  	ret = iommu_copy_struct_from_full_user_array(
>  		cmds, sizeof(*cmds), array,
> @@ -329,6 +348,7 @@ static int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
>  		last = cur;
>  	}
>  out:
> +	arm_smmu_rpm_put(smmu);
>  	array->entry_num = cur - cmds;
>  	kfree(cmds);
>  	return ret;
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
> index 9ba596430e7c..8f6d2533c493 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
> @@ -140,6 +140,7 @@ static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
>  {
>  	struct arm_smmu_domain *smmu_domain =
>  		container_of(mn, struct arm_smmu_domain, mmu_notifier);
> +	struct arm_smmu_device *smmu = smmu_domain->smmu;
>  	size_t size;
>  
>  	/*
> @@ -156,6 +157,10 @@ static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
>  			size = 0;
>  	}
>  
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		return;
> +
>  	if (!size)
>  		arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_domain->cd.asid);
>  	else
> @@ -163,6 +168,7 @@ static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
>  					    PAGE_SIZE, false, smmu_domain);
>  
>  	arm_smmu_atc_inv_domain(smmu_domain, start, size);
> +	arm_smmu_rpm_put(smmu);
>  }
>  
>  static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> @@ -170,8 +176,13 @@ static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  	struct arm_smmu_domain *smmu_domain =
>  		container_of(mn, struct arm_smmu_domain, mmu_notifier);
>  	struct arm_smmu_master_domain *master_domain;
> +	struct arm_smmu_device *smmu = smmu_domain->smmu;
>  	unsigned long flags;
>  
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		return;
> +
>  	/*
>  	 * DMA may still be running. Keep the cd valid to avoid C_BAD_CD events,
>  	 * but disable translation.
> @@ -195,6 +206,7 @@ static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  
>  	arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_domain->cd.asid);
>  	arm_smmu_atc_inv_domain(smmu_domain, 0, 0);
> +	arm_smmu_rpm_put(smmu);
>  }
>  
>  static void arm_smmu_mmu_notifier_free(struct mmu_notifier *mn)
> @@ -371,12 +383,24 @@ static int arm_smmu_sva_set_dev_pasid(struct iommu_domain *domain,
>  static void arm_smmu_sva_domain_free(struct iommu_domain *domain)
>  {
>  	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> +	struct arm_smmu_device *smmu = smmu_domain->smmu;
> +	int ret;
> +
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		/*
> +		 * Even if the rpm_get fails, the TLBs were invalidated
> +		 * with suspend, continue with the free operation.
> +		 */
> +		goto free_asid;
>  
>  	/*
>  	 * Ensure the ASID is empty in the iommu cache before allowing reuse.
>  	 */
>  	arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_domain->cd.asid);
> +	arm_smmu_rpm_put(smmu);
>  
> +free_asid:
>  	/*
>  	 * Notice that the arm_smmu_mm_arch_invalidate_secondary_tlbs op can
>  	 * still be called/running at this point. We allow the ASID to be
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 602784e84e84..f8d74ccaed6a 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -109,7 +109,7 @@ static const char * const event_class_str[] = {
>  
>  static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
>  
> -static int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
> +int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
>  {
>  	int ret;
>  
> @@ -124,7 +124,7 @@ static int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
>  	return 0;
>  }
>  
> -static void arm_smmu_rpm_put(struct arm_smmu_device *smmu)
> +void arm_smmu_rpm_put(struct arm_smmu_device *smmu)
>  {
>  	int ret;
>  
> @@ -1037,7 +1037,9 @@ static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused
>  {
>  	struct arm_smmu_cmdq_ent cmd = {0};
>  	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +	struct arm_smmu_device *smmu = master->smmu;
>  	int sid = master->streams[0].id;
> +	int ret;
>  
>  	if (WARN_ON(!master->stall_enabled))
>  		return;
> @@ -1057,6 +1059,10 @@ static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused
>  		break;
>  	}
>  
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		return;
> +
>  	arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
>  	/*
>  	 * Don't send a SYNC, it doesn't do anything for RESUME or PRI_RESP.
> @@ -1064,6 +1070,7 @@ static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused
>  	 * terminated... at some point in the future. PRI_RESP is fire and
>  	 * forget.
>  	 */
> +	arm_smmu_rpm_put(smmu);
>  }
>  
>  /* Context descriptor manipulation functions */
> @@ -1984,6 +1991,7 @@ static void arm_smmu_dump_event(struct arm_smmu_device *smmu, u64 *raw,
>  
>  static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
>  {
> +	int ret;
>  	u64 evt[EVTQ_ENT_DWORDS];
>  	struct arm_smmu_event event = {0};
>  	struct arm_smmu_device *smmu = dev;
> @@ -1992,6 +2000,10 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
>  	static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
>  				      DEFAULT_RATELIMIT_BURST);
>  
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		return IRQ_NONE;
> +
>  	do {
>  		while (!queue_remove_raw(q, evt)) {
>  			arm_smmu_decode_event(smmu, evt, &event);
> @@ -2012,6 +2024,7 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
>  
>  	/* Sync our overflow flag, as we believe we're up to speed */
>  	queue_sync_cons_ovf(q);
> +	arm_smmu_rpm_put(smmu);
>  	return IRQ_HANDLED;
>  }
>  
> @@ -2059,6 +2072,11 @@ static irqreturn_t arm_smmu_priq_thread(int irq, void *dev)
>  	struct arm_smmu_queue *q = &smmu->priq.q;
>  	struct arm_smmu_ll_queue *llq = &q->llq;
>  	u64 evt[PRIQ_ENT_DWORDS];
> +	int ret;
> +
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		return IRQ_NONE;
>  
>  	do {
>  		while (!queue_remove_raw(q, evt))
> @@ -2070,6 +2088,7 @@ static irqreturn_t arm_smmu_priq_thread(int irq, void *dev)
>  
>  	/* Sync our overflow flag, as we believe we're up to speed */
>  	queue_sync_cons_ovf(q);
> +	arm_smmu_rpm_put(smmu);
>  	return IRQ_HANDLED;
>  }
>  
> @@ -2079,13 +2098,24 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
>  {
>  	u32 gerror, gerrorn, active;
>  	struct arm_smmu_device *smmu = dev;
> +	int ret;
> +
> +	if (pm_runtime_enabled(smmu->dev)) {
> +		ret = pm_runtime_get_if_active(smmu->dev);
> +		if (ret == 0) {
> +			dev_err(smmu->dev, "Ignoring gerror interrupt because device isn't rpm active\n");
> +			return IRQ_NONE;
> +		}
> +	}
>  
>  	gerror = readl_relaxed(smmu->base + ARM_SMMU_GERROR);
>  	gerrorn = readl_relaxed(smmu->base + ARM_SMMU_GERRORN);
>  
>  	active = gerror ^ gerrorn;
> -	if (!(active & GERROR_ERR_MASK))
> +	if (!(active & GERROR_ERR_MASK)) {
> +		arm_smmu_rpm_put(smmu);
>  		return IRQ_NONE; /* No errors pending */
> +	}
>  
>  	dev_warn(smmu->dev,
>  		 "unexpected global error reported (0x%08x), this could be serious\n",
> @@ -2118,6 +2148,7 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
>  		arm_smmu_cmdq_skip_err(smmu);
>  
>  	writel(gerror, smmu->base + ARM_SMMU_GERRORN);
> +	arm_smmu_rpm_put(smmu);
>  	return IRQ_HANDLED;
>  }
>  
> @@ -2291,6 +2322,11 @@ static void arm_smmu_tlb_inv_context(void *cookie)
>  	struct arm_smmu_domain *smmu_domain = cookie;
>  	struct arm_smmu_device *smmu = smmu_domain->smmu;
>  	struct arm_smmu_cmdq_ent cmd;
> +	int ret;
> +
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		return;
>  
>  	/*
>  	 * NOTE: when io-pgtable is in non-strict mode, we may get here with
> @@ -2307,6 +2343,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
>  		arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  	}
>  	arm_smmu_atc_inv_domain(smmu_domain, 0, 0);
> +	arm_smmu_rpm_put(smmu);
>  }
>  
>  static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
> @@ -2384,6 +2421,8 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
>  					  size_t granule, bool leaf,
>  					  struct arm_smmu_domain *smmu_domain)
>  {
> +	int ret;
> +	struct arm_smmu_device *smmu = smmu_domain->smmu;
>  	struct arm_smmu_cmdq_ent cmd = {
>  		.tlbi = {
>  			.leaf	= leaf,
> @@ -2398,6 +2437,11 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
>  		cmd.opcode	= CMDQ_OP_TLBI_S2_IPA;
>  		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
>  	}
> +
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		return;
> +
>  	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
>  
>  	if (smmu_domain->nest_parent) {
> @@ -2414,6 +2458,7 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
>  	 * zapped an entire table.
>  	 */
>  	arm_smmu_atc_inv_domain(smmu_domain, iova, size);
> +	arm_smmu_rpm_put(smmu);
>  }
>  
>  void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
> @@ -2985,6 +3030,10 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>  	if (smmu_domain->smmu != smmu)
>  		return ret;
>  
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		return ret;
> +
>  	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
>  		cdptr = arm_smmu_alloc_cd_ptr(master, IOMMU_NO_PASID);
>  		if (!cdptr)
> @@ -3028,6 +3077,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>  
>  	arm_smmu_attach_commit(&state);
>  	mutex_unlock(&arm_smmu_asid_lock);
> +	arm_smmu_rpm_put(smmu);
>  	return 0;
>  }
>  
> @@ -3090,6 +3140,7 @@ int arm_smmu_set_pasid(struct arm_smmu_master *master,
>  		.ssid = pasid,
>  		.old_domain = old,
>  	};
> +	struct arm_smmu_device *smmu = master->smmu;
>  	struct arm_smmu_cd *cdptr;
>  	int ret;
>  
> @@ -3103,9 +3154,15 @@ int arm_smmu_set_pasid(struct arm_smmu_master *master,
>  	    sid_domain->type != IOMMU_DOMAIN_BLOCKED)
>  		return -EINVAL;
>  
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		return ret;
> +
>  	cdptr = arm_smmu_alloc_cd_ptr(master, pasid);
> -	if (!cdptr)
> +	if (!cdptr) {
> +		arm_smmu_rpm_put(smmu);
>  		return -ENOMEM;
> +	}
>  
>  	mutex_lock(&arm_smmu_asid_lock);
>  	ret = arm_smmu_attach_prepare(&state, &smmu_domain->domain);
> @@ -3127,6 +3184,7 @@ int arm_smmu_set_pasid(struct arm_smmu_master *master,
>  
>  out_unlock:
>  	mutex_unlock(&arm_smmu_asid_lock);
> +	arm_smmu_rpm_put(smmu);
>  	return ret;
>  }
>  
> @@ -3136,6 +3194,12 @@ static int arm_smmu_blocking_set_dev_pasid(struct iommu_domain *new_domain,
>  {
>  	struct arm_smmu_domain *smmu_domain = to_smmu_domain(old_domain);
>  	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +	struct arm_smmu_device *smmu = master->smmu;
> +	int ret;
> +
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		return ret;
>  
>  	mutex_lock(&arm_smmu_asid_lock);
>  	arm_smmu_clear_cd(master, pasid);
> @@ -3156,21 +3220,28 @@ static int arm_smmu_blocking_set_dev_pasid(struct iommu_domain *new_domain,
>  		    sid_domain->type == IOMMU_DOMAIN_BLOCKED)
>  			sid_domain->ops->attach_dev(sid_domain, dev);
>  	}
> +
> +	arm_smmu_rpm_put(smmu);
>  	return 0;
>  }
>  
> -static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
> +static int arm_smmu_attach_dev_ste(struct iommu_domain *domain,
>  				    struct device *dev,
>  				    struct arm_smmu_ste *ste,
>  				    unsigned int s1dss)
>  {
> +	int ret;
>  	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +	struct arm_smmu_device *smmu = master->smmu;
>  	struct arm_smmu_attach_state state = {
>  		.master = master,
>  		.old_domain = iommu_get_domain_for_dev(dev),
>  		.ssid = IOMMU_NO_PASID,
>  	};
>  
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		return ret;
>  	/*
>  	 * Do not allow any ASID to be changed while are working on the STE,
>  	 * otherwise we could miss invalidations.
> @@ -3205,6 +3276,8 @@ static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
>  	 * descriptor from arm_smmu_share_asid().
>  	 */
>  	arm_smmu_clear_cd(master, IOMMU_NO_PASID);
> +	arm_smmu_rpm_put(smmu);
> +	return 0;
>  }
>  
>  static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
> @@ -3214,8 +3287,8 @@ static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
>  	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
>  
>  	arm_smmu_make_bypass_ste(master->smmu, &ste);
> -	arm_smmu_attach_dev_ste(domain, dev, &ste, STRTAB_STE_1_S1DSS_BYPASS);
> -	return 0;
> +	return arm_smmu_attach_dev_ste(domain, dev, &ste,
> +				       STRTAB_STE_1_S1DSS_BYPASS);
>  }
>  
>  static const struct iommu_domain_ops arm_smmu_identity_ops = {
> @@ -3233,9 +3306,8 @@ static int arm_smmu_attach_dev_blocked(struct iommu_domain *domain,
>  	struct arm_smmu_ste ste;
>  
>  	arm_smmu_make_abort_ste(&ste);
> -	arm_smmu_attach_dev_ste(domain, dev, &ste,
> -				STRTAB_STE_1_S1DSS_TERMINATE);
> -	return 0;
> +	return arm_smmu_attach_dev_ste(domain, dev, &ste,
> +				       STRTAB_STE_1_S1DSS_TERMINATE);
>  }
>  
>  static const struct iommu_domain_ops arm_smmu_blocked_ops = {
> @@ -4874,10 +4946,19 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
>  static void arm_smmu_device_remove(struct platform_device *pdev)
>  {
>  	struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
> +	int ret;
>  
>  	iommu_device_unregister(&smmu->iommu);
>  	iommu_device_sysfs_remove(&smmu->iommu);
> +
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret <  0)
> +		goto free_iopf;
> +
>  	arm_smmu_device_disable(smmu);
> +	arm_smmu_rpm_put(smmu);
> +
> +free_iopf:
>  	iopf_queue_free(smmu->evtq.iopf);
>  	ida_destroy(&smmu->vmid_map);
>  }
> @@ -4885,8 +4966,14 @@ static void arm_smmu_device_remove(struct platform_device *pdev)
>  static void arm_smmu_device_shutdown(struct platform_device *pdev)
>  {
>  	struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
> +	int ret;
> +
> +	ret = arm_smmu_rpm_get(smmu);
> +	if (ret < 0)
> +		return;
>  
>  	arm_smmu_device_disable(smmu);
> +	arm_smmu_rpm_put(smmu);
>  }
>  
>  
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 6bac83a511e2..b7c2e28e903b 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -987,6 +987,9 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>  				struct arm_smmu_cmdq *cmdq, u64 *cmds, int n,
>  				bool sync);
>  
> +int arm_smmu_rpm_get(struct arm_smmu_device *smmu);
> +void arm_smmu_rpm_put(struct arm_smmu_device *smmu);
> +
>  #ifdef CONFIG_ARM_SMMU_V3_SVA
>  bool arm_smmu_sva_supported(struct arm_smmu_device *smmu);
>  bool arm_smmu_master_sva_supported(struct arm_smmu_master *master);
> -- 
> 2.49.0.rc1.451.g8f38331e32-goog
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-20 22:28     ` Mostafa Saleh
@ 2025-03-20 23:05       ` Jason Gunthorpe
  2025-03-21 14:44         ` Pranjal Shrivastava
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2025-03-20 23:05 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Robin Murphy, Pranjal Shrivastava, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Thu, Mar 20, 2025 at 10:28:44PM +0000, Mostafa Saleh wrote:
> On Wed, Mar 19, 2025 at 04:46:09PM -0300, Jason Gunthorpe wrote:
> > On Wed, Mar 19, 2025 at 06:22:30PM +0000, Robin Murphy wrote:
> > > I don't see SVA and VFIO needing any special considerations, so it might
> > > just be the new vIOMMU stuff which may need to hold PM enabled for the
> > > lifetime of things exposed to userspace.
> > 
> > Are you expecting that the VFIO driver, and whatever driver is
> > managing the SVA will keep the power turned on as necessary then?
> > 
> > VFIO is not going to work if the SMMU power goes off unexpectedly
> > while the VFIO FD is open.
> > 
> > I suppose VFIO also can power off on-demand via userspace asking
> > it. In that case the VM would be co-operating.
> 
> I guess VFIO is tricky as there is a security boundary. In case the
> SMMUv3 wakes up with caches not clean, there is a window of time where
> the device is accessible from userspace with old (or garbage) TLBs.
> I see VFIO-platform always take a PM reference for the device at
> open(), but I am not sure how that works with VFIO-pci.

I think that is my prior remark, PM must either abort or translate
exactly as the domain contains. Never something weird/old/bypass.

If the HW isn't guaranteeing that then we have a bigger discussion
here about how to deal with the security issues that creates.

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-20 21:00     ` Pranjal Shrivastava
@ 2025-03-20 23:08       ` Jason Gunthorpe
  2025-03-21 14:36         ` Pranjal Shrivastava
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2025-03-20 23:08 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Robin Murphy, Joerg Roedel, Will Deacon, Nicolin Chen,
	Mostafa Saleh, Daniel Mentz, iommu

On Thu, Mar 20, 2025 at 09:00:57PM +0000, Pranjal Shrivastava wrote:
> On Wed, Mar 19, 2025 at 04:46:09PM -0300, Jason Gunthorpe wrote:
> > On Wed, Mar 19, 2025 at 06:22:30PM +0000, Robin Murphy wrote:
> > > I don't see SVA and VFIO needing any special considerations, so it might
> > > just be the new vIOMMU stuff which may need to hold PM enabled for the
> > > lifetime of things exposed to userspace.
> > 
> > Are you expecting that the VFIO driver, and whatever driver is
> > managing the SVA will keep the power turned on as necessary then?
> > 
> > VFIO is not going to work if the SMMU power goes off unexpectedly
> > while the VFIO FD is open.
> > 
> 
> In that case, why not hold a ref till the FD is closed? When all the
> refs are dropped, we can safely power it down?

I assume VFIO is already doing that?? It must be keeping the PCI
device powered up, doesn't that automatically keep the SMMU powered
too?


> > I suppose VFIO also can power off on-demand via userspace asking
> > it. In that case the VM would be co-operating.
> > 
> 
> I'm wondering if user-space / VMM would be willing to control runtime
> power at this point? Maybe we could have n VMMs vote when they need
> power and drop the votes later and we can simply rpm_get if votes > 0

Yeah, I imagine that is how it would have to work. 

> > viommu I think will only uses STE fields so it should be OK?
> > 
> 
> What about IOMMUFD stuff like the Get HW info IOCTL?

Isn't that all from SW caches?
 
> > SVA.. Is similar to VFIO, if the power goes off while the SVA is
> > working that is bad. The SVA requesting driver would have to manage
> > this. HW with userspace command submission probably can't allow the
> > power to go off while command submission is possible.
> 
> Hmm.. maybe then rpm_get() 

Again mostly assuming the SVA using driver is already doing this
somehow?

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 1/5] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs
  2025-03-20 22:29   ` Mostafa Saleh
@ 2025-03-21  7:26     ` Pranjal Shrivastava
  0 siblings, 0 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-21  7:26 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Daniel Mentz, iommu

On Thu, Mar 20, 2025 at 10:29:42PM +0000, Mostafa Saleh wrote:
> On Wed, Mar 19, 2025 at 12:42:50AM +0000, Pranjal Shrivastava wrote:
> > Refactor arm_smmu_setup_irqs by splitting it into two parts, one for
> > registering interrupt handlers and the other one for enabling interrupt
> > generation in the hardware. This refactor helps in re-initialization of
> > hardware interrupts as part of a subsequent patch that enables runtime
> > power management for the arm-smmu-v3 driver.
> > 
> > Signed-off-by: Pranjal Shrivastava <praan@google.com>
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 ++++++++++++++++-----
> >  1 file changed, 38 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 358072b4e293..80362d9f293b 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -3985,12 +3985,24 @@ static void arm_smmu_setup_unique_irqs(struct arm_smmu_device *smmu)
> >  	}
> >  }
> >  
> > -static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
> > +static void arm_smmu_enable_irqs(struct arm_smmu_device *smmu)
> >  {
> > -	int ret, irq;
> > +	int ret;
> >  	u32 irqen_flags = IRQ_CTRL_EVTQ_IRQEN | IRQ_CTRL_GERROR_IRQEN;
> >  
> > -	/* Disable IRQs first */
> > +	if (smmu->features & ARM_SMMU_FEAT_PRI)
> > +		irqen_flags |= IRQ_CTRL_PRIQ_IRQEN;
> > +
> > +	ret = arm_smmu_write_reg_sync(smmu, irqen_flags,
> > +				      ARM_SMMU_IRQ_CTRL, ARM_SMMU_IRQ_CTRLACK);
> > +	if (ret)
> > +		dev_warn(smmu->dev, "failed to enable irqs\n");
> > +}
> > +
> > +static int arm_smmu_disable_irqs(struct arm_smmu_device *smmu)
> > +{
> > +	int ret;
> > +
> >  	ret = arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_IRQ_CTRL,
> >  				      ARM_SMMU_IRQ_CTRLACK);
> >  	if (ret) {
> > @@ -3998,6 +4010,19 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
> >  		return ret;
> >  	}
> >  
> > +	return 0;
> > +}
> > +
> > +static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
> > +{
> > +	int ret, irq;
> > +
> > +	/* Disable IRQs first */
> > +	ret = arm_smmu_disable_irqs(smmu);
> > +
> > +	if (ret)
> > +		return ret;
> > +
> >  	irq = smmu->combined_irq;
> >  	if (irq) {
> >  		/*
> > @@ -4014,15 +4039,6 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
> >  	} else
> >  		arm_smmu_setup_unique_irqs(smmu);
> >  
> > -	if (smmu->features & ARM_SMMU_FEAT_PRI)
> > -		irqen_flags |= IRQ_CTRL_PRIQ_IRQEN;
> > -
> > -	/* Enable interrupt generation on the SMMU */
> > -	ret = arm_smmu_write_reg_sync(smmu, irqen_flags,
> > -				      ARM_SMMU_IRQ_CTRL, ARM_SMMU_IRQ_CTRLACK);
> > -	if (ret)
> > -		dev_warn(smmu->dev, "failed to enable irqs\n");
> > -
> >  	return 0;
> >  }
> >  
> > @@ -4171,6 +4187,9 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu)
> >  		return ret;
> >  	}
> >  
> > +	/* Enable interrupt generation on the SMMU */
> > +	arm_smmu_enable_irqs(smmu);
> > +
> >  	if (is_kdump_kernel())
> >  		enables &= ~(CR0_EVTQEN | CR0_PRIQEN);
> >  
> > @@ -4754,6 +4773,13 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
> >  	/* Check for RMRs and install bypass STEs if any */
> >  	arm_smmu_rmr_install_bypass_ste(smmu);
> >  
> > +	/* Setup interrupt handlers */
> > +	ret = arm_smmu_setup_irqs(smmu);
> > +	if (ret) {
> > +		dev_err(smmu->dev, "failed to setup irqs\n");
> 
> Every path that can fail in this function has a print already, I’d say
> remove the ones inside and keep this one, that also simplifies
> arm_smmu_disable_irqs() to one line.

I guess as per Nicolin's comment, this call would go away and only be
called during probe. But yes the enable_irqs() will be here. I guess I
could move the print statement out of enable irqs and put it here?

The arm_smmu_disable_irqs will be called from the setup_irqs mainly,
during probe, so I think we could leave it's print as is..

> 
> Otherwise, besides Nicolin’s comment, it looks good.
> 
> Thanks,
> Mostafa

Thanks,
Praan


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/5] iommu/arm-smmu-v3: Add a helper to wait till cmdq drains
  2025-03-20 22:30   ` Mostafa Saleh
@ 2025-03-21  8:09     ` Pranjal Shrivastava
  0 siblings, 0 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-21  8:09 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Daniel Mentz, iommu

On Thu, Mar 20, 2025 at 10:30:23PM +0000, Mostafa Saleh wrote:
> On Wed, Mar 19, 2025 at 12:42:51AM +0000, Pranjal Shrivastava wrote:
> > Before we suspend SMMU, we want to ensure that all commands have been
> > consumed by the SMMU's command queue, i.e. the cmdq is empty. Add a
> > helper function that waits till the cmdq is empty using `queue_empty`.
> > 
> > Signed-off-by: Pranjal Shrivastava <praan@google.com>
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 +++++++++++++++++++++
> >  1 file changed, 24 insertions(+)
> > 
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 80362d9f293b..f65d9bca0392 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -754,6 +754,30 @@ static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu,
> >  	return __arm_smmu_cmdq_poll_until_consumed(smmu, cmdq, llq);
> >  }
> >  
> > +/*
> > + * Wait until the SMMU cmdq is empty
> > + * Must be called with the cmdq lock held in some capacity.
> > + */
> > +static int arm_smmu_cmdq_poll_until_empty(struct arm_smmu_device *smmu,
> > +					  struct arm_smmu_cmdq *cmdq,
> > +					  struct arm_smmu_ll_queue *llq)
> > +{
> > +	struct arm_smmu_queue_poll qp;
> > +	int ret = 0;
> > +
> > +	queue_poll_init(smmu, &qp);
> > +	llq->val = READ_ONCE(cmdq->q.llq.val);
> > +	do {
> > +		if (queue_empty(llq))
> > +			break;
> > +
> > +		ret = queue_poll(&qp);
> > +		llq->cons = readl(cmdq->q.cons_reg);
> 
> If the command queue is not empty, that means some other thread is already
> polling it, so I guess there is no need to do any MMIO here, just poll
> the SW view of prod/cons, something similar to arm_smmu_cmdq_poll_until_not_full()
> 

That's a good insight. But, I don't understand how we can avoid MMIO
here? The commands that are inserted without (sync == true) won't update
the llq.cons... in which case if we keep polling on the SW copies of
prod and cons we'd never know if the queue was drained? I believe we
will have to poll on the cons register.

> Thanks,
> Mostafa

Thanks,
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-03-20 22:33   ` Mostafa Saleh
@ 2025-03-21  8:13     ` Pranjal Shrivastava
  0 siblings, 0 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-21  8:13 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Daniel Mentz, iommu

On Thu, Mar 20, 2025 at 10:33:41PM +0000, Mostafa Saleh wrote:
> On Wed, Mar 19, 2025 at 12:43 AM Pranjal Shrivastava <praan@google.com> wrote:
> >
> > Implement pm_runtime and system sleep ops for arm-smmu-v3. The smmu is
> > disabled as part of the suspend callbacks after ensuring the completion
> > of all pending commands and is configured to abort any transactions that
> > happen after disabling the smmu. The smmu shall be reinitialized in the
> > resume callback by invoking the `arm_smmu_device_reset` helper.
> >
> > The MSIs are freed as part of the suspend and are re-allocated during
> > the resume operation.
> >
> > Signed-off-by: Pranjal Shrivastava <praan@google.com>
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 106 ++++++++++++++++++++
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |   2 +
> >  2 files changed, 108 insertions(+)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index f65d9bca0392..caf750470772 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -26,6 +26,7 @@
> >  #include <linux/pci.h>
> >  #include <linux/pci-ats.h>
> >  #include <linux/platform_device.h>
> > +#include <linux/pm_runtime.h>
> >  #include <linux/string_choices.h>
> >  #include <kunit/visibility.h>
> >  #include <uapi/linux/iommufd.h>
> > @@ -108,6 +109,32 @@ static const char * const event_class_str[] = {
> >
> >  static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
> >
> > +static int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
> > +{
> > +       int ret;
> > +
> > +       if (pm_runtime_enabled(smmu->dev)) {
> > +               ret = pm_runtime_resume_and_get(smmu->dev);
> > +               if (ret < 0) {
> > +                       dev_err(smmu->dev, "Failed to resume device: %d\n", ret);
> > +                       return ret;
> > +               }
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static void arm_smmu_rpm_put(struct arm_smmu_device *smmu)
> > +{
> > +       int ret;
> > +
> > +       if (pm_runtime_enabled(smmu->dev)) {
> > +               ret = pm_runtime_put_autosuspend(smmu->dev);
> > +               if (ret < 0)
> > +                       dev_err(smmu->dev, "Failed to suspend device: %d\n", ret);
> > +       }
> > +}
> > +
> >  static void parse_driver_options(struct arm_smmu_device *smmu)
> >  {
> >         int i = 0;
> > @@ -4850,6 +4877,84 @@ static void arm_smmu_device_shutdown(struct platform_device *pdev)
> >         arm_smmu_device_disable(smmu);
> >  }
> >
> > +
> > +static int __maybe_unused arm_smmu_runtime_suspend(struct device *dev)
> > +{
> > +       struct arm_smmu_device *smmu = dev_get_drvdata(dev);
> > +
> > +       /* We might get the vcmdq */
> > +       struct arm_smmu_cmdq_ent cmd = {
> > +               .opcode = smmu->features & ARM_SMMU_FEAT_E2H ?
> > +                       CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
> > +       };
> > +
> > +       struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu, &cmd);
> > +       struct arm_smmu_ll_queue *llq = &cmdq->q.llq;
> > +
> > +       /*
> > +        * Since suspend is invoked when all clients have been
> > +        * we don't expect more commands to be added to the cmdq.
> > +        * Thus, wait for all existing commands to complete.
> > +        */
> > +       arm_smmu_cmdq_shared_lock(cmdq);
> > +       arm_smmu_cmdq_poll_until_empty(smmu, cmdq, llq);
> > +       arm_smmu_cmdq_shared_unlock(cmdq);
> > +
> > +       /* Disable all queues */
> > +       arm_smmu_device_disable(smmu);
> > +
> > +       /* Abort all transactions to avoid spurious bypass */
> > +       arm_smmu_update_gbpa(smmu, GBPA_ABORT, 0);
> > +
> > +       /* Free all MSIs (re-allocated on resume) */
> > +       arm_smmu_free_msis(dev);
> > +
> > +       dev_dbg(dev, "Suspending smmu\n");
> > +       return 0;
> > +}
> > +
> > +static int __maybe_unused arm_smmu_runtime_resume(struct device *dev)
> > +{
> > +       int ret;
> > +       struct arm_smmu_device *smmu = dev_get_drvdata(dev);
> > +
> > +       dev_dbg(dev, "Resuming device\n");
> > +
> > +       /*
> > +        * The reset will re-initialize all the queues with the base addr,
> > +        * prod and cons maintained within struct arm_smmu_device as well as
> > +        * re-wire the IRQs and/or enable MSIs and install relevant handlers.
> > +        */
> > +       ret = arm_smmu_device_reset(smmu);
> > +
> > +       if (ret)
> > +               dev_err(dev, "Failed to reset during resume operation: %d\n", ret);
> > +
> > +       return ret;
> > +}
> > +
> > +static int __maybe_unused arm_smmu_pm_suspend(struct device *dev)
> > +{
> > +       if (pm_runtime_suspended(dev))
> > +               return 0;
> > +
> > +       return arm_smmu_runtime_suspend(dev);
> > +}
> > +
> > +static int __maybe_unused arm_smmu_pm_resume(struct device *dev)
> > +{
> > +       if (pm_runtime_suspended(dev))
> > +               return 0;
> > +
> > +       return arm_smmu_runtime_resume(dev);
> > +}
> > +
> > +static const struct dev_pm_ops arm_smmu_pm_ops = {
> > +       SET_SYSTEM_SLEEP_PM_OPS(arm_smmu_pm_suspend, arm_smmu_pm_resume)
> 
> I guess, as there is no specific logic in the system suspend/resume handlers,
> SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume)
> would be enough?

Ack. Thanks for this! :)

> 
> Thanks,
> Mostafa
> 

-Praan

> > +       SET_RUNTIME_PM_OPS(arm_smmu_runtime_suspend,
> > +                          arm_smmu_runtime_resume, NULL)
> > +};
> > +
> >  static const struct of_device_id arm_smmu_of_match[] = {
> >         { .compatible = "arm,smmu-v3", },
> >         { },
> > @@ -4866,6 +4971,7 @@ static struct platform_driver arm_smmu_driver = {
> >         .driver = {
> >                 .name                   = "arm-smmu-v3",
> >                 .of_match_table         = arm_smmu_of_match,
> > +               .pm                     = &arm_smmu_pm_ops,
> >                 .suppress_bind_attrs    = true,
> >         },
> >         .probe  = arm_smmu_device_probe,
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> > index bd9d7c85576a..6bac83a511e2 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> > @@ -505,6 +505,8 @@ static inline unsigned int arm_smmu_cdtab_l2_idx(unsigned int ssid)
> >  #define MSI_IOVA_BASE                  0x8000000
> >  #define MSI_IOVA_LENGTH                        0x100000
> >
> > +#define RPM_AUTOSUSPEND_DELAY_MS       15
> > +
> >  enum pri_resp {
> >         PRI_RESP_DENY = 0,
> >         PRI_RESP_FAIL = 1,
> > --
> > 2.49.0.rc1.451.g8f38331e32-goog
> >

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-20 22:25     ` Mostafa Saleh
@ 2025-03-21 14:18       ` Pranjal Shrivastava
  2025-03-21 17:35         ` Robin Murphy
  0 siblings, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-21 14:18 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Robin Murphy, Jason Gunthorpe, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Thu, Mar 20, 2025 at 10:25:52PM +0000, Mostafa Saleh wrote:
> On Wed, Mar 19, 2025 at 04:07:57PM +0000, Robin Murphy wrote:
> > On 19/03/2025 11:57 am, Jason Gunthorpe wrote:
> > > On Wed, Mar 19, 2025 at 12:42:49AM +0000, Pranjal Shrivastava wrote:
> > > 
> > > > 3. Invoking runtime_pm_get/put
> > > > Given that most of the configuration done by arm-smmu-v3 is stored in
> > > > memory, the initial idea is to focus on areas where the driver accesses
> > > > the hw via exposed ops, like iommmu_ops, iommu_flush_ops, sva_ops etc.
> > > 
> > > This seems weird, if the SMMU is suspended doesn't it also fail DMA
> > > transactions? Why would ops like flush even be called if the HW is
> > > disabled?
> > 
> > Because once the device has finished its operation, its driver is free to
> > call rpm_put() before calling dma_unmap(), so by the time that gets as far
> > as TLB maintenance, the SMMU may already be asleep as well if that device
> > was the only thing keeping it awake.
> > 
> > For direct IOMMU API users, pagetable update may be even more asynchronous
> > from device activity, e.g. a GPU buffer might only be unmapped once
> > userspace closes the last file handle referencing it, long after the GPU
> > itself has moved on to other things.
> > 
> > > flush is performance path stuff, so it doesn't seem great to be adding
> > > extra calls there.
> > 
> > That much is true - this really wants to be using pm_runtime_get_if_in_use()
> > nearly everywhere such that at most it's just juggling refcounts. There's no
> > point waking the SMMU up just to issue a CFGI or TLBI, if the act of doing
> > so is inherently going to do a full arm_smmu_reset() and thus invalidate
> > everything anyway.
> 
> AFAICT, there is no guarantees that caches are clean on system resume,
> but as we do invalidate everything that should be fine, but I am not sure

I mean we do set GBPA.Abort = 1 right before suspending, I'd want to
assume that doing so would ensure that TLB hits don't occur anymore. Let
me dig into the spec to see if I can find something regarding TLB
behavior when GBPA.Abort = 1

> how that works with distributed SMMUs where the TBU can still be powered
> with some TLBs that can be invalid?
> 

Hmm.. do you mean some situation like:

 |-----------------------|    |-------------------|
 | |-------|   |-------| |    |-------------------|
 | | Dev X |   | Dev Y | |    |                   |
 | | (TBU) |   |_______| |    |       SMMUv3      |
 | |-------|             |    |  (TCU -> TBU_X)   |
 |     Power_Domain A    |    |  Power_Domain B   |
 |-----------------------|    |-------------------|

Now if Dev Y isn't an SMMU client and Dev X drops the ref_count,
Power_Domain A would still remain ON whereas SMMUv3 might assume all
it's clients are down and try to suspend (power off Power_Domain B)
while the TLBs withing TBU_X are still powered up?

To avoid such a case maybe we should invalidate everything during
suspend? I still believe that when CR0.SMMUEN=0, it should broadcast
something to all it's TBUs asking them to invalidate all TLBs otherwise
this is a miss in the arch (unlikely for Arm) as TLB hits should never
occur if the SMMU is disabled. I guess I need to go through the SMMUv3
spec (or maybe the MMU-700 TRM) for confirming this..

> Thanks,
> Mostafa
> > 

Thanks,
Praan

> > > > Instead of wrapping every exposed op with an rpm_get/put, the idea is to
> > > > follow through till a logical common point (lowest common ancestor in
> > > > the call hierarchies) and wrap those with the rpm_get/put calls.
> > > 
> > > Should the iommu core be doing this instead of hidden in drivers? It
> > > seems like there might be better information there about when we
> > > expect the IOMMU to be powered up and working and when it is OK to
> > > have it shutdown?
> > 
> > On the contrary, I would say it's very much driver-level knowledge as to
> > which operations need the hardware accessible or not, and trying to do it at
> > the core level would inevitably end up very pessimistic and inefficient, or
> > at best just horribly complex.
> > 
> > > And maybe in this direction we can do things like remove the
> > > translation before doing a suspend so that flush ops are not a
> > > problem.
> > 
> > Case in point: that would be a load of extra work to do, make suspend/resume
> > even slower, and even then for no benefit, since map_pages/umnmap_pages can
> > still be called on the unattached domain (especially a default DMA domain),
> > wherein drivers may still end up trying to touch their hardware for various
> > reasons in ways invisible to the core API. Easy example: consider a driver
> > calling dma_alloc_coherent() to prepare a buffer *before* it wakes up its
> > client device (which would then also wake up the SMMU via the device link),
> > and deep within that call, arm_lpae_init_pte() ends up calling back into
> > arm_smmu_tlb_inv_walk() because it's laying down a block mapping where an
> > old empty table PTE happens to be...
> > 
> > Thanks,
> > Robin.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-20 23:08       ` Jason Gunthorpe
@ 2025-03-21 14:36         ` Pranjal Shrivastava
  2025-03-22  0:00           ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-21 14:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, Joerg Roedel, Will Deacon, Nicolin Chen,
	Mostafa Saleh, Daniel Mentz, iommu

On Thu, Mar 20, 2025 at 08:08:13PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 20, 2025 at 09:00:57PM +0000, Pranjal Shrivastava wrote:
> > On Wed, Mar 19, 2025 at 04:46:09PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Mar 19, 2025 at 06:22:30PM +0000, Robin Murphy wrote:
> > > > I don't see SVA and VFIO needing any special considerations, so it might
> > > > just be the new vIOMMU stuff which may need to hold PM enabled for the
> > > > lifetime of things exposed to userspace.
> > > 
> > > Are you expecting that the VFIO driver, and whatever driver is
> > > managing the SVA will keep the power turned on as necessary then?
> > > 
> > > VFIO is not going to work if the SMMU power goes off unexpectedly
> > > while the VFIO FD is open.
> > > 
> > 
> > In that case, why not hold a ref till the FD is closed? When all the
> > refs are dropped, we can safely power it down?
> 
> I assume VFIO is already doing that?? It must be keeping the PCI
> device powered up, doesn't that automatically keep the SMMU powered
> too?
> 

Yes it should if the devlinks are setup correctly.

> 
> > > I suppose VFIO also can power off on-demand via userspace asking
> > > it. In that case the VM would be co-operating.
> > > 
> > 
> > I'm wondering if user-space / VMM would be willing to control runtime
> > power at this point? Maybe we could have n VMMs vote when they need
> > power and drop the votes later and we can simply rpm_get if votes > 0
> 
> Yeah, I imagine that is how it would have to work. 
> 
> > > viommu I think will only uses STE fields so it should be OK?
> > > 
> > 
> > What about IOMMUFD stuff like the Get HW info IOCTL?
> 
> Isn't that all from SW caches?
>  

Not really? I see arm_smmu_hw_info reads all the IDR registers..
The call flow I could trace is like:

iommufd_get_hw_info()
 -> iommu_ops->hw_info()
   -> arm_smmu_hw_info()

Am I missing something here?

> > > SVA.. Is similar to VFIO, if the power goes off while the SVA is
> > > working that is bad. The SVA requesting driver would have to manage
> > > this. HW with userspace command submission probably can't allow the
> > > power to go off while command submission is possible.
> > 
> > Hmm.. maybe then rpm_get() 
> 
> Again mostly assuming the SVA using driver is already doing this
> somehow?
> 

Hmm.. yea it should happen through devlinks too.. Let me re-check if
there's a surprise somewhere.

> Jason

Thanks
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-20 23:05       ` Jason Gunthorpe
@ 2025-03-21 14:44         ` Pranjal Shrivastava
  2025-03-21 15:30           ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-21 14:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mostafa Saleh, Robin Murphy, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Thu, Mar 20, 2025 at 08:05:51PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 20, 2025 at 10:28:44PM +0000, Mostafa Saleh wrote:
> > On Wed, Mar 19, 2025 at 04:46:09PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Mar 19, 2025 at 06:22:30PM +0000, Robin Murphy wrote:
> > > > I don't see SVA and VFIO needing any special considerations, so it might
> > > > just be the new vIOMMU stuff which may need to hold PM enabled for the
> > > > lifetime of things exposed to userspace.
> > > 
> > > Are you expecting that the VFIO driver, and whatever driver is
> > > managing the SVA will keep the power turned on as necessary then?
> > > 
> > > VFIO is not going to work if the SMMU power goes off unexpectedly
> > > while the VFIO FD is open.
> > > 
> > > I suppose VFIO also can power off on-demand via userspace asking
> > > it. In that case the VM would be co-operating.
> > 
> > I guess VFIO is tricky as there is a security boundary. In case the
> > SMMUv3 wakes up with caches not clean, there is a window of time where
> > the device is accessible from userspace with old (or garbage) TLBs.
> > I see VFIO-platform always take a PM reference for the device at
> > open(), but I am not sure how that works with VFIO-pci.
> 
> I think that is my prior remark, PM must either abort or translate
> exactly as the domain contains. Never something weird/old/bypass.
> 
> If the HW isn't guaranteeing that then we have a bigger discussion
> here about how to deal with the security issues that creates.
> 

+1. I strongly believe that the disable that happens through
CR0.SMMUEN=0 would (and should) clear all TLB entries or *atleast*
ensure all transactions henceforth cause a TLB miss. I'll go through the
spec once to confirm this. If that's correct and we configure 
GBPA.Abort=1, all transactions should be aborted.

However, still there's a worry about the reset value of GBPA.Abort as
pointed out by Robin earlier. Since the reset value of GBPA.Abort is 
implementation defined.. there's a chance that after a power cycle the
SMMU wakes up with GBPA configured to bypass.. in such a case, I don't
think the kernel should be responsible to ensure security..

Thanks,
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-21 14:44         ` Pranjal Shrivastava
@ 2025-03-21 15:30           ` Jason Gunthorpe
  2025-03-24 17:53             ` Pranjal Shrivastava
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2025-03-21 15:30 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Mostafa Saleh, Robin Murphy, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Fri, Mar 21, 2025 at 02:44:09PM +0000, Pranjal Shrivastava wrote:

> However, still there's a worry about the reset value of GBPA.Abort as
> pointed out by Robin earlier. Since the reset value of GBPA.Abort is 
> implementation defined.. there's a chance that after a power cycle the
> SMMU wakes up with GBPA configured to bypass.. in such a case, I don't
> think the kernel should be responsible to ensure security..

The kernel should be responsible to operate that HW in a secure
way. If the spec doesn't guarentee security then you will need a
ACPI/DT flag to indicate if specific implementations are secure or
not (ie if the implementation preserves GBPA.Abort).

Otherwise we'd have to architect around the insecurity somehow and
prevent the SMMU from unpowering if there is any security sensitive
attachment..

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-21 14:18       ` Pranjal Shrivastava
@ 2025-03-21 17:35         ` Robin Murphy
  2025-03-24 17:36           ` Pranjal Shrivastava
  0 siblings, 1 reply; 71+ messages in thread
From: Robin Murphy @ 2025-03-21 17:35 UTC (permalink / raw)
  To: Pranjal Shrivastava, Mostafa Saleh
  Cc: Jason Gunthorpe, Joerg Roedel, Will Deacon, Nicolin Chen,
	Daniel Mentz, iommu

On 21/03/2025 2:18 pm, Pranjal Shrivastava wrote:
> On Thu, Mar 20, 2025 at 10:25:52PM +0000, Mostafa Saleh wrote:
>> On Wed, Mar 19, 2025 at 04:07:57PM +0000, Robin Murphy wrote:
>>> On 19/03/2025 11:57 am, Jason Gunthorpe wrote:
>>>> On Wed, Mar 19, 2025 at 12:42:49AM +0000, Pranjal Shrivastava wrote:
>>>>
>>>>> 3. Invoking runtime_pm_get/put
>>>>> Given that most of the configuration done by arm-smmu-v3 is stored in
>>>>> memory, the initial idea is to focus on areas where the driver accesses
>>>>> the hw via exposed ops, like iommmu_ops, iommu_flush_ops, sva_ops etc.
>>>>
>>>> This seems weird, if the SMMU is suspended doesn't it also fail DMA
>>>> transactions? Why would ops like flush even be called if the HW is
>>>> disabled?
>>>
>>> Because once the device has finished its operation, its driver is free to
>>> call rpm_put() before calling dma_unmap(), so by the time that gets as far
>>> as TLB maintenance, the SMMU may already be asleep as well if that device
>>> was the only thing keeping it awake.
>>>
>>> For direct IOMMU API users, pagetable update may be even more asynchronous
>>> from device activity, e.g. a GPU buffer might only be unmapped once
>>> userspace closes the last file handle referencing it, long after the GPU
>>> itself has moved on to other things.
>>>
>>>> flush is performance path stuff, so it doesn't seem great to be adding
>>>> extra calls there.
>>>
>>> That much is true - this really wants to be using pm_runtime_get_if_in_use()
>>> nearly everywhere such that at most it's just juggling refcounts. There's no
>>> point waking the SMMU up just to issue a CFGI or TLBI, if the act of doing
>>> so is inherently going to do a full arm_smmu_reset() and thus invalidate
>>> everything anyway.
>>
>> AFAICT, there is no guarantees that caches are clean on system resume,
>> but as we do invalidate everything that should be fine, but I am not sure

That was the point - we're definitely going to do a full software 
invalidation *because* we can't make any assumptions about the hardware 
state, i.e. it may come back full of valid-looking nonsense.

> I mean we do set GBPA.Abort = 1 right before suspending, I'd want to
> assume that doing so would ensure that TLB hits don't occur anymore. Let
> me dig into the spec to see if I can find something regarding TLB
> behavior when GBPA.Abort = 1

GBPA doesn't matter here, it's about the CR0.SMMUEN=0 behaviour (see 
6.3.9.6). That says "Incoming transactions [...] do not undergo 
translation," so although TLB entries are allowed to remain present, 
they must not be *used* - i.e. SMMUEN is not permitted to be cached in a 
TLB.

>> how that works with distributed SMMUs where the TBU can still be powered
>> with some TLBs that can be invalid?
>>
> 
> Hmm.. do you mean some situation like:
> 
>   |-----------------------|    |-------------------|
>   | |-------|   |-------| |    |-------------------|
>   | | Dev X |   | Dev Y | |    |                   |
>   | | (TBU) |   |_______| |    |       SMMUv3      |
>   | |-------|             |    |  (TCU -> TBU_X)   |
>   |     Power_Domain A    |    |  Power_Domain B   |
>   |-----------------------|    |-------------------|
> 
> Now if Dev Y isn't an SMMU client and Dev X drops the ref_count,
> Power_Domain A would still remain ON whereas SMMUv3 might assume all
> it's clients are down and try to suspend (power off Power_Domain B)
> while the TLBs withing TBU_X are still powered up?
> 
> To avoid such a case maybe we should invalidate everything during
> suspend? I still believe that when CR0.SMMUEN=0, it should broadcast
> something to all it's TBUs asking them to invalidate all TLBs otherwise
> this is a miss in the arch (unlikely for Arm) as TLB hits should never
> occur if the SMMU is disabled. I guess I need to go through the SMMUv3
> spec (or maybe the MMU-700 TRM) for confirming this..

I don't think this is allowed to be an issue in practice since TBUs are 
not architecturally visible. Certainly in terms of Arm's 
implementations, for a TBU to be powered off or externally clock gated 
it would have to do a full DTI disconnect (otherwise it would hang 
CMD_SYNC), and DTI requires that it must subsequently come back clean:

"The TBU must invalidate its caches before entering CONNECTED state."

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-21 14:36         ` Pranjal Shrivastava
@ 2025-03-22  0:00           ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2025-03-22  0:00 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Robin Murphy, Joerg Roedel, Will Deacon, Nicolin Chen,
	Mostafa Saleh, Daniel Mentz, iommu

On Fri, Mar 21, 2025 at 02:36:41PM +0000, Pranjal Shrivastava wrote:
> > > What about IOMMUFD stuff like the Get HW info IOCTL?
> > 
> > Isn't that all from SW caches?
> >  
> 
> Not really? I see arm_smmu_hw_info reads all the IDR registers..
> The call flow I could trace is like:
> 
> iommufd_get_hw_info()
>  -> iommu_ops->hw_info()
>    -> arm_smmu_hw_info()
> 
> Am I missing something here?

Oh, I see what you mean, even the IDR won't be readable. Yeah, that
should either be copied to memory during probe or we need to wake it
up to read the IDRs..

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-21 17:35         ` Robin Murphy
@ 2025-03-24 17:36           ` Pranjal Shrivastava
  2025-03-27 17:27             ` Mostafa Saleh
  0 siblings, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-24 17:36 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Mostafa Saleh, Jason Gunthorpe, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Fri, Mar 21, 2025 at 05:35:11PM +0000, Robin Murphy wrote:
> On 21/03/2025 2:18 pm, Pranjal Shrivastava wrote:
> > On Thu, Mar 20, 2025 at 10:25:52PM +0000, Mostafa Saleh wrote:
> > > On Wed, Mar 19, 2025 at 04:07:57PM +0000, Robin Murphy wrote:
> > > > On 19/03/2025 11:57 am, Jason Gunthorpe wrote:
> > > > > On Wed, Mar 19, 2025 at 12:42:49AM +0000, Pranjal Shrivastava wrote:
> > > > > 
> > > > > > 3. Invoking runtime_pm_get/put
> > > > > > Given that most of the configuration done by arm-smmu-v3 is stored in
> > > > > > memory, the initial idea is to focus on areas where the driver accesses
> > > > > > the hw via exposed ops, like iommmu_ops, iommu_flush_ops, sva_ops etc.
> > > > > 
> > > > > This seems weird, if the SMMU is suspended doesn't it also fail DMA
> > > > > transactions? Why would ops like flush even be called if the HW is
> > > > > disabled?
> > > > 
> > > > Because once the device has finished its operation, its driver is free to
> > > > call rpm_put() before calling dma_unmap(), so by the time that gets as far
> > > > as TLB maintenance, the SMMU may already be asleep as well if that device
> > > > was the only thing keeping it awake.
> > > > 
> > > > For direct IOMMU API users, pagetable update may be even more asynchronous
> > > > from device activity, e.g. a GPU buffer might only be unmapped once
> > > > userspace closes the last file handle referencing it, long after the GPU
> > > > itself has moved on to other things.
> > > > 
> > > > > flush is performance path stuff, so it doesn't seem great to be adding
> > > > > extra calls there.
> > > > 
> > > > That much is true - this really wants to be using pm_runtime_get_if_in_use()
> > > > nearly everywhere such that at most it's just juggling refcounts. There's no
> > > > point waking the SMMU up just to issue a CFGI or TLBI, if the act of doing
> > > > so is inherently going to do a full arm_smmu_reset() and thus invalidate
> > > > everything anyway.
> > > 
> > > AFAICT, there is no guarantees that caches are clean on system resume,
> > > but as we do invalidate everything that should be fine, but I am not sure
> 
> That was the point - we're definitely going to do a full software
> invalidation *because* we can't make any assumptions about the hardware
> state, i.e. it may come back full of valid-looking nonsense.
> 
> > I mean we do set GBPA.Abort = 1 right before suspending, I'd want to
> > assume that doing so would ensure that TLB hits don't occur anymore. Let
> > me dig into the spec to see if I can find something regarding TLB
> > behavior when GBPA.Abort = 1
> 
> GBPA doesn't matter here, it's about the CR0.SMMUEN=0 behaviour (see
> 6.3.9.6). That says "Incoming transactions [...] do not undergo
> translation," so although TLB entries are allowed to remain present, they
> must not be *used* - i.e. SMMUEN is not permitted to be cached in a TLB.
> 

Thanks for pointing me to the right section. It also mentions: 
When SMMU_(*_)CR0.SMMUEN == 0:
"Translation and configuration cache entries are not inserted or
modified, except for invalidation by maintenance commands or broadcast
operations."

So, it looks like we don't need to worry much about a disabled
programming interface causing changes to the TLB. 

However, another statement in the same section:

"Note: The ‘other’ Security state might still have SMMUEN == 1 and
therefore be inserting cache entries for that Security state. As these
entries are not visible to or affected by the Non-secure programming
interface, this is only a consideration for the Secure programming
interface which can maintain Non-secure cache entries."

Makes me think of situations where we might elide a TLB invalidate if
the SMMU is SUSPENDED but the secure world gets a hit to the 
invalidated TLB entry. The TLBI command could be a result of a simple
non-driver kernel module unmapping a page based on it's communication
with the secure world. In this case, devlinks may NOT save the day...

I know that the above situation is a burden on the SW designer or
implementer, I just want to discuss is if we have something
like the above case, that would not want us to elide TLBIs while
suspended? (I'm not able to see any case where we share pages with
the secure world at this time).


> > > how that works with distributed SMMUs where the TBU can still be powered
> > > with some TLBs that can be invalid?
> > > 
> > 
> > Hmm.. do you mean some situation like:
> > 
> >   |-----------------------|    |-------------------|
> >   | |-------|   |-------| |    |-------------------|
> >   | | Dev X |   | Dev Y | |    |                   |
> >   | | (TBU) |   |_______| |    |       SMMUv3      |
> >   | |-------|             |    |  (TCU -> TBU_X)   |
> >   |     Power_Domain A    |    |  Power_Domain B   |
> >   |-----------------------|    |-------------------|
> > 
> > Now if Dev Y isn't an SMMU client and Dev X drops the ref_count,
> > Power_Domain A would still remain ON whereas SMMUv3 might assume all
> > it's clients are down and try to suspend (power off Power_Domain B)
> > while the TLBs withing TBU_X are still powered up?
> > 
> > To avoid such a case maybe we should invalidate everything during
> > suspend? I still believe that when CR0.SMMUEN=0, it should broadcast
> > something to all it's TBUs asking them to invalidate all TLBs otherwise
> > this is a miss in the arch (unlikely for Arm) as TLB hits should never
> > occur if the SMMU is disabled. I guess I need to go through the SMMUv3
> > spec (or maybe the MMU-700 TRM) for confirming this..
> 
> I don't think this is allowed to be an issue in practice since TBUs are not
> architecturally visible. Certainly in terms of Arm's implementations, for a
> TBU to be powered off or externally clock gated it would have to do a full
> DTI disconnect (otherwise it would hang CMD_SYNC), and DTI requires that it
> must subsequently come back clean:
> 
> "The TBU must invalidate its caches before entering CONNECTED state."
> 

Interesting! Thanks for clarifying.

> Thanks,
> Robin.

Thanks,
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-21 15:30           ` Jason Gunthorpe
@ 2025-03-24 17:53             ` Pranjal Shrivastava
  2025-03-25 13:55               ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-24 17:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mostafa Saleh, Robin Murphy, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Fri, Mar 21, 2025 at 12:30:34PM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 21, 2025 at 02:44:09PM +0000, Pranjal Shrivastava wrote:
> 
> > However, still there's a worry about the reset value of GBPA.Abort as
> > pointed out by Robin earlier. Since the reset value of GBPA.Abort is 
> > implementation defined.. there's a chance that after a power cycle the
> > SMMU wakes up with GBPA configured to bypass.. in such a case, I don't
> > think the kernel should be responsible to ensure security..
> 
> The kernel should be responsible to operate that HW in a secure
> way. If the spec doesn't guarentee security then you will need a
> ACPI/DT flag to indicate if specific implementations are secure or
> not (ie if the implementation preserves GBPA.Abort).
> 
> Otherwise we'd have to architect around the insecurity somehow and
> prevent the SMMU from unpowering if there is any security sensitive
> attachment..

I see.. and what shall we do based on that ACPI/DT flag? 
Disable pm-runtime for security reasons? That way, the implementations
would get to chose if they *deliberately* want to enable runtime pm
despite the security issues. It can also act as an additional switch for
the pm feature.

> 
> Jason

Thanks,
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-24 17:53             ` Pranjal Shrivastava
@ 2025-03-25 13:55               ` Jason Gunthorpe
  2025-03-27 17:39                 ` Mostafa Saleh
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2025-03-25 13:55 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Mostafa Saleh, Robin Murphy, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Mon, Mar 24, 2025 at 05:53:32PM +0000, Pranjal Shrivastava wrote:
> On Fri, Mar 21, 2025 at 12:30:34PM -0300, Jason Gunthorpe wrote:
> > On Fri, Mar 21, 2025 at 02:44:09PM +0000, Pranjal Shrivastava wrote:
> > 
> > > However, still there's a worry about the reset value of GBPA.Abort as
> > > pointed out by Robin earlier. Since the reset value of GBPA.Abort is 
> > > implementation defined.. there's a chance that after a power cycle the
> > > SMMU wakes up with GBPA configured to bypass.. in such a case, I don't
> > > think the kernel should be responsible to ensure security..
> > 
> > The kernel should be responsible to operate that HW in a secure
> > way. If the spec doesn't guarentee security then you will need a
> > ACPI/DT flag to indicate if specific implementations are secure or
> > not (ie if the implementation preserves GBPA.Abort).
> > 
> > Otherwise we'd have to architect around the insecurity somehow and
> > prevent the SMMU from unpowering if there is any security sensitive
> > attachment..
> 
> I see.. and what shall we do based on that ACPI/DT flag? 
> Disable pm-runtime for security reasons? That way, the implementations
> would get to chose if they *deliberately* want to enable runtime pm
> despite the security issues. It can also act as an additional switch for
> the pm feature.

If this is a real problem I would probably figure out a way to mark
security sensitive attaches (like untrusted, vfio, etc) and only those
cases would prevent unpowering the smmu if the HW can't be made
secure.

You could also disable PM (or rather the flag would enable PM since we
have to be backwards compat)

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 1/5] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs
  2025-03-19  0:42 ` [RFC PATCH 1/5] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
  2025-03-19  4:50   ` Nicolin Chen
  2025-03-20 22:29   ` Mostafa Saleh
@ 2025-03-25 16:19   ` Daniel Mentz
  2025-03-26 19:35     ` Pranjal Shrivastava
  2 siblings, 1 reply; 71+ messages in thread
From: Daniel Mentz @ 2025-03-25 16:19 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Mostafa Saleh, iommu

On Tue, Mar 18, 2025 at 5:43 PM Pranjal Shrivastava <praan@google.com> wrote:
> @@ -4754,6 +4773,13 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
>         /* Check for RMRs and install bypass STEs if any */
>         arm_smmu_rmr_install_bypass_ste(smmu);
>
> +       /* Setup interrupt handlers */
> +       ret = arm_smmu_setup_irqs(smmu);
> +       if (ret) {
> +               dev_err(smmu->dev, "failed to setup irqs\n");
> +               return ret;

Do you need

goto err_free_iopf;

here instead of return ret?

> +       }
> +
>         /* Reset the device */
>         ret = arm_smmu_device_reset(smmu);
>         if (ret)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/5] iommu/arm-smmu-v3: Add a helper to wait till cmdq drains
  2025-03-19  0:42 ` [RFC PATCH 2/5] iommu/arm-smmu-v3: Add a helper to wait till cmdq drains Pranjal Shrivastava
  2025-03-20 22:30   ` Mostafa Saleh
@ 2025-03-25 17:50   ` Daniel Mentz
  2025-03-26 19:36     ` Pranjal Shrivastava
  2025-03-26  4:51   ` Daniel Mentz
  2 siblings, 1 reply; 71+ messages in thread
From: Daniel Mentz @ 2025-03-25 17:50 UTC (permalink / raw)
  To: Pranjal Shrivastava; +Cc: Mostafa Saleh, iommu

On Tue, Mar 18, 2025 at 5:43 PM Pranjal Shrivastava <praan@google.com> wrote:
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 80362d9f293b..f65d9bca0392 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -754,6 +754,30 @@ static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu,
>         return __arm_smmu_cmdq_poll_until_consumed(smmu, cmdq, llq);
>  }
>
> +/*
> + * Wait until the SMMU cmdq is empty

nit: There is an existing comment that reads "/* Wait for the command
queue to become non-full */". Align with that and write
"/* Wait for the command queue to become empty */"

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/5] iommu/arm-smmu-v3: Add a helper to wait till cmdq drains
  2025-03-19  0:42 ` [RFC PATCH 2/5] iommu/arm-smmu-v3: Add a helper to wait till cmdq drains Pranjal Shrivastava
  2025-03-20 22:30   ` Mostafa Saleh
  2025-03-25 17:50   ` Daniel Mentz
@ 2025-03-26  4:51   ` Daniel Mentz
  2025-03-26 20:10     ` Pranjal Shrivastava
  2 siblings, 1 reply; 71+ messages in thread
From: Daniel Mentz @ 2025-03-26  4:51 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Mostafa Saleh, iommu

On Tue, Mar 18, 2025 at 5:43 PM Pranjal Shrivastava <praan@google.com> wrote:
> +/*
> + * Wait until the SMMU cmdq is empty
> + * Must be called with the cmdq lock held in some capacity.
> + */
> +static int arm_smmu_cmdq_poll_until_empty(struct arm_smmu_device *smmu,
> +                                         struct arm_smmu_cmdq *cmdq,
> +                                         struct arm_smmu_ll_queue *llq)
> +{
> +       struct arm_smmu_queue_poll qp;
> +       int ret = 0;
> +
> +       queue_poll_init(smmu, &qp);
> +       llq->val = READ_ONCE(cmdq->q.llq.val);

In a later patch, I believe you're calling this function with llq
pointing to cmdq->q.llq in which case this assignment would be
redundant.

As an alternative approach, I'm wondering if you can reuse the
function __arm_smmu_cmdq_poll_until_consumed. Just take the current
producer index and wait for it to get consumed.



> +       do {
> +               if (queue_empty(llq))
> +                       break;
> +
> +               ret = queue_poll(&qp);
> +               llq->cons = readl(cmdq->q.cons_reg);
> +       } while (!ret);
> +
> +       return ret;
> +}
> +
>  static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
>                                         u32 prod, int n)
>  {
> --
> 2.49.0.rc1.451.g8f38331e32-goog
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-03-19  0:42 ` [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops Pranjal Shrivastava
  2025-03-20 22:33   ` Mostafa Saleh
@ 2025-03-26  4:52   ` Daniel Mentz
  2025-03-28  7:47     ` Pranjal Shrivastava
  2025-04-14 17:57   ` Nicolin Chen
  2 siblings, 1 reply; 71+ messages in thread
From: Daniel Mentz @ 2025-03-26  4:52 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Mostafa Saleh, iommu

On Tue, Mar 18, 2025 at 5:43 PM Pranjal Shrivastava <praan@google.com> wrote:
> +static int __maybe_unused arm_smmu_runtime_suspend(struct device *dev)
> +{
> +       struct arm_smmu_device *smmu = dev_get_drvdata(dev);
> +
> +       /* We might get the vcmdq */
> +       struct arm_smmu_cmdq_ent cmd = {
> +               .opcode = smmu->features & ARM_SMMU_FEAT_E2H ?
> +                       CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
> +       };
> +
> +       struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu, &cmd);
> +       struct arm_smmu_ll_queue *llq = &cmdq->q.llq;
> +
> +       /*
> +        * Since suspend is invoked when all clients have been
> +        * we don't expect more commands to be added to the cmdq.
> +        * Thus, wait for all existing commands to complete.
> +        */
> +       arm_smmu_cmdq_shared_lock(cmdq);
> +       arm_smmu_cmdq_poll_until_empty(smmu, cmdq, llq);

llq is pointing to cmdq->q.llq which means that
arm_smmu_cmdq_poll_until_empty will write to cmdq->q.llq.cons which I
understand is not allowed unless you are holding the exclusive lock or
are the only thread holding a shared lock.

How about the following:

1. Call arm_smmu_cmdq_exclusive_trylock_irqsave(). If this fails,
print an error message.
2. Call __arm_smmu_cmdq_poll_until_consumed on the current producer index.
3. Update cmdq->q.llq.cons
4. Call arm_smmu_cmdq_exclusive_unlock_irqrestore()


> +       arm_smmu_cmdq_shared_unlock(cmdq);
> +
> +       /* Disable all queues */
> +       arm_smmu_device_disable(smmu);
> +
> +       /* Abort all transactions to avoid spurious bypass */
> +       arm_smmu_update_gbpa(smmu, GBPA_ABORT, 0);
> +
> +       /* Free all MSIs (re-allocated on resume) */
> +       arm_smmu_free_msis(dev);
> +
> +       dev_dbg(dev, "Suspending smmu\n");
> +       return 0;
> +}

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 1/5] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs
  2025-03-25 16:19   ` Daniel Mentz
@ 2025-03-26 19:35     ` Pranjal Shrivastava
  0 siblings, 0 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-26 19:35 UTC (permalink / raw)
  To: Daniel Mentz
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Mostafa Saleh, iommu

On Tue, Mar 25, 2025 at 09:19:36AM -0700, Daniel Mentz wrote:
> On Tue, Mar 18, 2025 at 5:43 PM Pranjal Shrivastava <praan@google.com> wrote:
> > @@ -4754,6 +4773,13 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
> >         /* Check for RMRs and install bypass STEs if any */
> >         arm_smmu_rmr_install_bypass_ste(smmu);
> >
> > +       /* Setup interrupt handlers */
> > +       ret = arm_smmu_setup_irqs(smmu);
> > +       if (ret) {
> > +               dev_err(smmu->dev, "failed to setup irqs\n");
> > +               return ret;
> 
> Do you need
> 
> goto err_free_iopf;
> 
> here instead of return ret?
> 

Yes! Great catch! Thanks!

> > +       }
> > +
> >         /* Reset the device */
> >         ret = arm_smmu_device_reset(smmu);
> >         if (ret)

Thanks,
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/5] iommu/arm-smmu-v3: Add a helper to wait till cmdq drains
  2025-03-25 17:50   ` Daniel Mentz
@ 2025-03-26 19:36     ` Pranjal Shrivastava
  0 siblings, 0 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-26 19:36 UTC (permalink / raw)
  To: Daniel Mentz; +Cc: Mostafa Saleh, iommu

On Tue, Mar 25, 2025 at 10:50:34AM -0700, Daniel Mentz wrote:
> On Tue, Mar 18, 2025 at 5:43 PM Pranjal Shrivastava <praan@google.com> wrote:
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 80362d9f293b..f65d9bca0392 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -754,6 +754,30 @@ static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu,
> >         return __arm_smmu_cmdq_poll_until_consumed(smmu, cmdq, llq);
> >  }
> >
> > +/*
> > + * Wait until the SMMU cmdq is empty
> 
> nit: There is an existing comment that reads "/* Wait for the command
> queue to become non-full */". Align with that and write
> "/* Wait for the command queue to become empty */"

Ack.

Thanks,
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/5] iommu/arm-smmu-v3: Add a helper to wait till cmdq drains
  2025-03-26  4:51   ` Daniel Mentz
@ 2025-03-26 20:10     ` Pranjal Shrivastava
  0 siblings, 0 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-26 20:10 UTC (permalink / raw)
  To: Daniel Mentz
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Mostafa Saleh, iommu

On Tue, Mar 25, 2025 at 09:51:52PM -0700, Daniel Mentz wrote:
> On Tue, Mar 18, 2025 at 5:43 PM Pranjal Shrivastava <praan@google.com> wrote:
> > +/*
> > + * Wait until the SMMU cmdq is empty
> > + * Must be called with the cmdq lock held in some capacity.
> > + */
> > +static int arm_smmu_cmdq_poll_until_empty(struct arm_smmu_device *smmu,
> > +                                         struct arm_smmu_cmdq *cmdq,
> > +                                         struct arm_smmu_ll_queue *llq)
> > +{
> > +       struct arm_smmu_queue_poll qp;
> > +       int ret = 0;
> > +
> > +       queue_poll_init(smmu, &qp);
> > +       llq->val = READ_ONCE(cmdq->q.llq.val);
> 
> In a later patch, I believe you're calling this function with llq
> pointing to cmdq->q.llq in which case this assignment would be
> redundant.
> 

Ack. Will fix it.

> As an alternative approach, I'm wondering if you can reuse the
> function __arm_smmu_cmdq_poll_until_consumed. Just take the current
> producer index and wait for it to get consumed.
>

Actually, that could cause some trouble because the function,
__arm_smmu_cmdq_poll_until_consumed, waits till the cons __crosses__ the
prod by relying on `queue_consumed`. This could cause trouble as, when
the queue is empty, the cons won't cross prod and hence we'll keep
waiting in suspend... 

Instead, the `arm_smmu_cmdq_poll_until_empty` relies on `queue_empty` to
confirm that the queue has been drained before suspending..

>
> 
> > +       do {
> > +               if (queue_empty(llq))
> > +                       break;
> > +
> > +               ret = queue_poll(&qp);
> > +               llq->cons = readl(cmdq->q.cons_reg);
> > +       } while (!ret);
> > +
> > +       return ret;
> > +}
> > +
> >  static void arm_smmu_cmdq_write_entries(struct arm_smmu_cmdq *cmdq, u64 *cmds,
> >                                         u32 prod, int n)
> >  {
> > --
> > 2.49.0.rc1.451.g8f38331e32-goog
> >

Thanks,
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-24 17:36           ` Pranjal Shrivastava
@ 2025-03-27 17:27             ` Mostafa Saleh
  2025-03-28  9:13               ` Pranjal Shrivastava
  0 siblings, 1 reply; 71+ messages in thread
From: Mostafa Saleh @ 2025-03-27 17:27 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Robin Murphy, Jason Gunthorpe, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Mon, Mar 24, 2025 at 05:36:43PM +0000, Pranjal Shrivastava wrote:
> On Fri, Mar 21, 2025 at 05:35:11PM +0000, Robin Murphy wrote:
> > On 21/03/2025 2:18 pm, Pranjal Shrivastava wrote:
> > > On Thu, Mar 20, 2025 at 10:25:52PM +0000, Mostafa Saleh wrote:
> > > > On Wed, Mar 19, 2025 at 04:07:57PM +0000, Robin Murphy wrote:
> > > > > On 19/03/2025 11:57 am, Jason Gunthorpe wrote:
> > > > > > On Wed, Mar 19, 2025 at 12:42:49AM +0000, Pranjal Shrivastava wrote:
> > > > > > 
> > > > > > > 3. Invoking runtime_pm_get/put
> > > > > > > Given that most of the configuration done by arm-smmu-v3 is stored in
> > > > > > > memory, the initial idea is to focus on areas where the driver accesses
> > > > > > > the hw via exposed ops, like iommmu_ops, iommu_flush_ops, sva_ops etc.
> > > > > > 
> > > > > > This seems weird, if the SMMU is suspended doesn't it also fail DMA
> > > > > > transactions? Why would ops like flush even be called if the HW is
> > > > > > disabled?
> > > > > 
> > > > > Because once the device has finished its operation, its driver is free to
> > > > > call rpm_put() before calling dma_unmap(), so by the time that gets as far
> > > > > as TLB maintenance, the SMMU may already be asleep as well if that device
> > > > > was the only thing keeping it awake.
> > > > > 
> > > > > For direct IOMMU API users, pagetable update may be even more asynchronous
> > > > > from device activity, e.g. a GPU buffer might only be unmapped once
> > > > > userspace closes the last file handle referencing it, long after the GPU
> > > > > itself has moved on to other things.
> > > > > 
> > > > > > flush is performance path stuff, so it doesn't seem great to be adding
> > > > > > extra calls there.
> > > > > 
> > > > > That much is true - this really wants to be using pm_runtime_get_if_in_use()
> > > > > nearly everywhere such that at most it's just juggling refcounts. There's no
> > > > > point waking the SMMU up just to issue a CFGI or TLBI, if the act of doing
> > > > > so is inherently going to do a full arm_smmu_reset() and thus invalidate
> > > > > everything anyway.
> > > > 
> > > > AFAICT, there is no guarantees that caches are clean on system resume,
> > > > but as we do invalidate everything that should be fine, but I am not sure
> > 
> > That was the point - we're definitely going to do a full software
> > invalidation *because* we can't make any assumptions about the hardware
> > state, i.e. it may come back full of valid-looking nonsense.
> > 
> > > I mean we do set GBPA.Abort = 1 right before suspending, I'd want to
> > > assume that doing so would ensure that TLB hits don't occur anymore. Let
> > > me dig into the spec to see if I can find something regarding TLB
> > > behavior when GBPA.Abort = 1
> > 
> > GBPA doesn't matter here, it's about the CR0.SMMUEN=0 behaviour (see
> > 6.3.9.6). That says "Incoming transactions [...] do not undergo
> > translation," so although TLB entries are allowed to remain present, they
> > must not be *used* - i.e. SMMUEN is not permitted to be cached in a TLB.
> > 
> 
> Thanks for pointing me to the right section. It also mentions: 
> When SMMU_(*_)CR0.SMMUEN == 0:
> "Translation and configuration cache entries are not inserted or
> modified, except for invalidation by maintenance commands or broadcast
> operations."
> 
> So, it looks like we don't need to worry much about a disabled
> programming interface causing changes to the TLB. 

I am not sure I am following, but we can’t  guarantee that GBPA.abort
is 1 after resume nor SMMUEN is 1

> 
> However, another statement in the same section:
> 
> "Note: The ‘other’ Security state might still have SMMUEN == 1 and
> therefore be inserting cache entries for that Security state. As these
> entries are not visible to or affected by the Non-secure programming
> interface, this is only a consideration for the Secure programming
> interface which can maintain Non-secure cache entries."
> 
> Makes me think of situations where we might elide a TLB invalidate if
> the SMMU is SUSPENDED but the secure world gets a hit to the 
> invalidated TLB entry. The TLBI command could be a result of a simple
> non-driver kernel module unmapping a page based on it's communication
> with the secure world. In this case, devlinks may NOT save the day...
> 
> I know that the above situation is a burden on the SW designer or
> implementer, I just want to discuss is if we have something
> like the above case, that would not want us to elide TLBIs while
> suspended? (I'm not able to see any case where we share pages with
> the secure world at this time).

I don’t think we care about secure world, TLB would be tagged with the
NS bit, and at the moment there is nothing in the driver that interacts
with TZ on sharing NS page tables (we change IOVAs all the time and
there is no way TZ knows that)

Thanks,
Mostafa

> 
> 
> > > > how that works with distributed SMMUs where the TBU can still be powered
> > > > with some TLBs that can be invalid?
> > > > 
> > > 
> > > Hmm.. do you mean some situation like:
> > > 
> > >   |-----------------------|    |-------------------|
> > >   | |-------|   |-------| |    |-------------------|
> > >   | | Dev X |   | Dev Y | |    |                   |
> > >   | | (TBU) |   |_______| |    |       SMMUv3      |
> > >   | |-------|             |    |  (TCU -> TBU_X)   |
> > >   |     Power_Domain A    |    |  Power_Domain B   |
> > >   |-----------------------|    |-------------------|
> > > 
> > > Now if Dev Y isn't an SMMU client and Dev X drops the ref_count,
> > > Power_Domain A would still remain ON whereas SMMUv3 might assume all
> > > it's clients are down and try to suspend (power off Power_Domain B)
> > > while the TLBs withing TBU_X are still powered up?
> > > 
> > > To avoid such a case maybe we should invalidate everything during
> > > suspend? I still believe that when CR0.SMMUEN=0, it should broadcast
> > > something to all it's TBUs asking them to invalidate all TLBs otherwise
> > > this is a miss in the arch (unlikely for Arm) as TLB hits should never
> > > occur if the SMMU is disabled. I guess I need to go through the SMMUv3
> > > spec (or maybe the MMU-700 TRM) for confirming this..
> > 
> > I don't think this is allowed to be an issue in practice since TBUs are not
> > architecturally visible. Certainly in terms of Arm's implementations, for a
> > TBU to be powered off or externally clock gated it would have to do a full
> > DTI disconnect (otherwise it would hang CMD_SYNC), and DTI requires that it
> > must subsequently come back clean:
> > 
> > "The TBU must invalidate its caches before entering CONNECTED state."
> > 
> 
> Interesting! Thanks for clarifying.
> 
> > Thanks,
> > Robin.
> 
> Thanks,
> Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-25 13:55               ` Jason Gunthorpe
@ 2025-03-27 17:39                 ` Mostafa Saleh
  2025-03-28 13:21                   ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Mostafa Saleh @ 2025-03-27 17:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pranjal Shrivastava, Robin Murphy, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Tue, Mar 25, 2025 at 10:55:12AM -0300, Jason Gunthorpe wrote:
> On Mon, Mar 24, 2025 at 05:53:32PM +0000, Pranjal Shrivastava wrote:
> > On Fri, Mar 21, 2025 at 12:30:34PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Mar 21, 2025 at 02:44:09PM +0000, Pranjal Shrivastava wrote:
> > > 
> > > > However, still there's a worry about the reset value of GBPA.Abort as
> > > > pointed out by Robin earlier. Since the reset value of GBPA.Abort is 
> > > > implementation defined.. there's a chance that after a power cycle the
> > > > SMMU wakes up with GBPA configured to bypass.. in such a case, I don't
> > > > think the kernel should be responsible to ensure security..
> > > 
> > > The kernel should be responsible to operate that HW in a secure
> > > way. If the spec doesn't guarentee security then you will need a
> > > ACPI/DT flag to indicate if specific implementations are secure or
> > > not (ie if the implementation preserves GBPA.Abort).
> > > 
> > > Otherwise we'd have to architect around the insecurity somehow and
> > > prevent the SMMU from unpowering if there is any security sensitive
> > > attachment..
> > 
> > I see.. and what shall we do based on that ACPI/DT flag? 
> > Disable pm-runtime for security reasons? That way, the implementations
> > would get to chose if they *deliberately* want to enable runtime pm
> > despite the security issues. It can also act as an additional switch for
> > the pm feature.
> 
> If this is a real problem I would probably figure out a way to mark
> security sensitive attaches (like untrusted, vfio, etc) and only those
> cases would prevent unpowering the smmu if the HW can't be made
> secure.
> 
> You could also disable PM (or rather the flag would enable PM since we
> have to be backwards compat)

I have been thinking more about this.
For platform devices, we don’t do rpm and VFIO gets a PM reference at
open for the whole period of the fd.

For PCI, I see it provides IOCTLs which interacts with RPM
VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY/EXIT

Ideally, linux RPM would handle ordering due to consumer/supplier links,
but in case both the SMMU and the device share the same power domain,
there is this window of time where a malicious userspace can program the
device while there is an unclear SMMU state.

I agree, I don’t like the approach of adding a (yet another) flag for
firmware to populate, I guess we can either:
- Prevent SMMUv3 suspend, and that’s already the status quo as it
  doesn’t support RPM, instead of doing that from the driver by
  detecting non-secure attaches, we can just make VFIO takes a PM
  reference on the IOMMU.

- Disable the PCI device, or unmap it from userspace from VFIO suspend
  handler, and retrieve it’s state on the VFIO resume handler which would
  be ordered after the SMMUv3.

I don't know much about VFIO-pci(and very little about PCI), but I wonder
what happens if userspace powerdown the device and issued a read/write system
call, wouldn’t that cause the kernel to crash? It seems we need to prevent
userspace from interacting with the device while powered off anyway.

Thanks,
Mostafa

> 
> Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-03-26  4:52   ` Daniel Mentz
@ 2025-03-28  7:47     ` Pranjal Shrivastava
  0 siblings, 0 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-28  7:47 UTC (permalink / raw)
  To: Daniel Mentz
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Mostafa Saleh, iommu

On Tue, Mar 25, 2025 at 09:52:37PM -0700, Daniel Mentz wrote:
> On Tue, Mar 18, 2025 at 5:43 PM Pranjal Shrivastava <praan@google.com> wrote:
> > +static int __maybe_unused arm_smmu_runtime_suspend(struct device *dev)
> > +{
> > +       struct arm_smmu_device *smmu = dev_get_drvdata(dev);
> > +
> > +       /* We might get the vcmdq */
> > +       struct arm_smmu_cmdq_ent cmd = {
> > +               .opcode = smmu->features & ARM_SMMU_FEAT_E2H ?
> > +                       CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
> > +       };
> > +
> > +       struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu, &cmd);
> > +       struct arm_smmu_ll_queue *llq = &cmdq->q.llq;
> > +
> > +       /*
> > +        * Since suspend is invoked when all clients have been
> > +        * we don't expect more commands to be added to the cmdq.
> > +        * Thus, wait for all existing commands to complete.
> > +        */
> > +       arm_smmu_cmdq_shared_lock(cmdq);
> > +       arm_smmu_cmdq_poll_until_empty(smmu, cmdq, llq);
> 
> llq is pointing to cmdq->q.llq which means that
> arm_smmu_cmdq_poll_until_empty will write to cmdq->q.llq.cons which I
> understand is not allowed unless you are holding the exclusive lock or
> are the only thread holding a shared lock.
> 
> How about the following:
> 
> 1. Call arm_smmu_cmdq_exclusive_trylock_irqsave(). If this fails,
> print an error message.
> 2. Call __arm_smmu_cmdq_poll_until_consumed on the current producer index.
> 3. Update cmdq->q.llq.cons
> 4. Call arm_smmu_cmdq_exclusive_unlock_irqrestore()
> 

As mentioned in the other thread, __arm_smmu_cmdq_poll_until_consumed
relies on `queue_consumed` which waits for the prod to cross cons idx.
We could potentially call poll_until_consumed on `prod - 1 but it feels`
hacky..relying on `queue_empty` should be better.

About locking, that is right, but I was assuming since we are in the
suspend callback here, no other client is operating the SMMU. Due to the
devlinks, I don't see a path that could be potentially racy here but if
we *somehow* fail to get the lock, I guess we should stop the suspend
along with logging the error message.

Do you see a case where we might not be the only owner of the cmdq while
suspending?

> 
> > +       arm_smmu_cmdq_shared_unlock(cmdq);
> > +
> > +       /* Disable all queues */
> > +       arm_smmu_device_disable(smmu);
> > +
> > +       /* Abort all transactions to avoid spurious bypass */
> > +       arm_smmu_update_gbpa(smmu, GBPA_ABORT, 0);
> > +
> > +       /* Free all MSIs (re-allocated on resume) */
> > +       arm_smmu_free_msis(dev);
> > +
> > +       dev_dbg(dev, "Suspending smmu\n");
> > +       return 0;
> > +}

Thanks,
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-27 17:27             ` Mostafa Saleh
@ 2025-03-28  9:13               ` Pranjal Shrivastava
  2025-03-28  9:19                 ` Pranjal Shrivastava
  2025-03-28 13:18                 ` Jason Gunthorpe
  0 siblings, 2 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-28  9:13 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Robin Murphy, Jason Gunthorpe, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Thu, Mar 27, 2025 at 05:27:49PM +0000, Mostafa Saleh wrote:
> On Mon, Mar 24, 2025 at 05:36:43PM +0000, Pranjal Shrivastava wrote:
> > On Fri, Mar 21, 2025 at 05:35:11PM +0000, Robin Murphy wrote:
> > > On 21/03/2025 2:18 pm, Pranjal Shrivastava wrote:
> > > > On Thu, Mar 20, 2025 at 10:25:52PM +0000, Mostafa Saleh wrote:
> > > > > On Wed, Mar 19, 2025 at 04:07:57PM +0000, Robin Murphy wrote:
> > > > > > On 19/03/2025 11:57 am, Jason Gunthorpe wrote:
> > > > > > > On Wed, Mar 19, 2025 at 12:42:49AM +0000, Pranjal Shrivastava wrote:
> > > > > > > 
> > > > > > > > 3. Invoking runtime_pm_get/put
> > > > > > > > Given that most of the configuration done by arm-smmu-v3 is stored in
> > > > > > > > memory, the initial idea is to focus on areas where the driver accesses
> > > > > > > > the hw via exposed ops, like iommmu_ops, iommu_flush_ops, sva_ops etc.
> > > > > > > 
> > > > > > > This seems weird, if the SMMU is suspended doesn't it also fail DMA
> > > > > > > transactions? Why would ops like flush even be called if the HW is
> > > > > > > disabled?
> > > > > > 
> > > > > > Because once the device has finished its operation, its driver is free to
> > > > > > call rpm_put() before calling dma_unmap(), so by the time that gets as far
> > > > > > as TLB maintenance, the SMMU may already be asleep as well if that device
> > > > > > was the only thing keeping it awake.
> > > > > > 
> > > > > > For direct IOMMU API users, pagetable update may be even more asynchronous
> > > > > > from device activity, e.g. a GPU buffer might only be unmapped once
> > > > > > userspace closes the last file handle referencing it, long after the GPU
> > > > > > itself has moved on to other things.
> > > > > > 
> > > > > > > flush is performance path stuff, so it doesn't seem great to be adding
> > > > > > > extra calls there.
> > > > > > 
> > > > > > That much is true - this really wants to be using pm_runtime_get_if_in_use()
> > > > > > nearly everywhere such that at most it's just juggling refcounts. There's no
> > > > > > point waking the SMMU up just to issue a CFGI or TLBI, if the act of doing
> > > > > > so is inherently going to do a full arm_smmu_reset() and thus invalidate
> > > > > > everything anyway.
> > > > > 
> > > > > AFAICT, there is no guarantees that caches are clean on system resume,
> > > > > but as we do invalidate everything that should be fine, but I am not sure
> > > 
> > > That was the point - we're definitely going to do a full software
> > > invalidation *because* we can't make any assumptions about the hardware
> > > state, i.e. it may come back full of valid-looking nonsense.
> > > 
> > > > I mean we do set GBPA.Abort = 1 right before suspending, I'd want to
> > > > assume that doing so would ensure that TLB hits don't occur anymore. Let
> > > > me dig into the spec to see if I can find something regarding TLB
> > > > behavior when GBPA.Abort = 1
> > > 
> > > GBPA doesn't matter here, it's about the CR0.SMMUEN=0 behaviour (see
> > > 6.3.9.6). That says "Incoming transactions [...] do not undergo
> > > translation," so although TLB entries are allowed to remain present, they
> > > must not be *used* - i.e. SMMUEN is not permitted to be cached in a TLB.
> > > 
> > 
> > Thanks for pointing me to the right section. It also mentions: 
> > When SMMU_(*_)CR0.SMMUEN == 0:
> > "Translation and configuration cache entries are not inserted or
> > modified, except for invalidation by maintenance commands or broadcast
> > operations."
> > 
> > So, it looks like we don't need to worry much about a disabled
> > programming interface causing changes to the TLB. 
> 
> I am not sure I am following, but we can’t  guarantee that GBPA.abort
> is 1 after resume nor SMMUEN is 1
> 

Correct. For GBPA, I think Jason suggested to have implementations
declare if an implementation resets with GBPA.Abort=1 in a DT/ACPI flag.
Based on that we can take calls, for e.g. don't suspend when there's
some security sensitive attachement. Although, we need to define what
would a "security sensitive attachment" be.

However, as per the spec, SMMUEN resets with 0 (section 6.3.9).
And if SMMUEN=0, as per the spec, "the Configuration or translation
structures are not accessed". Thus, when SMMUEN=0, the config cache &
TLB, both shall not be accessed.. However, the spec mentions that
invalidation of cached entries is supported when SMMUEN=0 & CMDQEN=1
(which is what the driver does in arm_smmu_device_reset).

Other than that, the MMU-700 implementation spec[1] mentions that 
certain revisions might access config and translation structures while
SMMUEN=0 (through an impl defined prefetch), and one of the provided SW
workarounds is to issue invalidate all with SMMUEN=0 after the SW has
finished all modifications to config/translation structures, which seems
to be happening anyway with arm_smmu_device_reset.

> > 
> > However, another statement in the same section:
> > 
> > "Note: The ‘other’ Security state might still have SMMUEN == 1 and
> > therefore be inserting cache entries for that Security state. As these
> > entries are not visible to or affected by the Non-secure programming
> > interface, this is only a consideration for the Secure programming
> > interface which can maintain Non-secure cache entries."
> > 
> > Makes me think of situations where we might elide a TLB invalidate if
> > the SMMU is SUSPENDED but the secure world gets a hit to the 
> > invalidated TLB entry. The TLBI command could be a result of a simple
> > non-driver kernel module unmapping a page based on it's communication
> > with the secure world. In this case, devlinks may NOT save the day...
> > 
> > I know that the above situation is a burden on the SW designer or
> > implementer, I just want to discuss is if we have something
> > like the above case, that would not want us to elide TLBIs while
> > suspended? (I'm not able to see any case where we share pages with
> > the secure world at this time).
> 
> I don’t think we care about secure world, TLB would be tagged with the
> NS bit, and at the moment there is nothing in the driver that interacts
> with TZ on sharing NS page tables (we change IOVAs all the time and
> there is no way TZ knows that)
> 

Ack. If we don't care about the secure world, I think we are covered in
terms of TLB / config cache invalidations. On suspend, we set SMMUEN=0
and on resume (in arm_smmu_device_reset), we invalidate the caches/TLBs
before setting SMMUEN=1.

The only thing that remains is GBPA resetting to bypass, where I think
Jason's suggestion for a flag could potentially help.

> Thanks,
> Mostafa
> 

Thanks,
Praan

> > 
> > 
> > > > > how that works with distributed SMMUs where the TBU can still be powered
> > > > > with some TLBs that can be invalid?
> > > > > 
> > > > 
> > > > Hmm.. do you mean some situation like:
> > > > 
> > > >   |-----------------------|    |-------------------|
> > > >   | |-------|   |-------| |    |-------------------|
> > > >   | | Dev X |   | Dev Y | |    |                   |
> > > >   | | (TBU) |   |_______| |    |       SMMUv3      |
> > > >   | |-------|             |    |  (TCU -> TBU_X)   |
> > > >   |     Power_Domain A    |    |  Power_Domain B   |
> > > >   |-----------------------|    |-------------------|
> > > > 
> > > > Now if Dev Y isn't an SMMU client and Dev X drops the ref_count,
> > > > Power_Domain A would still remain ON whereas SMMUv3 might assume all
> > > > it's clients are down and try to suspend (power off Power_Domain B)
> > > > while the TLBs withing TBU_X are still powered up?
> > > > 
> > > > To avoid such a case maybe we should invalidate everything during
> > > > suspend? I still believe that when CR0.SMMUEN=0, it should broadcast
> > > > something to all it's TBUs asking them to invalidate all TLBs otherwise
> > > > this is a miss in the arch (unlikely for Arm) as TLB hits should never
> > > > occur if the SMMU is disabled. I guess I need to go through the SMMUv3
> > > > spec (or maybe the MMU-700 TRM) for confirming this..
> > > 
> > > I don't think this is allowed to be an issue in practice since TBUs are not
> > > architecturally visible. Certainly in terms of Arm's implementations, for a
> > > TBU to be powered off or externally clock gated it would have to do a full
> > > DTI disconnect (otherwise it would hang CMD_SYNC), and DTI requires that it
> > > must subsequently come back clean:
> > > 
> > > "The TBU must invalidate its caches before entering CONNECTED state."
> > > 
> > 
> > Interesting! Thanks for clarifying.
> > 
> > > Thanks,
> > > Robin.
> > 
> > Thanks,
> > Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-28  9:13               ` Pranjal Shrivastava
@ 2025-03-28  9:19                 ` Pranjal Shrivastava
  2025-03-28 13:18                 ` Jason Gunthorpe
  1 sibling, 0 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-28  9:19 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Robin Murphy, Jason Gunthorpe, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Fri, Mar 28, 2025 at 09:13:01AM +0000, Pranjal Shrivastava wrote:
> On Thu, Mar 27, 2025 at 05:27:49PM +0000, Mostafa Saleh wrote:
> > On Mon, Mar 24, 2025 at 05:36:43PM +0000, Pranjal Shrivastava wrote:
> > > On Fri, Mar 21, 2025 at 05:35:11PM +0000, Robin Murphy wrote:
> > > > On 21/03/2025 2:18 pm, Pranjal Shrivastava wrote:
> > > > > On Thu, Mar 20, 2025 at 10:25:52PM +0000, Mostafa Saleh wrote:
> > > > > > On Wed, Mar 19, 2025 at 04:07:57PM +0000, Robin Murphy wrote:
> > > > > > > On 19/03/2025 11:57 am, Jason Gunthorpe wrote:
> > > > > > > > On Wed, Mar 19, 2025 at 12:42:49AM +0000, Pranjal Shrivastava wrote:
> > > > > > > > 
> > > > > > > > > 3. Invoking runtime_pm_get/put
> > > > > > > > > Given that most of the configuration done by arm-smmu-v3 is stored in
> > > > > > > > > memory, the initial idea is to focus on areas where the driver accesses
> > > > > > > > > the hw via exposed ops, like iommmu_ops, iommu_flush_ops, sva_ops etc.
> > > > > > > > 
> > > > > > > > This seems weird, if the SMMU is suspended doesn't it also fail DMA
> > > > > > > > transactions? Why would ops like flush even be called if the HW is
> > > > > > > > disabled?
> > > > > > > 
> > > > > > > Because once the device has finished its operation, its driver is free to
> > > > > > > call rpm_put() before calling dma_unmap(), so by the time that gets as far
> > > > > > > as TLB maintenance, the SMMU may already be asleep as well if that device
> > > > > > > was the only thing keeping it awake.
> > > > > > > 
> > > > > > > For direct IOMMU API users, pagetable update may be even more asynchronous
> > > > > > > from device activity, e.g. a GPU buffer might only be unmapped once
> > > > > > > userspace closes the last file handle referencing it, long after the GPU
> > > > > > > itself has moved on to other things.
> > > > > > > 
> > > > > > > > flush is performance path stuff, so it doesn't seem great to be adding
> > > > > > > > extra calls there.
> > > > > > > 
> > > > > > > That much is true - this really wants to be using pm_runtime_get_if_in_use()
> > > > > > > nearly everywhere such that at most it's just juggling refcounts. There's no
> > > > > > > point waking the SMMU up just to issue a CFGI or TLBI, if the act of doing
> > > > > > > so is inherently going to do a full arm_smmu_reset() and thus invalidate
> > > > > > > everything anyway.
> > > > > > 
> > > > > > AFAICT, there is no guarantees that caches are clean on system resume,
> > > > > > but as we do invalidate everything that should be fine, but I am not sure
> > > > 
> > > > That was the point - we're definitely going to do a full software
> > > > invalidation *because* we can't make any assumptions about the hardware
> > > > state, i.e. it may come back full of valid-looking nonsense.
> > > > 
> > > > > I mean we do set GBPA.Abort = 1 right before suspending, I'd want to
> > > > > assume that doing so would ensure that TLB hits don't occur anymore. Let
> > > > > me dig into the spec to see if I can find something regarding TLB
> > > > > behavior when GBPA.Abort = 1
> > > > 
> > > > GBPA doesn't matter here, it's about the CR0.SMMUEN=0 behaviour (see
> > > > 6.3.9.6). That says "Incoming transactions [...] do not undergo
> > > > translation," so although TLB entries are allowed to remain present, they
> > > > must not be *used* - i.e. SMMUEN is not permitted to be cached in a TLB.
> > > > 
> > > 
> > > Thanks for pointing me to the right section. It also mentions: 
> > > When SMMU_(*_)CR0.SMMUEN == 0:
> > > "Translation and configuration cache entries are not inserted or
> > > modified, except for invalidation by maintenance commands or broadcast
> > > operations."
> > > 
> > > So, it looks like we don't need to worry much about a disabled
> > > programming interface causing changes to the TLB. 
> > 
> > I am not sure I am following, but we can’t  guarantee that GBPA.abort
> > is 1 after resume nor SMMUEN is 1
> > 
> 
> Correct. For GBPA, I think Jason suggested to have implementations
> declare if an implementation resets with GBPA.Abort=1 in a DT/ACPI flag.
> Based on that we can take calls, for e.g. don't suspend when there's
> some security sensitive attachement. Although, we need to define what
> would a "security sensitive attachment" be.
> 
> However, as per the spec, SMMUEN resets with 0 (section 6.3.9).
> And if SMMUEN=0, as per the spec, "the Configuration or translation
> structures are not accessed". Thus, when SMMUEN=0, the config cache &
> TLB, both shall not be accessed.. However, the spec mentions that
> invalidation of cached entries is supported when SMMUEN=0 & CMDQEN=1
> (which is what the driver does in arm_smmu_device_reset).
> 
> Other than that, the MMU-700 implementation spec[1] mentions that 
> certain revisions might access config and translation structures while
> SMMUEN=0 (through an impl defined prefetch), and one of the provided SW
> workarounds is to issue invalidate all with SMMUEN=0 after the SW has
> finished all modifications to config/translation structures, which seems
> to be happening anyway with arm_smmu_device_reset.
> 
> > > 
> > > However, another statement in the same section:
> > > 
> > > "Note: The ‘other’ Security state might still have SMMUEN == 1 and
> > > therefore be inserting cache entries for that Security state. As these
> > > entries are not visible to or affected by the Non-secure programming
> > > interface, this is only a consideration for the Secure programming
> > > interface which can maintain Non-secure cache entries."
> > > 
> > > Makes me think of situations where we might elide a TLB invalidate if
> > > the SMMU is SUSPENDED but the secure world gets a hit to the 
> > > invalidated TLB entry. The TLBI command could be a result of a simple
> > > non-driver kernel module unmapping a page based on it's communication
> > > with the secure world. In this case, devlinks may NOT save the day...
> > > 
> > > I know that the above situation is a burden on the SW designer or
> > > implementer, I just want to discuss is if we have something
> > > like the above case, that would not want us to elide TLBIs while
> > > suspended? (I'm not able to see any case where we share pages with
> > > the secure world at this time).
> > 
> > I don’t think we care about secure world, TLB would be tagged with the
> > NS bit, and at the moment there is nothing in the driver that interacts
> > with TZ on sharing NS page tables (we change IOVAs all the time and
> > there is no way TZ knows that)
> > 
> 
> Ack. If we don't care about the secure world, I think we are covered in
> terms of TLB / config cache invalidations. On suspend, we set SMMUEN=0
> and on resume (in arm_smmu_device_reset), we invalidate the caches/TLBs
> before setting SMMUEN=1.
> 
> The only thing that remains is GBPA resetting to bypass, where I think
> Jason's suggestion for a flag could potentially help.
> 
> > Thanks,
> > Mostafa
> > 
> 
> Thanks,
> Praan
> 

Forgot the attachment:
[1] https://developer.arm.com/documentation/SDEN-1786925/latest/

> > > 
> > > 
> > > > > > how that works with distributed SMMUs where the TBU can still be powered
> > > > > > with some TLBs that can be invalid?
> > > > > > 
> > > > > 
> > > > > Hmm.. do you mean some situation like:
> > > > > 
> > > > >   |-----------------------|    |-------------------|
> > > > >   | |-------|   |-------| |    |-------------------|
> > > > >   | | Dev X |   | Dev Y | |    |                   |
> > > > >   | | (TBU) |   |_______| |    |       SMMUv3      |
> > > > >   | |-------|             |    |  (TCU -> TBU_X)   |
> > > > >   |     Power_Domain A    |    |  Power_Domain B   |
> > > > >   |-----------------------|    |-------------------|
> > > > > 
> > > > > Now if Dev Y isn't an SMMU client and Dev X drops the ref_count,
> > > > > Power_Domain A would still remain ON whereas SMMUv3 might assume all
> > > > > it's clients are down and try to suspend (power off Power_Domain B)
> > > > > while the TLBs withing TBU_X are still powered up?
> > > > > 
> > > > > To avoid such a case maybe we should invalidate everything during
> > > > > suspend? I still believe that when CR0.SMMUEN=0, it should broadcast
> > > > > something to all it's TBUs asking them to invalidate all TLBs otherwise
> > > > > this is a miss in the arch (unlikely for Arm) as TLB hits should never
> > > > > occur if the SMMU is disabled. I guess I need to go through the SMMUv3
> > > > > spec (or maybe the MMU-700 TRM) for confirming this..
> > > > 
> > > > I don't think this is allowed to be an issue in practice since TBUs are not
> > > > architecturally visible. Certainly in terms of Arm's implementations, for a
> > > > TBU to be powered off or externally clock gated it would have to do a full
> > > > DTI disconnect (otherwise it would hang CMD_SYNC), and DTI requires that it
> > > > must subsequently come back clean:
> > > > 
> > > > "The TBU must invalidate its caches before entering CONNECTED state."
> > > > 
> > > 
> > > Interesting! Thanks for clarifying.
> > > 
> > > > Thanks,
> > > > Robin.
> > > 
> > > Thanks,
> > > Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-28  9:13               ` Pranjal Shrivastava
  2025-03-28  9:19                 ` Pranjal Shrivastava
@ 2025-03-28 13:18                 ` Jason Gunthorpe
  2025-03-28 15:08                   ` Pranjal Shrivastava
  1 sibling, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2025-03-28 13:18 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Mostafa Saleh, Robin Murphy, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Fri, Mar 28, 2025 at 09:13:01AM +0000, Pranjal Shrivastava wrote:

> Other than that, the MMU-700 implementation spec[1] mentions that 
> certain revisions might access config and translation structures while
> SMMUEN=0 (through an impl defined prefetch), and one of the provided SW
> workarounds is to issue invalidate all with SMMUEN=0 after the SW has
> finished all modifications to config/translation structures, which seems
> to be happening anyway with arm_smmu_device_reset.

That starts to become another security problem though, if the SMMU is
not in forced abort and it somehow loads invalid data into the cache
because it hasn't been fully setup yet.. That can't be allowed with
insecure attachments :\

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-27 17:39                 ` Mostafa Saleh
@ 2025-03-28 13:21                   ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2025-03-28 13:21 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Pranjal Shrivastava, Robin Murphy, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Thu, Mar 27, 2025 at 05:39:26PM +0000, Mostafa Saleh wrote:

> I agree, I don’t like the approach of adding a (yet another) flag for
> firmware to populate, I guess we can either:
> - Prevent SMMUv3 suspend, and that’s already the status quo as it
>   doesn’t support RPM, instead of doing that from the driver by
>   detecting non-secure attaches, we can just make VFIO takes a PM
>   reference on the IOMMU.

That would break runtime PM support through VFIO on other platforms..
 
> - Disable the PCI device, or unmap it from userspace from VFIO suspend
>   handler, and retrieve it’s state on the VFIO resume handler which would
>   be ordered after the SMMUv3.

I don't think that works, the PCI device could have already been
programmed to do a hostile DMA prior to any unmap.

I think the only thing that makes sense is for the SMMU to not do
power management if it can't obey the API requirement around domains.

iommus are expected to always translate according to their attached
domain, never something else. The only possible exception would be
abort-all.

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-28 13:18                 ` Jason Gunthorpe
@ 2025-03-28 15:08                   ` Pranjal Shrivastava
  2025-03-28 18:21                     ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-03-28 15:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mostafa Saleh, Robin Murphy, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Fri, Mar 28, 2025 at 10:18:44AM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 28, 2025 at 09:13:01AM +0000, Pranjal Shrivastava wrote:
> 
> > Other than that, the MMU-700 implementation spec[1] mentions that 
> > certain revisions might access config and translation structures while
> > SMMUEN=0 (through an impl defined prefetch), and one of the provided SW
> > workarounds is to issue invalidate all with SMMUEN=0 after the SW has
> > finished all modifications to config/translation structures, which seems
> > to be happening anyway with arm_smmu_device_reset.
> 
> That starts to become another security problem though, if the SMMU is
> not in forced abort and it somehow loads invalid data into the cache
> because it hasn't been fully setup yet.. That can't be allowed with
> insecure attachments :\
> 

Ack. This is just to address the caching of invalid entries. GBPA.Abort
is a different issue.. All I meant was that when SMMUEN=0, the TLB isn't
supposed to be used for transactions (as per the spec) even if
GBPA.Abort = 0. 

We'd invalidate all entries in the resume callback before setting
SMMUEN=1 (as part of arm_smmu_device_reset).

GBPA.Abort resetting with 0 (i.e. bypass) is a different issue that we'll
need to address. I kinda agree with Robin there... we can't control what
we get when the resume callback is invoked... FWIW, there could be some
firmware that touches the SMMU before kernel does..


I like the idea of a flag telling us if the implementation boots with
GBPA.Abort = 0 but I'm unsure of how the iommu driver would understand
what attachments are insecure (or even if we'd wanna add that kind of
intelligence in the drivers)

> Jason

Thanks,
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
  2025-03-28 15:08                   ` Pranjal Shrivastava
@ 2025-03-28 18:21                     ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2025-03-28 18:21 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Mostafa Saleh, Robin Murphy, Joerg Roedel, Will Deacon,
	Nicolin Chen, Daniel Mentz, iommu

On Fri, Mar 28, 2025 at 03:08:36PM +0000, Pranjal Shrivastava wrote:

> I like the idea of a flag telling us if the implementation boots with
> GBPA.Abort = 0 but I'm unsure of how the iommu driver would understand
> what attachments are insecure (or even if we'd wanna add that kind of
> intelligence in the drivers)

I suspect we can reasonably add a domain flag that is set by vfio and
the pci untrusted paths.

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-03-19  0:42 ` [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops Pranjal Shrivastava
  2025-03-20 22:33   ` Mostafa Saleh
  2025-03-26  4:52   ` Daniel Mentz
@ 2025-04-14 17:57   ` Nicolin Chen
  2025-04-14 21:26     ` Nicolin Chen
  2025-04-15 20:37     ` Pranjal Shrivastava
  2 siblings, 2 replies; 71+ messages in thread
From: Nicolin Chen @ 2025-04-14 17:57 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Mostafa Saleh, Daniel Mentz, iommu

On Wed, Mar 19, 2025 at 12:42:52AM +0000, Pranjal Shrivastava wrote:
> +static int __maybe_unused arm_smmu_runtime_suspend(struct device *dev)
> +{
> +	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
> +
> +	/* We might get the vcmdq */
> +	struct arm_smmu_cmdq_ent cmd = {
> +		.opcode	= smmu->features & ARM_SMMU_FEAT_E2H ?
> +			CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
> +	};
> +
> +	struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu, &cmd);
> +	struct arm_smmu_ll_queue *llq = &cmdq->q.llq;
> +
> +	/*
> +	 * Since suspend is invoked when all clients have been
> +	 * we don't expect more commands to be added to the cmdq.
> +	 * Thus, wait for all existing commands to complete.
> +	 */
> +	arm_smmu_cmdq_shared_lock(cmdq);
> +	arm_smmu_cmdq_poll_until_empty(smmu, cmdq, llq);
> +	arm_smmu_cmdq_shared_unlock(cmdq);

Hmm, I just realized this: with an SMMU having multiple CMDQs
(currently with vCMDQs and potentially with ECMDQs), should we
make sure all cmdqs (not only the cmdq picked in this function
by the arm_smmu_get_cmdq call above) to be empty?

On a system with vCMDQs, there are currently one standard SMMU
CMDQ and two vCMDQs, i.e. totally 3 cmdqs that could be picked
in this context. Perhaps SMMU might need a list of cmdqs that
any new allocated cmdq must be added to, so we can iterate all
the cmdqs in the list?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-04-14 17:57   ` Nicolin Chen
@ 2025-04-14 21:26     ` Nicolin Chen
  2025-04-15 20:47       ` Pranjal Shrivastava
  2025-04-15 20:37     ` Pranjal Shrivastava
  1 sibling, 1 reply; 71+ messages in thread
From: Nicolin Chen @ 2025-04-14 21:26 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Mostafa Saleh, Daniel Mentz, iommu

On Mon, Apr 14, 2025 at 10:57:34AM -0700, Nicolin Chen wrote:
> On Wed, Mar 19, 2025 at 12:42:52AM +0000, Pranjal Shrivastava wrote:
> > +static int __maybe_unused arm_smmu_runtime_suspend(struct device *dev)
> > +{
> > +	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
> > +
> > +	/* We might get the vcmdq */
> > +	struct arm_smmu_cmdq_ent cmd = {
> > +		.opcode	= smmu->features & ARM_SMMU_FEAT_E2H ?
> > +			CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
> > +	};
> > +
> > +	struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu, &cmd);
> > +	struct arm_smmu_ll_queue *llq = &cmdq->q.llq;
> > +
> > +	/*
> > +	 * Since suspend is invoked when all clients have been
> > +	 * we don't expect more commands to be added to the cmdq.
> > +	 * Thus, wait for all existing commands to complete.
> > +	 */
> > +	arm_smmu_cmdq_shared_lock(cmdq);
> > +	arm_smmu_cmdq_poll_until_empty(smmu, cmdq, llq);
> > +	arm_smmu_cmdq_shared_unlock(cmdq);
> 
> Hmm, I just realized this: with an SMMU having multiple CMDQs
> (currently with vCMDQs and potentially with ECMDQs), should we
> make sure all cmdqs (not only the cmdq picked in this function
> by the arm_smmu_get_cmdq call above) to be empty?
> 
> On a system with vCMDQs, there are currently one standard SMMU
> CMDQ and two vCMDQs, i.e. totally 3 cmdqs that could be picked
> in this context. Perhaps SMMU might need a list of cmdqs that
> any new allocated cmdq must be added to, so we can iterate all
> the cmdqs in the list?

Another tricky thing: if a vcmdq is assigned to a guest VM, it
won't be added to the "list" above. Should we wait for them or
simply rely on VMM sending a suspend signal to all its VMs?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-04-14 17:57   ` Nicolin Chen
  2025-04-14 21:26     ` Nicolin Chen
@ 2025-04-15 20:37     ` Pranjal Shrivastava
  2025-04-15 22:13       ` Nicolin Chen
  1 sibling, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-04-15 20:37 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Mostafa Saleh, Daniel Mentz, iommu

On Mon, Apr 14, 2025 at 10:57:32AM -0700, Nicolin Chen wrote:
> On Wed, Mar 19, 2025 at 12:42:52AM +0000, Pranjal Shrivastava wrote:
> > +static int __maybe_unused arm_smmu_runtime_suspend(struct device *dev)
> > +{
> > +	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
> > +
> > +	/* We might get the vcmdq */
> > +	struct arm_smmu_cmdq_ent cmd = {
> > +		.opcode	= smmu->features & ARM_SMMU_FEAT_E2H ?
> > +			CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
> > +	};
> > +
> > +	struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu, &cmd);
> > +	struct arm_smmu_ll_queue *llq = &cmdq->q.llq;
> > +
> > +	/*
> > +	 * Since suspend is invoked when all clients have been
> > +	 * we don't expect more commands to be added to the cmdq.
> > +	 * Thus, wait for all existing commands to complete.
> > +	 */
> > +	arm_smmu_cmdq_shared_lock(cmdq);
> > +	arm_smmu_cmdq_poll_until_empty(smmu, cmdq, llq);
> > +	arm_smmu_cmdq_shared_unlock(cmdq);
> 
> Hmm, I just realized this: with an SMMU having multiple CMDQs
> (currently with vCMDQs and potentially with ECMDQs), should we
> make sure all cmdqs (not only the cmdq picked in this function
> by the arm_smmu_get_cmdq call above) to be empty?
> 
> On a system with vCMDQs, there are currently one standard SMMU
> CMDQ and two vCMDQs, i.e. totally 3 cmdqs that could be picked
> in this context. Perhaps SMMU might need a list of cmdqs that
> any new allocated cmdq must be added to, so we can iterate all
> the cmdqs in the list?
> 

Well.. I was thinking to only drain the cmdq if we have ats/pri enabled
because the only case where we'd wanna drain the cmdq is when:

1. There's a pending ATC invalidation
2. There's a pending CMD_RESUME / PRI_RESP

Since for any other invalidations, we'll clear the TLBs & config caches
on resume. Does the same hold true for the commands supported by
tegra-vcmdq?

Is there a spec I could read about tegra vCMDQ and the commands
supported by it?

> Thanks
> Nicolin

Thanks,
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-04-14 21:26     ` Nicolin Chen
@ 2025-04-15 20:47       ` Pranjal Shrivastava
  2025-04-15 22:28         ` Nicolin Chen
  0 siblings, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-04-15 20:47 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Mostafa Saleh, Daniel Mentz, iommu

On Mon, Apr 14, 2025 at 02:26:19PM -0700, Nicolin Chen wrote:
> On Mon, Apr 14, 2025 at 10:57:34AM -0700, Nicolin Chen wrote:
> > On Wed, Mar 19, 2025 at 12:42:52AM +0000, Pranjal Shrivastava wrote:
> > > +static int __maybe_unused arm_smmu_runtime_suspend(struct device *dev)
> > > +{
> > > +	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
> > > +
> > > +	/* We might get the vcmdq */
> > > +	struct arm_smmu_cmdq_ent cmd = {
> > > +		.opcode	= smmu->features & ARM_SMMU_FEAT_E2H ?
> > > +			CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
> > > +	};
> > > +
> > > +	struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu, &cmd);
> > > +	struct arm_smmu_ll_queue *llq = &cmdq->q.llq;
> > > +
> > > +	/*
> > > +	 * Since suspend is invoked when all clients have been
> > > +	 * we don't expect more commands to be added to the cmdq.
> > > +	 * Thus, wait for all existing commands to complete.
> > > +	 */
> > > +	arm_smmu_cmdq_shared_lock(cmdq);
> > > +	arm_smmu_cmdq_poll_until_empty(smmu, cmdq, llq);
> > > +	arm_smmu_cmdq_shared_unlock(cmdq);
> > 
> > Hmm, I just realized this: with an SMMU having multiple CMDQs
> > (currently with vCMDQs and potentially with ECMDQs), should we
> > make sure all cmdqs (not only the cmdq picked in this function
> > by the arm_smmu_get_cmdq call above) to be empty?
> > 
> > On a system with vCMDQs, there are currently one standard SMMU
> > CMDQ and two vCMDQs, i.e. totally 3 cmdqs that could be picked
> > in this context. Perhaps SMMU might need a list of cmdqs that
> > any new allocated cmdq must be added to, so we can iterate all
> > the cmdqs in the list?
> 
> Another tricky thing: if a vcmdq is assigned to a guest VM, it
> won't be added to the "list" above. Should we wait for them or
> simply rely on VMM sending a suspend signal to all its VMs?
> 

Hmm... I haven't looked at the vCMDQ series (of the vIOMMU) yet but I'm
assuming that's what we're talking about here?

I'm not sure if sending a signal to the VMM would help if we have some
stalled transactions? I guess we'll need to somehow figure out if
there's some PCI device assigned that's using that vIOMMU/vCMDQ ? 

Maybe we can smartly flush only the PCIe ATC inv and pri resp commands?

> Thanks
> Nicolin

Thanks
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-04-15 20:37     ` Pranjal Shrivastava
@ 2025-04-15 22:13       ` Nicolin Chen
  2025-04-16  8:29         ` Pranjal Shrivastava
  0 siblings, 1 reply; 71+ messages in thread
From: Nicolin Chen @ 2025-04-15 22:13 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Mostafa Saleh, Daniel Mentz, iommu

On Tue, Apr 15, 2025 at 08:37:45PM +0000, Pranjal Shrivastava wrote:
> On Mon, Apr 14, 2025 at 10:57:32AM -0700, Nicolin Chen wrote:
> > On Wed, Mar 19, 2025 at 12:42:52AM +0000, Pranjal Shrivastava wrote:
> > > +static int __maybe_unused arm_smmu_runtime_suspend(struct device *dev)
> > > +{
> > > +	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
> > > +
> > > +	/* We might get the vcmdq */
> > > +	struct arm_smmu_cmdq_ent cmd = {
> > > +		.opcode	= smmu->features & ARM_SMMU_FEAT_E2H ?
> > > +			CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
> > > +	};
> > > +
> > > +	struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu, &cmd);
> > > +	struct arm_smmu_ll_queue *llq = &cmdq->q.llq;
> > > +
> > > +	/*
> > > +	 * Since suspend is invoked when all clients have been
> > > +	 * we don't expect more commands to be added to the cmdq.
> > > +	 * Thus, wait for all existing commands to complete.
> > > +	 */
> > > +	arm_smmu_cmdq_shared_lock(cmdq);
> > > +	arm_smmu_cmdq_poll_until_empty(smmu, cmdq, llq);
> > > +	arm_smmu_cmdq_shared_unlock(cmdq);
> > 
> > Hmm, I just realized this: with an SMMU having multiple CMDQs
> > (currently with vCMDQs and potentially with ECMDQs), should we
> > make sure all cmdqs (not only the cmdq picked in this function
> > by the arm_smmu_get_cmdq call above) to be empty?
> > 
> > On a system with vCMDQs, there are currently one standard SMMU
> > CMDQ and two vCMDQs, i.e. totally 3 cmdqs that could be picked
> > in this context. Perhaps SMMU might need a list of cmdqs that
> > any new allocated cmdq must be added to, so we can iterate all
> > the cmdqs in the list?
> > 
> 
> Well.. I was thinking to only drain the cmdq if we have ats/pri enabled
> because the only case where we'd wanna drain the cmdq is when:
> 
> 1. There's a pending ATC invalidation
> 2. There's a pending CMD_RESUME / PRI_RESP
> 
> Since for any other invalidations, we'll clear the TLBs & config caches
> on resume. Does the same hold true for the commands supported by
> tegra-vcmdq?
> 
> Is there a spec I could read about tegra vCMDQ and the commands
> supported by it?

I don't have a link on hand. But the concept is very simple.

Think of vCMDQs as additional CPU cores to CPU0. They will be
able to execute the same ATC command as the main SMMU CMDQ.

But by doing arm_smmu_get_cmdq(), it would only get one of the
CMDQs, which can be a vcmdq (out of several) or the SMMU CMDQ.
And that specific CMDQ might not have a command pending, while
one of the other vcmdqs or the SMMU CMDQ is executing commands.

So, the point here is to iterate all the host-managed CMDQs to
wait for all of their completions of ATC/RESUME/PRI or etc.

We can do a list of cmdqs, and add each to the list in the
arm_smmu_init_one_queue(). Alternatively, we could add a new
impl op to defer to the vcmdq driver for a proper waiting,
if we do need to handle guest-owned cmdqs as well.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-04-15 20:47       ` Pranjal Shrivastava
@ 2025-04-15 22:28         ` Nicolin Chen
  2025-04-16 10:24           ` Pranjal Shrivastava
  0 siblings, 1 reply; 71+ messages in thread
From: Nicolin Chen @ 2025-04-15 22:28 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Mostafa Saleh, Daniel Mentz, iommu

On Tue, Apr 15, 2025 at 08:47:03PM +0000, Pranjal Shrivastava wrote:
> On Mon, Apr 14, 2025 at 02:26:19PM -0700, Nicolin Chen wrote:
> > > On a system with vCMDQs, there are currently one standard SMMU
> > > CMDQ and two vCMDQs, i.e. totally 3 cmdqs that could be picked
> > > in this context. Perhaps SMMU might need a list of cmdqs that
> > > any new allocated cmdq must be added to, so we can iterate all
> > > the cmdqs in the list?
> > 
> > Another tricky thing: if a vcmdq is assigned to a guest VM, it
> > won't be added to the "list" above. Should we wait for them or
> > simply rely on VMM sending a suspend signal to all its VMs?
> > 
> 
> Hmm... I haven't looked at the vCMDQ series (of the vIOMMU) yet but I'm
> assuming that's what we're talking about here?
> 
> I'm not sure if sending a signal to the VMM would help if we have some
> stalled transactions? I guess we'll need to somehow figure out if
> there's some PCI device assigned that's using that vIOMMU/vCMDQ ? 

If all VMs go into suspend prior to the host machine and all
guest-owned vcmdqs are ensured to be empty, we would be fine.

But this is difficult to tell, since a VMM might not signal
a suspend to all guests VMs. And even if it does, not all the
guest kernels might have implemented a suspend routine in the
guest vcmdq driver draining all guest-owned vcmdqs.

> Maybe we can smartly flush only the PCIe ATC inv and pri resp commands?

Yes, I suspect the host driver will have to do something like
that for the reason that I mentioned.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-04-15 22:13       ` Nicolin Chen
@ 2025-04-16  8:29         ` Pranjal Shrivastava
  0 siblings, 0 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-04-16  8:29 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Mostafa Saleh, Daniel Mentz, iommu

On Tue, Apr 15, 2025 at 03:13:20PM -0700, Nicolin Chen wrote:
> On Tue, Apr 15, 2025 at 08:37:45PM +0000, Pranjal Shrivastava wrote:
> > On Mon, Apr 14, 2025 at 10:57:32AM -0700, Nicolin Chen wrote:
> > > On Wed, Mar 19, 2025 at 12:42:52AM +0000, Pranjal Shrivastava wrote:
> > > > +static int __maybe_unused arm_smmu_runtime_suspend(struct device *dev)
> > > > +{
> > > > +	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
> > > > +
> > > > +	/* We might get the vcmdq */
> > > > +	struct arm_smmu_cmdq_ent cmd = {
> > > > +		.opcode	= smmu->features & ARM_SMMU_FEAT_E2H ?
> > > > +			CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
> > > > +	};
> > > > +
> > > > +	struct arm_smmu_cmdq *cmdq = arm_smmu_get_cmdq(smmu, &cmd);
> > > > +	struct arm_smmu_ll_queue *llq = &cmdq->q.llq;
> > > > +
> > > > +	/*
> > > > +	 * Since suspend is invoked when all clients have been
> > > > +	 * we don't expect more commands to be added to the cmdq.
> > > > +	 * Thus, wait for all existing commands to complete.
> > > > +	 */
> > > > +	arm_smmu_cmdq_shared_lock(cmdq);
> > > > +	arm_smmu_cmdq_poll_until_empty(smmu, cmdq, llq);
> > > > +	arm_smmu_cmdq_shared_unlock(cmdq);
> > > 
> > > Hmm, I just realized this: with an SMMU having multiple CMDQs
> > > (currently with vCMDQs and potentially with ECMDQs), should we
> > > make sure all cmdqs (not only the cmdq picked in this function
> > > by the arm_smmu_get_cmdq call above) to be empty?
> > > 
> > > On a system with vCMDQs, there are currently one standard SMMU
> > > CMDQ and two vCMDQs, i.e. totally 3 cmdqs that could be picked
> > > in this context. Perhaps SMMU might need a list of cmdqs that
> > > any new allocated cmdq must be added to, so we can iterate all
> > > the cmdqs in the list?
> > > 
> > 
> > Well.. I was thinking to only drain the cmdq if we have ats/pri enabled
> > because the only case where we'd wanna drain the cmdq is when:
> > 
> > 1. There's a pending ATC invalidation
> > 2. There's a pending CMD_RESUME / PRI_RESP
> > 
> > Since for any other invalidations, we'll clear the TLBs & config caches
> > on resume. Does the same hold true for the commands supported by
> > tegra-vcmdq?
> > 
> > Is there a spec I could read about tegra vCMDQ and the commands
> > supported by it?
> 
> I don't have a link on hand. But the concept is very simple.
> 
> Think of vCMDQs as additional CPU cores to CPU0. They will be
> able to execute the same ATC command as the main SMMU CMDQ.
> 
> But by doing arm_smmu_get_cmdq(), it would only get one of the
> CMDQs, which can be a vcmdq (out of several) or the SMMU CMDQ.
> And that specific CMDQ might not have a command pending, while
> one of the other vcmdqs or the SMMU CMDQ is executing commands.
> 

Ack. So, the commands are still the same it's just a diff queue.

> So, the point here is to iterate all the host-managed CMDQs to
> wait for all of their completions of ATC/RESUME/PRI or etc.
> 
> We can do a list of cmdqs, and add each to the list in the
> arm_smmu_init_one_queue(). Alternatively, we could add a new
> impl op to defer to the vcmdq driver for a proper waiting,
> if we do need to handle guest-owned cmdqs as well.
> 

Ack. I'd prefer adding a new impl op for the vcmdq driver to handle it

> Thanks
> Nicolin

Thanks,
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-04-15 22:28         ` Nicolin Chen
@ 2025-04-16 10:24           ` Pranjal Shrivastava
  2025-04-16 12:02             ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-04-16 10:24 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Mostafa Saleh, Daniel Mentz, iommu

On Tue, Apr 15, 2025 at 03:28:58PM -0700, Nicolin Chen wrote:
> On Tue, Apr 15, 2025 at 08:47:03PM +0000, Pranjal Shrivastava wrote:
> > On Mon, Apr 14, 2025 at 02:26:19PM -0700, Nicolin Chen wrote:
> > > > On a system with vCMDQs, there are currently one standard SMMU
> > > > CMDQ and two vCMDQs, i.e. totally 3 cmdqs that could be picked
> > > > in this context. Perhaps SMMU might need a list of cmdqs that
> > > > any new allocated cmdq must be added to, so we can iterate all
> > > > the cmdqs in the list?
> > > 
> > > Another tricky thing: if a vcmdq is assigned to a guest VM, it
> > > won't be added to the "list" above. Should we wait for them or
> > > simply rely on VMM sending a suspend signal to all its VMs?
> > > 
> > 
> > Hmm... I haven't looked at the vCMDQ series (of the vIOMMU) yet but I'm
> > assuming that's what we're talking about here?
> > 
> > I'm not sure if sending a signal to the VMM would help if we have some
> > stalled transactions? I guess we'll need to somehow figure out if
> > there's some PCI device assigned that's using that vIOMMU/vCMDQ ? 
> 
> If all VMs go into suspend prior to the host machine and all
> guest-owned vcmdqs are ensured to be empty, we would be fine.
> 
> But this is difficult to tell, since a VMM might not signal
> a suspend to all guests VMs. And even if it does, not all the
> guest kernels might have implemented a suspend routine in the
> guest vcmdq driver draining all guest-owned vcmdqs.
> 

I skimmed through the vCMDQ series and I have some context now. (Btw,
nicely written cover letter :) )

The issue is that we may have dedicated CMDQs (like the Tegra vCMDQ)
assigned to the guest and we may not know what's their status..and the
whole point of this is to avoid VM_exits for cache invalidations..

Since we're mmap-ing an mmio region for passing it through to the guest,
I asssume the host has access to the prod/cons registers for the vCMDQ?
If yes, then we could have a notifier tell the host driver that a vCMDQ
has been assigned to a VM and the host driver could simply poll on those
indices till it's drained?

Also, this would mean that we'll have to take care if the Guest Kernel
ends up touching the mmio region while the SMMU is suspended?

Let me go through the entire series and the tegra-vcmdq driver once.. if
at all the Guest kernel ends up calling the host-kernel vCMDQ driver, in
that case I think we should be able to handle this well... otherwise
we'd need to disable power management for cases where vCMDQs are
assigned..

I'll just go through the vCMDQ series as well..

> > Maybe we can smartly flush only the PCIe ATC inv and pri resp commands?
> 
> Yes, I suspect the host driver will have to do something like
> that for the reason that I mentioned.
> 
> Thanks
> Nicolin

Thanks
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-04-16 10:24           ` Pranjal Shrivastava
@ 2025-04-16 12:02             ` Jason Gunthorpe
  2025-04-16 12:29               ` Pranjal Shrivastava
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2025-04-16 12:02 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Nicolin Chen, Joerg Roedel, Will Deacon, Robin Murphy,
	Mostafa Saleh, Daniel Mentz, iommu

On Wed, Apr 16, 2025 at 10:24:52AM +0000, Pranjal Shrivastava wrote:

> Also, this would mean that we'll have to take care if the Guest Kernel
> ends up touching the mmio region while the SMMU is suspended?

I think this is approaching it backwards.

If power management is supported then the power should be on unless
the VM has activated its own virtual power management. VM virtual
power mangement would flush the cmdq from the VM side then signal the
host that it is OK to power down the SMMU, which the host may or may
not do

It doesn't make alot of sense to power down the SMMU while a VM is
running...

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-04-16 12:02             ` Jason Gunthorpe
@ 2025-04-16 12:29               ` Pranjal Shrivastava
  2025-04-16 12:42                 ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-04-16 12:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, Joerg Roedel, Will Deacon, Robin Murphy,
	Mostafa Saleh, Daniel Mentz, iommu

On Wed, Apr 16, 2025 at 09:02:51AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 16, 2025 at 10:24:52AM +0000, Pranjal Shrivastava wrote:
> 
> > Also, this would mean that we'll have to take care if the Guest Kernel
> > ends up touching the mmio region while the SMMU is suspended?
> 
> I think this is approaching it backwards.
> 
> If power management is supported then the power should be on unless
> the VM has activated its own virtual power management. VM virtual
> power mangement would flush the cmdq from the VM side then signal the
> host that it is OK to power down the SMMU, which the host may or may
> not do
> 
> It doesn't make alot of sense to power down the SMMU while a VM is
> running...
> 

Exactly, that's what I meant to convey..
So.. can't we simply take a ref as soon as a VM is created (IOMMU is
assigned to it), maybe from somewhere like vfio_change_dma_owner and
keep things powered on? That way we keep the IOMMU powered ON if it's
controlled by the userspace.

This would also allow us to only care about the host-managed queues
while suspening because we'll be sure that if we are suspending the
IOMMU, no VM is active.

> Jason

Thanks
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-04-16 12:29               ` Pranjal Shrivastava
@ 2025-04-16 12:42                 ` Jason Gunthorpe
  2025-04-16 12:52                   ` Pranjal Shrivastava
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2025-04-16 12:42 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Nicolin Chen, Joerg Roedel, Will Deacon, Robin Murphy,
	Mostafa Saleh, Daniel Mentz, iommu

On Wed, Apr 16, 2025 at 12:29:05PM +0000, Pranjal Shrivastava wrote:
> On Wed, Apr 16, 2025 at 09:02:51AM -0300, Jason Gunthorpe wrote:
> > On Wed, Apr 16, 2025 at 10:24:52AM +0000, Pranjal Shrivastava wrote:
> > 
> > > Also, this would mean that we'll have to take care if the Guest Kernel
> > > ends up touching the mmio region while the SMMU is suspended?
> > 
> > I think this is approaching it backwards.
> > 
> > If power management is supported then the power should be on unless
> > the VM has activated its own virtual power management. VM virtual
> > power mangement would flush the cmdq from the VM side then signal the
> > host that it is OK to power down the SMMU, which the host may or may
> > not do
> > 
> > It doesn't make alot of sense to power down the SMMU while a VM is
> > running...
> > 
> 
> Exactly, that's what I meant to convey..
> So.. can't we simply take a ref as soon as a VM is created (IOMMU is
> assigned to it), maybe from somewhere like vfio_change_dma_owner and
> keep things powered on? That way we keep the IOMMU powered ON if it's
> controlled by the userspace.
> 
> This would also allow us to only care about the host-managed queues
> while suspening because we'll be sure that if we are suspending the
> IOMMU, no VM is active.

I thought we agreed VFIO already would need to do something to keep
the power on as it can do DMA at any time?

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-04-16 12:42                 ` Jason Gunthorpe
@ 2025-04-16 12:52                   ` Pranjal Shrivastava
  2025-04-16 13:07                     ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-04-16 12:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, Joerg Roedel, Will Deacon, Robin Murphy,
	Mostafa Saleh, Daniel Mentz, iommu

On Wed, Apr 16, 2025 at 09:42:52AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 16, 2025 at 12:29:05PM +0000, Pranjal Shrivastava wrote:
> > On Wed, Apr 16, 2025 at 09:02:51AM -0300, Jason Gunthorpe wrote:
> > > On Wed, Apr 16, 2025 at 10:24:52AM +0000, Pranjal Shrivastava wrote:
> > > 
> > > > Also, this would mean that we'll have to take care if the Guest Kernel
> > > > ends up touching the mmio region while the SMMU is suspended?
> > > 
> > > I think this is approaching it backwards.
> > > 
> > > If power management is supported then the power should be on unless
> > > the VM has activated its own virtual power management. VM virtual
> > > power mangement would flush the cmdq from the VM side then signal the
> > > host that it is OK to power down the SMMU, which the host may or may
> > > not do
> > > 
> > > It doesn't make alot of sense to power down the SMMU while a VM is
> > > running...
> > > 
> > 
> > Exactly, that's what I meant to convey..
> > So.. can't we simply take a ref as soon as a VM is created (IOMMU is
> > assigned to it), maybe from somewhere like vfio_change_dma_owner and
> > keep things powered on? That way we keep the IOMMU powered ON if it's
> > controlled by the userspace.
> > 
> > This would also allow us to only care about the host-managed queues
> > while suspening because we'll be sure that if we are suspending the
> > IOMMU, no VM is active.
> 
> I thought we agreed VFIO already would need to do something to keep
> the power on as it can do DMA at any time?
> 

Yes.. but if that's the case why are the Guest-assigned vCMDQs coming
into the picture? 

I assumed that there's a situation where the VMM doesn't use
VFIO and only uses iommufd to assign/configure the IOMMU, maybe we'd
need to get a pm ref there as well? Is that the case?

Otherwise, if we have any user-space entity controlling a device behind
the IOMMU, we'd grab a pm ref and not power down. And hence, we won't
need any special handling for Guest-assigned vCMDQs because the IOMMU is
powered on.

Please tell me if I'm missing something? 

> Jason

Thanks
Praan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-04-16 12:52                   ` Pranjal Shrivastava
@ 2025-04-16 13:07                     ` Jason Gunthorpe
  2025-04-16 14:32                       ` Pranjal Shrivastava
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2025-04-16 13:07 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Nicolin Chen, Joerg Roedel, Will Deacon, Robin Murphy,
	Mostafa Saleh, Daniel Mentz, iommu

On Wed, Apr 16, 2025 at 12:52:28PM +0000, Pranjal Shrivastava wrote:
> On Wed, Apr 16, 2025 at 09:42:52AM -0300, Jason Gunthorpe wrote:
> > On Wed, Apr 16, 2025 at 12:29:05PM +0000, Pranjal Shrivastava wrote:
> > > On Wed, Apr 16, 2025 at 09:02:51AM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Apr 16, 2025 at 10:24:52AM +0000, Pranjal Shrivastava wrote:
> > > > 
> > > > > Also, this would mean that we'll have to take care if the Guest Kernel
> > > > > ends up touching the mmio region while the SMMU is suspended?
> > > > 
> > > > I think this is approaching it backwards.
> > > > 
> > > > If power management is supported then the power should be on unless
> > > > the VM has activated its own virtual power management. VM virtual
> > > > power mangement would flush the cmdq from the VM side then signal the
> > > > host that it is OK to power down the SMMU, which the host may or may
> > > > not do
> > > > 
> > > > It doesn't make alot of sense to power down the SMMU while a VM is
> > > > running...
> > > > 
> > > 
> > > Exactly, that's what I meant to convey..
> > > So.. can't we simply take a ref as soon as a VM is created (IOMMU is
> > > assigned to it), maybe from somewhere like vfio_change_dma_owner and
> > > keep things powered on? That way we keep the IOMMU powered ON if it's
> > > controlled by the userspace.
> > > 
> > > This would also allow us to only care about the host-managed queues
> > > while suspening because we'll be sure that if we are suspending the
> > > IOMMU, no VM is active.
> > 
> > I thought we agreed VFIO already would need to do something to keep
> > the power on as it can do DMA at any time?
> > 
> 
> Yes.. but if that's the case why are the Guest-assigned vCMDQs coming
> into the picture?

Don't know.

But the vcmdq's are used by the host too..

> I assumed that there's a situation where the VMM doesn't use
> VFIO and only uses iommufd to assign/configure the IOMMU, maybe we'd
> need to get a pm ref there as well? Is that the case?

Right now you can't reach the iommu HW without a VFIO, so that path is
closed off.

> Please tell me if I'm missing something? 

I think the issue here is the host operation of the vcmdq's
itself. They need to be quieted just like the normal command queue

cmdqs that are assigned to userspace can be ignored, but the upstream
kernel does not support this yet.

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-04-16 13:07                     ` Jason Gunthorpe
@ 2025-04-16 14:32                       ` Pranjal Shrivastava
  0 siblings, 0 replies; 71+ messages in thread
From: Pranjal Shrivastava @ 2025-04-16 14:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, Joerg Roedel, Will Deacon, Robin Murphy,
	Mostafa Saleh, Daniel Mentz, iommu

On Wed, Apr 16, 2025 at 10:07:11AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 16, 2025 at 12:52:28PM +0000, Pranjal Shrivastava wrote:
> > On Wed, Apr 16, 2025 at 09:42:52AM -0300, Jason Gunthorpe wrote:
> > > On Wed, Apr 16, 2025 at 12:29:05PM +0000, Pranjal Shrivastava wrote:
> > > > On Wed, Apr 16, 2025 at 09:02:51AM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, Apr 16, 2025 at 10:24:52AM +0000, Pranjal Shrivastava wrote:
> > > > > 
> > > > > > Also, this would mean that we'll have to take care if the Guest Kernel
> > > > > > ends up touching the mmio region while the SMMU is suspended?
> > > > > 
> > > > > I think this is approaching it backwards.
> > > > > 
> > > > > If power management is supported then the power should be on unless
> > > > > the VM has activated its own virtual power management. VM virtual
> > > > > power mangement would flush the cmdq from the VM side then signal the
> > > > > host that it is OK to power down the SMMU, which the host may or may
> > > > > not do
> > > > > 
> > > > > It doesn't make alot of sense to power down the SMMU while a VM is
> > > > > running...
> > > > > 
> > > > 
> > > > Exactly, that's what I meant to convey..
> > > > So.. can't we simply take a ref as soon as a VM is created (IOMMU is
> > > > assigned to it), maybe from somewhere like vfio_change_dma_owner and
> > > > keep things powered on? That way we keep the IOMMU powered ON if it's
> > > > controlled by the userspace.
> > > > 
> > > > This would also allow us to only care about the host-managed queues
> > > > while suspening because we'll be sure that if we are suspending the
> > > > IOMMU, no VM is active.
> > > 
> > > I thought we agreed VFIO already would need to do something to keep
> > > the power on as it can do DMA at any time?
> > > 
> > 
> > Yes.. but if that's the case why are the Guest-assigned vCMDQs coming
> > into the picture?
> 
> Don't know.
> 
> But the vcmdq's are used by the host too..
> 

Yea..I agree that the host-owned ones would need to be drained.

> > I assumed that there's a situation where the VMM doesn't use
> > VFIO and only uses iommufd to assign/configure the IOMMU, maybe we'd
> > need to get a pm ref there as well? Is that the case?
> 
> Right now you can't reach the iommu HW without a VFIO, so that path is
> closed off.
> 

Ack. 

> > Please tell me if I'm missing something? 
> 
> I think the issue here is the host operation of the vcmdq's
> itself. They need to be quieted just like the normal command queue
> 

YEs, I agree we'll need to handle them as well.

> cmdqs that are assigned to userspace can be ignored, but the upstream
> kernel does not support this yet.

Ack.

> 
> Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2025-04-16 14:32 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-19  0:42 [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
2025-03-19  0:42 ` [RFC PATCH 1/5] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
2025-03-19  4:50   ` Nicolin Chen
2025-03-19  7:43     ` Pranjal Shrivastava
2025-03-20 22:29   ` Mostafa Saleh
2025-03-21  7:26     ` Pranjal Shrivastava
2025-03-25 16:19   ` Daniel Mentz
2025-03-26 19:35     ` Pranjal Shrivastava
2025-03-19  0:42 ` [RFC PATCH 2/5] iommu/arm-smmu-v3: Add a helper to wait till cmdq drains Pranjal Shrivastava
2025-03-20 22:30   ` Mostafa Saleh
2025-03-21  8:09     ` Pranjal Shrivastava
2025-03-25 17:50   ` Daniel Mentz
2025-03-26 19:36     ` Pranjal Shrivastava
2025-03-26  4:51   ` Daniel Mentz
2025-03-26 20:10     ` Pranjal Shrivastava
2025-03-19  0:42 ` [RFC PATCH 3/5] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops Pranjal Shrivastava
2025-03-20 22:33   ` Mostafa Saleh
2025-03-21  8:13     ` Pranjal Shrivastava
2025-03-26  4:52   ` Daniel Mentz
2025-03-28  7:47     ` Pranjal Shrivastava
2025-04-14 17:57   ` Nicolin Chen
2025-04-14 21:26     ` Nicolin Chen
2025-04-15 20:47       ` Pranjal Shrivastava
2025-04-15 22:28         ` Nicolin Chen
2025-04-16 10:24           ` Pranjal Shrivastava
2025-04-16 12:02             ` Jason Gunthorpe
2025-04-16 12:29               ` Pranjal Shrivastava
2025-04-16 12:42                 ` Jason Gunthorpe
2025-04-16 12:52                   ` Pranjal Shrivastava
2025-04-16 13:07                     ` Jason Gunthorpe
2025-04-16 14:32                       ` Pranjal Shrivastava
2025-04-15 20:37     ` Pranjal Shrivastava
2025-04-15 22:13       ` Nicolin Chen
2025-04-16  8:29         ` Pranjal Shrivastava
2025-03-19  0:42 ` [RFC PATCH 4/5] iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks Pranjal Shrivastava
2025-03-20 22:34   ` Mostafa Saleh
2025-03-19  0:42 ` [RFC PATCH 5/5] iommu/arm-smmu-v3: Invoke pm_runtime before hw access Pranjal Shrivastava
2025-03-19 12:04   ` Jason Gunthorpe
2025-03-20  7:25     ` Pranjal Shrivastava
2025-03-20 12:54       ` Jason Gunthorpe
2025-03-20 13:22         ` Robin Murphy
2025-03-20 14:21           ` Pranjal Shrivastava
2025-03-20 22:36   ` Mostafa Saleh
2025-03-19 11:57 ` [RFC PATCH 0/5] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Jason Gunthorpe
2025-03-19 16:07   ` Robin Murphy
2025-03-20 22:25     ` Mostafa Saleh
2025-03-21 14:18       ` Pranjal Shrivastava
2025-03-21 17:35         ` Robin Murphy
2025-03-24 17:36           ` Pranjal Shrivastava
2025-03-27 17:27             ` Mostafa Saleh
2025-03-28  9:13               ` Pranjal Shrivastava
2025-03-28  9:19                 ` Pranjal Shrivastava
2025-03-28 13:18                 ` Jason Gunthorpe
2025-03-28 15:08                   ` Pranjal Shrivastava
2025-03-28 18:21                     ` Jason Gunthorpe
2025-03-19 18:22 ` Robin Murphy
2025-03-19 19:46   ` Jason Gunthorpe
2025-03-20 21:00     ` Pranjal Shrivastava
2025-03-20 23:08       ` Jason Gunthorpe
2025-03-21 14:36         ` Pranjal Shrivastava
2025-03-22  0:00           ` Jason Gunthorpe
2025-03-20 22:28     ` Mostafa Saleh
2025-03-20 23:05       ` Jason Gunthorpe
2025-03-21 14:44         ` Pranjal Shrivastava
2025-03-21 15:30           ` Jason Gunthorpe
2025-03-24 17:53             ` Pranjal Shrivastava
2025-03-25 13:55               ` Jason Gunthorpe
2025-03-27 17:39                 ` Mostafa Saleh
2025-03-28 13:21                   ` Jason Gunthorpe
2025-03-20 14:13   ` Pranjal Shrivastava
2025-03-20 14:54     ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.