[RFC PATCH v3 0/8] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v3 0/8] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops
@ 2025-06-16 20:31 Pranjal Shrivastava
  2025-06-16 20:31 ` [RFC PATCH v3 1/8] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-06-16 20:31 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

As arm-smmu-v3 rapidly finds its way into SoCs designed for hand-held
devices, power management capabilities, similar to its predecessors, are
crucial for these applications. This series introduces power management
support for the arm-smmu-v3 driver.

Design
=======
The arm-smmu-v3 primarily operates with in-memory data structures
through HW registers pointing to these data structures in some fashion.
The proposed design tries to make use of this fact for implementing the
suspend and resume ops.

1. Suspend / Resume
The idea for the "suspend" op is to wait till the cmd queues are flushed
before disabling the SMMU through CR0. Note that waiting for the other 
queues like the event queue and PRI queue is not required in this
version as by design their bottom-halves hold a pm reference ensuring
that if the suspend callback is invoked, both the queues have been
drained. In order to avoid mis-use or spurious transactions 
(b/w SMMU disable -> power-down), the GBPA register is configured to
abort all transactions.

The resume operation uses the `arm_smmu_device_reset` function which
re-initializes the HW using the SW-copies maintained by the driver. For
example, prod/cons for queues, base addresses for queues & tables. The
arm_smmu_device_reset also clears the TLBs and config caches.

2. Interrupt Re-config
a. Wired irqs: The series refactors the `arm_smmu_setup_irqs` to be
able to enable/disable irqs and install their handlers separately to
help with the re-initialization of the interrupts correctly.

b. MSIs: The series relies on caching the msi_msg and retrieving it
through a newly introduced helper `arm_smmu_resume_msis()` which
re-configures the *_IRQ_CFGn registers via writing back cached msi_msgs.

3. Eliding TLBIs and CFGIs
The existing pm_runtime helpers like pm_runtime_get_if_active() and
pm_runtime_get_if_in_use(), are too strict for certain use cases. They
fail if the device is in a transient state like RPM_SUSPENDING, which
can lead to drivers making incorrect assumptions about the dev's state.

These helpers don't suffice for cases where one wishes to elide HW TLB
invalidations if the device is powered off. As discussed in the previous
versions, it is wasteful to wake up the SMMU just to issue a TLBI since
we anyway clear the TLBs and Config cache on resume. The existing APIs
like pm_runtime_get_if_active or pm_runtime_get_if_in_use fail to help
us acheive this. Consider the following sequence of operations:

a. The SMMU is in `RPM_SUSPENDING` state
b. The SMMU driver calls pm_runtime_get_if_active/in_use
c. Depending on these API, the driver elides a HW clean-up op like:

if (pm_runtime_get_if_in_active(dev))
	invalidate_tlb(dev);
else
	// Skip flush, assuming device will fully suspend

d. Now, a client wakes up, causing the SMMU's state to bounce from
   RPM_SUSPENDING to RPM_ACTIVE without invoking any rpm callbacks
   preventing the resume callback from clearing all TLBs.

e. The SMMU continues operate with stale TLB entries.

In order to avoid the above situation, this series introduces a new
helper function, pm_runtime_get_if_not_suspended(), The new APU, 
pm_runtime_get_if_not_suspended() increments the device's runtime 
PM reference only if it's rpm state is NOT RPM_SUSPENDED and it cancels
any pending autosuspend timers. This allows us to reliably clear TLBs or
Config cache if the SMMU is NOT suspended, ensuring correct operation.

4. Invoking runtime_pm_get/put
Given that most of the configuration done by arm-smmu-v3 is stored in
memory, the idea in this version of the series is to elide all TLBIs and
CFGIs if the SMMU is suspended and only go ahead with ATC invalidations.

Thus, for most calls, the SMMU driver would make the required changes to
the in-memory data structures, but elide all TLBIs, CFGIs and prefetches
This is done by introducing another runtime PM helper based on the newly
introduced pm_runtime_get_if_not_suspended.

Only places where the driver does a pm_runtime_resume_and_get is where
touching the HW is unavoidable and important commands like ATC_INV and
other commands resuming stalled transactions.

Future Work / Potential Improvements
Ensuring a pm reference is held for user-space controlled IOMMUs.
As per a discussion with Will off-list, it was decided to remove the
additional code for tracking user-owned IOMMUs / insecure attachments
from the driver.

Call for review
Any insights/comments on the proposed changes are appreciated,
especially in areas related to locking, atomic contexts, early resume, 
PCIe-related considerations etc. or any other potential optimizations.

Note: The series isn't tested with MSIs and weakly tested for PCIe
clients. The same holds true for tegra241_cmdv changes. Any help in
reviewing and testing these parts is much appreciated.

Changelog:

[v3]
 - Introduced `pm_runtime_get_if_not_suspended` API to avoid races due
   to bouncing RPM states while eliding TLBIs as pointed out by Daniel.
 - Addressed Nicolin's comments regarding msi_resume and CMDQV flush
 - Addressed Daniel's comments about CMDQ locking and draining
 - Addressed issues related to draining the evtq and priq
 - Dropped the code to identify and track user-space attachments

[v2]
 - https://lore.kernel.org/all/20250418233409.3926715-1-praan@google.com/
 - Introduced `arm_smmu_rpm_get_if_active` for eliding TLBIs & CFGIs
 - Updated the rpm helper invocation strategy.
 - Drained all queues in suspend callback (including tegra241-cmdv)
 - Cache and restore msi_msg instead of free-ing realloc-ing on resume
 - Added support to identify and track user-space attachments
 - Fixed the setup_irqs as per Nicolin & Mostafa's suggestions
 - Used force_runtime_suspend/resume instead as per Mostafa's suggestion.
 - Added "Reviewed-by" line from Mostafa on an unchanged patch

[v1]
 - https://lore.kernel.org/all/20250319004254.2547950-1-praan@google.com/

Pranjal Shrivastava (8):
  iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs
  iommu/arm-smmu-v3: Add a helper to drain cmd queues
  iommu/tegra241-cmdqv: Add a helper to drain VCMDQs
  iommu/arm-smmu-v3: Cache and restore MSI config
  pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks
  iommu/arm-smmu-v3: Invoke pm_runtime before hw access

 drivers/base/power/runtime.c                  |  41 ++
 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     |  11 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 378 ++++++++++++++++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   9 +
 .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c    |  27 ++
 include/linux/pm_runtime.h                    |   5 +
 6 files changed, 438 insertions(+), 33 deletions(-)

-- 
2.50.0.rc2.692.g299adb8693-goog

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 1/8] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs
  2025-06-16 20:31 [RFC PATCH v3 0/8] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
@ 2025-06-16 20:31 ` Pranjal Shrivastava
  2025-07-08 15:15   ` Mostafa Saleh
  2025-06-16 20:31 ` [RFC PATCH v3 2/8] iommu/arm-smmu-v3: Add a helper to drain cmd queues Pranjal Shrivastava
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-06-16 20:31 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

Refactor arm_smmu_setup_irqs by splitting it into two parts, one for
registering interrupt handlers and the other one for enabling interrupt
generation in the hardware. This refactor helps in re-initialization of
hardware interrupts as part of a subsequent patch that enables runtime
power management for the arm-smmu-v3 driver.

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 53 ++++++++++++++-------
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f50c26596416..f19a77708d2c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -4009,12 +4009,24 @@ static void arm_smmu_setup_unique_irqs(struct arm_smmu_device *smmu)
 	}
 }
 
-static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
+static void arm_smmu_enable_irqs(struct arm_smmu_device *smmu)
 {
-	int ret, irq;
+	int ret;
 	u32 irqen_flags = IRQ_CTRL_EVTQ_IRQEN | IRQ_CTRL_GERROR_IRQEN;
 
-	/* Disable IRQs first */
+	if (smmu->features & ARM_SMMU_FEAT_PRI)
+		irqen_flags |= IRQ_CTRL_PRIQ_IRQEN;
+
+	ret = arm_smmu_write_reg_sync(smmu, irqen_flags,
+				      ARM_SMMU_IRQ_CTRL, ARM_SMMU_IRQ_CTRLACK);
+	if (ret)
+		dev_warn(smmu->dev, "failed to enable irqs\n");
+}
+
+static int arm_smmu_disable_irqs(struct arm_smmu_device *smmu)
+{
+	int ret;
+
 	ret = arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_IRQ_CTRL,
 				      ARM_SMMU_IRQ_CTRLACK);
 	if (ret) {
@@ -4022,6 +4034,18 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
 		return ret;
 	}
 
+	return 0;
+}
+
+static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
+{
+	int ret, irq;
+
+	/* Disable IRQs first */
+	ret = arm_smmu_disable_irqs(smmu);
+	if (ret)
+		return ret;
+
 	irq = smmu->combined_irq;
 	if (irq) {
 		/*
@@ -4038,15 +4062,6 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
 	} else
 		arm_smmu_setup_unique_irqs(smmu);
 
-	if (smmu->features & ARM_SMMU_FEAT_PRI)
-		irqen_flags |= IRQ_CTRL_PRIQ_IRQEN;
-
-	/* Enable interrupt generation on the SMMU */
-	ret = arm_smmu_write_reg_sync(smmu, irqen_flags,
-				      ARM_SMMU_IRQ_CTRL, ARM_SMMU_IRQ_CTRLACK);
-	if (ret)
-		dev_warn(smmu->dev, "failed to enable irqs\n");
-
 	return 0;
 }
 
@@ -4189,11 +4204,8 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu)
 		}
 	}
 
-	ret = arm_smmu_setup_irqs(smmu);
-	if (ret) {
-		dev_err(smmu->dev, "failed to setup irqs\n");
-		return ret;
-	}
+	/* Enable interrupt generation on the SMMU */
+	arm_smmu_enable_irqs(smmu);
 
 	if (is_kdump_kernel())
 		enables &= ~(CR0_EVTQEN | CR0_PRIQEN);
@@ -4778,6 +4790,13 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 	/* Check for RMRs and install bypass STEs if any */
 	arm_smmu_rmr_install_bypass_ste(smmu);
 
+	/* Setup interrupt handlers */
+	ret = arm_smmu_setup_irqs(smmu);
+	if (ret) {
+		dev_err(smmu->dev, "failed to setup irqs\n");
+		goto err_free_iopf;
+	}
+
 	/* Reset the device */
 	ret = arm_smmu_device_reset(smmu);
 	if (ret)
-- 
2.50.0.rc2.692.g299adb8693-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 1/8] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs
  2025-06-16 20:31 ` [RFC PATCH v3 1/8] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
@ 2025-07-08 15:15   ` Mostafa Saleh
  0 siblings, 0 replies; 28+ messages in thread
From: Mostafa Saleh @ 2025-07-08 15:15 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki, Nicolin Chen, Daniel Mentz, iommu

On Mon, Jun 16, 2025 at 08:31:42PM +0000, Pranjal Shrivastava wrote:
> Refactor arm_smmu_setup_irqs by splitting it into two parts, one for
> registering interrupt handlers and the other one for enabling interrupt
> generation in the hardware. This refactor helps in re-initialization of
> hardware interrupts as part of a subsequent patch that enables runtime
> power management for the arm-smmu-v3 driver.
> 
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Pranjal Shrivastava <praan@google.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 53 ++++++++++++++-------
>  1 file changed, 36 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index f50c26596416..f19a77708d2c 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -4009,12 +4009,24 @@ static void arm_smmu_setup_unique_irqs(struct arm_smmu_device *smmu)
>  	}
>  }
>  
> -static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
> +static void arm_smmu_enable_irqs(struct arm_smmu_device *smmu)
>  {
> -	int ret, irq;
> +	int ret;
>  	u32 irqen_flags = IRQ_CTRL_EVTQ_IRQEN | IRQ_CTRL_GERROR_IRQEN;
>  
> -	/* Disable IRQs first */
> +	if (smmu->features & ARM_SMMU_FEAT_PRI)
> +		irqen_flags |= IRQ_CTRL_PRIQ_IRQEN;
> +
> +	ret = arm_smmu_write_reg_sync(smmu, irqen_flags,
> +				      ARM_SMMU_IRQ_CTRL, ARM_SMMU_IRQ_CTRLACK);
> +	if (ret)
> +		dev_warn(smmu->dev, "failed to enable irqs\n");
> +}
> +
> +static int arm_smmu_disable_irqs(struct arm_smmu_device *smmu)
> +{
> +	int ret;
> +
>  	ret = arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_IRQ_CTRL,
>  				      ARM_SMMU_IRQ_CTRLACK);
>  	if (ret) {
> @@ -4022,6 +4034,18 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
>  		return ret;
>  	}
>  
> +	return 0;
> +}

nit: it feels like making the function just as:

static int arm_smmu_disable_irqs(struct arm_smmu_device *smmu)
{
	return arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_IRQ_CTRL,
				      ARM_SMMU_IRQ_CTRLACK);
}

And print from the caller is less code and cleaner, but no strong opinion.

Reviewed-by: Mostafa Saleh <smostafa@google.com>

> +
> +static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
> +{
> +	int ret, irq;
> +
> +	/* Disable IRQs first */
> +	ret = arm_smmu_disable_irqs(smmu);
> +	if (ret)
> +		return ret;
> +
>  	irq = smmu->combined_irq;
>  	if (irq) {
>  		/*
> @@ -4038,15 +4062,6 @@ static int arm_smmu_setup_irqs(struct arm_smmu_device *smmu)
>  	} else
>  		arm_smmu_setup_unique_irqs(smmu);
>  
> -	if (smmu->features & ARM_SMMU_FEAT_PRI)
> -		irqen_flags |= IRQ_CTRL_PRIQ_IRQEN;
> -
> -	/* Enable interrupt generation on the SMMU */
> -	ret = arm_smmu_write_reg_sync(smmu, irqen_flags,
> -				      ARM_SMMU_IRQ_CTRL, ARM_SMMU_IRQ_CTRLACK);
> -	if (ret)
> -		dev_warn(smmu->dev, "failed to enable irqs\n");
> -
>  	return 0;
>  }
>  
> @@ -4189,11 +4204,8 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu)
>  		}
>  	}
>  
> -	ret = arm_smmu_setup_irqs(smmu);
> -	if (ret) {
> -		dev_err(smmu->dev, "failed to setup irqs\n");
> -		return ret;
> -	}
> +	/* Enable interrupt generation on the SMMU */
> +	arm_smmu_enable_irqs(smmu);
>  
>  	if (is_kdump_kernel())
>  		enables &= ~(CR0_EVTQEN | CR0_PRIQEN);
> @@ -4778,6 +4790,13 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
>  	/* Check for RMRs and install bypass STEs if any */
>  	arm_smmu_rmr_install_bypass_ste(smmu);
>  
> +	/* Setup interrupt handlers */
> +	ret = arm_smmu_setup_irqs(smmu);
> +	if (ret) {
> +		dev_err(smmu->dev, "failed to setup irqs\n");
> +		goto err_free_iopf;
> +	}
> +
>  	/* Reset the device */
>  	ret = arm_smmu_device_reset(smmu);
>  	if (ret)
> -- 
> 2.50.0.rc2.692.g299adb8693-goog
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 2/8] iommu/arm-smmu-v3: Add a helper to drain cmd queues
  2025-06-16 20:31 [RFC PATCH v3 0/8] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
  2025-06-16 20:31 ` [RFC PATCH v3 1/8] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
@ 2025-06-16 20:31 ` Pranjal Shrivastava
  2025-07-08 15:32   ` Mostafa Saleh
  2025-06-16 20:31 ` [RFC PATCH v3 3/8] iommu/tegra241-cmdqv: Add a helper to drain VCMDQs Pranjal Shrivastava
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-06-16 20:31 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

Before we suspend SMMU, we want to ensure that all commands have been
flushed by all the queues i.e. the queues are empty. Add a helper
function that polls till the queues are dained.

Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 40 +++++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 ++
 2 files changed, 43 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f19a77708d2c..0df1b17a09cc 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -981,6 +981,46 @@ static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu,
 					   cmds->num, true);
 }
 
+int arm_smmu_queue_poll_until_empty(struct arm_smmu_device *smmu,
+				    struct arm_smmu_queue *q)
+{
+	struct arm_smmu_queue_poll qp;
+	struct arm_smmu_ll_queue *llq = &q->llq;
+	int ret = 0;
+
+	queue_poll_init(smmu, &qp);
+	do {
+		if (queue_empty(llq))
+			break;
+
+		ret = queue_poll(&qp);
+		WRITE_ONCE(llq->cons, readl_relaxed(q->cons_reg));
+
+	} while (!ret);
+
+	return ret;
+}
+
+static int arm_smmu_drain_queues(struct arm_smmu_device *smmu)
+{
+	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
+	unsigned long flags;
+	int ret;
+
+	/*
+	 * Since this is only called from the suspend callback, we
+	 * should be able to acquire the exclusive lock without failing.
+	 */
+	arm_smmu_cmdq_exclusive_trylock_irqsave(cmdq, flags);
+	ret = arm_smmu_queue_poll_until_empty(smmu, &smmu->cmdq.q);
+	arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags);
+
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
 static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused,
 				   struct iommu_page_response *resp)
 {
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index dd1ad56ce863..461f1958e574 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -965,6 +965,9 @@ int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
 		       struct arm_smmu_cmdq *cmdq);
 
+int arm_smmu_queue_poll_until_empty(struct arm_smmu_device *smmu,
+				    struct arm_smmu_queue *q);
+
 static inline bool arm_smmu_master_canwbs(struct arm_smmu_master *master)
 {
 	return dev_iommu_fwspec_get(master->dev)->flags &
-- 
2.50.0.rc2.692.g299adb8693-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 2/8] iommu/arm-smmu-v3: Add a helper to drain cmd queues
  2025-06-16 20:31 ` [RFC PATCH v3 2/8] iommu/arm-smmu-v3: Add a helper to drain cmd queues Pranjal Shrivastava
@ 2025-07-08 15:32   ` Mostafa Saleh
  2025-07-14  9:24     ` Pranjal Shrivastava
  0 siblings, 1 reply; 28+ messages in thread
From: Mostafa Saleh @ 2025-07-08 15:32 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki, Nicolin Chen, Daniel Mentz, iommu

On Mon, Jun 16, 2025 at 08:31:43PM +0000, Pranjal Shrivastava wrote:
> Before we suspend SMMU, we want to ensure that all commands have been
> flushed by all the queues i.e. the queues are empty. Add a helper
> function that polls till the queues are dained.
> 
> Signed-off-by: Pranjal Shrivastava <praan@google.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 40 +++++++++++++++++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 ++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index f19a77708d2c..0df1b17a09cc 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -981,6 +981,46 @@ static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu,
>  					   cmds->num, true);
>  }
>  
> +int arm_smmu_queue_poll_until_empty(struct arm_smmu_device *smmu,
> +				    struct arm_smmu_queue *q)
> +{
> +	struct arm_smmu_queue_poll qp;
> +	struct arm_smmu_ll_queue *llq = &q->llq;
> +	int ret = 0;
> +
> +	queue_poll_init(smmu, &qp);
> +	do {
> +		if (queue_empty(llq))
> +			break;
> +
> +		ret = queue_poll(&qp);
> +		WRITE_ONCE(llq->cons, readl_relaxed(q->cons_reg));
> +
> +	} while (!ret);
> +
> +	return ret;
> +}
> +
> +static int arm_smmu_drain_queues(struct arm_smmu_device *smmu)
> +{
> +	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
> +	unsigned long flags;
> +	int ret;
> +
> +	/*
> +	 * Since this is only called from the suspend callback, we
> +	 * should be able to acquire the exclusive lock without failing.
> +	 */

I don’t understand the comment, can’t the command queue be busy at this
point form another unmap context?

> +	arm_smmu_cmdq_exclusive_trylock_irqsave(cmdq, flags);
> +	ret = arm_smmu_queue_poll_until_empty(smmu, &smmu->cmdq.q);
> +	arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags);
> +
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +

I wonder if that’s enough, I see that if arm_smmu_cmdq_issue_cmdlist()
is called with sync=false, it won't touch the exclusive lock, so what
happens if another driver keeps unmapping, that means that this thread
can loop forever as prod keeps growing?

Thinking out loud about this, if the aim of draining the command queue
is not TLB invalidation but ATC, maybe we need higher-level
synchronization between those 2 and keep the command queue out of this.
(this also applies for the next patch for tegra)

Thanks,
Mostafa

>  static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused,
>  				   struct iommu_page_response *resp)
>  {
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index dd1ad56ce863..461f1958e574 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -965,6 +965,9 @@ int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
>  int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
>  		       struct arm_smmu_cmdq *cmdq);
>  
> +int arm_smmu_queue_poll_until_empty(struct arm_smmu_device *smmu,
> +				    struct arm_smmu_queue *q);
> +
>  static inline bool arm_smmu_master_canwbs(struct arm_smmu_master *master)
>  {
>  	return dev_iommu_fwspec_get(master->dev)->flags &
> -- 
> 2.50.0.rc2.692.g299adb8693-goog
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 2/8] iommu/arm-smmu-v3: Add a helper to drain cmd queues
  2025-07-08 15:32   ` Mostafa Saleh
@ 2025-07-14  9:24     ` Pranjal Shrivastava
  0 siblings, 0 replies; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-07-14  9:24 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki, Nicolin Chen, Daniel Mentz, iommu

On Tue, Jul 08, 2025 at 03:32:08PM +0000, Mostafa Saleh wrote:
> On Mon, Jun 16, 2025 at 08:31:43PM +0000, Pranjal Shrivastava wrote:
> > Before we suspend SMMU, we want to ensure that all commands have been
> > flushed by all the queues i.e. the queues are empty. Add a helper
> > function that polls till the queues are dained.
> > 
> > Signed-off-by: Pranjal Shrivastava <praan@google.com>
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 40 +++++++++++++++++++++
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 ++
> >  2 files changed, 43 insertions(+)
> > 
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index f19a77708d2c..0df1b17a09cc 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -981,6 +981,46 @@ static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu,
> >  					   cmds->num, true);
> >  }
> >  
> > +int arm_smmu_queue_poll_until_empty(struct arm_smmu_device *smmu,
> > +				    struct arm_smmu_queue *q)
> > +{
> > +	struct arm_smmu_queue_poll qp;
> > +	struct arm_smmu_ll_queue *llq = &q->llq;
> > +	int ret = 0;
> > +
> > +	queue_poll_init(smmu, &qp);
> > +	do {
> > +		if (queue_empty(llq))
> > +			break;
> > +
> > +		ret = queue_poll(&qp);
> > +		WRITE_ONCE(llq->cons, readl_relaxed(q->cons_reg));
> > +
> > +	} while (!ret);
> > +
> > +	return ret;
> > +}
> > +
> > +static int arm_smmu_drain_queues(struct arm_smmu_device *smmu)
> > +{
> > +	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
> > +	unsigned long flags;
> > +	int ret;
> > +
> > +	/*
> > +	 * Since this is only called from the suspend callback, we
> > +	 * should be able to acquire the exclusive lock without failing.
> > +	 */
> 
> I don’t understand the comment, can’t the command queue be busy at this
> point form another unmap context?
> 

The expectation would be that if we entered the suspend callback, all
clients are down.. if there's an interrupt I think we're holding the
power lock during the entirety of this callback, thus the client should
sleep or spin at rpm_get() for the dev?

> > +	arm_smmu_cmdq_exclusive_trylock_irqsave(cmdq, flags);
> > +	ret = arm_smmu_queue_poll_until_empty(smmu, &smmu->cmdq.q);
> > +	arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags);
> > +
> > +	if (ret)
> > +		return ret;
> > +
> > +	return 0;
> > +}
> > +
> 
> I wonder if that’s enough, I see that if arm_smmu_cmdq_issue_cmdlist()
> is called with sync=false, it won't touch the exclusive lock, so what
> happens if another driver keeps unmapping, that means that this thread
> can loop forever as prod keeps growing?
> 

I'm not sure how can we enter suspend while a client is awake? I'd
expect the devlinks to ensure that suspend isn't called if a client is
awake. If it wakes up while we are *in* the suspend callback, it will
have to wait for the entire callback to execute, i.e. it will spin/sleep
in the rpm_get() call for the client device.

> Thinking out loud about this, if the aim of draining the command queue
> is not TLB invalidation but ATC, maybe we need higher-level
> synchronization between those 2 and keep the command queue out of this.
> (this also applies for the next patch for tegra)
> 

Hmm.. I guess that would mean issuing all ATC invalidations synchronously
We already hold a PM ref while issuing the ATC inv commands, but I think
we'll have to issue all of those with sync (wait till CMD_SYNC is done)

> Thanks,
> Mostafa
> 
> >  static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused,
> >  				   struct iommu_page_response *resp)
> >  {
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> > index dd1ad56ce863..461f1958e574 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> > @@ -965,6 +965,9 @@ int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
> >  int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
> >  		       struct arm_smmu_cmdq *cmdq);
> >  
> > +int arm_smmu_queue_poll_until_empty(struct arm_smmu_device *smmu,
> > +				    struct arm_smmu_queue *q);
> > +
> >  static inline bool arm_smmu_master_canwbs(struct arm_smmu_master *master)
> >  {
> >  	return dev_iommu_fwspec_get(master->dev)->flags &
> > -- 
> > 2.50.0.rc2.692.g299adb8693-goog
> > 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 3/8] iommu/tegra241-cmdqv: Add a helper to drain VCMDQs
  2025-06-16 20:31 [RFC PATCH v3 0/8] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
  2025-06-16 20:31 ` [RFC PATCH v3 1/8] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
  2025-06-16 20:31 ` [RFC PATCH v3 2/8] iommu/arm-smmu-v3: Add a helper to drain cmd queues Pranjal Shrivastava
@ 2025-06-16 20:31 ` Pranjal Shrivastava
  2025-06-16 20:31 ` [RFC PATCH v3 4/8] iommu/arm-smmu-v3: Cache and restore MSI config Pranjal Shrivastava
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-06-16 20:31 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

The tegra241-cmdqv driver supports vCMDQs which need to be drained
before suspending the SMMU. The current driver implementation only uses
VINTF0 for vCMDQs owned by the kernel which need to be drained. Add a
helper that drains all the enabled vCMDQs under VINTF0.

Add another function ptr to arm_smmu_impl_ops to drain implementation
specified queues and call it within `arm_smmu_drain_queues`

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  6 ++++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  1 +
 .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c    | 27 +++++++++++++++++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 0df1b17a09cc..9ce00c959034 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1018,7 +1018,11 @@ static int arm_smmu_drain_queues(struct arm_smmu_device *smmu)
 	if (ret)
 		return ret;
 
-	return 0;
+	/* Drain all implementation-specific queues */
+	if (smmu->impl_ops && smmu->impl_ops->drain_queues)
+		ret = smmu->impl_ops->drain_queues(smmu);
+
+	return ret;
 }
 
 static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 461f1958e574..a950da2e7ed9 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -720,6 +720,7 @@ struct arm_smmu_impl_ops {
 	int (*init_structures)(struct arm_smmu_device *smmu);
 	struct arm_smmu_cmdq *(*get_secondary_cmdq)(
 		struct arm_smmu_device *smmu, struct arm_smmu_cmdq_ent *ent);
+	int (*drain_queues)(struct arm_smmu_device *smmu);
 };
 
 /* An SMMUv3 instance */
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index dd7d030d2e89..95d34a8460b5 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -349,6 +349,32 @@ tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu,
 	return &vcmdq->cmdq;
 }
 
+static int tegra241_cmdqv_drain_vintf0_lvcmdqs(struct arm_smmu_device *smmu)
+{
+	struct tegra241_cmdqv *cmdqv =
+		container_of(smmu, struct tegra241_cmdqv, smmu);
+	struct tegra241_vintf *vintf = cmdqv->vintfs[0];
+	int ret = 0;
+	u16 lidx;
+
+	/* Kernel only uses VINTF0. Return if it's disabled */
+	if (!READ_ONCE(vintf->enabled))
+		return 0;
+
+	for (lidx = 0; lidx < cmdqv->num_lvcmdqs_per_vintf; lidx++) {
+		struct tegra241_vcmdq *vcmdq = vintf->lvcmdqs[lidx];
+
+		if (!vcmdq || !READ_ONCE(vcmdq->enabled))
+			continue;
+
+		ret = arm_smmu_queue_poll_until_empty(smmu, &vcmdq->cmdq.q);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
 /* HW Reset Functions */
 
 static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
@@ -681,6 +707,7 @@ static struct arm_smmu_impl_ops tegra241_cmdqv_impl_ops = {
 	.get_secondary_cmdq = tegra241_cmdqv_get_cmdq,
 	.device_reset = tegra241_cmdqv_hw_reset,
 	.device_remove = tegra241_cmdqv_remove,
+	.drain_queues = tegra241_cmdqv_drain_vintf0_lvcmdqs,
 };
 
 /* Probe Functions */
-- 
2.50.0.rc2.692.g299adb8693-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 4/8] iommu/arm-smmu-v3: Cache and restore MSI config
  2025-06-16 20:31 [RFC PATCH v3 0/8] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
                   ` (2 preceding siblings ...)
  2025-06-16 20:31 ` [RFC PATCH v3 3/8] iommu/tegra241-cmdqv: Add a helper to drain VCMDQs Pranjal Shrivastava
@ 2025-06-16 20:31 ` Pranjal Shrivastava
  2025-07-08 15:34   ` Mostafa Saleh
  2025-06-16 20:31 ` [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended() Pranjal Shrivastava
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-06-16 20:31 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

The SMMU's MSI configuration registers (*_IRQ_CFGn) containing target
address, data and memory attributes lose their state when the SMMU is
powered down. We'll need to cache and restore their contents to ensure
that MSIs work after the system resumes.

To address this, cache the original `msi_msg` within the `msi_desc`
when the configuration is first written by `arm_smmu_write_msi_msg`.
This primarily includes the target address and data since the memory
attributes are fixed.

Introduce a new helper `arm_smmu_resume_msis` which will later be called
during the driver's resume callback. The helper would retrieve the
cached MSI message for each relevant interrupt (evtq, gerr, priq) via
get_cached_msi_msg & re-config the registers via arm_smmu_write_msi_msg

Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 45 +++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9ce00c959034..4699d3294d3e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3962,6 +3962,9 @@ static void arm_smmu_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
 	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
 	phys_addr_t *cfg = arm_smmu_msi_cfg[desc->msi_index];
 
+	/* Cache the msi_msg for resume */
+	desc->msg = *msg;
+
 	doorbell = (((u64)msg->address_hi) << 32) | msg->address_lo;
 	doorbell &= MSI_CFG0_ADDR_MASK;
 
@@ -3970,6 +3973,48 @@ static void arm_smmu_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
 	writel_relaxed(ARM_SMMU_MEMATTR_DEVICE_nGnRE, smmu->base + cfg[2]);
 }
 
+static void arm_smmu_resume_msis(struct arm_smmu_device *smmu)
+{
+	unsigned int irq;
+	struct msi_desc *desc;
+	struct msi_msg msg;
+
+	if (!(smmu->features & ARM_SMMU_FEAT_MSI))
+		return;
+
+	if (!smmu->dev->msi.domain)
+		return;
+
+	irq = smmu->evtq.q.irq;
+	desc = irq ? irq_get_msi_desc(irq) : NULL;
+	if (desc) {
+		get_cached_msi_msg(irq, &msg);
+		arm_smmu_write_msi_msg(desc, &msg);
+	} else {
+		dev_err(smmu->dev, "Failed to resume evtq msi");
+	}
+
+	irq = smmu->gerr_irq;
+	desc = irq ? irq_get_msi_desc(irq) : NULL;
+	if (desc) {
+		get_cached_msi_msg(irq, &msg);
+		arm_smmu_write_msi_msg(desc, &msg);
+	} else {
+		dev_err(smmu->dev, "Failed to resume gerror msi");
+	}
+
+	if (smmu->features & ARM_SMMU_FEAT_PRI) {
+		irq = smmu->priq.q.irq;
+		desc = irq ? irq_get_msi_desc(irq) : NULL;
+		if (desc) {
+			get_cached_msi_msg(smmu->priq.q.irq, &msg);
+			arm_smmu_write_msi_msg(desc, &msg);
+		} else {
+			dev_err(smmu->dev, "Failed to resume priq msi");
+		}
+	}
+}
+
 static void arm_smmu_setup_msis(struct arm_smmu_device *smmu)
 {
 	int ret, nvec = ARM_SMMU_MAX_MSIS;
-- 
2.50.0.rc2.692.g299adb8693-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 4/8] iommu/arm-smmu-v3: Cache and restore MSI config
  2025-06-16 20:31 ` [RFC PATCH v3 4/8] iommu/arm-smmu-v3: Cache and restore MSI config Pranjal Shrivastava
@ 2025-07-08 15:34   ` Mostafa Saleh
  2025-07-14  9:01     ` Pranjal Shrivastava
  0 siblings, 1 reply; 28+ messages in thread
From: Mostafa Saleh @ 2025-07-08 15:34 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki, Nicolin Chen, Daniel Mentz, iommu

On Mon, Jun 16, 2025 at 08:31:45PM +0000, Pranjal Shrivastava wrote:
> The SMMU's MSI configuration registers (*_IRQ_CFGn) containing target
> address, data and memory attributes lose their state when the SMMU is
> powered down. We'll need to cache and restore their contents to ensure
> that MSIs work after the system resumes.
> 
> To address this, cache the original `msi_msg` within the `msi_desc`
> when the configuration is first written by `arm_smmu_write_msi_msg`.
> This primarily includes the target address and data since the memory
> attributes are fixed.
> 
> Introduce a new helper `arm_smmu_resume_msis` which will later be called
> during the driver's resume callback. The helper would retrieve the
> cached MSI message for each relevant interrupt (evtq, gerr, priq) via
> get_cached_msi_msg & re-config the registers via arm_smmu_write_msi_msg
> 
> Signed-off-by: Pranjal Shrivastava <praan@google.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 45 +++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 9ce00c959034..4699d3294d3e 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -3962,6 +3962,9 @@ static void arm_smmu_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
>  	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
>  	phys_addr_t *cfg = arm_smmu_msi_cfg[desc->msi_index];
>  
> +	/* Cache the msi_msg for resume */
> +	desc->msg = *msg;

All callers to arm_smmu_write_msi_msg() get the msg from
get_cached_msi_msg() which already returns msi_desc::msg of this irq,
so that seems redundant.


> +
>  	doorbell = (((u64)msg->address_hi) << 32) | msg->address_lo;
>  	doorbell &= MSI_CFG0_ADDR_MASK;
>  
> @@ -3970,6 +3973,48 @@ static void arm_smmu_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
>  	writel_relaxed(ARM_SMMU_MEMATTR_DEVICE_nGnRE, smmu->base + cfg[2]);
>  }
>  
> +static void arm_smmu_resume_msis(struct arm_smmu_device *smmu)
> +{
> +	unsigned int irq;
> +	struct msi_desc *desc;
> +	struct msi_msg msg;
> +
> +	if (!(smmu->features & ARM_SMMU_FEAT_MSI))
> +		return;
> +
> +	if (!smmu->dev->msi.domain)
> +		return;
Is there a reason that is needed?

I’d expect irq_get_msi_desc() to return NULL if MSI is not used.
Otherwise, I think dev_get_msi_domain() would be more suitable to check?

> +
> +	irq = smmu->evtq.q.irq;
> +	desc = irq ? irq_get_msi_desc(irq) : NULL;
> +	if (desc) {
> +		get_cached_msi_msg(irq, &msg);
> +		arm_smmu_write_msi_msg(desc, &msg);
> +	} else {
> +		dev_err(smmu->dev, "Failed to resume evtq msi");
> +	}
> +
> +	irq = smmu->gerr_irq;
> +	desc = irq ? irq_get_msi_desc(irq) : NULL;
> +	if (desc) {
> +		get_cached_msi_msg(irq, &msg);
> +		arm_smmu_write_msi_msg(desc, &msg);
> +	} else {
> +		dev_err(smmu->dev, "Failed to resume gerror msi");
> +	}
> +
> +	if (smmu->features & ARM_SMMU_FEAT_PRI) {
> +		irq = smmu->priq.q.irq;
> +		desc = irq ? irq_get_msi_desc(irq) : NULL;
> +		if (desc) {
> +			get_cached_msi_msg(smmu->priq.q.irq, &msg);
> +			arm_smmu_write_msi_msg(desc, &msg);
> +		} else {
> +			dev_err(smmu->dev, "Failed to resume priq msi");
> +		}
> +	}
> +}
> +

This pattern seems to repeat, maybe it’s better in a macro

Something as:
#define smmu_resume_msi(smmu, _irq)  {	\
	unsigned int irq;	\
	struct msi_desc *desc; \
	struct msi_msg msg; \
	irq = smmu->_irq; \
	desc = irq ? irq_get_msi_desc(irq) : NULL; \
	if (desc) { \
		get_cached_msi_msg(irq, &msg); \
		arm_smmu_write_msi_msg(desc, &msg); \
	} else {	\
		dev_err(smmu->dev, "Failed to resume %s\n", #_irq);	\
	} \
}

static void arm_smmu_resume_msis(struct arm_smmu_device *smmu)
{
	if (!(smmu->features & ARM_SMMU_FEAT_MSI))
		return;

	if (!smmu->dev->msi.domain)
		return;

	smmu_resume_msi(smmu, evtq.q.irq);
	smmu_resume_msi(smmu, gerr_irq);
	if (smmu->features & ARM_SMMU_FEAT_PRI)
		smmu_resume_msi(smmu, priq.q.irq);
}


Thanks,
Mostafa

>  static void arm_smmu_setup_msis(struct arm_smmu_device *smmu)
>  {
>  	int ret, nvec = ARM_SMMU_MAX_MSIS;
> -- 
> 2.50.0.rc2.692.g299adb8693-goog
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 4/8] iommu/arm-smmu-v3: Cache and restore MSI config
  2025-07-08 15:34   ` Mostafa Saleh
@ 2025-07-14  9:01     ` Pranjal Shrivastava
  0 siblings, 0 replies; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-07-14  9:01 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki, Nicolin Chen, Daniel Mentz, iommu

On Tue, Jul 08, 2025 at 03:34:32PM +0000, Mostafa Saleh wrote:
> On Mon, Jun 16, 2025 at 08:31:45PM +0000, Pranjal Shrivastava wrote:
> > The SMMU's MSI configuration registers (*_IRQ_CFGn) containing target
> > address, data and memory attributes lose their state when the SMMU is
> > powered down. We'll need to cache and restore their contents to ensure
> > that MSIs work after the system resumes.
> > 
> > To address this, cache the original `msi_msg` within the `msi_desc`
> > when the configuration is first written by `arm_smmu_write_msi_msg`.
> > This primarily includes the target address and data since the memory
> > attributes are fixed.
> > 
> > Introduce a new helper `arm_smmu_resume_msis` which will later be called
> > during the driver's resume callback. The helper would retrieve the
> > cached MSI message for each relevant interrupt (evtq, gerr, priq) via
> > get_cached_msi_msg & re-config the registers via arm_smmu_write_msi_msg
> > 
> > Signed-off-by: Pranjal Shrivastava <praan@google.com>
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 45 +++++++++++++++++++++
> >  1 file changed, 45 insertions(+)
> > 
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 9ce00c959034..4699d3294d3e 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -3962,6 +3962,9 @@ static void arm_smmu_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
> >  	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
> >  	phys_addr_t *cfg = arm_smmu_msi_cfg[desc->msi_index];
> >  
> > +	/* Cache the msi_msg for resume */
> > +	desc->msg = *msg;
> 
> All callers to arm_smmu_write_msi_msg() get the msg from
> get_cached_msi_msg() which already returns msi_desc::msg of this irq,
> so that seems redundant.
> 

Can you point me to the caller? I quick grep shows me that only vfio,
dw-dma and some spi drivers call get_cached_msi_msg().. I can't also
find a place where the assignment desc->msg = msg happens except for in
PCIe bus.

> 
> > +
> >  	doorbell = (((u64)msg->address_hi) << 32) | msg->address_lo;
> >  	doorbell &= MSI_CFG0_ADDR_MASK;
> >  
> > @@ -3970,6 +3973,48 @@ static void arm_smmu_write_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
> >  	writel_relaxed(ARM_SMMU_MEMATTR_DEVICE_nGnRE, smmu->base + cfg[2]);
> >  }
> >  
> > +static void arm_smmu_resume_msis(struct arm_smmu_device *smmu)
> > +{
> > +	unsigned int irq;
> > +	struct msi_desc *desc;
> > +	struct msi_msg msg;
> > +
> > +	if (!(smmu->features & ARM_SMMU_FEAT_MSI))
> > +		return;
> > +
> > +	if (!smmu->dev->msi.domain)
> > +		return;
> Is there a reason that is needed?
> 
> I’d expect irq_get_msi_desc() to return NULL if MSI is not used.

Do you mean that this check is redundant as moving ahead we anyway seem
to check irq_get_msi_desc for each irq?

> Otherwise, I think dev_get_msi_domain() would be more suitable to check?

Ack.

> 
> > +
> > +	irq = smmu->evtq.q.irq;
> > +	desc = irq ? irq_get_msi_desc(irq) : NULL;
> > +	if (desc) {
> > +		get_cached_msi_msg(irq, &msg);
> > +		arm_smmu_write_msi_msg(desc, &msg);
> > +	} else {
> > +		dev_err(smmu->dev, "Failed to resume evtq msi");
> > +	}
> > +
> > +	irq = smmu->gerr_irq;
> > +	desc = irq ? irq_get_msi_desc(irq) : NULL;
> > +	if (desc) {
> > +		get_cached_msi_msg(irq, &msg);
> > +		arm_smmu_write_msi_msg(desc, &msg);
> > +	} else {
> > +		dev_err(smmu->dev, "Failed to resume gerror msi");
> > +	}
> > +
> > +	if (smmu->features & ARM_SMMU_FEAT_PRI) {
> > +		irq = smmu->priq.q.irq;
> > +		desc = irq ? irq_get_msi_desc(irq) : NULL;
> > +		if (desc) {
> > +			get_cached_msi_msg(smmu->priq.q.irq, &msg);
> > +			arm_smmu_write_msi_msg(desc, &msg);
> > +		} else {
> > +			dev_err(smmu->dev, "Failed to resume priq msi");
> > +		}
> > +	}
> > +}
> > +
> 
> This pattern seems to repeat, maybe it’s better in a macro
> 
> Something as:
> #define smmu_resume_msi(smmu, _irq)  {	\
> 	unsigned int irq;	\
> 	struct msi_desc *desc; \
> 	struct msi_msg msg; \
> 	irq = smmu->_irq; \
> 	desc = irq ? irq_get_msi_desc(irq) : NULL; \
> 	if (desc) { \
> 		get_cached_msi_msg(irq, &msg); \
> 		arm_smmu_write_msi_msg(desc, &msg); \
> 	} else {	\
> 		dev_err(smmu->dev, "Failed to resume %s\n", #_irq);	\
> 	} \
> }
> 

Yup, that looks good. Thanks!

> static void arm_smmu_resume_msis(struct arm_smmu_device *smmu)
> {
> 	if (!(smmu->features & ARM_SMMU_FEAT_MSI))
> 		return;
> 
> 	if (!smmu->dev->msi.domain)
> 		return;
> 
> 	smmu_resume_msi(smmu, evtq.q.irq);
> 	smmu_resume_msi(smmu, gerr_irq);
> 	if (smmu->features & ARM_SMMU_FEAT_PRI)
> 		smmu_resume_msi(smmu, priq.q.irq);
> }
> 
> 
> Thanks,
> Mostafa
> 
> >  static void arm_smmu_setup_msis(struct arm_smmu_device *smmu)
> >  {
> >  	int ret, nvec = ARM_SMMU_MAX_MSIS;
> > -- 
> > 2.50.0.rc2.692.g299adb8693-goog
> > 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-06-16 20:31 [RFC PATCH v3 0/8] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
                   ` (3 preceding siblings ...)
  2025-06-16 20:31 ` [RFC PATCH v3 4/8] iommu/arm-smmu-v3: Cache and restore MSI config Pranjal Shrivastava
@ 2025-06-16 20:31 ` Pranjal Shrivastava
  2025-06-17  9:19   ` kernel test robot
                     ` (2 more replies)
  2025-06-16 20:31 ` [RFC PATCH v3 6/8] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops Pranjal Shrivastava
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-06-16 20:31 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

The existing opportunistic helpers like pm_runtime_get_if_active() and
pm_runtime_get_if_in_use(), are too strict for certain use cases. They
fail if the device is in a transient state like RPM_SUSPENDING, which
can lead to drivers making incorrect assumptions about the dev's state.

These helpers don't suffice for cases where one wishes to elide HW clean
up like queue flushes or TLB invalidations if the device is powered off.
It is wasteful to wake up the device in cases where the resume callback
resets the device well OR if the HW resets in a clean state. Thus, if a
device is powered off, it is preferred to elide any clean-up HW ops like
queue flushes / TLB invalidations when the device will resume afresh.

Consider the following sequence of operations:

1. The device is in `RPM_SUSPENDING` state
2. The driver calls pm_runtime_get_if_active/in_use
3. Depending on these API, the driver elides a HW clean-up op like:

if (pm_runtime_get_if_in_active(dev))
	invalidate_tlb(dev);
else
	// Skip flush, assuming device will fully suspend

4. Now, another rpm dev-linked device wakes up, causing the device's
   state to bounce from RPM_SUSPENDING to RPM_ACTIVE without invoking
   any rpm callbacks, preventing them from resetting the dev correctly.

5. The device continues to be active with an invalid HW state (e.g.
   stale TLB entries, unflushed packets/commands etc.)

Introduce a new helper function, pm_runtime_get_if_not_suspended(),
to address the problem. The pm_runtime_get_if_not_suspended() increments
the device's runtime PM reference only if it's rpm state is NOT
RPM_SUSPENDED and it cancels any pending autosuspend timers.

Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 drivers/base/power/runtime.c | 41 ++++++++++++++++++++++++++++++++++++
 include/linux/pm_runtime.h   |  5 +++++
 2 files changed, 46 insertions(+)

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 0e127b0329c0..f1c81b6723cc 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -1260,6 +1260,47 @@ int pm_runtime_get_if_in_use(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(pm_runtime_get_if_in_use);
 
+/**
+ * pm_runtime_get_if_not_suspended - Conditionally bump up runtime PM usage count.
+ * @dev: Target device.
+ *
+ * Increment the runtime PM usage counter of @dev if its runtime PM status is NOT
+ * %RPM_SUSPENDED, in which case it returns 1. If the device state is %RPM_SUSPENDED
+ * 0 is returned. -EINVAL is returned if runtime PM is disabled for the device, in
+ * which case also the usage_count will remain unmodified.
+ *
+ * NOTE: This also cancels any pending autosuspend request.
+ */
+int pm_runtime_get_if_not_suspended(struct device *dev)
+{
+	unsigned long flags;
+	int retval;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+	if (dev->power.disable_depth > 0) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	if (dev->power.runtime_status == RPM_SUSPENDED) {
+		retval = 0;
+		goto out;
+	}
+
+	/* Cancel the auto-suspend timer */
+	pm_runtime_cancel_pending(dev);
+	atomic_inc(&dev->power.usage_count);
+	__update_runtime_status(dev, RPM_ACTIVE);
+	retval = 1;
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+
+	return retval;
+out:
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+	return retval;
+}
+EXPORT_SYMBOL_GPL(pm_runtime_get_if_not_suspended);
+
 /**
  * __pm_runtime_set_status - Set runtime PM status of a device.
  * @dev: Device to handle.
diff --git a/include/linux/pm_runtime.h b/include/linux/pm_runtime.h
index 7fb5a459847e..d07bca2b6a83 100644
--- a/include/linux/pm_runtime.h
+++ b/include/linux/pm_runtime.h
@@ -75,6 +75,7 @@ extern int __pm_runtime_suspend(struct device *dev, int rpmflags);
 extern int __pm_runtime_resume(struct device *dev, int rpmflags);
 extern int pm_runtime_get_if_active(struct device *dev);
 extern int pm_runtime_get_if_in_use(struct device *dev);
+extern int pm_runtime_get_if_not_suspended(struct device *dev);
 extern int pm_schedule_suspend(struct device *dev, unsigned int delay);
 extern int __pm_runtime_set_status(struct device *dev, unsigned int status);
 extern int pm_runtime_barrier(struct device *dev);
@@ -283,6 +284,10 @@ static inline int pm_runtime_get_if_active(struct device *dev)
 {
 	return -EINVAL;
 }
+static int pm_runtime_get_if_not_suspended(struct device *dev)
+{
+	return -EINVAL;
+}
 static inline int __pm_runtime_set_status(struct device *dev,
 					    unsigned int status) { return 0; }
 static inline int pm_runtime_barrier(struct device *dev) { return 0; }
-- 
2.50.0.rc2.692.g299adb8693-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-06-16 20:31 ` [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended() Pranjal Shrivastava
@ 2025-06-17  9:19   ` kernel test robot
  2025-07-08 22:00   ` Nicolin Chen
  2025-07-09  6:44   ` Rafael J. Wysocki
  2 siblings, 0 replies; 28+ messages in thread
From: kernel test robot @ 2025-06-17  9:19 UTC (permalink / raw)
  To: Pranjal Shrivastava; +Cc: oe-kbuild-all

Hi Pranjal,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on arm-perf/for-next/perf]
[also build test WARNING on linus/master v6.16-rc2 next-20250617]
[cannot apply to rafael-pm/linux-next rafael-pm/bleeding-edge]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Pranjal-Shrivastava/iommu-arm-smmu-v3-Refactor-arm_smmu_setup_irqs/20250617-043418
base:   https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git for-next/perf
patch link:    https://lore.kernel.org/r/20250616203149.2649118-6-praan%40google.com
patch subject: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
config: i386-buildonly-randconfig-004-20250617 (https://download.01.org/0day-ci/archive/20250617/202506171624.6rjwnug7-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250617/202506171624.6rjwnug7-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506171624.6rjwnug7-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from block/blk-pm.h:6,
                    from block/blk-mq.c:39:
>> include/linux/pm_runtime.h:287:12: warning: 'pm_runtime_get_if_not_suspended' defined but not used [-Wunused-function]
     287 | static int pm_runtime_get_if_not_suspended(struct device *dev)
         |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/pm_runtime_get_if_not_suspended +287 include/linux/pm_runtime.h

   262	
   263	static inline int __pm_runtime_idle(struct device *dev, int rpmflags)
   264	{
   265		return -ENOSYS;
   266	}
   267	static inline int __pm_runtime_suspend(struct device *dev, int rpmflags)
   268	{
   269		return -ENOSYS;
   270	}
   271	static inline int __pm_runtime_resume(struct device *dev, int rpmflags)
   272	{
   273		return 1;
   274	}
   275	static inline int pm_schedule_suspend(struct device *dev, unsigned int delay)
   276	{
   277		return -ENOSYS;
   278	}
   279	static inline int pm_runtime_get_if_in_use(struct device *dev)
   280	{
   281		return -EINVAL;
   282	}
   283	static inline int pm_runtime_get_if_active(struct device *dev)
   284	{
   285		return -EINVAL;
   286	}
 > 287	static int pm_runtime_get_if_not_suspended(struct device *dev)
   288	{
   289		return -EINVAL;
   290	}
   291	static inline int __pm_runtime_set_status(struct device *dev,
   292						    unsigned int status) { return 0; }
   293	static inline int pm_runtime_barrier(struct device *dev) { return 0; }
   294	static inline bool pm_runtime_block_if_disabled(struct device *dev) { return true; }
   295	static inline void pm_runtime_unblock(struct device *dev) {}
   296	static inline void pm_runtime_enable(struct device *dev) {}
   297	static inline void __pm_runtime_disable(struct device *dev, bool c) {}
   298	static inline bool pm_runtime_blocked(struct device *dev) { return true; }
   299	static inline void pm_runtime_allow(struct device *dev) {}
   300	static inline void pm_runtime_forbid(struct device *dev) {}
   301	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-06-16 20:31 ` [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended() Pranjal Shrivastava
  2025-06-17  9:19   ` kernel test robot
@ 2025-07-08 22:00   ` Nicolin Chen
  2025-07-09 15:51     ` Pranjal Shrivastava
  2025-07-09  6:44   ` Rafael J. Wysocki
  2 siblings, 1 reply; 28+ messages in thread
From: Nicolin Chen @ 2025-07-08 22:00 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki, Mostafa Saleh, Daniel Mentz, iommu

On Mon, Jun 16, 2025 at 08:31:46PM +0000, Pranjal Shrivastava wrote:
> +int pm_runtime_get_if_not_suspended(struct device *dev)
> +{
> +	unsigned long flags;
> +	int retval;
> +
> +	spin_lock_irqsave(&dev->power.lock, flags);
> +	if (dev->power.disable_depth > 0) {
> +		retval = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (dev->power.runtime_status == RPM_SUSPENDED) {
> +		retval = 0;
> +		goto out;
> +	}
> +
> +	/* Cancel the auto-suspend timer */
> +	pm_runtime_cancel_pending(dev);
> +	atomic_inc(&dev->power.usage_count);
> +	__update_runtime_status(dev, RPM_ACTIVE);
> +	retval = 1;
[...]
> +	spin_unlock_irqrestore(&dev->power.lock, flags);
> +
> +	return retval;
> +out:
> +	spin_unlock_irqrestore(&dev->power.lock, flags);
> +	return retval;
> +}
 
One pair is enough.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-07-08 22:00   ` Nicolin Chen
@ 2025-07-09 15:51     ` Pranjal Shrivastava
  0 siblings, 0 replies; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-07-09 15:51 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki, Mostafa Saleh, Daniel Mentz, iommu

On Tue, Jul 08, 2025 at 03:00:07PM -0700, Nicolin Chen wrote:
> On Mon, Jun 16, 2025 at 08:31:46PM +0000, Pranjal Shrivastava wrote:
> > +int pm_runtime_get_if_not_suspended(struct device *dev)
> > +{
> > +	unsigned long flags;
> > +	int retval;
> > +
> > +	spin_lock_irqsave(&dev->power.lock, flags);
> > +	if (dev->power.disable_depth > 0) {
> > +		retval = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	if (dev->power.runtime_status == RPM_SUSPENDED) {
> > +		retval = 0;
> > +		goto out;
> > +	}
> > +
> > +	/* Cancel the auto-suspend timer */
> > +	pm_runtime_cancel_pending(dev);
> > +	atomic_inc(&dev->power.usage_count);
> > +	__update_runtime_status(dev, RPM_ACTIVE);
> > +	retval = 1;
> [...]
> > +	spin_unlock_irqrestore(&dev->power.lock, flags);
> > +
> > +	return retval;
> > +out:
> > +	spin_unlock_irqrestore(&dev->power.lock, flags);
> > +	return retval;
> > +}
>  
> One pair is enough.

Ack. Will fix this.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-06-16 20:31 ` [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended() Pranjal Shrivastava
  2025-06-17  9:19   ` kernel test robot
  2025-07-08 22:00   ` Nicolin Chen
@ 2025-07-09  6:44   ` Rafael J. Wysocki
  2025-07-09 15:51     ` Pranjal Shrivastava
  2 siblings, 1 reply; 28+ messages in thread
From: Rafael J. Wysocki @ 2025-07-09  6:44 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki, Nicolin Chen, Mostafa Saleh, Daniel Mentz,
	iommu

On Mon, Jun 16, 2025 at 10:32 PM Pranjal Shrivastava <praan@google.com> wrote:
>
> The existing opportunistic helpers like pm_runtime_get_if_active() and
> pm_runtime_get_if_in_use(), are too strict for certain use cases. They
> fail if the device is in a transient state like RPM_SUSPENDING, which
> can lead to drivers making incorrect assumptions about the dev's state.
>
> These helpers don't suffice for cases where one wishes to elide HW clean
> up like queue flushes or TLB invalidations if the device is powered off.
> It is wasteful to wake up the device in cases where the resume callback
> resets the device well OR if the HW resets in a clean state. Thus, if a
> device is powered off, it is preferred to elide any clean-up HW ops like
> queue flushes / TLB invalidations when the device will resume afresh.
>
> Consider the following sequence of operations:
>
> 1. The device is in `RPM_SUSPENDING` state
> 2. The driver calls pm_runtime_get_if_active/in_use
> 3. Depending on these API, the driver elides a HW clean-up op like:
>
> if (pm_runtime_get_if_in_active(dev))
>         invalidate_tlb(dev);
> else
>         // Skip flush, assuming device will fully suspend
>
> 4. Now, another rpm dev-linked device wakes up, causing the device's
>    state to bounce from RPM_SUSPENDING to RPM_ACTIVE without invoking
>    any rpm callbacks, preventing them from resetting the dev correctly.

This never happens.

If the status is RPM_SUSPENDING, the runtime suspend callback will be invoked.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-07-09  6:44   ` Rafael J. Wysocki
@ 2025-07-09 15:51     ` Pranjal Shrivastava
  2025-07-09 16:35       ` Rafael J. Wysocki
  0 siblings, 1 reply; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-07-09 15:51 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu

On Wed, Jul 09, 2025 at 08:44:06AM +0200, Rafael J. Wysocki wrote:
> On Mon, Jun 16, 2025 at 10:32 PM Pranjal Shrivastava <praan@google.com> wrote:
> >
> > The existing opportunistic helpers like pm_runtime_get_if_active() and
> > pm_runtime_get_if_in_use(), are too strict for certain use cases. They
> > fail if the device is in a transient state like RPM_SUSPENDING, which
> > can lead to drivers making incorrect assumptions about the dev's state.
> >
> > These helpers don't suffice for cases where one wishes to elide HW clean
> > up like queue flushes or TLB invalidations if the device is powered off.
> > It is wasteful to wake up the device in cases where the resume callback
> > resets the device well OR if the HW resets in a clean state. Thus, if a
>ppy to hear your feedback or any alternative ideas you might have > device is powered off, it is preferred to elide any clean-up HW ops like
> > queue flushes / TLB invalidations when the device will resume afresh.
> >
> > Consider the following sequence of operations:
> >
> > 1. The device is in `RPM_SUSPENDING` state
> > 2. The driver calls pm_runtime_get_if_active/in_use
> > 3. Depending on these API, the driver elides a HW clean-up op like:
> >
> > if (pm_runtime_get_if_in_active(dev))
> >         invalidate_tlb(dev);
> > else
> >         // Skip flush, assuming device will fully suspend
> >
> > 4. Now, another rpm dev-linked device wakes up, causing the device's
> >    state to bounce from RPM_SUSPENDING to RPM_ACTIVE without invoking
> >    any rpm callbacks, preventing them from resetting the dev correctly.
> 
> This never happens.
> 
> If the status is RPM_SUSPENDING, the runtime suspend callback will be invoked.

Ack, I believe we're talking about the check[1] in rpm_resume here:

	if (dev->power.runtime_status == RPM_RESUMING ||
	    dev->power.runtime_status == RPM_SUSPENDING) {

	    [...]
	    /* Wait for the operation carried out in parallel with us. */
	    [...]
	    finish_wait(&dev->power.wait_queue, &wait);
	    goto repeat;

However, my worry is about the following situation:

  rpm_suspend			   rpm_resume

  rpm_status = RPM_SUSPENDING 	
  				   if (RPM_SUSPENDING)
				   prepare_to wait(...)

  // suspend fails
  retval = rpm_callback(...);
  goto fail;
  [...]
fail:
  rpm_status = RPM_ACTIVE;
  				   finish_wait(...);
				   goto repeat;
				   repeat:
				   if (rpm_status == RPM_ACTIVE) {
					retval = 1;
					goto out;
				   }
				out:
				   [ ... ] // put_parent if one

				   // we return without resume cb();
				   return retval;

Now, if we rely on APIs based on pm_runtime_get_conditional(), which
might return 0 if (dev->power.runtime_status != RPM_ACTIVE), we might
end up eliding some TLB invalidations in the period where the status was
RPM_SUSPENDING till it's back to RPM_ACTIVE due to a failed suspend.
This can cause some severe address-aliasing/ghost hit issues since the
TLB still has some stale entries..

Instead, we'd like to ensure that the status IS suspended to make a
decision to elide the TLBI or not.

What are your thoughts on this? I'm open to approaching it differently.
Happy to hear your feedback or any alternative ideas you might have

Thanks
Praan

[1] https://elixir.bootlin.com/linux/v6.16-rc5/source/drivers/base/power/runtime.c#L808 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-07-09 15:51     ` Pranjal Shrivastava
@ 2025-07-09 16:35       ` Rafael J. Wysocki
  2025-07-09 17:06         ` Pranjal Shrivastava
  0 siblings, 1 reply; 28+ messages in thread
From: Rafael J. Wysocki @ 2025-07-09 16:35 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Rafael J. Wysocki, Joerg Roedel, Will Deacon, Robin Murphy,
	Jason Gunthorpe, Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu

On Wed, Jul 9, 2025 at 5:51 PM Pranjal Shrivastava <praan@google.com> wrote:
>
> On Wed, Jul 09, 2025 at 08:44:06AM +0200, Rafael J. Wysocki wrote:
> > On Mon, Jun 16, 2025 at 10:32 PM Pranjal Shrivastava <praan@google.com> wrote:
> > >
> > > The existing opportunistic helpers like pm_runtime_get_if_active() and
> > > pm_runtime_get_if_in_use(), are too strict for certain use cases. They
> > > fail if the device is in a transient state like RPM_SUSPENDING, which
> > > can lead to drivers making incorrect assumptions about the dev's state.
> > >
> > > These helpers don't suffice for cases where one wishes to elide HW clean
> > > up like queue flushes or TLB invalidations if the device is powered off.
> > > It is wasteful to wake up the device in cases where the resume callback
> > > resets the device well OR if the HW resets in a clean state. Thus, if a
> >ppy to hear your feedback or any alternative ideas you might have > device is powered off, it is preferred to elide any clean-up HW ops like
> > > queue flushes / TLB invalidations when the device will resume afresh.
> > >
> > > Consider the following sequence of operations:
> > >
> > > 1. The device is in `RPM_SUSPENDING` state
> > > 2. The driver calls pm_runtime_get_if_active/in_use
> > > 3. Depending on these API, the driver elides a HW clean-up op like:
> > >
> > > if (pm_runtime_get_if_in_active(dev))
> > >         invalidate_tlb(dev);
> > > else
> > >         // Skip flush, assuming device will fully suspend
> > >
> > > 4. Now, another rpm dev-linked device wakes up, causing the device's
> > >    state to bounce from RPM_SUSPENDING to RPM_ACTIVE without invoking
> > >    any rpm callbacks, preventing them from resetting the dev correctly.
> >
> > This never happens.
> >
> > If the status is RPM_SUSPENDING, the runtime suspend callback will be invoked.
>
> Ack, I believe we're talking about the check[1] in rpm_resume here:

No, I'm talking about the fact that the status is changed to
RPM_SUSPENDING right before invoking the callback.

>         if (dev->power.runtime_status == RPM_RESUMING ||
>             dev->power.runtime_status == RPM_SUSPENDING) {
>
>             [...]
>             /* Wait for the operation carried out in parallel with us. */
>             [...]
>             finish_wait(&dev->power.wait_queue, &wait);
>             goto repeat;
>
> However, my worry is about the following situation:
>
>   rpm_suspend                      rpm_resume
>
>   rpm_status = RPM_SUSPENDING
>                                    if (RPM_SUSPENDING)
>                                    prepare_to wait(...)
>
>   // suspend fails
>   retval = rpm_callback(...);

You're talking about the callback returning an error and your
changelog is talking about a different situation.

>   goto fail;
>   [...]
> fail:
>   rpm_status = RPM_ACTIVE;
>                                    finish_wait(...);
>                                    goto repeat;
>                                    repeat:
>                                    if (rpm_status == RPM_ACTIVE) {
>                                         retval = 1;
>                                         goto out;
>                                    }
>                                 out:
>                                    [ ... ] // put_parent if one
>
>                                    // we return without resume cb();
>                                    return retval;
>
> Now, if we rely on APIs based on pm_runtime_get_conditional(), which
> might return 0 if (dev->power.runtime_status != RPM_ACTIVE), we might
> end up eliding some TLB invalidations in the period where the status was
> RPM_SUSPENDING till it's back to RPM_ACTIVE due to a failed suspend.

It actually depends on the reason why the callback returned an error.

> This can cause some severe address-aliasing/ghost hit issues since the
> TLB still has some stale entries..

I'm not quite sure what you mean.

> Instead, we'd like to ensure that the status IS suspended

But you can't.

That's the difference between RPM_ACTIVE and RPM_SUSPENDED.  If you
bump up the runtime PM usage counter while the device is RPM_ACTIVE
(and while holding its power.spinlock), it will not suspend.  There's
no way to prevent a suspended device from resuming (other than
disabling runtime PM for it).

> to make a decision to elide the TLBI or not.
>
> What are your thoughts on this? I'm open to approaching it differently.

IMV this is all misguided, sorry.

> Happy to hear your feedback or any alternative ideas you might have

Disable runtime PM for the device and check its runtime PM status,
which cannot be transient at that point.  If it is RPM_ACTIVE, you
need to flush the TLB.

> [1] https://elixir.bootlin.com/linux/v6.16-rc5/source/drivers/base/power/runtime.c#L808

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-07-09 16:35       ` Rafael J. Wysocki
@ 2025-07-09 17:06         ` Pranjal Shrivastava
  2025-07-09 19:37           ` Rafael J. Wysocki
  0 siblings, 1 reply; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-07-09 17:06 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu

On Wed, Jul 09, 2025 at 06:35:41PM +0200, Rafael J. Wysocki wrote:
> On Wed, Jul 9, 2025 at 5:51 PM Pranjal Shrivastava <praan@google.com> wrote:
> >
> > On Wed, Jul 09, 2025 at 08:44:06AM +0200, Rafael J. Wysocki wrote:
> > > On Mon, Jun 16, 2025 at 10:32 PM Pranjal Shrivastava <praan@google.com> wrote:
> > > >
> > > > The existing opportunistic helpers like pm_runtime_get_if_active() and
> > > > pm_runtime_get_if_in_use(), are too strict for certain use cases. They
> > > > fail if the device is in a transient state like RPM_SUSPENDING, which
> > > > can lead to drivers making incorrect assumptions about the dev's state.
> > > >
> > > > These helpers don't suffice for cases where one wishes to elide HW clean
> > > > up like queue flushes or TLB invalidations if the device is powered off.
> > > > It is wasteful to wake up the device in cases where the resume callback
> > > > resets the device well OR if the HW resets in a clean state. Thus, if a
> > >ppy to hear your feedback or any alternative ideas you might have > device is powered off, it is preferred to elide any clean-up HW ops like
> > > > queue flushes / TLB invalidations when the device will resume afresh.
> > > >
> > > > Consider the following sequence of operations:
> > > >
> > > > 1. The device is in `RPM_SUSPENDING` state
> > > > 2. The driver calls pm_runtime_get_if_active/in_use
> > > > 3. Depending on these API, the driver elides a HW clean-up op like:
> > > >
> > > > if (pm_runtime_get_if_in_active(dev))
> > > >         invalidate_tlb(dev);
> > > > else
> > > >         // Skip flush, assuming device will fully suspend
> > > >
> > > > 4. Now, another rpm dev-linked device wakes up, causing the device's
> > > >    state to bounce from RPM_SUSPENDING to RPM_ACTIVE without invoking
> > > >    any rpm callbacks, preventing them from resetting the dev correctly.
> > >
> > > This never happens.
> > >
> > > If the status is RPM_SUSPENDING, the runtime suspend callback will be invoked.
> >
> > Ack, I believe we're talking about the check[1] in rpm_resume here:
> 
> No, I'm talking about the fact that the status is changed to
> RPM_SUSPENDING right before invoking the callback.
> 

Right, but if the callback returns -EAGAIN or something, in
rpm_suspend() we go ahead and set the status to RPM_ACTIVE again
on jumping to the `fail` label[1].

> >         if (dev->power.runtime_status == RPM_RESUMING ||
> >             dev->power.runtime_status == RPM_SUSPENDING) {
> >
> >             [...]
> >             /* Wait for the operation carried out in parallel with us. */
> >             [...]
> >             finish_wait(&dev->power.wait_queue, &wait);
> >             goto repeat;
> >
> > However, my worry is about the following situation:
> >
> >   rpm_suspend                      rpm_resume
> >
> >   rpm_status = RPM_SUSPENDING
> >                                    if (RPM_SUSPENDING)
> >                                    prepare_to wait(...)
> >
> >   // suspend fails
> >   retval = rpm_callback(...);
> 
> You're talking about the callback returning an error and your
> changelog is talking about a different situation.
> 

Apologies, I should've been more clear. What I meant was for a situation
where due to *any* reason (except disabling runtime PM) we bounce back
from RPM_SUSPENDING to RPM_ACTIVE *without* invoking the resume callback

> >   goto fail;
> >   [...]
> > fail:
> >   rpm_status = RPM_ACTIVE;
> >                                    finish_wait(...);
> >                                    goto repeat;
> >                                    repeat:
> >                                    if (rpm_status == RPM_ACTIVE) {
> >                                         retval = 1;
> >                                         goto out;
> >                                    }
> >                                 out:
> >                                    [ ... ] // put_parent if one
> >
> >                                    // we return without resume cb();
> >                                    return retval;
> >
> > Now, if we rely on APIs based on pm_runtime_get_conditional(), which
> > might return 0 if (dev->power.runtime_status != RPM_ACTIVE), we might
> > end up eliding some TLB invalidations in the period where the status was
> > RPM_SUSPENDING till it's back to RPM_ACTIVE due to a failed suspend.
> 
> It actually depends on the reason why the callback returned an error.
> 

I'm not sure if I follow.. as per rpm_suspend[1] I see that upon failing
(which also happens on getting a non-zero retval from the suspend_cb) we
set the runtime status to RPM_ACTIVE.

> > This can cause some severe address-aliasing/ghost hit issues since the
> > TLB still has some stale entries..
> 
> I'm not quite sure what you mean.
> 

I meant if the TLB invalidations were elided (i.e. TLB still had stale
entries) in hope that the IOMMU would be suspended, but due to some
reason the suspend failed and the status gets back to RPM_ACTIVE while
exiting the rpm_suspend call and a client wakes up and performs a DMA
(IOMMU transaction), the TLB entry might hit for an address which was
supposed to be invalidated by now.

> > Instead, we'd like to ensure that the status IS suspended
> 
> But you can't.
> 
> That's the difference between RPM_ACTIVE and RPM_SUSPENDED.  If you
> bump up the runtime PM usage counter while the device is RPM_ACTIVE
> (and while holding its power.spinlock), it will not suspend.  There's
> no way to prevent a suspended device from resuming (other than
> disabling runtime PM for it).
> 

The problem is avoiding bumping up the count while the device is in a
transient state like RPM_SUSPENDING and then bouncing back to the
RPM_ACTIVE state without invoking RPM resume callback, which seems to be
possible as per the rpm_suspend implementation [1].

> > to make a decision to elide the TLBI or not.
> >
> > What are your thoughts on this? I'm open to approaching it differently.
> 
> IMV this is all misguided, sorry.
> 

Ack. But I'd want to understand better here. Are you saying that it
isn't at all possible that RPM_SUSPENDING bounces back to RPM_ACTIVE
without invoking the rpm_resume ever? Because per the following snippet,
it seems likely:

	__update_runtime_status(dev, RPM_SUSPENDING);

	callback = RPM_GET_CALLBACK(dev, runtime_suspend);

	dev_pm_enable_wake_irq_check(dev, true);
	retval = rpm_callback(callback, dev);
	if (retval)
		goto fail;

	[...]

fail:
	dev_pm_disable_wake_irq_check(dev, true);
	__update_runtime_status(dev, RPM_ACTIVE);
	dev->power.deferred_resume = false;
	wake_up_all(&dev->power.wait_queue);
	[...]

Sorry if I'm missing something here, but it does look like we can bounce
back to RPM_ACTIVE without invoking the resume callback, which is a
behaviour we'd like to avoid.

> > Happy to hear your feedback or any alternative ideas you might have
> 
> Disable runtime PM for the device and check its runtime PM status,
> which cannot be transient at that point.  If it is RPM_ACTIVE, you
> need to flush the TLB.
> 

I see... let me see how or if we can safely do that.. we might be in
interrupt context while flushing the TLBs..

Thanks,
Praan

[1] https://elixir.bootlin.com/linux/v6.16-rc5/source/drivers/base/power/runtime.c#L678 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-07-09 17:06         ` Pranjal Shrivastava
@ 2025-07-09 19:37           ` Rafael J. Wysocki
  2025-07-10  8:59             ` Pranjal Shrivastava
  0 siblings, 1 reply; 28+ messages in thread
From: Rafael J. Wysocki @ 2025-07-09 19:37 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Rafael J. Wysocki, Joerg Roedel, Will Deacon, Robin Murphy,
	Jason Gunthorpe, Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu

On Wed, Jul 9, 2025 at 7:06 PM Pranjal Shrivastava <praan@google.com> wrote:
>
> On Wed, Jul 09, 2025 at 06:35:41PM +0200, Rafael J. Wysocki wrote:
> > On Wed, Jul 9, 2025 at 5:51 PM Pranjal Shrivastava <praan@google.com> wrote:
> > >
> > > On Wed, Jul 09, 2025 at 08:44:06AM +0200, Rafael J. Wysocki wrote:
> > > > On Mon, Jun 16, 2025 at 10:32 PM Pranjal Shrivastava <praan@google.com> wrote:
> > > > >
> > > > > The existing opportunistic helpers like pm_runtime_get_if_active() and
> > > > > pm_runtime_get_if_in_use(), are too strict for certain use cases. They
> > > > > fail if the device is in a transient state like RPM_SUSPENDING, which
> > > > > can lead to drivers making incorrect assumptions about the dev's state.
> > > > >
> > > > > These helpers don't suffice for cases where one wishes to elide HW clean
> > > > > up like queue flushes or TLB invalidations if the device is powered off.
> > > > > It is wasteful to wake up the device in cases where the resume callback
> > > > > resets the device well OR if the HW resets in a clean state. Thus, if a
> > > >ppy to hear your feedback or any alternative ideas you might have > device is powered off, it is preferred to elide any clean-up HW ops like
> > > > > queue flushes / TLB invalidations when the device will resume afresh.
> > > > >
> > > > > Consider the following sequence of operations:
> > > > >
> > > > > 1. The device is in `RPM_SUSPENDING` state
> > > > > 2. The driver calls pm_runtime_get_if_active/in_use
> > > > > 3. Depending on these API, the driver elides a HW clean-up op like:
> > > > >
> > > > > if (pm_runtime_get_if_in_active(dev))
> > > > >         invalidate_tlb(dev);
> > > > > else
> > > > >         // Skip flush, assuming device will fully suspend
> > > > >
> > > > > 4. Now, another rpm dev-linked device wakes up, causing the device's
> > > > >    state to bounce from RPM_SUSPENDING to RPM_ACTIVE without invoking
> > > > >    any rpm callbacks, preventing them from resetting the dev correctly.
> > > >
> > > > This never happens.
> > > >
> > > > If the status is RPM_SUSPENDING, the runtime suspend callback will be invoked.
> > >
> > > Ack, I believe we're talking about the check[1] in rpm_resume here:
> >
> > No, I'm talking about the fact that the status is changed to
> > RPM_SUSPENDING right before invoking the callback.
> >
>
> Right, but if the callback returns -EAGAIN or something, in
> rpm_suspend() we go ahead and set the status to RPM_ACTIVE again
> on jumping to the `fail` label[1].
>
> > >         if (dev->power.runtime_status == RPM_RESUMING ||
> > >             dev->power.runtime_status == RPM_SUSPENDING) {
> > >
> > >             [...]
> > >             /* Wait for the operation carried out in parallel with us. */
> > >             [...]
> > >             finish_wait(&dev->power.wait_queue, &wait);
> > >             goto repeat;
> > >
> > > However, my worry is about the following situation:
> > >
> > >   rpm_suspend                      rpm_resume
> > >
> > >   rpm_status = RPM_SUSPENDING
> > >                                    if (RPM_SUSPENDING)
> > >                                    prepare_to wait(...)
> > >
> > >   // suspend fails
> > >   retval = rpm_callback(...);
> >
> > You're talking about the callback returning an error and your
> > changelog is talking about a different situation.
>
> Apologies, I should've been more clear. What I meant was for a situation
> where due to *any* reason (except disabling runtime PM) we bounce back
> from RPM_SUSPENDING to RPM_ACTIVE *without* invoking the resume callback

This only can happen if the suspend callback returns an error, but I'm
not sure why and how this matters.

pm_runtime_get_if_active() does not guarantee anything in the case
when 0 is returned anyway.

> > >   goto fail;
> > >   [...]
> > > fail:
> > >   rpm_status = RPM_ACTIVE;
> > >                                    finish_wait(...);
> > >                                    goto repeat;
> > >                                    repeat:
> > >                                    if (rpm_status == RPM_ACTIVE) {
> > >                                         retval = 1;
> > >                                         goto out;
> > >                                    }
> > >                                 out:
> > >                                    [ ... ] // put_parent if one
> > >
> > >                                    // we return without resume cb();
> > >                                    return retval;
> > >
> > > Now, if we rely on APIs based on pm_runtime_get_conditional(), which
> > > might return 0 if (dev->power.runtime_status != RPM_ACTIVE), we might
> > > end up eliding some TLB invalidations in the period where the status was
> > > RPM_SUSPENDING till it's back to RPM_ACTIVE due to a failed suspend.
> >
> > It actually depends on the reason why the callback returned an error.
> >
>
> I'm not sure if I follow.. as per rpm_suspend[1] I see that upon failing
> (which also happens on getting a non-zero retval from the suspend_cb) we
> set the runtime status to RPM_ACTIVE.

Yes, it just goes back to the status from before the failure because
when the error code is -EAGAIN or -EBUSY, the status should be still
RPM_ACTIVE and otherwise runtime_error is set and the device likely
requires some help.

> > > This can cause some severe address-aliasing/ghost hit issues since the
> > > TLB still has some stale entries..
> >
> > I'm not quite sure what you mean.
> >
>
> I meant if the TLB invalidations were elided (i.e. TLB still had stale
> entries) in hope that the IOMMU would be suspended, but due to some
> reason the suspend failed and the status gets back to RPM_ACTIVE while
> exiting the rpm_suspend call and a client wakes up and performs a DMA
> (IOMMU transaction), the TLB entry might hit for an address which was
> supposed to be invalidated by now.

So as I said this is just one reason why you cannot rely on runtime PM
to guarantee that the device will remain suspended, or in fact whether
or not it will be suspended at all.

> > > Instead, we'd like to ensure that the status IS suspended
> >
> > But you can't.
> >
> > That's the difference between RPM_ACTIVE and RPM_SUSPENDED.  If you
> > bump up the runtime PM usage counter while the device is RPM_ACTIVE
> > (and while holding its power.spinlock), it will not suspend.  There's
> > no way to prevent a suspended device from resuming (other than
> > disabling runtime PM for it).
> >
>
> The problem is avoiding bumping up the count while the device is in a
> transient state like RPM_SUSPENDING and then bouncing back to the
> RPM_ACTIVE state without invoking RPM resume callback, which seems to be
> possible as per the rpm_suspend implementation [1].

I'm not sure what you mean here.

The usage counter is bumped up by pm_runtime_get_if_active() only if
the device is RPM_ACTIVE in which case it will guarantee that the
runtime PM status will not change.  There is no way to provide a
similar guarantee on the RPM_SUSPENDED side short of disabling runtime
PM.

> > > to make a decision to elide the TLBI or not.
> > >
> > > What are your thoughts on this? I'm open to approaching it differently.
> >
> > IMV this is all misguided, sorry.
> >
>
> Ack. But I'd want to understand better here. Are you saying that it
> isn't at all possible that RPM_SUSPENDING bounces back to RPM_ACTIVE
> without invoking the rpm_resume ever? Because per the following snippet,
> it seems likely:
>
>         __update_runtime_status(dev, RPM_SUSPENDING);
>
>         callback = RPM_GET_CALLBACK(dev, runtime_suspend);
>
>         dev_pm_enable_wake_irq_check(dev, true);
>         retval = rpm_callback(callback, dev);
>         if (retval)
>                 goto fail;
>
>         [...]
>
> fail:
>         dev_pm_disable_wake_irq_check(dev, true);
>         __update_runtime_status(dev, RPM_ACTIVE);
>         dev->power.deferred_resume = false;
>         wake_up_all(&dev->power.wait_queue);
>         [...]
>
> Sorry if I'm missing something here, but it does look like we can bounce
> back to RPM_ACTIVE without invoking the resume callback, which is a
> behaviour we'd like to avoid.

So let me repeat: This only happens if the suspend callback returns an
error and in that case it is not clear whether or not the resume
callback needs to be invoked (either it doesn't or the device can be
assumed to be in a state in which invoking that callback will not help
either way).

> > > Happy to hear your feedback or any alternative ideas you might have
> >
> > Disable runtime PM for the device and check its runtime PM status,
> > which cannot be transient at that point.  If it is RPM_ACTIVE, you
> > need to flush the TLB.
> >
>
> I see... let me see how or if we can safely do that.. we might be in
> interrupt context while flushing the TLBs..

There is no way known to me in which interrupt context can prevent a
suspended device from resuming, sorry.

>
> [1] https://elixir.bootlin.com/linux/v6.16-rc5/source/drivers/base/power/runtime.c#L678

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-07-09 19:37           ` Rafael J. Wysocki
@ 2025-07-10  8:59             ` Pranjal Shrivastava
  2025-07-10 10:29               ` Rafael J. Wysocki
  0 siblings, 1 reply; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-07-10  8:59 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu

On Wed, Jul 09, 2025 at 09:37:58PM +0200, Rafael J. Wysocki wrote:
> On Wed, Jul 9, 2025 at 7:06 PM Pranjal Shrivastava <praan@google.com> wrote:
> >
> > On Wed, Jul 09, 2025 at 06:35:41PM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Jul 9, 2025 at 5:51 PM Pranjal Shrivastava <praan@google.com> wrote:
> > > >
> > > > On Wed, Jul 09, 2025 at 08:44:06AM +0200, Rafael J. Wysocki wrote:
> > > > > On Mon, Jun 16, 2025 at 10:32 PM Pranjal Shrivastava <praan@google.com> wrote:
> > > > > >
> > > > > > The existing opportunistic helpers like pm_runtime_get_if_active() and
> > > > > > pm_runtime_get_if_in_use(), are too strict for certain use cases. They
> > > > > > fail if the device is in a transient state like RPM_SUSPENDING, which
> > > > > > can lead to drivers making incorrect assumptions about the dev's state.
> > > > > >
> > > > > > These helpers don't suffice for cases where one wishes to elide HW clean
> > > > > > up like queue flushes or TLB invalidations if the device is powered off.
> > > > > > It is wasteful to wake up the device in cases where the resume callback
> > > > > > resets the device well OR if the HW resets in a clean state. Thus, if a
> > > > >ppy to hear your feedback or any alternative ideas you might have > device is powered off, it is preferred to elide any clean-up HW ops like
> > > > > > queue flushes / TLB invalidations when the device will resume afresh.
> > > > > >
> > > > > > Consider the following sequence of operations:
> > > > > >
> > > > > > 1. The device is in `RPM_SUSPENDING` state
> > > > > > 2. The driver calls pm_runtime_get_if_active/in_use
> > > > > > 3. Depending on these API, the driver elides a HW clean-up op like:
> > > > > >
> > > > > > if (pm_runtime_get_if_in_active(dev))
> > > > > >         invalidate_tlb(dev);
> > > > > > else
> > > > > >         // Skip flush, assuming device will fully suspend
> > > > > >
> > > > > > 4. Now, another rpm dev-linked device wakes up, causing the device's
> > > > > >    state to bounce from RPM_SUSPENDING to RPM_ACTIVE without invoking
> > > > > >    any rpm callbacks, preventing them from resetting the dev correctly.
> > > > >
> > > > > This never happens.
> > > > >
> > > > > If the status is RPM_SUSPENDING, the runtime suspend callback will be invoked.
> > > >
> > > > Ack, I believe we're talking about the check[1] in rpm_resume here:
> > >
> > > No, I'm talking about the fact that the status is changed to
> > > RPM_SUSPENDING right before invoking the callback.
> > >
> >
> > Right, but if the callback returns -EAGAIN or something, in
> > rpm_suspend() we go ahead and set the status to RPM_ACTIVE again
> > on jumping to the `fail` label[1].
> >
> > > >         if (dev->power.runtime_status == RPM_RESUMING ||
> > > >             dev->power.runtime_status == RPM_SUSPENDING) {
> > > >
> > > >             [...]
> > > >             /* Wait for the operation carried out in parallel with us. */
> > > >             [...]
> > > >             finish_wait(&dev->power.wait_queue, &wait);
> > > >             goto repeat;
> > > >
> > > > However, my worry is about the following situation:
> > > >
> > > >   rpm_suspend                      rpm_resume
> > > >
> > > >   rpm_status = RPM_SUSPENDING
> > > >                                    if (RPM_SUSPENDING)
> > > >                                    prepare_to wait(...)
> > > >
> > > >   // suspend fails
> > > >   retval = rpm_callback(...);
> > >
> > > You're talking about the callback returning an error and your
> > > changelog is talking about a different situation.
> >
> > Apologies, I should've been more clear. What I meant was for a situation
> > where due to *any* reason (except disabling runtime PM) we bounce back
> > from RPM_SUSPENDING to RPM_ACTIVE *without* invoking the resume callback
> 
> This only can happen if the suspend callback returns an error, but I'm
> not sure why and how this matters.
> 
> pm_runtime_get_if_active() does not guarantee anything in the case
> when 0 is returned anyway.
> 
> > > >   goto fail;
> > > >   [...]
> > > > fail:
> > > >   rpm_status = RPM_ACTIVE;
> > > >                                    finish_wait(...);
> > > >                                    goto repeat;
> > > >                                    repeat:
> > > >                                    if (rpm_status == RPM_ACTIVE) {
> > > >                                         retval = 1;
> > > >                                         goto out;
> > > >                                    }
> > > >                                 out:
> > > >                                    [ ... ] // put_parent if one
> > > >
> > > >                                    // we return without resume cb();
> > > >                                    return retval;
> > > >
> > > > Now, if we rely on APIs based on pm_runtime_get_conditional(), which
> > > > might return 0 if (dev->power.runtime_status != RPM_ACTIVE), we might
> > > > end up eliding some TLB invalidations in the period where the status was
> > > > RPM_SUSPENDING till it's back to RPM_ACTIVE due to a failed suspend.
> > >
> > > It actually depends on the reason why the callback returned an error.
> > >
> >
> > I'm not sure if I follow.. as per rpm_suspend[1] I see that upon failing
> > (which also happens on getting a non-zero retval from the suspend_cb) we
> > set the runtime status to RPM_ACTIVE.
> 
> Yes, it just goes back to the status from before the failure because
> when the error code is -EAGAIN or -EBUSY, the status should be still
> RPM_ACTIVE and otherwise runtime_error is set and the device likely
> requires some help.
> 
> > > > This can cause some severe address-aliasing/ghost hit issues since the
> > > > TLB still has some stale entries..
> > >
> > > I'm not quite sure what you mean.
> > >
> >
> > I meant if the TLB invalidations were elided (i.e. TLB still had stale
> > entries) in hope that the IOMMU would be suspended, but due to some
> > reason the suspend failed and the status gets back to RPM_ACTIVE while
> > exiting the rpm_suspend call and a client wakes up and performs a DMA
> > (IOMMU transaction), the TLB entry might hit for an address which was
> > supposed to be invalidated by now.
> 
> So as I said this is just one reason why you cannot rely on runtime PM
> to guarantee that the device will remain suspended, or in fact whether
> or not it will be suspended at all.
> 
> > > > Instead, we'd like to ensure that the status IS suspended
> > >
> > > But you can't.
> > >
> > > That's the difference between RPM_ACTIVE and RPM_SUSPENDED.  If you
> > > bump up the runtime PM usage counter while the device is RPM_ACTIVE
> > > (and while holding its power.spinlock), it will not suspend.  There's
> > > no way to prevent a suspended device from resuming (other than
> > > disabling runtime PM for it).
> > >
> >
> > The problem is avoiding bumping up the count while the device is in a
> > transient state like RPM_SUSPENDING and then bouncing back to the
> > RPM_ACTIVE state without invoking RPM resume callback, which seems to be
> > possible as per the rpm_suspend implementation [1].
> 
> I'm not sure what you mean here.
> 
> The usage counter is bumped up by pm_runtime_get_if_active() only if
> the device is RPM_ACTIVE in which case it will guarantee that the
> runtime PM status will not change.  There is no way to provide a
> similar guarantee on the RPM_SUSPENDED side short of disabling runtime
> PM.
> 
> > > > to make a decision to elide the TLBI or not.
> > > >
> > > > What are your thoughts on this? I'm open to approaching it differently.
> > >
> > > IMV this is all misguided, sorry.
> > >
> >
> > Ack. But I'd want to understand better here. Are you saying that it
> > isn't at all possible that RPM_SUSPENDING bounces back to RPM_ACTIVE
> > without invoking the rpm_resume ever? Because per the following snippet,
> > it seems likely:
> >
> >         __update_runtime_status(dev, RPM_SUSPENDING);
> >
> >         callback = RPM_GET_CALLBACK(dev, runtime_suspend);
> >
> >         dev_pm_enable_wake_irq_check(dev, true);
> >         retval = rpm_callback(callback, dev);
> >         if (retval)
> >                 goto fail;
> >
> >         [...]
> >
> > fail:
> >         dev_pm_disable_wake_irq_check(dev, true);
> >         __update_runtime_status(dev, RPM_ACTIVE);
> >         dev->power.deferred_resume = false;
> >         wake_up_all(&dev->power.wait_queue);
> >         [...]
> >
> > Sorry if I'm missing something here, but it does look like we can bounce
> > back to RPM_ACTIVE without invoking the resume callback, which is a
> > behaviour we'd like to avoid.
> 
> So let me repeat: This only happens if the suspend callback returns an
> error and in that case it is not clear whether or not the resume
> callback needs to be invoked (either it doesn't or the device can be
> assumed to be in a state in which invoking that callback will not help
> either way).
> 

Alright. My goal here is to ensure that TLB invalidations are elided
only when the device (IOMMU) is suspended. For that I believed we 
could do either of the following:

1. Ensure RPM_SUSPENDING -> RPM_ACTIVE transition *always* invokes the
resume callback (this is because the resume cb flushes the entire TLB).

2. Ensure that we are able to reliably *get* a PM ref if the device is
NOT suspended, i.e. even if it was in a transient state.

But I guess that maybe the new API (option 2) isn't the right way to go.

I'm assuming you mean that if we design the suspend callback in a way
that it doesn't return error, we can reliably ensure that the resume
callback will be called before transitioning to the RPM_ACTIVE state? 

[...]

Thanks,
Praan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-07-10  8:59             ` Pranjal Shrivastava
@ 2025-07-10 10:29               ` Rafael J. Wysocki
  2025-07-11 10:20                 ` Pranjal Shrivastava
  0 siblings, 1 reply; 28+ messages in thread
From: Rafael J. Wysocki @ 2025-07-10 10:29 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Rafael J. Wysocki, Joerg Roedel, Will Deacon, Robin Murphy,
	Jason Gunthorpe, Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu

On Thu, Jul 10, 2025 at 11:00 AM Pranjal Shrivastava <praan@google.com> wrote:
>
> On Wed, Jul 09, 2025 at 09:37:58PM +0200, Rafael J. Wysocki wrote:
> > On Wed, Jul 9, 2025 at 7:06 PM Pranjal Shrivastava <praan@google.com> wrote:
> > >
> > > On Wed, Jul 09, 2025 at 06:35:41PM +0200, Rafael J. Wysocki wrote:
> > > > On Wed, Jul 9, 2025 at 5:51 PM Pranjal Shrivastava <praan@google.com> wrote:
> > > > >
> > > > > On Wed, Jul 09, 2025 at 08:44:06AM +0200, Rafael J. Wysocki wrote:
> > > > > > On Mon, Jun 16, 2025 at 10:32 PM Pranjal Shrivastava <praan@google.com> wrote:
> > > > > > >
> > > > > > > The existing opportunistic helpers like pm_runtime_get_if_active() and
> > > > > > > pm_runtime_get_if_in_use(), are too strict for certain use cases. They
> > > > > > > fail if the device is in a transient state like RPM_SUSPENDING, which
> > > > > > > can lead to drivers making incorrect assumptions about the dev's state.
> > > > > > >
> > > > > > > These helpers don't suffice for cases where one wishes to elide HW clean
> > > > > > > up like queue flushes or TLB invalidations if the device is powered off.
> > > > > > > It is wasteful to wake up the device in cases where the resume callback
> > > > > > > resets the device well OR if the HW resets in a clean state. Thus, if a
> > > > > >ppy to hear your feedback or any alternative ideas you might have > device is powered off, it is preferred to elide any clean-up HW ops like
> > > > > > > queue flushes / TLB invalidations when the device will resume afresh.
> > > > > > >
> > > > > > > Consider the following sequence of operations:
> > > > > > >
> > > > > > > 1. The device is in `RPM_SUSPENDING` state
> > > > > > > 2. The driver calls pm_runtime_get_if_active/in_use
> > > > > > > 3. Depending on these API, the driver elides a HW clean-up op like:
> > > > > > >
> > > > > > > if (pm_runtime_get_if_in_active(dev))
> > > > > > >         invalidate_tlb(dev);
> > > > > > > else
> > > > > > >         // Skip flush, assuming device will fully suspend
> > > > > > >
> > > > > > > 4. Now, another rpm dev-linked device wakes up, causing the device's
> > > > > > >    state to bounce from RPM_SUSPENDING to RPM_ACTIVE without invoking
> > > > > > >    any rpm callbacks, preventing them from resetting the dev correctly.
> > > > > >
> > > > > > This never happens.
> > > > > >
> > > > > > If the status is RPM_SUSPENDING, the runtime suspend callback will be invoked.
> > > > >
> > > > > Ack, I believe we're talking about the check[1] in rpm_resume here:
> > > >
> > > > No, I'm talking about the fact that the status is changed to
> > > > RPM_SUSPENDING right before invoking the callback.
> > > >
> > >
> > > Right, but if the callback returns -EAGAIN or something, in
> > > rpm_suspend() we go ahead and set the status to RPM_ACTIVE again
> > > on jumping to the `fail` label[1].
> > >
> > > > >         if (dev->power.runtime_status == RPM_RESUMING ||
> > > > >             dev->power.runtime_status == RPM_SUSPENDING) {
> > > > >
> > > > >             [...]
> > > > >             /* Wait for the operation carried out in parallel with us. */
> > > > >             [...]
> > > > >             finish_wait(&dev->power.wait_queue, &wait);
> > > > >             goto repeat;
> > > > >
> > > > > However, my worry is about the following situation:
> > > > >
> > > > >   rpm_suspend                      rpm_resume
> > > > >
> > > > >   rpm_status = RPM_SUSPENDING
> > > > >                                    if (RPM_SUSPENDING)
> > > > >                                    prepare_to wait(...)
> > > > >
> > > > >   // suspend fails
> > > > >   retval = rpm_callback(...);
> > > >
> > > > You're talking about the callback returning an error and your
> > > > changelog is talking about a different situation.
> > >
> > > Apologies, I should've been more clear. What I meant was for a situation
> > > where due to *any* reason (except disabling runtime PM) we bounce back
> > > from RPM_SUSPENDING to RPM_ACTIVE *without* invoking the resume callback
> >
> > This only can happen if the suspend callback returns an error, but I'm
> > not sure why and how this matters.
> >
> > pm_runtime_get_if_active() does not guarantee anything in the case
> > when 0 is returned anyway.
> >
> > > > >   goto fail;
> > > > >   [...]
> > > > > fail:
> > > > >   rpm_status = RPM_ACTIVE;
> > > > >                                    finish_wait(...);
> > > > >                                    goto repeat;
> > > > >                                    repeat:
> > > > >                                    if (rpm_status == RPM_ACTIVE) {
> > > > >                                         retval = 1;
> > > > >                                         goto out;
> > > > >                                    }
> > > > >                                 out:
> > > > >                                    [ ... ] // put_parent if one
> > > > >
> > > > >                                    // we return without resume cb();
> > > > >                                    return retval;
> > > > >
> > > > > Now, if we rely on APIs based on pm_runtime_get_conditional(), which
> > > > > might return 0 if (dev->power.runtime_status != RPM_ACTIVE), we might
> > > > > end up eliding some TLB invalidations in the period where the status was
> > > > > RPM_SUSPENDING till it's back to RPM_ACTIVE due to a failed suspend.
> > > >
> > > > It actually depends on the reason why the callback returned an error.
> > > >
> > >
> > > I'm not sure if I follow.. as per rpm_suspend[1] I see that upon failing
> > > (which also happens on getting a non-zero retval from the suspend_cb) we
> > > set the runtime status to RPM_ACTIVE.
> >
> > Yes, it just goes back to the status from before the failure because
> > when the error code is -EAGAIN or -EBUSY, the status should be still
> > RPM_ACTIVE and otherwise runtime_error is set and the device likely
> > requires some help.
> >
> > > > > This can cause some severe address-aliasing/ghost hit issues since the
> > > > > TLB still has some stale entries..
> > > >
> > > > I'm not quite sure what you mean.
> > > >
> > >
> > > I meant if the TLB invalidations were elided (i.e. TLB still had stale
> > > entries) in hope that the IOMMU would be suspended, but due to some
> > > reason the suspend failed and the status gets back to RPM_ACTIVE while
> > > exiting the rpm_suspend call and a client wakes up and performs a DMA
> > > (IOMMU transaction), the TLB entry might hit for an address which was
> > > supposed to be invalidated by now.
> >
> > So as I said this is just one reason why you cannot rely on runtime PM
> > to guarantee that the device will remain suspended, or in fact whether
> > or not it will be suspended at all.
> >
> > > > > Instead, we'd like to ensure that the status IS suspended
> > > >
> > > > But you can't.
> > > >
> > > > That's the difference between RPM_ACTIVE and RPM_SUSPENDED.  If you
> > > > bump up the runtime PM usage counter while the device is RPM_ACTIVE
> > > > (and while holding its power.spinlock), it will not suspend.  There's
> > > > no way to prevent a suspended device from resuming (other than
> > > > disabling runtime PM for it).
> > > >
> > >
> > > The problem is avoiding bumping up the count while the device is in a
> > > transient state like RPM_SUSPENDING and then bouncing back to the
> > > RPM_ACTIVE state without invoking RPM resume callback, which seems to be
> > > possible as per the rpm_suspend implementation [1].
> >
> > I'm not sure what you mean here.
> >
> > The usage counter is bumped up by pm_runtime_get_if_active() only if
> > the device is RPM_ACTIVE in which case it will guarantee that the
> > runtime PM status will not change.  There is no way to provide a
> > similar guarantee on the RPM_SUSPENDED side short of disabling runtime
> > PM.
> >
> > > > > to make a decision to elide the TLBI or not.
> > > > >
> > > > > What are your thoughts on this? I'm open to approaching it differently.
> > > >
> > > > IMV this is all misguided, sorry.
> > > >
> > >
> > > Ack. But I'd want to understand better here. Are you saying that it
> > > isn't at all possible that RPM_SUSPENDING bounces back to RPM_ACTIVE
> > > without invoking the rpm_resume ever? Because per the following snippet,
> > > it seems likely:
> > >
> > >         __update_runtime_status(dev, RPM_SUSPENDING);
> > >
> > >         callback = RPM_GET_CALLBACK(dev, runtime_suspend);
> > >
> > >         dev_pm_enable_wake_irq_check(dev, true);
> > >         retval = rpm_callback(callback, dev);
> > >         if (retval)
> > >                 goto fail;
> > >
> > >         [...]
> > >
> > > fail:
> > >         dev_pm_disable_wake_irq_check(dev, true);
> > >         __update_runtime_status(dev, RPM_ACTIVE);
> > >         dev->power.deferred_resume = false;
> > >         wake_up_all(&dev->power.wait_queue);
> > >         [...]
> > >
> > > Sorry if I'm missing something here, but it does look like we can bounce
> > > back to RPM_ACTIVE without invoking the resume callback, which is a
> > > behaviour we'd like to avoid.
> >
> > So let me repeat: This only happens if the suspend callback returns an
> > error and in that case it is not clear whether or not the resume
> > callback needs to be invoked (either it doesn't or the device can be
> > assumed to be in a state in which invoking that callback will not help
> > either way).
> >
>
> Alright. My goal here is to ensure that TLB invalidations are elided
> only when the device (IOMMU) is suspended. For that I believed we
> could do either of the following:
>
> 1. Ensure RPM_SUSPENDING -> RPM_ACTIVE transition *always* invokes the
> resume callback (this is because the resume cb flushes the entire TLB).

That would be incorrect at least in some cases.

> 2. Ensure that we are able to reliably *get* a PM ref if the device is
> NOT suspended, i.e. even if it was in a transient state.

A transient state means a point of no return, so I don't think this
can work the way you want.

> But I guess that maybe the new API (option 2) isn't the right way to go.
>
> I'm assuming you mean that if we design the suspend callback in a way
> that it doesn't return error, we can reliably ensure that the resume
> callback will be called before transitioning to the RPM_ACTIVE state?

Yes overall, but note that driver callbacks are usually invoked by bus
type or PM domain callbacks that can fail.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-07-10 10:29               ` Rafael J. Wysocki
@ 2025-07-11 10:20                 ` Pranjal Shrivastava
  2025-07-15 23:52                   ` Daniel Mentz
  0 siblings, 1 reply; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-07-11 10:20 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu

On Thu, Jul 10, 2025 at 12:29:03PM +0200, Rafael J. Wysocki wrote:
> On Thu, Jul 10, 2025 at 11:00 AM Pranjal Shrivastava <praan@google.com> wrote:
> >
> > On Wed, Jul 09, 2025 at 09:37:58PM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Jul 9, 2025 at 7:06 PM Pranjal Shrivastava <praan@google.com> wrote:
> > > >
> > > > On Wed, Jul 09, 2025 at 06:35:41PM +0200, Rafael J. Wysocki wrote:
> > > > > On Wed, Jul 9, 2025 at 5:51 PM Pranjal Shrivastava <praan@google.com> wrote:
> > > > > >
> > > > > > On Wed, Jul 09, 2025 at 08:44:06AM +0200, Rafael J. Wysocki wrote:
> > > > > > > On Mon, Jun 16, 2025 at 10:32 PM Pranjal Shrivastava <praan@google.com> wrote:
> > > > > > > >
> > > > > > > > The existing opportunistic helpers like pm_runtime_get_if_active() and
> > > > > > > > pm_runtime_get_if_in_use(), are too strict for certain use cases. They
> > > > > > > > fail if the device is in a transient state like RPM_SUSPENDING, which
> > > > > > > > can lead to drivers making incorrect assumptions about the dev's state.
> > > > > > > >
> > > > > > > > These helpers don't suffice for cases where one wishes to elide HW clean
> > > > > > > > up like queue flushes or TLB invalidations if the device is powered off.
> > > > > > > > It is wasteful to wake up the device in cases where the resume callback
> > > > > > > > resets the device well OR if the HW resets in a clean state. Thus, if a
> > > > > > >ppy to hear your feedback or any alternative ideas you might have > device is powered off, it is preferred to elide any clean-up HW ops like
> > > > > > > > queue flushes / TLB invalidations when the device will resume afresh.
> > > > > > > >
> > > > > > > > Consider the following sequence of operations:
> > > > > > > >
> > > > > > > > 1. The device is in `RPM_SUSPENDING` state
> > > > > > > > 2. The driver calls pm_runtime_get_if_active/in_use
> > > > > > > > 3. Depending on these API, the driver elides a HW clean-up op like:
> > > > > > > >
> > > > > > > > if (pm_runtime_get_if_in_active(dev))
> > > > > > > >         invalidate_tlb(dev);
> > > > > > > > else
> > > > > > > >         // Skip flush, assuming device will fully suspend
> > > > > > > >
> > > > > > > > 4. Now, another rpm dev-linked device wakes up, causing the device's
> > > > > > > >    state to bounce from RPM_SUSPENDING to RPM_ACTIVE without invoking
> > > > > > > >    any rpm callbacks, preventing them from resetting the dev correctly.
> > > > > > >
> > > > > > > This never happens.
> > > > > > >
> > > > > > > If the status is RPM_SUSPENDING, the runtime suspend callback will be invoked.
> > > > > >
> > > > > > Ack, I believe we're talking about the check[1] in rpm_resume here:
> > > > >
> > > > > No, I'm talking about the fact that the status is changed to
> > > > > RPM_SUSPENDING right before invoking the callback.
> > > > >
> > > >
> > > > Right, but if the callback returns -EAGAIN or something, in
> > > > rpm_suspend() we go ahead and set the status to RPM_ACTIVE again
> > > > on jumping to the `fail` label[1].
> > > >
> > > > > >         if (dev->power.runtime_status == RPM_RESUMING ||
> > > > > >             dev->power.runtime_status == RPM_SUSPENDING) {
> > > > > >
> > > > > >             [...]
> > > > > >             /* Wait for the operation carried out in parallel with us. */
> > > > > >             [...]
> > > > > >             finish_wait(&dev->power.wait_queue, &wait);
> > > > > >             goto repeat;
> > > > > >
> > > > > > However, my worry is about the following situation:
> > > > > >
> > > > > >   rpm_suspend                      rpm_resume
> > > > > >
> > > > > >   rpm_status = RPM_SUSPENDING
> > > > > >                                    if (RPM_SUSPENDING)
> > > > > >                                    prepare_to wait(...)
> > > > > >
> > > > > >   // suspend fails
> > > > > >   retval = rpm_callback(...);
> > > > >
> > > > > You're talking about the callback returning an error and your
> > > > > changelog is talking about a different situation.
> > > >
> > > > Apologies, I should've been more clear. What I meant was for a situation
> > > > where due to *any* reason (except disabling runtime PM) we bounce back
> > > > from RPM_SUSPENDING to RPM_ACTIVE *without* invoking the resume callback
> > >
> > > This only can happen if the suspend callback returns an error, but I'm
> > > not sure why and how this matters.
> > >
> > > pm_runtime_get_if_active() does not guarantee anything in the case
> > > when 0 is returned anyway.
> > >
> > > > > >   goto fail;
> > > > > >   [...]
> > > > > > fail:
> > > > > >   rpm_status = RPM_ACTIVE;
> > > > > >                                    finish_wait(...);
> > > > > >                                    goto repeat;
> > > > > >                                    repeat:
> > > > > >                                    if (rpm_status == RPM_ACTIVE) {
> > > > > >                                         retval = 1;
> > > > > >                                         goto out;
> > > > > >                                    }
> > > > > >                                 out:
> > > > > >                                    [ ... ] // put_parent if one
> > > > > >
> > > > > >                                    // we return without resume cb();
> > > > > >                                    return retval;
> > > > > >
> > > > > > Now, if we rely on APIs based on pm_runtime_get_conditional(), which
> > > > > > might return 0 if (dev->power.runtime_status != RPM_ACTIVE), we might
> > > > > > end up eliding some TLB invalidations in the period where the status was
> > > > > > RPM_SUSPENDING till it's back to RPM_ACTIVE due to a failed suspend.
> > > > >
> > > > > It actually depends on the reason why the callback returned an error.
> > > > >
> > > >
> > > > I'm not sure if I follow.. as per rpm_suspend[1] I see that upon failing
> > > > (which also happens on getting a non-zero retval from the suspend_cb) we
> > > > set the runtime status to RPM_ACTIVE.
> > >
> > > Yes, it just goes back to the status from before the failure because
> > > when the error code is -EAGAIN or -EBUSY, the status should be still
> > > RPM_ACTIVE and otherwise runtime_error is set and the device likely
> > > requires some help.
> > >
> > > > > > This can cause some severe address-aliasing/ghost hit issues since the
> > > > > > TLB still has some stale entries..
> > > > >
> > > > > I'm not quite sure what you mean.
> > > > >
> > > >
> > > > I meant if the TLB invalidations were elided (i.e. TLB still had stale
> > > > entries) in hope that the IOMMU would be suspended, but due to some
> > > > reason the suspend failed and the status gets back to RPM_ACTIVE while
> > > > exiting the rpm_suspend call and a client wakes up and performs a DMA
> > > > (IOMMU transaction), the TLB entry might hit for an address which was
> > > > supposed to be invalidated by now.
> > >
> > > So as I said this is just one reason why you cannot rely on runtime PM
> > > to guarantee that the device will remain suspended, or in fact whether
> > > or not it will be suspended at all.
> > >
> > > > > > Instead, we'd like to ensure that the status IS suspended
> > > > >
> > > > > But you can't.
> > > > >
> > > > > That's the difference between RPM_ACTIVE and RPM_SUSPENDED.  If you
> > > > > bump up the runtime PM usage counter while the device is RPM_ACTIVE
> > > > > (and while holding its power.spinlock), it will not suspend.  There's
> > > > > no way to prevent a suspended device from resuming (other than
> > > > > disabling runtime PM for it).
> > > > >
> > > >
> > > > The problem is avoiding bumping up the count while the device is in a
> > > > transient state like RPM_SUSPENDING and then bouncing back to the
> > > > RPM_ACTIVE state without invoking RPM resume callback, which seems to be
> > > > possible as per the rpm_suspend implementation [1].
> > >
> > > I'm not sure what you mean here.
> > >
> > > The usage counter is bumped up by pm_runtime_get_if_active() only if
> > > the device is RPM_ACTIVE in which case it will guarantee that the
> > > runtime PM status will not change.  There is no way to provide a
> > > similar guarantee on the RPM_SUSPENDED side short of disabling runtime
> > > PM.
> > >
> > > > > > to make a decision to elide the TLBI or not.
> > > > > >
> > > > > > What are your thoughts on this? I'm open to approaching it differently.
> > > > >
> > > > > IMV this is all misguided, sorry.
> > > > >
> > > >
> > > > Ack. But I'd want to understand better here. Are you saying that it
> > > > isn't at all possible that RPM_SUSPENDING bounces back to RPM_ACTIVE
> > > > without invoking the rpm_resume ever? Because per the following snippet,
> > > > it seems likely:
> > > >
> > > >         __update_runtime_status(dev, RPM_SUSPENDING);
> > > >
> > > >         callback = RPM_GET_CALLBACK(dev, runtime_suspend);
> > > >
> > > >         dev_pm_enable_wake_irq_check(dev, true);
> > > >         retval = rpm_callback(callback, dev);
> > > >         if (retval)
> > > >                 goto fail;
> > > >
> > > >         [...]
> > > >
> > > > fail:
> > > >         dev_pm_disable_wake_irq_check(dev, true);
> > > >         __update_runtime_status(dev, RPM_ACTIVE);
> > > >         dev->power.deferred_resume = false;
> > > >         wake_up_all(&dev->power.wait_queue);
> > > >         [...]
> > > >
> > > > Sorry if I'm missing something here, but it does look like we can bounce
> > > > back to RPM_ACTIVE without invoking the resume callback, which is a
> > > > behaviour we'd like to avoid.
> > >
> > > So let me repeat: This only happens if the suspend callback returns an
> > > error and in that case it is not clear whether or not the resume
> > > callback needs to be invoked (either it doesn't or the device can be
> > > assumed to be in a state in which invoking that callback will not help
> > > either way).
> > >
> >
> > Alright. My goal here is to ensure that TLB invalidations are elided
> > only when the device (IOMMU) is suspended. For that I believed we
> > could do either of the following:
> >
> > 1. Ensure RPM_SUSPENDING -> RPM_ACTIVE transition *always* invokes the
> > resume callback (this is because the resume cb flushes the entire TLB).
> 
> That would be incorrect at least in some cases.
> 

Right.. those incorrect cases is what we'd like to avoid..

> > 2. Ensure that we are able to reliably *get* a PM ref if the device is
> > NOT suspended, i.e. even if it was in a transient state.
> 
> A transient state means a point of no return, so I don't think this
> can work the way you want.
> 
> > But I guess that maybe the new API (option 2) isn't the right way to go.
> >
> > I'm assuming you mean that if we design the suspend callback in a way
> > that it doesn't return error, we can reliably ensure that the resume
> > callback will be called before transitioning to the RPM_ACTIVE state?
> 
> Yes overall, but note that driver callbacks are usually invoked by bus
> type or PM domain callbacks that can fail.

Ack. Can you confirm that these failures happen *before* invoking the
rpm_suspend callback.. because after rpm_suspend callback is invoked I
guess the rpm_resume will be invoked?

I could potentially have a design where the suspend_callback flushes 
the TLB iff we're about to return an error. But if you say that a 
failure happening before the rpm_suspend can make a transition from a
NON-ACTIVE to ACTIVE state, then I guess we'll need something from the
PM framework..

Looking at the code, I don't see the rpm status changing from ACTIVE
to rpm status != RPM_ACTIVE before the rpm_suspend functions call.
Thus, I suppose, we are *always* RPM_ACTIVE till we enter rpm_suspend
and only bounce back from RPM_SUSPENDING to RPM_ACTIVE (without invoking
resume callback), when the suspend callback returns ERROR. Is that the
right understanding?

Thanks,
Praan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-07-11 10:20                 ` Pranjal Shrivastava
@ 2025-07-15 23:52                   ` Daniel Mentz
  2025-07-16 12:53                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Mentz @ 2025-07-15 23:52 UTC (permalink / raw)
  To: Pranjal Shrivastava, Rafael J. Wysocki
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Nicolin Chen, Mostafa Saleh, iommu

I think the underlying problem is that we need certainty as to whether
SMMU is enabled or disabled, and if we find it in either
RPM_SUSPENDING or RPM_RESUMING state, then we don't have this
certainty. Creating certainty would require waiting for an in-progress
rpm suspend or resume operation to finish which we can't do, because
iommu_map and iommu_unmap can be called from atomic context where we
wouldn't be able to wait.

Before discussing suitable approaches, I think we need to agree on the
requirements. After our discussion with Will, I believe the
requirements are as follows:

We can elide i.e. skip TLB invalidations only when SMMU is either
disabled or powered off.

At any time and even in the absence of any client transactions, SMMU
can perform speculative translation table walks and cache the results.
In particular, Will provided the following example (I hope I'm
depicting this correctly here):

In __arm_lpae_unmap() where it says "/* Also flush any partial walks
*/", we skip over io_pgtable_tlb_flush_walk() and subsequently free
the page table in __arm_lpae_free_pgtable(). The memory location that
previously contained the page table is then re-allocated for some
different and unrelated purpose while the SMMU hardware still stores a
pointer to this memory location in its walk cache due to the driver
previously skipping the TLBI command. SMMU might then perform a
speculative translation table walk and misinterpret the new data as a
table descriptor which includes a pointer to a next level translation
table (i.e. use-after-free). SMMU could then follow that garbage
pointer and read from an unpredictable location. If this unpredictable
location turns out to be read sensitive like an interrupt acknowledge
register, we might potentially end up with very subtle types of
misbehavior.

Due to these requirements, I believe the following approach will *not* work:
Elide TLBI commands when device is in RPM_RESUMING state: While in
this state, SMMU could already be enabled and perform speculative
translation table walks.
Elide TLBI commands when device is in RPM_SUSPENDING state: While in
this state, SMMU could still be enabled, perform speculative
translation table walks and run in to the use-after-free situation
described earlier

I propose a custom synchronization mechanism where we use a spinlock
protected variable that determines whether SMMU is enabled or
disabled. In the suspend/resume callbacks, we disable or enable SMMU
and update this variable while holding the spinlock. In the map/unmap
path we briefly grab the spinlock while examining this new variable.
Only if we find that SMMU is disabled, we elide TLBI commands.
My concern with this, though, is that the additional spinlock might
spoil the performance gains of the lockless command queue insertion
algorithm.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-07-15 23:52                   ` Daniel Mentz
@ 2025-07-16 12:53                     ` Rafael J. Wysocki
  2025-07-21 12:44                       ` Will Deacon
  0 siblings, 1 reply; 28+ messages in thread
From: Rafael J. Wysocki @ 2025-07-16 12:53 UTC (permalink / raw)
  To: Daniel Mentz
  Cc: Pranjal Shrivastava, Rafael J. Wysocki, Joerg Roedel, Will Deacon,
	Robin Murphy, Jason Gunthorpe, Nicolin Chen, Mostafa Saleh, iommu

On Wed, Jul 16, 2025 at 1:52 AM Daniel Mentz <danielmentz@google.com> wrote:
>
> I think the underlying problem is that we need certainty as to whether
> SMMU is enabled or disabled, and if we find it in either
> RPM_SUSPENDING or RPM_RESUMING state, then we don't have this
> certainty.

Right.

Holding power.lock at that point can only defer the subsequent runtime
PM status change, but the actual state of the device is not actually
known.

> Creating certainty would require waiting for an in-progress
> rpm suspend or resume operation to finish which we can't do, because
> iommu_map and iommu_unmap can be called from atomic context where we
> wouldn't be able to wait.
>
> Before discussing suitable approaches, I think we need to agree on the
> requirements. After our discussion with Will, I believe the
> requirements are as follows:
>
> We can elide i.e. skip TLB invalidations only when SMMU is either
> disabled or powered off.
>
> At any time and even in the absence of any client transactions, SMMU
> can perform speculative translation table walks and cache the results.
> In particular, Will provided the following example (I hope I'm
> depicting this correctly here):
>
> In __arm_lpae_unmap() where it says "/* Also flush any partial walks
> */", we skip over io_pgtable_tlb_flush_walk() and subsequently free
> the page table in __arm_lpae_free_pgtable(). The memory location that
> previously contained the page table is then re-allocated for some
> different and unrelated purpose while the SMMU hardware still stores a
> pointer to this memory location in its walk cache due to the driver
> previously skipping the TLBI command. SMMU might then perform a
> speculative translation table walk and misinterpret the new data as a
> table descriptor which includes a pointer to a next level translation
> table (i.e. use-after-free). SMMU could then follow that garbage
> pointer and read from an unpredictable location. If this unpredictable
> location turns out to be read sensitive like an interrupt acknowledge
> register, we might potentially end up with very subtle types of
> misbehavior.
>
> Due to these requirements, I believe the following approach will *not* work:
> Elide TLBI commands when device is in RPM_RESUMING state: While in
> this state, SMMU could already be enabled and perform speculative
> translation table walks.
> Elide TLBI commands when device is in RPM_SUSPENDING state: While in
> this state, SMMU could still be enabled, perform speculative
> translation table walks and run in to the use-after-free situation
> described earlier
>
> I propose a custom synchronization mechanism where we use a spinlock
> protected variable that determines whether SMMU is enabled or
> disabled. In the suspend/resume callbacks, we disable or enable SMMU
> and update this variable while holding the spinlock. In the map/unmap
> path we briefly grab the spinlock while examining this new variable.
> Only if we find that SMMU is disabled, we elide TLBI commands.

That sounds reasonable to me in principle.

> My concern with this, though, is that the additional spinlock might
> spoil the performance gains of the lockless command queue insertion
> algorithm.

It might, but it is hard to say upfront.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
  2025-07-16 12:53                     ` Rafael J. Wysocki
@ 2025-07-21 12:44                       ` Will Deacon
  0 siblings, 0 replies; 28+ messages in thread
From: Will Deacon @ 2025-07-21 12:44 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Daniel Mentz, Pranjal Shrivastava, Joerg Roedel, Robin Murphy,
	Jason Gunthorpe, Nicolin Chen, Mostafa Saleh, iommu

On Wed, Jul 16, 2025 at 02:53:13PM +0200, Rafael J. Wysocki wrote:
> On Wed, Jul 16, 2025 at 1:52 AM Daniel Mentz <danielmentz@google.com> wrote:
> > I propose a custom synchronization mechanism where we use a spinlock
> > protected variable that determines whether SMMU is enabled or
> > disabled. In the suspend/resume callbacks, we disable or enable SMMU
> > and update this variable while holding the spinlock. In the map/unmap
> > path we briefly grab the spinlock while examining this new variable.
> > Only if we find that SMMU is disabled, we elide TLBI commands.
> 
> That sounds reasonable to me in principle.
> 
> > My concern with this, though, is that the additional spinlock might
> > spoil the performance gains of the lockless command queue insertion
> > algorithm.
> 
> It might, but it is hard to say upfront.

I'm definitely not going to take a solution that unconditionally
serialises unmap().

Instead, I think that we can probably check whether or not the SMMU
is suspended in a few places:

  1. After zapping a PTE but before issuing TLBI

  2. When trying to allocate space in the cmdq

  3. Right before moving PROD in the cmdq

Only (1) is necessary for correctness, but (3) narrows the race window
and (2) is then required to avoid getting stuck trying to queue commands
while the SMMU is powered down.

There's still a small window for a race if the SMMU is powered down
immediately after we checked the flag in (3), but in that case we'll end
up serialising to power it back up. Adding additional states might
mitigate that somewhat (we could poll on the suspend operation if we
know it's in progress) but I'm not sure that it's worth it.

So I would start with (1) and the state tracking machinery, as that's
fiddly enough with the memory ordering and suspend/resume sequences.
Then (2) and (3) could come as an optimisation, after which we could
assess whether or not we need to go further.

Will

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 6/8] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops
  2025-06-16 20:31 [RFC PATCH v3 0/8] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
                   ` (4 preceding siblings ...)
  2025-06-16 20:31 ` [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended() Pranjal Shrivastava
@ 2025-06-16 20:31 ` Pranjal Shrivastava
  2025-06-16 20:31 ` [RFC PATCH v3 7/8] iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks Pranjal Shrivastava
  2025-06-16 20:31 ` [RFC PATCH v3 8/8] iommu/arm-smmu-v3: Invoke pm_runtime before hw access Pranjal Shrivastava
  7 siblings, 0 replies; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-06-16 20:31 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

Implement pm_runtime and system sleep ops for arm-smmu-v3. The smmu is
disabled as part of the suspend callbacks after ensuring the completion
of all pending commands and is configured to abort any transactions that
happen after disabling the smmu. The smmu shall be reinitialized in the
resume callback by invoking the `arm_smmu_device_reset` helper.

The MSIs are freed as part of the suspend and are re-allocated during
the resume operation.

Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 93 +++++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  2 +
 2 files changed, 95 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 4699d3294d3e..5736f08e3af7 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -26,6 +26,7 @@
 #include <linux/pci.h>
 #include <linux/pci-ats.h>
 #include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
 #include <linux/string_choices.h>
 #include <kunit/visibility.h>
 #include <uapi/linux/iommufd.h>
@@ -108,6 +109,41 @@ static const char * const event_class_str[] = {
 
 static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
 
+/* Runtime PM helpers */
+static int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
+{
+	int ret;
+
+	if (pm_runtime_enabled(smmu->dev)) {
+		ret = pm_runtime_resume_and_get(smmu->dev);
+		if (ret < 0) {
+			dev_err(smmu->dev, "Failed to resume device: %d\n", ret);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static int arm_smmu_rpm_get_if_not_suspended(struct arm_smmu_device *smmu)
+{
+	if (pm_runtime_enabled(smmu->dev))
+		return pm_runtime_get_if_not_suspended(smmu->dev);
+
+	return 0;
+}
+
+static void arm_smmu_rpm_put(struct arm_smmu_device *smmu)
+{
+	int ret;
+
+	if (pm_runtime_enabled(smmu->dev)) {
+		ret = pm_runtime_put_autosuspend(smmu->dev);
+		if (ret < 0)
+			dev_err(smmu->dev, "Failed to suspend device: %d\n", ret);
+	}
+}
+
 static void parse_driver_options(struct arm_smmu_device *smmu)
 {
 	int i = 0;
@@ -4932,6 +4968,62 @@ static void arm_smmu_device_shutdown(struct platform_device *pdev)
 	arm_smmu_device_disable(smmu);
 }
 
+static int __maybe_unused arm_smmu_runtime_suspend(struct device *dev)
+{
+	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
+	int ret;
+
+	/*
+	 * Since suspend is invoked when all clients have been suspended,
+	 * we don't expect more cmds or events to be added to the queues.
+	 * Wait for all queues to be drained.
+	 */
+	ret = arm_smmu_drain_queues(smmu);
+	if (ret) {
+		dev_err(smmu->dev, "Draining queues timed-out.. retry later\n");
+		return -EAGAIN;
+	}
+
+	/* Disable all queues */
+	arm_smmu_device_disable(smmu);
+
+	/* Abort all transactions to avoid spurious bypass */
+	arm_smmu_update_gbpa(smmu, GBPA_ABORT, 0);
+
+	dev_dbg(dev, "Suspending smmu\n");
+	return 0;
+}
+
+static int __maybe_unused arm_smmu_runtime_resume(struct device *dev)
+{
+	int ret;
+	struct arm_smmu_device *smmu = dev_get_drvdata(dev);
+
+	dev_dbg(dev, "Resuming device\n");
+
+	/* Re-configure MSIs */
+	arm_smmu_resume_msis(smmu);
+
+	/*
+	 * The reset will re-initialize all the base addresses, queues,
+	 * prod and cons maintained within struct arm_smmu_device as well as
+	 * re-enable the interrupts.
+	 */
+	ret = arm_smmu_device_reset(smmu);
+
+	if (ret)
+		dev_err(dev, "Failed to reset during resume operation: %d\n", ret);
+
+	return ret;
+}
+
+static const struct dev_pm_ops arm_smmu_pm_ops = {
+	SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend,
+				pm_runtime_force_resume)
+	SET_RUNTIME_PM_OPS(arm_smmu_runtime_suspend,
+			   arm_smmu_runtime_resume, NULL)
+};
+
 static const struct of_device_id arm_smmu_of_match[] = {
 	{ .compatible = "arm,smmu-v3", },
 	{ },
@@ -4948,6 +5040,7 @@ static struct platform_driver arm_smmu_driver = {
 	.driver	= {
 		.name			= "arm-smmu-v3",
 		.of_match_table		= arm_smmu_of_match,
+		.pm                     = &arm_smmu_pm_ops,
 		.suppress_bind_attrs	= true,
 	},
 	.probe	= arm_smmu_device_probe,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index a950da2e7ed9..d4ac6adc6a30 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -506,6 +506,8 @@ static inline unsigned int arm_smmu_cdtab_l2_idx(unsigned int ssid)
 #define MSI_IOVA_BASE			0x8000000
 #define MSI_IOVA_LENGTH			0x100000
 
+#define RPM_AUTOSUSPEND_DELAY_MS	15
+
 enum pri_resp {
 	PRI_RESP_DENY = 0,
 	PRI_RESP_FAIL = 1,
-- 
2.50.0.rc2.692.g299adb8693-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 7/8] iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks
  2025-06-16 20:31 [RFC PATCH v3 0/8] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
                   ` (5 preceding siblings ...)
  2025-06-16 20:31 ` [RFC PATCH v3 6/8] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops Pranjal Shrivastava
@ 2025-06-16 20:31 ` Pranjal Shrivastava
  2025-06-16 20:31 ` [RFC PATCH v3 8/8] iommu/arm-smmu-v3: Invoke pm_runtime before hw access Pranjal Shrivastava
  7 siblings, 0 replies; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-06-16 20:31 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

Enable PM runtime for SMMUs having a power-domain during smmu probe.
Add a devlink between the clients and SMMU device. The absence of a
power domain effectively disables runtime power management.

Signed-off-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5736f08e3af7..80928e5a2d60 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3568,6 +3568,9 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev)
 		pci_prepare_ats(to_pci_dev(dev), stu);
 	}
 
+	device_link_add(dev, smmu->dev,
+			DL_FLAG_PM_RUNTIME | DL_FLAG_AUTOREMOVE_SUPPLIER);
+
 	return &smmu->iommu;
 
 err_free_master:
@@ -4927,11 +4930,18 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 	if (ret)
 		goto err_disable;
 
+	if (dev->pm_domain) {
+		pm_runtime_set_active(dev);
+		pm_runtime_use_autosuspend(dev);
+		pm_runtime_set_autosuspend_delay(dev, RPM_AUTOSUSPEND_DELAY_MS);
+		pm_runtime_enable(dev);
+	}
+
 	/* And we're up. Go go go! */
 	ret = iommu_device_sysfs_add(&smmu->iommu, dev, NULL,
 				     "smmu3.%pa", &ioaddr);
 	if (ret)
-		goto err_disable;
+		goto err_rpm_disable;
 
 	ret = iommu_device_register(&smmu->iommu, &arm_smmu_ops, dev);
 	if (ret) {
@@ -4943,6 +4953,8 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 
 err_free_sysfs:
 	iommu_device_sysfs_remove(&smmu->iommu);
+err_rpm_disable:
+	pm_runtime_disable(dev);
 err_disable:
 	arm_smmu_device_disable(smmu);
 err_free_iopf:
-- 
2.50.0.rc2.692.g299adb8693-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 8/8] iommu/arm-smmu-v3: Invoke pm_runtime before hw access
  2025-06-16 20:31 [RFC PATCH v3 0/8] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
                   ` (6 preceding siblings ...)
  2025-06-16 20:31 ` [RFC PATCH v3 7/8] iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks Pranjal Shrivastava
@ 2025-06-16 20:31 ` Pranjal Shrivastava
  7 siblings, 0 replies; 28+ messages in thread
From: Pranjal Shrivastava @ 2025-06-16 20:31 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
	Rafael J. Wysocki
  Cc: Nicolin Chen, Mostafa Saleh, Daniel Mentz, iommu,
	Pranjal Shrivastava

Invoke the pm_runtime helpers at all places before accessing the hw.
The idea is to invoke runtime_pm helpers at common points which are
used by exposed ops or interrupt handlers. Elide all TLB and CFG
invalidations if the smmu is suspended but not  ATC invalidations.

Signed-off-by: Pranjal Shrivastava <praan@google.com>
---
 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     |  11 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 133 +++++++++++++++---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   3 +
 3 files changed, 130 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
index e4fd8d522af8..817d384ae8e4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -13,10 +13,17 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type)
 	struct iommu_hw_info_arm_smmuv3 *info;
 	u32 __iomem *base_idr;
 	unsigned int i;
+	int ret;
+
+	ret = arm_smmu_rpm_get(master->smmu);
+	if (ret < 0)
+		return ERR_PTR(-EIO);
 
 	info = kzalloc(sizeof(*info), GFP_KERNEL);
-	if (!info)
+	if (!info) {
+		arm_smmu_rpm_put(master->smmu);
 		return ERR_PTR(-ENOMEM);
+	}
 
 	base_idr = master->smmu->base + ARM_SMMU_IDR0;
 	for (i = 0; i <= 5; i++)
@@ -27,6 +34,7 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type)
 	*length = sizeof(*info);
 	*type = IOMMU_HW_INFO_TYPE_ARM_SMMUV3;
 
+	arm_smmu_rpm_put(master->smmu);
 	return info;
 }
 
@@ -139,6 +147,7 @@ static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
 		.old_domain = iommu_get_domain_for_dev(dev),
 		.ssid = IOMMU_NO_PASID,
 	};
+	struct arm_smmu_device *smmu = master->smmu;
 	struct arm_smmu_ste ste;
 	int ret;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 80928e5a2d60..87ee2e22aa33 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -110,7 +110,7 @@ static const char * const event_class_str[] = {
 static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master);
 
 /* Runtime PM helpers */
-static int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
+int arm_smmu_rpm_get(struct arm_smmu_device *smmu)
 {
 	int ret;
 
@@ -133,7 +133,7 @@ static int arm_smmu_rpm_get_if_not_suspended(struct arm_smmu_device *smmu)
 	return 0;
 }
 
-static void arm_smmu_rpm_put(struct arm_smmu_device *smmu)
+void arm_smmu_rpm_put(struct arm_smmu_device *smmu)
 {
 	int ret;
 
@@ -1066,7 +1066,9 @@ static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused
 {
 	struct arm_smmu_cmdq_ent cmd = {0};
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct arm_smmu_device *smmu = master->smmu;
 	int sid = master->streams[0].id;
+	int ret;
 
 	if (WARN_ON(!master->stall_enabled))
 		return;
@@ -1086,6 +1088,10 @@ static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused
 		break;
 	}
 
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return;
+
 	arm_smmu_cmdq_issue_cmd(master->smmu, &cmd);
 	/*
 	 * Don't send a SYNC, it doesn't do anything for RESUME or PRI_RESP.
@@ -1093,11 +1099,16 @@ static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused
 	 * terminated... at some point in the future. PRI_RESP is fire and
 	 * forget.
 	 */
+	arm_smmu_rpm_put(smmu);
 }
 
 /* Context descriptor manipulation functions */
 void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
 {
+	/* No need to invalidate TLBs if smmu is suspended */
+	if (!arm_smmu_rpm_get_if_not_suspended(smmu))
+		return;
+
 	struct arm_smmu_cmdq_ent cmd = {
 		.opcode	= smmu->features & ARM_SMMU_FEAT_E2H ?
 			CMDQ_OP_TLBI_EL2_ASID : CMDQ_OP_TLBI_NH_ASID,
@@ -1105,6 +1116,7 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
 	};
 
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
+	arm_smmu_rpm_put(smmu);
 }
 
 /*
@@ -1311,6 +1323,10 @@ static void arm_smmu_sync_cd(struct arm_smmu_master *master,
 		},
 	};
 
+	/* No need to invalidate if smmu is suspended */
+	if (!arm_smmu_rpm_get_if_not_suspended(smmu))
+		return;
+
 	arm_smmu_cmdq_batch_init(smmu, &cmds, &cmd);
 	for (i = 0; i < master->num_streams; i++) {
 		cmd.cfgi.sid = master->streams[i].id;
@@ -1318,6 +1334,7 @@ static void arm_smmu_sync_cd(struct arm_smmu_master *master,
 	}
 
 	arm_smmu_cmdq_batch_submit(smmu, &cmds);
+	arm_smmu_rpm_put(smmu);
 }
 
 static void arm_smmu_write_cd_l1_desc(struct arm_smmu_cdtab_l1 *dst,
@@ -1607,6 +1624,10 @@ struct arm_smmu_ste_writer {
 
 static void arm_smmu_ste_writer_sync_entry(struct arm_smmu_entry_writer *writer)
 {
+	/* No need to invalidate if smmu is suspended */
+	if (!arm_smmu_rpm_get_if_not_suspended(writer->master->smmu))
+		return;
+
 	struct arm_smmu_ste_writer *ste_writer =
 		container_of(writer, struct arm_smmu_ste_writer, writer);
 	struct arm_smmu_cmdq_ent cmd = {
@@ -1618,6 +1639,7 @@ static void arm_smmu_ste_writer_sync_entry(struct arm_smmu_entry_writer *writer)
 	};
 
 	arm_smmu_cmdq_issue_cmd_with_sync(writer->master->smmu, &cmd);
+	arm_smmu_rpm_put(writer->master->smmu);
 }
 
 static const struct arm_smmu_entry_writer_ops arm_smmu_ste_writer_ops = {
@@ -1642,6 +1664,10 @@ static void arm_smmu_write_ste(struct arm_smmu_master *master, u32 sid,
 
 	/* It's likely that we'll want to use the new STE soon */
 	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
+		/* No need to prefech if smmu is suspended */
+		if (!arm_smmu_rpm_get_if_not_suspended(smmu))
+			return;
+
 		struct arm_smmu_cmdq_ent
 			prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
 					 .prefetch = {
@@ -1649,6 +1675,7 @@ static void arm_smmu_write_ste(struct arm_smmu_master *master, u32 sid,
 					 } };
 
 		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+		arm_smmu_rpm_put(smmu);
 	}
 }
 
@@ -2021,6 +2048,7 @@ static void arm_smmu_dump_event(struct arm_smmu_device *smmu, u64 *raw,
 
 static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 {
+	int ret;
 	u64 evt[EVTQ_ENT_DWORDS];
 	struct arm_smmu_event event = {0};
 	struct arm_smmu_device *smmu = dev;
@@ -2029,6 +2057,10 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 	static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
 				      DEFAULT_RATELIMIT_BURST);
 
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return IRQ_NONE;
+
 	do {
 		while (!queue_remove_raw(q, evt)) {
 			arm_smmu_decode_event(smmu, evt, &event);
@@ -2049,6 +2081,7 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 
 	/* Sync our overflow flag, as we believe we're up to speed */
 	queue_sync_cons_ovf(q);
+	arm_smmu_rpm_put(smmu);
 	return IRQ_HANDLED;
 }
 
@@ -2096,6 +2129,11 @@ static irqreturn_t arm_smmu_priq_thread(int irq, void *dev)
 	struct arm_smmu_queue *q = &smmu->priq.q;
 	struct arm_smmu_ll_queue *llq = &q->llq;
 	u64 evt[PRIQ_ENT_DWORDS];
+	int ret;
+
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return IRQ_NONE;
 
 	do {
 		while (!queue_remove_raw(q, evt))
@@ -2107,6 +2145,7 @@ static irqreturn_t arm_smmu_priq_thread(int irq, void *dev)
 
 	/* Sync our overflow flag, as we believe we're up to speed */
 	queue_sync_cons_ovf(q);
+	arm_smmu_rpm_put(smmu);
 	return IRQ_HANDLED;
 }
 
@@ -2116,13 +2155,24 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
 {
 	u32 gerror, gerrorn, active;
 	struct arm_smmu_device *smmu = dev;
+	int ret;
+
+	if (pm_runtime_enabled(smmu->dev)) {
+		ret = pm_runtime_get_if_active(smmu->dev);
+		if (ret == 0) {
+			dev_err(smmu->dev, "Ignoring gerror interrupt because device isn't rpm active\n");
+			return IRQ_NONE;
+		}
+	}
 
 	gerror = readl_relaxed(smmu->base + ARM_SMMU_GERROR);
 	gerrorn = readl_relaxed(smmu->base + ARM_SMMU_GERRORN);
 
 	active = gerror ^ gerrorn;
-	if (!(active & GERROR_ERR_MASK))
+	if (!(active & GERROR_ERR_MASK)) {
+		arm_smmu_rpm_put(smmu);
 		return IRQ_NONE; /* No errors pending */
+	}
 
 	dev_warn(smmu->dev,
 		 "unexpected global error reported (0x%08x), this could be serious\n",
@@ -2155,6 +2205,7 @@ static irqreturn_t arm_smmu_gerror_handler(int irq, void *dev)
 		arm_smmu_cmdq_skip_err(smmu);
 
 	writel(gerror, smmu->base + ARM_SMMU_GERRORN);
+	arm_smmu_rpm_put(smmu);
 	return IRQ_HANDLED;
 }
 
@@ -2245,26 +2296,33 @@ arm_smmu_atc_inv_to_cmd(int ssid, unsigned long iova, size_t size,
 static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
 				   ioasid_t ssid)
 {
-	int i;
+	int i, ret;
 	struct arm_smmu_cmdq_ent cmd;
 	struct arm_smmu_cmdq_batch cmds;
 
 	arm_smmu_atc_inv_to_cmd(ssid, 0, 0, &cmd);
 
+	/* ATC invalidations shouldn't be elided */
+	ret = arm_smmu_rpm_get(master->smmu);
+	if (ret < 0)
+		return ret;
+
 	arm_smmu_cmdq_batch_init(master->smmu, &cmds, &cmd);
 	for (i = 0; i < master->num_streams; i++) {
 		cmd.atc.sid = master->streams[i].id;
 		arm_smmu_cmdq_batch_add(master->smmu, &cmds, &cmd);
 	}
 
-	return arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
+	ret = arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
+	arm_smmu_rpm_put(master->smmu);
+	return ret;
 }
 
 int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
 			    unsigned long iova, size_t size)
 {
 	struct arm_smmu_master_domain *master_domain;
-	int i;
+	int i, ret;
 	unsigned long flags;
 	struct arm_smmu_cmdq_ent cmd = {
 		.opcode = CMDQ_OP_ATC_INV,
@@ -2291,6 +2349,11 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
 	if (!atomic_read(&smmu_domain->nr_ats_masters))
 		return 0;
 
+	/* ATC invalidations shouldn't be elided */
+	ret = arm_smmu_rpm_get(smmu_domain->smmu);
+	if (ret < 0)
+		return ret;
+
 	arm_smmu_cmdq_batch_init(smmu_domain->smmu, &cmds, &cmd);
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
@@ -2319,7 +2382,9 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
 	}
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
-	return arm_smmu_cmdq_batch_submit(smmu_domain->smmu, &cmds);
+	ret = arm_smmu_cmdq_batch_submit(smmu_domain->smmu, &cmds);
+	arm_smmu_rpm_put(smmu_domain->smmu);
+	return ret;
 }
 
 /* IO_PGTABLE API */
@@ -2341,8 +2406,15 @@ static void arm_smmu_tlb_inv_context(void *cookie)
 	} else {
 		cmd.opcode	= CMDQ_OP_TLBI_S12_VMALL;
 		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
+
+		/* No need to invalidate TLBs if smmu is suspended */
+		if (!arm_smmu_rpm_get_if_not_suspended(smmu))
+			goto atc_inv;
+
 		arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
+		arm_smmu_rpm_put(smmu);
 	}
+atc_inv:
 	arm_smmu_atc_inv_domain(smmu_domain, 0, 0);
 }
 
@@ -2359,6 +2431,10 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
 	if (!size)
 		return;
 
+	/* No need to invalidate TLBs if smmu is suspended */
+	if (!arm_smmu_rpm_get_if_not_suspended(smmu))
+		return;
+
 	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
 		/* Get the leaf page size */
 		tg = __ffs(smmu_domain->domain.pgsize_bitmap);
@@ -2415,18 +2491,24 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
 		iova += inv_range;
 	}
 	arm_smmu_cmdq_batch_submit(smmu, &cmds);
+	arm_smmu_rpm_put(smmu);
 }
 
 static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
 					  size_t granule, bool leaf,
 					  struct arm_smmu_domain *smmu_domain)
 {
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	struct arm_smmu_cmdq_ent cmd = {
 		.tlbi = {
 			.leaf	= leaf,
 		},
 	};
 
+	/* No need to invalidate TLBs if smmu is suspended */
+	if (!arm_smmu_rpm_get_if_not_suspended(smmu))
+		goto atc_inv;
+
 	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
 		cmd.opcode	= smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ?
 				  CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA;
@@ -2435,6 +2517,7 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
 		cmd.opcode	= CMDQ_OP_TLBI_S2_IPA;
 		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
 	}
+
 	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
 
 	if (smmu_domain->nest_parent) {
@@ -2446,6 +2529,8 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
 		arm_smmu_cmdq_issue_cmd_with_sync(smmu_domain->smmu, &cmd);
 	}
 
+	arm_smmu_rpm_put(smmu);
+atc_inv:
 	/*
 	 * Unfortunately, this can't be leaf-only since we may have
 	 * zapped an entire table.
@@ -3206,13 +3291,14 @@ static int arm_smmu_blocking_set_dev_pasid(struct iommu_domain *new_domain,
 		    sid_domain->type == IOMMU_DOMAIN_BLOCKED)
 			sid_domain->ops->attach_dev(sid_domain, dev);
 	}
+
 	return 0;
 }
 
-static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
-				    struct device *dev,
-				    struct arm_smmu_ste *ste,
-				    unsigned int s1dss)
+static int arm_smmu_attach_dev_ste(struct iommu_domain *domain,
+				   struct device *dev,
+				   struct arm_smmu_ste *ste,
+				   unsigned int s1dss)
 {
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
 	struct arm_smmu_attach_state state = {
@@ -3255,6 +3341,7 @@ static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
 	 * descriptor from arm_smmu_share_asid().
 	 */
 	arm_smmu_clear_cd(master, IOMMU_NO_PASID);
+	return 0;
 }
 
 static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
@@ -3265,8 +3352,8 @@ static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
 
 	arm_smmu_master_clear_vmaster(master);
 	arm_smmu_make_bypass_ste(master->smmu, &ste);
-	arm_smmu_attach_dev_ste(domain, dev, &ste, STRTAB_STE_1_S1DSS_BYPASS);
-	return 0;
+	return arm_smmu_attach_dev_ste(domain, dev, &ste,
+				       STRTAB_STE_1_S1DSS_BYPASS);
 }
 
 static const struct iommu_domain_ops arm_smmu_identity_ops = {
@@ -3286,9 +3373,8 @@ static int arm_smmu_attach_dev_blocked(struct iommu_domain *domain,
 
 	arm_smmu_master_clear_vmaster(master);
 	arm_smmu_make_abort_ste(&ste);
-	arm_smmu_attach_dev_ste(domain, dev, &ste,
-				STRTAB_STE_1_S1DSS_TERMINATE);
-	return 0;
+	return arm_smmu_attach_dev_ste(domain, dev, &ste,
+				       STRTAB_STE_1_S1DSS_TERMINATE);
 }
 
 static const struct iommu_domain_ops arm_smmu_blocked_ops = {
@@ -4965,10 +5051,19 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 static void arm_smmu_device_remove(struct platform_device *pdev)
 {
 	struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
+	int ret;
 
 	iommu_device_unregister(&smmu->iommu);
 	iommu_device_sysfs_remove(&smmu->iommu);
+
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret <  0)
+		goto free_iopf;
+
 	arm_smmu_device_disable(smmu);
+	arm_smmu_rpm_put(smmu);
+
+free_iopf:
 	iopf_queue_free(smmu->evtq.iopf);
 	ida_destroy(&smmu->vmid_map);
 }
@@ -4976,8 +5071,14 @@ static void arm_smmu_device_remove(struct platform_device *pdev)
 static void arm_smmu_device_shutdown(struct platform_device *pdev)
 {
 	struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
+	int ret;
+
+	ret = arm_smmu_rpm_get(smmu);
+	if (ret < 0)
+		return;
 
 	arm_smmu_device_disable(smmu);
+	arm_smmu_rpm_put(smmu);
 }
 
 static int __maybe_unused arm_smmu_runtime_suspend(struct device *dev)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index d4ac6adc6a30..283e72924054 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -999,6 +999,9 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 				struct arm_smmu_cmdq *cmdq, u64 *cmds, int n,
 				bool sync);
 
+int arm_smmu_rpm_get(struct arm_smmu_device *smmu);
+void arm_smmu_rpm_put(struct arm_smmu_device *smmu);
+
 #ifdef CONFIG_ARM_SMMU_V3_SVA
 bool arm_smmu_sva_supported(struct arm_smmu_device *smmu);
 bool arm_smmu_master_sva_supported(struct arm_smmu_master *master);
-- 
2.50.0.rc2.692.g299adb8693-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-07-21 12:44 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-16 20:31 [RFC PATCH v3 0/8] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
2025-06-16 20:31 ` [RFC PATCH v3 1/8] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
2025-07-08 15:15   ` Mostafa Saleh
2025-06-16 20:31 ` [RFC PATCH v3 2/8] iommu/arm-smmu-v3: Add a helper to drain cmd queues Pranjal Shrivastava
2025-07-08 15:32   ` Mostafa Saleh
2025-07-14  9:24     ` Pranjal Shrivastava
2025-06-16 20:31 ` [RFC PATCH v3 3/8] iommu/tegra241-cmdqv: Add a helper to drain VCMDQs Pranjal Shrivastava
2025-06-16 20:31 ` [RFC PATCH v3 4/8] iommu/arm-smmu-v3: Cache and restore MSI config Pranjal Shrivastava
2025-07-08 15:34   ` Mostafa Saleh
2025-07-14  9:01     ` Pranjal Shrivastava
2025-06-16 20:31 ` [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended() Pranjal Shrivastava
2025-06-17  9:19   ` kernel test robot
2025-07-08 22:00   ` Nicolin Chen
2025-07-09 15:51     ` Pranjal Shrivastava
2025-07-09  6:44   ` Rafael J. Wysocki
2025-07-09 15:51     ` Pranjal Shrivastava
2025-07-09 16:35       ` Rafael J. Wysocki
2025-07-09 17:06         ` Pranjal Shrivastava
2025-07-09 19:37           ` Rafael J. Wysocki
2025-07-10  8:59             ` Pranjal Shrivastava
2025-07-10 10:29               ` Rafael J. Wysocki
2025-07-11 10:20                 ` Pranjal Shrivastava
2025-07-15 23:52                   ` Daniel Mentz
2025-07-16 12:53                     ` Rafael J. Wysocki
2025-07-21 12:44                       ` Will Deacon
2025-06-16 20:31 ` [RFC PATCH v3 6/8] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops Pranjal Shrivastava
2025-06-16 20:31 ` [RFC PATCH v3 7/8] iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks Pranjal Shrivastava
2025-06-16 20:31 ` [RFC PATCH v3 8/8] iommu/arm-smmu-v3: Invoke pm_runtime before hw access Pranjal Shrivastava

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.