[PATCH v2 0/5] drm/panthor: Be robust against failures in the resume path

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/5] drm/panthor: Be robust against failures in the resume path
@ 2024-11-28 11:02 Boris Brezillon
  2024-11-28 11:02 ` [PATCH v2 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Boris Brezillon @ 2024-11-28 11:02 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: dri-devel, kernel

Hello,

Here's a collection of patches improving robustness to failures in
the device resume/suspend path. Those failures are pretty hard to
reproduce (happens once in a while on a deqp-vk run), so I used a
mechanism to fake them.

Faking a FW boot failure is kinda tricky though, which means the
last patch has only been partially tested:
- the fast reset path is well tested because that's the default on
  a device suspend
- the slow reset has been tested with a hack replacing fast resets
  by slow resets
- the fast -> slow reset fallback has been tested by faking boot
  failures after a fast reset, but these are not real, which means
  we can't really validate if the MCU recovers fine after a slow
  reset

On the other hand, this implementation doesn't look like it could
do more harm than the current one (the only difference is the
extra GPU soft-reset that happens between the fast and slow FW
boot).

Nothing major changed in v2. Each patch contains a changelog, if
you're interested.

Regards,

Boris

Boris Brezillon (5):
  drm/panthor: Preserve the result returned by panthor_fw_resume()
  drm/panthor: Be robust against runtime PM resume failures in the
    suspend path
  drm/panthor: Ignore devfreq_{suspend,resume}_device() failures
  drm/panthor: Be robust against resume failures
  drm/panthor: Fix the fast-reset logic

 drivers/gpu/drm/panthor/panthor_devfreq.c | 12 ++--
 drivers/gpu/drm/panthor/panthor_devfreq.h |  4 +-
 drivers/gpu/drm/panthor/panthor_device.c  | 68 ++++++++++-------------
 drivers/gpu/drm/panthor/panthor_device.h  | 37 ++++++++++++
 drivers/gpu/drm/panthor/panthor_drv.c     |  2 +-
 drivers/gpu/drm/panthor/panthor_fw.c      | 68 +++++++----------------
 drivers/gpu/drm/panthor/panthor_gpu.c     | 14 +++--
 drivers/gpu/drm/panthor/panthor_mmu.c     |  3 +-
 drivers/gpu/drm/panthor/panthor_sched.c   |  4 +-
 9 files changed, 107 insertions(+), 105 deletions(-)

-- 
2.46.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume()
  2024-11-28 11:02 [PATCH v2 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
@ 2024-11-28 11:02 ` Boris Brezillon
  2024-11-29 13:11   ` Adrián Larumbe
  2024-11-28 11:02 ` [PATCH v2 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Boris Brezillon @ 2024-11-28 11:02 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: dri-devel, kernel

WARN() will return true if the condition is true, false otherwise.
If we store the return of drm_WARN_ON() in ret, we lose the actual
error code.

v2:
- Add R-b

Fixes: 5fe909cae118 ("drm/panthor: Add the device logical block")
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/panthor/panthor_device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index 984615f4ed27..e701e605d013 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -461,8 +461,8 @@ int panthor_device_resume(struct device *dev)
 	    drm_dev_enter(&ptdev->base, &cookie)) {
 		panthor_gpu_resume(ptdev);
 		panthor_mmu_resume(ptdev);
-		ret = drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
-		if (!ret) {
+		ret = panthor_fw_resume(ptdev);
+		if (!drm_WARN_ON(&ptdev->base, ret)) {
 			panthor_sched_resume(ptdev);
 		} else {
 			panthor_mmu_suspend(ptdev);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
  2024-11-28 11:02 [PATCH v2 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
  2024-11-28 11:02 ` [PATCH v2 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
@ 2024-11-28 11:02 ` Boris Brezillon
  2024-11-29 13:14   ` Adrián Larumbe
  2024-11-29 15:21   ` Steven Price
  2024-11-28 11:02 ` [PATCH v2 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures Boris Brezillon
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 14+ messages in thread
From: Boris Brezillon @ 2024-11-28 11:02 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: dri-devel, kernel

The runtime PM resume operation is not guaranteed to succeed, but if it
fails, the device should be in a suspended state. Make sure we're robust
to resume failures in the unplug path.

v2:
- Move the bit that belonged in the next commit
- Drop the panthor_device_unplug() changes

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_fw.c  | 14 +++++++++-----
 drivers/gpu/drm/panthor/panthor_gpu.c |  3 ++-
 drivers/gpu/drm/panthor/panthor_mmu.c |  3 ++-
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
index ebf8980ca9a3..f3d3d8fbe13d 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.c
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -12,6 +12,7 @@
 #include <linux/iosys-map.h>
 #include <linux/mutex.h>
 #include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
 
 #include <drm/drm_drv.h>
 #include <drm/drm_managed.h>
@@ -1190,11 +1191,13 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
 
 	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
 
-	/* Make sure the IRQ handler can be called after that point. */
-	if (ptdev->fw->irq.irq)
-		panthor_job_irq_suspend(&ptdev->fw->irq);
+	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev)) {
+		/* Make sure the IRQ handler can be called after that point. */
+		if (ptdev->fw->irq.irq)
+			panthor_job_irq_suspend(&ptdev->fw->irq);
 
-	panthor_fw_stop(ptdev);
+		panthor_fw_stop(ptdev);
+	}
 
 	list_for_each_entry(section, &ptdev->fw->sections, node)
 		panthor_kernel_bo_destroy(section->mem);
@@ -1207,7 +1210,8 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
 	panthor_vm_put(ptdev->fw->vm);
 	ptdev->fw->vm = NULL;
 
-	panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
+	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
+		panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
 }
 
 /**
diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
index 0f3cac6ec88e..ee85a371bc38 100644
--- a/drivers/gpu/drm/panthor/panthor_gpu.c
+++ b/drivers/gpu/drm/panthor/panthor_gpu.c
@@ -180,7 +180,8 @@ void panthor_gpu_unplug(struct panthor_device *ptdev)
 	unsigned long flags;
 
 	/* Make sure the IRQ handler is not running after that point. */
-	panthor_gpu_irq_suspend(&ptdev->gpu->irq);
+	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
+		panthor_gpu_irq_suspend(&ptdev->gpu->irq);
 
 	/* Wake-up all waiters. */
 	spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
index 9478ee2093d1..6716463903bc 100644
--- a/drivers/gpu/drm/panthor/panthor_mmu.c
+++ b/drivers/gpu/drm/panthor/panthor_mmu.c
@@ -2681,7 +2681,8 @@ int panthor_vm_prepare_mapped_bos_resvs(struct drm_exec *exec, struct panthor_vm
  */
 void panthor_mmu_unplug(struct panthor_device *ptdev)
 {
-	panthor_mmu_irq_suspend(&ptdev->mmu->irq);
+	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
+		panthor_mmu_irq_suspend(&ptdev->mmu->irq);
 
 	mutex_lock(&ptdev->mmu->as.slots_lock);
 	for (u32 i = 0; i < ARRAY_SIZE(ptdev->mmu->as.slots); i++) {
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures
  2024-11-28 11:02 [PATCH v2 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
  2024-11-28 11:02 ` [PATCH v2 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
  2024-11-28 11:02 ` [PATCH v2 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
@ 2024-11-28 11:02 ` Boris Brezillon
  2024-11-29 13:46   ` [PATCH v2 3/5] drm/panthor: Ignore devfreq_{suspend,resume}_device() failures Adrián Larumbe
  2024-11-28 11:02 ` [PATCH v2 4/5] drm/panthor: Be robust against resume failures Boris Brezillon
  2024-11-28 11:02 ` [PATCH v2 5/5] drm/panthor: Fix the fast-reset logic Boris Brezillon
  4 siblings, 1 reply; 14+ messages in thread
From: Boris Brezillon @ 2024-11-28 11:02 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: dri-devel, kernel

devfreq_{resume,suspend}_device() don't bother undoing the suspend_count
modifications if something fails, so either it assumes failures are
harmless, or it's super fragile/buggy. In either case it's not something
we can address at the driver level, so let's just assume failures are
harmless for now, like is done in panfrost.

v2:
- Add R-b

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/panthor/panthor_devfreq.c | 12 ++++----
 drivers/gpu/drm/panthor/panthor_devfreq.h |  4 +--
 drivers/gpu/drm/panthor/panthor_device.c  | 35 ++---------------------
 3 files changed, 11 insertions(+), 40 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c b/drivers/gpu/drm/panthor/panthor_devfreq.c
index ecc7a52bd688..3686515d368d 100644
--- a/drivers/gpu/drm/panthor/panthor_devfreq.c
+++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
@@ -243,26 +243,26 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
 	return 0;
 }
 
-int panthor_devfreq_resume(struct panthor_device *ptdev)
+void panthor_devfreq_resume(struct panthor_device *ptdev)
 {
 	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
 
 	if (!pdevfreq->devfreq)
-		return 0;
+		return;
 
 	panthor_devfreq_reset(pdevfreq);
 
-	return devfreq_resume_device(pdevfreq->devfreq);
+	drm_WARN_ON(&ptdev->base, devfreq_resume_device(pdevfreq->devfreq));
 }
 
-int panthor_devfreq_suspend(struct panthor_device *ptdev)
+void panthor_devfreq_suspend(struct panthor_device *ptdev)
 {
 	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
 
 	if (!pdevfreq->devfreq)
-		return 0;
+		return;
 
-	return devfreq_suspend_device(pdevfreq->devfreq);
+	drm_WARN_ON(&ptdev->base, devfreq_suspend_device(pdevfreq->devfreq));
 }
 
 void panthor_devfreq_record_busy(struct panthor_device *ptdev)
diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.h b/drivers/gpu/drm/panthor/panthor_devfreq.h
index 83a5c9522493..b7631de695f7 100644
--- a/drivers/gpu/drm/panthor/panthor_devfreq.h
+++ b/drivers/gpu/drm/panthor/panthor_devfreq.h
@@ -12,8 +12,8 @@ struct panthor_devfreq;
 
 int panthor_devfreq_init(struct panthor_device *ptdev);
 
-int panthor_devfreq_resume(struct panthor_device *ptdev);
-int panthor_devfreq_suspend(struct panthor_device *ptdev);
+void panthor_devfreq_resume(struct panthor_device *ptdev);
+void panthor_devfreq_suspend(struct panthor_device *ptdev);
 
 void panthor_devfreq_record_busy(struct panthor_device *ptdev);
 void panthor_devfreq_record_idle(struct panthor_device *ptdev);
diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index e701e605d013..e3b22107b268 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -453,9 +453,7 @@ int panthor_device_resume(struct device *dev)
 	if (ret)
 		goto err_disable_stacks_clk;
 
-	ret = panthor_devfreq_resume(ptdev);
-	if (ret)
-		goto err_disable_coregroup_clk;
+	panthor_devfreq_resume(ptdev);
 
 	if (panthor_device_is_initialized(ptdev) &&
 	    drm_dev_enter(&ptdev->base, &cookie)) {
@@ -492,8 +490,6 @@ int panthor_device_resume(struct device *dev)
 
 err_suspend_devfreq:
 	panthor_devfreq_suspend(ptdev);
-
-err_disable_coregroup_clk:
 	clk_disable_unprepare(ptdev->clks.coregroup);
 
 err_disable_stacks_clk:
@@ -510,7 +506,7 @@ int panthor_device_resume(struct device *dev)
 int panthor_device_suspend(struct device *dev)
 {
 	struct panthor_device *ptdev = dev_get_drvdata(dev);
-	int ret, cookie;
+	int cookie;
 
 	if (atomic_read(&ptdev->pm.state) != PANTHOR_DEVICE_PM_STATE_ACTIVE)
 		return -EINVAL;
@@ -542,36 +538,11 @@ int panthor_device_suspend(struct device *dev)
 		drm_dev_exit(cookie);
 	}
 
-	ret = panthor_devfreq_suspend(ptdev);
-	if (ret) {
-		if (panthor_device_is_initialized(ptdev) &&
-		    drm_dev_enter(&ptdev->base, &cookie)) {
-			panthor_gpu_resume(ptdev);
-			panthor_mmu_resume(ptdev);
-			drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
-			panthor_sched_resume(ptdev);
-			drm_dev_exit(cookie);
-		}
-
-		goto err_set_active;
-	}
+	panthor_devfreq_suspend(ptdev);
 
 	clk_disable_unprepare(ptdev->clks.coregroup);
 	clk_disable_unprepare(ptdev->clks.stacks);
 	clk_disable_unprepare(ptdev->clks.core);
 	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
 	return 0;
-
-err_set_active:
-	/* If something failed and we have to revert back to an
-	 * active state, we also need to clear the MMIO userspace
-	 * mappings, so any dumb pages that were mapped while we
-	 * were trying to suspend gets invalidated.
-	 */
-	mutex_lock(&ptdev->pm.mmio_lock);
-	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);
-	unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
-			    DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
-	mutex_unlock(&ptdev->pm.mmio_lock);
-	return ret;
 }
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/5] drm/panthor: Be robust against resume failures
  2024-11-28 11:02 [PATCH v2 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
                   ` (2 preceding siblings ...)
  2024-11-28 11:02 ` [PATCH v2 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures Boris Brezillon
@ 2024-11-28 11:02 ` Boris Brezillon
  2024-11-29 13:59   ` Adrián Larumbe
  2024-11-29 15:21   ` Steven Price
  2024-11-28 11:02 ` [PATCH v2 5/5] drm/panthor: Fix the fast-reset logic Boris Brezillon
  4 siblings, 2 replies; 14+ messages in thread
From: Boris Brezillon @ 2024-11-28 11:02 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: dri-devel, kernel

When the runtime PM resume callback returns an error, it puts the device
in a state where it can't be resumed anymore. Make sure we can recover
from such transient failures by calling pm_runtime_set_suspended()
explicitly after a pm_runtime_resume_and_get() failure.

v2:
- Add a comment explaining potential races in
  panthor_device_resume_and_get()

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_device.c |  1 +
 drivers/gpu/drm/panthor/panthor_device.h | 26 ++++++++++++++++++++++++
 drivers/gpu/drm/panthor/panthor_drv.c    |  2 +-
 drivers/gpu/drm/panthor/panthor_sched.c  |  4 ++--
 4 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index e3b22107b268..0362101ea896 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -500,6 +500,7 @@ int panthor_device_resume(struct device *dev)
 
 err_set_suspended:
 	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
+	atomic_set(&ptdev->pm.recovery_needed, 1);
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index 0e68f5a70d20..b6c4f25a5d6e 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -9,6 +9,7 @@
 #include <linux/atomic.h>
 #include <linux/io-pgtable.h>
 #include <linux/regulator/consumer.h>
+#include <linux/pm_runtime.h>
 #include <linux/sched.h>
 #include <linux/spinlock.h>
 
@@ -180,6 +181,9 @@ struct panthor_device {
 		 * is suspended.
 		 */
 		struct page *dummy_latest_flush;
+
+		/** @recovery_needed: True when a resume attempt failed. */
+		atomic_t recovery_needed;
 	} pm;
 
 	/** @profile_mask: User-set profiling flags for job accounting. */
@@ -243,6 +247,28 @@ int panthor_device_mmap_io(struct panthor_device *ptdev,
 int panthor_device_resume(struct device *dev);
 int panthor_device_suspend(struct device *dev);
 
+static inline int panthor_device_resume_and_get(struct panthor_device *ptdev)
+{
+	int ret = pm_runtime_resume_and_get(ptdev->base.dev);
+
+	/* If the resume failed, we need to clear the runtime_error, which
+	 * can done by forcing the RPM state to suspended. If multiple
+	 * threads called panthor_device_resume_and_get(), we only want
+	 * one of them to update the state, hence the cmpxchg. Note that a
+	 * thread might enter panthor_device_resume_and_get() and call
+	 * pm_runtime_resume_and_get() after another thread had attempted
+	 * to resume and failed. This means we will end up with an error
+	 * without even attempting a resume ourselves. The only risk here
+	 * is to report an error when the second resume attempt might have
+	 * succeeded. Given resume errors are not expected, this is probably
+	 * something we can live with.
+	 */
+	if (ret && atomic_cmpxchg(&ptdev->pm.recovery_needed, 1, 0) == 1)
+		pm_runtime_set_suspended(ptdev->base.dev);
+
+	return ret;
+}
+
 enum drm_panthor_exception_type {
 	DRM_PANTHOR_EXCEPTION_OK = 0x00,
 	DRM_PANTHOR_EXCEPTION_TERMINATED = 0x04,
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index 1498c97b4b85..b7a9adc918e3 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -763,7 +763,7 @@ static int panthor_query_timestamp_info(struct panthor_device *ptdev,
 {
 	int ret;
 
-	ret = pm_runtime_resume_and_get(ptdev->base.dev);
+	ret = panthor_device_resume_and_get(ptdev);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 97ed5fe5a191..77b184c3fb0c 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -2364,7 +2364,7 @@ static void tick_work(struct work_struct *work)
 	if (!drm_dev_enter(&ptdev->base, &cookie))
 		return;
 
-	ret = pm_runtime_resume_and_get(ptdev->base.dev);
+	ret = panthor_device_resume_and_get(ptdev);
 	if (drm_WARN_ON(&ptdev->base, ret))
 		goto out_dev_exit;
 
@@ -3131,7 +3131,7 @@ queue_run_job(struct drm_sched_job *sched_job)
 		return dma_fence_get(job->done_fence);
 	}
 
-	ret = pm_runtime_resume_and_get(ptdev->base.dev);
+	ret = panthor_device_resume_and_get(ptdev);
 	if (drm_WARN_ON(&ptdev->base, ret))
 		return ERR_PTR(ret);
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 5/5] drm/panthor: Fix the fast-reset logic
  2024-11-28 11:02 [PATCH v2 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
                   ` (3 preceding siblings ...)
  2024-11-28 11:02 ` [PATCH v2 4/5] drm/panthor: Be robust against resume failures Boris Brezillon
@ 2024-11-28 11:02 ` Boris Brezillon
  4 siblings, 0 replies; 14+ messages in thread
From: Boris Brezillon @ 2024-11-28 11:02 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: dri-devel, kernel

If we do a GPU soft-reset, that's no longer fast reset. This also means
the slow reset fallback doesn't work because the MCU state is only reset
after a GPU soft-reset.

Let's move the retry logic to panthor_device_resume() to issue a
soft-reset between the fast and slow attempts, and patch
panthor_gpu_suspend() to only power-off the L2 when a fast reset is
requested.

v2:
- Add R-b

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/panthor/panthor_device.c | 32 ++++++++++----
 drivers/gpu/drm/panthor/panthor_device.h | 11 +++++
 drivers/gpu/drm/panthor/panthor_fw.c     | 54 ++++++------------------
 drivers/gpu/drm/panthor/panthor_gpu.c    | 11 ++---
 4 files changed, 53 insertions(+), 55 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index 0362101ea896..2c817e65e6be 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -431,6 +431,22 @@ int panthor_device_mmap_io(struct panthor_device *ptdev, struct vm_area_struct *
 	return 0;
 }
 
+static int panthor_device_resume_hw_components(struct panthor_device *ptdev)
+{
+	int ret;
+
+	panthor_gpu_resume(ptdev);
+	panthor_mmu_resume(ptdev);
+
+	ret = panthor_fw_resume(ptdev);
+	if (!ret)
+		return 0;
+
+	panthor_mmu_suspend(ptdev);
+	panthor_gpu_suspend(ptdev);
+	return ret;
+}
+
 int panthor_device_resume(struct device *dev)
 {
 	struct panthor_device *ptdev = dev_get_drvdata(dev);
@@ -457,16 +473,16 @@ int panthor_device_resume(struct device *dev)
 
 	if (panthor_device_is_initialized(ptdev) &&
 	    drm_dev_enter(&ptdev->base, &cookie)) {
-		panthor_gpu_resume(ptdev);
-		panthor_mmu_resume(ptdev);
-		ret = panthor_fw_resume(ptdev);
-		if (!drm_WARN_ON(&ptdev->base, ret)) {
-			panthor_sched_resume(ptdev);
-		} else {
-			panthor_mmu_suspend(ptdev);
-			panthor_gpu_suspend(ptdev);
+		ret = panthor_device_resume_hw_components(ptdev);
+		if (ret && ptdev->reset.fast) {
+			drm_err(&ptdev->base, "Fast reset failed, trying a slow reset");
+			ptdev->reset.fast = false;
+			ret = panthor_device_resume_hw_components(ptdev);
 		}
 
+		if (!ret)
+			panthor_sched_resume(ptdev);
+
 		drm_dev_exit(cookie);
 
 		if (ret)
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index b6c4f25a5d6e..da6574021664 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -157,6 +157,17 @@ struct panthor_device {
 
 		/** @pending: Set to true if a reset is pending. */
 		atomic_t pending;
+
+		/**
+		 * @fast: True if the post_reset logic can proceed with a fast reset.
+		 *
+		 * A fast reset is just a reset where the driver doesn't reload the FW sections.
+		 *
+		 * Any time the firmware is properly suspended, a fast reset can take place.
+		 * On the other hand, if the halt operation failed, the driver will reload
+		 * all FW sections to make sure we start from a fresh state.
+		 */
+		bool fast;
 	} reset;
 
 	/** @pm: Power management related data. */
diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
index f3d3d8fbe13d..a3d11d32b71c 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.c
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -263,17 +263,6 @@ struct panthor_fw {
 	/** @booted: True is the FW is booted */
 	bool booted;
 
-	/**
-	 * @fast_reset: True if the post_reset logic can proceed with a fast reset.
-	 *
-	 * A fast reset is just a reset where the driver doesn't reload the FW sections.
-	 *
-	 * Any time the firmware is properly suspended, a fast reset can take place.
-	 * On the other hand, if the halt operation failed, the driver will reload
-	 * all sections to make sure we start from a fresh state.
-	 */
-	bool fast_reset;
-
 	/** @irq: Job irq data. */
 	struct panthor_irq irq;
 };
@@ -1090,7 +1079,7 @@ void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang)
 	/* Make sure we won't be woken up by a ping. */
 	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
 
-	ptdev->fw->fast_reset = false;
+	ptdev->reset.fast = false;
 
 	if (!on_hang) {
 		struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
@@ -1100,7 +1089,7 @@ void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang)
 		gpu_write(ptdev, CSF_DOORBELL(CSF_GLB_DOORBELL_ID), 1);
 		if (!readl_poll_timeout(ptdev->iomem + MCU_STATUS, status,
 					status == MCU_STATUS_HALT, 10, 100000)) {
-			ptdev->fw->fast_reset = true;
+			ptdev->reset.fast = true;
 		} else {
 			drm_warn(&ptdev->base, "Failed to cleanly suspend MCU");
 		}
@@ -1125,49 +1114,30 @@ int panthor_fw_post_reset(struct panthor_device *ptdev)
 	if (ret)
 		return ret;
 
-	/* If this is a fast reset, try to start the MCU without reloading
-	 * the FW sections. If it fails, go for a full reset.
-	 */
-	if (ptdev->fw->fast_reset) {
+	if (!ptdev->reset.fast) {
+		/* On a slow reset, reload all sections, including RO ones.
+		 * We're not supposed to end up here anyway, let's just assume
+		 * the overhead of reloading everything is acceptable.
+		 */
+		panthor_reload_fw_sections(ptdev, true);
+	} else {
 		/* The FW detects 0 -> 1 transitions. Make sure we reset
 		 * the HALT bit before the FW is rebooted.
 		 * This is not needed on a slow reset because FW sections are
 		 * re-initialized.
 		 */
 		struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
+
 		panthor_fw_update_reqs(glb_iface, req, 0, GLB_HALT);
-
-		ret = panthor_fw_start(ptdev);
-		if (!ret)
-			goto out;
-
-		/* Forcibly reset the MCU and force a slow reset, so we get a
-		 * fresh boot on the next panthor_fw_start() call.
-		 */
-		panthor_fw_stop(ptdev);
-		ptdev->fw->fast_reset = false;
-		drm_err(&ptdev->base, "FW fast reset failed, trying a slow reset");
-
-		ret = panthor_vm_flush_all(ptdev->fw->vm);
-		if (ret) {
-			drm_err(&ptdev->base, "FW slow reset failed (couldn't flush FW's AS l2cache)");
-			return ret;
-		}
 	}
 
-	/* Reload all sections, including RO ones. We're not supposed
-	 * to end up here anyway, let's just assume the overhead of
-	 * reloading everything is acceptable.
-	 */
-	panthor_reload_fw_sections(ptdev, true);
-
 	ret = panthor_fw_start(ptdev);
 	if (ret) {
-		drm_err(&ptdev->base, "FW slow reset failed (couldn't start the FW )");
+		drm_err(&ptdev->base, "FW %s reset failed",
+			ptdev->reset.fast ?  "fast" : "slow");
 		return ret;
 	}
 
-out:
 	/* We must re-initialize the global interface even on fast-reset. */
 	panthor_fw_init_global_iface(ptdev);
 	return 0;
diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
index ee85a371bc38..671049020afa 100644
--- a/drivers/gpu/drm/panthor/panthor_gpu.c
+++ b/drivers/gpu/drm/panthor/panthor_gpu.c
@@ -470,11 +470,12 @@ int panthor_gpu_soft_reset(struct panthor_device *ptdev)
  */
 void panthor_gpu_suspend(struct panthor_device *ptdev)
 {
-	/*
-	 * It may be preferable to simply power down the L2, but for now just
-	 * soft-reset which will leave the L2 powered down.
-	 */
-	panthor_gpu_soft_reset(ptdev);
+	/* On a fast reset, simply power down the L2. */
+	if (!ptdev->reset.fast)
+		panthor_gpu_soft_reset(ptdev);
+	else
+		panthor_gpu_power_off(ptdev, L2, 1, 20000);
+
 	panthor_gpu_irq_suspend(&ptdev->gpu->irq);
 }
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume()
  2024-11-28 11:02 ` [PATCH v2 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
@ 2024-11-29 13:11   ` Adrián Larumbe
  0 siblings, 0 replies; 14+ messages in thread
From: Adrián Larumbe @ 2024-11-29 13:11 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Steven Price, Liviu Dudau, dri-devel, kernel

Reviewed-by: Adrian Larumbe <adrian.larumbe@collabora.com>

On 28.11.2024 12:02, Boris Brezillon wrote:
> WARN() will return true if the condition is true, false otherwise.
> If we store the return of drm_WARN_ON() in ret, we lose the actual
> error code.
> 
> v2:
> - Add R-b
> 
> Fixes: 5fe909cae118 ("drm/panthor: Add the device logical block")
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> Reviewed-by: Steven Price <steven.price@arm.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index 984615f4ed27..e701e605d013 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -461,8 +461,8 @@ int panthor_device_resume(struct device *dev)
>  	    drm_dev_enter(&ptdev->base, &cookie)) {
>  		panthor_gpu_resume(ptdev);
>  		panthor_mmu_resume(ptdev);
> -		ret = drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
> -		if (!ret) {
> +		ret = panthor_fw_resume(ptdev);
> +		if (!drm_WARN_ON(&ptdev->base, ret)) {
>  			panthor_sched_resume(ptdev);
>  		} else {
>  			panthor_mmu_suspend(ptdev);
> -- 
> 2.46.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
  2024-11-28 11:02 ` [PATCH v2 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
@ 2024-11-29 13:14   ` Adrián Larumbe
  2024-11-29 14:45     ` Boris Brezillon
  2024-11-29 15:21   ` Steven Price
  1 sibling, 1 reply; 14+ messages in thread
From: Adrián Larumbe @ 2024-11-29 13:14 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Steven Price, Liviu Dudau, dri-devel, kernel

Reviewed-by: Adrian Larumbe <adrian.larumbe@collabora.com>

On 28.11.2024 12:02, Boris Brezillon wrote:
> The runtime PM resume operation is not guaranteed to succeed, but if it
> fails, the device should be in a suspended state. Make sure we're robust
> to resume failures in the unplug path.
> 
> v2:
> - Move the bit that belonged in the next commit
> - Drop the panthor_device_unplug() changes
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_fw.c  | 14 +++++++++-----
>  drivers/gpu/drm/panthor/panthor_gpu.c |  3 ++-
>  drivers/gpu/drm/panthor/panthor_mmu.c |  3 ++-
>  3 files changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
> index ebf8980ca9a3..f3d3d8fbe13d 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.c
> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> @@ -12,6 +12,7 @@
>  #include <linux/iosys-map.h>
>  #include <linux/mutex.h>
>  #include <linux/platform_device.h>
> +#include <linux/pm_runtime.h>
>  
>  #include <drm/drm_drv.h>
>  #include <drm/drm_managed.h>
> @@ -1190,11 +1191,13 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
>  
>  	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
>  
> -	/* Make sure the IRQ handler can be called after that point. */
> -	if (ptdev->fw->irq.irq)
> -		panthor_job_irq_suspend(&ptdev->fw->irq);
> +	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev)) {
> +		/* Make sure the IRQ handler can be called after that point. */

Did you mean 'cannot be called' ?

> +		if (ptdev->fw->irq.irq)
> +			panthor_job_irq_suspend(&ptdev->fw->irq);
>  
> -	panthor_fw_stop(ptdev);
> +		panthor_fw_stop(ptdev);
> +	}
>  
>  	list_for_each_entry(section, &ptdev->fw->sections, node)
>  		panthor_kernel_bo_destroy(section->mem);
> @@ -1207,7 +1210,8 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
>  	panthor_vm_put(ptdev->fw->vm);
>  	ptdev->fw->vm = NULL;
>  
> -	panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
> +	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
> +		panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
> index 0f3cac6ec88e..ee85a371bc38 100644
> --- a/drivers/gpu/drm/panthor/panthor_gpu.c
> +++ b/drivers/gpu/drm/panthor/panthor_gpu.c
> @@ -180,7 +180,8 @@ void panthor_gpu_unplug(struct panthor_device *ptdev)
>  	unsigned long flags;
>  
>  	/* Make sure the IRQ handler is not running after that point. */
> -	panthor_gpu_irq_suspend(&ptdev->gpu->irq);
> +	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
> +		panthor_gpu_irq_suspend(&ptdev->gpu->irq);
>  
>  	/* Wake-up all waiters. */
>  	spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
> diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
> index 9478ee2093d1..6716463903bc 100644
> --- a/drivers/gpu/drm/panthor/panthor_mmu.c
> +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
> @@ -2681,7 +2681,8 @@ int panthor_vm_prepare_mapped_bos_resvs(struct drm_exec *exec, struct panthor_vm
>   */
>  void panthor_mmu_unplug(struct panthor_device *ptdev)
>  {
> -	panthor_mmu_irq_suspend(&ptdev->mmu->irq);
> +	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
> +		panthor_mmu_irq_suspend(&ptdev->mmu->irq);
>  
>  	mutex_lock(&ptdev->mmu->as.slots_lock);
>  	for (u32 i = 0; i < ARRAY_SIZE(ptdev->mmu->as.slots); i++) {
> -- 
> 2.46.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/5] drm/panthor: Ignore devfreq_{suspend,resume}_device() failures
  2024-11-28 11:02 ` [PATCH v2 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures Boris Brezillon
@ 2024-11-29 13:46   ` Adrián Larumbe
  0 siblings, 0 replies; 14+ messages in thread
From: Adrián Larumbe @ 2024-11-29 13:46 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Steven Price, Liviu Dudau, dri-devel, kernel

Reviewed-by: Adrian Larumbe <adrian.larumbe@collabora.com>

On 28.11.2024 12:02, Boris Brezillon wrote:
> devfreq_{resume,suspend}_device() don't bother undoing the suspend_count
> modifications if something fails, so either it assumes failures are
> harmless, or it's super fragile/buggy. In either case it's not something
> we can address at the driver level, so let's just assume failures are
> harmless for now, like is done in panfrost.

In my experience, when devfreq_suspend_device fails in the PM suspend path, then
FW resumption will always fail, even after a slow reset, although I guess
with the latest patch in this series that is already addressed.   

> v2:
> - Add R-b
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> Reviewed-by: Steven Price <steven.price@arm.com>
> ---
>  drivers/gpu/drm/panthor/panthor_devfreq.c | 12 ++++----
>  drivers/gpu/drm/panthor/panthor_devfreq.h |  4 +--
>  drivers/gpu/drm/panthor/panthor_device.c  | 35 ++---------------------
>  3 files changed, 11 insertions(+), 40 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c b/drivers/gpu/drm/panthor/panthor_devfreq.c
> index ecc7a52bd688..3686515d368d 100644
> --- a/drivers/gpu/drm/panthor/panthor_devfreq.c
> +++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
> @@ -243,26 +243,26 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
>  	return 0;
>  }
>  
> -int panthor_devfreq_resume(struct panthor_device *ptdev)
> +void panthor_devfreq_resume(struct panthor_device *ptdev)
>  {
>  	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
>  
>  	if (!pdevfreq->devfreq)
> -		return 0;
> +		return;
>  
>  	panthor_devfreq_reset(pdevfreq);
>  
> -	return devfreq_resume_device(pdevfreq->devfreq);
> +	drm_WARN_ON(&ptdev->base, devfreq_resume_device(pdevfreq->devfreq));
>  }
>  
> -int panthor_devfreq_suspend(struct panthor_device *ptdev)
> +void panthor_devfreq_suspend(struct panthor_device *ptdev)
>  {
>  	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
>  
>  	if (!pdevfreq->devfreq)
> -		return 0;
> +		return;
>  
> -	return devfreq_suspend_device(pdevfreq->devfreq);
> +	drm_WARN_ON(&ptdev->base, devfreq_suspend_device(pdevfreq->devfreq));
>  }
>  
>  void panthor_devfreq_record_busy(struct panthor_device *ptdev)
> diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.h b/drivers/gpu/drm/panthor/panthor_devfreq.h
> index 83a5c9522493..b7631de695f7 100644
> --- a/drivers/gpu/drm/panthor/panthor_devfreq.h
> +++ b/drivers/gpu/drm/panthor/panthor_devfreq.h
> @@ -12,8 +12,8 @@ struct panthor_devfreq;
>  
>  int panthor_devfreq_init(struct panthor_device *ptdev);
>  
> -int panthor_devfreq_resume(struct panthor_device *ptdev);
> -int panthor_devfreq_suspend(struct panthor_device *ptdev);
> +void panthor_devfreq_resume(struct panthor_device *ptdev);
> +void panthor_devfreq_suspend(struct panthor_device *ptdev);
>  
>  void panthor_devfreq_record_busy(struct panthor_device *ptdev);
>  void panthor_devfreq_record_idle(struct panthor_device *ptdev);
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index e701e605d013..e3b22107b268 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -453,9 +453,7 @@ int panthor_device_resume(struct device *dev)
>  	if (ret)
>  		goto err_disable_stacks_clk;
>  
> -	ret = panthor_devfreq_resume(ptdev);
> -	if (ret)
> -		goto err_disable_coregroup_clk;
> +	panthor_devfreq_resume(ptdev);
>  
>  	if (panthor_device_is_initialized(ptdev) &&
>  	    drm_dev_enter(&ptdev->base, &cookie)) {
> @@ -492,8 +490,6 @@ int panthor_device_resume(struct device *dev)
>  
>  err_suspend_devfreq:
>  	panthor_devfreq_suspend(ptdev);
> -
> -err_disable_coregroup_clk:
>  	clk_disable_unprepare(ptdev->clks.coregroup);
>  
>  err_disable_stacks_clk:
> @@ -510,7 +506,7 @@ int panthor_device_resume(struct device *dev)
>  int panthor_device_suspend(struct device *dev)
>  {
>  	struct panthor_device *ptdev = dev_get_drvdata(dev);
> -	int ret, cookie;
> +	int cookie;
>  
>  	if (atomic_read(&ptdev->pm.state) != PANTHOR_DEVICE_PM_STATE_ACTIVE)
>  		return -EINVAL;
> @@ -542,36 +538,11 @@ int panthor_device_suspend(struct device *dev)
>  		drm_dev_exit(cookie);
>  	}
>  
> -	ret = panthor_devfreq_suspend(ptdev);
> -	if (ret) {
> -		if (panthor_device_is_initialized(ptdev) &&
> -		    drm_dev_enter(&ptdev->base, &cookie)) {
> -			panthor_gpu_resume(ptdev);
> -			panthor_mmu_resume(ptdev);
> -			drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
> -			panthor_sched_resume(ptdev);
> -			drm_dev_exit(cookie);
> -		}
> -
> -		goto err_set_active;
> -	}
> +	panthor_devfreq_suspend(ptdev);
>  
>  	clk_disable_unprepare(ptdev->clks.coregroup);
>  	clk_disable_unprepare(ptdev->clks.stacks);
>  	clk_disable_unprepare(ptdev->clks.core);
>  	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
>  	return 0;
> -
> -err_set_active:
> -	/* If something failed and we have to revert back to an
> -	 * active state, we also need to clear the MMIO userspace
> -	 * mappings, so any dumb pages that were mapped while we
> -	 * were trying to suspend gets invalidated.
> -	 */
> -	mutex_lock(&ptdev->pm.mmio_lock);
> -	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);
> -	unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
> -			    DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
> -	mutex_unlock(&ptdev->pm.mmio_lock);
> -	return ret;
>  }
> -- 
> 2.46.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/5] drm/panthor: Be robust against resume failures
  2024-11-28 11:02 ` [PATCH v2 4/5] drm/panthor: Be robust against resume failures Boris Brezillon
@ 2024-11-29 13:59   ` Adrián Larumbe
  2024-11-29 14:44     ` Boris Brezillon
  2024-11-29 15:21   ` Steven Price
  1 sibling, 1 reply; 14+ messages in thread
From: Adrián Larumbe @ 2024-11-29 13:59 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Steven Price, Liviu Dudau, dri-devel, kernel

Reviewed-by: Adrian Larumbe <adrian.larumbe@collabora.com>

On 28.11.2024 12:02, Boris Brezillon wrote:
> When the runtime PM resume callback returns an error, it puts the device
> in a state where it can't be resumed anymore. Make sure we can recover
> from such transient failures by calling pm_runtime_set_suspended()
> explicitly after a pm_runtime_resume_and_get() failure.
> 
> v2:
> - Add a comment explaining potential races in
>   panthor_device_resume_and_get()
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.c |  1 +
>  drivers/gpu/drm/panthor/panthor_device.h | 26 ++++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_drv.c    |  2 +-
>  drivers/gpu/drm/panthor/panthor_sched.c  |  4 ++--
>  4 files changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index e3b22107b268..0362101ea896 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -500,6 +500,7 @@ int panthor_device_resume(struct device *dev)
>  
>  err_set_suspended:
>  	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> +	atomic_set(&ptdev->pm.recovery_needed, 1);

I think it might be the case that if PM resume fails, then ptdev->base.dev->power.runtime_error
would be set to '1' and then you might use this state variable in panthor_device_resume_and_get()
rather than encoding it explicity into the panthor driver struct?

>  	return ret;
>  }
>  
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index 0e68f5a70d20..b6c4f25a5d6e 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -9,6 +9,7 @@
>  #include <linux/atomic.h>
>  #include <linux/io-pgtable.h>
>  #include <linux/regulator/consumer.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/sched.h>
>  #include <linux/spinlock.h>
>  
> @@ -180,6 +181,9 @@ struct panthor_device {
>  		 * is suspended.
>  		 */
>  		struct page *dummy_latest_flush;
> +
> +		/** @recovery_needed: True when a resume attempt failed. */
> +		atomic_t recovery_needed;
>  	} pm;
>  
>  	/** @profile_mask: User-set profiling flags for job accounting. */
> @@ -243,6 +247,28 @@ int panthor_device_mmap_io(struct panthor_device *ptdev,
>  int panthor_device_resume(struct device *dev);
>  int panthor_device_suspend(struct device *dev);
>  
> +static inline int panthor_device_resume_and_get(struct panthor_device *ptdev)
> +{
> +	int ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +
> +	/* If the resume failed, we need to clear the runtime_error, which
> +	 * can done by forcing the RPM state to suspended. If multiple
> +	 * threads called panthor_device_resume_and_get(), we only want
> +	 * one of them to update the state, hence the cmpxchg. Note that a
> +	 * thread might enter panthor_device_resume_and_get() and call
> +	 * pm_runtime_resume_and_get() after another thread had attempted
> +	 * to resume and failed. This means we will end up with an error
> +	 * without even attempting a resume ourselves. The only risk here
> +	 * is to report an error when the second resume attempt might have
> +	 * succeeded. Given resume errors are not expected, this is probably
> +	 * something we can live with.
> +	 */
> +	if (ret && atomic_cmpxchg(&ptdev->pm.recovery_needed, 1, 0) == 1)
> +		pm_runtime_set_suspended(ptdev->base.dev);
> +
> +	return ret;
> +}
> +
>  enum drm_panthor_exception_type {
>  	DRM_PANTHOR_EXCEPTION_OK = 0x00,
>  	DRM_PANTHOR_EXCEPTION_TERMINATED = 0x04,
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index 1498c97b4b85..b7a9adc918e3 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -763,7 +763,7 @@ static int panthor_query_timestamp_info(struct panthor_device *ptdev,
>  {
>  	int ret;
>  
> -	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +	ret = panthor_device_resume_and_get(ptdev);
>  	if (ret)
>  		return ret;
>  
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index 97ed5fe5a191..77b184c3fb0c 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -2364,7 +2364,7 @@ static void tick_work(struct work_struct *work)
>  	if (!drm_dev_enter(&ptdev->base, &cookie))
>  		return;
>  
> -	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +	ret = panthor_device_resume_and_get(ptdev);
>  	if (drm_WARN_ON(&ptdev->base, ret))
>  		goto out_dev_exit;
>  
> @@ -3131,7 +3131,7 @@ queue_run_job(struct drm_sched_job *sched_job)
>  		return dma_fence_get(job->done_fence);
>  	}
>  
> -	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +	ret = panthor_device_resume_and_get(ptdev);
>  	if (drm_WARN_ON(&ptdev->base, ret))
>  		return ERR_PTR(ret);
>  
> -- 
> 2.46.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/5] drm/panthor: Be robust against resume failures
  2024-11-29 13:59   ` Adrián Larumbe
@ 2024-11-29 14:44     ` Boris Brezillon
  0 siblings, 0 replies; 14+ messages in thread
From: Boris Brezillon @ 2024-11-29 14:44 UTC (permalink / raw)
  To: Adrián Larumbe; +Cc: Steven Price, Liviu Dudau, dri-devel, kernel

On Fri, 29 Nov 2024 13:59:13 +0000
Adrián Larumbe <adrian.larumbe@collabora.com> wrote:

> Reviewed-by: Adrian Larumbe <adrian.larumbe@collabora.com>
> 
> On 28.11.2024 12:02, Boris Brezillon wrote:
> > When the runtime PM resume callback returns an error, it puts the device
> > in a state where it can't be resumed anymore. Make sure we can recover
> > from such transient failures by calling pm_runtime_set_suspended()
> > explicitly after a pm_runtime_resume_and_get() failure.
> > 
> > v2:
> > - Add a comment explaining potential races in
> >   panthor_device_resume_and_get()
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >  drivers/gpu/drm/panthor/panthor_device.c |  1 +
> >  drivers/gpu/drm/panthor/panthor_device.h | 26 ++++++++++++++++++++++++
> >  drivers/gpu/drm/panthor/panthor_drv.c    |  2 +-
> >  drivers/gpu/drm/panthor/panthor_sched.c  |  4 ++--
> >  4 files changed, 30 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> > index e3b22107b268..0362101ea896 100644
> > --- a/drivers/gpu/drm/panthor/panthor_device.c
> > +++ b/drivers/gpu/drm/panthor/panthor_device.c
> > @@ -500,6 +500,7 @@ int panthor_device_resume(struct device *dev)
> >  
> >  err_set_suspended:
> >  	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> > +	atomic_set(&ptdev->pm.recovery_needed, 1);  
> 
> I think it might be the case that if PM resume fails, then ptdev->base.dev->power.runtime_error
> would be set to '1' and then you might use this state variable in panthor_device_resume_and_get()
> rather than encoding it explicity into the panthor driver struct?

So, there are two reasons for not using
ptdev->base.dev->power.runtime_error directly here:

1. I hate accessing subsystem's internal objects directly, and if
there's no helper to check if a runtime error is pending, I suspect
there's a good reason.

2. We need an atomic variable to ensure only one thread clears the
runtime_error (see the comment in panthor_device_resume_and_get()).

> 
> >  	return ret;
> >  }
> >  
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> > index 0e68f5a70d20..b6c4f25a5d6e 100644
> > --- a/drivers/gpu/drm/panthor/panthor_device.h
> > +++ b/drivers/gpu/drm/panthor/panthor_device.h
> > @@ -9,6 +9,7 @@
> >  #include <linux/atomic.h>
> >  #include <linux/io-pgtable.h>
> >  #include <linux/regulator/consumer.h>
> > +#include <linux/pm_runtime.h>
> >  #include <linux/sched.h>
> >  #include <linux/spinlock.h>
> >  
> > @@ -180,6 +181,9 @@ struct panthor_device {
> >  		 * is suspended.
> >  		 */
> >  		struct page *dummy_latest_flush;
> > +
> > +		/** @recovery_needed: True when a resume attempt failed. */
> > +		atomic_t recovery_needed;
> >  	} pm;
> >  
> >  	/** @profile_mask: User-set profiling flags for job accounting. */
> > @@ -243,6 +247,28 @@ int panthor_device_mmap_io(struct panthor_device *ptdev,
> >  int panthor_device_resume(struct device *dev);
> >  int panthor_device_suspend(struct device *dev);
> >  
> > +static inline int panthor_device_resume_and_get(struct panthor_device *ptdev)
> > +{
> > +	int ret = pm_runtime_resume_and_get(ptdev->base.dev);
> > +
> > +	/* If the resume failed, we need to clear the runtime_error, which
> > +	 * can done by forcing the RPM state to suspended. If multiple
> > +	 * threads called panthor_device_resume_and_get(), we only want
> > +	 * one of them to update the state, hence the cmpxchg. Note that a
> > +	 * thread might enter panthor_device_resume_and_get() and call
> > +	 * pm_runtime_resume_and_get() after another thread had attempted
> > +	 * to resume and failed. This means we will end up with an error
> > +	 * without even attempting a resume ourselves. The only risk here
> > +	 * is to report an error when the second resume attempt might have
> > +	 * succeeded. Given resume errors are not expected, this is probably
> > +	 * something we can live with.
> > +	 */
> > +	if (ret && atomic_cmpxchg(&ptdev->pm.recovery_needed, 1, 0) == 1)
> > +		pm_runtime_set_suspended(ptdev->base.dev);
> > +
> > +	return ret;
> > +}
> > +
> >  enum drm_panthor_exception_type {
> >  	DRM_PANTHOR_EXCEPTION_OK = 0x00,
> >  	DRM_PANTHOR_EXCEPTION_TERMINATED = 0x04,
> > diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> > index 1498c97b4b85..b7a9adc918e3 100644
> > --- a/drivers/gpu/drm/panthor/panthor_drv.c
> > +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> > @@ -763,7 +763,7 @@ static int panthor_query_timestamp_info(struct panthor_device *ptdev,
> >  {
> >  	int ret;
> >  
> > -	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> > +	ret = panthor_device_resume_and_get(ptdev);
> >  	if (ret)
> >  		return ret;
> >  
> > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> > index 97ed5fe5a191..77b184c3fb0c 100644
> > --- a/drivers/gpu/drm/panthor/panthor_sched.c
> > +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> > @@ -2364,7 +2364,7 @@ static void tick_work(struct work_struct *work)
> >  	if (!drm_dev_enter(&ptdev->base, &cookie))
> >  		return;
> >  
> > -	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> > +	ret = panthor_device_resume_and_get(ptdev);
> >  	if (drm_WARN_ON(&ptdev->base, ret))
> >  		goto out_dev_exit;
> >  
> > @@ -3131,7 +3131,7 @@ queue_run_job(struct drm_sched_job *sched_job)
> >  		return dma_fence_get(job->done_fence);
> >  	}
> >  
> > -	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> > +	ret = panthor_device_resume_and_get(ptdev);
> >  	if (drm_WARN_ON(&ptdev->base, ret))
> >  		return ERR_PTR(ret);
> >  
> > -- 
> > 2.46.2  


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
  2024-11-29 13:14   ` Adrián Larumbe
@ 2024-11-29 14:45     ` Boris Brezillon
  0 siblings, 0 replies; 14+ messages in thread
From: Boris Brezillon @ 2024-11-29 14:45 UTC (permalink / raw)
  To: Adrián Larumbe; +Cc: Steven Price, dri-devel, kernel

On Fri, 29 Nov 2024 13:14:58 +0000
Adrián Larumbe <adrian.larumbe@collabora.com> wrote:

> Reviewed-by: Adrian Larumbe <adrian.larumbe@collabora.com>
> 
> On 28.11.2024 12:02, Boris Brezillon wrote:
> > The runtime PM resume operation is not guaranteed to succeed, but if it
> > fails, the device should be in a suspended state. Make sure we're robust
> > to resume failures in the unplug path.
> > 
> > v2:
> > - Move the bit that belonged in the next commit
> > - Drop the panthor_device_unplug() changes
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >  drivers/gpu/drm/panthor/panthor_fw.c  | 14 +++++++++-----
> >  drivers/gpu/drm/panthor/panthor_gpu.c |  3 ++-
> >  drivers/gpu/drm/panthor/panthor_mmu.c |  3 ++-
> >  3 files changed, 13 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
> > index ebf8980ca9a3..f3d3d8fbe13d 100644
> > --- a/drivers/gpu/drm/panthor/panthor_fw.c
> > +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> > @@ -12,6 +12,7 @@
> >  #include <linux/iosys-map.h>
> >  #include <linux/mutex.h>
> >  #include <linux/platform_device.h>
> > +#include <linux/pm_runtime.h>
> >  
> >  #include <drm/drm_drv.h>
> >  #include <drm/drm_managed.h>
> > @@ -1190,11 +1191,13 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
> >  
> >  	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
> >  
> > -	/* Make sure the IRQ handler can be called after that point. */
> > -	if (ptdev->fw->irq.irq)
> > -		panthor_job_irq_suspend(&ptdev->fw->irq);
> > +	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev)) {
> > +		/* Make sure the IRQ handler can be called after that point. */  
> 
> Did you mean 'cannot be called' ?

Oops, indeed I meant 'cannot'.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
  2024-11-28 11:02 ` [PATCH v2 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
  2024-11-29 13:14   ` Adrián Larumbe
@ 2024-11-29 15:21   ` Steven Price
  1 sibling, 0 replies; 14+ messages in thread
From: Steven Price @ 2024-11-29 15:21 UTC (permalink / raw)
  To: Boris Brezillon, Liviu Dudau, Adrián Larumbe; +Cc: dri-devel, kernel

On 28/11/2024 11:02, Boris Brezillon wrote:
> The runtime PM resume operation is not guaranteed to succeed, but if it
> fails, the device should be in a suspended state. Make sure we're robust
> to resume failures in the unplug path.
> 
> v2:
> - Move the bit that belonged in the next commit
> - Drop the panthor_device_unplug() changes
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

With the comment fix:

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  drivers/gpu/drm/panthor/panthor_fw.c  | 14 +++++++++-----
>  drivers/gpu/drm/panthor/panthor_gpu.c |  3 ++-
>  drivers/gpu/drm/panthor/panthor_mmu.c |  3 ++-
>  3 files changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
> index ebf8980ca9a3..f3d3d8fbe13d 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.c
> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> @@ -12,6 +12,7 @@
>  #include <linux/iosys-map.h>
>  #include <linux/mutex.h>
>  #include <linux/platform_device.h>
> +#include <linux/pm_runtime.h>
>  
>  #include <drm/drm_drv.h>
>  #include <drm/drm_managed.h>
> @@ -1190,11 +1191,13 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
>  
>  	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
>  
> -	/* Make sure the IRQ handler can be called after that point. */
> -	if (ptdev->fw->irq.irq)
> -		panthor_job_irq_suspend(&ptdev->fw->irq);
> +	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev)) {
> +		/* Make sure the IRQ handler can be called after that point. */
> +		if (ptdev->fw->irq.irq)
> +			panthor_job_irq_suspend(&ptdev->fw->irq);
>  
> -	panthor_fw_stop(ptdev);
> +		panthor_fw_stop(ptdev);
> +	}
>  
>  	list_for_each_entry(section, &ptdev->fw->sections, node)
>  		panthor_kernel_bo_destroy(section->mem);
> @@ -1207,7 +1210,8 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
>  	panthor_vm_put(ptdev->fw->vm);
>  	ptdev->fw->vm = NULL;
>  
> -	panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
> +	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
> +		panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
> index 0f3cac6ec88e..ee85a371bc38 100644
> --- a/drivers/gpu/drm/panthor/panthor_gpu.c
> +++ b/drivers/gpu/drm/panthor/panthor_gpu.c
> @@ -180,7 +180,8 @@ void panthor_gpu_unplug(struct panthor_device *ptdev)
>  	unsigned long flags;
>  
>  	/* Make sure the IRQ handler is not running after that point. */
> -	panthor_gpu_irq_suspend(&ptdev->gpu->irq);
> +	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
> +		panthor_gpu_irq_suspend(&ptdev->gpu->irq);
>  
>  	/* Wake-up all waiters. */
>  	spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
> diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
> index 9478ee2093d1..6716463903bc 100644
> --- a/drivers/gpu/drm/panthor/panthor_mmu.c
> +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
> @@ -2681,7 +2681,8 @@ int panthor_vm_prepare_mapped_bos_resvs(struct drm_exec *exec, struct panthor_vm
>   */
>  void panthor_mmu_unplug(struct panthor_device *ptdev)
>  {
> -	panthor_mmu_irq_suspend(&ptdev->mmu->irq);
> +	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
> +		panthor_mmu_irq_suspend(&ptdev->mmu->irq);
>  
>  	mutex_lock(&ptdev->mmu->as.slots_lock);
>  	for (u32 i = 0; i < ARRAY_SIZE(ptdev->mmu->as.slots); i++) {


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/5] drm/panthor: Be robust against resume failures
  2024-11-28 11:02 ` [PATCH v2 4/5] drm/panthor: Be robust against resume failures Boris Brezillon
  2024-11-29 13:59   ` Adrián Larumbe
@ 2024-11-29 15:21   ` Steven Price
  1 sibling, 0 replies; 14+ messages in thread
From: Steven Price @ 2024-11-29 15:21 UTC (permalink / raw)
  To: Boris Brezillon, Liviu Dudau, Adrián Larumbe; +Cc: dri-devel, kernel

On 28/11/2024 11:02, Boris Brezillon wrote:
> When the runtime PM resume callback returns an error, it puts the device
> in a state where it can't be resumed anymore. Make sure we can recover
> from such transient failures by calling pm_runtime_set_suspended()
> explicitly after a pm_runtime_resume_and_get() failure.
> 
> v2:
> - Add a comment explaining potential races in
>   panthor_device_resume_and_get()

Thanks for the comment, see below.

> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.c |  1 +
>  drivers/gpu/drm/panthor/panthor_device.h | 26 ++++++++++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_drv.c    |  2 +-
>  drivers/gpu/drm/panthor/panthor_sched.c  |  4 ++--
>  4 files changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index e3b22107b268..0362101ea896 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -500,6 +500,7 @@ int panthor_device_resume(struct device *dev)
>  
>  err_set_suspended:
>  	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> +	atomic_set(&ptdev->pm.recovery_needed, 1);
>  	return ret;
>  }
>  
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index 0e68f5a70d20..b6c4f25a5d6e 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -9,6 +9,7 @@
>  #include <linux/atomic.h>
>  #include <linux/io-pgtable.h>
>  #include <linux/regulator/consumer.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/sched.h>
>  #include <linux/spinlock.h>
>  
> @@ -180,6 +181,9 @@ struct panthor_device {
>  		 * is suspended.
>  		 */
>  		struct page *dummy_latest_flush;
> +
> +		/** @recovery_needed: True when a resume attempt failed. */
> +		atomic_t recovery_needed;
>  	} pm;
>  
>  	/** @profile_mask: User-set profiling flags for job accounting. */
> @@ -243,6 +247,28 @@ int panthor_device_mmap_io(struct panthor_device *ptdev,
>  int panthor_device_resume(struct device *dev);
>  int panthor_device_suspend(struct device *dev);
>  
> +static inline int panthor_device_resume_and_get(struct panthor_device *ptdev)
> +{
> +	int ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +
> +	/* If the resume failed, we need to clear the runtime_error, which
> +	 * can done by forcing the RPM state to suspended. If multiple
> +	 * threads called panthor_device_resume_and_get(), we only want
> +	 * one of them to update the state, hence the cmpxchg. Note that a
> +	 * thread might enter panthor_device_resume_and_get() and call
> +	 * pm_runtime_resume_and_get() after another thread had attempted
> +	 * to resume and failed. This means we will end up with an error
> +	 * without even attempting a resume ourselves. The only risk here
> +	 * is to report an error when the second resume attempt might have
> +	 * succeeded. Given resume errors are not expected, this is probably
> +	 * something we can live with.

I agree this is "something we can live with", and the comment at least
explains the logic here - so hopefully it won't confuse me in the
future. But it still seems like this is the wrong solution because we've
got a known race.

On the other hand it's a clear improvement over the broken state before
(and I'm afraid I don't have time at the moment to look at it in
detail), so feel free to merge it for now:

Acked-by: Steven Price <steven.price@arm.com>

> +	 */
> +	if (ret && atomic_cmpxchg(&ptdev->pm.recovery_needed, 1, 0) == 1)
> +		pm_runtime_set_suspended(ptdev->base.dev);
> +
> +	return ret;
> +}
> +
>  enum drm_panthor_exception_type {
>  	DRM_PANTHOR_EXCEPTION_OK = 0x00,
>  	DRM_PANTHOR_EXCEPTION_TERMINATED = 0x04,
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index 1498c97b4b85..b7a9adc918e3 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -763,7 +763,7 @@ static int panthor_query_timestamp_info(struct panthor_device *ptdev,
>  {
>  	int ret;
>  
> -	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +	ret = panthor_device_resume_and_get(ptdev);
>  	if (ret)
>  		return ret;
>  
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index 97ed5fe5a191..77b184c3fb0c 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -2364,7 +2364,7 @@ static void tick_work(struct work_struct *work)
>  	if (!drm_dev_enter(&ptdev->base, &cookie))
>  		return;
>  
> -	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +	ret = panthor_device_resume_and_get(ptdev);
>  	if (drm_WARN_ON(&ptdev->base, ret))
>  		goto out_dev_exit;
>  
> @@ -3131,7 +3131,7 @@ queue_run_job(struct drm_sched_job *sched_job)
>  		return dma_fence_get(job->done_fence);
>  	}
>  
> -	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +	ret = panthor_device_resume_and_get(ptdev);
>  	if (drm_WARN_ON(&ptdev->base, ret))
>  		return ERR_PTR(ret);
>  


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-11-29 15:22 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-28 11:02 [PATCH v2 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
2024-11-28 11:02 ` [PATCH v2 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
2024-11-29 13:11   ` Adrián Larumbe
2024-11-28 11:02 ` [PATCH v2 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
2024-11-29 13:14   ` Adrián Larumbe
2024-11-29 14:45     ` Boris Brezillon
2024-11-29 15:21   ` Steven Price
2024-11-28 11:02 ` [PATCH v2 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures Boris Brezillon
2024-11-29 13:46   ` [PATCH v2 3/5] drm/panthor: Ignore devfreq_{suspend,resume}_device() failures Adrián Larumbe
2024-11-28 11:02 ` [PATCH v2 4/5] drm/panthor: Be robust against resume failures Boris Brezillon
2024-11-29 13:59   ` Adrián Larumbe
2024-11-29 14:44     ` Boris Brezillon
2024-11-29 15:21   ` Steven Price
2024-11-28 11:02 ` [PATCH v2 5/5] drm/panthor: Fix the fast-reset logic Boris Brezillon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.