All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] drm/panthor: Be robust against failures in the resume path
@ 2024-11-13 15:42 Boris Brezillon
  2024-11-13 15:42 ` [PATCH 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:42 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: dri-devel, kernel

Hello,

Here's a collection of patches improving robustness to failures in
the device resume/suspend path. Those failures are pretty hard to
reproduce (happens once in a while on a deqp-vk run), so I used a
mechanism to fake them.

Faking a FW boot failure is kinda tricky though, which means the
last patch has only been partially tested:
- the fast reset path is well tested because that's the default on
  a device suspend
- the slow reset has been tested with a hack replacing fast resets
  by slow resets
- the fast -> slow reset fallback has been tested by faking boot
  failures after a fast reset, but these are not real, which means
  we can't really validate if the MCU recovers fine after a slow
  reset

On the other hand, this implementation doesn't look like it could
do more harm than the current one (the only difference is the
extra GPU soft-reset that happens between the fast and slow FW
boot).

Regards,

Boris

Boris Brezillon (5):
  drm/panthor: Preserve the result returned by panthor_fw_resume()
  drm/panthor: Be robust against runtime PM resume failures in the
    suspend path
  drm/panthor: Ignore devfreq_{suspend,resume}_device() failures
  drm/panthor: Be robust against resume failures
  drm/panthor: Fix the fast-reset logic

 drivers/gpu/drm/panthor/panthor_devfreq.c | 12 ++--
 drivers/gpu/drm/panthor/panthor_devfreq.h |  4 +-
 drivers/gpu/drm/panthor/panthor_device.c  | 78 +++++++++++------------
 drivers/gpu/drm/panthor/panthor_device.h  | 28 ++++++++
 drivers/gpu/drm/panthor/panthor_drv.c     |  2 +-
 drivers/gpu/drm/panthor/panthor_fw.c      | 62 +++++-------------
 drivers/gpu/drm/panthor/panthor_gpu.c     | 14 ++--
 drivers/gpu/drm/panthor/panthor_mmu.c     |  3 +-
 drivers/gpu/drm/panthor/panthor_sched.c   |  4 +-
 9 files changed, 102 insertions(+), 105 deletions(-)

-- 
2.46.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume()
  2024-11-13 15:42 [PATCH 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
@ 2024-11-13 15:42 ` Boris Brezillon
  2024-11-14 10:50   ` Steven Price
  2024-11-13 15:42 ` [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:42 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: dri-devel, kernel

WARN() will return true if the condition is true, false otherwise.
If we store the return of drm_WARN_ON() in ret, we lose the actual
error code.

Fixes: 5fe909cae118 ("drm/panthor: Add the device logical block")
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index 4082c8f2951d..db7ba40f771d 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -441,8 +441,8 @@ int panthor_device_resume(struct device *dev)
 	    drm_dev_enter(&ptdev->base, &cookie)) {
 		panthor_gpu_resume(ptdev);
 		panthor_mmu_resume(ptdev);
-		ret = drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
-		if (!ret) {
+		ret = panthor_fw_resume(ptdev);
+		if (!drm_WARN_ON(&ptdev->base, ret)) {
 			panthor_sched_resume(ptdev);
 		} else {
 			panthor_mmu_suspend(ptdev);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
  2024-11-13 15:42 [PATCH 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
  2024-11-13 15:42 ` [PATCH 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
@ 2024-11-13 15:42 ` Boris Brezillon
  2024-11-13 15:54   ` Boris Brezillon
                     ` (2 more replies)
  2024-11-13 15:42 ` [PATCH 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures Boris Brezillon
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:42 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: dri-devel, kernel

The runtime PM resume operation is not guaranteed to succeed, but if it
fails, the device should be in a suspended state. Make sure we're robust
to resume failures in the unplug path.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_device.c | 23 ++++++++---------------
 drivers/gpu/drm/panthor/panthor_fw.c     | 14 +++++++++-----
 drivers/gpu/drm/panthor/panthor_gpu.c    |  3 ++-
 drivers/gpu/drm/panthor/panthor_mmu.c    |  3 ++-
 4 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index db7ba40f771d..8b5d54b2bbb4 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -48,6 +48,8 @@ static int panthor_clk_init(struct panthor_device *ptdev)
 
 void panthor_device_unplug(struct panthor_device *ptdev)
 {
+	int ret;
+
 	/* This function can be called from two different path: the reset work
 	 * and the platform device remove callback. drm_dev_unplug() doesn't
 	 * deal with concurrent callers, so we have to protect drm_dev_unplug()
@@ -74,7 +76,8 @@ void panthor_device_unplug(struct panthor_device *ptdev)
 	 */
 	mutex_unlock(&ptdev->unplug.lock);
 
-	drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
+	ret = pm_runtime_get_sync(ptdev->base.dev);
+	drm_WARN_ON(&ptdev->base, ret < 0);
 
 	/* Now, try to cleanly shutdown the GPU before the device resources
 	 * get reclaimed.
@@ -85,7 +88,10 @@ void panthor_device_unplug(struct panthor_device *ptdev)
 	panthor_gpu_unplug(ptdev);
 
 	pm_runtime_dont_use_autosuspend(ptdev->base.dev);
-	pm_runtime_put_sync_suspend(ptdev->base.dev);
+
+	/* If the resume failed, we don't need to suspend here. */
+	if (!ret)
+		pm_runtime_put_sync_suspend(ptdev->base.dev);
 
 	/* If PM is disabled, we need to call the suspend handler manually. */
 	if (!IS_ENABLED(CONFIG_PM))
@@ -541,17 +547,4 @@ int panthor_device_suspend(struct device *dev)
 	clk_disable_unprepare(ptdev->clks.core);
 	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
 	return 0;
-
-err_set_active:
-	/* If something failed and we have to revert back to an
-	 * active state, we also need to clear the MMIO userspace
-	 * mappings, so any dumb pages that were mapped while we
-	 * were trying to suspend gets invalidated.
-	 */
-	mutex_lock(&ptdev->pm.mmio_lock);
-	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);
-	unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
-			    DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
-	mutex_unlock(&ptdev->pm.mmio_lock);
-	return ret;
 }
diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
index 2060085cc9f3..df74750cf1b7 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.c
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -12,6 +12,7 @@
 #include <linux/iosys-map.h>
 #include <linux/mutex.h>
 #include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
 
 #include <drm/drm_drv.h>
 #include <drm/drm_managed.h>
@@ -1188,11 +1189,13 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
 
 	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
 
-	/* Make sure the IRQ handler can be called after that point. */
-	if (ptdev->fw->irq.irq)
-		panthor_job_irq_suspend(&ptdev->fw->irq);
+	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev)) {
+		/* Make sure the IRQ handler can be called after that point. */
+		if (ptdev->fw->irq.irq)
+			panthor_job_irq_suspend(&ptdev->fw->irq);
 
-	panthor_fw_stop(ptdev);
+		panthor_fw_stop(ptdev);
+	}
 
 	list_for_each_entry(section, &ptdev->fw->sections, node)
 		panthor_kernel_bo_destroy(section->mem);
@@ -1205,7 +1208,8 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
 	panthor_vm_put(ptdev->fw->vm);
 	ptdev->fw->vm = NULL;
 
-	panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
+	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
+		panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
 }
 
 /**
diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
index 2d3529a0b156..27702bc62dd6 100644
--- a/drivers/gpu/drm/panthor/panthor_gpu.c
+++ b/drivers/gpu/drm/panthor/panthor_gpu.c
@@ -174,7 +174,8 @@ void panthor_gpu_unplug(struct panthor_device *ptdev)
 	unsigned long flags;
 
 	/* Make sure the IRQ handler is not running after that point. */
-	panthor_gpu_irq_suspend(&ptdev->gpu->irq);
+	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
+		panthor_gpu_irq_suspend(&ptdev->gpu->irq);
 
 	/* Wake-up all waiters. */
 	spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
index c2262c72e8e2..fee6c7d9fe0a 100644
--- a/drivers/gpu/drm/panthor/panthor_mmu.c
+++ b/drivers/gpu/drm/panthor/panthor_mmu.c
@@ -2674,7 +2674,8 @@ int panthor_vm_prepare_mapped_bos_resvs(struct drm_exec *exec, struct panthor_vm
  */
 void panthor_mmu_unplug(struct panthor_device *ptdev)
 {
-	panthor_mmu_irq_suspend(&ptdev->mmu->irq);
+	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
+		panthor_mmu_irq_suspend(&ptdev->mmu->irq);
 
 	mutex_lock(&ptdev->mmu->as.slots_lock);
 	for (u32 i = 0; i < ARRAY_SIZE(ptdev->mmu->as.slots); i++) {
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures
  2024-11-13 15:42 [PATCH 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
  2024-11-13 15:42 ` [PATCH 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
  2024-11-13 15:42 ` [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
@ 2024-11-13 15:42 ` Boris Brezillon
  2024-11-14 10:50   ` Steven Price
  2024-11-13 15:42 ` [PATCH 4/5] drm/panthor: Be robust against resume failures Boris Brezillon
  2024-11-13 15:42 ` [PATCH 5/5] drm/panthor: Fix the fast-reset logic Boris Brezillon
  4 siblings, 1 reply; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:42 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: dri-devel, kernel

devfreq_{resume,suspend}_device() don't bother undoing the suspend_count
modifications if something fails, so either it assumes failures are
harmless, or it's super fragile/buggy. In either case it's not something
we can address at the driver level, so let's just assume failures are
harmless for now, like is done in panfrost.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_devfreq.c | 12 ++++++------
 drivers/gpu/drm/panthor/panthor_devfreq.h |  4 ++--
 drivers/gpu/drm/panthor/panthor_device.c  | 22 +++-------------------
 3 files changed, 11 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c b/drivers/gpu/drm/panthor/panthor_devfreq.c
index 9d0f891b9b53..fadc2edb26fe 100644
--- a/drivers/gpu/drm/panthor/panthor_devfreq.c
+++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
@@ -244,26 +244,26 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
 	return 0;
 }
 
-int panthor_devfreq_resume(struct panthor_device *ptdev)
+void panthor_devfreq_resume(struct panthor_device *ptdev)
 {
 	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
 
 	if (!pdevfreq->devfreq)
-		return 0;
+		return;
 
 	panthor_devfreq_reset(pdevfreq);
 
-	return devfreq_resume_device(pdevfreq->devfreq);
+	drm_WARN_ON(&ptdev->base, devfreq_resume_device(pdevfreq->devfreq));
 }
 
-int panthor_devfreq_suspend(struct panthor_device *ptdev)
+void panthor_devfreq_suspend(struct panthor_device *ptdev)
 {
 	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
 
 	if (!pdevfreq->devfreq)
-		return 0;
+		return;
 
-	return devfreq_suspend_device(pdevfreq->devfreq);
+	drm_WARN_ON(&ptdev->base, devfreq_suspend_device(pdevfreq->devfreq));
 }
 
 void panthor_devfreq_record_busy(struct panthor_device *ptdev)
diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.h b/drivers/gpu/drm/panthor/panthor_devfreq.h
index 83a5c9522493..b7631de695f7 100644
--- a/drivers/gpu/drm/panthor/panthor_devfreq.h
+++ b/drivers/gpu/drm/panthor/panthor_devfreq.h
@@ -12,8 +12,8 @@ struct panthor_devfreq;
 
 int panthor_devfreq_init(struct panthor_device *ptdev);
 
-int panthor_devfreq_resume(struct panthor_device *ptdev);
-int panthor_devfreq_suspend(struct panthor_device *ptdev);
+void panthor_devfreq_resume(struct panthor_device *ptdev);
+void panthor_devfreq_suspend(struct panthor_device *ptdev);
 
 void panthor_devfreq_record_busy(struct panthor_device *ptdev);
 void panthor_devfreq_record_idle(struct panthor_device *ptdev);
diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index 8b5d54b2bbb4..353f3aabef42 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -439,9 +439,7 @@ int panthor_device_resume(struct device *dev)
 	if (ret)
 		goto err_disable_stacks_clk;
 
-	ret = panthor_devfreq_resume(ptdev);
-	if (ret)
-		goto err_disable_coregroup_clk;
+	panthor_devfreq_resume(ptdev);
 
 	if (panthor_device_is_initialized(ptdev) &&
 	    drm_dev_enter(&ptdev->base, &cookie)) {
@@ -478,8 +476,6 @@ int panthor_device_resume(struct device *dev)
 
 err_suspend_devfreq:
 	panthor_devfreq_suspend(ptdev);
-
-err_disable_coregroup_clk:
 	clk_disable_unprepare(ptdev->clks.coregroup);
 
 err_disable_stacks_clk:
@@ -496,7 +492,7 @@ int panthor_device_resume(struct device *dev)
 int panthor_device_suspend(struct device *dev)
 {
 	struct panthor_device *ptdev = dev_get_drvdata(dev);
-	int ret, cookie;
+	int cookie;
 
 	if (atomic_read(&ptdev->pm.state) != PANTHOR_DEVICE_PM_STATE_ACTIVE)
 		return -EINVAL;
@@ -528,19 +524,7 @@ int panthor_device_suspend(struct device *dev)
 		drm_dev_exit(cookie);
 	}
 
-	ret = panthor_devfreq_suspend(ptdev);
-	if (ret) {
-		if (panthor_device_is_initialized(ptdev) &&
-		    drm_dev_enter(&ptdev->base, &cookie)) {
-			panthor_gpu_resume(ptdev);
-			panthor_mmu_resume(ptdev);
-			drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
-			panthor_sched_resume(ptdev);
-			drm_dev_exit(cookie);
-		}
-
-		goto err_set_active;
-	}
+	panthor_devfreq_suspend(ptdev);
 
 	clk_disable_unprepare(ptdev->clks.coregroup);
 	clk_disable_unprepare(ptdev->clks.stacks);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/5] drm/panthor: Be robust against resume failures
  2024-11-13 15:42 [PATCH 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
                   ` (2 preceding siblings ...)
  2024-11-13 15:42 ` [PATCH 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures Boris Brezillon
@ 2024-11-13 15:42 ` Boris Brezillon
  2024-11-14 10:51   ` Steven Price
  2024-11-13 15:42 ` [PATCH 5/5] drm/panthor: Fix the fast-reset logic Boris Brezillon
  4 siblings, 1 reply; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:42 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: dri-devel, kernel

When the runtime PM resume callback returns an error, it puts the device
in a state where it can't be resumed anymore. Make sure we can recover
from such transient failures by calling pm_runtime_set_suspended()
explicitly after a pm_runtime_resume_and_get() failure.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_device.c |  1 +
 drivers/gpu/drm/panthor/panthor_device.h | 17 +++++++++++++++++
 drivers/gpu/drm/panthor/panthor_drv.c    |  2 +-
 drivers/gpu/drm/panthor/panthor_sched.c  |  4 ++--
 4 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index 353f3aabef42..d3276b936141 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -486,6 +486,7 @@ int panthor_device_resume(struct device *dev)
 
 err_set_suspended:
 	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
+	atomic_set(&ptdev->pm.recovery_needed, 1);
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index 0e68f5a70d20..cc74e99e53f9 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -9,6 +9,7 @@
 #include <linux/atomic.h>
 #include <linux/io-pgtable.h>
 #include <linux/regulator/consumer.h>
+#include <linux/pm_runtime.h>
 #include <linux/sched.h>
 #include <linux/spinlock.h>
 
@@ -180,6 +181,9 @@ struct panthor_device {
 		 * is suspended.
 		 */
 		struct page *dummy_latest_flush;
+
+		/** @recovery_needed: True when a resume attempt failed. */
+		atomic_t recovery_needed;
 	} pm;
 
 	/** @profile_mask: User-set profiling flags for job accounting. */
@@ -243,6 +247,19 @@ int panthor_device_mmap_io(struct panthor_device *ptdev,
 int panthor_device_resume(struct device *dev);
 int panthor_device_suspend(struct device *dev);
 
+static inline int panthor_device_resume_and_get(struct panthor_device *ptdev)
+{
+	int ret = pm_runtime_resume_and_get(ptdev->base.dev);
+
+	/* If the resume failed, we need to clear the runtime_error, which we
+	 * can done by forcing the RPM state to suspended.
+	 */
+	if (ret && atomic_cmpxchg(&ptdev->pm.recovery_needed, 1, 0) == 1)
+		pm_runtime_set_suspended(ptdev->base.dev);
+
+	return ret;
+}
+
 enum drm_panthor_exception_type {
 	DRM_PANTHOR_EXCEPTION_OK = 0x00,
 	DRM_PANTHOR_EXCEPTION_TERMINATED = 0x04,
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index 1498c97b4b85..b7a9adc918e3 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -763,7 +763,7 @@ static int panthor_query_timestamp_info(struct panthor_device *ptdev,
 {
 	int ret;
 
-	ret = pm_runtime_resume_and_get(ptdev->base.dev);
+	ret = panthor_device_resume_and_get(ptdev);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 97ed5fe5a191..77b184c3fb0c 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -2364,7 +2364,7 @@ static void tick_work(struct work_struct *work)
 	if (!drm_dev_enter(&ptdev->base, &cookie))
 		return;
 
-	ret = pm_runtime_resume_and_get(ptdev->base.dev);
+	ret = panthor_device_resume_and_get(ptdev);
 	if (drm_WARN_ON(&ptdev->base, ret))
 		goto out_dev_exit;
 
@@ -3131,7 +3131,7 @@ queue_run_job(struct drm_sched_job *sched_job)
 		return dma_fence_get(job->done_fence);
 	}
 
-	ret = pm_runtime_resume_and_get(ptdev->base.dev);
+	ret = panthor_device_resume_and_get(ptdev);
 	if (drm_WARN_ON(&ptdev->base, ret))
 		return ERR_PTR(ret);
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/5] drm/panthor: Fix the fast-reset logic
  2024-11-13 15:42 [PATCH 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
                   ` (3 preceding siblings ...)
  2024-11-13 15:42 ` [PATCH 4/5] drm/panthor: Be robust against resume failures Boris Brezillon
@ 2024-11-13 15:42 ` Boris Brezillon
  2024-11-14 10:51   ` Steven Price
  4 siblings, 1 reply; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:42 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: dri-devel, kernel

If we do a GPU soft-reset, that's no longer fast reset. This also means
the slow reset fallback doesn't work because the MCU state is only reset
after a GPU soft-reset.

Let's move the retry logic to panthor_device_resume() to issue a
soft-reset between the fast and slow attempts, and patch
panthor_gpu_suspend() to only power-off the L2 when a fast reset is
requested.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_device.c | 32 ++++++++++++----
 drivers/gpu/drm/panthor/panthor_device.h | 11 ++++++
 drivers/gpu/drm/panthor/panthor_fw.c     | 48 ++++--------------------
 drivers/gpu/drm/panthor/panthor_gpu.c    | 11 +++---
 4 files changed, 49 insertions(+), 53 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index d3276b936141..2b93c56e1319 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -417,6 +417,22 @@ int panthor_device_mmap_io(struct panthor_device *ptdev, struct vm_area_struct *
 	return 0;
 }
 
+static int panthor_device_resume_hw_components(struct panthor_device *ptdev)
+{
+	int ret;
+
+	panthor_gpu_resume(ptdev);
+	panthor_mmu_resume(ptdev);
+
+	ret = panthor_fw_resume(ptdev);
+	if (!ret)
+		return 0;
+
+	panthor_mmu_suspend(ptdev);
+	panthor_gpu_suspend(ptdev);
+	return ret;
+}
+
 int panthor_device_resume(struct device *dev)
 {
 	struct panthor_device *ptdev = dev_get_drvdata(dev);
@@ -443,16 +459,16 @@ int panthor_device_resume(struct device *dev)
 
 	if (panthor_device_is_initialized(ptdev) &&
 	    drm_dev_enter(&ptdev->base, &cookie)) {
-		panthor_gpu_resume(ptdev);
-		panthor_mmu_resume(ptdev);
-		ret = panthor_fw_resume(ptdev);
-		if (!drm_WARN_ON(&ptdev->base, ret)) {
-			panthor_sched_resume(ptdev);
-		} else {
-			panthor_mmu_suspend(ptdev);
-			panthor_gpu_suspend(ptdev);
+		ret = panthor_device_resume_hw_components(ptdev);
+		if (ret && ptdev->reset.fast) {
+			drm_err(&ptdev->base, "Fast reset failed, trying a slow reset");
+			ptdev->reset.fast = false;
+			ret = panthor_device_resume_hw_components(ptdev);
 		}
 
+		if (!ret)
+			panthor_sched_resume(ptdev);
+
 		drm_dev_exit(cookie);
 
 		if (ret)
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index cc74e99e53f9..cab7ce740ce3 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -157,6 +157,17 @@ struct panthor_device {
 
 		/** @pending: Set to true if a reset is pending. */
 		atomic_t pending;
+
+		/**
+		 * @fast: True if the post_reset logic can proceed with a fast reset.
+		 *
+		 * A fast reset is just a reset where the driver doesn't reload the FW sections.
+		 *
+		 * Any time the firmware is properly suspended, a fast reset can take place.
+		 * On the other hand, if the halt operation failed, the driver will reload
+		 * all FW sections to make sure we start from a fresh state.
+		 */
+		bool fast;
 	} reset;
 
 	/** @pm: Power management related data. */
diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
index df74750cf1b7..b4fe0de547d1 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.c
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -263,17 +263,6 @@ struct panthor_fw {
 	/** @booted: True is the FW is booted */
 	bool booted;
 
-	/**
-	 * @fast_reset: True if the post_reset logic can proceed with a fast reset.
-	 *
-	 * A fast reset is just a reset where the driver doesn't reload the FW sections.
-	 *
-	 * Any time the firmware is properly suspended, a fast reset can take place.
-	 * On the other hand, if the halt operation failed, the driver will reload
-	 * all sections to make sure we start from a fresh state.
-	 */
-	bool fast_reset;
-
 	/** @irq: Job irq data. */
 	struct panthor_irq irq;
 };
@@ -1090,7 +1079,7 @@ void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang)
 	/* Make sure we won't be woken up by a ping. */
 	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
 
-	ptdev->fw->fast_reset = false;
+	ptdev->reset.fast = false;
 
 	if (!on_hang) {
 		struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
@@ -1101,7 +1090,7 @@ void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang)
 		if (!readl_poll_timeout(ptdev->iomem + MCU_STATUS, status,
 					status == MCU_STATUS_HALT, 10, 100000) &&
 		    glb_iface->output->halt_status == PANTHOR_FW_HALT_OK) {
-			ptdev->fw->fast_reset = true;
+			ptdev->reset.fast = true;
 		} else {
 			drm_warn(&ptdev->base, "Failed to cleanly suspend MCU");
 		}
@@ -1131,41 +1120,20 @@ int panthor_fw_post_reset(struct panthor_device *ptdev)
 	if (ret)
 		return ret;
 
-	/* If this is a fast reset, try to start the MCU without reloading
-	 * the FW sections. If it fails, go for a full reset.
-	 */
-	if (ptdev->fw->fast_reset) {
-		ret = panthor_fw_start(ptdev);
-		if (!ret)
-			goto out;
-
-		/* Forcibly reset the MCU and force a slow reset, so we get a
-		 * fresh boot on the next panthor_fw_start() call.
-		 */
-		panthor_fw_stop(ptdev);
-		ptdev->fw->fast_reset = false;
-		drm_err(&ptdev->base, "FW fast reset failed, trying a slow reset");
-
-		ret = panthor_vm_flush_all(ptdev->fw->vm);
-		if (ret) {
-			drm_err(&ptdev->base, "FW slow reset failed (couldn't flush FW's AS l2cache)");
-			return ret;
-		}
-	}
-
-	/* Reload all sections, including RO ones. We're not supposed
-	 * to end up here anyway, let's just assume the overhead of
+	/* On a slow reset, reload all sections, including RO ones. We're not
+	 * supposed to end up here anyway, let's just assume the overhead of
 	 * reloading everything is acceptable.
 	 */
-	panthor_reload_fw_sections(ptdev, true);
+	if (!ptdev->reset.fast)
+		panthor_reload_fw_sections(ptdev, true);
 
 	ret = panthor_fw_start(ptdev);
 	if (ret) {
-		drm_err(&ptdev->base, "FW slow reset failed (couldn't start the FW )");
+		drm_err(&ptdev->base, "FW %s reset failed",
+			ptdev->reset.fast ?  "fast" : "slow");
 		return ret;
 	}
 
-out:
 	/* We must re-initialize the global interface even on fast-reset. */
 	panthor_fw_init_global_iface(ptdev);
 	return 0;
diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
index 27702bc62dd6..ce3b68124a0a 100644
--- a/drivers/gpu/drm/panthor/panthor_gpu.c
+++ b/drivers/gpu/drm/panthor/panthor_gpu.c
@@ -461,11 +461,12 @@ int panthor_gpu_soft_reset(struct panthor_device *ptdev)
  */
 void panthor_gpu_suspend(struct panthor_device *ptdev)
 {
-	/*
-	 * It may be preferable to simply power down the L2, but for now just
-	 * soft-reset which will leave the L2 powered down.
-	 */
-	panthor_gpu_soft_reset(ptdev);
+	/* On a fast reset, simply power down the L2. */
+	if (!ptdev->reset.fast)
+		panthor_gpu_soft_reset(ptdev);
+	else
+		panthor_gpu_power_off(ptdev, L2, 1, 20000);
+
 	panthor_gpu_irq_suspend(&ptdev->gpu->irq);
 }
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
  2024-11-13 15:42 ` [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
@ 2024-11-13 15:54   ` Boris Brezillon
  2024-11-13 15:59   ` Boris Brezillon
  2024-11-14 11:13   ` Liviu Dudau
  2 siblings, 0 replies; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:54 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: dri-devel, kernel

On Wed, 13 Nov 2024 16:42:54 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> @@ -74,7 +76,8 @@ void panthor_device_unplug(struct panthor_device *ptdev)
>  	 */
>  	mutex_unlock(&ptdev->unplug.lock);
>  
> -	drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
> +	ret = pm_runtime_get_sync(ptdev->base.dev);
> +	drm_WARN_ON(&ptdev->base, ret < 0);
>  
>  	/* Now, try to cleanly shutdown the GPU before the device resources
>  	 * get reclaimed.
> @@ -85,7 +88,10 @@ void panthor_device_unplug(struct panthor_device *ptdev)
>  	panthor_gpu_unplug(ptdev);
>  
>  	pm_runtime_dont_use_autosuspend(ptdev->base.dev);
> -	pm_runtime_put_sync_suspend(ptdev->base.dev);
> +
> +	/* If the resume failed, we don't need to suspend here. */
> +	if (!ret)
> +		pm_runtime_put_sync_suspend(ptdev->base.dev);

Okay, I always get confused by pm_runtime_get_sync(). Turns out the
refcount is incremented even if pm_runtime_get_sync() fails, so we
should call pm_runtime_put_sync_suspend() unconditionally here.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
  2024-11-13 15:42 ` [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
  2024-11-13 15:54   ` Boris Brezillon
@ 2024-11-13 15:59   ` Boris Brezillon
  2024-11-14 11:13   ` Liviu Dudau
  2 siblings, 0 replies; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:59 UTC (permalink / raw)
  To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
  Cc: dri-devel, kernel

On Wed, 13 Nov 2024 16:42:54 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> @@ -541,17 +547,4 @@ int panthor_device_suspend(struct device *dev)
>  	clk_disable_unprepare(ptdev->clks.core);
>  	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
>  	return 0;
> -
> -err_set_active:
> -	/* If something failed and we have to revert back to an
> -	 * active state, we also need to clear the MMIO userspace
> -	 * mappings, so any dumb pages that were mapped while we
> -	 * were trying to suspend gets invalidated.
> -	 */
> -	mutex_lock(&ptdev->pm.mmio_lock);
> -	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);
> -	unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
> -			    DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
> -	mutex_unlock(&ptdev->pm.mmio_lock);
> -	return ret;
>  }

This chunk should be in patch 3.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume()
  2024-11-13 15:42 ` [PATCH 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
@ 2024-11-14 10:50   ` Steven Price
  0 siblings, 0 replies; 16+ messages in thread
From: Steven Price @ 2024-11-14 10:50 UTC (permalink / raw)
  To: Boris Brezillon, Liviu Dudau, Adrián Larumbe; +Cc: dri-devel, kernel

On 13/11/2024 15:42, Boris Brezillon wrote:
> WARN() will return true if the condition is true, false otherwise.
> If we store the return of drm_WARN_ON() in ret, we lose the actual
> error code.
> 
> Fixes: 5fe909cae118 ("drm/panthor: Add the device logical block")
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  drivers/gpu/drm/panthor/panthor_device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index 4082c8f2951d..db7ba40f771d 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -441,8 +441,8 @@ int panthor_device_resume(struct device *dev)
>  	    drm_dev_enter(&ptdev->base, &cookie)) {
>  		panthor_gpu_resume(ptdev);
>  		panthor_mmu_resume(ptdev);
> -		ret = drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
> -		if (!ret) {
> +		ret = panthor_fw_resume(ptdev);
> +		if (!drm_WARN_ON(&ptdev->base, ret)) {
>  			panthor_sched_resume(ptdev);
>  		} else {
>  			panthor_mmu_suspend(ptdev);


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures
  2024-11-13 15:42 ` [PATCH 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures Boris Brezillon
@ 2024-11-14 10:50   ` Steven Price
  0 siblings, 0 replies; 16+ messages in thread
From: Steven Price @ 2024-11-14 10:50 UTC (permalink / raw)
  To: Boris Brezillon, Liviu Dudau, Adrián Larumbe; +Cc: dri-devel, kernel

On 13/11/2024 15:42, Boris Brezillon wrote:
> devfreq_{resume,suspend}_device() don't bother undoing the suspend_count
> modifications if something fails, so either it assumes failures are
> harmless, or it's super fragile/buggy. In either case it's not something
> we can address at the driver level, so let's just assume failures are
> harmless for now, like is done in panfrost.
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

With the missing chunk from patch 2:

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  drivers/gpu/drm/panthor/panthor_devfreq.c | 12 ++++++------
>  drivers/gpu/drm/panthor/panthor_devfreq.h |  4 ++--
>  drivers/gpu/drm/panthor/panthor_device.c  | 22 +++-------------------
>  3 files changed, 11 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c b/drivers/gpu/drm/panthor/panthor_devfreq.c
> index 9d0f891b9b53..fadc2edb26fe 100644
> --- a/drivers/gpu/drm/panthor/panthor_devfreq.c
> +++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
> @@ -244,26 +244,26 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
>  	return 0;
>  }
>  
> -int panthor_devfreq_resume(struct panthor_device *ptdev)
> +void panthor_devfreq_resume(struct panthor_device *ptdev)
>  {
>  	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
>  
>  	if (!pdevfreq->devfreq)
> -		return 0;
> +		return;
>  
>  	panthor_devfreq_reset(pdevfreq);
>  
> -	return devfreq_resume_device(pdevfreq->devfreq);
> +	drm_WARN_ON(&ptdev->base, devfreq_resume_device(pdevfreq->devfreq));
>  }
>  
> -int panthor_devfreq_suspend(struct panthor_device *ptdev)
> +void panthor_devfreq_suspend(struct panthor_device *ptdev)
>  {
>  	struct panthor_devfreq *pdevfreq = ptdev->devfreq;
>  
>  	if (!pdevfreq->devfreq)
> -		return 0;
> +		return;
>  
> -	return devfreq_suspend_device(pdevfreq->devfreq);
> +	drm_WARN_ON(&ptdev->base, devfreq_suspend_device(pdevfreq->devfreq));
>  }
>  
>  void panthor_devfreq_record_busy(struct panthor_device *ptdev)
> diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.h b/drivers/gpu/drm/panthor/panthor_devfreq.h
> index 83a5c9522493..b7631de695f7 100644
> --- a/drivers/gpu/drm/panthor/panthor_devfreq.h
> +++ b/drivers/gpu/drm/panthor/panthor_devfreq.h
> @@ -12,8 +12,8 @@ struct panthor_devfreq;
>  
>  int panthor_devfreq_init(struct panthor_device *ptdev);
>  
> -int panthor_devfreq_resume(struct panthor_device *ptdev);
> -int panthor_devfreq_suspend(struct panthor_device *ptdev);
> +void panthor_devfreq_resume(struct panthor_device *ptdev);
> +void panthor_devfreq_suspend(struct panthor_device *ptdev);
>  
>  void panthor_devfreq_record_busy(struct panthor_device *ptdev);
>  void panthor_devfreq_record_idle(struct panthor_device *ptdev);
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index 8b5d54b2bbb4..353f3aabef42 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -439,9 +439,7 @@ int panthor_device_resume(struct device *dev)
>  	if (ret)
>  		goto err_disable_stacks_clk;
>  
> -	ret = panthor_devfreq_resume(ptdev);
> -	if (ret)
> -		goto err_disable_coregroup_clk;
> +	panthor_devfreq_resume(ptdev);
>  
>  	if (panthor_device_is_initialized(ptdev) &&
>  	    drm_dev_enter(&ptdev->base, &cookie)) {
> @@ -478,8 +476,6 @@ int panthor_device_resume(struct device *dev)
>  
>  err_suspend_devfreq:
>  	panthor_devfreq_suspend(ptdev);
> -
> -err_disable_coregroup_clk:
>  	clk_disable_unprepare(ptdev->clks.coregroup);
>  
>  err_disable_stacks_clk:
> @@ -496,7 +492,7 @@ int panthor_device_resume(struct device *dev)
>  int panthor_device_suspend(struct device *dev)
>  {
>  	struct panthor_device *ptdev = dev_get_drvdata(dev);
> -	int ret, cookie;
> +	int cookie;
>  
>  	if (atomic_read(&ptdev->pm.state) != PANTHOR_DEVICE_PM_STATE_ACTIVE)
>  		return -EINVAL;
> @@ -528,19 +524,7 @@ int panthor_device_suspend(struct device *dev)
>  		drm_dev_exit(cookie);
>  	}
>  
> -	ret = panthor_devfreq_suspend(ptdev);
> -	if (ret) {
> -		if (panthor_device_is_initialized(ptdev) &&
> -		    drm_dev_enter(&ptdev->base, &cookie)) {
> -			panthor_gpu_resume(ptdev);
> -			panthor_mmu_resume(ptdev);
> -			drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
> -			panthor_sched_resume(ptdev);
> -			drm_dev_exit(cookie);
> -		}
> -
> -		goto err_set_active;
> -	}
> +	panthor_devfreq_suspend(ptdev);
>  
>  	clk_disable_unprepare(ptdev->clks.coregroup);
>  	clk_disable_unprepare(ptdev->clks.stacks);


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] drm/panthor: Be robust against resume failures
  2024-11-13 15:42 ` [PATCH 4/5] drm/panthor: Be robust against resume failures Boris Brezillon
@ 2024-11-14 10:51   ` Steven Price
  2024-11-14 11:24     ` Boris Brezillon
  0 siblings, 1 reply; 16+ messages in thread
From: Steven Price @ 2024-11-14 10:51 UTC (permalink / raw)
  To: Boris Brezillon, Liviu Dudau, Adrián Larumbe; +Cc: dri-devel, kernel

On 13/11/2024 15:42, Boris Brezillon wrote:
> When the runtime PM resume callback returns an error, it puts the device
> in a state where it can't be resumed anymore. Make sure we can recover
> from such transient failures by calling pm_runtime_set_suspended()
> explicitly after a pm_runtime_resume_and_get() failure.
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.c |  1 +
>  drivers/gpu/drm/panthor/panthor_device.h | 17 +++++++++++++++++
>  drivers/gpu/drm/panthor/panthor_drv.c    |  2 +-
>  drivers/gpu/drm/panthor/panthor_sched.c  |  4 ++--
>  4 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index 353f3aabef42..d3276b936141 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -486,6 +486,7 @@ int panthor_device_resume(struct device *dev)
>  
>  err_set_suspended:
>  	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> +	atomic_set(&ptdev->pm.recovery_needed, 1);
>  	return ret;
>  }
>  
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index 0e68f5a70d20..cc74e99e53f9 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -9,6 +9,7 @@
>  #include <linux/atomic.h>
>  #include <linux/io-pgtable.h>
>  #include <linux/regulator/consumer.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/sched.h>
>  #include <linux/spinlock.h>
>  
> @@ -180,6 +181,9 @@ struct panthor_device {
>  		 * is suspended.
>  		 */
>  		struct page *dummy_latest_flush;
> +
> +		/** @recovery_needed: True when a resume attempt failed. */
> +		atomic_t recovery_needed;
>  	} pm;
>  
>  	/** @profile_mask: User-set profiling flags for job accounting. */
> @@ -243,6 +247,19 @@ int panthor_device_mmap_io(struct panthor_device *ptdev,
>  int panthor_device_resume(struct device *dev);
>  int panthor_device_suspend(struct device *dev);
>  
> +static inline int panthor_device_resume_and_get(struct panthor_device *ptdev)
> +{
> +	int ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +
> +	/* If the resume failed, we need to clear the runtime_error, which we
> +	 * can done by forcing the RPM state to suspended.
> +	 */
> +	if (ret && atomic_cmpxchg(&ptdev->pm.recovery_needed, 1, 0) == 1)

I'm unclear what this atomic achieves. At first glance it appears
pointless: with this change if panthor_device_resume() fails then
recovery_needed is set to 1. So logically if ret != 0 then also
recovery_needed == 1.

My second thought was is this to avoid races? If multiple threads are
calling this then only one will win the cmpxchg and call
pm_runtime_set_suspended. But it's not safe - it's quite possible for
the first thread to get past the cmpxchg and be suspended before the
second thread comes along and reaches the same point - leading to
multiple calls to pm_runtime_set_suspended().

So I'm afraid I don't understand this atomic - what am I missing?

Thanks,
Steve

> +		pm_runtime_set_suspended(ptdev->base.dev);
> +
> +	return ret;
> +}
> +
>  enum drm_panthor_exception_type {
>  	DRM_PANTHOR_EXCEPTION_OK = 0x00,
>  	DRM_PANTHOR_EXCEPTION_TERMINATED = 0x04,
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index 1498c97b4b85..b7a9adc918e3 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -763,7 +763,7 @@ static int panthor_query_timestamp_info(struct panthor_device *ptdev,
>  {
>  	int ret;
>  
> -	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +	ret = panthor_device_resume_and_get(ptdev);
>  	if (ret)
>  		return ret;
>  
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index 97ed5fe5a191..77b184c3fb0c 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -2364,7 +2364,7 @@ static void tick_work(struct work_struct *work)
>  	if (!drm_dev_enter(&ptdev->base, &cookie))
>  		return;
>  
> -	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +	ret = panthor_device_resume_and_get(ptdev);
>  	if (drm_WARN_ON(&ptdev->base, ret))
>  		goto out_dev_exit;
>  
> @@ -3131,7 +3131,7 @@ queue_run_job(struct drm_sched_job *sched_job)
>  		return dma_fence_get(job->done_fence);
>  	}
>  
> -	ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +	ret = panthor_device_resume_and_get(ptdev);
>  	if (drm_WARN_ON(&ptdev->base, ret))
>  		return ERR_PTR(ret);
>  


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] drm/panthor: Fix the fast-reset logic
  2024-11-13 15:42 ` [PATCH 5/5] drm/panthor: Fix the fast-reset logic Boris Brezillon
@ 2024-11-14 10:51   ` Steven Price
  0 siblings, 0 replies; 16+ messages in thread
From: Steven Price @ 2024-11-14 10:51 UTC (permalink / raw)
  To: Boris Brezillon, Liviu Dudau, Adrián Larumbe; +Cc: dri-devel, kernel

On 13/11/2024 15:42, Boris Brezillon wrote:
> If we do a GPU soft-reset, that's no longer fast reset. This also means
> the slow reset fallback doesn't work because the MCU state is only reset
> after a GPU soft-reset.
> 
> Let's move the retry logic to panthor_device_resume() to issue a
> soft-reset between the fast and slow attempts, and patch
> panthor_gpu_suspend() to only power-off the L2 when a fast reset is
> requested.
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

I also don't have a good way of testing this, but it looks right and it
doesn't blow up in the "normal" case, so:

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  drivers/gpu/drm/panthor/panthor_device.c | 32 ++++++++++++----
>  drivers/gpu/drm/panthor/panthor_device.h | 11 ++++++
>  drivers/gpu/drm/panthor/panthor_fw.c     | 48 ++++--------------------
>  drivers/gpu/drm/panthor/panthor_gpu.c    | 11 +++---
>  4 files changed, 49 insertions(+), 53 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index d3276b936141..2b93c56e1319 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -417,6 +417,22 @@ int panthor_device_mmap_io(struct panthor_device *ptdev, struct vm_area_struct *
>  	return 0;
>  }
>  
> +static int panthor_device_resume_hw_components(struct panthor_device *ptdev)
> +{
> +	int ret;
> +
> +	panthor_gpu_resume(ptdev);
> +	panthor_mmu_resume(ptdev);
> +
> +	ret = panthor_fw_resume(ptdev);
> +	if (!ret)
> +		return 0;
> +
> +	panthor_mmu_suspend(ptdev);
> +	panthor_gpu_suspend(ptdev);
> +	return ret;
> +}
> +
>  int panthor_device_resume(struct device *dev)
>  {
>  	struct panthor_device *ptdev = dev_get_drvdata(dev);
> @@ -443,16 +459,16 @@ int panthor_device_resume(struct device *dev)
>  
>  	if (panthor_device_is_initialized(ptdev) &&
>  	    drm_dev_enter(&ptdev->base, &cookie)) {
> -		panthor_gpu_resume(ptdev);
> -		panthor_mmu_resume(ptdev);
> -		ret = panthor_fw_resume(ptdev);
> -		if (!drm_WARN_ON(&ptdev->base, ret)) {
> -			panthor_sched_resume(ptdev);
> -		} else {
> -			panthor_mmu_suspend(ptdev);
> -			panthor_gpu_suspend(ptdev);
> +		ret = panthor_device_resume_hw_components(ptdev);
> +		if (ret && ptdev->reset.fast) {
> +			drm_err(&ptdev->base, "Fast reset failed, trying a slow reset");
> +			ptdev->reset.fast = false;
> +			ret = panthor_device_resume_hw_components(ptdev);
>  		}
>  
> +		if (!ret)
> +			panthor_sched_resume(ptdev);
> +
>  		drm_dev_exit(cookie);
>  
>  		if (ret)
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index cc74e99e53f9..cab7ce740ce3 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -157,6 +157,17 @@ struct panthor_device {
>  
>  		/** @pending: Set to true if a reset is pending. */
>  		atomic_t pending;
> +
> +		/**
> +		 * @fast: True if the post_reset logic can proceed with a fast reset.
> +		 *
> +		 * A fast reset is just a reset where the driver doesn't reload the FW sections.
> +		 *
> +		 * Any time the firmware is properly suspended, a fast reset can take place.
> +		 * On the other hand, if the halt operation failed, the driver will reload
> +		 * all FW sections to make sure we start from a fresh state.
> +		 */
> +		bool fast;
>  	} reset;
>  
>  	/** @pm: Power management related data. */
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
> index df74750cf1b7..b4fe0de547d1 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.c
> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> @@ -263,17 +263,6 @@ struct panthor_fw {
>  	/** @booted: True is the FW is booted */
>  	bool booted;
>  
> -	/**
> -	 * @fast_reset: True if the post_reset logic can proceed with a fast reset.
> -	 *
> -	 * A fast reset is just a reset where the driver doesn't reload the FW sections.
> -	 *
> -	 * Any time the firmware is properly suspended, a fast reset can take place.
> -	 * On the other hand, if the halt operation failed, the driver will reload
> -	 * all sections to make sure we start from a fresh state.
> -	 */
> -	bool fast_reset;
> -
>  	/** @irq: Job irq data. */
>  	struct panthor_irq irq;
>  };
> @@ -1090,7 +1079,7 @@ void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang)
>  	/* Make sure we won't be woken up by a ping. */
>  	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
>  
> -	ptdev->fw->fast_reset = false;
> +	ptdev->reset.fast = false;
>  
>  	if (!on_hang) {
>  		struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> @@ -1101,7 +1090,7 @@ void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang)
>  		if (!readl_poll_timeout(ptdev->iomem + MCU_STATUS, status,
>  					status == MCU_STATUS_HALT, 10, 100000) &&
>  		    glb_iface->output->halt_status == PANTHOR_FW_HALT_OK) {
> -			ptdev->fw->fast_reset = true;
> +			ptdev->reset.fast = true;
>  		} else {
>  			drm_warn(&ptdev->base, "Failed to cleanly suspend MCU");
>  		}
> @@ -1131,41 +1120,20 @@ int panthor_fw_post_reset(struct panthor_device *ptdev)
>  	if (ret)
>  		return ret;
>  
> -	/* If this is a fast reset, try to start the MCU without reloading
> -	 * the FW sections. If it fails, go for a full reset.
> -	 */
> -	if (ptdev->fw->fast_reset) {
> -		ret = panthor_fw_start(ptdev);
> -		if (!ret)
> -			goto out;
> -
> -		/* Forcibly reset the MCU and force a slow reset, so we get a
> -		 * fresh boot on the next panthor_fw_start() call.
> -		 */
> -		panthor_fw_stop(ptdev);
> -		ptdev->fw->fast_reset = false;
> -		drm_err(&ptdev->base, "FW fast reset failed, trying a slow reset");
> -
> -		ret = panthor_vm_flush_all(ptdev->fw->vm);
> -		if (ret) {
> -			drm_err(&ptdev->base, "FW slow reset failed (couldn't flush FW's AS l2cache)");
> -			return ret;
> -		}
> -	}
> -
> -	/* Reload all sections, including RO ones. We're not supposed
> -	 * to end up here anyway, let's just assume the overhead of
> +	/* On a slow reset, reload all sections, including RO ones. We're not
> +	 * supposed to end up here anyway, let's just assume the overhead of
>  	 * reloading everything is acceptable.
>  	 */
> -	panthor_reload_fw_sections(ptdev, true);
> +	if (!ptdev->reset.fast)
> +		panthor_reload_fw_sections(ptdev, true);
>  
>  	ret = panthor_fw_start(ptdev);
>  	if (ret) {
> -		drm_err(&ptdev->base, "FW slow reset failed (couldn't start the FW )");
> +		drm_err(&ptdev->base, "FW %s reset failed",
> +			ptdev->reset.fast ?  "fast" : "slow");
>  		return ret;
>  	}
>  
> -out:
>  	/* We must re-initialize the global interface even on fast-reset. */
>  	panthor_fw_init_global_iface(ptdev);
>  	return 0;
> diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
> index 27702bc62dd6..ce3b68124a0a 100644
> --- a/drivers/gpu/drm/panthor/panthor_gpu.c
> +++ b/drivers/gpu/drm/panthor/panthor_gpu.c
> @@ -461,11 +461,12 @@ int panthor_gpu_soft_reset(struct panthor_device *ptdev)
>   */
>  void panthor_gpu_suspend(struct panthor_device *ptdev)
>  {
> -	/*
> -	 * It may be preferable to simply power down the L2, but for now just
> -	 * soft-reset which will leave the L2 powered down.
> -	 */
> -	panthor_gpu_soft_reset(ptdev);
> +	/* On a fast reset, simply power down the L2. */
> +	if (!ptdev->reset.fast)
> +		panthor_gpu_soft_reset(ptdev);
> +	else
> +		panthor_gpu_power_off(ptdev, L2, 1, 20000);
> +
>  	panthor_gpu_irq_suspend(&ptdev->gpu->irq);
>  }
>  


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
  2024-11-13 15:42 ` [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
  2024-11-13 15:54   ` Boris Brezillon
  2024-11-13 15:59   ` Boris Brezillon
@ 2024-11-14 11:13   ` Liviu Dudau
  2024-11-14 11:27     ` Boris Brezillon
  2 siblings, 1 reply; 16+ messages in thread
From: Liviu Dudau @ 2024-11-14 11:13 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Steven Price, Adrián Larumbe, dri-devel, kernel

On Wed, Nov 13, 2024 at 04:42:54PM +0100, Boris Brezillon wrote:
> The runtime PM resume operation is not guaranteed to succeed, but if it
> fails, the device should be in a suspended state. Make sure we're robust
> to resume failures in the unplug path.
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panthor/panthor_device.c | 23 ++++++++---------------
>  drivers/gpu/drm/panthor/panthor_fw.c     | 14 +++++++++-----
>  drivers/gpu/drm/panthor/panthor_gpu.c    |  3 ++-
>  drivers/gpu/drm/panthor/panthor_mmu.c    |  3 ++-
>  4 files changed, 21 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index db7ba40f771d..8b5d54b2bbb4 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -48,6 +48,8 @@ static int panthor_clk_init(struct panthor_device *ptdev)
>  
>  void panthor_device_unplug(struct panthor_device *ptdev)
>  {
> +	int ret;
> +
>  	/* This function can be called from two different path: the reset work
>  	 * and the platform device remove callback. drm_dev_unplug() doesn't
>  	 * deal with concurrent callers, so we have to protect drm_dev_unplug()
> @@ -74,7 +76,8 @@ void panthor_device_unplug(struct panthor_device *ptdev)
>  	 */
>  	mutex_unlock(&ptdev->unplug.lock);
>  
> -	drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
> +	ret = pm_runtime_get_sync(ptdev->base.dev);
> +	drm_WARN_ON(&ptdev->base, ret < 0);
>  
>  	/* Now, try to cleanly shutdown the GPU before the device resources
>  	 * get reclaimed.
> @@ -85,7 +88,10 @@ void panthor_device_unplug(struct panthor_device *ptdev)
>  	panthor_gpu_unplug(ptdev);
>  
>  	pm_runtime_dont_use_autosuspend(ptdev->base.dev);
> -	pm_runtime_put_sync_suspend(ptdev->base.dev);
> +
> +	/* If the resume failed, we don't need to suspend here. */
> +	if (!ret)
> +		pm_runtime_put_sync_suspend(ptdev->base.dev);
>  
>  	/* If PM is disabled, we need to call the suspend handler manually. */
>  	if (!IS_ENABLED(CONFIG_PM))
> @@ -541,17 +547,4 @@ int panthor_device_suspend(struct device *dev)
>  	clk_disable_unprepare(ptdev->clks.core);
>  	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
>  	return 0;
> -
> -err_set_active:

Label is being deleted here, but I don't see where the `goto err_set_active` is being removed.

Best regards,
Liviu

> -	/* If something failed and we have to revert back to an
> -	 * active state, we also need to clear the MMIO userspace
> -	 * mappings, so any dumb pages that were mapped while we
> -	 * were trying to suspend gets invalidated.
> -	 */
> -	mutex_lock(&ptdev->pm.mmio_lock);
> -	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);
> -	unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
> -			    DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
> -	mutex_unlock(&ptdev->pm.mmio_lock);
> -	return ret;
>  }
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
> index 2060085cc9f3..df74750cf1b7 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.c
> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> @@ -12,6 +12,7 @@
>  #include <linux/iosys-map.h>
>  #include <linux/mutex.h>
>  #include <linux/platform_device.h>
> +#include <linux/pm_runtime.h>
>  
>  #include <drm/drm_drv.h>
>  #include <drm/drm_managed.h>
> @@ -1188,11 +1189,13 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
>  
>  	cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
>  
> -	/* Make sure the IRQ handler can be called after that point. */
> -	if (ptdev->fw->irq.irq)
> -		panthor_job_irq_suspend(&ptdev->fw->irq);
> +	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev)) {
> +		/* Make sure the IRQ handler can be called after that point. */
> +		if (ptdev->fw->irq.irq)
> +			panthor_job_irq_suspend(&ptdev->fw->irq);
>  
> -	panthor_fw_stop(ptdev);
> +		panthor_fw_stop(ptdev);
> +	}
>  
>  	list_for_each_entry(section, &ptdev->fw->sections, node)
>  		panthor_kernel_bo_destroy(section->mem);
> @@ -1205,7 +1208,8 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
>  	panthor_vm_put(ptdev->fw->vm);
>  	ptdev->fw->vm = NULL;
>  
> -	panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
> +	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
> +		panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
> index 2d3529a0b156..27702bc62dd6 100644
> --- a/drivers/gpu/drm/panthor/panthor_gpu.c
> +++ b/drivers/gpu/drm/panthor/panthor_gpu.c
> @@ -174,7 +174,8 @@ void panthor_gpu_unplug(struct panthor_device *ptdev)
>  	unsigned long flags;
>  
>  	/* Make sure the IRQ handler is not running after that point. */
> -	panthor_gpu_irq_suspend(&ptdev->gpu->irq);
> +	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
> +		panthor_gpu_irq_suspend(&ptdev->gpu->irq);
>  
>  	/* Wake-up all waiters. */
>  	spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
> diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
> index c2262c72e8e2..fee6c7d9fe0a 100644
> --- a/drivers/gpu/drm/panthor/panthor_mmu.c
> +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
> @@ -2674,7 +2674,8 @@ int panthor_vm_prepare_mapped_bos_resvs(struct drm_exec *exec, struct panthor_vm
>   */
>  void panthor_mmu_unplug(struct panthor_device *ptdev)
>  {
> -	panthor_mmu_irq_suspend(&ptdev->mmu->irq);
> +	if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
> +		panthor_mmu_irq_suspend(&ptdev->mmu->irq);
>  
>  	mutex_lock(&ptdev->mmu->as.slots_lock);
>  	for (u32 i = 0; i < ARRAY_SIZE(ptdev->mmu->as.slots); i++) {
> -- 
> 2.46.2
> 

-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] drm/panthor: Be robust against resume failures
  2024-11-14 10:51   ` Steven Price
@ 2024-11-14 11:24     ` Boris Brezillon
  0 siblings, 0 replies; 16+ messages in thread
From: Boris Brezillon @ 2024-11-14 11:24 UTC (permalink / raw)
  To: Steven Price; +Cc: Liviu Dudau, Adrián Larumbe, dri-devel, kernel

On Thu, 14 Nov 2024 10:51:01 +0000
Steven Price <steven.price@arm.com> wrote:

> On 13/11/2024 15:42, Boris Brezillon wrote:
> > When the runtime PM resume callback returns an error, it puts the device
> > in a state where it can't be resumed anymore. Make sure we can recover
> > from such transient failures by calling pm_runtime_set_suspended()
> > explicitly after a pm_runtime_resume_and_get() failure.
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >  drivers/gpu/drm/panthor/panthor_device.c |  1 +
> >  drivers/gpu/drm/panthor/panthor_device.h | 17 +++++++++++++++++
> >  drivers/gpu/drm/panthor/panthor_drv.c    |  2 +-
> >  drivers/gpu/drm/panthor/panthor_sched.c  |  4 ++--
> >  4 files changed, 21 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> > index 353f3aabef42..d3276b936141 100644
> > --- a/drivers/gpu/drm/panthor/panthor_device.c
> > +++ b/drivers/gpu/drm/panthor/panthor_device.c
> > @@ -486,6 +486,7 @@ int panthor_device_resume(struct device *dev)
> >  
> >  err_set_suspended:
> >  	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> > +	atomic_set(&ptdev->pm.recovery_needed, 1);
> >  	return ret;
> >  }
> >  
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> > index 0e68f5a70d20..cc74e99e53f9 100644
> > --- a/drivers/gpu/drm/panthor/panthor_device.h
> > +++ b/drivers/gpu/drm/panthor/panthor_device.h
> > @@ -9,6 +9,7 @@
> >  #include <linux/atomic.h>
> >  #include <linux/io-pgtable.h>
> >  #include <linux/regulator/consumer.h>
> > +#include <linux/pm_runtime.h>
> >  #include <linux/sched.h>
> >  #include <linux/spinlock.h>
> >  
> > @@ -180,6 +181,9 @@ struct panthor_device {
> >  		 * is suspended.
> >  		 */
> >  		struct page *dummy_latest_flush;
> > +
> > +		/** @recovery_needed: True when a resume attempt failed. */
> > +		atomic_t recovery_needed;
> >  	} pm;
> >  
> >  	/** @profile_mask: User-set profiling flags for job accounting. */
> > @@ -243,6 +247,19 @@ int panthor_device_mmap_io(struct panthor_device *ptdev,
> >  int panthor_device_resume(struct device *dev);
> >  int panthor_device_suspend(struct device *dev);
> >  
> > +static inline int panthor_device_resume_and_get(struct panthor_device *ptdev)
> > +{
> > +	int ret = pm_runtime_resume_and_get(ptdev->base.dev);
> > +
> > +	/* If the resume failed, we need to clear the runtime_error, which we
> > +	 * can done by forcing the RPM state to suspended.
> > +	 */
> > +	if (ret && atomic_cmpxchg(&ptdev->pm.recovery_needed, 1, 0) == 1)  
> 
> I'm unclear what this atomic achieves. At first glance it appears
> pointless: with this change if panthor_device_resume() fails then
> recovery_needed is set to 1. So logically if ret != 0 then also
> recovery_needed == 1.
> 
> My second thought was is this to avoid races? If multiple threads are
> calling this then only one will win the cmpxchg and call
> pm_runtime_set_suspended. But it's not safe - it's quite possible for
> the first thread to get past the cmpxchg and be suspended before the
> second thread comes along and reaches the same point - leading to
> multiple calls to pm_runtime_set_suspended().

Yes, it was here to avoid the race. I don't think there's a risk of
multiple threads calling pm_runtime_set_suspended() without
actually needing such a call, because we won't reach
panthor_device_resume() until the runtime_error has been cleared
(runtime PM bails out early with a -EINVAL). So, in practice, there's no
way two threads can see a recovery_needed=1 unless the error has already
been cleared by the other thread and the second thread triggered
another resume error, in which case the second
pm_runtime_set_suspended() call is legit.

But now that you mention it, it indeed doesn't prevent the second
thread to call pm_runtime_resume_and_get() before the PM runtime_error
has been cleared, leading to potential spurious errors, so that's
annoying.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
  2024-11-14 11:13   ` Liviu Dudau
@ 2024-11-14 11:27     ` Boris Brezillon
  2024-11-14 11:36       ` Liviu Dudau
  0 siblings, 1 reply; 16+ messages in thread
From: Boris Brezillon @ 2024-11-14 11:27 UTC (permalink / raw)
  To: Liviu Dudau; +Cc: Steven Price, Adrián Larumbe, dri-devel, kernel

On Thu, 14 Nov 2024 11:13:29 +0000
Liviu Dudau <liviu.dudau@arm.com> wrote:

> On Wed, Nov 13, 2024 at 04:42:54PM +0100, Boris Brezillon wrote:
> > The runtime PM resume operation is not guaranteed to succeed, but if it
> > fails, the device should be in a suspended state. Make sure we're robust
> > to resume failures in the unplug path.
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >  drivers/gpu/drm/panthor/panthor_device.c | 23 ++++++++---------------
> >  drivers/gpu/drm/panthor/panthor_fw.c     | 14 +++++++++-----
> >  drivers/gpu/drm/panthor/panthor_gpu.c    |  3 ++-
> >  drivers/gpu/drm/panthor/panthor_mmu.c    |  3 ++-
> >  4 files changed, 21 insertions(+), 22 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> > index db7ba40f771d..8b5d54b2bbb4 100644
> > --- a/drivers/gpu/drm/panthor/panthor_device.c
> > +++ b/drivers/gpu/drm/panthor/panthor_device.c
> > @@ -48,6 +48,8 @@ static int panthor_clk_init(struct panthor_device *ptdev)
> >  
> >  void panthor_device_unplug(struct panthor_device *ptdev)
> >  {
> > +	int ret;
> > +
> >  	/* This function can be called from two different path: the reset work
> >  	 * and the platform device remove callback. drm_dev_unplug() doesn't
> >  	 * deal with concurrent callers, so we have to protect drm_dev_unplug()
> > @@ -74,7 +76,8 @@ void panthor_device_unplug(struct panthor_device *ptdev)
> >  	 */
> >  	mutex_unlock(&ptdev->unplug.lock);
> >  
> > -	drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
> > +	ret = pm_runtime_get_sync(ptdev->base.dev);
> > +	drm_WARN_ON(&ptdev->base, ret < 0);
> >  
> >  	/* Now, try to cleanly shutdown the GPU before the device resources
> >  	 * get reclaimed.
> > @@ -85,7 +88,10 @@ void panthor_device_unplug(struct panthor_device *ptdev)
> >  	panthor_gpu_unplug(ptdev);
> >  
> >  	pm_runtime_dont_use_autosuspend(ptdev->base.dev);
> > -	pm_runtime_put_sync_suspend(ptdev->base.dev);
> > +
> > +	/* If the resume failed, we don't need to suspend here. */
> > +	if (!ret)
> > +		pm_runtime_put_sync_suspend(ptdev->base.dev);
> >  
> >  	/* If PM is disabled, we need to call the suspend handler manually. */
> >  	if (!IS_ENABLED(CONFIG_PM))
> > @@ -541,17 +547,4 @@ int panthor_device_suspend(struct device *dev)
> >  	clk_disable_unprepare(ptdev->clks.core);
> >  	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> >  	return 0;
> > -
> > -err_set_active:  
> 
> Label is being deleted here, but I don't see where the `goto err_set_active` is being removed.

Yes, as mentioned in my own reply, this chunk should be in patch 3.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
  2024-11-14 11:27     ` Boris Brezillon
@ 2024-11-14 11:36       ` Liviu Dudau
  0 siblings, 0 replies; 16+ messages in thread
From: Liviu Dudau @ 2024-11-14 11:36 UTC (permalink / raw)
  To: Boris Brezillon; +Cc: Steven Price, Adrián Larumbe, dri-devel, kernel

On Thu, Nov 14, 2024 at 12:27:55PM +0100, Boris Brezillon wrote:
> On Thu, 14 Nov 2024 11:13:29 +0000
> Liviu Dudau <liviu.dudau@arm.com> wrote:
> 
> > On Wed, Nov 13, 2024 at 04:42:54PM +0100, Boris Brezillon wrote:
> > > The runtime PM resume operation is not guaranteed to succeed, but if it
> > > fails, the device should be in a suspended state. Make sure we're robust
> > > to resume failures in the unplug path.
> > > 
> > > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > > ---
> > >  drivers/gpu/drm/panthor/panthor_device.c | 23 ++++++++---------------
> > >  drivers/gpu/drm/panthor/panthor_fw.c     | 14 +++++++++-----
> > >  drivers/gpu/drm/panthor/panthor_gpu.c    |  3 ++-
> > >  drivers/gpu/drm/panthor/panthor_mmu.c    |  3 ++-
> > >  4 files changed, 21 insertions(+), 22 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> > > index db7ba40f771d..8b5d54b2bbb4 100644
> > > --- a/drivers/gpu/drm/panthor/panthor_device.c
> > > +++ b/drivers/gpu/drm/panthor/panthor_device.c
> > > @@ -48,6 +48,8 @@ static int panthor_clk_init(struct panthor_device *ptdev)
> > >  
> > >  void panthor_device_unplug(struct panthor_device *ptdev)
> > >  {
> > > +	int ret;
> > > +
> > >  	/* This function can be called from two different path: the reset work
> > >  	 * and the platform device remove callback. drm_dev_unplug() doesn't
> > >  	 * deal with concurrent callers, so we have to protect drm_dev_unplug()
> > > @@ -74,7 +76,8 @@ void panthor_device_unplug(struct panthor_device *ptdev)
> > >  	 */
> > >  	mutex_unlock(&ptdev->unplug.lock);
> > >  
> > > -	drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
> > > +	ret = pm_runtime_get_sync(ptdev->base.dev);
> > > +	drm_WARN_ON(&ptdev->base, ret < 0);
> > >  
> > >  	/* Now, try to cleanly shutdown the GPU before the device resources
> > >  	 * get reclaimed.
> > > @@ -85,7 +88,10 @@ void panthor_device_unplug(struct panthor_device *ptdev)
> > >  	panthor_gpu_unplug(ptdev);
> > >  
> > >  	pm_runtime_dont_use_autosuspend(ptdev->base.dev);
> > > -	pm_runtime_put_sync_suspend(ptdev->base.dev);
> > > +
> > > +	/* If the resume failed, we don't need to suspend here. */
> > > +	if (!ret)
> > > +		pm_runtime_put_sync_suspend(ptdev->base.dev);
> > >  
> > >  	/* If PM is disabled, we need to call the suspend handler manually. */
> > >  	if (!IS_ENABLED(CONFIG_PM))
> > > @@ -541,17 +547,4 @@ int panthor_device_suspend(struct device *dev)
> > >  	clk_disable_unprepare(ptdev->clks.core);
> > >  	atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> > >  	return 0;
> > > -
> > > -err_set_active:  
> > 
> > Label is being deleted here, but I don't see where the `goto err_set_active` is being removed.
> 
> Yes, as mentioned in my own reply, this chunk should be in patch 3.

Sorry, I did look at patch 3 before sending the email but somehow missed the line with the goto being deleted.

With the change:

Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>

Best regards,
Liviu

-- 
====================
| I would like to |
| fix the world,  |
| but they're not |
| giving me the   |
 \ source code!  /
  ---------------
    ¯\_(ツ)_/¯

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-11-14 11:36 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-13 15:42 [PATCH 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
2024-11-13 15:42 ` [PATCH 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
2024-11-14 10:50   ` Steven Price
2024-11-13 15:42 ` [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
2024-11-13 15:54   ` Boris Brezillon
2024-11-13 15:59   ` Boris Brezillon
2024-11-14 11:13   ` Liviu Dudau
2024-11-14 11:27     ` Boris Brezillon
2024-11-14 11:36       ` Liviu Dudau
2024-11-13 15:42 ` [PATCH 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures Boris Brezillon
2024-11-14 10:50   ` Steven Price
2024-11-13 15:42 ` [PATCH 4/5] drm/panthor: Be robust against resume failures Boris Brezillon
2024-11-14 10:51   ` Steven Price
2024-11-14 11:24     ` Boris Brezillon
2024-11-13 15:42 ` [PATCH 5/5] drm/panthor: Fix the fast-reset logic Boris Brezillon
2024-11-14 10:51   ` Steven Price

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.