* [PATCH 0/5] drm/panthor: Be robust against failures in the resume path
@ 2024-11-13 15:42 Boris Brezillon
2024-11-13 15:42 ` [PATCH 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:42 UTC (permalink / raw)
To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
Cc: dri-devel, kernel
Hello,
Here's a collection of patches improving robustness to failures in
the device resume/suspend path. Those failures are pretty hard to
reproduce (happens once in a while on a deqp-vk run), so I used a
mechanism to fake them.
Faking a FW boot failure is kinda tricky though, which means the
last patch has only been partially tested:
- the fast reset path is well tested because that's the default on
a device suspend
- the slow reset has been tested with a hack replacing fast resets
by slow resets
- the fast -> slow reset fallback has been tested by faking boot
failures after a fast reset, but these are not real, which means
we can't really validate if the MCU recovers fine after a slow
reset
On the other hand, this implementation doesn't look like it could
do more harm than the current one (the only difference is the
extra GPU soft-reset that happens between the fast and slow FW
boot).
Regards,
Boris
Boris Brezillon (5):
drm/panthor: Preserve the result returned by panthor_fw_resume()
drm/panthor: Be robust against runtime PM resume failures in the
suspend path
drm/panthor: Ignore devfreq_{suspend,resume}_device() failures
drm/panthor: Be robust against resume failures
drm/panthor: Fix the fast-reset logic
drivers/gpu/drm/panthor/panthor_devfreq.c | 12 ++--
drivers/gpu/drm/panthor/panthor_devfreq.h | 4 +-
drivers/gpu/drm/panthor/panthor_device.c | 78 +++++++++++------------
drivers/gpu/drm/panthor/panthor_device.h | 28 ++++++++
drivers/gpu/drm/panthor/panthor_drv.c | 2 +-
drivers/gpu/drm/panthor/panthor_fw.c | 62 +++++-------------
drivers/gpu/drm/panthor/panthor_gpu.c | 14 ++--
drivers/gpu/drm/panthor/panthor_mmu.c | 3 +-
drivers/gpu/drm/panthor/panthor_sched.c | 4 +-
9 files changed, 102 insertions(+), 105 deletions(-)
--
2.46.2
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume()
2024-11-13 15:42 [PATCH 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
@ 2024-11-13 15:42 ` Boris Brezillon
2024-11-14 10:50 ` Steven Price
2024-11-13 15:42 ` [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
` (3 subsequent siblings)
4 siblings, 1 reply; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:42 UTC (permalink / raw)
To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
Cc: dri-devel, kernel
WARN() will return true if the condition is true, false otherwise.
If we store the return of drm_WARN_ON() in ret, we lose the actual
error code.
Fixes: 5fe909cae118 ("drm/panthor: Add the device logical block")
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
drivers/gpu/drm/panthor/panthor_device.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index 4082c8f2951d..db7ba40f771d 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -441,8 +441,8 @@ int panthor_device_resume(struct device *dev)
drm_dev_enter(&ptdev->base, &cookie)) {
panthor_gpu_resume(ptdev);
panthor_mmu_resume(ptdev);
- ret = drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
- if (!ret) {
+ ret = panthor_fw_resume(ptdev);
+ if (!drm_WARN_ON(&ptdev->base, ret)) {
panthor_sched_resume(ptdev);
} else {
panthor_mmu_suspend(ptdev);
--
2.46.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
2024-11-13 15:42 [PATCH 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
2024-11-13 15:42 ` [PATCH 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
@ 2024-11-13 15:42 ` Boris Brezillon
2024-11-13 15:54 ` Boris Brezillon
` (2 more replies)
2024-11-13 15:42 ` [PATCH 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures Boris Brezillon
` (2 subsequent siblings)
4 siblings, 3 replies; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:42 UTC (permalink / raw)
To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
Cc: dri-devel, kernel
The runtime PM resume operation is not guaranteed to succeed, but if it
fails, the device should be in a suspended state. Make sure we're robust
to resume failures in the unplug path.
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
drivers/gpu/drm/panthor/panthor_device.c | 23 ++++++++---------------
drivers/gpu/drm/panthor/panthor_fw.c | 14 +++++++++-----
drivers/gpu/drm/panthor/panthor_gpu.c | 3 ++-
drivers/gpu/drm/panthor/panthor_mmu.c | 3 ++-
4 files changed, 21 insertions(+), 22 deletions(-)
diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index db7ba40f771d..8b5d54b2bbb4 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -48,6 +48,8 @@ static int panthor_clk_init(struct panthor_device *ptdev)
void panthor_device_unplug(struct panthor_device *ptdev)
{
+ int ret;
+
/* This function can be called from two different path: the reset work
* and the platform device remove callback. drm_dev_unplug() doesn't
* deal with concurrent callers, so we have to protect drm_dev_unplug()
@@ -74,7 +76,8 @@ void panthor_device_unplug(struct panthor_device *ptdev)
*/
mutex_unlock(&ptdev->unplug.lock);
- drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
+ ret = pm_runtime_get_sync(ptdev->base.dev);
+ drm_WARN_ON(&ptdev->base, ret < 0);
/* Now, try to cleanly shutdown the GPU before the device resources
* get reclaimed.
@@ -85,7 +88,10 @@ void panthor_device_unplug(struct panthor_device *ptdev)
panthor_gpu_unplug(ptdev);
pm_runtime_dont_use_autosuspend(ptdev->base.dev);
- pm_runtime_put_sync_suspend(ptdev->base.dev);
+
+ /* If the resume failed, we don't need to suspend here. */
+ if (!ret)
+ pm_runtime_put_sync_suspend(ptdev->base.dev);
/* If PM is disabled, we need to call the suspend handler manually. */
if (!IS_ENABLED(CONFIG_PM))
@@ -541,17 +547,4 @@ int panthor_device_suspend(struct device *dev)
clk_disable_unprepare(ptdev->clks.core);
atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
return 0;
-
-err_set_active:
- /* If something failed and we have to revert back to an
- * active state, we also need to clear the MMIO userspace
- * mappings, so any dumb pages that were mapped while we
- * were trying to suspend gets invalidated.
- */
- mutex_lock(&ptdev->pm.mmio_lock);
- atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);
- unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
- DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
- mutex_unlock(&ptdev->pm.mmio_lock);
- return ret;
}
diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
index 2060085cc9f3..df74750cf1b7 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.c
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -12,6 +12,7 @@
#include <linux/iosys-map.h>
#include <linux/mutex.h>
#include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
#include <drm/drm_drv.h>
#include <drm/drm_managed.h>
@@ -1188,11 +1189,13 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
- /* Make sure the IRQ handler can be called after that point. */
- if (ptdev->fw->irq.irq)
- panthor_job_irq_suspend(&ptdev->fw->irq);
+ if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev)) {
+ /* Make sure the IRQ handler can be called after that point. */
+ if (ptdev->fw->irq.irq)
+ panthor_job_irq_suspend(&ptdev->fw->irq);
- panthor_fw_stop(ptdev);
+ panthor_fw_stop(ptdev);
+ }
list_for_each_entry(section, &ptdev->fw->sections, node)
panthor_kernel_bo_destroy(section->mem);
@@ -1205,7 +1208,8 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
panthor_vm_put(ptdev->fw->vm);
ptdev->fw->vm = NULL;
- panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
+ if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
+ panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
}
/**
diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
index 2d3529a0b156..27702bc62dd6 100644
--- a/drivers/gpu/drm/panthor/panthor_gpu.c
+++ b/drivers/gpu/drm/panthor/panthor_gpu.c
@@ -174,7 +174,8 @@ void panthor_gpu_unplug(struct panthor_device *ptdev)
unsigned long flags;
/* Make sure the IRQ handler is not running after that point. */
- panthor_gpu_irq_suspend(&ptdev->gpu->irq);
+ if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
+ panthor_gpu_irq_suspend(&ptdev->gpu->irq);
/* Wake-up all waiters. */
spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
index c2262c72e8e2..fee6c7d9fe0a 100644
--- a/drivers/gpu/drm/panthor/panthor_mmu.c
+++ b/drivers/gpu/drm/panthor/panthor_mmu.c
@@ -2674,7 +2674,8 @@ int panthor_vm_prepare_mapped_bos_resvs(struct drm_exec *exec, struct panthor_vm
*/
void panthor_mmu_unplug(struct panthor_device *ptdev)
{
- panthor_mmu_irq_suspend(&ptdev->mmu->irq);
+ if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
+ panthor_mmu_irq_suspend(&ptdev->mmu->irq);
mutex_lock(&ptdev->mmu->as.slots_lock);
for (u32 i = 0; i < ARRAY_SIZE(ptdev->mmu->as.slots); i++) {
--
2.46.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures
2024-11-13 15:42 [PATCH 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
2024-11-13 15:42 ` [PATCH 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
2024-11-13 15:42 ` [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
@ 2024-11-13 15:42 ` Boris Brezillon
2024-11-14 10:50 ` Steven Price
2024-11-13 15:42 ` [PATCH 4/5] drm/panthor: Be robust against resume failures Boris Brezillon
2024-11-13 15:42 ` [PATCH 5/5] drm/panthor: Fix the fast-reset logic Boris Brezillon
4 siblings, 1 reply; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:42 UTC (permalink / raw)
To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
Cc: dri-devel, kernel
devfreq_{resume,suspend}_device() don't bother undoing the suspend_count
modifications if something fails, so either it assumes failures are
harmless, or it's super fragile/buggy. In either case it's not something
we can address at the driver level, so let's just assume failures are
harmless for now, like is done in panfrost.
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
drivers/gpu/drm/panthor/panthor_devfreq.c | 12 ++++++------
drivers/gpu/drm/panthor/panthor_devfreq.h | 4 ++--
drivers/gpu/drm/panthor/panthor_device.c | 22 +++-------------------
3 files changed, 11 insertions(+), 27 deletions(-)
diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c b/drivers/gpu/drm/panthor/panthor_devfreq.c
index 9d0f891b9b53..fadc2edb26fe 100644
--- a/drivers/gpu/drm/panthor/panthor_devfreq.c
+++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
@@ -244,26 +244,26 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
return 0;
}
-int panthor_devfreq_resume(struct panthor_device *ptdev)
+void panthor_devfreq_resume(struct panthor_device *ptdev)
{
struct panthor_devfreq *pdevfreq = ptdev->devfreq;
if (!pdevfreq->devfreq)
- return 0;
+ return;
panthor_devfreq_reset(pdevfreq);
- return devfreq_resume_device(pdevfreq->devfreq);
+ drm_WARN_ON(&ptdev->base, devfreq_resume_device(pdevfreq->devfreq));
}
-int panthor_devfreq_suspend(struct panthor_device *ptdev)
+void panthor_devfreq_suspend(struct panthor_device *ptdev)
{
struct panthor_devfreq *pdevfreq = ptdev->devfreq;
if (!pdevfreq->devfreq)
- return 0;
+ return;
- return devfreq_suspend_device(pdevfreq->devfreq);
+ drm_WARN_ON(&ptdev->base, devfreq_suspend_device(pdevfreq->devfreq));
}
void panthor_devfreq_record_busy(struct panthor_device *ptdev)
diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.h b/drivers/gpu/drm/panthor/panthor_devfreq.h
index 83a5c9522493..b7631de695f7 100644
--- a/drivers/gpu/drm/panthor/panthor_devfreq.h
+++ b/drivers/gpu/drm/panthor/panthor_devfreq.h
@@ -12,8 +12,8 @@ struct panthor_devfreq;
int panthor_devfreq_init(struct panthor_device *ptdev);
-int panthor_devfreq_resume(struct panthor_device *ptdev);
-int panthor_devfreq_suspend(struct panthor_device *ptdev);
+void panthor_devfreq_resume(struct panthor_device *ptdev);
+void panthor_devfreq_suspend(struct panthor_device *ptdev);
void panthor_devfreq_record_busy(struct panthor_device *ptdev);
void panthor_devfreq_record_idle(struct panthor_device *ptdev);
diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index 8b5d54b2bbb4..353f3aabef42 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -439,9 +439,7 @@ int panthor_device_resume(struct device *dev)
if (ret)
goto err_disable_stacks_clk;
- ret = panthor_devfreq_resume(ptdev);
- if (ret)
- goto err_disable_coregroup_clk;
+ panthor_devfreq_resume(ptdev);
if (panthor_device_is_initialized(ptdev) &&
drm_dev_enter(&ptdev->base, &cookie)) {
@@ -478,8 +476,6 @@ int panthor_device_resume(struct device *dev)
err_suspend_devfreq:
panthor_devfreq_suspend(ptdev);
-
-err_disable_coregroup_clk:
clk_disable_unprepare(ptdev->clks.coregroup);
err_disable_stacks_clk:
@@ -496,7 +492,7 @@ int panthor_device_resume(struct device *dev)
int panthor_device_suspend(struct device *dev)
{
struct panthor_device *ptdev = dev_get_drvdata(dev);
- int ret, cookie;
+ int cookie;
if (atomic_read(&ptdev->pm.state) != PANTHOR_DEVICE_PM_STATE_ACTIVE)
return -EINVAL;
@@ -528,19 +524,7 @@ int panthor_device_suspend(struct device *dev)
drm_dev_exit(cookie);
}
- ret = panthor_devfreq_suspend(ptdev);
- if (ret) {
- if (panthor_device_is_initialized(ptdev) &&
- drm_dev_enter(&ptdev->base, &cookie)) {
- panthor_gpu_resume(ptdev);
- panthor_mmu_resume(ptdev);
- drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
- panthor_sched_resume(ptdev);
- drm_dev_exit(cookie);
- }
-
- goto err_set_active;
- }
+ panthor_devfreq_suspend(ptdev);
clk_disable_unprepare(ptdev->clks.coregroup);
clk_disable_unprepare(ptdev->clks.stacks);
--
2.46.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 4/5] drm/panthor: Be robust against resume failures
2024-11-13 15:42 [PATCH 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
` (2 preceding siblings ...)
2024-11-13 15:42 ` [PATCH 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures Boris Brezillon
@ 2024-11-13 15:42 ` Boris Brezillon
2024-11-14 10:51 ` Steven Price
2024-11-13 15:42 ` [PATCH 5/5] drm/panthor: Fix the fast-reset logic Boris Brezillon
4 siblings, 1 reply; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:42 UTC (permalink / raw)
To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
Cc: dri-devel, kernel
When the runtime PM resume callback returns an error, it puts the device
in a state where it can't be resumed anymore. Make sure we can recover
from such transient failures by calling pm_runtime_set_suspended()
explicitly after a pm_runtime_resume_and_get() failure.
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
drivers/gpu/drm/panthor/panthor_device.c | 1 +
drivers/gpu/drm/panthor/panthor_device.h | 17 +++++++++++++++++
drivers/gpu/drm/panthor/panthor_drv.c | 2 +-
drivers/gpu/drm/panthor/panthor_sched.c | 4 ++--
4 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index 353f3aabef42..d3276b936141 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -486,6 +486,7 @@ int panthor_device_resume(struct device *dev)
err_set_suspended:
atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
+ atomic_set(&ptdev->pm.recovery_needed, 1);
return ret;
}
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index 0e68f5a70d20..cc74e99e53f9 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -9,6 +9,7 @@
#include <linux/atomic.h>
#include <linux/io-pgtable.h>
#include <linux/regulator/consumer.h>
+#include <linux/pm_runtime.h>
#include <linux/sched.h>
#include <linux/spinlock.h>
@@ -180,6 +181,9 @@ struct panthor_device {
* is suspended.
*/
struct page *dummy_latest_flush;
+
+ /** @recovery_needed: True when a resume attempt failed. */
+ atomic_t recovery_needed;
} pm;
/** @profile_mask: User-set profiling flags for job accounting. */
@@ -243,6 +247,19 @@ int panthor_device_mmap_io(struct panthor_device *ptdev,
int panthor_device_resume(struct device *dev);
int panthor_device_suspend(struct device *dev);
+static inline int panthor_device_resume_and_get(struct panthor_device *ptdev)
+{
+ int ret = pm_runtime_resume_and_get(ptdev->base.dev);
+
+ /* If the resume failed, we need to clear the runtime_error, which we
+ * can done by forcing the RPM state to suspended.
+ */
+ if (ret && atomic_cmpxchg(&ptdev->pm.recovery_needed, 1, 0) == 1)
+ pm_runtime_set_suspended(ptdev->base.dev);
+
+ return ret;
+}
+
enum drm_panthor_exception_type {
DRM_PANTHOR_EXCEPTION_OK = 0x00,
DRM_PANTHOR_EXCEPTION_TERMINATED = 0x04,
diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
index 1498c97b4b85..b7a9adc918e3 100644
--- a/drivers/gpu/drm/panthor/panthor_drv.c
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -763,7 +763,7 @@ static int panthor_query_timestamp_info(struct panthor_device *ptdev,
{
int ret;
- ret = pm_runtime_resume_and_get(ptdev->base.dev);
+ ret = panthor_device_resume_and_get(ptdev);
if (ret)
return ret;
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 97ed5fe5a191..77b184c3fb0c 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -2364,7 +2364,7 @@ static void tick_work(struct work_struct *work)
if (!drm_dev_enter(&ptdev->base, &cookie))
return;
- ret = pm_runtime_resume_and_get(ptdev->base.dev);
+ ret = panthor_device_resume_and_get(ptdev);
if (drm_WARN_ON(&ptdev->base, ret))
goto out_dev_exit;
@@ -3131,7 +3131,7 @@ queue_run_job(struct drm_sched_job *sched_job)
return dma_fence_get(job->done_fence);
}
- ret = pm_runtime_resume_and_get(ptdev->base.dev);
+ ret = panthor_device_resume_and_get(ptdev);
if (drm_WARN_ON(&ptdev->base, ret))
return ERR_PTR(ret);
--
2.46.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 5/5] drm/panthor: Fix the fast-reset logic
2024-11-13 15:42 [PATCH 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
` (3 preceding siblings ...)
2024-11-13 15:42 ` [PATCH 4/5] drm/panthor: Be robust against resume failures Boris Brezillon
@ 2024-11-13 15:42 ` Boris Brezillon
2024-11-14 10:51 ` Steven Price
4 siblings, 1 reply; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:42 UTC (permalink / raw)
To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
Cc: dri-devel, kernel
If we do a GPU soft-reset, that's no longer fast reset. This also means
the slow reset fallback doesn't work because the MCU state is only reset
after a GPU soft-reset.
Let's move the retry logic to panthor_device_resume() to issue a
soft-reset between the fast and slow attempts, and patch
panthor_gpu_suspend() to only power-off the L2 when a fast reset is
requested.
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
drivers/gpu/drm/panthor/panthor_device.c | 32 ++++++++++++----
drivers/gpu/drm/panthor/panthor_device.h | 11 ++++++
drivers/gpu/drm/panthor/panthor_fw.c | 48 ++++--------------------
drivers/gpu/drm/panthor/panthor_gpu.c | 11 +++---
4 files changed, 49 insertions(+), 53 deletions(-)
diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
index d3276b936141..2b93c56e1319 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -417,6 +417,22 @@ int panthor_device_mmap_io(struct panthor_device *ptdev, struct vm_area_struct *
return 0;
}
+static int panthor_device_resume_hw_components(struct panthor_device *ptdev)
+{
+ int ret;
+
+ panthor_gpu_resume(ptdev);
+ panthor_mmu_resume(ptdev);
+
+ ret = panthor_fw_resume(ptdev);
+ if (!ret)
+ return 0;
+
+ panthor_mmu_suspend(ptdev);
+ panthor_gpu_suspend(ptdev);
+ return ret;
+}
+
int panthor_device_resume(struct device *dev)
{
struct panthor_device *ptdev = dev_get_drvdata(dev);
@@ -443,16 +459,16 @@ int panthor_device_resume(struct device *dev)
if (panthor_device_is_initialized(ptdev) &&
drm_dev_enter(&ptdev->base, &cookie)) {
- panthor_gpu_resume(ptdev);
- panthor_mmu_resume(ptdev);
- ret = panthor_fw_resume(ptdev);
- if (!drm_WARN_ON(&ptdev->base, ret)) {
- panthor_sched_resume(ptdev);
- } else {
- panthor_mmu_suspend(ptdev);
- panthor_gpu_suspend(ptdev);
+ ret = panthor_device_resume_hw_components(ptdev);
+ if (ret && ptdev->reset.fast) {
+ drm_err(&ptdev->base, "Fast reset failed, trying a slow reset");
+ ptdev->reset.fast = false;
+ ret = panthor_device_resume_hw_components(ptdev);
}
+ if (!ret)
+ panthor_sched_resume(ptdev);
+
drm_dev_exit(cookie);
if (ret)
diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
index cc74e99e53f9..cab7ce740ce3 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -157,6 +157,17 @@ struct panthor_device {
/** @pending: Set to true if a reset is pending. */
atomic_t pending;
+
+ /**
+ * @fast: True if the post_reset logic can proceed with a fast reset.
+ *
+ * A fast reset is just a reset where the driver doesn't reload the FW sections.
+ *
+ * Any time the firmware is properly suspended, a fast reset can take place.
+ * On the other hand, if the halt operation failed, the driver will reload
+ * all FW sections to make sure we start from a fresh state.
+ */
+ bool fast;
} reset;
/** @pm: Power management related data. */
diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
index df74750cf1b7..b4fe0de547d1 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.c
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -263,17 +263,6 @@ struct panthor_fw {
/** @booted: True is the FW is booted */
bool booted;
- /**
- * @fast_reset: True if the post_reset logic can proceed with a fast reset.
- *
- * A fast reset is just a reset where the driver doesn't reload the FW sections.
- *
- * Any time the firmware is properly suspended, a fast reset can take place.
- * On the other hand, if the halt operation failed, the driver will reload
- * all sections to make sure we start from a fresh state.
- */
- bool fast_reset;
-
/** @irq: Job irq data. */
struct panthor_irq irq;
};
@@ -1090,7 +1079,7 @@ void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang)
/* Make sure we won't be woken up by a ping. */
cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
- ptdev->fw->fast_reset = false;
+ ptdev->reset.fast = false;
if (!on_hang) {
struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
@@ -1101,7 +1090,7 @@ void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang)
if (!readl_poll_timeout(ptdev->iomem + MCU_STATUS, status,
status == MCU_STATUS_HALT, 10, 100000) &&
glb_iface->output->halt_status == PANTHOR_FW_HALT_OK) {
- ptdev->fw->fast_reset = true;
+ ptdev->reset.fast = true;
} else {
drm_warn(&ptdev->base, "Failed to cleanly suspend MCU");
}
@@ -1131,41 +1120,20 @@ int panthor_fw_post_reset(struct panthor_device *ptdev)
if (ret)
return ret;
- /* If this is a fast reset, try to start the MCU without reloading
- * the FW sections. If it fails, go for a full reset.
- */
- if (ptdev->fw->fast_reset) {
- ret = panthor_fw_start(ptdev);
- if (!ret)
- goto out;
-
- /* Forcibly reset the MCU and force a slow reset, so we get a
- * fresh boot on the next panthor_fw_start() call.
- */
- panthor_fw_stop(ptdev);
- ptdev->fw->fast_reset = false;
- drm_err(&ptdev->base, "FW fast reset failed, trying a slow reset");
-
- ret = panthor_vm_flush_all(ptdev->fw->vm);
- if (ret) {
- drm_err(&ptdev->base, "FW slow reset failed (couldn't flush FW's AS l2cache)");
- return ret;
- }
- }
-
- /* Reload all sections, including RO ones. We're not supposed
- * to end up here anyway, let's just assume the overhead of
+ /* On a slow reset, reload all sections, including RO ones. We're not
+ * supposed to end up here anyway, let's just assume the overhead of
* reloading everything is acceptable.
*/
- panthor_reload_fw_sections(ptdev, true);
+ if (!ptdev->reset.fast)
+ panthor_reload_fw_sections(ptdev, true);
ret = panthor_fw_start(ptdev);
if (ret) {
- drm_err(&ptdev->base, "FW slow reset failed (couldn't start the FW )");
+ drm_err(&ptdev->base, "FW %s reset failed",
+ ptdev->reset.fast ? "fast" : "slow");
return ret;
}
-out:
/* We must re-initialize the global interface even on fast-reset. */
panthor_fw_init_global_iface(ptdev);
return 0;
diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
index 27702bc62dd6..ce3b68124a0a 100644
--- a/drivers/gpu/drm/panthor/panthor_gpu.c
+++ b/drivers/gpu/drm/panthor/panthor_gpu.c
@@ -461,11 +461,12 @@ int panthor_gpu_soft_reset(struct panthor_device *ptdev)
*/
void panthor_gpu_suspend(struct panthor_device *ptdev)
{
- /*
- * It may be preferable to simply power down the L2, but for now just
- * soft-reset which will leave the L2 powered down.
- */
- panthor_gpu_soft_reset(ptdev);
+ /* On a fast reset, simply power down the L2. */
+ if (!ptdev->reset.fast)
+ panthor_gpu_soft_reset(ptdev);
+ else
+ panthor_gpu_power_off(ptdev, L2, 1, 20000);
+
panthor_gpu_irq_suspend(&ptdev->gpu->irq);
}
--
2.46.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
2024-11-13 15:42 ` [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
@ 2024-11-13 15:54 ` Boris Brezillon
2024-11-13 15:59 ` Boris Brezillon
2024-11-14 11:13 ` Liviu Dudau
2 siblings, 0 replies; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:54 UTC (permalink / raw)
To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
Cc: dri-devel, kernel
On Wed, 13 Nov 2024 16:42:54 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:
> @@ -74,7 +76,8 @@ void panthor_device_unplug(struct panthor_device *ptdev)
> */
> mutex_unlock(&ptdev->unplug.lock);
>
> - drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
> + ret = pm_runtime_get_sync(ptdev->base.dev);
> + drm_WARN_ON(&ptdev->base, ret < 0);
>
> /* Now, try to cleanly shutdown the GPU before the device resources
> * get reclaimed.
> @@ -85,7 +88,10 @@ void panthor_device_unplug(struct panthor_device *ptdev)
> panthor_gpu_unplug(ptdev);
>
> pm_runtime_dont_use_autosuspend(ptdev->base.dev);
> - pm_runtime_put_sync_suspend(ptdev->base.dev);
> +
> + /* If the resume failed, we don't need to suspend here. */
> + if (!ret)
> + pm_runtime_put_sync_suspend(ptdev->base.dev);
Okay, I always get confused by pm_runtime_get_sync(). Turns out the
refcount is incremented even if pm_runtime_get_sync() fails, so we
should call pm_runtime_put_sync_suspend() unconditionally here.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
2024-11-13 15:42 ` [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
2024-11-13 15:54 ` Boris Brezillon
@ 2024-11-13 15:59 ` Boris Brezillon
2024-11-14 11:13 ` Liviu Dudau
2 siblings, 0 replies; 16+ messages in thread
From: Boris Brezillon @ 2024-11-13 15:59 UTC (permalink / raw)
To: Boris Brezillon, Steven Price, Liviu Dudau, Adrián Larumbe
Cc: dri-devel, kernel
On Wed, 13 Nov 2024 16:42:54 +0100
Boris Brezillon <boris.brezillon@collabora.com> wrote:
> @@ -541,17 +547,4 @@ int panthor_device_suspend(struct device *dev)
> clk_disable_unprepare(ptdev->clks.core);
> atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> return 0;
> -
> -err_set_active:
> - /* If something failed and we have to revert back to an
> - * active state, we also need to clear the MMIO userspace
> - * mappings, so any dumb pages that were mapped while we
> - * were trying to suspend gets invalidated.
> - */
> - mutex_lock(&ptdev->pm.mmio_lock);
> - atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);
> - unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
> - DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
> - mutex_unlock(&ptdev->pm.mmio_lock);
> - return ret;
> }
This chunk should be in patch 3.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume()
2024-11-13 15:42 ` [PATCH 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
@ 2024-11-14 10:50 ` Steven Price
0 siblings, 0 replies; 16+ messages in thread
From: Steven Price @ 2024-11-14 10:50 UTC (permalink / raw)
To: Boris Brezillon, Liviu Dudau, Adrián Larumbe; +Cc: dri-devel, kernel
On 13/11/2024 15:42, Boris Brezillon wrote:
> WARN() will return true if the condition is true, false otherwise.
> If we store the return of drm_WARN_ON() in ret, we lose the actual
> error code.
>
> Fixes: 5fe909cae118 ("drm/panthor: Add the device logical block")
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
> ---
> drivers/gpu/drm/panthor/panthor_device.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index 4082c8f2951d..db7ba40f771d 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -441,8 +441,8 @@ int panthor_device_resume(struct device *dev)
> drm_dev_enter(&ptdev->base, &cookie)) {
> panthor_gpu_resume(ptdev);
> panthor_mmu_resume(ptdev);
> - ret = drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
> - if (!ret) {
> + ret = panthor_fw_resume(ptdev);
> + if (!drm_WARN_ON(&ptdev->base, ret)) {
> panthor_sched_resume(ptdev);
> } else {
> panthor_mmu_suspend(ptdev);
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures
2024-11-13 15:42 ` [PATCH 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures Boris Brezillon
@ 2024-11-14 10:50 ` Steven Price
0 siblings, 0 replies; 16+ messages in thread
From: Steven Price @ 2024-11-14 10:50 UTC (permalink / raw)
To: Boris Brezillon, Liviu Dudau, Adrián Larumbe; +Cc: dri-devel, kernel
On 13/11/2024 15:42, Boris Brezillon wrote:
> devfreq_{resume,suspend}_device() don't bother undoing the suspend_count
> modifications if something fails, so either it assumes failures are
> harmless, or it's super fragile/buggy. In either case it's not something
> we can address at the driver level, so let's just assume failures are
> harmless for now, like is done in panfrost.
>
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
With the missing chunk from patch 2:
Reviewed-by: Steven Price <steven.price@arm.com>
> ---
> drivers/gpu/drm/panthor/panthor_devfreq.c | 12 ++++++------
> drivers/gpu/drm/panthor/panthor_devfreq.h | 4 ++--
> drivers/gpu/drm/panthor/panthor_device.c | 22 +++-------------------
> 3 files changed, 11 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c b/drivers/gpu/drm/panthor/panthor_devfreq.c
> index 9d0f891b9b53..fadc2edb26fe 100644
> --- a/drivers/gpu/drm/panthor/panthor_devfreq.c
> +++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
> @@ -244,26 +244,26 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
> return 0;
> }
>
> -int panthor_devfreq_resume(struct panthor_device *ptdev)
> +void panthor_devfreq_resume(struct panthor_device *ptdev)
> {
> struct panthor_devfreq *pdevfreq = ptdev->devfreq;
>
> if (!pdevfreq->devfreq)
> - return 0;
> + return;
>
> panthor_devfreq_reset(pdevfreq);
>
> - return devfreq_resume_device(pdevfreq->devfreq);
> + drm_WARN_ON(&ptdev->base, devfreq_resume_device(pdevfreq->devfreq));
> }
>
> -int panthor_devfreq_suspend(struct panthor_device *ptdev)
> +void panthor_devfreq_suspend(struct panthor_device *ptdev)
> {
> struct panthor_devfreq *pdevfreq = ptdev->devfreq;
>
> if (!pdevfreq->devfreq)
> - return 0;
> + return;
>
> - return devfreq_suspend_device(pdevfreq->devfreq);
> + drm_WARN_ON(&ptdev->base, devfreq_suspend_device(pdevfreq->devfreq));
> }
>
> void panthor_devfreq_record_busy(struct panthor_device *ptdev)
> diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.h b/drivers/gpu/drm/panthor/panthor_devfreq.h
> index 83a5c9522493..b7631de695f7 100644
> --- a/drivers/gpu/drm/panthor/panthor_devfreq.h
> +++ b/drivers/gpu/drm/panthor/panthor_devfreq.h
> @@ -12,8 +12,8 @@ struct panthor_devfreq;
>
> int panthor_devfreq_init(struct panthor_device *ptdev);
>
> -int panthor_devfreq_resume(struct panthor_device *ptdev);
> -int panthor_devfreq_suspend(struct panthor_device *ptdev);
> +void panthor_devfreq_resume(struct panthor_device *ptdev);
> +void panthor_devfreq_suspend(struct panthor_device *ptdev);
>
> void panthor_devfreq_record_busy(struct panthor_device *ptdev);
> void panthor_devfreq_record_idle(struct panthor_device *ptdev);
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index 8b5d54b2bbb4..353f3aabef42 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -439,9 +439,7 @@ int panthor_device_resume(struct device *dev)
> if (ret)
> goto err_disable_stacks_clk;
>
> - ret = panthor_devfreq_resume(ptdev);
> - if (ret)
> - goto err_disable_coregroup_clk;
> + panthor_devfreq_resume(ptdev);
>
> if (panthor_device_is_initialized(ptdev) &&
> drm_dev_enter(&ptdev->base, &cookie)) {
> @@ -478,8 +476,6 @@ int panthor_device_resume(struct device *dev)
>
> err_suspend_devfreq:
> panthor_devfreq_suspend(ptdev);
> -
> -err_disable_coregroup_clk:
> clk_disable_unprepare(ptdev->clks.coregroup);
>
> err_disable_stacks_clk:
> @@ -496,7 +492,7 @@ int panthor_device_resume(struct device *dev)
> int panthor_device_suspend(struct device *dev)
> {
> struct panthor_device *ptdev = dev_get_drvdata(dev);
> - int ret, cookie;
> + int cookie;
>
> if (atomic_read(&ptdev->pm.state) != PANTHOR_DEVICE_PM_STATE_ACTIVE)
> return -EINVAL;
> @@ -528,19 +524,7 @@ int panthor_device_suspend(struct device *dev)
> drm_dev_exit(cookie);
> }
>
> - ret = panthor_devfreq_suspend(ptdev);
> - if (ret) {
> - if (panthor_device_is_initialized(ptdev) &&
> - drm_dev_enter(&ptdev->base, &cookie)) {
> - panthor_gpu_resume(ptdev);
> - panthor_mmu_resume(ptdev);
> - drm_WARN_ON(&ptdev->base, panthor_fw_resume(ptdev));
> - panthor_sched_resume(ptdev);
> - drm_dev_exit(cookie);
> - }
> -
> - goto err_set_active;
> - }
> + panthor_devfreq_suspend(ptdev);
>
> clk_disable_unprepare(ptdev->clks.coregroup);
> clk_disable_unprepare(ptdev->clks.stacks);
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 4/5] drm/panthor: Be robust against resume failures
2024-11-13 15:42 ` [PATCH 4/5] drm/panthor: Be robust against resume failures Boris Brezillon
@ 2024-11-14 10:51 ` Steven Price
2024-11-14 11:24 ` Boris Brezillon
0 siblings, 1 reply; 16+ messages in thread
From: Steven Price @ 2024-11-14 10:51 UTC (permalink / raw)
To: Boris Brezillon, Liviu Dudau, Adrián Larumbe; +Cc: dri-devel, kernel
On 13/11/2024 15:42, Boris Brezillon wrote:
> When the runtime PM resume callback returns an error, it puts the device
> in a state where it can't be resumed anymore. Make sure we can recover
> from such transient failures by calling pm_runtime_set_suspended()
> explicitly after a pm_runtime_resume_and_get() failure.
>
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
> drivers/gpu/drm/panthor/panthor_device.c | 1 +
> drivers/gpu/drm/panthor/panthor_device.h | 17 +++++++++++++++++
> drivers/gpu/drm/panthor/panthor_drv.c | 2 +-
> drivers/gpu/drm/panthor/panthor_sched.c | 4 ++--
> 4 files changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index 353f3aabef42..d3276b936141 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -486,6 +486,7 @@ int panthor_device_resume(struct device *dev)
>
> err_set_suspended:
> atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> + atomic_set(&ptdev->pm.recovery_needed, 1);
> return ret;
> }
>
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index 0e68f5a70d20..cc74e99e53f9 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -9,6 +9,7 @@
> #include <linux/atomic.h>
> #include <linux/io-pgtable.h>
> #include <linux/regulator/consumer.h>
> +#include <linux/pm_runtime.h>
> #include <linux/sched.h>
> #include <linux/spinlock.h>
>
> @@ -180,6 +181,9 @@ struct panthor_device {
> * is suspended.
> */
> struct page *dummy_latest_flush;
> +
> + /** @recovery_needed: True when a resume attempt failed. */
> + atomic_t recovery_needed;
> } pm;
>
> /** @profile_mask: User-set profiling flags for job accounting. */
> @@ -243,6 +247,19 @@ int panthor_device_mmap_io(struct panthor_device *ptdev,
> int panthor_device_resume(struct device *dev);
> int panthor_device_suspend(struct device *dev);
>
> +static inline int panthor_device_resume_and_get(struct panthor_device *ptdev)
> +{
> + int ret = pm_runtime_resume_and_get(ptdev->base.dev);
> +
> + /* If the resume failed, we need to clear the runtime_error, which we
> + * can done by forcing the RPM state to suspended.
> + */
> + if (ret && atomic_cmpxchg(&ptdev->pm.recovery_needed, 1, 0) == 1)
I'm unclear what this atomic achieves. At first glance it appears
pointless: with this change if panthor_device_resume() fails then
recovery_needed is set to 1. So logically if ret != 0 then also
recovery_needed == 1.
My second thought was is this to avoid races? If multiple threads are
calling this then only one will win the cmpxchg and call
pm_runtime_set_suspended. But it's not safe - it's quite possible for
the first thread to get past the cmpxchg and be suspended before the
second thread comes along and reaches the same point - leading to
multiple calls to pm_runtime_set_suspended().
So I'm afraid I don't understand this atomic - what am I missing?
Thanks,
Steve
> + pm_runtime_set_suspended(ptdev->base.dev);
> +
> + return ret;
> +}
> +
> enum drm_panthor_exception_type {
> DRM_PANTHOR_EXCEPTION_OK = 0x00,
> DRM_PANTHOR_EXCEPTION_TERMINATED = 0x04,
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c b/drivers/gpu/drm/panthor/panthor_drv.c
> index 1498c97b4b85..b7a9adc918e3 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -763,7 +763,7 @@ static int panthor_query_timestamp_info(struct panthor_device *ptdev,
> {
> int ret;
>
> - ret = pm_runtime_resume_and_get(ptdev->base.dev);
> + ret = panthor_device_resume_and_get(ptdev);
> if (ret)
> return ret;
>
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index 97ed5fe5a191..77b184c3fb0c 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -2364,7 +2364,7 @@ static void tick_work(struct work_struct *work)
> if (!drm_dev_enter(&ptdev->base, &cookie))
> return;
>
> - ret = pm_runtime_resume_and_get(ptdev->base.dev);
> + ret = panthor_device_resume_and_get(ptdev);
> if (drm_WARN_ON(&ptdev->base, ret))
> goto out_dev_exit;
>
> @@ -3131,7 +3131,7 @@ queue_run_job(struct drm_sched_job *sched_job)
> return dma_fence_get(job->done_fence);
> }
>
> - ret = pm_runtime_resume_and_get(ptdev->base.dev);
> + ret = panthor_device_resume_and_get(ptdev);
> if (drm_WARN_ON(&ptdev->base, ret))
> return ERR_PTR(ret);
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 5/5] drm/panthor: Fix the fast-reset logic
2024-11-13 15:42 ` [PATCH 5/5] drm/panthor: Fix the fast-reset logic Boris Brezillon
@ 2024-11-14 10:51 ` Steven Price
0 siblings, 0 replies; 16+ messages in thread
From: Steven Price @ 2024-11-14 10:51 UTC (permalink / raw)
To: Boris Brezillon, Liviu Dudau, Adrián Larumbe; +Cc: dri-devel, kernel
On 13/11/2024 15:42, Boris Brezillon wrote:
> If we do a GPU soft-reset, that's no longer fast reset. This also means
> the slow reset fallback doesn't work because the MCU state is only reset
> after a GPU soft-reset.
>
> Let's move the retry logic to panthor_device_resume() to issue a
> soft-reset between the fast and slow attempts, and patch
> panthor_gpu_suspend() to only power-off the L2 when a fast reset is
> requested.
>
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
I also don't have a good way of testing this, but it looks right and it
doesn't blow up in the "normal" case, so:
Reviewed-by: Steven Price <steven.price@arm.com>
> ---
> drivers/gpu/drm/panthor/panthor_device.c | 32 ++++++++++++----
> drivers/gpu/drm/panthor/panthor_device.h | 11 ++++++
> drivers/gpu/drm/panthor/panthor_fw.c | 48 ++++--------------------
> drivers/gpu/drm/panthor/panthor_gpu.c | 11 +++---
> 4 files changed, 49 insertions(+), 53 deletions(-)
>
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index d3276b936141..2b93c56e1319 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -417,6 +417,22 @@ int panthor_device_mmap_io(struct panthor_device *ptdev, struct vm_area_struct *
> return 0;
> }
>
> +static int panthor_device_resume_hw_components(struct panthor_device *ptdev)
> +{
> + int ret;
> +
> + panthor_gpu_resume(ptdev);
> + panthor_mmu_resume(ptdev);
> +
> + ret = panthor_fw_resume(ptdev);
> + if (!ret)
> + return 0;
> +
> + panthor_mmu_suspend(ptdev);
> + panthor_gpu_suspend(ptdev);
> + return ret;
> +}
> +
> int panthor_device_resume(struct device *dev)
> {
> struct panthor_device *ptdev = dev_get_drvdata(dev);
> @@ -443,16 +459,16 @@ int panthor_device_resume(struct device *dev)
>
> if (panthor_device_is_initialized(ptdev) &&
> drm_dev_enter(&ptdev->base, &cookie)) {
> - panthor_gpu_resume(ptdev);
> - panthor_mmu_resume(ptdev);
> - ret = panthor_fw_resume(ptdev);
> - if (!drm_WARN_ON(&ptdev->base, ret)) {
> - panthor_sched_resume(ptdev);
> - } else {
> - panthor_mmu_suspend(ptdev);
> - panthor_gpu_suspend(ptdev);
> + ret = panthor_device_resume_hw_components(ptdev);
> + if (ret && ptdev->reset.fast) {
> + drm_err(&ptdev->base, "Fast reset failed, trying a slow reset");
> + ptdev->reset.fast = false;
> + ret = panthor_device_resume_hw_components(ptdev);
> }
>
> + if (!ret)
> + panthor_sched_resume(ptdev);
> +
> drm_dev_exit(cookie);
>
> if (ret)
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> index cc74e99e53f9..cab7ce740ce3 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -157,6 +157,17 @@ struct panthor_device {
>
> /** @pending: Set to true if a reset is pending. */
> atomic_t pending;
> +
> + /**
> + * @fast: True if the post_reset logic can proceed with a fast reset.
> + *
> + * A fast reset is just a reset where the driver doesn't reload the FW sections.
> + *
> + * Any time the firmware is properly suspended, a fast reset can take place.
> + * On the other hand, if the halt operation failed, the driver will reload
> + * all FW sections to make sure we start from a fresh state.
> + */
> + bool fast;
> } reset;
>
> /** @pm: Power management related data. */
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
> index df74750cf1b7..b4fe0de547d1 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.c
> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> @@ -263,17 +263,6 @@ struct panthor_fw {
> /** @booted: True is the FW is booted */
> bool booted;
>
> - /**
> - * @fast_reset: True if the post_reset logic can proceed with a fast reset.
> - *
> - * A fast reset is just a reset where the driver doesn't reload the FW sections.
> - *
> - * Any time the firmware is properly suspended, a fast reset can take place.
> - * On the other hand, if the halt operation failed, the driver will reload
> - * all sections to make sure we start from a fresh state.
> - */
> - bool fast_reset;
> -
> /** @irq: Job irq data. */
> struct panthor_irq irq;
> };
> @@ -1090,7 +1079,7 @@ void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang)
> /* Make sure we won't be woken up by a ping. */
> cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
>
> - ptdev->fw->fast_reset = false;
> + ptdev->reset.fast = false;
>
> if (!on_hang) {
> struct panthor_fw_global_iface *glb_iface = panthor_fw_get_glb_iface(ptdev);
> @@ -1101,7 +1090,7 @@ void panthor_fw_pre_reset(struct panthor_device *ptdev, bool on_hang)
> if (!readl_poll_timeout(ptdev->iomem + MCU_STATUS, status,
> status == MCU_STATUS_HALT, 10, 100000) &&
> glb_iface->output->halt_status == PANTHOR_FW_HALT_OK) {
> - ptdev->fw->fast_reset = true;
> + ptdev->reset.fast = true;
> } else {
> drm_warn(&ptdev->base, "Failed to cleanly suspend MCU");
> }
> @@ -1131,41 +1120,20 @@ int panthor_fw_post_reset(struct panthor_device *ptdev)
> if (ret)
> return ret;
>
> - /* If this is a fast reset, try to start the MCU without reloading
> - * the FW sections. If it fails, go for a full reset.
> - */
> - if (ptdev->fw->fast_reset) {
> - ret = panthor_fw_start(ptdev);
> - if (!ret)
> - goto out;
> -
> - /* Forcibly reset the MCU and force a slow reset, so we get a
> - * fresh boot on the next panthor_fw_start() call.
> - */
> - panthor_fw_stop(ptdev);
> - ptdev->fw->fast_reset = false;
> - drm_err(&ptdev->base, "FW fast reset failed, trying a slow reset");
> -
> - ret = panthor_vm_flush_all(ptdev->fw->vm);
> - if (ret) {
> - drm_err(&ptdev->base, "FW slow reset failed (couldn't flush FW's AS l2cache)");
> - return ret;
> - }
> - }
> -
> - /* Reload all sections, including RO ones. We're not supposed
> - * to end up here anyway, let's just assume the overhead of
> + /* On a slow reset, reload all sections, including RO ones. We're not
> + * supposed to end up here anyway, let's just assume the overhead of
> * reloading everything is acceptable.
> */
> - panthor_reload_fw_sections(ptdev, true);
> + if (!ptdev->reset.fast)
> + panthor_reload_fw_sections(ptdev, true);
>
> ret = panthor_fw_start(ptdev);
> if (ret) {
> - drm_err(&ptdev->base, "FW slow reset failed (couldn't start the FW )");
> + drm_err(&ptdev->base, "FW %s reset failed",
> + ptdev->reset.fast ? "fast" : "slow");
> return ret;
> }
>
> -out:
> /* We must re-initialize the global interface even on fast-reset. */
> panthor_fw_init_global_iface(ptdev);
> return 0;
> diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
> index 27702bc62dd6..ce3b68124a0a 100644
> --- a/drivers/gpu/drm/panthor/panthor_gpu.c
> +++ b/drivers/gpu/drm/panthor/panthor_gpu.c
> @@ -461,11 +461,12 @@ int panthor_gpu_soft_reset(struct panthor_device *ptdev)
> */
> void panthor_gpu_suspend(struct panthor_device *ptdev)
> {
> - /*
> - * It may be preferable to simply power down the L2, but for now just
> - * soft-reset which will leave the L2 powered down.
> - */
> - panthor_gpu_soft_reset(ptdev);
> + /* On a fast reset, simply power down the L2. */
> + if (!ptdev->reset.fast)
> + panthor_gpu_soft_reset(ptdev);
> + else
> + panthor_gpu_power_off(ptdev, L2, 1, 20000);
> +
> panthor_gpu_irq_suspend(&ptdev->gpu->irq);
> }
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
2024-11-13 15:42 ` [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
2024-11-13 15:54 ` Boris Brezillon
2024-11-13 15:59 ` Boris Brezillon
@ 2024-11-14 11:13 ` Liviu Dudau
2024-11-14 11:27 ` Boris Brezillon
2 siblings, 1 reply; 16+ messages in thread
From: Liviu Dudau @ 2024-11-14 11:13 UTC (permalink / raw)
To: Boris Brezillon; +Cc: Steven Price, Adrián Larumbe, dri-devel, kernel
On Wed, Nov 13, 2024 at 04:42:54PM +0100, Boris Brezillon wrote:
> The runtime PM resume operation is not guaranteed to succeed, but if it
> fails, the device should be in a suspended state. Make sure we're robust
> to resume failures in the unplug path.
>
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
> drivers/gpu/drm/panthor/panthor_device.c | 23 ++++++++---------------
> drivers/gpu/drm/panthor/panthor_fw.c | 14 +++++++++-----
> drivers/gpu/drm/panthor/panthor_gpu.c | 3 ++-
> drivers/gpu/drm/panthor/panthor_mmu.c | 3 ++-
> 4 files changed, 21 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> index db7ba40f771d..8b5d54b2bbb4 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.c
> +++ b/drivers/gpu/drm/panthor/panthor_device.c
> @@ -48,6 +48,8 @@ static int panthor_clk_init(struct panthor_device *ptdev)
>
> void panthor_device_unplug(struct panthor_device *ptdev)
> {
> + int ret;
> +
> /* This function can be called from two different path: the reset work
> * and the platform device remove callback. drm_dev_unplug() doesn't
> * deal with concurrent callers, so we have to protect drm_dev_unplug()
> @@ -74,7 +76,8 @@ void panthor_device_unplug(struct panthor_device *ptdev)
> */
> mutex_unlock(&ptdev->unplug.lock);
>
> - drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
> + ret = pm_runtime_get_sync(ptdev->base.dev);
> + drm_WARN_ON(&ptdev->base, ret < 0);
>
> /* Now, try to cleanly shutdown the GPU before the device resources
> * get reclaimed.
> @@ -85,7 +88,10 @@ void panthor_device_unplug(struct panthor_device *ptdev)
> panthor_gpu_unplug(ptdev);
>
> pm_runtime_dont_use_autosuspend(ptdev->base.dev);
> - pm_runtime_put_sync_suspend(ptdev->base.dev);
> +
> + /* If the resume failed, we don't need to suspend here. */
> + if (!ret)
> + pm_runtime_put_sync_suspend(ptdev->base.dev);
>
> /* If PM is disabled, we need to call the suspend handler manually. */
> if (!IS_ENABLED(CONFIG_PM))
> @@ -541,17 +547,4 @@ int panthor_device_suspend(struct device *dev)
> clk_disable_unprepare(ptdev->clks.core);
> atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> return 0;
> -
> -err_set_active:
Label is being deleted here, but I don't see where the `goto err_set_active` is being removed.
Best regards,
Liviu
> - /* If something failed and we have to revert back to an
> - * active state, we also need to clear the MMIO userspace
> - * mappings, so any dumb pages that were mapped while we
> - * were trying to suspend gets invalidated.
> - */
> - mutex_lock(&ptdev->pm.mmio_lock);
> - atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_ACTIVE);
> - unmap_mapping_range(ptdev->base.anon_inode->i_mapping,
> - DRM_PANTHOR_USER_MMIO_OFFSET, 0, 1);
> - mutex_unlock(&ptdev->pm.mmio_lock);
> - return ret;
> }
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c b/drivers/gpu/drm/panthor/panthor_fw.c
> index 2060085cc9f3..df74750cf1b7 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.c
> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> @@ -12,6 +12,7 @@
> #include <linux/iosys-map.h>
> #include <linux/mutex.h>
> #include <linux/platform_device.h>
> +#include <linux/pm_runtime.h>
>
> #include <drm/drm_drv.h>
> #include <drm/drm_managed.h>
> @@ -1188,11 +1189,13 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
>
> cancel_delayed_work_sync(&ptdev->fw->watchdog.ping_work);
>
> - /* Make sure the IRQ handler can be called after that point. */
> - if (ptdev->fw->irq.irq)
> - panthor_job_irq_suspend(&ptdev->fw->irq);
> + if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev)) {
> + /* Make sure the IRQ handler can be called after that point. */
> + if (ptdev->fw->irq.irq)
> + panthor_job_irq_suspend(&ptdev->fw->irq);
>
> - panthor_fw_stop(ptdev);
> + panthor_fw_stop(ptdev);
> + }
>
> list_for_each_entry(section, &ptdev->fw->sections, node)
> panthor_kernel_bo_destroy(section->mem);
> @@ -1205,7 +1208,8 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
> panthor_vm_put(ptdev->fw->vm);
> ptdev->fw->vm = NULL;
>
> - panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
> + if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
> + panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 20000);
> }
>
> /**
> diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c b/drivers/gpu/drm/panthor/panthor_gpu.c
> index 2d3529a0b156..27702bc62dd6 100644
> --- a/drivers/gpu/drm/panthor/panthor_gpu.c
> +++ b/drivers/gpu/drm/panthor/panthor_gpu.c
> @@ -174,7 +174,8 @@ void panthor_gpu_unplug(struct panthor_device *ptdev)
> unsigned long flags;
>
> /* Make sure the IRQ handler is not running after that point. */
> - panthor_gpu_irq_suspend(&ptdev->gpu->irq);
> + if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
> + panthor_gpu_irq_suspend(&ptdev->gpu->irq);
>
> /* Wake-up all waiters. */
> spin_lock_irqsave(&ptdev->gpu->reqs_lock, flags);
> diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c b/drivers/gpu/drm/panthor/panthor_mmu.c
> index c2262c72e8e2..fee6c7d9fe0a 100644
> --- a/drivers/gpu/drm/panthor/panthor_mmu.c
> +++ b/drivers/gpu/drm/panthor/panthor_mmu.c
> @@ -2674,7 +2674,8 @@ int panthor_vm_prepare_mapped_bos_resvs(struct drm_exec *exec, struct panthor_vm
> */
> void panthor_mmu_unplug(struct panthor_device *ptdev)
> {
> - panthor_mmu_irq_suspend(&ptdev->mmu->irq);
> + if (!IS_ENABLED(CONFIG_PM) || pm_runtime_active(ptdev->base.dev))
> + panthor_mmu_irq_suspend(&ptdev->mmu->irq);
>
> mutex_lock(&ptdev->mmu->as.slots_lock);
> for (u32 i = 0; i < ARRAY_SIZE(ptdev->mmu->as.slots); i++) {
> --
> 2.46.2
>
--
====================
| I would like to |
| fix the world, |
| but they're not |
| giving me the |
\ source code! /
---------------
¯\_(ツ)_/¯
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 4/5] drm/panthor: Be robust against resume failures
2024-11-14 10:51 ` Steven Price
@ 2024-11-14 11:24 ` Boris Brezillon
0 siblings, 0 replies; 16+ messages in thread
From: Boris Brezillon @ 2024-11-14 11:24 UTC (permalink / raw)
To: Steven Price; +Cc: Liviu Dudau, Adrián Larumbe, dri-devel, kernel
On Thu, 14 Nov 2024 10:51:01 +0000
Steven Price <steven.price@arm.com> wrote:
> On 13/11/2024 15:42, Boris Brezillon wrote:
> > When the runtime PM resume callback returns an error, it puts the device
> > in a state where it can't be resumed anymore. Make sure we can recover
> > from such transient failures by calling pm_runtime_set_suspended()
> > explicitly after a pm_runtime_resume_and_get() failure.
> >
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> > drivers/gpu/drm/panthor/panthor_device.c | 1 +
> > drivers/gpu/drm/panthor/panthor_device.h | 17 +++++++++++++++++
> > drivers/gpu/drm/panthor/panthor_drv.c | 2 +-
> > drivers/gpu/drm/panthor/panthor_sched.c | 4 ++--
> > 4 files changed, 21 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> > index 353f3aabef42..d3276b936141 100644
> > --- a/drivers/gpu/drm/panthor/panthor_device.c
> > +++ b/drivers/gpu/drm/panthor/panthor_device.c
> > @@ -486,6 +486,7 @@ int panthor_device_resume(struct device *dev)
> >
> > err_set_suspended:
> > atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> > + atomic_set(&ptdev->pm.recovery_needed, 1);
> > return ret;
> > }
> >
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h
> > index 0e68f5a70d20..cc74e99e53f9 100644
> > --- a/drivers/gpu/drm/panthor/panthor_device.h
> > +++ b/drivers/gpu/drm/panthor/panthor_device.h
> > @@ -9,6 +9,7 @@
> > #include <linux/atomic.h>
> > #include <linux/io-pgtable.h>
> > #include <linux/regulator/consumer.h>
> > +#include <linux/pm_runtime.h>
> > #include <linux/sched.h>
> > #include <linux/spinlock.h>
> >
> > @@ -180,6 +181,9 @@ struct panthor_device {
> > * is suspended.
> > */
> > struct page *dummy_latest_flush;
> > +
> > + /** @recovery_needed: True when a resume attempt failed. */
> > + atomic_t recovery_needed;
> > } pm;
> >
> > /** @profile_mask: User-set profiling flags for job accounting. */
> > @@ -243,6 +247,19 @@ int panthor_device_mmap_io(struct panthor_device *ptdev,
> > int panthor_device_resume(struct device *dev);
> > int panthor_device_suspend(struct device *dev);
> >
> > +static inline int panthor_device_resume_and_get(struct panthor_device *ptdev)
> > +{
> > + int ret = pm_runtime_resume_and_get(ptdev->base.dev);
> > +
> > + /* If the resume failed, we need to clear the runtime_error, which we
> > + * can done by forcing the RPM state to suspended.
> > + */
> > + if (ret && atomic_cmpxchg(&ptdev->pm.recovery_needed, 1, 0) == 1)
>
> I'm unclear what this atomic achieves. At first glance it appears
> pointless: with this change if panthor_device_resume() fails then
> recovery_needed is set to 1. So logically if ret != 0 then also
> recovery_needed == 1.
>
> My second thought was is this to avoid races? If multiple threads are
> calling this then only one will win the cmpxchg and call
> pm_runtime_set_suspended. But it's not safe - it's quite possible for
> the first thread to get past the cmpxchg and be suspended before the
> second thread comes along and reaches the same point - leading to
> multiple calls to pm_runtime_set_suspended().
Yes, it was here to avoid the race. I don't think there's a risk of
multiple threads calling pm_runtime_set_suspended() without
actually needing such a call, because we won't reach
panthor_device_resume() until the runtime_error has been cleared
(runtime PM bails out early with a -EINVAL). So, in practice, there's no
way two threads can see a recovery_needed=1 unless the error has already
been cleared by the other thread and the second thread triggered
another resume error, in which case the second
pm_runtime_set_suspended() call is legit.
But now that you mention it, it indeed doesn't prevent the second
thread to call pm_runtime_resume_and_get() before the PM runtime_error
has been cleared, leading to potential spurious errors, so that's
annoying.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
2024-11-14 11:13 ` Liviu Dudau
@ 2024-11-14 11:27 ` Boris Brezillon
2024-11-14 11:36 ` Liviu Dudau
0 siblings, 1 reply; 16+ messages in thread
From: Boris Brezillon @ 2024-11-14 11:27 UTC (permalink / raw)
To: Liviu Dudau; +Cc: Steven Price, Adrián Larumbe, dri-devel, kernel
On Thu, 14 Nov 2024 11:13:29 +0000
Liviu Dudau <liviu.dudau@arm.com> wrote:
> On Wed, Nov 13, 2024 at 04:42:54PM +0100, Boris Brezillon wrote:
> > The runtime PM resume operation is not guaranteed to succeed, but if it
> > fails, the device should be in a suspended state. Make sure we're robust
> > to resume failures in the unplug path.
> >
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> > drivers/gpu/drm/panthor/panthor_device.c | 23 ++++++++---------------
> > drivers/gpu/drm/panthor/panthor_fw.c | 14 +++++++++-----
> > drivers/gpu/drm/panthor/panthor_gpu.c | 3 ++-
> > drivers/gpu/drm/panthor/panthor_mmu.c | 3 ++-
> > 4 files changed, 21 insertions(+), 22 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> > index db7ba40f771d..8b5d54b2bbb4 100644
> > --- a/drivers/gpu/drm/panthor/panthor_device.c
> > +++ b/drivers/gpu/drm/panthor/panthor_device.c
> > @@ -48,6 +48,8 @@ static int panthor_clk_init(struct panthor_device *ptdev)
> >
> > void panthor_device_unplug(struct panthor_device *ptdev)
> > {
> > + int ret;
> > +
> > /* This function can be called from two different path: the reset work
> > * and the platform device remove callback. drm_dev_unplug() doesn't
> > * deal with concurrent callers, so we have to protect drm_dev_unplug()
> > @@ -74,7 +76,8 @@ void panthor_device_unplug(struct panthor_device *ptdev)
> > */
> > mutex_unlock(&ptdev->unplug.lock);
> >
> > - drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
> > + ret = pm_runtime_get_sync(ptdev->base.dev);
> > + drm_WARN_ON(&ptdev->base, ret < 0);
> >
> > /* Now, try to cleanly shutdown the GPU before the device resources
> > * get reclaimed.
> > @@ -85,7 +88,10 @@ void panthor_device_unplug(struct panthor_device *ptdev)
> > panthor_gpu_unplug(ptdev);
> >
> > pm_runtime_dont_use_autosuspend(ptdev->base.dev);
> > - pm_runtime_put_sync_suspend(ptdev->base.dev);
> > +
> > + /* If the resume failed, we don't need to suspend here. */
> > + if (!ret)
> > + pm_runtime_put_sync_suspend(ptdev->base.dev);
> >
> > /* If PM is disabled, we need to call the suspend handler manually. */
> > if (!IS_ENABLED(CONFIG_PM))
> > @@ -541,17 +547,4 @@ int panthor_device_suspend(struct device *dev)
> > clk_disable_unprepare(ptdev->clks.core);
> > atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> > return 0;
> > -
> > -err_set_active:
>
> Label is being deleted here, but I don't see where the `goto err_set_active` is being removed.
Yes, as mentioned in my own reply, this chunk should be in patch 3.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path
2024-11-14 11:27 ` Boris Brezillon
@ 2024-11-14 11:36 ` Liviu Dudau
0 siblings, 0 replies; 16+ messages in thread
From: Liviu Dudau @ 2024-11-14 11:36 UTC (permalink / raw)
To: Boris Brezillon; +Cc: Steven Price, Adrián Larumbe, dri-devel, kernel
On Thu, Nov 14, 2024 at 12:27:55PM +0100, Boris Brezillon wrote:
> On Thu, 14 Nov 2024 11:13:29 +0000
> Liviu Dudau <liviu.dudau@arm.com> wrote:
>
> > On Wed, Nov 13, 2024 at 04:42:54PM +0100, Boris Brezillon wrote:
> > > The runtime PM resume operation is not guaranteed to succeed, but if it
> > > fails, the device should be in a suspended state. Make sure we're robust
> > > to resume failures in the unplug path.
> > >
> > > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > > ---
> > > drivers/gpu/drm/panthor/panthor_device.c | 23 ++++++++---------------
> > > drivers/gpu/drm/panthor/panthor_fw.c | 14 +++++++++-----
> > > drivers/gpu/drm/panthor/panthor_gpu.c | 3 ++-
> > > drivers/gpu/drm/panthor/panthor_mmu.c | 3 ++-
> > > 4 files changed, 21 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/panthor/panthor_device.c b/drivers/gpu/drm/panthor/panthor_device.c
> > > index db7ba40f771d..8b5d54b2bbb4 100644
> > > --- a/drivers/gpu/drm/panthor/panthor_device.c
> > > +++ b/drivers/gpu/drm/panthor/panthor_device.c
> > > @@ -48,6 +48,8 @@ static int panthor_clk_init(struct panthor_device *ptdev)
> > >
> > > void panthor_device_unplug(struct panthor_device *ptdev)
> > > {
> > > + int ret;
> > > +
> > > /* This function can be called from two different path: the reset work
> > > * and the platform device remove callback. drm_dev_unplug() doesn't
> > > * deal with concurrent callers, so we have to protect drm_dev_unplug()
> > > @@ -74,7 +76,8 @@ void panthor_device_unplug(struct panthor_device *ptdev)
> > > */
> > > mutex_unlock(&ptdev->unplug.lock);
> > >
> > > - drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
> > > + ret = pm_runtime_get_sync(ptdev->base.dev);
> > > + drm_WARN_ON(&ptdev->base, ret < 0);
> > >
> > > /* Now, try to cleanly shutdown the GPU before the device resources
> > > * get reclaimed.
> > > @@ -85,7 +88,10 @@ void panthor_device_unplug(struct panthor_device *ptdev)
> > > panthor_gpu_unplug(ptdev);
> > >
> > > pm_runtime_dont_use_autosuspend(ptdev->base.dev);
> > > - pm_runtime_put_sync_suspend(ptdev->base.dev);
> > > +
> > > + /* If the resume failed, we don't need to suspend here. */
> > > + if (!ret)
> > > + pm_runtime_put_sync_suspend(ptdev->base.dev);
> > >
> > > /* If PM is disabled, we need to call the suspend handler manually. */
> > > if (!IS_ENABLED(CONFIG_PM))
> > > @@ -541,17 +547,4 @@ int panthor_device_suspend(struct device *dev)
> > > clk_disable_unprepare(ptdev->clks.core);
> > > atomic_set(&ptdev->pm.state, PANTHOR_DEVICE_PM_STATE_SUSPENDED);
> > > return 0;
> > > -
> > > -err_set_active:
> >
> > Label is being deleted here, but I don't see where the `goto err_set_active` is being removed.
>
> Yes, as mentioned in my own reply, this chunk should be in patch 3.
Sorry, I did look at patch 3 before sending the email but somehow missed the line with the goto being deleted.
With the change:
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Best regards,
Liviu
--
====================
| I would like to |
| fix the world, |
| but they're not |
| giving me the |
\ source code! /
---------------
¯\_(ツ)_/¯
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2024-11-14 11:36 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-13 15:42 [PATCH 0/5] drm/panthor: Be robust against failures in the resume path Boris Brezillon
2024-11-13 15:42 ` [PATCH 1/5] drm/panthor: Preserve the result returned by panthor_fw_resume() Boris Brezillon
2024-11-14 10:50 ` Steven Price
2024-11-13 15:42 ` [PATCH 2/5] drm/panthor: Be robust against runtime PM resume failures in the suspend path Boris Brezillon
2024-11-13 15:54 ` Boris Brezillon
2024-11-13 15:59 ` Boris Brezillon
2024-11-14 11:13 ` Liviu Dudau
2024-11-14 11:27 ` Boris Brezillon
2024-11-14 11:36 ` Liviu Dudau
2024-11-13 15:42 ` [PATCH 3/5] drm/panthor: Ignore devfreq_{suspend, resume}_device() failures Boris Brezillon
2024-11-14 10:50 ` Steven Price
2024-11-13 15:42 ` [PATCH 4/5] drm/panthor: Be robust against resume failures Boris Brezillon
2024-11-14 10:51 ` Steven Price
2024-11-14 11:24 ` Boris Brezillon
2024-11-13 15:42 ` [PATCH 5/5] drm/panthor: Fix the fast-reset logic Boris Brezillon
2024-11-14 10:51 ` Steven Price
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.