All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Handle aborted suspend better
@ 2025-06-02  1:44 Mario Limonciello
  2025-06-02  1:44 ` [PATCH 1/3] drm/amd: Add support for a complete pmops action Mario Limonciello
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Mario Limonciello @ 2025-06-02  1:44 UTC (permalink / raw)
  To: amd-gfx; +Cc: Chris Bainbridge, Mario Limonciello

From: Mario Limonciello <mario.limonciello@amd.com>

Chris Bainbridge reported some list corruption occurring around the
suspend sequence when an aborted suspend occurs.

I couldn't reproduce this specific problem, but when I tried I found
some other issues where the cached DM state isn't properly destroyed.

This is because there isn't a complete() callback to match the prepare()
callback used by amdgpu. Normally the PM core will call complete() after
every suspend attempt (succesful or not).

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/4280

Mario Limonciello (3):
  drm/amd: Add support for a complete pmops action
  drm/amd/display: Stop storing failures into adev->dm.cached_state
  drm/amd/display: Destroy cached state in complete() callback

 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  22 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |   2 +-
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 125 +++++++++++-------
 drivers/gpu/drm/amd/include/amd_shared.h      |   1 +
 5 files changed, 103 insertions(+), 48 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/3] drm/amd: Add support for a complete pmops action
  2025-06-02  1:44 [PATCH 0/3] Handle aborted suspend better Mario Limonciello
@ 2025-06-02  1:44 ` Mario Limonciello
  2025-06-02  1:44 ` [PATCH 2/3] drm/amd/display: Stop storing failures into adev->dm.cached_state Mario Limonciello
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Mario Limonciello @ 2025-06-02  1:44 UTC (permalink / raw)
  To: amd-gfx; +Cc: Chris Bainbridge, Mario Limonciello

From: Mario Limonciello <mario.limonciello@amd.com>

complete() callbacks are supposed to handle reversing anything
that occurred during prepare() callbacks.  They'll be called on every
power state transition, and will also be called if the sequence is
failed (such as an aborted suspend).

Add support for IP blocks to support this action.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  2 +-
 drivers/gpu/drm/amd/include/amd_shared.h   |  1 +
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index a5ccd0ada16ab..77da367231b3a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1619,6 +1619,7 @@ void amdgpu_driver_release_kms(struct drm_device *dev);
 
 int amdgpu_device_ip_suspend(struct amdgpu_device *adev);
 int amdgpu_device_prepare(struct drm_device *dev);
+void amdgpu_device_complete(struct drm_device *dev);
 int amdgpu_device_suspend(struct drm_device *dev, bool fbcon);
 int amdgpu_device_resume(struct drm_device *dev, bool fbcon);
 u32 amdgpu_get_vblank_counter_kms(struct drm_crtc *crtc);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 72e41781afb06..abe86c7d830d5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5034,6 +5034,28 @@ int amdgpu_device_prepare(struct drm_device *dev)
 	return 0;
 }
 
+/**
+ * amdgpu_device_complete - complete power state transition
+ *
+ * @dev: drm dev pointer
+ *
+ * Undo the changes from amdgpu_device_prepare. This will be
+ * called on all resume transitions, including those that failed.
+ */
+void amdgpu_device_complete(struct drm_device *dev)
+{
+	struct amdgpu_device *adev = drm_to_adev(dev);
+	int i;
+
+	for (i = 0; i < adev->num_ip_blocks; i++) {
+		if (!adev->ip_blocks[i].status.valid)
+			continue;
+		if (!adev->ip_blocks[i].version->funcs->complete)
+			continue;
+		adev->ip_blocks[i].version->funcs->complete(&adev->ip_blocks[i]);
+	}
+}
+
 /**
  * amdgpu_device_suspend - initiate device suspend
  *
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 4db92e0a60da7..0fc0eeedc6461 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2570,7 +2570,7 @@ static int amdgpu_pmops_prepare(struct device *dev)
 
 static void amdgpu_pmops_complete(struct device *dev)
 {
-	/* nothing to do */
+	amdgpu_device_complete(dev_get_drvdata(dev));
 }
 
 static int amdgpu_pmops_suspend(struct device *dev)
diff --git a/drivers/gpu/drm/amd/include/amd_shared.h b/drivers/gpu/drm/amd/include/amd_shared.h
index 11374a2cbab87..a06e92b1b2ef9 100644
--- a/drivers/gpu/drm/amd/include/amd_shared.h
+++ b/drivers/gpu/drm/amd/include/amd_shared.h
@@ -427,6 +427,7 @@ struct amd_ip_funcs {
 	int (*prepare_suspend)(struct amdgpu_ip_block *ip_block);
 	int (*suspend)(struct amdgpu_ip_block *ip_block);
 	int (*resume)(struct amdgpu_ip_block *ip_block);
+	void (*complete)(struct amdgpu_ip_block *ip_block);
 	bool (*is_idle)(struct amdgpu_ip_block *ip_block);
 	int (*wait_for_idle)(struct amdgpu_ip_block *ip_block);
 	bool (*check_soft_reset)(struct amdgpu_ip_block *ip_block);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/3] drm/amd/display: Stop storing failures into adev->dm.cached_state
  2025-06-02  1:44 [PATCH 0/3] Handle aborted suspend better Mario Limonciello
  2025-06-02  1:44 ` [PATCH 1/3] drm/amd: Add support for a complete pmops action Mario Limonciello
@ 2025-06-02  1:44 ` Mario Limonciello
  2025-06-02  1:44 ` [PATCH 3/3] drm/amd/display: Destroy cached state in complete() callback Mario Limonciello
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Mario Limonciello @ 2025-06-02  1:44 UTC (permalink / raw)
  To: amd-gfx; +Cc: Chris Bainbridge, Mario Limonciello

From: Mario Limonciello <mario.limonciello@amd.com>

If drm_atomic_helper_suspend() has failed for any reason, it's stored
in adev->dm.cached_state.  This isn't expected because the resume
(or complete()) sequence will attempt to use the stored state to
resume.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 25 +++++++++++++------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 1797fa85fac6a..1f5894d8f2fb1 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -3060,6 +3060,19 @@ static void hpd_rx_irq_work_suspend(struct amdgpu_display_manager *dm)
 	}
 }
 
+static int dm_cache_state(struct amdgpu_device *adev)
+{
+	int r;
+
+	adev->dm.cached_state = drm_atomic_helper_suspend(adev_to_drm(adev));
+	if (IS_ERR(adev->dm.cached_state)) {
+		r = PTR_ERR(adev->dm.cached_state);
+		adev->dm.cached_state = NULL;
+	}
+
+	return adev->dm.cached_state ? 0 : r;
+}
+
 static int dm_prepare_suspend(struct amdgpu_ip_block *ip_block)
 {
 	struct amdgpu_device *adev = ip_block->adev;
@@ -3068,11 +3081,8 @@ static int dm_prepare_suspend(struct amdgpu_ip_block *ip_block)
 		return 0;
 
 	WARN_ON(adev->dm.cached_state);
-	adev->dm.cached_state = drm_atomic_helper_suspend(adev_to_drm(adev));
-	if (IS_ERR(adev->dm.cached_state))
-		return PTR_ERR(adev->dm.cached_state);
 
-	return 0;
+	return dm_cache_state(adev);
 }
 
 static int dm_suspend(struct amdgpu_ip_block *ip_block)
@@ -3106,9 +3116,10 @@ static int dm_suspend(struct amdgpu_ip_block *ip_block)
 	}
 
 	if (!adev->dm.cached_state) {
-		adev->dm.cached_state = drm_atomic_helper_suspend(adev_to_drm(adev));
-		if (IS_ERR(adev->dm.cached_state))
-			return PTR_ERR(adev->dm.cached_state);
+		int r = dm_cache_state(adev);
+
+		if (r)
+			return r;
 	}
 
 	s3_handle_hdmi_cec(adev_to_drm(adev), true);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/3] drm/amd/display: Destroy cached state in complete() callback
  2025-06-02  1:44 [PATCH 0/3] Handle aborted suspend better Mario Limonciello
  2025-06-02  1:44 ` [PATCH 1/3] drm/amd: Add support for a complete pmops action Mario Limonciello
  2025-06-02  1:44 ` [PATCH 2/3] drm/amd/display: Stop storing failures into adev->dm.cached_state Mario Limonciello
@ 2025-06-02  1:44 ` Mario Limonciello
  2025-06-02 12:22 ` [PATCH 0/3] Handle aborted suspend better Chris Bainbridge
  2025-06-10 20:11 ` Alex Hung
  4 siblings, 0 replies; 6+ messages in thread
From: Mario Limonciello @ 2025-06-02  1:44 UTC (permalink / raw)
  To: amd-gfx; +Cc: Chris Bainbridge, Mario Limonciello

From: Mario Limonciello <mario.limonciello@amd.com>

[Why]
When the suspend sequence has been aborted after prepare() but
before suspend() the resume() callback never gets called. The PM core
will call complete() when this happens. As the state has been cached
in prepare() it needs to be destroyed in complete() if it's still around.

[How]
Create a helper for destroying cached state and call it both in resume()
and complete() callbacks. If resume has been called the state will be
destroyed and it's a no-op for complete().  If resume hasn't been called
(such as an aborted suspend) then destroy the state in complete().

Fixes: 50e0bae34fa6b ("drm/amd/display: Add and use new dm_prepare_suspend() callback")
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 100 +++++++++++-------
 1 file changed, 60 insertions(+), 40 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 1f5894d8f2fb1..94e5f07a51462 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -3073,6 +3073,64 @@ static int dm_cache_state(struct amdgpu_device *adev)
 	return adev->dm.cached_state ? 0 : r;
 }
 
+static void dm_destroy_cached_state(struct amdgpu_device *adev)
+{
+	struct amdgpu_display_manager *dm = &adev->dm;
+	struct drm_device *ddev = adev_to_drm(adev);
+	struct dm_plane_state *dm_new_plane_state;
+	struct drm_plane_state *new_plane_state;
+	struct dm_crtc_state *dm_new_crtc_state;
+	struct drm_crtc_state *new_crtc_state;
+	struct drm_plane *plane;
+	struct drm_crtc *crtc;
+	int i;
+
+	if (!dm->cached_state)
+		return;
+
+	/* Force mode set in atomic commit */
+	for_each_new_crtc_in_state(dm->cached_state, crtc, new_crtc_state, i) {
+		new_crtc_state->active_changed = true;
+		dm_new_crtc_state = to_dm_crtc_state(new_crtc_state);
+		reset_freesync_config_for_crtc(dm_new_crtc_state);
+	}
+
+	/*
+	 * atomic_check is expected to create the dc states. We need to release
+	 * them here, since they were duplicated as part of the suspend
+	 * procedure.
+	 */
+	for_each_new_crtc_in_state(dm->cached_state, crtc, new_crtc_state, i) {
+		dm_new_crtc_state = to_dm_crtc_state(new_crtc_state);
+		if (dm_new_crtc_state->stream) {
+			WARN_ON(kref_read(&dm_new_crtc_state->stream->refcount) > 1);
+			dc_stream_release(dm_new_crtc_state->stream);
+			dm_new_crtc_state->stream = NULL;
+		}
+		dm_new_crtc_state->base.color_mgmt_changed = true;
+	}
+
+	for_each_new_plane_in_state(dm->cached_state, plane, new_plane_state, i) {
+		dm_new_plane_state = to_dm_plane_state(new_plane_state);
+		if (dm_new_plane_state->dc_state) {
+			WARN_ON(kref_read(&dm_new_plane_state->dc_state->refcount) > 1);
+			dc_plane_state_release(dm_new_plane_state->dc_state);
+			dm_new_plane_state->dc_state = NULL;
+		}
+	}
+
+	drm_atomic_helper_resume(ddev, dm->cached_state);
+
+	dm->cached_state = NULL;
+}
+
+static void dm_complete(struct amdgpu_ip_block *ip_block)
+{
+	struct amdgpu_device *adev = ip_block->adev;
+
+	dm_destroy_cached_state(adev);
+}
+
 static int dm_prepare_suspend(struct amdgpu_ip_block *ip_block)
 {
 	struct amdgpu_device *adev = ip_block->adev;
@@ -3306,12 +3364,6 @@ static int dm_resume(struct amdgpu_ip_block *ip_block)
 	struct amdgpu_dm_connector *aconnector;
 	struct drm_connector *connector;
 	struct drm_connector_list_iter iter;
-	struct drm_crtc *crtc;
-	struct drm_crtc_state *new_crtc_state;
-	struct dm_crtc_state *dm_new_crtc_state;
-	struct drm_plane *plane;
-	struct drm_plane_state *new_plane_state;
-	struct dm_plane_state *dm_new_plane_state;
 	struct dm_atomic_state *dm_state = to_dm_atomic_state(dm->atomic_obj.state);
 	enum dc_connection_type new_connection_type = dc_connection_none;
 	struct dc_state *dc_state;
@@ -3468,40 +3520,7 @@ static int dm_resume(struct amdgpu_ip_block *ip_block)
 	}
 	drm_connector_list_iter_end(&iter);
 
-	/* Force mode set in atomic commit */
-	for_each_new_crtc_in_state(dm->cached_state, crtc, new_crtc_state, i) {
-		new_crtc_state->active_changed = true;
-		dm_new_crtc_state = to_dm_crtc_state(new_crtc_state);
-		reset_freesync_config_for_crtc(dm_new_crtc_state);
-	}
-
-	/*
-	 * atomic_check is expected to create the dc states. We need to release
-	 * them here, since they were duplicated as part of the suspend
-	 * procedure.
-	 */
-	for_each_new_crtc_in_state(dm->cached_state, crtc, new_crtc_state, i) {
-		dm_new_crtc_state = to_dm_crtc_state(new_crtc_state);
-		if (dm_new_crtc_state->stream) {
-			WARN_ON(kref_read(&dm_new_crtc_state->stream->refcount) > 1);
-			dc_stream_release(dm_new_crtc_state->stream);
-			dm_new_crtc_state->stream = NULL;
-		}
-		dm_new_crtc_state->base.color_mgmt_changed = true;
-	}
-
-	for_each_new_plane_in_state(dm->cached_state, plane, new_plane_state, i) {
-		dm_new_plane_state = to_dm_plane_state(new_plane_state);
-		if (dm_new_plane_state->dc_state) {
-			WARN_ON(kref_read(&dm_new_plane_state->dc_state->refcount) > 1);
-			dc_plane_state_release(dm_new_plane_state->dc_state);
-			dm_new_plane_state->dc_state = NULL;
-		}
-	}
-
-	drm_atomic_helper_resume(ddev, dm->cached_state);
-
-	dm->cached_state = NULL;
+	dm_destroy_cached_state(adev);
 
 	/* Do mst topology probing after resuming cached state*/
 	drm_connector_list_iter_begin(ddev, &iter);
@@ -3550,6 +3569,7 @@ static const struct amd_ip_funcs amdgpu_dm_funcs = {
 	.prepare_suspend = dm_prepare_suspend,
 	.suspend = dm_suspend,
 	.resume = dm_resume,
+	.complete = dm_complete,
 	.is_idle = dm_is_idle,
 	.wait_for_idle = dm_wait_for_idle,
 	.check_soft_reset = dm_check_soft_reset,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/3] Handle aborted suspend better
  2025-06-02  1:44 [PATCH 0/3] Handle aborted suspend better Mario Limonciello
                   ` (2 preceding siblings ...)
  2025-06-02  1:44 ` [PATCH 3/3] drm/amd/display: Destroy cached state in complete() callback Mario Limonciello
@ 2025-06-02 12:22 ` Chris Bainbridge
  2025-06-10 20:11 ` Alex Hung
  4 siblings, 0 replies; 6+ messages in thread
From: Chris Bainbridge @ 2025-06-02 12:22 UTC (permalink / raw)
  To: Mario Limonciello; +Cc: amd-gfx, Mario Limonciello

On Sun, Jun 01, 2025 at 08:44:29PM -0500, Mario Limonciello wrote:
> From: Mario Limonciello <mario.limonciello@amd.com>
> 
> Chris Bainbridge reported some list corruption occurring around the
> suspend sequence when an aborted suspend occurs.
> 
> I couldn't reproduce this specific problem, but when I tried I found
> some other issues where the cached DM state isn't properly destroyed.
> 
> This is because there isn't a complete() callback to match the prepare()
> callback used by amdgpu. Normally the PM core will call complete() after
> every suspend attempt (succesful or not).
> 
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/4280
> 
> Mario Limonciello (3):
>   drm/amd: Add support for a complete pmops action
>   drm/amd/display: Stop storing failures into adev->dm.cached_state
>   drm/amd/display: Destroy cached state in complete() callback
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  22 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |   2 +-
>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 125 +++++++++++-------
>  drivers/gpu/drm/amd/include/amd_shared.h      |   1 +
>  5 files changed, 103 insertions(+), 48 deletions(-)
> 
> -- 
> 2.43.0
> 

I tested with 30 suspends and the dm_prepare_suspend /
amdgpu_device_prepare error did not appear. The list corruption error
remain but that bisects to:

aa7a9275ab81 ("PM: sleep: Suspend async parents after suspending children").

I applied your patch series to the parent of that commit, tested, and
there were no errors. So this issue looks fixed but the other issue
remains, email sent: https://lore.kernel.org/all/aD2U3VIhf8vDkl09@debian.local/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/3] Handle aborted suspend better
  2025-06-02  1:44 [PATCH 0/3] Handle aborted suspend better Mario Limonciello
                   ` (3 preceding siblings ...)
  2025-06-02 12:22 ` [PATCH 0/3] Handle aborted suspend better Chris Bainbridge
@ 2025-06-10 20:11 ` Alex Hung
  4 siblings, 0 replies; 6+ messages in thread
From: Alex Hung @ 2025-06-10 20:11 UTC (permalink / raw)
  To: Mario Limonciello, amd-gfx; +Cc: Chris Bainbridge, Mario Limonciello

Reviewed-by: Alex Hung <alex.hung@amd.com>

This patchset passed promotion tests along with DC 3.2.337 too.

On 6/1/25 19:44, Mario Limonciello wrote:
> From: Mario Limonciello <mario.limonciello@amd.com>
> 
> Chris Bainbridge reported some list corruption occurring around the
> suspend sequence when an aborted suspend occurs.
> 
> I couldn't reproduce this specific problem, but when I tried I found
> some other issues where the cached DM state isn't properly destroyed.
> 
> This is because there isn't a complete() callback to match the prepare()
> callback used by amdgpu. Normally the PM core will call complete() after
> every suspend attempt (succesful or not).
> 
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/4280
> 
> Mario Limonciello (3):
>    drm/amd: Add support for a complete pmops action
>    drm/amd/display: Stop storing failures into adev->dm.cached_state
>    drm/amd/display: Destroy cached state in complete() callback
> 
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  22 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |   2 +-
>   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 125 +++++++++++-------
>   drivers/gpu/drm/amd/include/amd_shared.h      |   1 +
>   5 files changed, 103 insertions(+), 48 deletions(-)
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-06-10 20:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-02  1:44 [PATCH 0/3] Handle aborted suspend better Mario Limonciello
2025-06-02  1:44 ` [PATCH 1/3] drm/amd: Add support for a complete pmops action Mario Limonciello
2025-06-02  1:44 ` [PATCH 2/3] drm/amd/display: Stop storing failures into adev->dm.cached_state Mario Limonciello
2025-06-02  1:44 ` [PATCH 3/3] drm/amd/display: Destroy cached state in complete() callback Mario Limonciello
2025-06-02 12:22 ` [PATCH 0/3] Handle aborted suspend better Chris Bainbridge
2025-06-10 20:11 ` Alex Hung

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.