[RFC 0/3] Fixups for cancelled hibernate

public inbox for linux-pm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC 0/3] Fixups for cancelled hibernate
@ 2025-10-20 16:50 Mario Limonciello (AMD)
  2025-10-20 16:50 ` [RFC 1/3] PM: Mark device as suspended if it failed to resume Mario Limonciello (AMD)
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Mario Limonciello (AMD) @ 2025-10-20 16:50 UTC (permalink / raw)
  To: mario.limonciello, airlied, alexander.deucher, christian.koenig,
	dakr, gregkh, lenb, pavel, rafael, simona
  Cc: Mario Limonciello (AMD), Muhammad Usama Anjum, amd-gfx, dri-devel,
	linux-pm

Muhammad Usama Anjun's recent series for being able to cancel
the hibernate sequence [1] exposes a bug with amdgpu handling for
skipping the thaw step.

Because the thaw step is skipped in most cases, cancelling the
hibernate means that the device is left in an inconsistent
state.

To fix this allow devices to return -EBUSY and adjust the PM
core handling of this case.

Link: https://lore.kernel.org/all/20251018142114.897445-1-usama.anjum@collabora.com/ [1]
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Mario Limonciello (3):
  PM: Mark device as suspended if it failed to resume
  PM: Don't pass up device_resume() -EBUSY errors
  drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success

 drivers/base/power/main.c               | 7 ++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
 2 files changed, 7 insertions(+), 2 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC 1/3] PM: Mark device as suspended if it failed to resume
  2025-10-20 16:50 [RFC 0/3] Fixups for cancelled hibernate Mario Limonciello (AMD)
@ 2025-10-20 16:50 ` Mario Limonciello (AMD)
  2025-10-20 16:58   ` Muhammad Usama Anjum
  2025-10-20 16:50 ` [RFC 2/3] PM: Don't pass up device_resume() -EBUSY errors Mario Limonciello (AMD)
  2025-10-20 16:50 ` [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success Mario Limonciello (AMD)
  2 siblings, 1 reply; 23+ messages in thread
From: Mario Limonciello (AMD) @ 2025-10-20 16:50 UTC (permalink / raw)
  To: mario.limonciello, airlied, alexander.deucher, christian.koenig,
	dakr, gregkh, lenb, pavel, rafael, simona
  Cc: Muhammad Usama Anjum, Mario Limonciello, amd-gfx, dri-devel,
	linux-pm

From: Mario Limonciello <mario.limonciello@amd.com>

If a device failed to resume the PM core treats it as though it
succeeded.  This could cause state machine problems.

Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
---
 drivers/base/power/main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index e83503bdc1fdb..bf9c3d79c455f 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -1104,6 +1104,9 @@ static void device_resume(struct device *dev, pm_message_t state, bool async)
 	device_unlock(dev);
 	dpm_watchdog_clear(&wd);
 
+	if (error)
+		dev->power.is_suspended = true;
+
  Complete:
 	complete_all(&dev->power.completion);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 2/3] PM: Don't pass up device_resume() -EBUSY errors
  2025-10-20 16:50 [RFC 0/3] Fixups for cancelled hibernate Mario Limonciello (AMD)
  2025-10-20 16:50 ` [RFC 1/3] PM: Mark device as suspended if it failed to resume Mario Limonciello (AMD)
@ 2025-10-20 16:50 ` Mario Limonciello (AMD)
  2025-10-20 16:58   ` Muhammad Usama Anjum
  2025-10-20 17:18   ` Rafael J. Wysocki
  2025-10-20 16:50 ` [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success Mario Limonciello (AMD)
  2 siblings, 2 replies; 23+ messages in thread
From: Mario Limonciello (AMD) @ 2025-10-20 16:50 UTC (permalink / raw)
  To: mario.limonciello, airlied, alexander.deucher, christian.koenig,
	dakr, gregkh, lenb, pavel, rafael, simona
  Cc: Muhammad Usama Anjum, Mario Limonciello, amd-gfx, dri-devel,
	linux-pm

From: Mario Limonciello <mario.limonciello@amd.com>

If a device resume returns -EBUSY the device resume sequence has
been skipped. Don't show errors for this or pass it up to async
resume.  If resume is run again in another stage the device should
try again.

Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
---
 drivers/base/power/main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index bf9c3d79c455f..f6bc7ef9a8371 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -1112,7 +1112,9 @@ static void device_resume(struct device *dev, pm_message_t state, bool async)
 
 	TRACE_RESUME(error);
 
-	if (error) {
+	if (error == -EBUSY)
+		pm_dev_dbg(dev, state, async ? " async" : "");
+	else if (error) {
 		WRITE_ONCE(async_error, error);
 		dpm_save_failed_dev(dev_name(dev));
 		pm_dev_err(dev, state, async ? " async" : "", error);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-20 16:50 [RFC 0/3] Fixups for cancelled hibernate Mario Limonciello (AMD)
  2025-10-20 16:50 ` [RFC 1/3] PM: Mark device as suspended if it failed to resume Mario Limonciello (AMD)
  2025-10-20 16:50 ` [RFC 2/3] PM: Don't pass up device_resume() -EBUSY errors Mario Limonciello (AMD)
@ 2025-10-20 16:50 ` Mario Limonciello (AMD)
  2025-10-20 16:59   ` Muhammad Usama Anjum
  2025-10-20 17:21   ` Rafael J. Wysocki
  2 siblings, 2 replies; 23+ messages in thread
From: Mario Limonciello (AMD) @ 2025-10-20 16:50 UTC (permalink / raw)
  To: mario.limonciello, airlied, alexander.deucher, christian.koenig,
	dakr, gregkh, lenb, pavel, rafael, simona
  Cc: Muhammad Usama Anjum, amd-gfx, dri-devel, linux-pm

From: Mario Limonciello <mario.limonciello@amd.com>

The PM core should be notified that thaw was skipped for the device
so that if it's tried to be resumed (such as an aborted hibernate)
that it gets another chance to resume.

Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 61268aa82df4d..d40af069f24dd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
 
 	/* do not resume device if it's normal hibernation */
 	if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
-		return 0;
+		return -EBUSY;
 
 	return amdgpu_device_resume(drm_dev, true);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC 1/3] PM: Mark device as suspended if it failed to resume
  2025-10-20 16:50 ` [RFC 1/3] PM: Mark device as suspended if it failed to resume Mario Limonciello (AMD)
@ 2025-10-20 16:58   ` Muhammad Usama Anjum
  2025-10-20 17:15     ` Rafael J. Wysocki
  0 siblings, 1 reply; 23+ messages in thread
From: Muhammad Usama Anjum @ 2025-10-20 16:58 UTC (permalink / raw)
  To: Mario Limonciello (AMD), mario.limonciello, airlied,
	alexander.deucher, christian.koenig, dakr, gregkh, lenb, pavel,
	rafael, simona
  Cc: usama.anjum, amd-gfx, dri-devel, linux-pm

On 10/20/25 9:50 PM, Mario Limonciello (AMD) wrote:
> From: Mario Limonciello <mario.limonciello@amd.com>
> 
> If a device failed to resume the PM core treats it as though it
> succeeded.  This could cause state machine problems.
> 
> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com>

> ---
>  drivers/base/power/main.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
> index e83503bdc1fdb..bf9c3d79c455f 100644
> --- a/drivers/base/power/main.c
> +++ b/drivers/base/power/main.c
> @@ -1104,6 +1104,9 @@ static void device_resume(struct device *dev, pm_message_t state, bool async)
>  	device_unlock(dev);
>  	dpm_watchdog_clear(&wd);
>  
> +	if (error)
> +		dev->power.is_suspended = true;
> +
>   Complete:
>  	complete_all(&dev->power.completion);
>  


-- 
---
Thanks,
Usama

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 2/3] PM: Don't pass up device_resume() -EBUSY errors
  2025-10-20 16:50 ` [RFC 2/3] PM: Don't pass up device_resume() -EBUSY errors Mario Limonciello (AMD)
@ 2025-10-20 16:58   ` Muhammad Usama Anjum
  2025-10-20 17:18   ` Rafael J. Wysocki
  1 sibling, 0 replies; 23+ messages in thread
From: Muhammad Usama Anjum @ 2025-10-20 16:58 UTC (permalink / raw)
  To: Mario Limonciello (AMD), mario.limonciello, airlied,
	alexander.deucher, christian.koenig, dakr, gregkh, lenb, pavel,
	rafael, simona
  Cc: usama.anjum, amd-gfx, dri-devel, linux-pm

On 10/20/25 9:50 PM, Mario Limonciello (AMD) wrote:
> From: Mario Limonciello <mario.limonciello@amd.com>
> 
> If a device resume returns -EBUSY the device resume sequence has
> been skipped. Don't show errors for this or pass it up to async
> resume.  If resume is run again in another stage the device should
> try again.
> 
> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com>

> ---
>  drivers/base/power/main.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
> index bf9c3d79c455f..f6bc7ef9a8371 100644
> --- a/drivers/base/power/main.c
> +++ b/drivers/base/power/main.c
> @@ -1112,7 +1112,9 @@ static void device_resume(struct device *dev, pm_message_t state, bool async)
>  
>  	TRACE_RESUME(error);
>  
> -	if (error) {
> +	if (error == -EBUSY)
> +		pm_dev_dbg(dev, state, async ? " async" : "");
> +	else if (error) {
>  		WRITE_ONCE(async_error, error);
>  		dpm_save_failed_dev(dev_name(dev));
>  		pm_dev_err(dev, state, async ? " async" : "", error);


-- 
---
Thanks,
Usama

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-20 16:50 ` [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success Mario Limonciello (AMD)
@ 2025-10-20 16:59   ` Muhammad Usama Anjum
  2025-10-20 17:21   ` Rafael J. Wysocki
  1 sibling, 0 replies; 23+ messages in thread
From: Muhammad Usama Anjum @ 2025-10-20 16:59 UTC (permalink / raw)
  To: Mario Limonciello (AMD), mario.limonciello, airlied,
	alexander.deucher, christian.koenig, dakr, gregkh, lenb, pavel,
	rafael, simona
  Cc: usama.anjum, amd-gfx, dri-devel, linux-pm

On 10/20/25 9:50 PM, Mario Limonciello (AMD) wrote:
> From: Mario Limonciello <mario.limonciello@amd.com>
> 
> The PM core should be notified that thaw was skipped for the device
> so that if it's tried to be resumed (such as an aborted hibernate)
> that it gets another chance to resume.
> 
> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 61268aa82df4d..d40af069f24dd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
>  
>  	/* do not resume device if it's normal hibernation */
>  	if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
> -		return 0;
> +		return -EBUSY;
>  
>  	return amdgpu_device_resume(drm_dev, true);
>  }


-- 
---
Thanks,
Usama

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 1/3] PM: Mark device as suspended if it failed to resume
  2025-10-20 16:58   ` Muhammad Usama Anjum
@ 2025-10-20 17:15     ` Rafael J. Wysocki
  0 siblings, 0 replies; 23+ messages in thread
From: Rafael J. Wysocki @ 2025-10-20 17:15 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: Mario Limonciello (AMD), mario.limonciello, airlied,
	alexander.deucher, christian.koenig, dakr, gregkh, lenb, pavel,
	rafael, simona, amd-gfx, dri-devel, linux-pm

On Mon, Oct 20, 2025 at 6:59 PM Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
>
> On 10/20/25 9:50 PM, Mario Limonciello (AMD) wrote:
> > From: Mario Limonciello <mario.limonciello@amd.com>
> >
> > If a device failed to resume the PM core treats it as though it
> > succeeded.  This could cause state machine problems.

You need to be very specific here as this is a significant change in behavior.

> >
> > Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
> > Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
> Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
>
> > ---
> >  drivers/base/power/main.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
> > index e83503bdc1fdb..bf9c3d79c455f 100644
> > --- a/drivers/base/power/main.c
> > +++ b/drivers/base/power/main.c
> > @@ -1104,6 +1104,9 @@ static void device_resume(struct device *dev, pm_message_t state, bool async)
> >       device_unlock(dev);
> >       dpm_watchdog_clear(&wd);
> >
> > +     if (error)
> > +             dev->power.is_suspended = true;
> > +
> >   Complete:
> >       complete_all(&dev->power.completion);
> >
>
>
> --
> ---
> Thanks,
> Usama

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 2/3] PM: Don't pass up device_resume() -EBUSY errors
  2025-10-20 16:50 ` [RFC 2/3] PM: Don't pass up device_resume() -EBUSY errors Mario Limonciello (AMD)
  2025-10-20 16:58   ` Muhammad Usama Anjum
@ 2025-10-20 17:18   ` Rafael J. Wysocki
  2025-10-20 17:24     ` Mario Limonciello (AMD) (kernel.org)
  1 sibling, 1 reply; 23+ messages in thread
From: Rafael J. Wysocki @ 2025-10-20 17:18 UTC (permalink / raw)
  To: Mario Limonciello (AMD)
  Cc: mario.limonciello, airlied, alexander.deucher, christian.koenig,
	dakr, gregkh, lenb, pavel, rafael, simona, Muhammad Usama Anjum,
	amd-gfx, dri-devel, linux-pm

On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
<superm1@kernel.org> wrote:
>
> From: Mario Limonciello <mario.limonciello@amd.com>
>
> If a device resume returns -EBUSY the device resume sequence has
> been skipped.

Is this actually demonstrably true in all of the cases?

And what about -EAGAIN?

> Don't show errors for this or pass it up to async
> resume.  If resume is run again in another stage the device should
> try again.
>
> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
> ---
>  drivers/base/power/main.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
> index bf9c3d79c455f..f6bc7ef9a8371 100644
> --- a/drivers/base/power/main.c
> +++ b/drivers/base/power/main.c
> @@ -1112,7 +1112,9 @@ static void device_resume(struct device *dev, pm_message_t state, bool async)
>
>         TRACE_RESUME(error);
>
> -       if (error) {
> +       if (error == -EBUSY)
> +               pm_dev_dbg(dev, state, async ? " async" : "");
> +       else if (error) {
>                 WRITE_ONCE(async_error, error);
>                 dpm_save_failed_dev(dev_name(dev));
>                 pm_dev_err(dev, state, async ? " async" : "", error);
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-20 16:50 ` [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success Mario Limonciello (AMD)
  2025-10-20 16:59   ` Muhammad Usama Anjum
@ 2025-10-20 17:21   ` Rafael J. Wysocki
  2025-10-20 17:28     ` Mario Limonciello (AMD) (kernel.org)
  1 sibling, 1 reply; 23+ messages in thread
From: Rafael J. Wysocki @ 2025-10-20 17:21 UTC (permalink / raw)
  To: Mario Limonciello (AMD)
  Cc: mario.limonciello, airlied, alexander.deucher, christian.koenig,
	dakr, gregkh, lenb, pavel, rafael, simona, Muhammad Usama Anjum,
	amd-gfx, dri-devel, linux-pm

On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
<superm1@kernel.org> wrote:
>
> From: Mario Limonciello <mario.limonciello@amd.com>
>
> The PM core should be notified that thaw was skipped for the device
> so that if it's tried to be resumed (such as an aborted hibernate)
> that it gets another chance to resume.
>
> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 61268aa82df4d..d40af069f24dd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
>
>         /* do not resume device if it's normal hibernation */
>         if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
> -               return 0;
> +               return -EBUSY;

So that's why you need the special handling of -EBUSY in the previous patch.

I think that you need to save some state in this driver and then use
it in subsequent callbacks instead of hacking the core to do what you
want.

>         return amdgpu_device_resume(drm_dev, true);
>  }
> --

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 2/3] PM: Don't pass up device_resume() -EBUSY errors
  2025-10-20 17:18   ` Rafael J. Wysocki
@ 2025-10-20 17:24     ` Mario Limonciello (AMD) (kernel.org)
  0 siblings, 0 replies; 23+ messages in thread
From: Mario Limonciello (AMD) (kernel.org) @ 2025-10-20 17:24 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: mario.limonciello, airlied, alexander.deucher, christian.koenig,
	dakr, gregkh, lenb, pavel, simona, Muhammad Usama Anjum, amd-gfx,
	dri-devel, linux-pm



On 10/20/2025 12:18 PM, Rafael J. Wysocki wrote:
> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
> <superm1@kernel.org> wrote:
>>
>> From: Mario Limonciello <mario.limonciello@amd.com>
>>
>> If a device resume returns -EBUSY the device resume sequence has
>> been skipped.
> 
> Is this actually demonstrably true in all of the cases?
> 
> And what about -EAGAIN?
> 

I haven't audited codepaths of all drivers to guarantee it to be true 
for all cases.  That's the main reason I wanted to make it RFC - to 
discuss the idea of a dedicated return code to indicate it was skipped.

Another idea I had is that we could make it return a positive number, 
and PM core could recognize that as a skip.

So would like your thoughts against the ideas currently presented:

* -EAGAIN
* -EBUSY
* Some other return code
* > 0

Whichever is decided the PM core documentation would need to be updated 
to match as well.

>> Don't show errors for this or pass it up to async
>> resume.  If resume is run again in another stage the device should
>> try again.
>>
>> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
>> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
>> ---
>>   drivers/base/power/main.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
>> index bf9c3d79c455f..f6bc7ef9a8371 100644
>> --- a/drivers/base/power/main.c
>> +++ b/drivers/base/power/main.c
>> @@ -1112,7 +1112,9 @@ static void device_resume(struct device *dev, pm_message_t state, bool async)
>>
>>          TRACE_RESUME(error);
>>
>> -       if (error) {
>> +       if (error == -EBUSY)
>> +               pm_dev_dbg(dev, state, async ? " async" : "");
>> +       else if (error) {
>>                  WRITE_ONCE(async_error, error);
>>                  dpm_save_failed_dev(dev_name(dev));
>>                  pm_dev_err(dev, state, async ? " async" : "", error);
>> --
>> 2.43.0
>>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-20 17:21   ` Rafael J. Wysocki
@ 2025-10-20 17:28     ` Mario Limonciello (AMD) (kernel.org)
  2025-10-20 17:39       ` Rafael J. Wysocki
  0 siblings, 1 reply; 23+ messages in thread
From: Mario Limonciello (AMD) (kernel.org) @ 2025-10-20 17:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: mario.limonciello, airlied, alexander.deucher, christian.koenig,
	dakr, gregkh, lenb, pavel, simona, Muhammad Usama Anjum, amd-gfx,
	dri-devel, linux-pm



On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
> <superm1@kernel.org> wrote:
>>
>> From: Mario Limonciello <mario.limonciello@amd.com>
>>
>> The PM core should be notified that thaw was skipped for the device
>> so that if it's tried to be resumed (such as an aborted hibernate)
>> that it gets another chance to resume.
>>
>> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 61268aa82df4d..d40af069f24dd 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
>>
>>          /* do not resume device if it's normal hibernation */
>>          if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>> -               return 0;
>> +               return -EBUSY;
> 
> So that's why you need the special handling of -EBUSY in the previous patch.

Yup.

> 
> I think that you need to save some state in this driver and then use
> it in subsequent callbacks instead of hacking the core to do what you
> want.
> 

The problem is the core decides "what" to call and more importantly 
"when" to call it.

IE if the core thinks that something is thawed it will never call 
resume, and that's why you end up in a bad place with Muhammad's 
cancellation series and why I proposed this one to discuss.

We could obviously go back to dropping this case entirely:

if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())

But then the display turns on at thaw(), you do an unnecessary resource 
eviction, it takes a lot longer if you have a ton of VRAM etc.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-20 17:28     ` Mario Limonciello (AMD) (kernel.org)
@ 2025-10-20 17:39       ` Rafael J. Wysocki
  2025-10-20 18:32         ` Mario Limonciello (AMD) (kernel.org)
  0 siblings, 1 reply; 23+ messages in thread
From: Rafael J. Wysocki @ 2025-10-20 17:39 UTC (permalink / raw)
  To: Mario Limonciello (AMD) (kernel.org)
  Cc: Rafael J. Wysocki, mario.limonciello, airlied, alexander.deucher,
	christian.koenig, dakr, gregkh, lenb, pavel, simona,
	Muhammad Usama Anjum, amd-gfx, dri-devel, linux-pm

On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org)
<superm1@kernel.org> wrote:
>
>
>
> On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
> > On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
> > <superm1@kernel.org> wrote:
> >>
> >> From: Mario Limonciello <mario.limonciello@amd.com>
> >>
> >> The PM core should be notified that thaw was skipped for the device
> >> so that if it's tried to be resumed (such as an aborted hibernate)
> >> that it gets another chance to resume.
> >>
> >> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
> >> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> index 61268aa82df4d..d40af069f24dd 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
> >>
> >>          /* do not resume device if it's normal hibernation */
> >>          if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
> >> -               return 0;
> >> +               return -EBUSY;
> >
> > So that's why you need the special handling of -EBUSY in the previous patch.
>
> Yup.
>
> >
> > I think that you need to save some state in this driver and then use
> > it in subsequent callbacks instead of hacking the core to do what you
> > want.
> >
>
> The problem is the core decides "what" to call and more importantly
> "when" to call it.
>
> IE if the core thinks that something is thawed it will never call
> resume, and that's why you end up in a bad place with Muhammad's
> cancellation series and why I proposed this one to discuss.
>
> We could obviously go back to dropping this case entirely:
>
> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>
> But then the display turns on at thaw(), you do an unnecessary resource
> eviction, it takes a lot longer if you have a ton of VRAM etc.

The cancellation series is at odds with this code path AFAICS because
what if hibernation is canceled after the entire thaw transition?

Some cleanup would need to be done before thawing user space I suppose.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-20 17:39       ` Rafael J. Wysocki
@ 2025-10-20 18:32         ` Mario Limonciello (AMD) (kernel.org)
  2025-10-20 18:50           ` Rafael J. Wysocki
  2025-10-21 14:12           ` Muhammad Usama Anjum
  0 siblings, 2 replies; 23+ messages in thread
From: Mario Limonciello (AMD) (kernel.org) @ 2025-10-20 18:32 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: mario.limonciello, airlied, alexander.deucher, christian.koenig,
	dakr, gregkh, lenb, pavel, simona, Muhammad Usama Anjum, amd-gfx,
	dri-devel, linux-pm



On 10/20/2025 12:39 PM, Rafael J. Wysocki wrote:
> On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org)
> <superm1@kernel.org> wrote:
>>
>>
>>
>> On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
>>> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
>>> <superm1@kernel.org> wrote:
>>>>
>>>> From: Mario Limonciello <mario.limonciello@amd.com>
>>>>
>>>> The PM core should be notified that thaw was skipped for the device
>>>> so that if it's tried to be resumed (such as an aborted hibernate)
>>>> that it gets another chance to resume.
>>>>
>>>> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index 61268aa82df4d..d40af069f24dd 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
>>>>
>>>>           /* do not resume device if it's normal hibernation */
>>>>           if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>>>> -               return 0;
>>>> +               return -EBUSY;
>>>
>>> So that's why you need the special handling of -EBUSY in the previous patch.
>>
>> Yup.
>>
>>>
>>> I think that you need to save some state in this driver and then use
>>> it in subsequent callbacks instead of hacking the core to do what you
>>> want.
>>>
>>
>> The problem is the core decides "what" to call and more importantly
>> "when" to call it.
>>
>> IE if the core thinks that something is thawed it will never call
>> resume, and that's why you end up in a bad place with Muhammad's
>> cancellation series and why I proposed this one to discuss.
>>
>> We could obviously go back to dropping this case entirely:
>>
>> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>>
>> But then the display turns on at thaw(), you do an unnecessary resource
>> eviction, it takes a lot longer if you have a ton of VRAM etc.
> 
> The cancellation series is at odds with this code path AFAICS because
> what if hibernation is canceled after the entire thaw transition?

Muhammad - did you test that specific timing of cancelling the hibernate?
> 
> Some cleanup would need to be done before thawing user space I suppose.

I agree; I think that series would need changes for it.

But if you put that series aside, I think this one still has some merit 
on it's own.  If another driver aborted the hibernate, I think the same 
thing could happen if it happened to run before amdgpu's device thaw().

That series just exposed a very "easy" way to reproduce this issue.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-20 18:32         ` Mario Limonciello (AMD) (kernel.org)
@ 2025-10-20 18:50           ` Rafael J. Wysocki
  2025-10-20 19:14             ` Mario Limonciello (AMD) (kernel.org)
  2025-10-21 14:12           ` Muhammad Usama Anjum
  1 sibling, 1 reply; 23+ messages in thread
From: Rafael J. Wysocki @ 2025-10-20 18:50 UTC (permalink / raw)
  To: Mario Limonciello (AMD) (kernel.org)
  Cc: Rafael J. Wysocki, mario.limonciello, airlied, alexander.deucher,
	christian.koenig, dakr, gregkh, lenb, pavel, simona,
	Muhammad Usama Anjum, amd-gfx, dri-devel, linux-pm

On Mon, Oct 20, 2025 at 8:32 PM Mario Limonciello (AMD) (kernel.org)
<superm1@kernel.org> wrote:
>
>
>
> On 10/20/2025 12:39 PM, Rafael J. Wysocki wrote:
> > On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org)
> > <superm1@kernel.org> wrote:
> >>
> >>
> >>
> >> On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
> >>> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
> >>> <superm1@kernel.org> wrote:
> >>>>
> >>>> From: Mario Limonciello <mario.limonciello@amd.com>
> >>>>
> >>>> The PM core should be notified that thaw was skipped for the device
> >>>> so that if it's tried to be resumed (such as an aborted hibernate)
> >>>> that it gets another chance to resume.
> >>>>
> >>>> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
> >>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> >>>> ---
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
> >>>>    1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>> index 61268aa82df4d..d40af069f24dd 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
> >>>>
> >>>>           /* do not resume device if it's normal hibernation */
> >>>>           if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
> >>>> -               return 0;
> >>>> +               return -EBUSY;
> >>>
> >>> So that's why you need the special handling of -EBUSY in the previous patch.
> >>
> >> Yup.
> >>
> >>>
> >>> I think that you need to save some state in this driver and then use
> >>> it in subsequent callbacks instead of hacking the core to do what you
> >>> want.
> >>>
> >>
> >> The problem is the core decides "what" to call and more importantly
> >> "when" to call it.
> >>
> >> IE if the core thinks that something is thawed it will never call
> >> resume, and that's why you end up in a bad place with Muhammad's
> >> cancellation series and why I proposed this one to discuss.
> >>
> >> We could obviously go back to dropping this case entirely:
> >>
> >> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
> >>
> >> But then the display turns on at thaw(), you do an unnecessary resource
> >> eviction, it takes a lot longer if you have a ton of VRAM etc.
> >
> > The cancellation series is at odds with this code path AFAICS because
> > what if hibernation is canceled after the entire thaw transition?
>
> Muhammad - did you test that specific timing of cancelling the hibernate?
> >
> > Some cleanup would need to be done before thawing user space I suppose.
>
> I agree; I think that series would need changes for it.
>
> But if you put that series aside, I think this one still has some merit
> on it's own.  If another driver aborted the hibernate, I think the same
> thing could happen if it happened to run before amdgpu's device thaw().
>
> That series just exposed a very "easy" way to reproduce this issue.

Device thaw errors don't abort anything AFAICS.

What can happen though is that another device may abort the final
"power off" transition, which is one of the reasons why I think that
rolling it back is generally hard.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-20 18:50           ` Rafael J. Wysocki
@ 2025-10-20 19:14             ` Mario Limonciello (AMD) (kernel.org)
  2025-10-20 19:18               ` Rafael J. Wysocki
  0 siblings, 1 reply; 23+ messages in thread
From: Mario Limonciello (AMD) (kernel.org) @ 2025-10-20 19:14 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: mario.limonciello, airlied, alexander.deucher, christian.koenig,
	dakr, gregkh, lenb, pavel, simona, Muhammad Usama Anjum, amd-gfx,
	dri-devel, linux-pm



On 10/20/2025 1:50 PM, Rafael J. Wysocki wrote:
> On Mon, Oct 20, 2025 at 8:32 PM Mario Limonciello (AMD) (kernel.org)
> <superm1@kernel.org> wrote:
>>
>>
>>
>> On 10/20/2025 12:39 PM, Rafael J. Wysocki wrote:
>>> On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org)
>>> <superm1@kernel.org> wrote:
>>>>
>>>>
>>>>
>>>> On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
>>>>> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
>>>>> <superm1@kernel.org> wrote:
>>>>>>
>>>>>> From: Mario Limonciello <mario.limonciello@amd.com>
>>>>>>
>>>>>> The PM core should be notified that thaw was skipped for the device
>>>>>> so that if it's tried to be resumed (such as an aborted hibernate)
>>>>>> that it gets another chance to resume.
>>>>>>
>>>>>> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
>>>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>> index 61268aa82df4d..d40af069f24dd 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
>>>>>>
>>>>>>            /* do not resume device if it's normal hibernation */
>>>>>>            if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>>>>>> -               return 0;
>>>>>> +               return -EBUSY;
>>>>>
>>>>> So that's why you need the special handling of -EBUSY in the previous patch.
>>>>
>>>> Yup.
>>>>
>>>>>
>>>>> I think that you need to save some state in this driver and then use
>>>>> it in subsequent callbacks instead of hacking the core to do what you
>>>>> want.
>>>>>
>>>>
>>>> The problem is the core decides "what" to call and more importantly
>>>> "when" to call it.
>>>>
>>>> IE if the core thinks that something is thawed it will never call
>>>> resume, and that's why you end up in a bad place with Muhammad's
>>>> cancellation series and why I proposed this one to discuss.
>>>>
>>>> We could obviously go back to dropping this case entirely:
>>>>
>>>> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>>>>
>>>> But then the display turns on at thaw(), you do an unnecessary resource
>>>> eviction, it takes a lot longer if you have a ton of VRAM etc.
>>>
>>> The cancellation series is at odds with this code path AFAICS because
>>> what if hibernation is canceled after the entire thaw transition?
>>
>> Muhammad - did you test that specific timing of cancelling the hibernate?
>>>
>>> Some cleanup would need to be done before thawing user space I suppose.
>>
>> I agree; I think that series would need changes for it.
>>
>> But if you put that series aside, I think this one still has some merit
>> on it's own.  If another driver aborted the hibernate, I think the same
>> thing could happen if it happened to run before amdgpu's device thaw().
>>
>> That series just exposed a very "easy" way to reproduce this issue.
> 
> Device thaw errors don't abort anything AFAICS.
> 

You're right; it doesn't abort, it just is saved to the logs.
The state is also not maintained.
> What can happen though is that another device may abort the final
> "power off" transition, which is one of the reasons why I think that
> rolling it back is generally hard.

That's exactly the reason for the first patch in this series.  The state 
of whether it succeeded isn't recorded.  So if thaw non-fatally fails 
and you've saved state to indicate this then any of the other calls that 
run can try again.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-20 19:14             ` Mario Limonciello (AMD) (kernel.org)
@ 2025-10-20 19:18               ` Rafael J. Wysocki
  2025-10-20 19:34                 ` Mario Limonciello (AMD) (kernel.org)
  0 siblings, 1 reply; 23+ messages in thread
From: Rafael J. Wysocki @ 2025-10-20 19:18 UTC (permalink / raw)
  To: Mario Limonciello (AMD) (kernel.org)
  Cc: Rafael J. Wysocki, mario.limonciello, airlied, alexander.deucher,
	christian.koenig, dakr, gregkh, lenb, pavel, simona,
	Muhammad Usama Anjum, amd-gfx, dri-devel, linux-pm

On Mon, Oct 20, 2025 at 9:14 PM Mario Limonciello (AMD) (kernel.org)
<superm1@kernel.org> wrote:
>
>
>
> On 10/20/2025 1:50 PM, Rafael J. Wysocki wrote:
> > On Mon, Oct 20, 2025 at 8:32 PM Mario Limonciello (AMD) (kernel.org)
> > <superm1@kernel.org> wrote:
> >>
> >>
> >>
> >> On 10/20/2025 12:39 PM, Rafael J. Wysocki wrote:
> >>> On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org)
> >>> <superm1@kernel.org> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
> >>>>> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
> >>>>> <superm1@kernel.org> wrote:
> >>>>>>
> >>>>>> From: Mario Limonciello <mario.limonciello@amd.com>
> >>>>>>
> >>>>>> The PM core should be notified that thaw was skipped for the device
> >>>>>> so that if it's tried to be resumed (such as an aborted hibernate)
> >>>>>> that it gets another chance to resume.
> >>>>>>
> >>>>>> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
> >>>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
> >>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>> index 61268aa82df4d..d40af069f24dd 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
> >>>>>>
> >>>>>>            /* do not resume device if it's normal hibernation */
> >>>>>>            if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
> >>>>>> -               return 0;
> >>>>>> +               return -EBUSY;
> >>>>>
> >>>>> So that's why you need the special handling of -EBUSY in the previous patch.
> >>>>
> >>>> Yup.
> >>>>
> >>>>>
> >>>>> I think that you need to save some state in this driver and then use
> >>>>> it in subsequent callbacks instead of hacking the core to do what you
> >>>>> want.
> >>>>>
> >>>>
> >>>> The problem is the core decides "what" to call and more importantly
> >>>> "when" to call it.
> >>>>
> >>>> IE if the core thinks that something is thawed it will never call
> >>>> resume, and that's why you end up in a bad place with Muhammad's
> >>>> cancellation series and why I proposed this one to discuss.
> >>>>
> >>>> We could obviously go back to dropping this case entirely:
> >>>>
> >>>> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
> >>>>
> >>>> But then the display turns on at thaw(), you do an unnecessary resource
> >>>> eviction, it takes a lot longer if you have a ton of VRAM etc.
> >>>
> >>> The cancellation series is at odds with this code path AFAICS because
> >>> what if hibernation is canceled after the entire thaw transition?
> >>
> >> Muhammad - did you test that specific timing of cancelling the hibernate?
> >>>
> >>> Some cleanup would need to be done before thawing user space I suppose.
> >>
> >> I agree; I think that series would need changes for it.
> >>
> >> But if you put that series aside, I think this one still has some merit
> >> on it's own.  If another driver aborted the hibernate, I think the same
> >> thing could happen if it happened to run before amdgpu's device thaw().
> >>
> >> That series just exposed a very "easy" way to reproduce this issue.
> >
> > Device thaw errors don't abort anything AFAICS.
> >
>
> You're right; it doesn't abort, it just is saved to the logs.
> The state is also not maintained.
> > What can happen though is that another device may abort the final
> > "power off" transition, which is one of the reasons why I think that
> > rolling it back is generally hard.
>
> That's exactly the reason for the first patch in this series.  The state
> of whether it succeeded isn't recorded.  So if thaw non-fatally fails
> and you've saved state to indicate this then any of the other calls that
> run can try again.

So long as they are called.

But as I said before, I would save the state in the driver thaw
callback and then clear it in the driver poweroff callback and look at
it in the driver restore callback.  If it is there at that point,
poweroff has not run and hibernation is rolling back, so you need to
do a "thaw".

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-20 19:18               ` Rafael J. Wysocki
@ 2025-10-20 19:34                 ` Mario Limonciello (AMD) (kernel.org)
  2025-10-20 19:55                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 23+ messages in thread
From: Mario Limonciello (AMD) (kernel.org) @ 2025-10-20 19:34 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: mario.limonciello, airlied, alexander.deucher, christian.koenig,
	dakr, gregkh, lenb, pavel, simona, Muhammad Usama Anjum, amd-gfx,
	dri-devel, linux-pm



On 10/20/2025 2:18 PM, Rafael J. Wysocki wrote:
> On Mon, Oct 20, 2025 at 9:14 PM Mario Limonciello (AMD) (kernel.org)
> <superm1@kernel.org> wrote:
>>
>>
>>
>> On 10/20/2025 1:50 PM, Rafael J. Wysocki wrote:
>>> On Mon, Oct 20, 2025 at 8:32 PM Mario Limonciello (AMD) (kernel.org)
>>> <superm1@kernel.org> wrote:
>>>>
>>>>
>>>>
>>>> On 10/20/2025 12:39 PM, Rafael J. Wysocki wrote:
>>>>> On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org)
>>>>> <superm1@kernel.org> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
>>>>>>> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
>>>>>>> <superm1@kernel.org> wrote:
>>>>>>>>
>>>>>>>> From: Mario Limonciello <mario.limonciello@amd.com>
>>>>>>>>
>>>>>>>> The PM core should be notified that thaw was skipped for the device
>>>>>>>> so that if it's tried to be resumed (such as an aborted hibernate)
>>>>>>>> that it gets another chance to resume.
>>>>>>>>
>>>>>>>> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
>>>>>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
>>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>> index 61268aa82df4d..d40af069f24dd 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
>>>>>>>>
>>>>>>>>             /* do not resume device if it's normal hibernation */
>>>>>>>>             if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>>>>>>>> -               return 0;
>>>>>>>> +               return -EBUSY;
>>>>>>>
>>>>>>> So that's why you need the special handling of -EBUSY in the previous patch.
>>>>>>
>>>>>> Yup.
>>>>>>
>>>>>>>
>>>>>>> I think that you need to save some state in this driver and then use
>>>>>>> it in subsequent callbacks instead of hacking the core to do what you
>>>>>>> want.
>>>>>>>
>>>>>>
>>>>>> The problem is the core decides "what" to call and more importantly
>>>>>> "when" to call it.
>>>>>>
>>>>>> IE if the core thinks that something is thawed it will never call
>>>>>> resume, and that's why you end up in a bad place with Muhammad's
>>>>>> cancellation series and why I proposed this one to discuss.
>>>>>>
>>>>>> We could obviously go back to dropping this case entirely:
>>>>>>
>>>>>> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>>>>>>
>>>>>> But then the display turns on at thaw(), you do an unnecessary resource
>>>>>> eviction, it takes a lot longer if you have a ton of VRAM etc.
>>>>>
>>>>> The cancellation series is at odds with this code path AFAICS because
>>>>> what if hibernation is canceled after the entire thaw transition?
>>>>
>>>> Muhammad - did you test that specific timing of cancelling the hibernate?
>>>>>
>>>>> Some cleanup would need to be done before thawing user space I suppose.
>>>>
>>>> I agree; I think that series would need changes for it.
>>>>
>>>> But if you put that series aside, I think this one still has some merit
>>>> on it's own.  If another driver aborted the hibernate, I think the same
>>>> thing could happen if it happened to run before amdgpu's device thaw().
>>>>
>>>> That series just exposed a very "easy" way to reproduce this issue.
>>>
>>> Device thaw errors don't abort anything AFAICS.
>>>
>>
>> You're right; it doesn't abort, it just is saved to the logs.
>> The state is also not maintained.
>>> What can happen though is that another device may abort the final
>>> "power off" transition, which is one of the reasons why I think that
>>> rolling it back is generally hard.
>>
>> That's exactly the reason for the first patch in this series.  The state
>> of whether it succeeded isn't recorded.  So if thaw non-fatally fails
>> and you've saved state to indicate this then any of the other calls that
>> run can try again.
> 
> So long as they are called.
> 
> But as I said before, I would save the state in the driver thaw
> callback and then clear it in the driver poweroff callback and look at
> it in the driver restore callback.  If it is there at that point,
> poweroff has not run and hibernation is rolling back, so you need to
> do a "thaw".

Are you suggesting that the device driver should directly manipulate 
dev->power.is_suspended?

I'll do some testing but; I suppose that would work as well without 
needing to make core changes if you don't see a need for other devices 
to do this.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-20 19:34                 ` Mario Limonciello (AMD) (kernel.org)
@ 2025-10-20 19:55                   ` Rafael J. Wysocki
  2025-10-20 21:09                     ` Mario Limonciello (AMD) (kernel.org)
  0 siblings, 1 reply; 23+ messages in thread
From: Rafael J. Wysocki @ 2025-10-20 19:55 UTC (permalink / raw)
  To: Mario Limonciello (AMD) (kernel.org)
  Cc: Rafael J. Wysocki, mario.limonciello, airlied, alexander.deucher,
	christian.koenig, dakr, gregkh, lenb, pavel, simona,
	Muhammad Usama Anjum, amd-gfx, dri-devel, linux-pm

On Mon, Oct 20, 2025 at 9:34 PM Mario Limonciello (AMD) (kernel.org)
<superm1@kernel.org> wrote:
>
>
>
> On 10/20/2025 2:18 PM, Rafael J. Wysocki wrote:
> > On Mon, Oct 20, 2025 at 9:14 PM Mario Limonciello (AMD) (kernel.org)
> > <superm1@kernel.org> wrote:
> >>
> >>
> >>
> >> On 10/20/2025 1:50 PM, Rafael J. Wysocki wrote:
> >>> On Mon, Oct 20, 2025 at 8:32 PM Mario Limonciello (AMD) (kernel.org)
> >>> <superm1@kernel.org> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 10/20/2025 12:39 PM, Rafael J. Wysocki wrote:
> >>>>> On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org)
> >>>>> <superm1@kernel.org> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
> >>>>>>> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
> >>>>>>> <superm1@kernel.org> wrote:
> >>>>>>>>
> >>>>>>>> From: Mario Limonciello <mario.limonciello@amd.com>
> >>>>>>>>
> >>>>>>>> The PM core should be notified that thaw was skipped for the device
> >>>>>>>> so that if it's tried to be resumed (such as an aborted hibernate)
> >>>>>>>> that it gets another chance to resume.
> >>>>>>>>
> >>>>>>>> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
> >>>>>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> >>>>>>>> ---
> >>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
> >>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>>>> index 61268aa82df4d..d40af069f24dd 100644
> >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>>>> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
> >>>>>>>>
> >>>>>>>>             /* do not resume device if it's normal hibernation */
> >>>>>>>>             if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
> >>>>>>>> -               return 0;
> >>>>>>>> +               return -EBUSY;
> >>>>>>>
> >>>>>>> So that's why you need the special handling of -EBUSY in the previous patch.
> >>>>>>
> >>>>>> Yup.
> >>>>>>
> >>>>>>>
> >>>>>>> I think that you need to save some state in this driver and then use
> >>>>>>> it in subsequent callbacks instead of hacking the core to do what you
> >>>>>>> want.
> >>>>>>>
> >>>>>>
> >>>>>> The problem is the core decides "what" to call and more importantly
> >>>>>> "when" to call it.
> >>>>>>
> >>>>>> IE if the core thinks that something is thawed it will never call
> >>>>>> resume, and that's why you end up in a bad place with Muhammad's
> >>>>>> cancellation series and why I proposed this one to discuss.
> >>>>>>
> >>>>>> We could obviously go back to dropping this case entirely:
> >>>>>>
> >>>>>> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
> >>>>>>
> >>>>>> But then the display turns on at thaw(), you do an unnecessary resource
> >>>>>> eviction, it takes a lot longer if you have a ton of VRAM etc.
> >>>>>
> >>>>> The cancellation series is at odds with this code path AFAICS because
> >>>>> what if hibernation is canceled after the entire thaw transition?
> >>>>
> >>>> Muhammad - did you test that specific timing of cancelling the hibernate?
> >>>>>
> >>>>> Some cleanup would need to be done before thawing user space I suppose.
> >>>>
> >>>> I agree; I think that series would need changes for it.
> >>>>
> >>>> But if you put that series aside, I think this one still has some merit
> >>>> on it's own.  If another driver aborted the hibernate, I think the same
> >>>> thing could happen if it happened to run before amdgpu's device thaw().
> >>>>
> >>>> That series just exposed a very "easy" way to reproduce this issue.
> >>>
> >>> Device thaw errors don't abort anything AFAICS.
> >>>
> >>
> >> You're right; it doesn't abort, it just is saved to the logs.
> >> The state is also not maintained.
> >>> What can happen though is that another device may abort the final
> >>> "power off" transition, which is one of the reasons why I think that
> >>> rolling it back is generally hard.
> >>
> >> That's exactly the reason for the first patch in this series.  The state
> >> of whether it succeeded isn't recorded.  So if thaw non-fatally fails
> >> and you've saved state to indicate this then any of the other calls that
> >> run can try again.
> >
> > So long as they are called.
> >
> > But as I said before, I would save the state in the driver thaw
> > callback and then clear it in the driver poweroff callback and look at
> > it in the driver restore callback.  If it is there at that point,
> > poweroff has not run and hibernation is rolling back, so you need to
> > do a "thaw".
>
> Are you suggesting that the device driver should directly manipulate
> dev->power.is_suspended?

No, it needs to have its own state for that.  power.is_suspended
should not be manipulated by drivers (or anything other than the core
for that matter).

> I'll do some testing but; I suppose that would work as well without
> needing to make core changes if you don't see a need for other devices
> to do this.

So long as they don't try to skip the "thaw" actions, I don't.

If there are more drivers wanting to do it, I guess it would be good
to have a common approach although at this point I'm not sure how much
in common there would be.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-20 19:55                   ` Rafael J. Wysocki
@ 2025-10-20 21:09                     ` Mario Limonciello (AMD) (kernel.org)
  2025-10-21 13:25                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 23+ messages in thread
From: Mario Limonciello (AMD) (kernel.org) @ 2025-10-20 21:09 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: mario.limonciello, airlied, alexander.deucher, christian.koenig,
	dakr, gregkh, lenb, pavel, simona, Muhammad Usama Anjum, amd-gfx,
	dri-devel, linux-pm



On 10/20/2025 2:55 PM, Rafael J. Wysocki wrote:
> On Mon, Oct 20, 2025 at 9:34 PM Mario Limonciello (AMD) (kernel.org)
> <superm1@kernel.org> wrote:
>>
>>
>>
>> On 10/20/2025 2:18 PM, Rafael J. Wysocki wrote:
>>> On Mon, Oct 20, 2025 at 9:14 PM Mario Limonciello (AMD) (kernel.org)
>>> <superm1@kernel.org> wrote:
>>>>
>>>>
>>>>
>>>> On 10/20/2025 1:50 PM, Rafael J. Wysocki wrote:
>>>>> On Mon, Oct 20, 2025 at 8:32 PM Mario Limonciello (AMD) (kernel.org)
>>>>> <superm1@kernel.org> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10/20/2025 12:39 PM, Rafael J. Wysocki wrote:
>>>>>>> On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org)
>>>>>>> <superm1@kernel.org> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
>>>>>>>>> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
>>>>>>>>> <superm1@kernel.org> wrote:
>>>>>>>>>>
>>>>>>>>>> From: Mario Limonciello <mario.limonciello@amd.com>
>>>>>>>>>>
>>>>>>>>>> The PM core should be notified that thaw was skipped for the device
>>>>>>>>>> so that if it's tried to be resumed (such as an aborted hibernate)
>>>>>>>>>> that it gets another chance to resume.
>>>>>>>>>>
>>>>>>>>>> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
>>>>>>>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>>>>>>>>> ---
>>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
>>>>>>>>>>       1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>>>> index 61268aa82df4d..d40af069f24dd 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>>>> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
>>>>>>>>>>
>>>>>>>>>>              /* do not resume device if it's normal hibernation */
>>>>>>>>>>              if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>>>>>>>>>> -               return 0;
>>>>>>>>>> +               return -EBUSY;
>>>>>>>>>
>>>>>>>>> So that's why you need the special handling of -EBUSY in the previous patch.
>>>>>>>>
>>>>>>>> Yup.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think that you need to save some state in this driver and then use
>>>>>>>>> it in subsequent callbacks instead of hacking the core to do what you
>>>>>>>>> want.
>>>>>>>>>
>>>>>>>>
>>>>>>>> The problem is the core decides "what" to call and more importantly
>>>>>>>> "when" to call it.
>>>>>>>>
>>>>>>>> IE if the core thinks that something is thawed it will never call
>>>>>>>> resume, and that's why you end up in a bad place with Muhammad's
>>>>>>>> cancellation series and why I proposed this one to discuss.
>>>>>>>>
>>>>>>>> We could obviously go back to dropping this case entirely:
>>>>>>>>
>>>>>>>> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>>>>>>>>
>>>>>>>> But then the display turns on at thaw(), you do an unnecessary resource
>>>>>>>> eviction, it takes a lot longer if you have a ton of VRAM etc.
>>>>>>>
>>>>>>> The cancellation series is at odds with this code path AFAICS because
>>>>>>> what if hibernation is canceled after the entire thaw transition?
>>>>>>
>>>>>> Muhammad - did you test that specific timing of cancelling the hibernate?
>>>>>>>
>>>>>>> Some cleanup would need to be done before thawing user space I suppose.
>>>>>>
>>>>>> I agree; I think that series would need changes for it.
>>>>>>
>>>>>> But if you put that series aside, I think this one still has some merit
>>>>>> on it's own.  If another driver aborted the hibernate, I think the same
>>>>>> thing could happen if it happened to run before amdgpu's device thaw().
>>>>>>
>>>>>> That series just exposed a very "easy" way to reproduce this issue.
>>>>>
>>>>> Device thaw errors don't abort anything AFAICS.
>>>>>
>>>>
>>>> You're right; it doesn't abort, it just is saved to the logs.
>>>> The state is also not maintained.
>>>>> What can happen though is that another device may abort the final
>>>>> "power off" transition, which is one of the reasons why I think that
>>>>> rolling it back is generally hard.
>>>>
>>>> That's exactly the reason for the first patch in this series.  The state
>>>> of whether it succeeded isn't recorded.  So if thaw non-fatally fails
>>>> and you've saved state to indicate this then any of the other calls that
>>>> run can try again.
>>>
>>> So long as they are called.
>>>
>>> But as I said before, I would save the state in the driver thaw
>>> callback and then clear it in the driver poweroff callback and look at
>>> it in the driver restore callback.  If it is there at that point,
>>> poweroff has not run and hibernation is rolling back, so you need to
>>> do a "thaw".
>>
>> Are you suggesting that the device driver should directly manipulate
>> dev->power.is_suspended?
> 
> No, it needs to have its own state for that.  power.is_suspended
> should not be manipulated by drivers (or anything other than the core
> for that matter).

That's what I originally thought which is why this series looks like it 
does.

> 
>> I'll do some testing but; I suppose that would work as well without
>> needing to make core changes if you don't see a need for other devices
>> to do this.
> 
> So long as they don't try to skip the "thaw" actions, I don't.
> 
> If there are more drivers wanting to do it, I guess it would be good
> to have a common approach although at this point I'm not sure how much
> in common there would be.

But so if the state is maintained in the driver dev->power.is_suspended 
will be FALSE at the end of thaw().  That means that restore() is never 
called for a cancellation/abort.

So I think the only place to do the cleanup would be in the complete() 
callback.  Do you think that's the best place for this based upon that 
internal driver state variable?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-20 21:09                     ` Mario Limonciello (AMD) (kernel.org)
@ 2025-10-21 13:25                       ` Rafael J. Wysocki
  2025-10-21 14:19                         ` Mario Limonciello (AMD) (kernel.org)
  0 siblings, 1 reply; 23+ messages in thread
From: Rafael J. Wysocki @ 2025-10-21 13:25 UTC (permalink / raw)
  To: Mario Limonciello (AMD) (kernel.org)
  Cc: Rafael J. Wysocki, mario.limonciello, airlied, alexander.deucher,
	christian.koenig, dakr, gregkh, lenb, pavel, simona,
	Muhammad Usama Anjum, amd-gfx, dri-devel, linux-pm

On Mon, Oct 20, 2025 at 11:09 PM Mario Limonciello (AMD) (kernel.org)
<superm1@kernel.org> wrote:
>
>
>
> On 10/20/2025 2:55 PM, Rafael J. Wysocki wrote:
> > On Mon, Oct 20, 2025 at 9:34 PM Mario Limonciello (AMD) (kernel.org)
> > <superm1@kernel.org> wrote:
> >>
> >>
> >>
> >> On 10/20/2025 2:18 PM, Rafael J. Wysocki wrote:
> >>> On Mon, Oct 20, 2025 at 9:14 PM Mario Limonciello (AMD) (kernel.org)
> >>> <superm1@kernel.org> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 10/20/2025 1:50 PM, Rafael J. Wysocki wrote:
> >>>>> On Mon, Oct 20, 2025 at 8:32 PM Mario Limonciello (AMD) (kernel.org)
> >>>>> <superm1@kernel.org> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 10/20/2025 12:39 PM, Rafael J. Wysocki wrote:
> >>>>>>> On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org)
> >>>>>>> <superm1@kernel.org> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
> >>>>>>>>> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
> >>>>>>>>> <superm1@kernel.org> wrote:
> >>>>>>>>>>
> >>>>>>>>>> From: Mario Limonciello <mario.limonciello@amd.com>
> >>>>>>>>>>
> >>>>>>>>>> The PM core should be notified that thaw was skipped for the device
> >>>>>>>>>> so that if it's tried to be resumed (such as an aborted hibernate)
> >>>>>>>>>> that it gets another chance to resume.
> >>>>>>>>>>
> >>>>>>>>>> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
> >>>>>>>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> >>>>>>>>>> ---
> >>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
> >>>>>>>>>>       1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>>>>>> index 61268aa82df4d..d40af069f24dd 100644
> >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>>>>>> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
> >>>>>>>>>>
> >>>>>>>>>>              /* do not resume device if it's normal hibernation */
> >>>>>>>>>>              if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
> >>>>>>>>>> -               return 0;
> >>>>>>>>>> +               return -EBUSY;
> >>>>>>>>>
> >>>>>>>>> So that's why you need the special handling of -EBUSY in the previous patch.
> >>>>>>>>
> >>>>>>>> Yup.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I think that you need to save some state in this driver and then use
> >>>>>>>>> it in subsequent callbacks instead of hacking the core to do what you
> >>>>>>>>> want.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> The problem is the core decides "what" to call and more importantly
> >>>>>>>> "when" to call it.
> >>>>>>>>
> >>>>>>>> IE if the core thinks that something is thawed it will never call
> >>>>>>>> resume, and that's why you end up in a bad place with Muhammad's
> >>>>>>>> cancellation series and why I proposed this one to discuss.
> >>>>>>>>
> >>>>>>>> We could obviously go back to dropping this case entirely:
> >>>>>>>>
> >>>>>>>> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
> >>>>>>>>
> >>>>>>>> But then the display turns on at thaw(), you do an unnecessary resource
> >>>>>>>> eviction, it takes a lot longer if you have a ton of VRAM etc.
> >>>>>>>
> >>>>>>> The cancellation series is at odds with this code path AFAICS because
> >>>>>>> what if hibernation is canceled after the entire thaw transition?
> >>>>>>
> >>>>>> Muhammad - did you test that specific timing of cancelling the hibernate?
> >>>>>>>
> >>>>>>> Some cleanup would need to be done before thawing user space I suppose.
> >>>>>>
> >>>>>> I agree; I think that series would need changes for it.
> >>>>>>
> >>>>>> But if you put that series aside, I think this one still has some merit
> >>>>>> on it's own.  If another driver aborted the hibernate, I think the same
> >>>>>> thing could happen if it happened to run before amdgpu's device thaw().
> >>>>>>
> >>>>>> That series just exposed a very "easy" way to reproduce this issue.
> >>>>>
> >>>>> Device thaw errors don't abort anything AFAICS.
> >>>>>
> >>>>
> >>>> You're right; it doesn't abort, it just is saved to the logs.
> >>>> The state is also not maintained.
> >>>>> What can happen though is that another device may abort the final
> >>>>> "power off" transition, which is one of the reasons why I think that
> >>>>> rolling it back is generally hard.
> >>>>
> >>>> That's exactly the reason for the first patch in this series.  The state
> >>>> of whether it succeeded isn't recorded.  So if thaw non-fatally fails
> >>>> and you've saved state to indicate this then any of the other calls that
> >>>> run can try again.
> >>>
> >>> So long as they are called.
> >>>
> >>> But as I said before, I would save the state in the driver thaw
> >>> callback and then clear it in the driver poweroff callback and look at
> >>> it in the driver restore callback.  If it is there at that point,
> >>> poweroff has not run and hibernation is rolling back, so you need to
> >>> do a "thaw".
> >>
> >> Are you suggesting that the device driver should directly manipulate
> >> dev->power.is_suspended?
> >
> > No, it needs to have its own state for that.  power.is_suspended
> > should not be manipulated by drivers (or anything other than the core
> > for that matter).
>
> That's what I originally thought which is why this series looks like it
> does.
>
> >
> >> I'll do some testing but; I suppose that would work as well without
> >> needing to make core changes if you don't see a need for other devices
> >> to do this.
> >
> > So long as they don't try to skip the "thaw" actions, I don't.
> >
> > If there are more drivers wanting to do it, I guess it would be good
> > to have a common approach although at this point I'm not sure how much
> > in common there would be.
>
> But so if the state is maintained in the driver dev->power.is_suspended
> will be FALSE at the end of thaw().  That means that restore() is never
> called for a cancellation/abort.

OK, I see what you mean.

The failing scenario is when "thaw" leaves the devices in "freeze" and
then "poweroff" is not called because the final transition is aborted
and so "restore" is not called either and the device remains "frozen".

> So I think the only place to do the cleanup would be in the complete()
> callback.  Do you think that's the best place for this based upon that
> internal driver state variable?

It would be if nothing else depended on the device in question, but I
somehow suspect that it is not the case.

I think that you need to trigger a "restore" for the "frozen" device
in the right order with respect to the rest of dpm_list.  I guess you
could add a special power.frozen flag that will be set by drivers
leaving their devices in a "frozen" state in their "thaw" callback.
Then, it could be converted to power.is_suspended in the error path of
dpm_suspend() for "poweroff" transitions.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-20 18:32         ` Mario Limonciello (AMD) (kernel.org)
  2025-10-20 18:50           ` Rafael J. Wysocki
@ 2025-10-21 14:12           ` Muhammad Usama Anjum
  1 sibling, 0 replies; 23+ messages in thread
From: Muhammad Usama Anjum @ 2025-10-21 14:12 UTC (permalink / raw)
  To: Mario Limonciello (AMD) (kernel.org), Rafael J. Wysocki
  Cc: usama.anjum, mario.limonciello, airlied, alexander.deucher,
	christian.koenig, dakr, gregkh, lenb, pavel, simona, amd-gfx,
	dri-devel, linux-pm

On 10/20/25 11:32 PM, Mario Limonciello (AMD) (kernel.org) wrote:
> 
> 
> On 10/20/2025 12:39 PM, Rafael J. Wysocki wrote:
>> On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org)
>> <superm1@kernel.org> wrote:
>>>
>>>
>>>
>>> On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
>>>> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
>>>> <superm1@kernel.org> wrote:
>>>>>
>>>>> From: Mario Limonciello <mario.limonciello@amd.com>
>>>>>
>>>>> The PM core should be notified that thaw was skipped for the device
>>>>> so that if it's tried to be resumed (such as an aborted hibernate)
>>>>> that it gets another chance to resume.
>>>>>
>>>>> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
>>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> index 61268aa82df4d..d40af069f24dd 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
>>>>>
>>>>>           /* do not resume device if it's normal hibernation */
>>>>>           if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>>>>> -               return 0;
>>>>> +               return -EBUSY;
>>>>
>>>> So that's why you need the special handling of -EBUSY in the previous patch.
>>>
>>> Yup.
>>>
>>>>
>>>> I think that you need to save some state in this driver and then use
>>>> it in subsequent callbacks instead of hacking the core to do what you
>>>> want.
>>>>
>>>
>>> The problem is the core decides "what" to call and more importantly
>>> "when" to call it.
>>>
>>> IE if the core thinks that something is thawed it will never call
>>> resume, and that's why you end up in a bad place with Muhammad's
>>> cancellation series and why I proposed this one to discuss.
>>>
>>> We could obviously go back to dropping this case entirely:
>>>
>>> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>>>
>>> But then the display turns on at thaw(), you do an unnecessary resource
>>> eviction, it takes a lot longer if you have a ton of VRAM etc.
>>
>> The cancellation series is at odds with this code path AFAICS because
>> what if hibernation is canceled after the entire thaw transition?
> 
> Muhammad - did you test that specific timing of cancelling the hibernate?
Yes, I've tested the cancellations before and after the thaw both.

>>
>> Some cleanup would need to be done before thawing user space I suppose.
> 
> I agree; I think that series would need changes for it.
> 
> But if you put that series aside, I think this one still has some merit on it's own.  If another driver aborted the hibernate, I think the same thing could happen if it happened to run before amdgpu's device thaw().
> 
> That series just exposed a very "easy" way to reproduce this issue.


-- 
---
Thanks,
Usama

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
  2025-10-21 13:25                       ` Rafael J. Wysocki
@ 2025-10-21 14:19                         ` Mario Limonciello (AMD) (kernel.org)
  0 siblings, 0 replies; 23+ messages in thread
From: Mario Limonciello (AMD) (kernel.org) @ 2025-10-21 14:19 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: mario.limonciello, airlied, alexander.deucher, christian.koenig,
	dakr, gregkh, lenb, pavel, simona, Muhammad Usama Anjum, amd-gfx,
	dri-devel, linux-pm



On 10/21/2025 8:25 AM, Rafael J. Wysocki wrote:
> On Mon, Oct 20, 2025 at 11:09 PM Mario Limonciello (AMD) (kernel.org)
> <superm1@kernel.org> wrote:
>>
>>
>>
>> On 10/20/2025 2:55 PM, Rafael J. Wysocki wrote:
>>> On Mon, Oct 20, 2025 at 9:34 PM Mario Limonciello (AMD) (kernel.org)
>>> <superm1@kernel.org> wrote:
>>>>
>>>>
>>>>
>>>> On 10/20/2025 2:18 PM, Rafael J. Wysocki wrote:
>>>>> On Mon, Oct 20, 2025 at 9:14 PM Mario Limonciello (AMD) (kernel.org)
>>>>> <superm1@kernel.org> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10/20/2025 1:50 PM, Rafael J. Wysocki wrote:
>>>>>>> On Mon, Oct 20, 2025 at 8:32 PM Mario Limonciello (AMD) (kernel.org)
>>>>>>> <superm1@kernel.org> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/20/2025 12:39 PM, Rafael J. Wysocki wrote:
>>>>>>>>> On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org)
>>>>>>>>> <superm1@kernel.org> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
>>>>>>>>>>> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
>>>>>>>>>>> <superm1@kernel.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> From: Mario Limonciello <mario.limonciello@amd.com>
>>>>>>>>>>>>
>>>>>>>>>>>> The PM core should be notified that thaw was skipped for the device
>>>>>>>>>>>> so that if it's tried to be resumed (such as an aborted hibernate)
>>>>>>>>>>>> that it gets another chance to resume.
>>>>>>>>>>>>
>>>>>>>>>>>> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
>>>>>>>>>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
>>>>>>>>>>>>        1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>>>>>> index 61268aa82df4d..d40af069f24dd 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>>>>>> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
>>>>>>>>>>>>
>>>>>>>>>>>>               /* do not resume device if it's normal hibernation */
>>>>>>>>>>>>               if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>>>>>>>>>>>> -               return 0;
>>>>>>>>>>>> +               return -EBUSY;
>>>>>>>>>>>
>>>>>>>>>>> So that's why you need the special handling of -EBUSY in the previous patch.
>>>>>>>>>>
>>>>>>>>>> Yup.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think that you need to save some state in this driver and then use
>>>>>>>>>>> it in subsequent callbacks instead of hacking the core to do what you
>>>>>>>>>>> want.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The problem is the core decides "what" to call and more importantly
>>>>>>>>>> "when" to call it.
>>>>>>>>>>
>>>>>>>>>> IE if the core thinks that something is thawed it will never call
>>>>>>>>>> resume, and that's why you end up in a bad place with Muhammad's
>>>>>>>>>> cancellation series and why I proposed this one to discuss.
>>>>>>>>>>
>>>>>>>>>> We could obviously go back to dropping this case entirely:
>>>>>>>>>>
>>>>>>>>>> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>>>>>>>>>>
>>>>>>>>>> But then the display turns on at thaw(), you do an unnecessary resource
>>>>>>>>>> eviction, it takes a lot longer if you have a ton of VRAM etc.
>>>>>>>>>
>>>>>>>>> The cancellation series is at odds with this code path AFAICS because
>>>>>>>>> what if hibernation is canceled after the entire thaw transition?
>>>>>>>>
>>>>>>>> Muhammad - did you test that specific timing of cancelling the hibernate?
>>>>>>>>>
>>>>>>>>> Some cleanup would need to be done before thawing user space I suppose.
>>>>>>>>
>>>>>>>> I agree; I think that series would need changes for it.
>>>>>>>>
>>>>>>>> But if you put that series aside, I think this one still has some merit
>>>>>>>> on it's own.  If another driver aborted the hibernate, I think the same
>>>>>>>> thing could happen if it happened to run before amdgpu's device thaw().
>>>>>>>>
>>>>>>>> That series just exposed a very "easy" way to reproduce this issue.
>>>>>>>
>>>>>>> Device thaw errors don't abort anything AFAICS.
>>>>>>>
>>>>>>
>>>>>> You're right; it doesn't abort, it just is saved to the logs.
>>>>>> The state is also not maintained.
>>>>>>> What can happen though is that another device may abort the final
>>>>>>> "power off" transition, which is one of the reasons why I think that
>>>>>>> rolling it back is generally hard.
>>>>>>
>>>>>> That's exactly the reason for the first patch in this series.  The state
>>>>>> of whether it succeeded isn't recorded.  So if thaw non-fatally fails
>>>>>> and you've saved state to indicate this then any of the other calls that
>>>>>> run can try again.
>>>>>
>>>>> So long as they are called.
>>>>>
>>>>> But as I said before, I would save the state in the driver thaw
>>>>> callback and then clear it in the driver poweroff callback and look at
>>>>> it in the driver restore callback.  If it is there at that point,
>>>>> poweroff has not run and hibernation is rolling back, so you need to
>>>>> do a "thaw".
>>>>
>>>> Are you suggesting that the device driver should directly manipulate
>>>> dev->power.is_suspended?
>>>
>>> No, it needs to have its own state for that.  power.is_suspended
>>> should not be manipulated by drivers (or anything other than the core
>>> for that matter).
>>
>> That's what I originally thought which is why this series looks like it
>> does.
>>
>>>
>>>> I'll do some testing but; I suppose that would work as well without
>>>> needing to make core changes if you don't see a need for other devices
>>>> to do this.
>>>
>>> So long as they don't try to skip the "thaw" actions, I don't.
>>>
>>> If there are more drivers wanting to do it, I guess it would be good
>>> to have a common approach although at this point I'm not sure how much
>>> in common there would be.
>>
>> But so if the state is maintained in the driver dev->power.is_suspended
>> will be FALSE at the end of thaw().  That means that restore() is never
>> called for a cancellation/abort.
> 
> OK, I see what you mean.
> 
> The failing scenario is when "thaw" leaves the devices in "freeze" and
> then "poweroff" is not called because the final transition is aborted
> and so "restore" is not called either and the device remains "frozen".
> 
>> So I think the only place to do the cleanup would be in the complete()
>> callback.  Do you think that's the best place for this based upon that
>> internal driver state variable?
> 
> It would be if nothing else depended on the device in question, but I
> somehow suspect that it is not the case.

Muhammad had a try with this and confirmed it worked on an mobile APU 
(which has no device dependency)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index aad620cdfd399..a88c28579e290 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2594,6 +2594,12 @@ static int amdgpu_pmops_prepare(struct device *dev)

  static void amdgpu_pmops_complete(struct device *dev)
  {
+       struct drm_device *drm_dev = dev_get_drvdata(dev);
+       struct amdgpu_device *adev = drm_to_adev(drm_dev);
+
+       if (adev->in_s4 && adev->in_suspend && pm_hibernate_is_recovering())
+               amdgpu_device_resume(drm_dev, true);
+
         amdgpu_device_complete(dev_get_drvdata(dev));
  }

But I think you're right it won't work in the case of a dGPU because the 
ordering between HDMI audio and GFX needs to be done properly.

> 
> I think that you need to trigger a "restore" for the "frozen" device
> in the right order with respect to the rest of dpm_list.  I guess you
> could add a special power.frozen flag that will be set by drivers
> leaving their devices in a "frozen" state in their "thaw" callback.
> Then, it could be converted to power.is_suspended in the error path of
> dpm_suspend() for "poweroff" transitions.

OK, I'll have a try with changing the series to do this, thanks.

^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-10-21 14:19 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-20 16:50 [RFC 0/3] Fixups for cancelled hibernate Mario Limonciello (AMD)
2025-10-20 16:50 ` [RFC 1/3] PM: Mark device as suspended if it failed to resume Mario Limonciello (AMD)
2025-10-20 16:58   ` Muhammad Usama Anjum
2025-10-20 17:15     ` Rafael J. Wysocki
2025-10-20 16:50 ` [RFC 2/3] PM: Don't pass up device_resume() -EBUSY errors Mario Limonciello (AMD)
2025-10-20 16:58   ` Muhammad Usama Anjum
2025-10-20 17:18   ` Rafael J. Wysocki
2025-10-20 17:24     ` Mario Limonciello (AMD) (kernel.org)
2025-10-20 16:50 ` [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success Mario Limonciello (AMD)
2025-10-20 16:59   ` Muhammad Usama Anjum
2025-10-20 17:21   ` Rafael J. Wysocki
2025-10-20 17:28     ` Mario Limonciello (AMD) (kernel.org)
2025-10-20 17:39       ` Rafael J. Wysocki
2025-10-20 18:32         ` Mario Limonciello (AMD) (kernel.org)
2025-10-20 18:50           ` Rafael J. Wysocki
2025-10-20 19:14             ` Mario Limonciello (AMD) (kernel.org)
2025-10-20 19:18               ` Rafael J. Wysocki
2025-10-20 19:34                 ` Mario Limonciello (AMD) (kernel.org)
2025-10-20 19:55                   ` Rafael J. Wysocki
2025-10-20 21:09                     ` Mario Limonciello (AMD) (kernel.org)
2025-10-21 13:25                       ` Rafael J. Wysocki
2025-10-21 14:19                         ` Mario Limonciello (AMD) (kernel.org)
2025-10-21 14:12           ` Muhammad Usama Anjum

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox