From: "Mario Limonciello (AMD) (kernel.org)" <superm1@kernel.org>
To: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: mario.limonciello@amd.com, airlied@gmail.com,
alexander.deucher@amd.com, christian.koenig@amd.com,
dakr@kernel.org, gregkh@linuxfoundation.org, lenb@kernel.org,
pavel@kernel.org, simona@ffwll.ch,
Muhammad Usama Anjum <usama.anjum@collabora.com>,
amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-pm@vger.kernel.org
Subject: Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success
Date: Mon, 20 Oct 2025 16:09:12 -0500 [thread overview]
Message-ID: <1b86e583-1f3a-4192-8864-d2a60a8787e1@kernel.org> (raw)
In-Reply-To: <CAJZ5v0jWou0vxNZhe-pU-wQyWxhikaDkF+ZO0rsUieXs_nYjjQ@mail.gmail.com>
On 10/20/2025 2:55 PM, Rafael J. Wysocki wrote:
> On Mon, Oct 20, 2025 at 9:34 PM Mario Limonciello (AMD) (kernel.org)
> <superm1@kernel.org> wrote:
>>
>>
>>
>> On 10/20/2025 2:18 PM, Rafael J. Wysocki wrote:
>>> On Mon, Oct 20, 2025 at 9:14 PM Mario Limonciello (AMD) (kernel.org)
>>> <superm1@kernel.org> wrote:
>>>>
>>>>
>>>>
>>>> On 10/20/2025 1:50 PM, Rafael J. Wysocki wrote:
>>>>> On Mon, Oct 20, 2025 at 8:32 PM Mario Limonciello (AMD) (kernel.org)
>>>>> <superm1@kernel.org> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10/20/2025 12:39 PM, Rafael J. Wysocki wrote:
>>>>>>> On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org)
>>>>>>> <superm1@kernel.org> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
>>>>>>>>> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
>>>>>>>>> <superm1@kernel.org> wrote:
>>>>>>>>>>
>>>>>>>>>> From: Mario Limonciello <mario.limonciello@amd.com>
>>>>>>>>>>
>>>>>>>>>> The PM core should be notified that thaw was skipped for the device
>>>>>>>>>> so that if it's tried to be resumed (such as an aborted hibernate)
>>>>>>>>>> that it gets another chance to resume.
>>>>>>>>>>
>>>>>>>>>> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
>>>>>>>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>>>>>>>>> ---
>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
>>>>>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>>>> index 61268aa82df4d..d40af069f24dd 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>>>> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
>>>>>>>>>>
>>>>>>>>>> /* do not resume device if it's normal hibernation */
>>>>>>>>>> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>>>>>>>>>> - return 0;
>>>>>>>>>> + return -EBUSY;
>>>>>>>>>
>>>>>>>>> So that's why you need the special handling of -EBUSY in the previous patch.
>>>>>>>>
>>>>>>>> Yup.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think that you need to save some state in this driver and then use
>>>>>>>>> it in subsequent callbacks instead of hacking the core to do what you
>>>>>>>>> want.
>>>>>>>>>
>>>>>>>>
>>>>>>>> The problem is the core decides "what" to call and more importantly
>>>>>>>> "when" to call it.
>>>>>>>>
>>>>>>>> IE if the core thinks that something is thawed it will never call
>>>>>>>> resume, and that's why you end up in a bad place with Muhammad's
>>>>>>>> cancellation series and why I proposed this one to discuss.
>>>>>>>>
>>>>>>>> We could obviously go back to dropping this case entirely:
>>>>>>>>
>>>>>>>> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
>>>>>>>>
>>>>>>>> But then the display turns on at thaw(), you do an unnecessary resource
>>>>>>>> eviction, it takes a lot longer if you have a ton of VRAM etc.
>>>>>>>
>>>>>>> The cancellation series is at odds with this code path AFAICS because
>>>>>>> what if hibernation is canceled after the entire thaw transition?
>>>>>>
>>>>>> Muhammad - did you test that specific timing of cancelling the hibernate?
>>>>>>>
>>>>>>> Some cleanup would need to be done before thawing user space I suppose.
>>>>>>
>>>>>> I agree; I think that series would need changes for it.
>>>>>>
>>>>>> But if you put that series aside, I think this one still has some merit
>>>>>> on it's own. If another driver aborted the hibernate, I think the same
>>>>>> thing could happen if it happened to run before amdgpu's device thaw().
>>>>>>
>>>>>> That series just exposed a very "easy" way to reproduce this issue.
>>>>>
>>>>> Device thaw errors don't abort anything AFAICS.
>>>>>
>>>>
>>>> You're right; it doesn't abort, it just is saved to the logs.
>>>> The state is also not maintained.
>>>>> What can happen though is that another device may abort the final
>>>>> "power off" transition, which is one of the reasons why I think that
>>>>> rolling it back is generally hard.
>>>>
>>>> That's exactly the reason for the first patch in this series. The state
>>>> of whether it succeeded isn't recorded. So if thaw non-fatally fails
>>>> and you've saved state to indicate this then any of the other calls that
>>>> run can try again.
>>>
>>> So long as they are called.
>>>
>>> But as I said before, I would save the state in the driver thaw
>>> callback and then clear it in the driver poweroff callback and look at
>>> it in the driver restore callback. If it is there at that point,
>>> poweroff has not run and hibernation is rolling back, so you need to
>>> do a "thaw".
>>
>> Are you suggesting that the device driver should directly manipulate
>> dev->power.is_suspended?
>
> No, it needs to have its own state for that. power.is_suspended
> should not be manipulated by drivers (or anything other than the core
> for that matter).
That's what I originally thought which is why this series looks like it
does.
>
>> I'll do some testing but; I suppose that would work as well without
>> needing to make core changes if you don't see a need for other devices
>> to do this.
>
> So long as they don't try to skip the "thaw" actions, I don't.
>
> If there are more drivers wanting to do it, I guess it would be good
> to have a common approach although at this point I'm not sure how much
> in common there would be.
But so if the state is maintained in the driver dev->power.is_suspended
will be FALSE at the end of thaw(). That means that restore() is never
called for a cancellation/abort.
So I think the only place to do the cleanup would be in the complete()
callback. Do you think that's the best place for this based upon that
internal driver state variable?
next prev parent reply other threads:[~2025-10-20 21:09 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-20 16:50 [RFC 0/3] Fixups for cancelled hibernate Mario Limonciello (AMD)
2025-10-20 16:50 ` [RFC 1/3] PM: Mark device as suspended if it failed to resume Mario Limonciello (AMD)
2025-10-20 16:58 ` Muhammad Usama Anjum
2025-10-20 17:15 ` Rafael J. Wysocki
2025-10-20 16:50 ` [RFC 2/3] PM: Don't pass up device_resume() -EBUSY errors Mario Limonciello (AMD)
2025-10-20 16:58 ` Muhammad Usama Anjum
2025-10-20 17:18 ` Rafael J. Wysocki
2025-10-20 17:24 ` Mario Limonciello (AMD) (kernel.org)
2025-10-20 16:50 ` [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success Mario Limonciello (AMD)
2025-10-20 16:59 ` Muhammad Usama Anjum
2025-10-20 17:21 ` Rafael J. Wysocki
2025-10-20 17:28 ` Mario Limonciello (AMD) (kernel.org)
2025-10-20 17:39 ` Rafael J. Wysocki
2025-10-20 18:32 ` Mario Limonciello (AMD) (kernel.org)
2025-10-20 18:50 ` Rafael J. Wysocki
2025-10-20 19:14 ` Mario Limonciello (AMD) (kernel.org)
2025-10-20 19:18 ` Rafael J. Wysocki
2025-10-20 19:34 ` Mario Limonciello (AMD) (kernel.org)
2025-10-20 19:55 ` Rafael J. Wysocki
2025-10-20 21:09 ` Mario Limonciello (AMD) (kernel.org) [this message]
2025-10-21 13:25 ` Rafael J. Wysocki
2025-10-21 14:19 ` Mario Limonciello (AMD) (kernel.org)
2025-10-21 14:12 ` Muhammad Usama Anjum
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1b86e583-1f3a-4192-8864-d2a60a8787e1@kernel.org \
--to=superm1@kernel.org \
--cc=airlied@gmail.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=dakr@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=gregkh@linuxfoundation.org \
--cc=lenb@kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=mario.limonciello@amd.com \
--cc=pavel@kernel.org \
--cc=rafael@kernel.org \
--cc=simona@ffwll.ch \
--cc=usama.anjum@collabora.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox