[PATCH] drm/amdgpu: Ignore first evction failure during suspend

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/amdgpu: Ignore first evction failure during suspend
@ 2023-09-08  3:39 xinhui pan
  2023-09-08  6:48 ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: xinhui pan @ 2023-09-08  3:39 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, xinhui pan, christian.koenig, shikang.fan

Some BOs might be pinned. So the first eviction's failure will abort the
suspend sequence. These pinned BOs will be unpined afterwards during
suspend.

Actaully it has evicted most BOs, so that should stil work fine in sriov
full access mode.

Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call during device_suspend.")
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5c0e2b766026..39af526cdbbe 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
 
 	adev->in_suspend = true;
 
-	/* Evict the majority of BOs before grabbing the full access */
-	r = amdgpu_device_evict_resources(adev);
-	if (r)
-		return r;
+	/* Try to evict the majority of BOs before grabbing the full access
+	 * Ignore the ret val at first place as we will unpin some BOs if any
+	 * afterwards.
+	 */
+	(void)amdgpu_device_evict_resources(adev);
 
 	if (amdgpu_sriov_vf(adev)) {
 		amdgpu_virt_fini_data_exchange(adev);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
  2023-09-08  3:39 [PATCH] drm/amdgpu: Ignore first evction failure during suspend xinhui pan
@ 2023-09-08  6:48 ` Christian König
  2023-09-12  0:21   ` Pan, Xinhui
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2023-09-08  6:48 UTC (permalink / raw)
  To: xinhui pan, amd-gfx; +Cc: alexander.deucher, christian.koenig, shikang.fan

Am 08.09.23 um 05:39 schrieb xinhui pan:
> Some BOs might be pinned. So the first eviction's failure will abort the
> suspend sequence. These pinned BOs will be unpined afterwards during
> suspend.

That doesn't make much sense since pinned BOs don't cause eviction 
failure here.

What exactly is the error code you see?

Christian.

>
> Actaully it has evicted most BOs, so that should stil work fine in sriov
> full access mode.
>
> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call during device_suspend.")
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>   1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5c0e2b766026..39af526cdbbe 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
>   
>   	adev->in_suspend = true;
>   
> -	/* Evict the majority of BOs before grabbing the full access */
> -	r = amdgpu_device_evict_resources(adev);
> -	if (r)
> -		return r;
> +	/* Try to evict the majority of BOs before grabbing the full access
> +	 * Ignore the ret val at first place as we will unpin some BOs if any
> +	 * afterwards.
> +	 */
> +	(void)amdgpu_device_evict_resources(adev);
>   
>   	if (amdgpu_sriov_vf(adev)) {
>   		amdgpu_virt_fini_data_exchange(adev);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
  2023-09-08  6:48 ` Christian König
@ 2023-09-12  0:21   ` Pan, Xinhui
  2023-09-12  9:01     ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Pan, Xinhui @ 2023-09-12  0:21 UTC (permalink / raw)
  To: Christian König, amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Koenig, Christian, Fan, Shikang

[AMD Official Use Only - General]

Oh yep, Pinned BO is moved to other LRU list, So eviction fails because of other reason.
I will change the comments in the patch.
The problem is eviction fails as many reasons, say, BO is locked.
ASAIK, kfd will stop the queues and flush some evict/restore work in its suspend callback. SO the first eviction before kfd callback likely fails.

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@gmail.com>
Sent: Friday, September 8, 2023 2:49 PM
To: Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Ignore first evction failure during suspend

Am 08.09.23 um 05:39 schrieb xinhui pan:
> Some BOs might be pinned. So the first eviction's failure will abort
> the suspend sequence. These pinned BOs will be unpined afterwards
> during suspend.

That doesn't make much sense since pinned BOs don't cause eviction failure here.

What exactly is the error code you see?

Christian.

>
> Actaully it has evicted most BOs, so that should stil work fine in
> sriov full access mode.
>
> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call
> during device_suspend.")
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>   1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5c0e2b766026..39af526cdbbe 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device
> *dev, bool fbcon)
>
>       adev->in_suspend = true;
>
> -     /* Evict the majority of BOs before grabbing the full access */
> -     r = amdgpu_device_evict_resources(adev);
> -     if (r)
> -             return r;
> +     /* Try to evict the majority of BOs before grabbing the full access
> +      * Ignore the ret val at first place as we will unpin some BOs if any
> +      * afterwards.
> +      */
> +     (void)amdgpu_device_evict_resources(adev);
>
>       if (amdgpu_sriov_vf(adev)) {
>               amdgpu_virt_fini_data_exchange(adev);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] drm/amdgpu: Ignore first evction failure during suspend
@ 2023-09-12  2:03 xinhui pan
  0 siblings, 0 replies; 15+ messages in thread
From: xinhui pan @ 2023-09-12  2:03 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, xinhui pan, christian.koenig, shikang.fan

Some BOs might be in use or locked and then the first eviction's failure
will abort the suspend sequence. We will try to unlock or stop any user
accessing these BOs afterwards during suspend. So only the second
eviction should succeed.

Actaully the first eviction has evicted most BOs, so that should still
work fine in sriov full access mode.

Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call during device_suspend.")
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5c0e2b766026..f381cb90c964 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
 
 	adev->in_suspend = true;
 
-	/* Evict the majority of BOs before grabbing the full access */
-	r = amdgpu_device_evict_resources(adev);
-	if (r)
-		return r;
+	/* Try to evict the majority of BOs before grabbing the full access
+	 * Ignore the ret val at first place as we will unlock or stop accessing
+	 * any BOs afterwards.
+	 */
+	(void)amdgpu_device_evict_resources(adev);
 
 	if (amdgpu_sriov_vf(adev)) {
 		amdgpu_virt_fini_data_exchange(adev);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
  2023-09-12  0:21   ` Pan, Xinhui
@ 2023-09-12  9:01     ` Christian König
  2023-09-13  5:13       ` 回复: " Pan, Xinhui
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2023-09-12  9:01 UTC (permalink / raw)
  To: Pan, Xinhui, amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Koenig, Christian, Fan, Shikang

When amdgpu_device_suspend() is called processes should be frozen 
already. In other words KFD queues etc... should already be idle.

So when the eviction fails here we missed something previously and that 
in turn can cause tons amount of problems.

So ignoring those errors is most likely not a good idea at all.

Regards,
Christian.

Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
> [AMD Official Use Only - General]
>
> Oh yep, Pinned BO is moved to other LRU list, So eviction fails because of other reason.
> I will change the comments in the patch.
> The problem is eviction fails as many reasons, say, BO is locked.
> ASAIK, kfd will stop the queues and flush some evict/restore work in its suspend callback. SO the first eviction before kfd callback likely fails.
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Friday, September 8, 2023 2:49 PM
> To: Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
>
> Am 08.09.23 um 05:39 schrieb xinhui pan:
>> Some BOs might be pinned. So the first eviction's failure will abort
>> the suspend sequence. These pinned BOs will be unpined afterwards
>> during suspend.
> That doesn't make much sense since pinned BOs don't cause eviction failure here.
>
> What exactly is the error code you see?
>
> Christian.
>
>> Actaully it has evicted most BOs, so that should stil work fine in
>> sriov full access mode.
>>
>> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call
>> during device_suspend.")
>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>>    1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 5c0e2b766026..39af526cdbbe 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device
>> *dev, bool fbcon)
>>
>>        adev->in_suspend = true;
>>
>> -     /* Evict the majority of BOs before grabbing the full access */
>> -     r = amdgpu_device_evict_resources(adev);
>> -     if (r)
>> -             return r;
>> +     /* Try to evict the majority of BOs before grabbing the full access
>> +      * Ignore the ret val at first place as we will unpin some BOs if any
>> +      * afterwards.
>> +      */
>> +     (void)amdgpu_device_evict_resources(adev);
>>
>>        if (amdgpu_sriov_vf(adev)) {
>>                amdgpu_virt_fini_data_exchange(adev);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
  2023-09-12  9:01     ` Christian König
@ 2023-09-13  5:13       ` Pan, Xinhui
  2023-09-13  8:07         ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Pan, Xinhui @ 2023-09-13  5:13 UTC (permalink / raw)
  To: Christian König, amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Koenig, Christian, Fan, Shikang

[-- Attachment #1: Type: text/plain, Size: 5190 bytes --]

[AMD Official Use Only - General]

I notice that only user space process are frozen on my side.  kthread and workqueue  keeps running. Maybe some kernel configs are not enabled.
I made one module which just prints something like i++ with mutex lock both in workqueue and kthread. I paste some logs below.
[438619.696196] XH: 14 from workqueue
[438619.700193] XH: 15 from kthread
[438620.394335] PM: suspend entry (deep)
[438620.399619] Filesystems sync: 0.001 seconds
[438620.403887] PM: Preparing system for sleep (deep)
[438620.409299] Freezing user space processes
[438620.414862] Freezing user space processes completed (elapsed 0.001 seconds)
[438620.421881] OOM killer disabled.
[438620.425197] Freezing remaining freezable tasks
[438620.430890] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[438620.438348] PM: Suspending system (deep)
.....
[438623.746038] PM: suspend of devices complete after 3303.137 msecs
[438623.752125] PM: start suspend of devices complete after 3309.713 msecs
[438623.758722] PM: suspend debug: Waiting for 5 second(s).
[438623.792166] XH: 22 from kthread
[438623.824140] XH: 23 from workqueue


So BOs definitely can be in use during suspend.
Even if kthread or workqueue can be stopped with one special kernel config. I think suspend can only stop the workqueue with its callback finish.
otherwise something like below makes things crazy.
LOCK BO
do something
    -> schedule or wait, anycode might sleep.  Stopped by suspend now? no, i think.
UNLOCK BO

I do tests  with  cmds below.
echo devices  > /sys/power/pm_test
echo 0  > /sys/power/pm_async
echo 1  > /sys/power/pm_print_times
echo 1 > /sys/power/pm_debug_messages
echo 1 > /sys/module/amdgpu/parameters/debug_evictions
./kfd.sh --gtest_filter=KFDEvictTest.BasicTest
pm-suspend

thanks
xinhui


________________________________
发件人: Christian König <ckoenig.leichtzumerken@gmail.com>
发送时间: 2023年9月12日 17:01
收件人: Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
抄送: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
主题: Re: [PATCH] drm/amdgpu: Ignore first evction failure during suspend

When amdgpu_device_suspend() is called processes should be frozen
already. In other words KFD queues etc... should already be idle.

So when the eviction fails here we missed something previously and that
in turn can cause tons amount of problems.

So ignoring those errors is most likely not a good idea at all.

Regards,
Christian.

Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
> [AMD Official Use Only - General]
>
> Oh yep, Pinned BO is moved to other LRU list, So eviction fails because of other reason.
> I will change the comments in the patch.
> The problem is eviction fails as many reasons, say, BO is locked.
> ASAIK, kfd will stop the queues and flush some evict/restore work in its suspend callback. SO the first eviction before kfd callback likely fails.
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> Sent: Friday, September 8, 2023 2:49 PM
> To: Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
>
> Am 08.09.23 um 05:39 schrieb xinhui pan:
>> Some BOs might be pinned. So the first eviction's failure will abort
>> the suspend sequence. These pinned BOs will be unpined afterwards
>> during suspend.
> That doesn't make much sense since pinned BOs don't cause eviction failure here.
>
> What exactly is the error code you see?
>
> Christian.
>
>> Actaully it has evicted most BOs, so that should stil work fine in
>> sriov full access mode.
>>
>> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call
>> during device_suspend.")
>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>>    1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 5c0e2b766026..39af526cdbbe 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device
>> *dev, bool fbcon)
>>
>>        adev->in_suspend = true;
>>
>> -     /* Evict the majority of BOs before grabbing the full access */
>> -     r = amdgpu_device_evict_resources(adev);
>> -     if (r)
>> -             return r;
>> +     /* Try to evict the majority of BOs before grabbing the full access
>> +      * Ignore the ret val at first place as we will unpin some BOs if any
>> +      * afterwards.
>> +      */
>> +     (void)amdgpu_device_evict_resources(adev);
>>
>>        if (amdgpu_sriov_vf(adev)) {
>>                amdgpu_virt_fini_data_exchange(adev);


[-- Attachment #2: Type: text/html, Size: 12185 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
  2023-09-13  5:13       ` 回复: " Pan, Xinhui
@ 2023-09-13  8:07         ` Christian König
  2023-09-13 13:54           ` Felix Kuehling
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2023-09-13  8:07 UTC (permalink / raw)
  To: Pan, Xinhui, amd-gfx@lists.freedesktop.org, Kuehling, Felix
  Cc: Deucher, Alexander, Koenig, Christian, Fan, Shikang

[-- Attachment #1: Type: text/plain, Size: 6386 bytes --]

[+Fleix]

Well that looks like quite a serious bug.

If I'm not completely mistaken the KFD work item tries to restore the 
process by moving BOs into memory even after the suspend freeze. 
Normally work items are frozen together with the user space processes 
unless explicitly marked as not freezable.

That this causes problem during the first eviction phase is just the tip 
of the iceberg here. If a BO is moved into invisible memory during this 
we wouldn't be able to get it out of that in the second phase because 
SDMA and hw is already turned off.

@Felix any idea how that can happen? Have you guys marked a work item / 
work queue as not freezable? Or maybe the display guys?

@Xinhui please investigate what work item that is and where that is 
coming from. Something like "if (adev->in_suspend) dump_stack();" in the 
right place should probably do it.

Thanks,
Christian.

Am 13.09.23 um 07:13 schrieb Pan, Xinhui:
>
> [AMD Official Use Only - General]
>
>
> I notice that only user space process are frozen on my side. kthread 
> and workqueue  keeps running. Maybe some kernel configs are not enabled.
> I made one module which just prints something like i++ with mutex lock 
> both in workqueue and kthread. I paste some logs below.
> [438619.696196] XH: 14 from workqueue
> [438619.700193] XH: 15 from kthread
> [438620.394335] PM: suspend entry (deep)
> [438620.399619] Filesystems sync: 0.001 seconds
> [438620.403887] PM: Preparing system for sleep (deep)
> [438620.409299] Freezing user space processes
> [438620.414862] Freezing user space processes completed (elapsed 0.001 
> seconds)
> [438620.421881] OOM killer disabled.
> [438620.425197] Freezing remaining freezable tasks
> [438620.430890] Freezing remaining freezable tasks completed (elapsed 
> 0.001 seconds)
> [438620.438348] PM: Suspending system (deep)
> .....
> [438623.746038] PM: suspend of devices complete after 3303.137 msecs
> [438623.752125] PM: start suspend of devices complete after 3309.713 msecs
> [438623.758722] PM: suspend debug: Waiting for 5 second(s).
> [438623.792166] XH: 22 from kthread
> [438623.824140] XH: 23 from workqueue
>
>
> So BOs definitely can be in use during suspend.
> Even if kthread or workqueue can be stopped with one special kernel 
> config. I think suspend can only stop the workqueue with its callback 
> finish.
> otherwise something like below makes things crazy.
> LOCK BO
> do something
>     -> schedule or wait, anycode might sleep.  Stopped by suspend now? 
> no, i think.
> UNLOCK BO
>
> I do tests  with  cmds below.
> echo devices  > /sys/power/pm_test
> echo 0  > /sys/power/pm_async
> echo 1  > /sys/power/pm_print_times
> echo 1 > /sys/power/pm_debug_messages
> echo 1 > /sys/module/amdgpu/parameters/debug_evictions
> ./kfd.sh --gtest_filter=KFDEvictTest.BasicTest
> pm-suspend
>
> thanks
> xinhui
>
>
> ------------------------------------------------------------------------
> *发件人:* Christian König <ckoenig.leichtzumerken@gmail.com>
> *发送时间:* 2023年9月12日 17:01
> *收件人:* Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org 
> <amd-gfx@lists.freedesktop.org>
> *抄送:* Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, 
> Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
> *主题:* Re: [PATCH] drm/amdgpu: Ignore first evction failure during 
> suspend
> When amdgpu_device_suspend() is called processes should be frozen
> already. In other words KFD queues etc... should already be idle.
>
> So when the eviction fails here we missed something previously and that
> in turn can cause tons amount of problems.
>
> So ignoring those errors is most likely not a good idea at all.
>
> Regards,
> Christian.
>
> Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
> > [AMD Official Use Only - General]
> >
> > Oh yep, Pinned BO is moved to other LRU list, So eviction fails 
> because of other reason.
> > I will change the comments in the patch.
> > The problem is eviction fails as many reasons, say, BO is locked.
> > ASAIK, kfd will stop the queues and flush some evict/restore work in 
> its suspend callback. SO the first eviction before kfd callback likely 
> fails.
> >
> > -----Original Message-----
> > From: Christian König <ckoenig.leichtzumerken@gmail.com>
> > Sent: Friday, September 8, 2023 2:49 PM
> > To: Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
> > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, 
> Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
> > Subject: Re: [PATCH] drm/amdgpu: Ignore first evction failure during 
> suspend
> >
> > Am 08.09.23 um 05:39 schrieb xinhui pan:
> >> Some BOs might be pinned. So the first eviction's failure will abort
> >> the suspend sequence. These pinned BOs will be unpined afterwards
> >> during suspend.
> > That doesn't make much sense since pinned BOs don't cause eviction 
> failure here.
> >
> > What exactly is the error code you see?
> >
> > Christian.
> >
> >> Actaully it has evicted most BOs, so that should stil work fine in
> >> sriov full access mode.
> >>
> >> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call
> >> during device_suspend.")
> >> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> >> ---
> >>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
> >>    1 file changed, 5 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> index 5c0e2b766026..39af526cdbbe 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device
> >> *dev, bool fbcon)
> >>
> >>        adev->in_suspend = true;
> >>
> >> -     /* Evict the majority of BOs before grabbing the full access */
> >> -     r = amdgpu_device_evict_resources(adev);
> >> -     if (r)
> >> -             return r;
> >> +     /* Try to evict the majority of BOs before grabbing the full 
> access
> >> +      * Ignore the ret val at first place as we will unpin some 
> BOs if any
> >> +      * afterwards.
> >> +      */
> >> + (void)amdgpu_device_evict_resources(adev);
> >>
> >>        if (amdgpu_sriov_vf(adev)) {
> >> amdgpu_virt_fini_data_exchange(adev);
>

[-- Attachment #2: Type: text/html, Size: 17558 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
  2023-09-13  8:07         ` Christian König
@ 2023-09-13 13:54           ` Felix Kuehling
  2023-09-13 14:28             ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Felix Kuehling @ 2023-09-13 13:54 UTC (permalink / raw)
  To: Christian König, Pan, Xinhui, amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Koenig, Christian, Fan, Shikang

[-- Attachment #1: Type: text/plain, Size: 6692 bytes --]

On 2023-09-13 4:07, Christian König wrote:
> [+Fleix]
>
> Well that looks like quite a serious bug.
>
> If I'm not completely mistaken the KFD work item tries to restore the 
> process by moving BOs into memory even after the suspend freeze. 
> Normally work items are frozen together with the user space processes 
> unless explicitly marked as not freezable.
>
> That this causes problem during the first eviction phase is just the 
> tip of the iceberg here. If a BO is moved into invisible memory during 
> this we wouldn't be able to get it out of that in the second phase 
> because SDMA and hw is already turned off.
>
> @Felix any idea how that can happen? Have you guys marked a work item 
> / work queue as not freezable?

We don't set anything to non-freezable in KFD.


Regards,
   Felix


> Or maybe the display guys?
>
> @Xinhui please investigate what work item that is and where that is 
> coming from. Something like "if (adev->in_suspend) dump_stack();" in 
> the right place should probably do it.
>
> Thanks,
> Christian.
>
> Am 13.09.23 um 07:13 schrieb Pan, Xinhui:
>>
>> [AMD Official Use Only - General]
>>
>>
>> I notice that only user space process are frozen on my side.  kthread 
>> and workqueue  keeps running. Maybe some kernel configs are not enabled.
>> I made one module which just prints something like i++ with mutex 
>> lock both in workqueue and kthread. I paste some logs below.
>> [438619.696196] XH: 14 from workqueue
>> [438619.700193] XH: 15 from kthread
>> [438620.394335] PM: suspend entry (deep)
>> [438620.399619] Filesystems sync: 0.001 seconds
>> [438620.403887] PM: Preparing system for sleep (deep)
>> [438620.409299] Freezing user space processes
>> [438620.414862] Freezing user space processes completed (elapsed 
>> 0.001 seconds)
>> [438620.421881] OOM killer disabled.
>> [438620.425197] Freezing remaining freezable tasks
>> [438620.430890] Freezing remaining freezable tasks completed (elapsed 
>> 0.001 seconds)
>> [438620.438348] PM: Suspending system (deep)
>> .....
>> [438623.746038] PM: suspend of devices complete after 3303.137 msecs
>> [438623.752125] PM: start suspend of devices complete after 3309.713 
>> msecs
>> [438623.758722] PM: suspend debug: Waiting for 5 second(s).
>> [438623.792166] XH: 22 from kthread
>> [438623.824140] XH: 23 from workqueue
>>
>>
>> So BOs definitely can be in use during suspend.
>> Even if kthread or workqueue can be stopped with one special kernel 
>> config. I think suspend can only stop the workqueue with its callback 
>> finish.
>> otherwise something like below makes things crazy.
>> LOCK BO
>> do something
>>     -> schedule or wait, anycode might sleep.  Stopped by suspend 
>> now? no, i think.
>> UNLOCK BO
>>
>> I do tests  with  cmds below.
>> echo devices  > /sys/power/pm_test
>> echo 0  > /sys/power/pm_async
>> echo 1  > /sys/power/pm_print_times
>> echo 1 > /sys/power/pm_debug_messages
>> echo 1 > /sys/module/amdgpu/parameters/debug_evictions
>> ./kfd.sh --gtest_filter=KFDEvictTest.BasicTest
>> pm-suspend
>>
>> thanks
>> xinhui
>>
>>
>> ------------------------------------------------------------------------
>> *发件人:* Christian König <ckoenig.leichtzumerken@gmail.com>
>> *发送时间:* 2023年9月12日 17:01
>> *收件人:* Pan, Xinhui <Xinhui.Pan@amd.com>; 
>> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
>> *抄送:* Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, 
>> Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
>> *主题:* Re: [PATCH] drm/amdgpu: Ignore first evction failure during 
>> suspend
>> When amdgpu_device_suspend() is called processes should be frozen
>> already. In other words KFD queues etc... should already be idle.
>>
>> So when the eviction fails here we missed something previously and that
>> in turn can cause tons amount of problems.
>>
>> So ignoring those errors is most likely not a good idea at all.
>>
>> Regards,
>> Christian.
>>
>> Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
>> > [AMD Official Use Only - General]
>> >
>> > Oh yep, Pinned BO is moved to other LRU list, So eviction fails 
>> because of other reason.
>> > I will change the comments in the patch.
>> > The problem is eviction fails as many reasons, say, BO is locked.
>> > ASAIK, kfd will stop the queues and flush some evict/restore work 
>> in its suspend callback. SO the first eviction before kfd callback 
>> likely fails.
>> >
>> > -----Original Message-----
>> > From: Christian König <ckoenig.leichtzumerken@gmail.com>
>> > Sent: Friday, September 8, 2023 2:49 PM
>> > To: Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
>> > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, 
>> Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
>> > Subject: Re: [PATCH] drm/amdgpu: Ignore first evction failure 
>> during suspend
>> >
>> > Am 08.09.23 um 05:39 schrieb xinhui pan:
>> >> Some BOs might be pinned. So the first eviction's failure will abort
>> >> the suspend sequence. These pinned BOs will be unpined afterwards
>> >> during suspend.
>> > That doesn't make much sense since pinned BOs don't cause eviction 
>> failure here.
>> >
>> > What exactly is the error code you see?
>> >
>> > Christian.
>> >
>> >> Actaully it has evicted most BOs, so that should stil work fine in
>> >> sriov full access mode.
>> >>
>> >> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call
>> >> during device_suspend.")
>> >> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>> >> ---
>> >>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>> >>    1 file changed, 5 insertions(+), 4 deletions(-)
>> >>
>> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> index 5c0e2b766026..39af526cdbbe 100644
>> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device
>> >> *dev, bool fbcon)
>> >>
>> >>        adev->in_suspend = true;
>> >>
>> >> -     /* Evict the majority of BOs before grabbing the full access */
>> >> -     r = amdgpu_device_evict_resources(adev);
>> >> -     if (r)
>> >> -             return r;
>> >> +     /* Try to evict the majority of BOs before grabbing the full 
>> access
>> >> +      * Ignore the ret val at first place as we will unpin some 
>> BOs if any
>> >> +      * afterwards.
>> >> +      */
>> >> + (void)amdgpu_device_evict_resources(adev);
>> >>
>> >>        if (amdgpu_sriov_vf(adev)) {
>> >> amdgpu_virt_fini_data_exchange(adev);
>>
>

[-- Attachment #2: Type: text/html, Size: 19029 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
  2023-09-13 13:54           ` Felix Kuehling
@ 2023-09-13 14:28             ` Christian König
  2023-09-14  0:02               ` Pan, Xinhui
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2023-09-13 14:28 UTC (permalink / raw)
  To: Felix Kuehling, Christian König, Pan, Xinhui,
	amd-gfx@lists.freedesktop.org, Wentland, Harry
  Cc: Deucher, Alexander, Fan, Shikang

[-- Attachment #1: Type: text/plain, Size: 7104 bytes --]

[+Harry]

Am 13.09.23 um 15:54 schrieb Felix Kuehling:
> On 2023-09-13 4:07, Christian König wrote:
>> [+Fleix]
>>
>> Well that looks like quite a serious bug.
>>
>> If I'm not completely mistaken the KFD work item tries to restore the 
>> process by moving BOs into memory even after the suspend freeze. 
>> Normally work items are frozen together with the user space processes 
>> unless explicitly marked as not freezable.
>>
>> That this causes problem during the first eviction phase is just the 
>> tip of the iceberg here. If a BO is moved into invisible memory 
>> during this we wouldn't be able to get it out of that in the second 
>> phase because SDMA and hw is already turned off.
>>
>> @Felix any idea how that can happen? Have you guys marked a work item 
>> / work queue as not freezable?
>
> We don't set anything to non-freezable in KFD.
>
>
> Regards,
>   Felix
>
>
>> Or maybe the display guys?

Do you guys in the display do any delayed update in a work item which is 
marked as not-freezable?

Otherwise I have absolutely no idea what's going on here.

Thanks,
Christian.

>>
>> @Xinhui please investigate what work item that is and where that is 
>> coming from. Something like "if (adev->in_suspend) dump_stack();" in 
>> the right place should probably do it.
>>
>> Thanks,
>> Christian.
>>
>> Am 13.09.23 um 07:13 schrieb Pan, Xinhui:
>>>
>>> [AMD Official Use Only - General]
>>>
>>>
>>> I notice that only user space process are frozen on my side.  
>>> kthread and workqueue keeps running. Maybe some kernel configs are 
>>> not enabled.
>>> I made one module which just prints something like i++ with mutex 
>>> lock both in workqueue and kthread. I paste some logs below.
>>> [438619.696196] XH: 14 from workqueue
>>> [438619.700193] XH: 15 from kthread
>>> [438620.394335] PM: suspend entry (deep)
>>> [438620.399619] Filesystems sync: 0.001 seconds
>>> [438620.403887] PM: Preparing system for sleep (deep)
>>> [438620.409299] Freezing user space processes
>>> [438620.414862] Freezing user space processes completed (elapsed 
>>> 0.001 seconds)
>>> [438620.421881] OOM killer disabled.
>>> [438620.425197] Freezing remaining freezable tasks
>>> [438620.430890] Freezing remaining freezable tasks completed 
>>> (elapsed 0.001 seconds)
>>> [438620.438348] PM: Suspending system (deep)
>>> .....
>>> [438623.746038] PM: suspend of devices complete after 3303.137 msecs
>>> [438623.752125] PM: start suspend of devices complete after 3309.713 
>>> msecs
>>> [438623.758722] PM: suspend debug: Waiting for 5 second(s).
>>> [438623.792166] XH: 22 from kthread
>>> [438623.824140] XH: 23 from workqueue
>>>
>>>
>>> So BOs definitely can be in use during suspend.
>>> Even if kthread or workqueue can be stopped with one special kernel 
>>> config. I think suspend can only stop the workqueue with its 
>>> callback finish.
>>> otherwise something like below makes things crazy.
>>> LOCK BO
>>> do something
>>>     -> schedule or wait, anycode might sleep.  Stopped by suspend 
>>> now? no, i think.
>>> UNLOCK BO
>>>
>>> I do tests  with  cmds below.
>>> echo devices  > /sys/power/pm_test
>>> echo 0  > /sys/power/pm_async
>>> echo 1  > /sys/power/pm_print_times
>>> echo 1 > /sys/power/pm_debug_messages
>>> echo 1 > /sys/module/amdgpu/parameters/debug_evictions
>>> ./kfd.sh --gtest_filter=KFDEvictTest.BasicTest
>>> pm-suspend
>>>
>>> thanks
>>> xinhui
>>>
>>>
>>> ------------------------------------------------------------------------
>>> *发件人:* Christian König <ckoenig.leichtzumerken@gmail.com>
>>> *发送时间:* 2023年9月12日 17:01
>>> *收件人:* Pan, Xinhui <Xinhui.Pan@amd.com>; 
>>> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
>>> *抄送:* Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, 
>>> Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
>>> *主题:* Re: [PATCH] drm/amdgpu: Ignore first evction failure during 
>>> suspend
>>> When amdgpu_device_suspend() is called processes should be frozen
>>> already. In other words KFD queues etc... should already be idle.
>>>
>>> So when the eviction fails here we missed something previously and that
>>> in turn can cause tons amount of problems.
>>>
>>> So ignoring those errors is most likely not a good idea at all.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
>>> > [AMD Official Use Only - General]
>>> >
>>> > Oh yep, Pinned BO is moved to other LRU list, So eviction fails 
>>> because of other reason.
>>> > I will change the comments in the patch.
>>> > The problem is eviction fails as many reasons, say, BO is locked.
>>> > ASAIK, kfd will stop the queues and flush some evict/restore work 
>>> in its suspend callback. SO the first eviction before kfd callback 
>>> likely fails.
>>> >
>>> > -----Original Message-----
>>> > From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>> > Sent: Friday, September 8, 2023 2:49 PM
>>> > To: Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
>>> > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, 
>>> Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
>>> > Subject: Re: [PATCH] drm/amdgpu: Ignore first evction failure 
>>> during suspend
>>> >
>>> > Am 08.09.23 um 05:39 schrieb xinhui pan:
>>> >> Some BOs might be pinned. So the first eviction's failure will abort
>>> >> the suspend sequence. These pinned BOs will be unpined afterwards
>>> >> during suspend.
>>> > That doesn't make much sense since pinned BOs don't cause eviction 
>>> failure here.
>>> >
>>> > What exactly is the error code you see?
>>> >
>>> > Christian.
>>> >
>>> >> Actaully it has evicted most BOs, so that should stil work fine in
>>> >> sriov full access mode.
>>> >>
>>> >> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call
>>> >> during device_suspend.")
>>> >> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>> >> ---
>>> >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>>> >>    1 file changed, 5 insertions(+), 4 deletions(-)
>>> >>
>>> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> >> index 5c0e2b766026..39af526cdbbe 100644
>>> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> >> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device
>>> >> *dev, bool fbcon)
>>> >>
>>> >>        adev->in_suspend = true;
>>> >>
>>> >> -     /* Evict the majority of BOs before grabbing the full access */
>>> >> -     r = amdgpu_device_evict_resources(adev);
>>> >> -     if (r)
>>> >> -             return r;
>>> >> +     /* Try to evict the majority of BOs before grabbing the 
>>> full access
>>> >> +      * Ignore the ret val at first place as we will unpin some 
>>> BOs if any
>>> >> +      * afterwards.
>>> >> +      */
>>> >> + (void)amdgpu_device_evict_resources(adev);
>>> >>
>>> >>        if (amdgpu_sriov_vf(adev)) {
>>> >> amdgpu_virt_fini_data_exchange(adev);
>>>
>>

[-- Attachment #2: Type: text/html, Size: 20227 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
  2023-09-13 14:28             ` Christian König
@ 2023-09-14  0:02               ` Pan, Xinhui
  2023-09-14  1:54                 ` Pan, Xinhui
  0 siblings, 1 reply; 15+ messages in thread
From: Pan, Xinhui @ 2023-09-14  0:02 UTC (permalink / raw)
  To: Koenig, Christian, Kuehling, Felix, Christian König,
	amd-gfx@lists.freedesktop.org, Wentland, Harry
  Cc: Deucher, Alexander, Fan, Shikang

[-- Attachment #1: Type: text/plain, Size: 7737 bytes --]

[AMD Official Use Only - General]

Chris,
I can dump these busy BOs with their alloc/free stack later today.

BTW, the two evictions and the kfd suspend are all called before hw_fini. IOW, between phase 1 and phase 2. SDMA is turned only in phase2. So current code works fine maybe.

From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Wednesday, September 13, 2023 10:29 PM
To: Kuehling, Felix <Felix.Kuehling@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org; Wentland, Harry <Harry.Wentland@amd.com>
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
Subject: Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend

[+Harry]
Am 13.09.23 um 15:54 schrieb Felix Kuehling:
On 2023-09-13 4:07, Christian König wrote:
[+Fleix]

Well that looks like quite a serious bug.

If I'm not completely mistaken the KFD work item tries to restore the process by moving BOs into memory even after the suspend freeze. Normally work items are frozen together with the user space processes unless explicitly marked as not freezable.

That this causes problem during the first eviction phase is just the tip of the iceberg here. If a BO is moved into invisible memory during this we wouldn't be able to get it out of that in the second phase because SDMA and hw is already turned off.

@Felix any idea how that can happen? Have you guys marked a work item / work queue as not freezable?

We don't set anything to non-freezable in KFD.

Regards,
  Felix

Or maybe the display guys?

Do you guys in the display do any delayed update in a work item which is marked as not-freezable?

Otherwise I have absolutely no idea what's going on here.

Thanks,
Christian.

@Xinhui please investigate what work item that is and where that is coming from. Something like "if (adev->in_suspend) dump_stack();" in the right place should probably do it.

Thanks,
Christian.
Am 13.09.23 um 07:13 schrieb Pan, Xinhui:

[AMD Official Use Only - General]

I notice that only user space process are frozen on my side.  kthread and workqueue  keeps running. Maybe some kernel configs are not enabled.
I made one module which just prints something like i++ with mutex lock both in workqueue and kthread. I paste some logs below.
[438619.696196] XH: 14 from workqueue
[438619.700193] XH: 15 from kthread
[438620.394335] PM: suspend entry (deep)
[438620.399619] Filesystems sync: 0.001 seconds
[438620.403887] PM: Preparing system for sleep (deep)
[438620.409299] Freezing user space processes
[438620.414862] Freezing user space processes completed (elapsed 0.001 seconds)
[438620.421881] OOM killer disabled.
[438620.425197] Freezing remaining freezable tasks
[438620.430890] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[438620.438348] PM: Suspending system (deep)
.....
[438623.746038] PM: suspend of devices complete after 3303.137 msecs
[438623.752125] PM: start suspend of devices complete after 3309.713 msecs
[438623.758722] PM: suspend debug: Waiting for 5 second(s).
[438623.792166] XH: 22 from kthread
[438623.824140] XH: 23 from workqueue

So BOs definitely can be in use during suspend.
Even if kthread or workqueue can be stopped with one special kernel config. I think suspend can only stop the workqueue with its callback finish.
otherwise something like below makes things crazy.
LOCK BO
do something
    -> schedule or wait, anycode might sleep.  Stopped by suspend now? no, i think.
UNLOCK BO

I do tests  with  cmds below.
echo devices  > /sys/power/pm_test
echo 0  > /sys/power/pm_async
echo 1  > /sys/power/pm_print_times
echo 1 > /sys/power/pm_debug_messages
echo 1 > /sys/module/amdgpu/parameters/debug_evictions
./kfd.sh --gtest_filter=KFDEvictTest.BasicTest
pm-suspend

thanks
xinhui

________________________________
发件人: Christian König <ckoenig.leichtzumerken@gmail.com><mailto:ckoenig.leichtzumerken@gmail.com>
发送时间: 2023年9月12日 17:01
收件人: Pan, Xinhui <Xinhui.Pan@amd.com><mailto:Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org><mailto:amd-gfx@lists.freedesktop.org>
抄送: Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com><mailto:Shikang.Fan@amd.com>
主题: Re: [PATCH] drm/amdgpu: Ignore first evction failure during suspend

When amdgpu_device_suspend() is called processes should be frozen
already. In other words KFD queues etc... should already be idle.

So when the eviction fails here we missed something previously and that
in turn can cause tons amount of problems.

So ignoring those errors is most likely not a good idea at all.

Regards,
Christian.

Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
> [AMD Official Use Only - General]
>
> Oh yep, Pinned BO is moved to other LRU list, So eviction fails because of other reason.
> I will change the comments in the patch.
> The problem is eviction fails as many reasons, say, BO is locked.
> ASAIK, kfd will stop the queues and flush some evict/restore work in its suspend callback. SO the first eviction before kfd callback likely fails.
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com><mailto:ckoenig.leichtzumerken@gmail.com>
> Sent: Friday, September 8, 2023 2:49 PM
> To: Pan, Xinhui <Xinhui.Pan@amd.com><mailto:Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com><mailto:Shikang.Fan@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
>
> Am 08.09.23 um 05:39 schrieb xinhui pan:
>> Some BOs might be pinned. So the first eviction's failure will abort
>> the suspend sequence. These pinned BOs will be unpined afterwards
>> during suspend.
> That doesn't make much sense since pinned BOs don't cause eviction failure here.
>
> What exactly is the error code you see?
>
> Christian.
>
>> Actaully it has evicted most BOs, so that should stil work fine in
>> sriov full access mode.
>>
>> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call
>> during device_suspend.")
>> Signed-off-by: xinhui pan <xinhui.pan@amd.com><mailto:xinhui.pan@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>>    1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 5c0e2b766026..39af526cdbbe 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device
>> *dev, bool fbcon)
>>
>>        adev->in_suspend = true;
>>
>> -     /* Evict the majority of BOs before grabbing the full access */
>> -     r = amdgpu_device_evict_resources(adev);
>> -     if (r)
>> -             return r;
>> +     /* Try to evict the majority of BOs before grabbing the full access
>> +      * Ignore the ret val at first place as we will unpin some BOs if any
>> +      * afterwards.
>> +      */
>> +     (void)amdgpu_device_evict_resources(adev);
>>
>>        if (amdgpu_sriov_vf(adev)) {
>>                amdgpu_virt_fini_data_exchange(adev);

[-- Attachment #2: Type: text/html, Size: 20619 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
  2023-09-14  0:02               ` Pan, Xinhui
@ 2023-09-14  1:54                 ` Pan, Xinhui
  2023-09-14  6:23                   ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Pan, Xinhui @ 2023-09-14  1:54 UTC (permalink / raw)
  To: Koenig, Christian, Kuehling, Felix, Christian König,
	amd-gfx@lists.freedesktop.org, Wentland, Harry
  Cc: Deucher, Alexander, Fan, Shikang


[-- Attachment #1.1: Type: text/plain, Size: 8855 bytes --]

[AMD Official Use Only - General]

I just make one debug patch to show busy BO’s alloc-trace when the eviction fails in suspend.
And dmesg log attached.
Looks like they are just kfd user Bos and locked by evict/restore work.
So in kfd suspend callback, it really need to flush the evict/restore work before HW fini as it do now.
That is why the first very early eviction fails and the second eviction succeed.

Thanks
xinhui
From: Pan, Xinhui
Sent: Thursday, September 14, 2023 8:02 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org; Wentland, Harry <Harry.Wentland@amd.com>
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
Subject: RE: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend

Chris,
I can dump these busy BOs with their alloc/free stack later today.

BTW, the two evictions and the kfd suspend are all called before hw_fini. IOW, between phase 1 and phase 2. SDMA is turned only in phase2. So current code works fine maybe.

From: Koenig, Christian <Christian.Koenig@amd.com<mailto:Christian.Koenig@amd.com>>
Sent: Wednesday, September 13, 2023 10:29 PM
To: Kuehling, Felix <Felix.Kuehling@amd.com<mailto:Felix.Kuehling@amd.com>>; Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>>; Pan, Xinhui <Xinhui.Pan@amd.com<mailto:Xinhui.Pan@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Wentland, Harry <Harry.Wentland@amd.com<mailto:Harry.Wentland@amd.com>>
Cc: Deucher, Alexander <Alexander.Deucher@amd.com<mailto:Alexander.Deucher@amd.com>>; Fan, Shikang <Shikang.Fan@amd.com<mailto:Shikang.Fan@amd.com>>
Subject: Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend

[+Harry]
Am 13.09.23 um 15:54 schrieb Felix Kuehling:
On 2023-09-13 4:07, Christian König wrote:
[+Fleix]

Well that looks like quite a serious bug.

If I'm not completely mistaken the KFD work item tries to restore the process by moving BOs into memory even after the suspend freeze. Normally work items are frozen together with the user space processes unless explicitly marked as not freezable.

That this causes problem during the first eviction phase is just the tip of the iceberg here. If a BO is moved into invisible memory during this we wouldn't be able to get it out of that in the second phase because SDMA and hw is already turned off.

@Felix any idea how that can happen? Have you guys marked a work item / work queue as not freezable?

We don't set anything to non-freezable in KFD.



Regards,
  Felix


Or maybe the display guys?

Do you guys in the display do any delayed update in a work item which is marked as not-freezable?

Otherwise I have absolutely no idea what's going on here.

Thanks,
Christian.


@Xinhui please investigate what work item that is and where that is coming from. Something like "if (adev->in_suspend) dump_stack();" in the right place should probably do it.

Thanks,
Christian.
Am 13.09.23 um 07:13 schrieb Pan, Xinhui:

[AMD Official Use Only - General]

I notice that only user space process are frozen on my side.  kthread and workqueue  keeps running. Maybe some kernel configs are not enabled.
I made one module which just prints something like i++ with mutex lock both in workqueue and kthread. I paste some logs below.
[438619.696196] XH: 14 from workqueue
[438619.700193] XH: 15 from kthread
[438620.394335] PM: suspend entry (deep)
[438620.399619] Filesystems sync: 0.001 seconds
[438620.403887] PM: Preparing system for sleep (deep)
[438620.409299] Freezing user space processes
[438620.414862] Freezing user space processes completed (elapsed 0.001 seconds)
[438620.421881] OOM killer disabled.
[438620.425197] Freezing remaining freezable tasks
[438620.430890] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[438620.438348] PM: Suspending system (deep)
.....
[438623.746038] PM: suspend of devices complete after 3303.137 msecs
[438623.752125] PM: start suspend of devices complete after 3309.713 msecs
[438623.758722] PM: suspend debug: Waiting for 5 second(s).
[438623.792166] XH: 22 from kthread
[438623.824140] XH: 23 from workqueue


So BOs definitely can be in use during suspend.
Even if kthread or workqueue can be stopped with one special kernel config. I think suspend can only stop the workqueue with its callback finish.
otherwise something like below makes things crazy.
LOCK BO
do something
    -> schedule or wait, anycode might sleep.  Stopped by suspend now? no, i think.
UNLOCK BO

I do tests  with  cmds below.
echo devices  > /sys/power/pm_test
echo 0  > /sys/power/pm_async
echo 1  > /sys/power/pm_print_times
echo 1 > /sys/power/pm_debug_messages
echo 1 > /sys/module/amdgpu/parameters/debug_evictions
./kfd.sh --gtest_filter=KFDEvictTest.BasicTest
pm-suspend

thanks
xinhui


________________________________
发件人: Christian König <ckoenig.leichtzumerken@gmail.com><mailto:ckoenig.leichtzumerken@gmail.com>
发送时间: 2023年9月12日 17:01
收件人: Pan, Xinhui <Xinhui.Pan@amd.com><mailto:Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org><mailto:amd-gfx@lists.freedesktop.org>
抄送: Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com><mailto:Shikang.Fan@amd.com>
主题: Re: [PATCH] drm/amdgpu: Ignore first evction failure during suspend

When amdgpu_device_suspend() is called processes should be frozen
already. In other words KFD queues etc... should already be idle.

So when the eviction fails here we missed something previously and that
in turn can cause tons amount of problems.

So ignoring those errors is most likely not a good idea at all.

Regards,
Christian.

Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
> [AMD Official Use Only - General]
>
> Oh yep, Pinned BO is moved to other LRU list, So eviction fails because of other reason.
> I will change the comments in the patch.
> The problem is eviction fails as many reasons, say, BO is locked.
> ASAIK, kfd will stop the queues and flush some evict/restore work in its suspend callback. SO the first eviction before kfd callback likely fails.
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@gmail.com><mailto:ckoenig.leichtzumerken@gmail.com>
> Sent: Friday, September 8, 2023 2:49 PM
> To: Pan, Xinhui <Xinhui.Pan@amd.com><mailto:Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com><mailto:Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com><mailto:Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com><mailto:Shikang.Fan@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
>
> Am 08.09.23 um 05:39 schrieb xinhui pan:
>> Some BOs might be pinned. So the first eviction's failure will abort
>> the suspend sequence. These pinned BOs will be unpined afterwards
>> during suspend.
> That doesn't make much sense since pinned BOs don't cause eviction failure here.
>
> What exactly is the error code you see?
>
> Christian.
>
>> Actaully it has evicted most BOs, so that should stil work fine in
>> sriov full access mode.
>>
>> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call
>> during device_suspend.")
>> Signed-off-by: xinhui pan <xinhui.pan@amd.com><mailto:xinhui.pan@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>>    1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 5c0e2b766026..39af526cdbbe 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device
>> *dev, bool fbcon)
>>
>>        adev->in_suspend = true;
>>
>> -     /* Evict the majority of BOs before grabbing the full access */
>> -     r = amdgpu_device_evict_resources(adev);
>> -     if (r)
>> -             return r;
>> +     /* Try to evict the majority of BOs before grabbing the full access
>> +      * Ignore the ret val at first place as we will unpin some BOs if any
>> +      * afterwards.
>> +      */
>> +     (void)amdgpu_device_evict_resources(adev);
>>
>>        if (amdgpu_sriov_vf(adev)) {
>>                amdgpu_virt_fini_data_exchange(adev);



[-- Attachment #1.2: Type: text/html, Size: 22487 bytes --]

[-- Attachment #2: suspend.log --]
[-- Type: application/octet-stream, Size: 96208 bytes --]

[   42.562453] PM: suspend entry (deep)
[   42.568619] Filesystems sync: 0.002 seconds
[   42.572925] PM: Preparing system for sleep (deep)
[   42.580638] Freezing user space processes
[   42.592971] Freezing user space processes completed (elapsed 0.008 seconds)
[   42.600082] OOM killer disabled.
[   42.603313] Freezing remaining freezable tasks
[   42.608884] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[   42.616307] PM: Suspending system (deep)
[   42.623045] input input19: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: card0
[   42.631100] input input19: PM: input_dev_suspend+0x0/0x60 returned 0 after 3 usecs
[   42.638727] input input18: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: card0
[   42.646736] input input18: PM: input_dev_suspend+0x0/0x60 returned 0 after 2 usecs
[   42.654875] input input17: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: card0
[   42.662855] input input17: PM: input_dev_suspend+0x0/0x60 returned 0 after 1 usecs
[   42.670475] input input16: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: card0
[   42.678446] input input16: PM: input_dev_suspend+0x0/0x60 returned 0 after 0 usecs
[   42.686052] input input15: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: card0
[   42.694223] input input15: PM: input_dev_suspend+0x0/0x60 returned 0 after 1 usecs
[   42.701889] input input14: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: card0
[   42.710016] input input14: PM: input_dev_suspend+0x0/0x60 returned 0 after 0 usecs
[   42.717575] intel_rapl_msr intel_rapl_msr.0: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   42.727684] intel_rapl_msr intel_rapl_msr.0: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   42.737050] input input13: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: card0
[   42.745148] input input13: PM: input_dev_suspend+0x0/0x60 returned 0 after 1 usecs
[   42.752811] input input12: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: card1
[   42.760903] input input12: PM: input_dev_suspend+0x0/0x60 returned 0 after 1 usecs
[   42.768550] input input11: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: card1
[   42.776594] input input11: PM: input_dev_suspend+0x0/0x60 returned 0 after 0 usecs
[   42.784238] sound pcmC0D2c: PM: calling do_pcm_suspend+0x0/0x50 [snd_pcm] @ 1485, parent: card0
[   42.793095] sound pcmC0D2c: PM: do_pcm_suspend+0x0/0x50 [snd_pcm] returned 0 after 0 usecs
[   42.801507] input input10: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: card1
[   42.809595] input input10: PM: input_dev_suspend+0x0/0x60 returned 0 after 1 usecs
[   42.817226] sound pcmC0D1p: PM: calling do_pcm_suspend+0x0/0x50 [snd_pcm] @ 1485, parent: card0
[   42.826051] sound pcmC0D1p: PM: do_pcm_suspend+0x0/0x50 [snd_pcm] returned 0 after 0 usecs
[   42.834532] input input9: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: card1
[   42.842676] input input9: PM: input_dev_suspend+0x0/0x60 returned 0 after 0 usecs
[   42.850434] sound pcmC0D0c: PM: calling do_pcm_suspend+0x0/0x50 [snd_pcm] @ 1485, parent: card0
[   42.859214] sound pcmC0D0c: PM: do_pcm_suspend+0x0/0x50 [snd_pcm] returned 0 after 0 usecs
[   42.867640] input input8: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: card1
[   42.875572] input input8: PM: input_dev_suspend+0x0/0x60 returned 0 after 1 usecs
[   42.883057] sound pcmC0D0p: PM: calling do_pcm_suspend+0x0/0x50 [snd_pcm] @ 1485, parent: card0
[   42.891831] sound pcmC0D0p: PM: do_pcm_suspend+0x0/0x50 [snd_pcm] returned 0 after 1 usecs
[   42.900138] input input7: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: card1
[   42.908088] input input7: PM: input_dev_suspend+0x0/0x60 returned 0 after 2 usecs
[   42.915591] sound pcmC1D11p: PM: calling do_pcm_suspend+0x0/0x50 [snd_pcm] @ 1485, parent: card1
[   42.924421] sound pcmC1D11p: PM: do_pcm_suspend+0x0/0x50 [snd_pcm] returned 0 after 1 usecs
[   42.932819] sound pcmC1D10p: PM: calling do_pcm_suspend+0x0/0x50 [snd_pcm] @ 1485, parent: card1
[   42.941651] sound pcmC1D10p: PM: do_pcm_suspend+0x0/0x50 [snd_pcm] returned 0 after 0 usecs
[   42.950024] sound pcmC1D9p: PM: calling do_pcm_suspend+0x0/0x50 [snd_pcm] @ 1485, parent: card1
[   42.958880] sound pcmC1D9p: PM: do_pcm_suspend+0x0/0x50 [snd_pcm] returned 0 after 0 usecs
[   42.967311] sound pcmC1D8p: PM: calling do_pcm_suspend+0x0/0x50 [snd_pcm] @ 1485, parent: card1
[   42.976183] sound pcmC1D8p: PM: do_pcm_suspend+0x0/0x50 [snd_pcm] returned 0 after 0 usecs
[   42.984607] sound pcmC1D7p: PM: calling do_pcm_suspend+0x0/0x50 [snd_pcm] @ 1485, parent: card1
[   42.993498] sound pcmC1D7p: PM: do_pcm_suspend+0x0/0x50 [snd_pcm] returned 0 after 0 usecs
[   43.001942] sound pcmC1D3p: PM: calling do_pcm_suspend+0x0/0x50 [snd_pcm] @ 1485, parent: card1
[   43.010824] sound pcmC1D3p: PM: do_pcm_suspend+0x0/0x50 [snd_pcm] returned 0 after 0 usecs
[   43.019511] snd-soc-dummy snd-soc-dummy: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   43.029442] snd-soc-dummy snd-soc-dummy: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   43.038556] snd_hda_codec_realtek hdaudioC0D0: PM: calling hda_codec_pm_suspend+0x0/0x20 [snd_hda_codec] @ 1485, parent: 0000:00:1f.3
[   43.070726] snd_hda_codec_realtek hdaudioC0D0: PM: hda_codec_pm_suspend+0x0/0x20 [snd_hda_codec] returned 0 after 19949 usecs
[   43.082005] platform coretemp.0: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   43.090922] platform coretemp.0: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   43.099170] leds input3::kana: PM: calling led_suspend+0x0/0x50 @ 1485, parent: input3
[   43.107051] leds input3::kana: PM: led_suspend+0x0/0x50 returned 0 after 0 usecs
[   43.114429] leds input3::compose: PM: calling led_suspend+0x0/0x50 @ 1485, parent: input3
[   43.122575] leds input3::compose: PM: led_suspend+0x0/0x50 returned 0 after 0 usecs
[   43.130203] leds input3::scrolllock: PM: calling led_suspend+0x0/0x50 @ 1485, parent: input3
[   43.138602] leds input3::scrolllock: PM: led_suspend+0x0/0x50 returned 0 after 0 usecs
[   43.146496] leds input3::capslock: PM: calling led_suspend+0x0/0x50 @ 1485, parent: input3
[   43.154761] leds input3::capslock: PM: led_suspend+0x0/0x50 returned 0 after 0 usecs
[   43.162498] leds input3::numlock: PM: calling led_suspend+0x0/0x50 @ 1485, parent: input3
[   43.170643] leds input3::numlock: PM: led_suspend+0x0/0x50 returned 0 after 0 usecs
[   43.178283] input input6: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: eeepc-wmi
[   43.186523] input input6: PM: input_dev_suspend+0x0/0x60 returned 0 after 1 usecs
[   43.194002] eeepc-wmi eeepc-wmi: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   43.202931] eeepc-wmi eeepc-wmi: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   43.211291] input input5: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: 0003:046D:C077.0003
[   43.220383] input input5: PM: input_dev_suspend+0x0/0x60 returned 0 after 0 usecs
[   43.227843] input input4: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: 0003:046D:C31C.0002
[   43.236932] input input4: PM: input_dev_suspend+0x0/0x60 returned 0 after 0 usecs
[   43.244394] input input3: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: 0003:046D:C31C.0001
[   43.253483] input input3: PM: input_dev_suspend+0x0/0x60 returned 0 after 0 usecs
[   43.260979] usb 1-6: PM: calling usb_dev_suspend+0x0/0x20 @ 1485, parent: usb1
[   43.383247] usb 1-6: PM: usb_dev_suspend+0x0/0x20 returned 0 after 115072 usecs
[   43.390548] usb 1-3: PM: calling usb_dev_suspend+0x0/0x20 @ 1485, parent: usb1
[   43.397936] usb 1-3: PM: usb_dev_suspend+0x0/0x20 returned 0 after 193 usecs
[   43.405000] usb 1-2: PM: calling usb_dev_suspend+0x0/0x20 @ 1485, parent: usb1
[   43.412215] usb 1-2: PM: usb_dev_suspend+0x0/0x20 returned 0 after 5 usecs
[   43.419090] sd 5:0:0:0: PM: calling scsi_bus_suspend+0x0/0xc0 @ 1485, parent: target5:0:0
[   43.437232] sd 5:0:0:0: [sda] Synchronizing SCSI cache
[   43.443870] sd 5:0:0:0: [sda] Stopping disk
[   44.119669] sd 5:0:0:0: PM: scsi_bus_suspend+0x0/0xc0 returned 0 after 692426 usecs
[   44.127336] scsi target5:0:0: PM: calling scsi_bus_suspend+0x0/0xc0 @ 1485, parent: host5
[   44.135478] scsi target5:0:0: PM: scsi_bus_suspend+0x0/0xc0 returned 0 after 0 usecs
[   44.143217] usb 1-1: PM: calling usb_dev_suspend+0x0/0x20 @ 1485, parent: usb1
[   44.188849] usb 1-1: PM: usb_dev_suspend+0x0/0x20 returned 0 after 38438 usecs
[   44.196111] scsi host5: PM: calling scsi_bus_suspend+0x0/0xc0 @ 1485, parent: ata6
[   44.203659] scsi host5: PM: scsi_bus_suspend+0x0/0xc0 returned 0 after 0 usecs
[   44.210880] scsi host4: PM: calling scsi_bus_suspend+0x0/0xc0 @ 1485, parent: ata5
[   44.218426] scsi host4: PM: scsi_bus_suspend+0x0/0xc0 returned 0 after 0 usecs
[   44.225632] scsi host3: PM: calling scsi_bus_suspend+0x0/0xc0 @ 1485, parent: ata4
[   44.233175] scsi host3: PM: scsi_bus_suspend+0x0/0xc0 returned 0 after 0 usecs
[   44.240377] scsi host2: PM: calling scsi_bus_suspend+0x0/0xc0 @ 1485, parent: ata3
[   44.247917] scsi host2: PM: scsi_bus_suspend+0x0/0xc0 returned 0 after 0 usecs
[   44.255141] scsi host1: PM: calling scsi_bus_suspend+0x0/0xc0 @ 1485, parent: ata2
[   44.262677] scsi host1: PM: scsi_bus_suspend+0x0/0xc0 returned 0 after 0 usecs
[   44.269880] scsi host0: PM: calling scsi_bus_suspend+0x0/0xc0 @ 1485, parent: ata1
[   44.277420] scsi host0: PM: scsi_bus_suspend+0x0/0xc0 returned 0 after 0 usecs
[   44.284634] usb usb2: PM: calling usb_dev_suspend+0x0/0x20 @ 1485, parent: 0000:00:14.0
[   44.317420] usb usb2: PM: usb_dev_suspend+0x0/0x20 returned 0 after 24820 usecs
[   44.324729]  ata6: PM: calling ata_port_pm_suspend+0x0/0x50 @ 1485, parent: 0000:00:17.0
[   44.335246]  ata6: PM: ata_port_pm_suspend+0x0/0x50 returned 0 after 2452 usecs
[   44.342554]  ata5: PM: calling ata_port_pm_suspend+0x0/0x50 @ 1485, parent: 0000:00:17.0
[   44.351864]  ata5: PM: ata_port_pm_suspend+0x0/0x50 returned 0 after 1253 usecs
[   44.359172]  ata4: PM: calling ata_port_pm_suspend+0x0/0x50 @ 1485, parent: 0000:00:17.0
[   44.368490]  ata4: PM: ata_port_pm_suspend+0x0/0x50 returned 0 after 1258 usecs
[   44.375793]  ata3: PM: calling ata_port_pm_suspend+0x0/0x50 @ 1485, parent: 0000:00:17.0
[   44.385007]  ata3: PM: ata_port_pm_suspend+0x0/0x50 returned 0 after 1155 usecs
[   44.392307]  ata2: PM: calling ata_port_pm_suspend+0x0/0x50 @ 1485, parent: 0000:00:17.0
[   44.401617]  ata2: PM: ata_port_pm_suspend+0x0/0x50 returned 0 after 1256 usecs
[   44.408919]  ata1: PM: calling ata_port_pm_suspend+0x0/0x50 @ 1485, parent: 0000:00:17.0
[   44.418230]  ata1: PM: ata_port_pm_suspend+0x0/0x50 returned 0 after 1252 usecs
[   44.425589] usb usb1: PM: calling usb_dev_suspend+0x0/0x20 @ 1485, parent: 0000:00:14.0
[   44.440786] usb usb1: PM: usb_dev_suspend+0x0/0x20 returned 0 after 7219 usecs
[   44.448000] platform iTCO_wdt: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: 0000:00:1f.4
[   44.457096] platform iTCO_wdt: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.465200] pwm pwmchip0: PM: calling pwm_class_suspend+0x0/0x100 @ 1485, parent: INT3450:00
[   44.473601] pwm pwmchip0: PM: pwm_class_suspend+0x0/0x100 returned 0 after 0 usecs
[   44.481155] platform microcode: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   44.489988] platform microcode: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.498164] platform eisa.0: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   44.506737] platform eisa.0: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.514636] alarmtimer alarmtimer.0.auto: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: rtc0
[   44.524035] alarmtimer alarmtimer.0.auto: PM: platform_pm_suspend+0x0/0x60 returned 0 after 50 usecs
[   44.533130] rtc rtc0: PM: calling rtc_suspend+0x0/0x190 @ 1485, parent: rtc_cmos
[   44.540502] rtc rtc0: PM: rtc_suspend+0x0/0x190 returned 0 after 0 usecs
[   44.547213] platform Fixed MDIO bus.0: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   44.556649] platform Fixed MDIO bus.0: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.565542] serial8250 serial8250: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   44.574631] serial8250 serial8250: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.583082] input input2: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: LNXPWRBN:00
[   44.591483] input input2: PM: input_dev_suspend+0x0/0x60 returned 0 after 1 usecs
[   44.598948] input input1: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: PNP0C0C:00
[   44.607265] input input1: PM: input_dev_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.614722] input input0: PM: calling input_dev_suspend+0x0/0x60 @ 1485, parent: PNP0C0E:00
[   44.623044] input input0: PM: input_dev_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.630604] platform pcspkr: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   44.639175] platform pcspkr: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.647071] rtc_cmos rtc_cmos: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   44.656097] rtc_cmos rtc_cmos: PM: platform_pm_suspend+0x0/0x60 returned 0 after 276 usecs
[   44.664329] snd_hda_intel 0000:03:00.1: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: 0000:02:00.0
[   44.674073] snd_hda_intel 0000:03:00.1: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 220 usecs
[   44.682940] system 00:08: PM: calling pnp_bus_suspend+0x0/0x20 @ 1485, parent: pnp0
[   44.690577] system 00:08: PM: pnp_bus_suspend+0x0/0x20 returned 0 after 0 usecs
[   44.697866] system 00:07: PM: calling pnp_bus_suspend+0x0/0x20 @ 1485, parent: pnp0
[   44.705496] system 00:07: PM: pnp_bus_suspend+0x0/0x20 returned 0 after 0 usecs
[   44.712779] system 00:06: PM: calling pnp_bus_suspend+0x0/0x20 @ 1485, parent: pnp0
[   44.720405] system 00:06: PM: pnp_bus_suspend+0x0/0x20 returned 0 after 0 usecs
[   44.727697] system 00:05: PM: calling pnp_bus_suspend+0x0/0x20 @ 1485, parent: pnp0
[   44.735328] system 00:05: PM: pnp_bus_suspend+0x0/0x20 returned 0 after 0 usecs
[   44.742620] system 00:04: PM: calling pnp_bus_suspend+0x0/0x20 @ 1485, parent: pnp0
[   44.750251] system 00:04: PM: pnp_bus_suspend+0x0/0x20 returned 0 after 0 usecs
[   44.757535] system 00:03: PM: calling pnp_bus_suspend+0x0/0x20 @ 1485, parent: pnp0
[   44.765157] system 00:03: PM: pnp_bus_suspend+0x0/0x20 returned 0 after 0 usecs
[   44.772442] serial 00:02: PM: calling pnp_bus_suspend+0x0/0x20 @ 1485, parent: pnp0
[   44.780113] serial 00:02: PM: pnp_bus_suspend+0x0/0x20 returned 0 after 35 usecs
[   44.787491] system 00:01: PM: calling pnp_bus_suspend+0x0/0x20 @ 1485, parent: pnp0
[   44.795126] system 00:01: PM: pnp_bus_suspend+0x0/0x20 returned 0 after 5 usecs
[   44.802418] system 00:00: PM: calling pnp_bus_suspend+0x0/0x20 @ 1485, parent: pnp0
[   44.810055] system 00:00: PM: pnp_bus_suspend+0x0/0x20 returned 0 after 0 usecs
[   44.817364] button LNXPWRBN:00: PM: calling acpi_button_suspend+0x0/0x20 @ 1485, parent: LNXSYSTM:00
[   44.826458] button LNXPWRBN:00: PM: acpi_button_suspend+0x0/0x20 returned 0 after 0 usecs
[   44.834639] acpi-wmi PNP0C14:05: PM: calling acpi_subsys_suspend+0x0/0x60 @ 1485, parent: platform
[   44.843564] acpi-wmi PNP0C14:05: PM: acpi_subsys_suspend+0x0/0x60 returned 0 after 1 usecs
[   44.851805] acpi-wmi PNP0C14:04: PM: calling acpi_subsys_suspend+0x0/0x60 @ 1485, parent: platform
[   44.860729] acpi-wmi PNP0C14:04: PM: acpi_subsys_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.868971] acpi-fan PNP0C0B:04: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   44.877895] acpi-fan PNP0C0B:04: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.886136] acpi-fan PNP0C0B:03: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   44.895060] acpi-fan PNP0C0B:03: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.903302] acpi-fan PNP0C0B:02: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   44.912226] acpi-fan PNP0C0B:02: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.920468] acpi-fan PNP0C0B:01: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   44.929392] acpi-fan PNP0C0B:01: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.937629] acpi-fan PNP0C0B:00: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   44.946555] acpi-fan PNP0C0B:00: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.954796] platform PNP0C0C:00: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   44.963721] platform PNP0C0C:00: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.971962] intel_pmc_core INT33A1:00: PM: calling acpi_subsys_suspend+0x0/0x60 @ 1485, parent: platform
[   44.981402] intel_pmc_core INT33A1:00: PM: acpi_subsys_suspend+0x0/0x60 returned 0 after 0 usecs
[   44.990157] acpi-wmi PNP0C14:03: PM: calling acpi_subsys_suspend+0x0/0x60 @ 1485, parent: platform
[   44.999081] acpi-wmi PNP0C14:03: PM: acpi_subsys_suspend+0x0/0x60 returned 0 after 0 usecs
[   45.007323] acpi-wmi PNP0C14:02: PM: calling acpi_subsys_suspend+0x0/0x60 @ 1485, parent: platform
[   45.016247] acpi-wmi PNP0C14:02: PM: acpi_subsys_suspend+0x0/0x60 returned 0 after 0 usecs
[   45.024483] platform PNP0C0E:00: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   45.033408] platform PNP0C0E:00: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   45.041650] acpi-wmi PNP0C14:01: PM: calling acpi_subsys_suspend+0x0/0x60 @ 1485, parent: platform
[   45.050574] acpi-wmi PNP0C14:01: PM: acpi_subsys_suspend+0x0/0x60 returned 0 after 0 usecs
[   45.058815] acpi-tad ACPI000E:00: PM: calling acpi_subsys_suspend+0x0/0x60 @ 1485, parent: platform
[   45.067891] acpi-tad ACPI000E:00: PM: acpi_subsys_suspend+0x0/0x60 returned 0 after 67 usecs
[   45.076300] platform ACPI000C:00: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   45.085309] platform ACPI000C:00: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   45.093632] acpi-wmi PNP0C14:00: PM: calling acpi_subsys_suspend+0x0/0x60 @ 1485, parent: pci0000:00
[   45.102732] acpi-wmi PNP0C14:00: PM: acpi_subsys_suspend+0x0/0x60 returned 0 after 0 usecs
[   45.110970] cannonlake-pinctrl INT3450:00: PM: calling acpi_subsys_suspend+0x0/0x60 @ 1485, parent: pci0000:00
[   45.120929] cannonlake-pinctrl INT3450:00: PM: acpi_subsys_suspend+0x0/0x60 returned 0 after 0 usecs
[   45.130032] platform PNP0C04:00: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: 0000:00:1f.0
[   45.139297] platform PNP0C04:00: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   45.147538] platform PNP0103:00: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: 0000:00:1f.0
[   45.156806] platform PNP0103:00: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   45.165048] platform PNP0C09:00: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: 0000:00:1f.0
[   45.174314] platform PNP0C09:00: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   45.182567] amdgpu 0000:03:00.0: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: 0000:02:00.0
[   45.194601] ------------[ cut here ]------------
[   45.199213] Scheduling eviction of pid 1483 in 0 jiffies
[   45.199229] WARNING: CPU: 2 PID: 1346 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device.c:1137 kgd2kfd_schedule_evict_and_restore_process+0x1f4/0x230 [amdgpu]
[   45.219279] Modules linked in: amdgpu i2c_algo_bit drm_ttm_helper ttm amdxcp drm_buddy gpu_sched drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt cec rc_core binfmt_misc snd_sof_pci_intel_cnl intel_rapl_msr snd_sof_intel_hda_common mei_hdcp snd_soc_acpi_intel_match snd_soc_acpi snd_soc_hdac_hda snd_sof_pci snd_sof_xtensa_dsp intel_rapl_common snd_sof_intel_hda intel_tcc_cooling snd_sof snd_hda_codec_realtek x86_pkg_temp_thermal snd_sof_utils snd_hda_codec_generic snd_hda_ext_core intel_powerclamp snd_soc_core snd_hda_codec_hdmi snd_compress coretemp crct10dif_pclmul snd_hda_intel crc32_pclmul ghash_clmulni_intel snd_intel_dspcfg aesni_intel snd_hda_codec crypto_simd pl2303 cryptd snd_hwdep usbserial input_leds snd_hda_core rapl joydev snd_pcm intel_cstate eeepc_wmi asus_wmi snd_seq ledtrig_audio sparse_keymap snd_seq_device platform_profile mei_me wmi_bmof snd_timer intel_wmi_thunderbolt mxm_wmi ee1004 snd mei soundcore acpi_pad mac_hid acpi_tad sch_fq_codel msr
[   45.219351]  parport_pc ppdev lp parport drm ip_tables x_tables autofs4 hid_generic usbhid hid e1000e ahci i2c_i801 xhci_pci libahci i2c_smbus xhci_pci_renesas video pinctrl_cannonlake wmi
[   45.322250] CPU: 2 PID: 1346 Comm: page0 Tainted: G        W          6.2.8+ #44
[   45.329616] Hardware name: System manufacturer System Product Name/PRIME Z390-A, BIOS 1401 11/26/2019
[   45.338783] RIP: 0010:kgd2kfd_schedule_evict_and_restore_process+0x1f4/0x230 [amdgpu]
[   45.347066] Code: 00 48 89 ca 48 89 4d e8 48 c7 c7 78 44 7c c1 48 83 05 87 6b dd 00 01 8b b0 90 09 00 00 e8 54 12 77 f0 48 83 05 7c 6b dd 00 01 <0f> 0b 48 83 05 7a 6b dd 00 01 48 8b 4d e8 e9 f0 fe ff ff 48 83 05
[   45.365722] RSP: 0018:ffffb356c3067da8 EFLAGS: 00010002
[   45.370932] RAX: 0000000000000000 RBX: ffff890fb0565bc8 RCX: 0000000000000000
[   45.378033] RDX: 0000000000000003 RSI: ffffffffb2f03b10 RDI: 00000000ffffffff
[   45.385134] RBP: ffffb356c3067dc0 R08: 0000000000000003 R09: 00000000ffffffff
[   45.392230] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
[   45.399331] R13: ffff890ead4c9000 R14: 0000000000000246 R15: 0000000000000000
[   45.406432] FS:  0000000000000000(0000) GS:ffff891dadb00000(0000) knlGS:0000000000000000
[   45.414478] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   45.420199] CR2: 00007fcf4541ba70 CR3: 0000000d2d026004 CR4: 00000000003706e0
[   45.427300] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   45.434401] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   45.441498] Call Trace:
[   45.443947]  <TASK>
[   45.446049]  amdkfd_fence_enable_signaling+0x9b/0xe0 [amdgpu]
[   45.452117]  __dma_fence_enable_signaling+0x7b/0x120
[   45.457070]  ? __pfx_drm_sched_entity_wakeup+0x10/0x10 [gpu_sched]
[   45.463239]  dma_fence_add_callback+0x50/0xe0
[   45.467592]  drm_sched_entity_pop_job+0x175/0xaa0 [gpu_sched]
[   45.473331]  drm_sched_main+0x11b/0x7f0 [gpu_sched]
[   45.478207]  ? __pfx_autoremove_wake_function+0x10/0x10
[   45.483417]  ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched]
[   45.488811]  kthread+0x105/0x130
[   45.492041]  ? __pfx_kthread+0x10/0x10
[   45.495784]  ret_from_fork+0x29/0x50
[   45.499363]  </TASK>
[   45.501560] irq event stamp: 844274
[   45.505044] hardirqs last  enabled at (844273): [<ffffffffb26b3248>] _raw_spin_unlock_irqrestore+0x68/0x80
[   45.514649] hardirqs last disabled at (844274): [<ffffffffb26b2edc>] _raw_spin_lock_irqsave+0x6c/0x70
[   45.523816] softirqs last  enabled at (766272): [<ffffffffb16e199d>] __irq_exit_rcu+0xdd/0x130
[   45.532382] softirqs last disabled at (766267): [<ffffffffb16e199d>] __irq_exit_rcu+0xdd/0x130
[   45.540954] ---[ end trace 0000000000000000 ]---
[   46.315890] ------------[ cut here ]------------
[   46.320507] Scheduling eviction of pid 1484 in 0 jiffies
[   46.320533] WARNING: CPU: 2 PID: 1346 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device.c:1137 kgd2kfd_schedule_evict_and_restore_process+0x1f4/0x230 [amdgpu]
[   46.340943] Modules linked in: amdgpu i2c_algo_bit drm_ttm_helper ttm amdxcp drm_buddy gpu_sched drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt cec rc_core binfmt_misc snd_sof_pci_intel_cnl intel_rapl_msr snd_sof_intel_hda_common mei_hdcp snd_soc_acpi_intel_match snd_soc_acpi snd_soc_hdac_hda snd_sof_pci snd_sof_xtensa_dsp intel_rapl_common snd_sof_intel_hda intel_tcc_cooling snd_sof snd_hda_codec_realtek x86_pkg_temp_thermal snd_sof_utils snd_hda_codec_generic snd_hda_ext_core intel_powerclamp snd_soc_core snd_hda_codec_hdmi snd_compress coretemp crct10dif_pclmul snd_hda_intel crc32_pclmul ghash_clmulni_intel snd_intel_dspcfg aesni_intel snd_hda_codec crypto_simd pl2303 cryptd snd_hwdep usbserial input_leds snd_hda_core rapl joydev snd_pcm intel_cstate eeepc_wmi asus_wmi snd_seq ledtrig_audio sparse_keymap snd_seq_device platform_profile mei_me wmi_bmof snd_timer intel_wmi_thunderbolt mxm_wmi ee1004 snd mei soundcore acpi_pad mac_hid acpi_tad sch_fq_codel msr
[   46.341084]  parport_pc ppdev lp parport drm ip_tables x_tables autofs4 hid_generic usbhid hid e1000e ahci i2c_i801 xhci_pci libahci i2c_smbus xhci_pci_renesas video pinctrl_cannonlake wmi
[   46.443982] CPU: 2 PID: 1346 Comm: page0 Tainted: G        W          6.2.8+ #44
[   46.451340] Hardware name: System manufacturer System Product Name/PRIME Z390-A, BIOS 1401 11/26/2019
[   46.460507] RIP: 0010:kgd2kfd_schedule_evict_and_restore_process+0x1f4/0x230 [amdgpu]
[   46.468781] Code: 00 48 89 ca 48 89 4d e8 48 c7 c7 78 44 7c c1 48 83 05 87 6b dd 00 01 8b b0 90 09 00 00 e8 54 12 77 f0 48 83 05 7c 6b dd 00 01 <0f> 0b 48 83 05 7a 6b dd 00 01 48 8b 4d e8 e9 f0 fe ff ff 48 83 05
[   46.487433] RSP: 0018:ffffb356c3067da8 EFLAGS: 00010002
[   46.492639] RAX: 0000000000000000 RBX: ffff890fb0565e18 RCX: 0000000000000000
[   46.499740] RDX: 0000000000000003 RSI: ffffffffb2f03b10 RDI: 00000000ffffffff
[   46.506845] RBP: ffffb356c3067dc0 R08: 0000000000000003 R09: 00000000ffffffff
[   46.513946] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
[   46.521045] R13: ffff890ea9f8c000 R14: 0000000000000246 R15: 0000000000000000
[   46.528147] FS:  0000000000000000(0000) GS:ffff891dadb00000(0000) knlGS:0000000000000000
[   46.536193] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   46.541915] CR2: 00007fcf4541ba70 CR3: 0000000d2d026004 CR4: 00000000003706e0
[   46.549016] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   46.556116] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   46.563213] Call Trace:
[   46.565662]  <TASK>
[   46.567765]  amdkfd_fence_enable_signaling+0x9b/0xe0 [amdgpu]
[   46.573828]  __dma_fence_enable_signaling+0x7b/0x120
[   46.578777]  ? __pfx_drm_sched_entity_wakeup+0x10/0x10 [gpu_sched]
[   46.584945]  dma_fence_add_callback+0x50/0xe0
[   46.589290]  drm_sched_entity_pop_job+0x175/0xaa0 [gpu_sched]
[   46.595030]  drm_sched_main+0x11b/0x7f0 [gpu_sched]
[   46.599905]  ? __pfx_autoremove_wake_function+0x10/0x10
[   46.605116]  ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched]
[   46.610509]  kthread+0x105/0x130
[   46.613739]  ? __pfx_kthread+0x10/0x10
[   46.617482]  ret_from_fork+0x29/0x50
[   46.621061]  </TASK>
[   46.623259] irq event stamp: 941820
[   46.626742] hardirqs last  enabled at (941819): [<ffffffffb26b3248>] _raw_spin_unlock_irqrestore+0x68/0x80
[   46.636347] hardirqs last disabled at (941820): [<ffffffffb26b2edc>] _raw_spin_lock_irqsave+0x6c/0x70
[   46.645515] softirqs last  enabled at (923428): [<ffffffffb16e199d>] __irq_exit_rcu+0xdd/0x130
[   46.654080] softirqs last disabled at (923421): [<ffffffffb16e199d>] __irq_exit_rcu+0xdd/0x130
[   46.662646] ---[ end trace 0000000000000000 ]---
[   47.666272] [drm] evicting device resources failed
[   47.671578] XH: 0 bo(ffff890ea0e7a000) locked
[   47.671608]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   47.683148]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   47.689771]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   47.696631]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   47.704943]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   47.712121]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   47.717503]          __x64_sys_ioctl+0x92/0xd0
[   47.721948]          do_syscall_64+0x59/0x90
[   47.726220]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   47.731960] XH: 1 bo(ffff890ea0e78800) locked
[   47.731962]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   47.742070]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   47.748087]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   47.754371]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   47.762184]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   47.769122]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   47.774341]          __x64_sys_ioctl+0x92/0xd0
[   47.778787]          do_syscall_64+0x59/0x90
[   47.783060]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   47.788808] XH: 2 bo(ffff890e90aee800) locked
[   47.788823]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   47.798967]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   47.805007]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   47.811306]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   47.819228]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   47.826153]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   47.831379]          __x64_sys_ioctl+0x92/0xd0
[   47.835834]          do_syscall_64+0x59/0x90
[   47.840115]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   47.845863] XH: 3 bo(ffff890e90aea000) locked
[   47.845865]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   47.856052]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   47.862072]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   47.868370]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   47.876219]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   47.883168]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   47.888401]          __x64_sys_ioctl+0x92/0xd0
[   47.892879]          do_syscall_64+0x59/0x90
[   47.897147]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   47.902893] XH: 4 bo(ffff890e90ae8800) locked
[   47.902895]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   47.913052]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   47.919112]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   47.925391]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   47.933222]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   47.940178]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   47.945409]          __x64_sys_ioctl+0x92/0xd0
[   47.949862]          do_syscall_64+0x59/0x90
[   47.954155]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   47.959901] XH: 5 bo(ffff890e90aed000) locked
[   47.959903]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   47.970060]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   47.976101]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   47.982399]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   47.990230]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   47.997185]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.002414]          __x64_sys_ioctl+0x92/0xd0
[   48.006866]          do_syscall_64+0x59/0x90
[   48.011150]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.016899] XH: 6 bo(ffff890e90aeb800) locked
[   48.016900]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.027057]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   48.033096]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   48.039398]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   48.047230]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   48.054184]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.059422]          __x64_sys_ioctl+0x92/0xd0
[   48.063897]          do_syscall_64+0x59/0x90
[   48.068167]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.073913] XH: 7 bo(ffff890e90af5000) locked
[   48.073915]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.084065]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   48.090105]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   48.096406]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   48.104254]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   48.111209]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.116439]          __x64_sys_ioctl+0x92/0xd0
[   48.120894]          do_syscall_64+0x59/0x90
[   48.125186]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.130936] XH: 8 bo(ffff890e90af3800) locked
[   48.130952]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.141085]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   48.147127]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   48.153423]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   48.161272]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   48.168209]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.173455]          __x64_sys_ioctl+0x92/0xd0
[   48.177892]          do_syscall_64+0x59/0x90
[   48.182174]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.187922] XH: 9 bo(ffff890e90af6800) locked
[   48.187924]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.198072]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   48.204105]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   48.210402]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   48.218224]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   48.225179]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.230407]          __x64_sys_ioctl+0x92/0xd0
[   48.234862]          do_syscall_64+0x59/0x90
[   48.239143]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.244889] XH: 10 bo(ffff890e90af2000) locked
[   48.244891]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.255127]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   48.261168]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   48.267493]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   48.275306]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   48.282260]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.287488]          __x64_sys_ioctl+0x92/0xd0
[   48.291942]          do_syscall_64+0x59/0x90
[   48.296224]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.301998] XH: 11 bo(ffff890e90af0800) locked
[   48.302000]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.312208]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   48.318249]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   48.324548]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   48.332425]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   48.339436]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.344668]          __x64_sys_ioctl+0x92/0xd0
[   48.349122]          do_syscall_64+0x59/0x90
[   48.353406]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.359152] XH: 12 bo(ffff890e90680800) locked
[   48.359154]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.369389]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   48.375430]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   48.381730]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   48.389563]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   48.396514]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.401796]          __x64_sys_ioctl+0x92/0xd0
[   48.406250]          do_syscall_64+0x59/0x90
[   48.410534]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.416305] XH: 13 bo(ffff890e90685000) locked
[   48.416306]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.426527]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   48.432567]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   48.438889]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   48.446713]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   48.453668]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.458901]          __x64_sys_ioctl+0x92/0xd0
[   48.463353]          do_syscall_64+0x59/0x90
[   48.467637]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.473384] XH: 14 bo(ffff890e90683800) locked
[   48.473385]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.483621]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   48.489661]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   48.495959]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   48.503790]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   48.510744]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.515974]          __x64_sys_ioctl+0x92/0xd0
[   48.520426]          do_syscall_64+0x59/0x90
[   48.524710]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.530456] XH: 15 bo(ffff890e90686800) locked
[   48.530458]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.540694]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   48.546785]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   48.553087]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   48.560917]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   48.567873]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.573105]          __x64_sys_ioctl+0x92/0xd0
[   48.577560]          do_syscall_64+0x59/0x90
[   48.581837]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.587586] XH: 16 bo(ffff890e90682000) locked
[   48.587588]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.597821]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   48.603855]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   48.610155]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   48.617987]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   48.624941]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.630169]          __x64_sys_ioctl+0x92/0xd0
[   48.634624]          do_syscall_64+0x59/0x90
[   48.638906]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.644654] XH: 17 bo(ffff890e9068a000) locked
[   48.644656]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.654941]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   48.660974]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   48.667277]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   48.675107]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   48.682070]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.687298]          __x64_sys_ioctl+0x92/0xd0
[   48.691752]          do_syscall_64+0x59/0x90
[   48.696034]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.701782] XH: 18 bo(ffff890e90688800) locked
[   48.701783]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.712010]          amdgpu_bo_create_user+0x40/0x80 [amdgpu]
[   48.718042]          amdgpu_gem_object_create+0x7f/0xd0 [amdgpu]
[   48.724345]          amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x280/0xea0 [amdgpu]
[   48.732176]          kfd_ioctl_alloc_memory_of_gpu+0x27d/0x540 [amdgpu]
[   48.739130]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.744358]          __x64_sys_ioctl+0x92/0xd0
[   48.748812]          do_syscall_64+0x59/0x90
[   48.753112]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.758859] XH: 19 bo(ffff890e905e8000) locked
[   48.758861]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.769104]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   48.774973]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   48.781013]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   48.787234]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   48.793531]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   48.799570]          update_gpuvm_pte+0x36c/0x750 [amdgpu]
[   48.805413]          amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x41d/0x1a50 [amdgpu]
[   48.813149]          kfd_process_alloc_gpuvm+0x92/0x1b0 [amdgpu]
[   48.819499]          kfd_process_device_init_vm+0x2a6/0x400 [amdgpu]
[   48.826195]          kfd_ioctl_acquire_vm+0xc5/0x140 [amdgpu]
[   48.832286]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.837519]          __x64_sys_ioctl+0x92/0xd0
[   48.841973]          do_syscall_64+0x59/0x90
[   48.846259]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.852008] XH: 20 bo(ffff890e8fda0000) locked
[   48.852026]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.862243]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   48.868113]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   48.874162]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   48.880381]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   48.886679]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   48.892719]          update_gpuvm_pte+0x36c/0x750 [amdgpu]
[   48.898586]          amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x41d/0x1a50 [amdgpu]
[   48.906323]          kfd_process_alloc_gpuvm+0x92/0x1b0 [amdgpu]
[   48.912676]          kfd_process_device_init_vm+0x2a6/0x400 [amdgpu]
[   48.919393]          kfd_ioctl_acquire_vm+0xc5/0x140 [amdgpu]
[   48.925486]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   48.930719]          __x64_sys_ioctl+0x92/0xd0
[   48.935180]          do_syscall_64+0x59/0x90
[   48.939460]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   48.945207] XH: 21 bo(ffff890ea1bb2000) locked
[   48.945208]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   48.955435]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   48.961295]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   48.967343]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   48.973564]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   48.979862]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   48.985901]          update_gpuvm_pte+0x36c/0x750 [amdgpu]
[   48.991743]          amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x41d/0x1a50 [amdgpu]
[   48.999488]          kfd_process_alloc_gpuvm+0x92/0x1b0 [amdgpu]
[   49.005837]          kfd_process_device_init_vm+0x2a6/0x400 [amdgpu]
[   49.012535]          kfd_ioctl_acquire_vm+0xc5/0x140 [amdgpu]
[   49.018685]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   49.023918]          __x64_sys_ioctl+0x92/0xd0
[   49.028373]          do_syscall_64+0x59/0x90
[   49.032654]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   49.038402] XH: 22 bo(ffff890e8fda8000) locked
[   49.038403]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   49.048634]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   49.054519]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   49.060561]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   49.066827]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   49.073134]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   49.079173]          update_gpuvm_pte+0x36c/0x750 [amdgpu]
[   49.085016]          amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x41d/0x1a50 [amdgpu]
[   49.092778]          kfd_ioctl_map_memory_to_gpu+0x173/0x4f0 [amdgpu]
[   49.099545]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   49.104780]          __x64_sys_ioctl+0x92/0xd0
[   49.109225]          do_syscall_64+0x59/0x90
[   49.113509]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   49.119256] XH: 23 bo(ffff890e8fdb0000) locked
[   49.119257]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   49.129500]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   49.135370]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   49.141418]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   49.147631]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   49.153936]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   49.159975]          update_gpuvm_pte+0x36c/0x750 [amdgpu]
[   49.165818]          amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x41d/0x1a50 [amdgpu]
[   49.173554]          kfd_ioctl_map_memory_to_gpu+0x173/0x4f0 [amdgpu]
[   49.180338]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   49.185565]          __x64_sys_ioctl+0x92/0xd0
[   49.190019]          do_syscall_64+0x59/0x90
[   49.194303]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   49.200049] XH: 24 bo(ffff890ea3e55000) locked
[   49.200051]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   49.210269]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   49.216138]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   49.222186]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   49.228398]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   49.234699]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   49.240767]          update_gpuvm_pte+0x36c/0x750 [amdgpu]
[   49.246598]          amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x41d/0x1a50 [amdgpu]
[   49.254344]          kfd_ioctl_map_memory_to_gpu+0x173/0x4f0 [amdgpu]
[   49.261126]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   49.266356]          __x64_sys_ioctl+0x92/0xd0
[   49.270808]          do_syscall_64+0x59/0x90
[   49.275092]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   49.280837] XH: 25 bo(ffff890e90af8000) locked
[   49.280839]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   49.291076]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   49.296944]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   49.302992]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   49.309212]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   49.315511]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   49.321550]          update_gpuvm_pte+0x36c/0x750 [amdgpu]
[   49.327392]          amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x41d/0x1a50 [amdgpu]
[   49.335129]          kfd_ioctl_map_memory_to_gpu+0x173/0x4f0 [amdgpu]
[   49.341916]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   49.347149]          __x64_sys_ioctl+0x92/0xd0
[   49.351601]          do_syscall_64+0x59/0x90
[   49.355885]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   49.361628] XH: 26 bo(ffff890e90698000) locked
[   49.361629]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   49.371865]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   49.377733]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   49.383773]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   49.389985]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   49.396283]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   49.402322]          update_gpuvm_pte+0x36c/0x750 [amdgpu]
[   49.408164]          amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x41d/0x1a50 [amdgpu]
[   49.415909]          kfd_ioctl_map_memory_to_gpu+0x173/0x4f0 [amdgpu]
[   49.422688]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   49.427922]          __x64_sys_ioctl+0x92/0xd0
[   49.432373]          do_syscall_64+0x59/0x90
[   49.436657]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   49.442400] XH: 27 bo(ffff890e906b8000) locked
[   49.442402]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   49.452629]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   49.458548]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   49.464588]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   49.470824]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   49.477127]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   49.483177]          update_gpuvm_pte+0x36c/0x750 [amdgpu]
[   49.489022]          amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x41d/0x1a50 [amdgpu]
[   49.496776]          kfd_ioctl_map_memory_to_gpu+0x173/0x4f0 [amdgpu]
[   49.503542]          kfd_ioctl+0x497/0x9b0 [amdgpu]
[   49.508783]          __x64_sys_ioctl+0x92/0xd0
[   49.513229]          do_syscall_64+0x59/0x90
[   49.517514]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   49.523263] XH: 28 bo(ffff890e90960000) locked
[   49.523265]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   49.533498]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   49.539360]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   49.545407]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   49.551619]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   49.557916]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   49.563957]          amdgpu_gem_va_ioctl+0x7c3/0x880 [amdgpu]
[   49.569996]          drm_ioctl_kernel+0x113/0x260 [drm]
[   49.575288]          drm_ioctl+0x355/0x760 [drm]
[   49.579967]          amdgpu_drm_ioctl+0x5e/0xc0 [amdgpu]
[   49.585568]          __x64_sys_ioctl+0x92/0xd0
[   49.590022]          do_syscall_64+0x59/0x90
[   49.594301]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   49.600048] XH: 29 bo(ffff890e90968000) locked
[   49.600049]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   49.610268]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   49.616136]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   49.622176]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   49.628388]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   49.634685]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   49.640725]          amdgpu_gem_va_ioctl+0x7c3/0x880 [amdgpu]
[   49.646772]          drm_ioctl_kernel+0x113/0x260 [drm]
[   49.652054]          drm_ioctl+0x355/0x760 [drm]
[   49.656730]          amdgpu_drm_ioctl+0x5e/0xc0 [amdgpu]
[   49.662348]          __x64_sys_ioctl+0x92/0xd0
[   49.666804]          do_syscall_64+0x59/0x90
[   49.671086]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   49.676845] XH: 30 bo(ffff890e90972000) locked
[   49.676846]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   49.687065]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   49.692926]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   49.698967]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   49.705186]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   49.711484]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   49.717524]          amdgpu_gem_va_ioctl+0x7c3/0x880 [amdgpu]
[   49.723568]          drm_ioctl_kernel+0x113/0x260 [drm]
[   49.728846]          drm_ioctl+0x355/0x760 [drm]
[   49.733525]          amdgpu_drm_ioctl+0x5e/0xc0 [amdgpu]
[   49.739119]          __x64_sys_ioctl+0x92/0xd0
[   49.743571]          do_syscall_64+0x59/0x90
[   49.747855]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   49.753602] XH: 31 bo(ffff890e90978000) locked
[   49.753604]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   49.763839]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   49.769699]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   49.775747]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   49.781959]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   49.788262]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   49.794309]          amdgpu_cs_ioctl+0x197e/0x2b80 [amdgpu]
[   49.800177]          drm_ioctl_kernel+0x113/0x260 [drm]
[   49.805456]          drm_ioctl+0x355/0x760 [drm]
[   49.810136]          amdgpu_drm_ioctl+0x5e/0xc0 [amdgpu]
[   49.815736]          __x64_sys_ioctl+0x92/0xd0
[   49.820191]          do_syscall_64+0x59/0x90
[   49.824472]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   49.830220] XH: 32 bo(ffff890e90b40000) locked
[   49.830222]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   49.840461]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   49.846380]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   49.852422]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   49.858654]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   49.864960]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   49.871000]          amdgpu_cs_ioctl+0x197e/0x2b80 [amdgpu]
[   49.876867]          drm_ioctl_kernel+0x113/0x260 [drm]
[   49.882149]          drm_ioctl+0x355/0x760 [drm]
[   49.886830]          amdgpu_drm_ioctl+0x5e/0xc0 [amdgpu]
[   49.892429]          __x64_sys_ioctl+0x92/0xd0
[   49.896884]          do_syscall_64+0x59/0x90
[   49.901163]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   49.906912] XH: 33 bo(ffff890e90970800) locked
[   49.906914]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   49.917148]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   49.923018]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   49.929065]          amdgpu_vm_ptes_update+0xa37/0xbf0 [amdgpu]
[   49.935277]          amdgpu_vm_update_range+0x25f/0xb70 [amdgpu]
[   49.941583]          amdgpu_vm_bo_update+0x2e5/0x950 [amdgpu]
[   49.947632]          amdgpu_cs_ioctl+0x197e/0x2b80 [amdgpu]
[   49.953499]          drm_ioctl_kernel+0x113/0x260 [drm]
[   49.958779]          drm_ioctl+0x355/0x760 [drm]
[   49.963458]          amdgpu_drm_ioctl+0x5e/0xc0 [amdgpu]
[   49.969058]          __x64_sys_ioctl+0x92/0xd0
[   49.973512]          do_syscall_64+0x59/0x90
[   49.977795]          entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   49.983543] XH: 34 bo(ffff890ea1238000) locked
[   49.983561]          amdgpu_bo_create+0x352/0x8d0 [amdgpu]
[   49.993787]          amdgpu_bo_create_vm+0x3a/0x90 [amdgpu]
[   49.999656]          amdgpu_vm_pt_create+0x124/0x370 [amdgpu]
[   50.005705]          amdgpu_vm_init+0x2f7/0x840 [amdgpu]
[   50.011311]          amdgpu_driver_open_kms+0x17b/0x3d0 [amdgpu]
[   50.017604]          drm_file_alloc+0x242/0x410 [drm]
[   50.022711]          drm_open_helper+0x8f/0x200 [drm]
[   50.027823]          drm_open+0x88/0x180 [drm]
[   50.032330]          drm_stub_open+0xdb/0x250 [drm]
[   50.037265]          chrdev_open+0xc4/0x260
[   50.041461]          do_dentry_open+0x16a/0x440
[   50.046002]          vfs_open+0x2d/0x40
[   50.049855]          path_openat+0x3d7/0xae0
[   50.054139]          do_filp_open+0xb2/0x160
[   50.058425]          do_sys_openat2+0x9f/0x160
[   50.062884]          __x64_sys_openat+0x55/0x90
[   50.177854] ------------[ cut here ]------------
[   50.182706] Evicting all processes
[   50.182736] WARNING: CPU: 1 PID: 1485 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:1995 kfd_suspend_all_processes+0x2f7/0x340 [amdgpu]
[   50.200005] Modules linked in: amdgpu i2c_algo_bit drm_ttm_helper ttm amdxcp drm_buddy gpu_sched drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt cec rc_core binfmt_misc snd_sof_pci_intel_cnl intel_rapl_msr snd_sof_intel_hda_common mei_hdcp snd_soc_acpi_intel_match snd_soc_acpi snd_soc_hdac_hda snd_sof_pci snd_sof_xtensa_dsp intel_rapl_common snd_sof_intel_hda intel_tcc_cooling snd_sof snd_hda_codec_realtek x86_pkg_temp_thermal snd_sof_utils snd_hda_codec_generic snd_hda_ext_core intel_powerclamp snd_soc_core snd_hda_codec_hdmi snd_compress coretemp crct10dif_pclmul snd_hda_intel crc32_pclmul ghash_clmulni_intel snd_intel_dspcfg aesni_intel snd_hda_codec crypto_simd pl2303 cryptd snd_hwdep usbserial input_leds snd_hda_core rapl joydev snd_pcm intel_cstate eeepc_wmi asus_wmi snd_seq ledtrig_audio sparse_keymap snd_seq_device platform_profile mei_me wmi_bmof snd_timer intel_wmi_thunderbolt mxm_wmi ee1004 snd mei soundcore acpi_pad mac_hid acpi_tad sch_fq_codel msr
[   50.200285]  parport_pc ppdev lp parport drm ip_tables x_tables autofs4 hid_generic usbhid hid e1000e ahci i2c_i801 xhci_pci libahci i2c_smbus xhci_pci_renesas video pinctrl_cannonlake wmi
[   50.303204] CPU: 1 PID: 1485 Comm: pm-suspend Tainted: G        W          6.2.8+ #44
[   50.311009] Hardware name: System manufacturer System Product Name/PRIME Z390-A, BIOS 1401 11/26/2019
[   50.320197] RIP: 0010:kfd_suspend_all_processes+0x2f7/0x340 [amdgpu]
[   50.327039] Code: d7 48 83 05 62 6b dc 00 01 e9 2e ff ff ff 48 c7 c7 df bd 83 c1 48 83 05 2e 6b dc 00 01 e8 31 91 75 f0 48 83 05 29 6b dc 00 01 <0f> 0b 48 83 05 27 6b dc 00 01 e9 a1 fd ff ff 48 83 05 ba 4e dc 00
[   50.345700] RSP: 0018:ffffb356c36e3a58 EFLAGS: 00010202
[   50.350924] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[   50.358033] RDX: 0000000000000002 RSI: ffffffffb2f03b10 RDI: 00000000ffffffff
[   50.365173] RBP: ffffb356c36e3a78 R08: 0000000000000003 R09: 0000000000000001
[   50.372282] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
[   50.379393] R13: ffff890eaf33cc00 R14: 0000000000000000 R15: 0000000000000002
[   50.386509] FS:  00007fe400332740(0000) GS:ffff891dada80000(0000) knlGS:0000000000000000
[   50.394569] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.400308] CR2: 0000560c98e55456 CR3: 0000000110582004 CR4: 00000000003706e0
[   50.407421] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   50.414534] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   50.421649] Call Trace:
[   50.424127]  <TASK>
[   50.426245]  kgd2kfd_suspend+0xf3/0x130 [amdgpu]
[   50.431345]  amdgpu_amdkfd_suspend+0x26/0x50 [amdgpu]
[   50.436728]  amdgpu_device_suspend+0x147/0x250 [amdgpu]
[   50.442272]  amdgpu_pmops_suspend+0x50/0xe0 [amdgpu]
[   50.447515]  pci_pm_suspend+0x87/0x1b0
[   50.451268]  ? __pfx_pci_pm_suspend+0x10/0x10
[   50.455630]  dpm_run_callback+0x6c/0x1f0
[   50.459562]  __device_suspend+0x14e/0x5a0
[   50.463578]  dpm_suspend+0x173/0x380
[   50.467160]  dpm_suspend_start+0x93/0xb0
[   50.471091]  suspend_devices_and_enter+0x13e/0x9a0
[   50.475884]  pm_suspend+0x378/0x7c0
[   50.479376]  state_store+0x81/0xe0
[   50.482793]  kobj_attr_store+0xf/0x30
[   50.486459]  sysfs_kf_write+0x48/0x70
[   50.490149]  kernfs_fop_write_iter+0x16e/0x220
[   50.494597]  vfs_write+0x33f/0x500
[   50.498005]  ? __this_cpu_preempt_check+0x13/0x20
[   50.502718]  ksys_write+0x6d/0xf0
[   50.506037]  __x64_sys_write+0x19/0x20
[   50.509818]  do_syscall_64+0x59/0x90
[   50.513402]  ? do_syscall_64+0x69/0x90
[   50.517160]  ? __this_cpu_preempt_check+0x13/0x20
[   50.521872]  ? lockdep_hardirqs_on+0xcc/0x150
[   50.526233]  ? syscall_exit_to_user_mode+0x37/0x50
[   50.531023]  ? do_syscall_64+0x69/0x90
[   50.534782]  ? syscall_exit_to_user_mode+0x37/0x50
[   50.539572]  ? do_syscall_64+0x69/0x90
[   50.543326]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   50.548377] RIP: 0033:0x7fe400114a37
[   50.551959] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[   50.570624] RSP: 002b:00007ffdde07ad68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   50.578171] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fe400114a37
[   50.585280] RDX: 0000000000000003 RSI: 00005631dfefe710 RDI: 0000000000000001
[   50.592399] RBP: 00005631dfefe710 R08: 00005631dfef359c R09: 0000000000000000
[   50.599516] R10: 00005631dfef359b R11: 0000000000000246 R12: 0000000000000001
[   50.606628] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[   50.613757]  </TASK>
[   50.615952] irq event stamp: 178359
[   50.619440] hardirqs last  enabled at (178369): [<ffffffffb17950a6>] __up_console_sem+0x86/0x90
[   50.628109] hardirqs last disabled at (178378): [<ffffffffb179508b>] __up_console_sem+0x6b/0x90
[   50.636804] softirqs last  enabled at (177942): [<ffffffffb16e199d>] __irq_exit_rcu+0xdd/0x130
[   50.645404] softirqs last disabled at (177937): [<ffffffffb16e199d>] __irq_exit_rcu+0xdd/0x130
[   50.653982] ---[ end trace 0000000000000000 ]---
[   52.762639] [drm] psp gfx command UNLOAD_TA(0x2) failed and response status is (0x117)
[   52.798845] amdgpu 0000:03:00.0: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 7607355 usecs
[   52.807263] pcieport 0000:02:00.0: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: 0000:01:00.0
[   52.816363] pcieport 0000:02:00.0: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 4 usecs
[   52.824432] pcieport 0000:01:00.0: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: 0000:00:01.0
[   52.833525] pcieport 0000:01:00.0: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 2 usecs
[   52.841594] e1000e 0000:00:1f.6: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: pci0000:00
[   52.850398] e1000e: EEE TX LPI TIMER: 00000011
[   52.898521] e1000e 0000:00:1f.6: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 48133 usecs
[   52.906768] pci 0000:00:1f.5: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: pci0000:00
[   52.915259] pci 0000:00:1f.5: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 0 usecs
[   52.922892] i801_smbus 0000:00:1f.4: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: pci0000:00
[   52.932086] i801_smbus 0000:00:1f.4: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 100 usecs
[   52.940496] snd_hda_intel 0000:00:1f.3: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: pci0000:00
[   52.951201] snd_hda_intel 0000:00:1f.3: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 1349 usecs
[   52.959964] pci 0000:00:1f.0: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: pci0000:00
[   52.968454] pci 0000:00:1f.0: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 0 usecs
[   52.976094] pcieport 0000:00:1d.0: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: pci0000:00
[   52.985018] pcieport 0000:00:1d.0: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 2 usecs
[   52.993083] pcieport 0000:00:1c.0: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: pci0000:00
[   53.013019] pcieport 0000:00:1c.0: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 11013 usecs
[   53.021436] pcieport 0000:00:1b.0: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: pci0000:00
[   53.030364] pcieport 0000:00:1b.0: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 1 usecs
[   53.038430] ahci 0000:00:17.0: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: pci0000:00
[   53.047021] ahci 0000:00:17.0: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 13 usecs
[   53.054829] mei_me 0000:00:16.0: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: pci0000:00
[   53.077101] mei_me 0000:00:16.0: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 13520 usecs
[   53.085339] pci 0000:00:14.2: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: pci0000:00
[   53.093831] pci 0000:00:14.2: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 0 usecs
[   53.101472] xhci_hcd 0000:00:14.0: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: pci0000:00
[   53.110755] xhci_hcd 0000:00:14.0: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 361 usecs
[   53.118986] pcieport 0000:00:01.0: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: pci0000:00
[   53.127913] pcieport 0000:00:01.0: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 2 usecs
[   53.135978] skl_uncore 0000:00:00.0: PM: calling pci_pm_suspend+0x0/0x1b0 @ 1485, parent: pci0000:00
[   53.145072] skl_uncore 0000:00:00.0: PM: pci_pm_suspend+0x0/0x1b0 returned 0 after 0 usecs
[   53.153341] thermal LNXTHERM:00: PM: calling acpi_thermal_suspend+0x0/0x20 @ 1485, parent: LNXSYBUS:01
[   53.162644] thermal LNXTHERM:00: PM: acpi_thermal_suspend+0x0/0x20 returned 0 after 36 usecs
[   53.171116] button PNP0C0C:00: PM: calling acpi_button_suspend+0x0/0x20 @ 1485, parent: LNXSYBUS:00
[   53.180126] button PNP0C0C:00: PM: acpi_button_suspend+0x0/0x20 returned 0 after 0 usecs
[   53.188209] button PNP0C0E:00: PM: calling acpi_button_suspend+0x0/0x20 @ 1485, parent: LNXSYBUS:00
[   53.197218] button PNP0C0E:00: PM: acpi_button_suspend+0x0/0x20 returned 0 after 0 usecs
[   53.205794] ec PNP0C09:00: PM: calling acpi_ec_suspend+0x0/0x90 @ 1485, parent: device:19
[   53.213941] ec PNP0C09:00: PM: acpi_ec_suspend+0x0/0x90 returned 0 after 0 usecs
[   53.221506] regulator regulator.0: PM: calling regulator_suspend+0x0/0x140 @ 1485, parent: reg-dummy
[   53.230601] regulator regulator.0: PM: regulator_suspend+0x0/0x140 returned 0 after 0 usecs
[   53.238928] reg-dummy reg-dummy: PM: calling platform_pm_suspend+0x0/0x60 @ 1485, parent: platform
[   53.247852] reg-dummy reg-dummy: PM: platform_pm_suspend+0x0/0x60 returned 0 after 0 usecs
[   53.256200] PM: suspend of devices complete after 10633.520 msecs
[   53.262358] PM: start suspend of devices complete after 10641.391 msecs
[   53.268956] PM: suspend debug: Waiting for 5 second(s).
[   58.234642] reg-dummy reg-dummy: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   58.243489] reg-dummy reg-dummy: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   58.251640] regulator regulator.0: PM: calling regulator_resume+0x0/0x1c0 @ 1485, parent: reg-dummy
[   58.260650] regulator regulator.0: PM: regulator_resume+0x0/0x1c0 returned 0 after 0 usecs
[   58.268946] ec PNP0C09:00: PM: calling acpi_ec_resume+0x0/0x20 @ 1485, parent: device:19
[   58.277032] ec PNP0C09:00: PM: acpi_ec_resume+0x0/0x20 returned 0 after 29 usecs
[   58.284622] button PNP0C0E:00: PM: calling acpi_button_resume+0x0/0x110 @ 1485, parent: LNXSYBUS:00
[   58.293632] button PNP0C0E:00: PM: acpi_button_resume+0x0/0x110 returned 0 after 0 usecs
[   58.301701] button PNP0C0C:00: PM: calling acpi_button_resume+0x0/0x110 @ 1485, parent: LNXSYBUS:00
[   58.310707] button PNP0C0C:00: PM: acpi_button_resume+0x0/0x110 returned 0 after 0 usecs
[   58.318794] thermal LNXTHERM:00: PM: calling acpi_thermal_resume+0x0/0x240 @ 1485, parent: LNXSYBUS:01
[   58.328077] thermal LNXTHERM:00: PM: acpi_thermal_resume+0x0/0x240 returned 0 after 15 usecs
[   58.336533] skl_uncore 0000:00:00.0: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: pci0000:00
[   58.345548] skl_uncore 0000:00:00.0: PM: pci_pm_resume+0x0/0x100 returned 0 after 0 usecs
[   58.353698] pcieport 0000:00:01.0: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: pci0000:00
[   58.362540] pcieport 0000:00:01.0: PM: pci_pm_resume+0x0/0x100 returned 0 after 2 usecs
[   58.370513] xhci_hcd 0000:00:14.0: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: pci0000:00
[   58.505031] xhci_hcd 0000:00:14.0: PM: pci_pm_resume+0x0/0x100 returned 0 after 125674 usecs
[   58.513525] pci 0000:00:14.2: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: pci0000:00
[   58.522005] pci 0000:00:14.2: PM: pci_pm_resume+0x0/0x100 returned 0 after 3 usecs
[   58.529612] mei_me 0000:00:16.0: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: pci0000:00
[   58.554511] mei_me 0000:00:16.0: PM: pci_pm_resume+0x0/0x100 returned 0 after 16188 usecs
[   58.562719] ahci 0000:00:17.0: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: pci0000:00
[   58.581432] ahci 0000:00:17.0: PM: pci_pm_resume+0x0/0x100 returned 0 after 10210 usecs
[   58.589416] pcieport 0000:00:1b.0: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: pci0000:00
[   58.598253] pcieport 0000:00:1b.0: PM: pci_pm_resume+0x0/0x100 returned 0 after 1 usecs
[   58.606228] pcieport 0000:00:1c.0: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: pci0000:00
[   58.615072] pcieport 0000:00:1c.0: PM: pci_pm_resume+0x0/0x100 returned 0 after 1 usecs
[   58.623046] pcieport 0000:00:1d.0: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: pci0000:00
[   58.631886] pcieport 0000:00:1d.0: PM: pci_pm_resume+0x0/0x100 returned 0 after 1 usecs
[   58.639868] pci 0000:00:1f.0: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: pci0000:00
[   58.648273] pci 0000:00:1f.0: PM: pci_pm_resume+0x0/0x100 returned 0 after 0 usecs
[   58.655826] snd_hda_intel 0000:00:1f.3: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: pci0000:00
[   58.668875] snd_hda_intel 0000:00:1f.3: PM: pci_pm_resume+0x0/0x100 returned 0 after 3780 usecs
[   58.677544] i801_smbus 0000:00:1f.4: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: pci0000:00
[   58.686594] i801_smbus 0000:00:1f.4: PM: pci_pm_resume+0x0/0x100 returned 0 after 41 usecs
[   58.694827] pci 0000:00:1f.5: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: pci0000:00
[   58.703237] pci 0000:00:1f.5: PM: pci_pm_resume+0x0/0x100 returned 0 after 0 usecs
[   58.710785] e1000e 0000:00:1f.6: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: pci0000:00
[   59.196643] e1000e 0000:00:1f.6: PM: pci_pm_resume+0x0/0x100 returned 0 after 477184 usecs
[   59.205237] pcieport 0000:01:00.0: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: 0000:00:01.0
[   59.214294] pcieport 0000:01:00.0: PM: pci_pm_resume+0x0/0x100 returned 0 after 7 usecs
[   59.222287] pcieport 0000:02:00.0: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: 0000:01:00.0
[   59.231312] pcieport 0000:02:00.0: PM: pci_pm_resume+0x0/0x100 returned 0 after 4 usecs
[   59.239295] amdgpu 0000:03:00.0: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: 0000:02:00.0
[   59.248339] [drm] PCIE GART of 512M enabled.
[   59.252611] [drm] PTB located at 0x000000F5FEF00000
[   59.257540] [drm] PSP is resuming...
[   59.320677] [drm] reserve 0x400000 from 0xf5fe400000 for PSP TMR
[   59.819830] [drm] kiq ring mec 2 pipe 1 q 0
[   59.846884] [drm] UVD and UVD ENC initialized successfully.
[   59.952063] [drm] VCE initialized successfully.
[   59.956605] amdgpu 0000:03:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
[   59.963630] amdgpu 0000:03:00.0: amdgpu: ring gfx_low uses VM inv eng 1 on hub 0
[   59.971006] amdgpu 0000:03:00.0: amdgpu: ring gfx_high uses VM inv eng 4 on hub 0
[   59.978467] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 5 on hub 0
[   59.986101] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 6 on hub 0
[   59.993734] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 7 on hub 0
[   60.001367] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 8 on hub 0
[   60.009005] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 9 on hub 0
[   60.016639] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 10 on hub 0
[   60.024358] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 11 on hub 0
[   60.032078] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 12 on hub 0
[   60.039797] amdgpu 0000:03:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 13 on hub 0
[   60.047600] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 8
[   60.054797] amdgpu 0000:03:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 8
[   60.062001] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 8
[   60.069205] amdgpu 0000:03:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 8
[   60.076409] amdgpu 0000:03:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 8
[   60.083600] amdgpu 0000:03:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 8
[   60.091324] amdgpu 0000:03:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 8
[   60.099035] amdgpu 0000:03:00.0: amdgpu: ring vce0 uses VM inv eng 9 on hub 8
[   60.106148] amdgpu 0000:03:00.0: amdgpu: ring vce1 uses VM inv eng 10 on hub 8
[   60.113352] amdgpu 0000:03:00.0: amdgpu: ring vce2 uses VM inv eng 11 on hub 8
[   60.211271] amdgpu 0000:03:00.0: PM: pci_pm_resume+0x0/0x100 returned 0 after 963139 usecs
[   60.219537] platform PNP0C09:00: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: 0000:00:1f.0
[   60.228737] platform PNP0C09:00: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   60.236906] platform PNP0103:00: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: 0000:00:1f.0
[   60.246089] platform PNP0103:00: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   60.254239] platform PNP0C04:00: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: 0000:00:1f.0
[   60.263427] platform PNP0C04:00: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   60.271594] cannonlake-pinctrl INT3450:00: PM: calling acpi_subsys_resume+0x0/0x80 @ 1485, parent: pci0000:00
[   60.281470] cannonlake-pinctrl INT3450:00: PM: acpi_subsys_resume+0x0/0x80 returned 0 after 1 usecs
[   60.290483] acpi-wmi PNP0C14:00: PM: calling acpi_subsys_resume+0x0/0x80 @ 1485, parent: pci0000:00
[   60.299494] acpi-wmi PNP0C14:00: PM: acpi_subsys_resume+0x0/0x80 returned 0 after 0 usecs
[   60.307658] platform ACPI000C:00: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   60.316582] platform ACPI000C:00: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   60.324870] acpi-tad ACPI000E:00: PM: calling acpi_subsys_resume+0x0/0x80 @ 1485, parent: platform
[   60.333807] acpi-tad ACPI000E:00: PM: acpi_subsys_resume+0x0/0x80 returned 0 after 0 usecs
[   60.342050] acpi-wmi PNP0C14:01: PM: calling acpi_subsys_resume+0x0/0x80 @ 1485, parent: platform
[   60.350891] acpi-wmi PNP0C14:01: PM: acpi_subsys_resume+0x0/0x80 returned 0 after 0 usecs
[   60.359071] platform PNP0C0E:00: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   60.367910] platform PNP0C0E:00: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   60.376100] acpi-wmi PNP0C14:02: PM: calling acpi_subsys_resume+0x0/0x80 @ 1485, parent: platform
[   60.384940] acpi-wmi PNP0C14:02: PM: acpi_subsys_resume+0x0/0x80 returned 0 after 0 usecs
[   60.393090] acpi-wmi PNP0C14:03: PM: calling acpi_subsys_resume+0x0/0x80 @ 1485, parent: platform
[   60.401930] acpi-wmi PNP0C14:03: PM: acpi_subsys_resume+0x0/0x80 returned 0 after 0 usecs
[   60.410084] intel_pmc_core INT33A1:00: PM: calling acpi_subsys_resume+0x0/0x80 @ 1485, parent: platform
[   60.419439] intel_pmc_core INT33A1:00: PM: acpi_subsys_resume+0x0/0x80 returned 0 after 0 usecs
[   60.428126] platform PNP0C0C:00: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   60.436965] platform PNP0C0C:00: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   60.445116] acpi-fan PNP0C0B:00: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   60.453959] acpi-fan PNP0C0B:00: PM: platform_pm_resume+0x0/0x50 returned 0 after 1 usecs
[   60.462110] acpi-fan PNP0C0B:01: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   60.470967] acpi-fan PNP0C0B:01: PM: platform_pm_resume+0x0/0x50 returned 0 after 17 usecs
[   60.479209] acpi-fan PNP0C0B:02: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   60.488065] acpi-fan PNP0C0B:02: PM: platform_pm_resume+0x0/0x50 returned 0 after 1 usecs
[   60.496217] acpi-fan PNP0C0B:03: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   60.505057] acpi-fan PNP0C0B:03: PM: platform_pm_resume+0x0/0x50 returned 0 after 1 usecs
[   60.513227] acpi-fan PNP0C0B:04: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   60.522068] acpi-fan PNP0C0B:04: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   60.530219] acpi-wmi PNP0C14:04: PM: calling acpi_subsys_resume+0x0/0x80 @ 1485, parent: platform
[   60.539057] acpi-wmi PNP0C14:04: PM: acpi_subsys_resume+0x0/0x80 returned 0 after 0 usecs
[   60.547212] acpi-wmi PNP0C14:05: PM: calling acpi_subsys_resume+0x0/0x80 @ 1485, parent: platform
[   60.556047] acpi-wmi PNP0C14:05: PM: acpi_subsys_resume+0x0/0x80 returned 0 after 0 usecs
[   60.564222] button LNXPWRBN:00: PM: calling acpi_button_resume+0x0/0x110 @ 1485, parent: LNXSYSTM:00
[   60.573316] button LNXPWRBN:00: PM: acpi_button_resume+0x0/0x110 returned 0 after 0 usecs
[   60.581475] system 00:00: PM: calling pnp_bus_resume+0x0/0xb0 @ 1485, parent: pnp0
[   60.589023] system 00:00: PM: pnp_bus_resume+0x0/0xb0 returned 0 after 0 usecs
[   60.596253] system 00:01: PM: calling pnp_bus_resume+0x0/0xb0 @ 1485, parent: pnp0
[   60.603803] system 00:01: PM: pnp_bus_resume+0x0/0xb0 returned 0 after 0 usecs
[   60.611009] serial 00:02: PM: calling pnp_bus_resume+0x0/0xb0 @ 1485, parent: pnp0
[   60.622809] serial 00:02: activated
[   60.626352] serial 00:02: PM: pnp_bus_resume+0x0/0xb0 returned 0 after 7799 usecs
[   60.633805] system 00:03: PM: calling pnp_bus_resume+0x0/0xb0 @ 1485, parent: pnp0
[   60.641350] system 00:03: PM: pnp_bus_resume+0x0/0xb0 returned 0 after 0 usecs
[   60.648555] system 00:04: PM: calling pnp_bus_resume+0x0/0xb0 @ 1485, parent: pnp0
[   60.656106] system 00:04: PM: pnp_bus_resume+0x0/0xb0 returned 0 after 0 usecs
[   60.663311] system 00:05: PM: calling pnp_bus_resume+0x0/0xb0 @ 1485, parent: pnp0
[   60.670857] system 00:05: PM: pnp_bus_resume+0x0/0xb0 returned 0 after 0 usecs
[   60.678071] system 00:06: PM: calling pnp_bus_resume+0x0/0xb0 @ 1485, parent: pnp0
[   60.685617] system 00:06: PM: pnp_bus_resume+0x0/0xb0 returned 0 after 0 usecs
[   60.692834] system 00:07: PM: calling pnp_bus_resume+0x0/0xb0 @ 1485, parent: pnp0
[   60.700377] system 00:07: PM: pnp_bus_resume+0x0/0xb0 returned 0 after 0 usecs
[   60.707603] system 00:08: PM: calling pnp_bus_resume+0x0/0xb0 @ 1485, parent: pnp0
[   60.715147] system 00:08: PM: pnp_bus_resume+0x0/0xb0 returned 0 after 0 usecs
[   60.722461] snd_hda_intel 0000:03:00.1: PM: calling pci_pm_resume+0x0/0x100 @ 1485, parent: 0000:02:00.0
[   60.734103] snd_hda_intel 0000:03:00.1: PM: pci_pm_resume+0x0/0x100 returned 0 after 2207 usecs
[   60.742769] rtc_cmos rtc_cmos: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   60.751435] rtc_cmos rtc_cmos: PM: platform_pm_resume+0x0/0x50 returned 0 after 1 usecs
[   60.759415] platform pcspkr: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   60.767910] platform pcspkr: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   60.775770] input input0: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: PNP0C0E:00
[   60.783998] input input0: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   60.791400] input input1: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: PNP0C0C:00
[   60.799634] input input1: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   60.807011] input input2: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: LNXPWRBN:00
[   60.815335] input input2: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   60.822736] serial8250 serial8250: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   60.831741] serial8250 serial8250: PM: platform_pm_resume+0x0/0x50 returned 0 after 1 usecs
[   60.840135] platform Fixed MDIO bus.0: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   60.849500] platform Fixed MDIO bus.0: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   60.858177] rtc rtc0: PM: calling rtc_resume+0x0/0x80 @ 1485, parent: rtc_cmos
[   60.865369] rtc rtc0: PM: rtc_resume+0x0/0x80 returned 0 after 0 usecs
[   60.871904] alarmtimer alarmtimer.0.auto: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: rtc0
[   60.881184] alarmtimer alarmtimer.0.auto: PM: platform_pm_resume+0x0/0x50 returned 0 after 12 usecs
[   60.890196] platform eisa.0: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   60.898680] platform eisa.0: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   60.906516] platform microcode: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   60.915283] platform microcode: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   60.923359] pwm pwmchip0: PM: calling pwm_class_resume+0x0/0x20 @ 1485, parent: INT3450:00
[   60.931585] pwm pwmchip0: PM: pwm_class_resume+0x0/0x20 returned 0 after 0 usecs
[   60.938989] platform iTCO_wdt: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: 0000:00:1f.4
[   60.947995] platform iTCO_wdt: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   60.955982] usb usb1: PM: calling usb_dev_resume+0x0/0x20 @ 1485, parent: 0000:00:14.0
[   61.020311] usb usb1: PM: usb_dev_resume+0x0/0x20 returned 0 after 56442 usecs
[   61.027579]  ata1: PM: calling ata_port_pm_resume+0x0/0x60 @ 1485, parent: 0000:00:17.0
[   61.035829]  ata1: PM: ata_port_pm_resume+0x0/0x60 returned 0 after 268 usecs
[   61.042956]  ata2: PM: calling ata_port_pm_resume+0x0/0x60 @ 1485, parent: 0000:00:17.0
[   61.050932]  ata2: PM: ata_port_pm_resume+0x0/0x60 returned 0 after 6 usecs
[   61.057881]  ata3: PM: calling ata_port_pm_resume+0x0/0x60 @ 1485, parent: 0000:00:17.0
[   61.065852]  ata3: PM: ata_port_pm_resume+0x0/0x60 returned 0 after 5 usecs
[   61.072814]  ata4: PM: calling ata_port_pm_resume+0x0/0x60 @ 1485, parent: 0000:00:17.0
[   61.080782]  ata4: PM: ata_port_pm_resume+0x0/0x60 returned 0 after 4 usecs
[   61.087749]  ata5: PM: calling ata_port_pm_resume+0x0/0x60 @ 1485, parent: 0000:00:17.0
[   61.095727]  ata5: PM: ata_port_pm_resume+0x0/0x60 returned 0 after 4 usecs
[   61.102698]  ata6: PM: calling ata_port_pm_resume+0x0/0x60 @ 1485, parent: 0000:00:17.0
[   61.110676]  ata6: PM: ata_port_pm_resume+0x0/0x60 returned 0 after 4 usecs
[   61.117626] usb usb2: PM: calling usb_dev_resume+0x0/0x20 @ 1485, parent: 0000:00:14.0
[   61.153015] usb usb2: PM: usb_dev_resume+0x0/0x20 returned 0 after 27504 usecs
[   61.160224] scsi host0: PM: calling scsi_bus_resume+0x0/0xa0 @ 1485, parent: ata1
[   61.167679] scsi host0: PM: scsi_bus_resume+0x0/0xa0 returned 0 after 0 usecs
[   61.174796] scsi host1: PM: calling scsi_bus_resume+0x0/0xa0 @ 1485, parent: ata2
[   61.182244] scsi host1: PM: scsi_bus_resume+0x0/0xa0 returned 0 after 0 usecs
[   61.189363] scsi host2: PM: calling scsi_bus_resume+0x0/0xa0 @ 1485, parent: ata3
[   61.196826] scsi host2: PM: scsi_bus_resume+0x0/0xa0 returned 0 after 0 usecs
[   61.203936] scsi host3: PM: calling scsi_bus_resume+0x0/0xa0 @ 1485, parent: ata4
[   61.211398] scsi host3: PM: scsi_bus_resume+0x0/0xa0 returned 0 after 0 usecs
[   61.218517] scsi host4: PM: calling scsi_bus_resume+0x0/0xa0 @ 1485, parent: ata5
[   61.225995] scsi host4: PM: scsi_bus_resume+0x0/0xa0 returned 0 after 0 usecs
[   61.233119] scsi host5: PM: calling scsi_bus_resume+0x0/0xa0 @ 1485, parent: ata6
[   61.240575] scsi host5: PM: scsi_bus_resume+0x0/0xa0 returned 0 after 0 usecs
[   61.247708] usb 1-1: PM: calling usb_dev_resume+0x0/0x20 @ 1485, parent: usb1
[   61.352265] ata1: SATA link down (SStatus 4 SControl 300)
[   61.362894] ata2: SATA link down (SStatus 4 SControl 300)
[   61.377963] usb 1-1: PM: usb_dev_resume+0x0/0x20 returned 0 after 123143 usecs
[   61.378612] ata3: SATA link down (SStatus 4 SControl 300)
[   61.385199] scsi target5:0:0: PM: calling scsi_bus_resume+0x0/0xa0 @ 1485, parent: host5
[   61.398630] scsi target5:0:0: PM: scsi_bus_resume+0x0/0xa0 returned 0 after 0 usecs
[   61.400839] ata4: SATA link down (SStatus 4 SControl 300)
[   61.406261] sd 5:0:0:0: PM: calling scsi_bus_resume+0x0/0xa0 @ 1485, parent: target5:0:0
[   61.411771] ata5: SATA link down (SStatus 4 SControl 300)
[   61.419703] sd 5:0:0:0: [sda] Starting disk
[   61.425190] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   61.437226] ata6.00: supports DRM functions and may not be fully accessible
[   61.449769] ata6.00: supports DRM functions and may not be fully accessible
[   61.460620] ata6.00: configured for UDMA/133
[   61.464996] ahci 0000:00:17.0: port does not support device sleep
[   61.471319] sd 5:0:0:0: PM: scsi_bus_resume+0x0/0xa0 returned 0 after 51636 usecs
[   61.478806] usb 1-2: PM: calling usb_dev_resume+0x0/0x20 @ 1485, parent: usb1
[   61.505037] usb 1-2: PM: usb_dev_resume+0x0/0x20 returned 0 after 19115 usecs
[   61.512166] usb 1-3: PM: calling usb_dev_resume+0x0/0x20 @ 1485, parent: usb1
[   61.537279] usb 1-3: PM: usb_dev_resume+0x0/0x20 returned 0 after 18000 usecs
[   61.544395] usb 1-6: PM: calling usb_dev_resume+0x0/0x20 @ 1485, parent: usb1
[   61.570773] usb 1-6: PM: usb_dev_resume+0x0/0x20 returned 0 after 19268 usecs
[   61.577901] input input3: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: 0003:046D:C31C.0001
[   61.587180] input input3: PM: input_dev_resume+0x0/0x50 returned 0 after 268 usecs
[   61.594797] input input4: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: 0003:046D:C31C.0002
[   61.603810] input input4: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   61.611185] input input5: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: 0003:046D:C077.0003
[   61.620196] input input5: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   61.627628] eeepc-wmi eeepc-wmi: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   61.637300] eeepc-wmi eeepc-wmi: PM: platform_pm_resume+0x0/0x50 returned 0 after 837 usecs
[   61.645617] input input6: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: eeepc-wmi
[   61.653767] input input6: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   61.661147] leds input3::numlock: PM: calling led_resume+0x0/0x50 @ 1485, parent: input3
[   61.669205] leds input3::numlock: PM: led_resume+0x0/0x50 returned 0 after 0 usecs
[   61.676754] leds input3::capslock: PM: calling led_resume+0x0/0x50 @ 1485, parent: input3
[   61.684905] leds input3::capslock: PM: led_resume+0x0/0x50 returned 0 after 0 usecs
[   61.692540] leds input3::scrolllock: PM: calling led_resume+0x0/0x50 @ 1485, parent: input3
[   61.700861] leds input3::scrolllock: PM: led_resume+0x0/0x50 returned 0 after 0 usecs
[   61.708672] leds input3::compose: PM: calling led_resume+0x0/0x50 @ 1485, parent: input3
[   61.716729] leds input3::compose: PM: led_resume+0x0/0x50 returned 0 after 0 usecs
[   61.724278] leds input3::kana: PM: calling led_resume+0x0/0x50 @ 1485, parent: input3
[   61.732081] leds input3::kana: PM: led_resume+0x0/0x50 returned 0 after 0 usecs
[   61.739376] platform coretemp.0: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   61.748211] platform coretemp.0: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   61.756365] snd_hda_codec_realtek hdaudioC0D0: PM: calling hda_codec_pm_resume+0x0/0x20 [snd_hda_codec] @ 1485, parent: 0000:00:1f.3
[   61.946992] snd_hda_codec_realtek hdaudioC0D0: PM: hda_codec_pm_resume+0x0/0x20 [snd_hda_codec] returned 0 after 178763 usecs
[   61.958381] snd-soc-dummy snd-soc-dummy: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   61.967951] snd-soc-dummy snd-soc-dummy: PM: platform_pm_resume+0x0/0x50 returned 0 after 1 usecs
[   61.976837] input input7: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: card1
[   61.984643] input input7: PM: input_dev_resume+0x0/0x50 returned 0 after 1 usecs
[   61.992041] input input8: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: card1
[   61.999839] input input8: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   62.007218] input input9: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: card1
[   62.015017] input input9: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   62.022397] input input10: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: card1
[   62.030276] input input10: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   62.037742] input input11: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: card1
[   62.045624] input input11: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   62.053090] input input12: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: card1
[   62.060977] input input12: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   62.068436] input input13: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: card0
[   62.076317] input input13: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   62.083782] intel_rapl_msr intel_rapl_msr.0: PM: calling platform_pm_resume+0x0/0x50 @ 1485, parent: platform
[   62.093650] intel_rapl_msr intel_rapl_msr.0: PM: platform_pm_resume+0x0/0x50 returned 0 after 0 usecs
[   62.102836] input input14: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: card0
[   62.110722] input input14: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   62.118186] input input15: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: card0
[   62.126075] input input15: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   62.133539] input input16: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: card0
[   62.141427] input input16: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   62.148892] input input17: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: card0
[   62.156784] input input17: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   62.164243] input input18: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: card0
[   62.172137] input input18: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   62.179601] input input19: PM: calling input_dev_resume+0x0/0x50 @ 1485, parent: card0
[   62.187493] input input19: PM: input_dev_resume+0x0/0x50 returned 0 after 0 usecs
[   62.195036] PM: resume of devices complete after 3960.540 msecs
[   62.202873] PM: Finishing wakeup.
[   62.206450] OOM killer enabled.
[   62.209591] Restarting tasks ... done.
[   62.216797] random: crng reseeded on system resumption
[   62.217460] ------------[ cut here ]------------
[   62.227201] Scheduling eviction of pid 1484 in 0 jiffies
[   62.227216] WARNING: CPU: 2 PID: 1346 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device.c:1137 kgd2kfd_schedule_evict_and_restore_process+0x1f4/0x230 [amdgpu]
[   62.227801] Modules linked in: amdgpu i2c_algo_bit drm_ttm_helper ttm amdxcp drm_buddy gpu_sched drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt cec rc_core binfmt_misc snd_sof_pci_intel_cnl intel_rapl_msr snd_sof_intel_hda_common mei_hdcp snd_soc_acpi_intel_match snd_soc_acpi snd_soc_hdac_hda snd_sof_pci snd_sof_xtensa_dsp intel_rapl_common snd_sof_intel_hda intel_tcc_cooling snd_sof snd_hda_codec_realtek x86_pkg_temp_thermal snd_sof_utils snd_hda_codec_generic snd_hda_ext_core intel_powerclamp snd_soc_core snd_hda_codec_hdmi snd_compress coretemp crct10dif_pclmul snd_hda_intel crc32_pclmul ghash_clmulni_intel snd_intel_dspcfg aesni_intel snd_hda_codec crypto_simd pl2303 cryptd snd_hwdep usbserial input_leds snd_hda_core rapl joydev snd_pcm intel_cstate eeepc_wmi asus_wmi snd_seq ledtrig_audio sparse_keymap snd_seq_device platform_profile mei_me wmi_bmof snd_timer intel_wmi_thunderbolt mxm_wmi ee1004 snd mei soundcore acpi_pad mac_hid acpi_tad sch_fq_codel msr
[   62.227849]  parport_pc ppdev lp parport drm ip_tables x_tables autofs4 hid_generic usbhid hid e1000e ahci i2c_i801 xhci_pci libahci i2c_smbus xhci_pci_renesas video pinctrl_cannonlake wmi
[   62.227862] CPU: 2 PID: 1346 Comm: page0 Tainted: G        W          6.2.8+ #44
[   62.227864] Hardware name: System manufacturer System Product Name/PRIME Z390-A, BIOS 1401 11/26/2019
[   62.227865] RIP: 0010:kgd2kfd_schedule_evict_and_restore_process+0x1f4/0x230 [amdgpu]
[   62.228518] Code: 00 48 89 ca 48 89 4d e8 48 c7 c7 78 44 7c c1 48 83 05 87 6b dd 00 01 8b b0 90 09 00 00 e8 54 12 77 f0 48 83 05 7c 6b dd 00 01 <0f> 0b 48 83 05 7a 6b dd 00 01 48 8b 4d e8 e9 f0 fe ff ff 48 83 05
[   62.228519] RSP: 0018:ffffb356c3067da8 EFLAGS: 00010006
[   62.228521] RAX: 0000000000000000 RBX: ffff890fb0564b98 RCX: 0000000000000000
[   62.228522] RDX: 0000000000000003 RSI: ffffffffb2f03b10 RDI: 00000000ffffffff
[   62.228523] RBP: ffffb356c3067dc0 R08: 00000000ffffffff R09: 00000000ffffffff
[   62.228524] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
[   62.228525] R13: ffff890ea9f8c000 R14: 0000000000000246 R15: 0000000000000000
[   62.228526] FS:  0000000000000000(0000) GS:ffff891dadb00000(0000) knlGS:0000000000000000
[   62.228527] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   62.228528] CR2: 00007f4c43e20010 CR3: 0000000123ab0005 CR4: 00000000003706e0
[   62.228529] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   62.228530] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   62.228531] Call Trace:
[   62.228532]  <TASK>
[   62.228535]  amdkfd_fence_enable_signaling+0x9b/0xe0 [amdgpu]
[   62.229009]  __dma_fence_enable_signaling+0x7b/0x120
[   62.229014]  ? __pfx_drm_sched_entity_wakeup+0x10/0x10 [gpu_sched]
[   62.229021]  dma_fence_add_callback+0x50/0xe0
[   62.229026]  drm_sched_entity_pop_job+0x175/0xaa0 [gpu_sched]
[   62.229034]  drm_sched_main+0x11b/0x7f0 [gpu_sched]
[   62.229042]  ? __pfx_autoremove_wake_function+0x10/0x10
[   62.229047]  ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched]
[   62.229053]  kthread+0x105/0x130
[   62.229056]  ? __pfx_kthread+0x10/0x10
[   62.229059]  ret_from_fork+0x29/0x50
[   62.229069]  </TASK>
[   62.229070] irq event stamp: 1674960
[   62.229071] hardirqs last  enabled at (1674959): [<ffffffffb26b3248>] _raw_spin_unlock_irqrestore+0x68/0x80
[   62.229074] hardirqs last disabled at (1674960): [<ffffffffb26b2edc>] _raw_spin_lock_irqsave+0x6c/0x70
[   62.229075] softirqs last  enabled at (1668668): [<ffffffffb16e199d>] __irq_exit_rcu+0xdd/0x130
[   62.229077] softirqs last disabled at (1668663): [<ffffffffb16e199d>] __irq_exit_rcu+0xdd/0x130
[   62.229078] ---[ end trace 0000000000000000 ]---
[   62.239160] show_signal: 39 callbacks suppressed
[   62.239162] traps: nxserver.bin[1115] general protection fault ip:7fa9a1860748 sp:7fa96f7fda60 error:0
[   62.244418] ------------[ cut here ]------------
[   62.244420] Scheduling eviction of pid 1483 in 0 jiffies
[   62.244439] WARNING: CPU: 2 PID: 1346 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device.c:1137 kgd2kfd_schedule_evict_and_restore_process+0x1f4/0x230 [amdgpu]
[   62.245782] Modules linked in: amdgpu i2c_algo_bit drm_ttm_helper ttm amdxcp drm_buddy gpu_sched drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt cec rc_core binfmt_misc snd_sof_pci_intel_cnl intel_rapl_msr snd_sof_intel_hda_common mei_hdcp snd_soc_acpi_intel_match snd_soc_acpi snd_soc_hdac_hda snd_sof_pci snd_sof_xtensa_dsp intel_rapl_common snd_sof_intel_hda intel_tcc_cooling snd_sof snd_hda_codec_realtek x86_pkg_temp_thermal snd_sof_utils snd_hda_codec_generic snd_hda_ext_core intel_powerclamp snd_soc_core snd_hda_codec_hdmi snd_compress coretemp crct10dif_pclmul snd_hda_intel crc32_pclmul ghash_clmulni_intel snd_intel_dspcfg aesni_intel snd_hda_codec crypto_simd pl2303 cryptd snd_hwdep usbserial input_leds snd_hda_core rapl joydev snd_pcm intel_cstate eeepc_wmi asus_wmi snd_seq ledtrig_audio sparse_keymap snd_seq_device platform_profile mei_me wmi_bmof snd_timer intel_wmi_thunderbolt mxm_wmi ee1004 snd mei soundcore acpi_pad mac_hid acpi_tad sch_fq_codel msr
[   62.245839]  parport_pc ppdev lp parport drm ip_tables x_tables autofs4 hid_generic usbhid hid e1000e ahci i2c_i801 xhci_pci libahci i2c_smbus xhci_pci_renesas video pinctrl_cannonlake wmi
[   62.245855] CPU: 2 PID: 1346 Comm: page0 Tainted: G        W          6.2.8+ #44
[   62.245857] Hardware name: System manufacturer System Product Name/PRIME Z390-A, BIOS 1401 11/26/2019
[   62.245858] RIP: 0010:kgd2kfd_schedule_evict_and_restore_process+0x1f4/0x230 [amdgpu]
[   62.248033] Code: 00 48 89 ca 48 89 4d e8 48 c7 c7 78 44 7c c1 48 83 05 87 6b dd 00 01 8b b0 90 09 00 00 e8 54 12 77 f0 48 83 05 7c 6b dd 00 01 <0f> 0b 48 83 05 7a 6b dd 00 01 48 8b 4d e8 e9 f0 fe ff ff 48 83 05
[   62.248035] RSP: 0018:ffffb356c3067da8 EFLAGS: 00010006
[   62.248038] RAX: 0000000000000000 RBX: ffff890fb0565bc8 RCX: 0000000000000000
[   62.248039] RDX: 0000000000000003 RSI: ffffffffb2f03b10 RDI: 00000000ffffffff
[   62.248040] RBP: ffffb356c3067dc0 R08: 00000000ffffffff R09: 00000000ffffffff
[   62.248041] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
[   62.248042] R13: ffff890ead4c9000 R14: 0000000000000246 R15: 0000000000000000
[   62.248043] FS:  0000000000000000(0000) GS:ffff891dadb00000(0000) knlGS:0000000000000000
[   62.248044] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   62.248045] CR2: 000000c0006e4000 CR3: 000000010cfea004 CR4: 00000000003706e0
[   62.248047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   62.248047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   62.248049] Call Trace:
[   62.248049]  <TASK>
[   62.248054]  amdkfd_fence_enable_signaling+0x9b/0xe0 [amdgpu]
[   62.250639]  __dma_fence_enable_signaling+0x7b/0x120
[   62.250650]  ? __pfx_drm_sched_entity_wakeup+0x10/0x10 [gpu_sched]
[   62.250666]  dma_fence_add_callback+0x50/0xe0
[   62.250670]  drm_sched_entity_pop_job+0x175/0xaa0 [gpu_sched]
[   62.250678]  drm_sched_main+0x11b/0x7f0 [gpu_sched]
[   62.250686]  ? __pfx_autoremove_wake_function+0x10/0x10
[   62.250696]  ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched]
[   62.250701]  kthread+0x105/0x130
[   62.250708]  ? __pfx_kthread+0x10/0x10
[   62.250712]  ret_from_fork+0x29/0x50
[   62.250724]  </TASK>
[   62.250726] irq event stamp: 1680300
[   62.250727] hardirqs last  enabled at (1680299): [<ffffffffb26b3248>] _raw_spin_unlock_irqrestore+0x68/0x80
[   62.250732] hardirqs last disabled at (1680300): [<ffffffffb26b2edc>] _raw_spin_lock_irqsave+0x6c/0x70
[   62.250733] softirqs last  enabled at (1674968): [<ffffffffb16e199d>] __irq_exit_rcu+0xdd/0x130
[   62.250737] softirqs last disabled at (1674963): [<ffffffffb16e199d>] __irq_exit_rcu+0xdd/0x130
[   62.250738] ---[ end trace 0000000000000000 ]---
[   62.940878]  in libnxhs.so[7fa9a1800000+8e000]
[   62.946135] PM: suspend exit

[-- Attachment #3: 0001-suspend-debug.patch --]
[-- Type: application/octet-stream, Size: 3851 bytes --]

From 3a6d65993f1f0db81a9d2964227f3412949ecaa8 Mon Sep 17 00:00:00 2001
From: xinhui pan <xinhui.pan@amd.com>
Date: Thu, 14 Sep 2023 09:36:20 +0800
Subject: [PATCH] suspend debug

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 39 ++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  4 ++-
 drivers/gpu/drm/ttm/ttm_resource.c         |  2 ++
 4 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f381cb90c964..88b1c163d9bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4154,6 +4154,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
 	 */
 	(void)amdgpu_device_evict_resources(adev);
 
+	dump_all_bo_type(adev, TTM_PL_VRAM);
 	if (amdgpu_sriov_vf(adev)) {
 		amdgpu_virt_fini_data_exchange(adev);
 		r = amdgpu_virt_request_full_gpu(adev, false);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 95d15ae5fe52..4b53bbfff985 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -40,6 +40,44 @@
 #include "amdgpu_trace.h"
 #include "amdgpu_amdkfd.h"
 
+void bo_add_trace(struct amdgpu_bo *bo) {
+	bo->n = stack_trace_save(bo->stk, ARRAY_SIZE(bo->stk), 0);
+}
+
+void* bo_print_trace(struct amdgpu_bo *abo, char *buf, int size)
+{
+	if (!buf) {
+		stack_trace_print(abo->stk, abo->n, 8);
+		return NULL;
+	}
+	stack_trace_snprint(buf, size, abo->stk, abo->n, 8);
+	return buf;
+}
+
+void dump_all_bo_type(struct amdgpu_device *adev, int type)
+{
+	struct ttm_device *bdev = &adev->mman.bdev;
+	struct ttm_resource_manager *man = ttm_manager_type(bdev, type);
+	struct ttm_buffer_object *bo;
+	struct amdgpu_bo *abo;
+	struct ttm_resource_cursor cursor;
+	struct ttm_resource *res;
+	int i = 0;
+
+	spin_lock(&bdev->lru_lock);
+	ttm_resource_manager_for_each_res(man, &cursor, res) {
+		bo = res->bo;
+		if (!amdgpu_bo_is_amdgpu_bo(bo))
+			continue;
+		abo = ttm_to_amdgpu_bo(bo);
+		printk("XH: %d bo(%lx) %s",
+		       i++,
+		       abo,
+		       dma_resv_is_locked(bo->base.resv) ? "locked" : "unlocked");
+		bo_print_trace(abo, NULL, 0);
+	}
+	spin_unlock(&bdev->lru_lock);
+}
 /**
  * DOC: amdgpu_object
  *
@@ -647,6 +685,7 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
 	if (!bp->resv)
 		amdgpu_bo_unreserve(bo);
 	*bo_ptr = bo;
+	bo_add_trace(bo);
 
 	trace_amdgpu_bo_create(bo);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index c007523d6f83..06616e70faba 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -117,6 +117,8 @@ struct amdgpu_bo {
 	 * for memory accounting.
 	 */
 	int8_t				xcp_id;
+	unsigned long stk[16];
+	int n;
 };
 
 struct amdgpu_bo_user {
@@ -401,5 +403,5 @@ void amdgpu_debugfs_sa_init(struct amdgpu_device *adev);
 
 bool amdgpu_bo_support_uswc(u64 bo_flags);
 
-
+void dump_all_bo_type(struct amdgpu_device *adev, int type);
 #endif
diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
index b8a826a24fb2..2b0fe65dc51e 100644
--- a/drivers/gpu/drm/ttm/ttm_resource.c
+++ b/drivers/gpu/drm/ttm/ttm_resource.c
@@ -509,6 +509,7 @@ ttm_resource_manager_first(struct ttm_resource_manager *man,
 
 	return NULL;
 }
+EXPORT_SYMBOL(ttm_resource_manager_first);
 
 /**
  * ttm_resource_manager_next
@@ -536,6 +537,7 @@ ttm_resource_manager_next(struct ttm_resource_manager *man,
 
 	return NULL;
 }
+EXPORT_SYMBOL(ttm_resource_manager_next);
 
 static void ttm_kmap_iter_iomap_map_local(struct ttm_kmap_iter *iter,
 					  struct iosys_map *dmap,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
  2023-09-14  1:54                 ` Pan, Xinhui
@ 2023-09-14  6:23                   ` Christian König
  2023-09-14 13:37                     ` Felix Kuehling
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2023-09-14  6:23 UTC (permalink / raw)
  To: Pan, Xinhui, Koenig, Christian, Kuehling, Felix,
	amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Fan, Shikang

[-- Attachment #1: Type: text/plain, Size: 11743 bytes --]

[putting Harry on BCC, sorry for the noise]

Yeah, that is clearly a bug in the KFD.

During the second eviction the hw should already be disabled, so we 
don't have any SDMA or similar to evict BOs any more and can only copy 
them with the CPU.

@Felix what workqueue do you guys use for the restore work? I've just 
double checked and on the system workqueues you explicitly need to 
specify that stuff is freezable. E.g. use system_freezable_wq instead of 
system_wq.

Alternatively as Xinhui mentioned it might be necessary to flush all 
restore work before the first eviction phase or we have the chance that 
BOs are moved back into VRAM again.

Regards,
Christian.

Am 14.09.23 um 03:54 schrieb Pan, Xinhui:
>
> [AMD Official Use Only - General]
>
>
> I just make one debug patch to show busy BO’s alloc-trace when the 
> eviction fails in suspend.
>
> And dmesg log attached.
>
> Looks like they are just kfd user Bos and locked by evict/restore work.
>
> So in kfd suspend callback, it really need to flush the evict/restore 
> work before HW fini as it do now.
>
> That is why the first very early eviction fails and the second 
> eviction succeed.
>
> Thanks
>
> xinhui
>
> *From:* Pan, Xinhui
> *Sent:* Thursday, September 14, 2023 8:02 AM
> *To:* Koenig, Christian <Christian.Koenig@amd.com>; Kuehling, Felix 
> <Felix.Kuehling@amd.com>; Christian König 
> <ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org; 
> Wentland, Harry <Harry.Wentland@amd.com>
> *Cc:* Deucher, Alexander <Alexander.Deucher@amd.com>; Fan, Shikang 
> <Shikang.Fan@amd.com>
> *Subject:* RE: 回复: [PATCH] drm/amdgpu: Ignore first evction failure 
> during suspend
>
> Chris,
>
> I can dump these busy BOs with their alloc/free stack later today.
>
> BTW, the two evictions and the kfd suspend are all called before 
> hw_fini. IOW, between phase 1 and phase 2. SDMA is turned only in 
> phase2. So current code works fine maybe.
>
> *From:* Koenig, Christian <Christian.Koenig@amd.com>
> *Sent:* Wednesday, September 13, 2023 10:29 PM
> *To:* Kuehling, Felix <Felix.Kuehling@amd.com>; Christian König 
> <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; 
> amd-gfx@lists.freedesktop.org; Wentland, Harry <Harry.Wentland@amd.com>
> *Cc:* Deucher, Alexander <Alexander.Deucher@amd.com>; Fan, Shikang 
> <Shikang.Fan@amd.com>
> *Subject:* Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure 
> during suspend
>
> [+Harry]
>
> Am 13.09.23 um 15:54 schrieb Felix Kuehling:
>
>     On 2023-09-13 4:07, Christian König wrote:
>
>         [+Fleix]
>
>         Well that looks like quite a serious bug.
>
>         If I'm not completely mistaken the KFD work item tries to
>         restore the process by moving BOs into memory even after the
>         suspend freeze. Normally work items are frozen together with
>         the user space processes unless explicitly marked as not
>         freezable.
>
>         That this causes problem during the first eviction phase is
>         just the tip of the iceberg here. If a BO is moved into
>         invisible memory during this we wouldn't be able to get it out
>         of that in the second phase because SDMA and hw is already
>         turned off.
>
>         @Felix any idea how that can happen? Have you guys marked a
>         work item / work queue as not freezable?
>
>     We don't set anything to non-freezable in KFD.
>
>     Regards,
>       Felix
>
>         Or maybe the display guys?
>
>
> Do you guys in the display do any delayed update in a work item which 
> is marked as not-freezable?
>
> Otherwise I have absolutely no idea what's going on here.
>
> Thanks,
> Christian.
>
>
>         @Xinhui please investigate what work item that is and where
>         that is coming from. Something like "if (adev->in_suspend)
>         dump_stack();" in the right place should probably do it.
>
>         Thanks,
>         Christian.
>
>         Am 13.09.23 um 07:13 schrieb Pan, Xinhui:
>
>             [AMD Official Use Only - General]
>
>             I notice that only user space process are frozen on my
>             side.  kthread and workqueue  keeps running. Maybe some
>             kernel configs are not enabled.
>
>             I made one module which just prints something like i++
>             with mutex lock both in workqueue and kthread. I paste
>             some logs below.
>
>             [438619.696196] XH: 14 from workqueue
>
>             [438619.700193] XH: 15 from kthread
>
>             [438620.394335] PM: suspend entry (deep)
>
>             [438620.399619] Filesystems sync: 0.001 seconds
>
>             [438620.403887] PM: Preparing system for sleep (deep)
>
>             [438620.409299] Freezing user space processes
>
>             [438620.414862] Freezing user space processes completed
>             (elapsed 0.001 seconds)
>
>             [438620.421881] OOM killer disabled.
>
>             [438620.425197] Freezing remaining freezable tasks
>
>             [438620.430890] Freezing remaining freezable tasks
>             completed (elapsed 0.001 seconds)
>
>             [438620.438348] PM: Suspending system (deep)
>
>             .....
>
>             [438623.746038] PM: suspend of devices complete after
>             3303.137 msecs
>
>             [438623.752125] PM: start suspend of devices complete
>             after 3309.713 msecs
>
>             [438623.758722] PM: suspend debug: Waiting for 5 second(s).
>
>             [438623.792166] XH: 22 from kthread
>
>             [438623.824140] XH: 23 from workqueue
>
>             So BOs definitely can be in use during suspend.
>
>             Even if kthread or workqueue can be stopped with one
>             special kernel config. I think suspend can only stop the
>             workqueue with its callback finish.
>
>             otherwise something like below makes things crazy.
>
>             LOCK BO
>
>             do something
>
>             -> schedule or wait, anycode might sleep. Stopped by
>             suspend now? no, i think.
>
>             UNLOCK BO
>
>             I do tests  with  cmds below.
>
>             echo devices  > /sys/power/pm_test
>
>             echo 0  > /sys/power/pm_async
>
>             echo 1  > /sys/power/pm_print_times
>
>             echo 1 > /sys/power/pm_debug_messages
>
>             echo 1 > /sys/module/amdgpu/parameters/debug_evictions
>
>             ./kfd.sh --gtest_filter=KFDEvictTest.BasicTest
>
>             pm-suspend
>
>             thanks
>
>             xinhui
>
>             ------------------------------------------------------------------------
>
>             *发件人:*Christian König <ckoenig.leichtzumerken@gmail.com>
>             <mailto:ckoenig.leichtzumerken@gmail.com>
>             *发送时间:*2023年9月12日17:01
>             *收件人:*Pan, Xinhui <Xinhui.Pan@amd.com>
>             <mailto:Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
>             <amd-gfx@lists.freedesktop.org>
>             <mailto:amd-gfx@lists.freedesktop.org>
>             *抄送:*Deucher, Alexander <Alexander.Deucher@amd.com>
>             <mailto:Alexander.Deucher@amd.com>; Koenig, Christian
>             <Christian.Koenig@amd.com>
>             <mailto:Christian.Koenig@amd.com>; Fan, Shikang
>             <Shikang.Fan@amd.com> <mailto:Shikang.Fan@amd.com>
>             *主题:*Re: [PATCH] drm/amdgpu: Ignore first evction failure
>             during suspend
>
>             When amdgpu_device_suspend() is called processes should be
>             frozen
>             already. In other words KFD queues etc... should already
>             be idle.
>
>             So when the eviction fails here we missed something
>             previously and that
>             in turn can cause tons amount of problems.
>
>             So ignoring those errors is most likely not a good idea at
>             all.
>
>             Regards,
>             Christian.
>
>             Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
>             > [AMD Official Use Only - General]
>             >
>             > Oh yep, Pinned BO is moved to other LRU list, So
>             eviction fails because of other reason.
>             > I will change the comments in the patch.
>             > The problem is eviction fails as many reasons, say, BO
>             is locked.
>             > ASAIK, kfd will stop the queues and flush some
>             evict/restore work in its suspend callback. SO the first
>             eviction before kfd callback likely fails.
>             >
>             > -----Original Message-----
>             > From: Christian König <ckoenig.leichtzumerken@gmail.com>
>             <mailto:ckoenig.leichtzumerken@gmail.com>
>             > Sent: Friday, September 8, 2023 2:49 PM
>             > To: Pan, Xinhui <Xinhui.Pan@amd.com>
>             <mailto:Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
>             > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>             <mailto:Alexander.Deucher@amd.com>; Koenig, Christian
>             <Christian.Koenig@amd.com>
>             <mailto:Christian.Koenig@amd.com>; Fan, Shikang
>             <Shikang.Fan@amd.com> <mailto:Shikang.Fan@amd.com>
>             > Subject: Re: [PATCH] drm/amdgpu: Ignore first evction
>             failure during suspend
>             >
>             > Am 08.09.23 um 05:39 schrieb xinhui pan:
>             >> Some BOs might be pinned. So the first eviction's
>             failure will abort
>             >> the suspend sequence. These pinned BOs will be unpined
>             afterwards
>             >> during suspend.
>             > That doesn't make much sense since pinned BOs don't
>             cause eviction failure here.
>             >
>             > What exactly is the error code you see?
>             >
>             > Christian.
>             >
>             >> Actaully it has evicted most BOs, so that should stil
>             work fine in
>             >> sriov full access mode.
>             >>
>             >> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra
>             evict_resource call
>             >> during device_suspend.")
>             >> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>             <mailto:xinhui.pan@amd.com>
>             >> ---
>             >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>             >>    1 file changed, 5 insertions(+), 4 deletions(-)
>             >>
>             >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             >> index 5c0e2b766026..39af526cdbbe 100644
>             >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>             >> @@ -4148,10 +4148,11 @@ int
>             amdgpu_device_suspend(struct drm_device
>             >> *dev, bool fbcon)
>             >>
>             >>        adev->in_suspend = true;
>             >>
>             >> -     /* Evict the majority of BOs before grabbing the
>             full access */
>             >> -     r = amdgpu_device_evict_resources(adev);
>             >> -     if (r)
>             >> -             return r;
>             >> +     /* Try to evict the majority of BOs before
>             grabbing the full access
>             >> +      * Ignore the ret val at first place as we will
>             unpin some BOs if any
>             >> +      * afterwards.
>             >> +      */
>             >> + (void)amdgpu_device_evict_resources(adev);
>             >>
>             >>        if (amdgpu_sriov_vf(adev)) {
>             >> amdgpu_virt_fini_data_exchange(adev);
>

[-- Attachment #2: Type: text/html, Size: 33927 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
  2023-09-14  6:23                   ` Christian König
@ 2023-09-14 13:37                     ` Felix Kuehling
  2023-09-14 13:59                       ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Felix Kuehling @ 2023-09-14 13:37 UTC (permalink / raw)
  To: Christian König, Pan, Xinhui, Koenig, Christian,
	amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Fan, Shikang

[-- Attachment #1: Type: text/plain, Size: 13216 bytes --]

Userptr and SVM restore work is scheduled to the system WQ with 
schedule_delayed_work. See amdgpu_amdkfd_evict_userptr and 
svm_range_evict. This would need to use queue_delayed_work with the 
system_freezable_wq.

BO restoration is scheduled with queue_delayed_work on our own 
kfd_restore_wq that was allocated with alloc_ordered_workqueue. This 
would need to add the WQ_FREEZABLE flag when we create the wq in 
kfd_process_create_wq.

There is also evict_process_worker scheduled with schedule_delayed_work, 
which handles stopping of user mode queues, signaling of eviction fences 
and scheduling of restore work when BOs are evicted. I think that should 
not be freezable because it's needed to signal the eviction fences to 
allow suspend to evict BOs.

To make sure I'm not misunderstanding, I assume that freezing a 
freezable workqueue flushes work items in progress and prevents 
execution of more work until it is unfrozen. I assume work items are not 
frozen in the middle of execution, because that would not solve the problem.

Regards,
   Felix


On 2023-09-14 2:23, Christian König wrote:
> [putting Harry on BCC, sorry for the noise]
>
> Yeah, that is clearly a bug in the KFD.
>
> During the second eviction the hw should already be disabled, so we 
> don't have any SDMA or similar to evict BOs any more and can only copy 
> them with the CPU.
>
> @Felix what workqueue do you guys use for the restore work? I've just 
> double checked and on the system workqueues you explicitly need to 
> specify that stuff is freezable. E.g. use system_freezable_wq instead 
> of system_wq.
>
> Alternatively as Xinhui mentioned it might be necessary to flush all 
> restore work before the first eviction phase or we have the chance 
> that BOs are moved back into VRAM again.
>
> Regards,
> Christian.
>
> Am 14.09.23 um 03:54 schrieb Pan, Xinhui:
>>
>> [AMD Official Use Only - General]
>>
>>
>> I just make one debug patch to show busy BO’s alloc-trace when the 
>> eviction fails in suspend.
>>
>> And dmesg log attached.
>>
>> Looks like they are just kfd user Bos and locked by evict/restore work.
>>
>> So in kfd suspend callback, it really need to flush the evict/restore 
>> work before HW fini as it do now.
>>
>> That is why the first very early eviction fails and the second 
>> eviction succeed.
>>
>> Thanks
>>
>> xinhui
>>
>> *From:* Pan, Xinhui
>> *Sent:* Thursday, September 14, 2023 8:02 AM
>> *To:* Koenig, Christian <Christian.Koenig@amd.com>; Kuehling, Felix 
>> <Felix.Kuehling@amd.com>; Christian König 
>> <ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org; 
>> Wentland, Harry <Harry.Wentland@amd.com>
>> *Cc:* Deucher, Alexander <Alexander.Deucher@amd.com>; Fan, Shikang 
>> <Shikang.Fan@amd.com>
>> *Subject:* RE: 回复: [PATCH] drm/amdgpu: Ignore first evction failure 
>> during suspend
>>
>> Chris,
>>
>> I can dump these busy BOs with their alloc/free stack later today.
>>
>> BTW, the two evictions and the kfd suspend are all called before 
>> hw_fini. IOW, between phase 1 and phase 2. SDMA is turned only in 
>> phase2. So current code works fine maybe.
>>
>> *From:* Koenig, Christian <Christian.Koenig@amd.com>
>> *Sent:* Wednesday, September 13, 2023 10:29 PM
>> *To:* Kuehling, Felix <Felix.Kuehling@amd.com>; Christian König 
>> <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; 
>> amd-gfx@lists.freedesktop.org; Wentland, Harry <Harry.Wentland@amd.com>
>> *Cc:* Deucher, Alexander <Alexander.Deucher@amd.com>; Fan, Shikang 
>> <Shikang.Fan@amd.com>
>> *Subject:* Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure 
>> during suspend
>>
>> [+Harry]
>>
>> Am 13.09.23 um 15:54 schrieb Felix Kuehling:
>>
>>     On 2023-09-13 4:07, Christian König wrote:
>>
>>         [+Fleix]
>>
>>         Well that looks like quite a serious bug.
>>
>>         If I'm not completely mistaken the KFD work item tries to
>>         restore the process by moving BOs into memory even after the
>>         suspend freeze. Normally work items are frozen together with
>>         the user space processes unless explicitly marked as not
>>         freezable.
>>
>>         That this causes problem during the first eviction phase is
>>         just the tip of the iceberg here. If a BO is moved into
>>         invisible memory during this we wouldn't be able to get it
>>         out of that in the second phase because SDMA and hw is
>>         already turned off.
>>
>>         @Felix any idea how that can happen? Have you guys marked a
>>         work item / work queue as not freezable?
>>
>>     We don't set anything to non-freezable in KFD.
>>
>>     Regards,
>>       Felix
>>
>>         Or maybe the display guys?
>>
>>
>> Do you guys in the display do any delayed update in a work item which 
>> is marked as not-freezable?
>>
>> Otherwise I have absolutely no idea what's going on here.
>>
>> Thanks,
>> Christian.
>>
>>
>>         @Xinhui please investigate what work item that is and where
>>         that is coming from. Something like "if (adev->in_suspend)
>>         dump_stack();" in the right place should probably do it.
>>
>>         Thanks,
>>         Christian.
>>
>>         Am 13.09.23 um 07:13 schrieb Pan, Xinhui:
>>
>>             [AMD Official Use Only - General]
>>
>>             I notice that only user space process are frozen on my
>>             side.  kthread and workqueue  keeps running. Maybe some
>>             kernel configs are not enabled.
>>
>>             I made one module which just prints something like i++
>>             with mutex lock both in workqueue and kthread. I paste
>>             some logs below.
>>
>>             [438619.696196] XH: 14 from workqueue
>>
>>             [438619.700193] XH: 15 from kthread
>>
>>             [438620.394335] PM: suspend entry (deep)
>>
>>             [438620.399619] Filesystems sync: 0.001 seconds
>>
>>             [438620.403887] PM: Preparing system for sleep (deep)
>>
>>             [438620.409299] Freezing user space processes
>>
>>             [438620.414862] Freezing user space processes completed
>>             (elapsed 0.001 seconds)
>>
>>             [438620.421881] OOM killer disabled.
>>
>>             [438620.425197] Freezing remaining freezable tasks
>>
>>             [438620.430890] Freezing remaining freezable tasks
>>             completed (elapsed 0.001 seconds)
>>
>>             [438620.438348] PM: Suspending system (deep)
>>
>>             .....
>>
>>             [438623.746038] PM: suspend of devices complete after
>>             3303.137 msecs
>>
>>             [438623.752125] PM: start suspend of devices complete
>>             after 3309.713 msecs
>>
>>             [438623.758722] PM: suspend debug: Waiting for 5 second(s).
>>
>>             [438623.792166] XH: 22 from kthread
>>
>>             [438623.824140] XH: 23 from workqueue
>>
>>             So BOs definitely can be in use during suspend.
>>
>>             Even if kthread or workqueue can be stopped with one
>>             special kernel config. I think suspend can only stop the
>>             workqueue with its callback finish.
>>
>>             otherwise something like below makes things crazy.
>>
>>             LOCK BO
>>
>>             do something
>>
>>             -> schedule or wait, anycode might sleep. Stopped by
>>             suspend now? no, i think.
>>
>>             UNLOCK BO
>>
>>             I do tests  with  cmds below.
>>
>>             echo devices  > /sys/power/pm_test
>>
>>             echo 0  > /sys/power/pm_async
>>
>>             echo 1  > /sys/power/pm_print_times
>>
>>             echo 1 > /sys/power/pm_debug_messages
>>
>>             echo 1 > /sys/module/amdgpu/parameters/debug_evictions
>>
>>             ./kfd.sh --gtest_filter=KFDEvictTest.BasicTest
>>
>>             pm-suspend
>>
>>             thanks
>>
>>             xinhui
>>
>>             ------------------------------------------------------------------------
>>
>>             *发件人:*Christian König <ckoenig.leichtzumerken@gmail.com>
>>             <mailto:ckoenig.leichtzumerken@gmail.com>
>>             *发送时间:*2023年9月12日17:01
>>             *收件人:*Pan, Xinhui <Xinhui.Pan@amd.com>
>>             <mailto:Xinhui.Pan@amd.com>;
>>             amd-gfx@lists.freedesktop.org
>>             <amd-gfx@lists.freedesktop.org>
>>             <mailto:amd-gfx@lists.freedesktop.org>
>>             *抄送:*Deucher, Alexander <Alexander.Deucher@amd.com>
>>             <mailto:Alexander.Deucher@amd.com>; Koenig, Christian
>>             <Christian.Koenig@amd.com>
>>             <mailto:Christian.Koenig@amd.com>; Fan, Shikang
>>             <Shikang.Fan@amd.com> <mailto:Shikang.Fan@amd.com>
>>             *主题:*Re: [PATCH] drm/amdgpu: Ignore first evction failure
>>             during suspend
>>
>>             When amdgpu_device_suspend() is called processes should
>>             be frozen
>>             already. In other words KFD queues etc... should already
>>             be idle.
>>
>>             So when the eviction fails here we missed something
>>             previously and that
>>             in turn can cause tons amount of problems.
>>
>>             So ignoring those errors is most likely not a good idea
>>             at all.
>>
>>             Regards,
>>             Christian.
>>
>>             Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
>>             > [AMD Official Use Only - General]
>>             >
>>             > Oh yep, Pinned BO is moved to other LRU list, So
>>             eviction fails because of other reason.
>>             > I will change the comments in the patch.
>>             > The problem is eviction fails as many reasons, say, BO
>>             is locked.
>>             > ASAIK, kfd will stop the queues and flush some
>>             evict/restore work in its suspend callback. SO the first
>>             eviction before kfd callback likely fails.
>>             >
>>             > -----Original Message-----
>>             > From: Christian König
>>             <ckoenig.leichtzumerken@gmail.com>
>>             <mailto:ckoenig.leichtzumerken@gmail.com>
>>             > Sent: Friday, September 8, 2023 2:49 PM
>>             > To: Pan, Xinhui <Xinhui.Pan@amd.com>
>>             <mailto:Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
>>             > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>>             <mailto:Alexander.Deucher@amd.com>; Koenig, Christian
>>             <Christian.Koenig@amd.com>
>>             <mailto:Christian.Koenig@amd.com>; Fan, Shikang
>>             <Shikang.Fan@amd.com> <mailto:Shikang.Fan@amd.com>
>>             > Subject: Re: [PATCH] drm/amdgpu: Ignore first evction
>>             failure during suspend
>>             >
>>             > Am 08.09.23 um 05:39 schrieb xinhui pan:
>>             >> Some BOs might be pinned. So the first eviction's
>>             failure will abort
>>             >> the suspend sequence. These pinned BOs will be unpined
>>             afterwards
>>             >> during suspend.
>>             > That doesn't make much sense since pinned BOs don't
>>             cause eviction failure here.
>>             >
>>             > What exactly is the error code you see?
>>             >
>>             > Christian.
>>             >
>>             >> Actaully it has evicted most BOs, so that should stil
>>             work fine in
>>             >> sriov full access mode.
>>             >>
>>             >> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra
>>             evict_resource call
>>             >> during device_suspend.")
>>             >> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>             <mailto:xinhui.pan@amd.com>
>>             >> ---
>>             >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>>             >>    1 file changed, 5 insertions(+), 4 deletions(-)
>>             >>
>>             >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>             >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>             >> index 5c0e2b766026..39af526cdbbe 100644
>>             >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>             >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>             >> @@ -4148,10 +4148,11 @@ int
>>             amdgpu_device_suspend(struct drm_device
>>             >> *dev, bool fbcon)
>>             >>
>>             >>        adev->in_suspend = true;
>>             >>
>>             >> -     /* Evict the majority of BOs before grabbing the
>>             full access */
>>             >> -     r = amdgpu_device_evict_resources(adev);
>>             >> -     if (r)
>>             >> -             return r;
>>             >> +     /* Try to evict the majority of BOs before
>>             grabbing the full access
>>             >> +      * Ignore the ret val at first place as we will
>>             unpin some BOs if any
>>             >> +      * afterwards.
>>             >> +      */
>>             >> + (void)amdgpu_device_evict_resources(adev);
>>             >>
>>             >>        if (amdgpu_sriov_vf(adev)) {
>>             >> amdgpu_virt_fini_data_exchange(adev);
>>
>

[-- Attachment #2: Type: text/html, Size: 35258 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
  2023-09-14 13:37                     ` Felix Kuehling
@ 2023-09-14 13:59                       ` Christian König
  2023-09-22 10:38                         ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2023-09-14 13:59 UTC (permalink / raw)
  To: Felix Kuehling, Christian König, Pan, Xinhui,
	amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Fan, Shikang

[-- Attachment #1: Type: text/plain, Size: 14024 bytes --]

Am 14.09.23 um 15:37 schrieb Felix Kuehling:
>
> Userptr and SVM restore work is scheduled to the system WQ with 
> schedule_delayed_work. See amdgpu_amdkfd_evict_userptr and 
> svm_range_evict. This would need to use queue_delayed_work with the 
> system_freezable_wq.
>
> BO restoration is scheduled with queue_delayed_work on our own 
> kfd_restore_wq that was allocated with alloc_ordered_workqueue. This 
> would need to add the WQ_FREEZABLE flag when we create the wq in 
> kfd_process_create_wq.
>
> There is also evict_process_worker scheduled with 
> schedule_delayed_work, which handles stopping of user mode queues, 
> signaling of eviction fences and scheduling of restore work when BOs 
> are evicted. I think that should not be freezable because it's needed 
> to signal the eviction fences to allow suspend to evict BOs.
>
> To make sure I'm not misunderstanding, I assume that freezing a 
> freezable workqueue flushes work items in progress and prevents 
> execution of more work until it is unfrozen. I assume work items are 
> not frozen in the middle of execution, because that would not solve 
> the problem.
>

I was wondering the exact same thing and to be honest I don't know that 
detail either and of hand can't find any documentation about it.

My suspicion is that a work item can freeze when it calls schedule(), 
e.g. when taking a look or similar.

That would then indeed not work at all and we would need to make sure 
that the work is completed manually somehow.

Regards,
Christian.

> Regards,
>   Felix
>
>
> On 2023-09-14 2:23, Christian König wrote:
>> [putting Harry on BCC, sorry for the noise]
>>
>> Yeah, that is clearly a bug in the KFD.
>>
>> During the second eviction the hw should already be disabled, so we 
>> don't have any SDMA or similar to evict BOs any more and can only 
>> copy them with the CPU.
>>
>> @Felix what workqueue do you guys use for the restore work? I've just 
>> double checked and on the system workqueues you explicitly need to 
>> specify that stuff is freezable. E.g. use system_freezable_wq instead 
>> of system_wq.
>>
>> Alternatively as Xinhui mentioned it might be necessary to flush all 
>> restore work before the first eviction phase or we have the chance 
>> that BOs are moved back into VRAM again.
>>
>> Regards,
>> Christian.
>>
>> Am 14.09.23 um 03:54 schrieb Pan, Xinhui:
>>>
>>> [AMD Official Use Only - General]
>>>
>>>
>>> I just make one debug patch to show busy BO’s alloc-trace when the 
>>> eviction fails in suspend.
>>>
>>> And dmesg log attached.
>>>
>>> Looks like they are just kfd user Bos and locked by evict/restore work.
>>>
>>> So in kfd suspend callback, it really need to flush the 
>>> evict/restore work before HW fini as it do now.
>>>
>>> That is why the first very early eviction fails and the second 
>>> eviction succeed.
>>>
>>> Thanks
>>>
>>> xinhui
>>>
>>> *From:* Pan, Xinhui
>>> *Sent:* Thursday, September 14, 2023 8:02 AM
>>> *To:* Koenig, Christian <Christian.Koenig@amd.com>; Kuehling, Felix 
>>> <Felix.Kuehling@amd.com>; Christian König 
>>> <ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org; 
>>> Wentland, Harry <Harry.Wentland@amd.com>
>>> *Cc:* Deucher, Alexander <Alexander.Deucher@amd.com>; Fan, Shikang 
>>> <Shikang.Fan@amd.com>
>>> *Subject:* RE: 回复: [PATCH] drm/amdgpu: Ignore first evction failure 
>>> during suspend
>>>
>>> Chris,
>>>
>>> I can dump these busy BOs with their alloc/free stack later today.
>>>
>>> BTW, the two evictions and the kfd suspend are all called before 
>>> hw_fini. IOW, between phase 1 and phase 2. SDMA is turned only in 
>>> phase2. So current code works fine maybe.
>>>
>>> *From:* Koenig, Christian <Christian.Koenig@amd.com>
>>> *Sent:* Wednesday, September 13, 2023 10:29 PM
>>> *To:* Kuehling, Felix <Felix.Kuehling@amd.com>; Christian König 
>>> <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui 
>>> <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org; Wentland, Harry 
>>> <Harry.Wentland@amd.com>
>>> *Cc:* Deucher, Alexander <Alexander.Deucher@amd.com>; Fan, Shikang 
>>> <Shikang.Fan@amd.com>
>>> *Subject:* Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure 
>>> during suspend
>>>
>>> [+Harry]
>>>
>>> Am 13.09.23 um 15:54 schrieb Felix Kuehling:
>>>
>>>     On 2023-09-13 4:07, Christian König wrote:
>>>
>>>         [+Fleix]
>>>
>>>         Well that looks like quite a serious bug.
>>>
>>>         If I'm not completely mistaken the KFD work item tries to
>>>         restore the process by moving BOs into memory even after the
>>>         suspend freeze. Normally work items are frozen together with
>>>         the user space processes unless explicitly marked as not
>>>         freezable.
>>>
>>>         That this causes problem during the first eviction phase is
>>>         just the tip of the iceberg here. If a BO is moved into
>>>         invisible memory during this we wouldn't be able to get it
>>>         out of that in the second phase because SDMA and hw is
>>>         already turned off.
>>>
>>>         @Felix any idea how that can happen? Have you guys marked a
>>>         work item / work queue as not freezable?
>>>
>>>     We don't set anything to non-freezable in KFD.
>>>
>>>     Regards,
>>>       Felix
>>>
>>>         Or maybe the display guys?
>>>
>>>
>>> Do you guys in the display do any delayed update in a work item 
>>> which is marked as not-freezable?
>>>
>>> Otherwise I have absolutely no idea what's going on here.
>>>
>>> Thanks,
>>> Christian.
>>>
>>>
>>>         @Xinhui please investigate what work item that is and where
>>>         that is coming from. Something like "if (adev->in_suspend)
>>>         dump_stack();" in the right place should probably do it.
>>>
>>>         Thanks,
>>>         Christian.
>>>
>>>         Am 13.09.23 um 07:13 schrieb Pan, Xinhui:
>>>
>>>             [AMD Official Use Only - General]
>>>
>>>             I notice that only user space process are frozen on my
>>>             side.  kthread and workqueue keeps running. Maybe some
>>>             kernel configs are not enabled.
>>>
>>>             I made one module which just prints something like i++
>>>             with mutex lock both in workqueue and kthread. I paste
>>>             some logs below.
>>>
>>>             [438619.696196] XH: 14 from workqueue
>>>
>>>             [438619.700193] XH: 15 from kthread
>>>
>>>             [438620.394335] PM: suspend entry (deep)
>>>
>>>             [438620.399619] Filesystems sync: 0.001 seconds
>>>
>>>             [438620.403887] PM: Preparing system for sleep (deep)
>>>
>>>             [438620.409299] Freezing user space processes
>>>
>>>             [438620.414862] Freezing user space processes completed
>>>             (elapsed 0.001 seconds)
>>>
>>>             [438620.421881] OOM killer disabled.
>>>
>>>             [438620.425197] Freezing remaining freezable tasks
>>>
>>>             [438620.430890] Freezing remaining freezable tasks
>>>             completed (elapsed 0.001 seconds)
>>>
>>>             [438620.438348] PM: Suspending system (deep)
>>>
>>>             .....
>>>
>>>             [438623.746038] PM: suspend of devices complete after
>>>             3303.137 msecs
>>>
>>>             [438623.752125] PM: start suspend of devices complete
>>>             after 3309.713 msecs
>>>
>>>             [438623.758722] PM: suspend debug: Waiting for 5 second(s).
>>>
>>>             [438623.792166] XH: 22 from kthread
>>>
>>>             [438623.824140] XH: 23 from workqueue
>>>
>>>             So BOs definitely can be in use during suspend.
>>>
>>>             Even if kthread or workqueue can be stopped with one
>>>             special kernel config. I think suspend can only stop the
>>>             workqueue with its callback finish.
>>>
>>>             otherwise something like below makes things crazy.
>>>
>>>             LOCK BO
>>>
>>>             do something
>>>
>>>             -> schedule or wait, anycode might sleep.  Stopped by
>>>             suspend now? no, i think.
>>>
>>>             UNLOCK BO
>>>
>>>             I do tests  with  cmds below.
>>>
>>>             echo devices  > /sys/power/pm_test
>>>
>>>             echo 0  > /sys/power/pm_async
>>>
>>>             echo 1  > /sys/power/pm_print_times
>>>
>>>             echo 1 > /sys/power/pm_debug_messages
>>>
>>>             echo 1 > /sys/module/amdgpu/parameters/debug_evictions
>>>
>>>             ./kfd.sh --gtest_filter=KFDEvictTest.BasicTest
>>>
>>>             pm-suspend
>>>
>>>             thanks
>>>
>>>             xinhui
>>>
>>>             ------------------------------------------------------------------------
>>>
>>>             *发件人:*Christian König <ckoenig.leichtzumerken@gmail.com>
>>>             <mailto:ckoenig.leichtzumerken@gmail.com>
>>>             *发送时间:*2023年9月12日17:01
>>>             *收件人:*Pan, Xinhui <Xinhui.Pan@amd.com>
>>>             <mailto:Xinhui.Pan@amd.com>;
>>>             amd-gfx@lists.freedesktop.org
>>>             <amd-gfx@lists.freedesktop.org>
>>>             <mailto:amd-gfx@lists.freedesktop.org>
>>>             *抄送:*Deucher, Alexander <Alexander.Deucher@amd.com>
>>>             <mailto:Alexander.Deucher@amd.com>; Koenig, Christian
>>>             <Christian.Koenig@amd.com>
>>>             <mailto:Christian.Koenig@amd.com>; Fan, Shikang
>>>             <Shikang.Fan@amd.com> <mailto:Shikang.Fan@amd.com>
>>>             *主题:*Re: [PATCH] drm/amdgpu: Ignore first evction
>>>             failure during suspend
>>>
>>>             When amdgpu_device_suspend() is called processes should
>>>             be frozen
>>>             already. In other words KFD queues etc... should already
>>>             be idle.
>>>
>>>             So when the eviction fails here we missed something
>>>             previously and that
>>>             in turn can cause tons amount of problems.
>>>
>>>             So ignoring those errors is most likely not a good idea
>>>             at all.
>>>
>>>             Regards,
>>>             Christian.
>>>
>>>             Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
>>>             > [AMD Official Use Only - General]
>>>             >
>>>             > Oh yep, Pinned BO is moved to other LRU list, So
>>>             eviction fails because of other reason.
>>>             > I will change the comments in the patch.
>>>             > The problem is eviction fails as many reasons, say, BO
>>>             is locked.
>>>             > ASAIK, kfd will stop the queues and flush some
>>>             evict/restore work in its suspend callback. SO the first
>>>             eviction before kfd callback likely fails.
>>>             >
>>>             > -----Original Message-----
>>>             > From: Christian König
>>>             <ckoenig.leichtzumerken@gmail.com>
>>>             <mailto:ckoenig.leichtzumerken@gmail.com>
>>>             > Sent: Friday, September 8, 2023 2:49 PM
>>>             > To: Pan, Xinhui <Xinhui.Pan@amd.com>
>>>             <mailto:Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
>>>             > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>>>             <mailto:Alexander.Deucher@amd.com>; Koenig, Christian
>>>             <Christian.Koenig@amd.com>
>>>             <mailto:Christian.Koenig@amd.com>; Fan, Shikang
>>>             <Shikang.Fan@amd.com> <mailto:Shikang.Fan@amd.com>
>>>             > Subject: Re: [PATCH] drm/amdgpu: Ignore first evction
>>>             failure during suspend
>>>             >
>>>             > Am 08.09.23 um 05:39 schrieb xinhui pan:
>>>             >> Some BOs might be pinned. So the first eviction's
>>>             failure will abort
>>>             >> the suspend sequence. These pinned BOs will be
>>>             unpined afterwards
>>>             >> during suspend.
>>>             > That doesn't make much sense since pinned BOs don't
>>>             cause eviction failure here.
>>>             >
>>>             > What exactly is the error code you see?
>>>             >
>>>             > Christian.
>>>             >
>>>             >> Actaully it has evicted most BOs, so that should stil
>>>             work fine in
>>>             >> sriov full access mode.
>>>             >>
>>>             >> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra
>>>             evict_resource call
>>>             >> during device_suspend.")
>>>             >> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>>             <mailto:xinhui.pan@amd.com>
>>>             >> ---
>>>             >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>>>             >>    1 file changed, 5 insertions(+), 4 deletions(-)
>>>             >>
>>>             >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>             >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>             >> index 5c0e2b766026..39af526cdbbe 100644
>>>             >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>             >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>             >> @@ -4148,10 +4148,11 @@ int
>>>             amdgpu_device_suspend(struct drm_device
>>>             >> *dev, bool fbcon)
>>>             >>
>>>             >>        adev->in_suspend = true;
>>>             >>
>>>             >> -     /* Evict the majority of BOs before grabbing
>>>             the full access */
>>>             >> -     r = amdgpu_device_evict_resources(adev);
>>>             >> -     if (r)
>>>             >> -             return r;
>>>             >> +     /* Try to evict the majority of BOs before
>>>             grabbing the full access
>>>             >> +      * Ignore the ret val at first place as we will
>>>             unpin some BOs if any
>>>             >> +      * afterwards.
>>>             >> +      */
>>>             >> + (void)amdgpu_device_evict_resources(adev);
>>>             >>
>>>             >>        if (amdgpu_sriov_vf(adev)) {
>>>             >> amdgpu_virt_fini_data_exchange(adev);
>>>
>>

[-- Attachment #2: Type: text/html, Size: 37172 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
  2023-09-14 13:59                       ` Christian König
@ 2023-09-22 10:38                         ` Christian König
  0 siblings, 0 replies; 15+ messages in thread
From: Christian König @ 2023-09-22 10:38 UTC (permalink / raw)
  To: Felix Kuehling, Christian König, Pan, Xinhui,
	amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Fan, Shikang

[-- Attachment #1: Type: text/plain, Size: 14714 bytes --]

Am 14.09.23 um 15:59 schrieb Christian König:
> Am 14.09.23 um 15:37 schrieb Felix Kuehling:
>>
>> Userptr and SVM restore work is scheduled to the system WQ with 
>> schedule_delayed_work. See amdgpu_amdkfd_evict_userptr and 
>> svm_range_evict. This would need to use queue_delayed_work with the 
>> system_freezable_wq.
>>
>> BO restoration is scheduled with queue_delayed_work on our own 
>> kfd_restore_wq that was allocated with alloc_ordered_workqueue. This 
>> would need to add the WQ_FREEZABLE flag when we create the wq in 
>> kfd_process_create_wq.
>>
>> There is also evict_process_worker scheduled with 
>> schedule_delayed_work, which handles stopping of user mode queues, 
>> signaling of eviction fences and scheduling of restore work when BOs 
>> are evicted. I think that should not be freezable because it's needed 
>> to signal the eviction fences to allow suspend to evict BOs.
>>
>> To make sure I'm not misunderstanding, I assume that freezing a 
>> freezable workqueue flushes work items in progress and prevents 
>> execution of more work until it is unfrozen. I assume work items are 
>> not frozen in the middle of execution, because that would not solve 
>> the problem.
>>
>
> I was wondering the exact same thing and to be honest I don't know 
> that detail either and of hand can't find any documentation about it.
>
> My suspicion is that a work item can freeze when it calls schedule(), 
> e.g. when taking a look or similar.

I've found some time to double check this. At least of hand it looks 
like freezing workqueues means that no more work items are scheduled and 
we wait for existing to finish.

So using the freezable workqueues should solve the problem.

Christian.

>
> That would then indeed not work at all and we would need to make sure 
> that the work is completed manually somehow.
>
> Regards,
> Christian.
>
>> Regards,
>>   Felix
>>
>>
>> On 2023-09-14 2:23, Christian König wrote:
>>> [putting Harry on BCC, sorry for the noise]
>>>
>>> Yeah, that is clearly a bug in the KFD.
>>>
>>> During the second eviction the hw should already be disabled, so we 
>>> don't have any SDMA or similar to evict BOs any more and can only 
>>> copy them with the CPU.
>>>
>>> @Felix what workqueue do you guys use for the restore work? I've 
>>> just double checked and on the system workqueues you explicitly need 
>>> to specify that stuff is freezable. E.g. use system_freezable_wq 
>>> instead of system_wq.
>>>
>>> Alternatively as Xinhui mentioned it might be necessary to flush all 
>>> restore work before the first eviction phase or we have the chance 
>>> that BOs are moved back into VRAM again.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 14.09.23 um 03:54 schrieb Pan, Xinhui:
>>>>
>>>> [AMD Official Use Only - General]
>>>>
>>>>
>>>> I just make one debug patch to show busy BO’s alloc-trace when the 
>>>> eviction fails in suspend.
>>>>
>>>> And dmesg log attached.
>>>>
>>>> Looks like they are just kfd user Bos and locked by evict/restore work.
>>>>
>>>> So in kfd suspend callback, it really need to flush the 
>>>> evict/restore work before HW fini as it do now.
>>>>
>>>> That is why the first very early eviction fails and the second 
>>>> eviction succeed.
>>>>
>>>> Thanks
>>>>
>>>> xinhui
>>>>
>>>> *From:* Pan, Xinhui
>>>> *Sent:* Thursday, September 14, 2023 8:02 AM
>>>> *To:* Koenig, Christian <Christian.Koenig@amd.com>; Kuehling, Felix 
>>>> <Felix.Kuehling@amd.com>; Christian König 
>>>> <ckoenig.leichtzumerken@gmail.com>; amd-gfx@lists.freedesktop.org; 
>>>> Wentland, Harry <Harry.Wentland@amd.com>
>>>> *Cc:* Deucher, Alexander <Alexander.Deucher@amd.com>; Fan, Shikang 
>>>> <Shikang.Fan@amd.com>
>>>> *Subject:* RE: 回复: [PATCH] drm/amdgpu: Ignore first evction failure 
>>>> during suspend
>>>>
>>>> Chris,
>>>>
>>>> I can dump these busy BOs with their alloc/free stack later today.
>>>>
>>>> BTW, the two evictions and the kfd suspend are all called before 
>>>> hw_fini. IOW, between phase 1 and phase 2. SDMA is turned only in 
>>>> phase2. So current code works fine maybe.
>>>>
>>>> *From:* Koenig, Christian <Christian.Koenig@amd.com>
>>>> *Sent:* Wednesday, September 13, 2023 10:29 PM
>>>> *To:* Kuehling, Felix <Felix.Kuehling@amd.com>; Christian König 
>>>> <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui 
>>>> <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org; Wentland, 
>>>> Harry <Harry.Wentland@amd.com>
>>>> *Cc:* Deucher, Alexander <Alexander.Deucher@amd.com>; Fan, Shikang 
>>>> <Shikang.Fan@amd.com>
>>>> *Subject:* Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure 
>>>> during suspend
>>>>
>>>> [+Harry]
>>>>
>>>> Am 13.09.23 um 15:54 schrieb Felix Kuehling:
>>>>
>>>>     On 2023-09-13 4:07, Christian König wrote:
>>>>
>>>>         [+Fleix]
>>>>
>>>>         Well that looks like quite a serious bug.
>>>>
>>>>         If I'm not completely mistaken the KFD work item tries to
>>>>         restore the process by moving BOs into memory even after
>>>>         the suspend freeze. Normally work items are frozen together
>>>>         with the user space processes unless explicitly marked as
>>>>         not freezable.
>>>>
>>>>         That this causes problem during the first eviction phase is
>>>>         just the tip of the iceberg here. If a BO is moved into
>>>>         invisible memory during this we wouldn't be able to get it
>>>>         out of that in the second phase because SDMA and hw is
>>>>         already turned off.
>>>>
>>>>         @Felix any idea how that can happen? Have you guys marked a
>>>>         work item / work queue as not freezable?
>>>>
>>>>     We don't set anything to non-freezable in KFD.
>>>>
>>>>     Regards,
>>>>       Felix
>>>>
>>>>         Or maybe the display guys?
>>>>
>>>>
>>>> Do you guys in the display do any delayed update in a work item 
>>>> which is marked as not-freezable?
>>>>
>>>> Otherwise I have absolutely no idea what's going on here.
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>>
>>>>         @Xinhui please investigate what work item that is and where
>>>>         that is coming from. Something like "if (adev->in_suspend)
>>>>         dump_stack();" in the right place should probably do it.
>>>>
>>>>         Thanks,
>>>>         Christian.
>>>>
>>>>         Am 13.09.23 um 07:13 schrieb Pan, Xinhui:
>>>>
>>>>             [AMD Official Use Only - General]
>>>>
>>>>             I notice that only user space process are frozen on my
>>>>             side.  kthread and workqueue keeps running. Maybe some
>>>>             kernel configs are not enabled.
>>>>
>>>>             I made one module which just prints something like i++
>>>>             with mutex lock both in workqueue and kthread. I paste
>>>>             some logs below.
>>>>
>>>>             [438619.696196] XH: 14 from workqueue
>>>>
>>>>             [438619.700193] XH: 15 from kthread
>>>>
>>>>             [438620.394335] PM: suspend entry (deep)
>>>>
>>>>             [438620.399619] Filesystems sync: 0.001 seconds
>>>>
>>>>             [438620.403887] PM: Preparing system for sleep (deep)
>>>>
>>>>             [438620.409299] Freezing user space processes
>>>>
>>>>             [438620.414862] Freezing user space processes completed
>>>>             (elapsed 0.001 seconds)
>>>>
>>>>             [438620.421881] OOM killer disabled.
>>>>
>>>>             [438620.425197] Freezing remaining freezable tasks
>>>>
>>>>             [438620.430890] Freezing remaining freezable tasks
>>>>             completed (elapsed 0.001 seconds)
>>>>
>>>>             [438620.438348] PM: Suspending system (deep)
>>>>
>>>>             .....
>>>>
>>>>             [438623.746038] PM: suspend of devices complete after
>>>>             3303.137 msecs
>>>>
>>>>             [438623.752125] PM: start suspend of devices complete
>>>>             after 3309.713 msecs
>>>>
>>>>             [438623.758722] PM: suspend debug: Waiting for 5 second(s).
>>>>
>>>>             [438623.792166] XH: 22 from kthread
>>>>
>>>>             [438623.824140] XH: 23 from workqueue
>>>>
>>>>             So BOs definitely can be in use during suspend.
>>>>
>>>>             Even if kthread or workqueue can be stopped with one
>>>>             special kernel config. I think suspend can only stop
>>>>             the workqueue with its callback finish.
>>>>
>>>>             otherwise something like below makes things crazy.
>>>>
>>>>             LOCK BO
>>>>
>>>>             do something
>>>>
>>>>             -> schedule or wait, anycode might sleep.  Stopped by
>>>>             suspend now? no, i think.
>>>>
>>>>             UNLOCK BO
>>>>
>>>>             I do tests  with  cmds below.
>>>>
>>>>             echo devices  > /sys/power/pm_test
>>>>
>>>>             echo 0  > /sys/power/pm_async
>>>>
>>>>             echo 1  > /sys/power/pm_print_times
>>>>
>>>>             echo 1 > /sys/power/pm_debug_messages
>>>>
>>>>             echo 1 > /sys/module/amdgpu/parameters/debug_evictions
>>>>
>>>>             ./kfd.sh --gtest_filter=KFDEvictTest.BasicTest
>>>>
>>>>             pm-suspend
>>>>
>>>>             thanks
>>>>
>>>>             xinhui
>>>>
>>>>             ------------------------------------------------------------------------
>>>>
>>>>             *发件人:*Christian König
>>>>             <ckoenig.leichtzumerken@gmail.com>
>>>>             <mailto:ckoenig.leichtzumerken@gmail.com>
>>>>             *发送时间:*2023年9月12日17:01
>>>>             *收件人:*Pan, Xinhui <Xinhui.Pan@amd.com>
>>>>             <mailto:Xinhui.Pan@amd.com>;
>>>>             amd-gfx@lists.freedesktop.org
>>>>             <amd-gfx@lists.freedesktop.org>
>>>>             <mailto:amd-gfx@lists.freedesktop.org>
>>>>             *抄送:*Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>             <mailto:Alexander.Deucher@amd.com>; Koenig, Christian
>>>>             <Christian.Koenig@amd.com>
>>>>             <mailto:Christian.Koenig@amd.com>; Fan, Shikang
>>>>             <Shikang.Fan@amd.com> <mailto:Shikang.Fan@amd.com>
>>>>             *主题:*Re: [PATCH] drm/amdgpu: Ignore first evction
>>>>             failure during suspend
>>>>
>>>>             When amdgpu_device_suspend() is called processes should
>>>>             be frozen
>>>>             already. In other words KFD queues etc... should
>>>>             already be idle.
>>>>
>>>>             So when the eviction fails here we missed something
>>>>             previously and that
>>>>             in turn can cause tons amount of problems.
>>>>
>>>>             So ignoring those errors is most likely not a good idea
>>>>             at all.
>>>>
>>>>             Regards,
>>>>             Christian.
>>>>
>>>>             Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
>>>>             > [AMD Official Use Only - General]
>>>>             >
>>>>             > Oh yep, Pinned BO is moved to other LRU list, So
>>>>             eviction fails because of other reason.
>>>>             > I will change the comments in the patch.
>>>>             > The problem is eviction fails as many reasons, say,
>>>>             BO is locked.
>>>>             > ASAIK, kfd will stop the queues and flush some
>>>>             evict/restore work in its suspend callback. SO the
>>>>             first eviction before kfd callback likely fails.
>>>>             >
>>>>             > -----Original Message-----
>>>>             > From: Christian König
>>>>             <ckoenig.leichtzumerken@gmail.com>
>>>>             <mailto:ckoenig.leichtzumerken@gmail.com>
>>>>             > Sent: Friday, September 8, 2023 2:49 PM
>>>>             > To: Pan, Xinhui <Xinhui.Pan@amd.com>
>>>>             <mailto:Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
>>>>             > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>>>>             <mailto:Alexander.Deucher@amd.com>; Koenig, Christian
>>>>             <Christian.Koenig@amd.com>
>>>>             <mailto:Christian.Koenig@amd.com>; Fan, Shikang
>>>>             <Shikang.Fan@amd.com> <mailto:Shikang.Fan@amd.com>
>>>>             > Subject: Re: [PATCH] drm/amdgpu: Ignore first evction
>>>>             failure during suspend
>>>>             >
>>>>             > Am 08.09.23 um 05:39 schrieb xinhui pan:
>>>>             >> Some BOs might be pinned. So the first eviction's
>>>>             failure will abort
>>>>             >> the suspend sequence. These pinned BOs will be
>>>>             unpined afterwards
>>>>             >> during suspend.
>>>>             > That doesn't make much sense since pinned BOs don't
>>>>             cause eviction failure here.
>>>>             >
>>>>             > What exactly is the error code you see?
>>>>             >
>>>>             > Christian.
>>>>             >
>>>>             >> Actaully it has evicted most BOs, so that should
>>>>             stil work fine in
>>>>             >> sriov full access mode.
>>>>             >>
>>>>             >> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra
>>>>             evict_resource call
>>>>             >> during device_suspend.")
>>>>             >> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>>>             <mailto:xinhui.pan@amd.com>
>>>>             >> ---
>>>>             >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>>>>             >>    1 file changed, 5 insertions(+), 4 deletions(-)
>>>>             >>
>>>>             >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>             >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>             >> index 5c0e2b766026..39af526cdbbe 100644
>>>>             >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>             >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>             >> @@ -4148,10 +4148,11 @@ int
>>>>             amdgpu_device_suspend(struct drm_device
>>>>             >> *dev, bool fbcon)
>>>>             >>
>>>>             >>        adev->in_suspend = true;
>>>>             >>
>>>>             >> -     /* Evict the majority of BOs before grabbing
>>>>             the full access */
>>>>             >> -     r = amdgpu_device_evict_resources(adev);
>>>>             >> -     if (r)
>>>>             >> -             return r;
>>>>             >> +     /* Try to evict the majority of BOs before
>>>>             grabbing the full access
>>>>             >> +      * Ignore the ret val at first place as we
>>>>             will unpin some BOs if any
>>>>             >> +      * afterwards.
>>>>             >> +      */
>>>>             >> + (void)amdgpu_device_evict_resources(adev);
>>>>             >>
>>>>             >>        if (amdgpu_sriov_vf(adev)) {
>>>>             >> amdgpu_virt_fini_data_exchange(adev);
>>>>
>>>
>

[-- Attachment #2: Type: text/html, Size: 39195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-09-22 10:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-08  3:39 [PATCH] drm/amdgpu: Ignore first evction failure during suspend xinhui pan
2023-09-08  6:48 ` Christian König
2023-09-12  0:21   ` Pan, Xinhui
2023-09-12  9:01     ` Christian König
2023-09-13  5:13       ` 回复: " Pan, Xinhui
2023-09-13  8:07         ` Christian König
2023-09-13 13:54           ` Felix Kuehling
2023-09-13 14:28             ` Christian König
2023-09-14  0:02               ` Pan, Xinhui
2023-09-14  1:54                 ` Pan, Xinhui
2023-09-14  6:23                   ` Christian König
2023-09-14 13:37                     ` Felix Kuehling
2023-09-14 13:59                       ` Christian König
2023-09-22 10:38                         ` Christian König
  -- strict thread matches above, loose matches on Subject: below --
2023-09-12  2:03 xinhui pan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox