Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Christian König" <christian.koenig@amd.com>
To: "Felix Kuehling" <felix.kuehling@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Pan, Xinhui" <Xinhui.Pan@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
	"Wentland, Harry" <Harry.Wentland@amd.com>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Fan, Shikang" <Shikang.Fan@amd.com>
Subject: Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
Date: Wed, 13 Sep 2023 16:28:59 +0200	[thread overview]
Message-ID: <59d9927d-5216-e2fa-22cd-205e4a35ebaa@amd.com> (raw)
In-Reply-To: <d9037d0e-e9fb-35f6-9e00-a2e1799bc2f6@amd.com>

[-- Attachment #1: Type: text/plain, Size: 7104 bytes --]

[+Harry]

Am 13.09.23 um 15:54 schrieb Felix Kuehling:
> On 2023-09-13 4:07, Christian König wrote:
>> [+Fleix]
>>
>> Well that looks like quite a serious bug.
>>
>> If I'm not completely mistaken the KFD work item tries to restore the 
>> process by moving BOs into memory even after the suspend freeze. 
>> Normally work items are frozen together with the user space processes 
>> unless explicitly marked as not freezable.
>>
>> That this causes problem during the first eviction phase is just the 
>> tip of the iceberg here. If a BO is moved into invisible memory 
>> during this we wouldn't be able to get it out of that in the second 
>> phase because SDMA and hw is already turned off.
>>
>> @Felix any idea how that can happen? Have you guys marked a work item 
>> / work queue as not freezable?
>
> We don't set anything to non-freezable in KFD.
>
>
> Regards,
>   Felix
>
>
>> Or maybe the display guys?

Do you guys in the display do any delayed update in a work item which is 
marked as not-freezable?

Otherwise I have absolutely no idea what's going on here.

Thanks,
Christian.

>>
>> @Xinhui please investigate what work item that is and where that is 
>> coming from. Something like "if (adev->in_suspend) dump_stack();" in 
>> the right place should probably do it.
>>
>> Thanks,
>> Christian.
>>
>> Am 13.09.23 um 07:13 schrieb Pan, Xinhui:
>>>
>>> [AMD Official Use Only - General]
>>>
>>>
>>> I notice that only user space process are frozen on my side.  
>>> kthread and workqueue keeps running. Maybe some kernel configs are 
>>> not enabled.
>>> I made one module which just prints something like i++ with mutex 
>>> lock both in workqueue and kthread. I paste some logs below.
>>> [438619.696196] XH: 14 from workqueue
>>> [438619.700193] XH: 15 from kthread
>>> [438620.394335] PM: suspend entry (deep)
>>> [438620.399619] Filesystems sync: 0.001 seconds
>>> [438620.403887] PM: Preparing system for sleep (deep)
>>> [438620.409299] Freezing user space processes
>>> [438620.414862] Freezing user space processes completed (elapsed 
>>> 0.001 seconds)
>>> [438620.421881] OOM killer disabled.
>>> [438620.425197] Freezing remaining freezable tasks
>>> [438620.430890] Freezing remaining freezable tasks completed 
>>> (elapsed 0.001 seconds)
>>> [438620.438348] PM: Suspending system (deep)
>>> .....
>>> [438623.746038] PM: suspend of devices complete after 3303.137 msecs
>>> [438623.752125] PM: start suspend of devices complete after 3309.713 
>>> msecs
>>> [438623.758722] PM: suspend debug: Waiting for 5 second(s).
>>> [438623.792166] XH: 22 from kthread
>>> [438623.824140] XH: 23 from workqueue
>>>
>>>
>>> So BOs definitely can be in use during suspend.
>>> Even if kthread or workqueue can be stopped with one special kernel 
>>> config. I think suspend can only stop the workqueue with its 
>>> callback finish.
>>> otherwise something like below makes things crazy.
>>> LOCK BO
>>> do something
>>>     -> schedule or wait, anycode might sleep.  Stopped by suspend 
>>> now? no, i think.
>>> UNLOCK BO
>>>
>>> I do tests  with  cmds below.
>>> echo devices  > /sys/power/pm_test
>>> echo 0  > /sys/power/pm_async
>>> echo 1  > /sys/power/pm_print_times
>>> echo 1 > /sys/power/pm_debug_messages
>>> echo 1 > /sys/module/amdgpu/parameters/debug_evictions
>>> ./kfd.sh --gtest_filter=KFDEvictTest.BasicTest
>>> pm-suspend
>>>
>>> thanks
>>> xinhui
>>>
>>>
>>> ------------------------------------------------------------------------
>>> *发件人:* Christian König <ckoenig.leichtzumerken@gmail.com>
>>> *发送时间:* 2023年9月12日 17:01
>>> *收件人:* Pan, Xinhui <Xinhui.Pan@amd.com>; 
>>> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
>>> *抄送:* Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, 
>>> Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
>>> *主题:* Re: [PATCH] drm/amdgpu: Ignore first evction failure during 
>>> suspend
>>> When amdgpu_device_suspend() is called processes should be frozen
>>> already. In other words KFD queues etc... should already be idle.
>>>
>>> So when the eviction fails here we missed something previously and that
>>> in turn can cause tons amount of problems.
>>>
>>> So ignoring those errors is most likely not a good idea at all.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
>>> > [AMD Official Use Only - General]
>>> >
>>> > Oh yep, Pinned BO is moved to other LRU list, So eviction fails 
>>> because of other reason.
>>> > I will change the comments in the patch.
>>> > The problem is eviction fails as many reasons, say, BO is locked.
>>> > ASAIK, kfd will stop the queues and flush some evict/restore work 
>>> in its suspend callback. SO the first eviction before kfd callback 
>>> likely fails.
>>> >
>>> > -----Original Message-----
>>> > From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>> > Sent: Friday, September 8, 2023 2:49 PM
>>> > To: Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
>>> > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, 
>>> Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
>>> > Subject: Re: [PATCH] drm/amdgpu: Ignore first evction failure 
>>> during suspend
>>> >
>>> > Am 08.09.23 um 05:39 schrieb xinhui pan:
>>> >> Some BOs might be pinned. So the first eviction's failure will abort
>>> >> the suspend sequence. These pinned BOs will be unpined afterwards
>>> >> during suspend.
>>> > That doesn't make much sense since pinned BOs don't cause eviction 
>>> failure here.
>>> >
>>> > What exactly is the error code you see?
>>> >
>>> > Christian.
>>> >
>>> >> Actaully it has evicted most BOs, so that should stil work fine in
>>> >> sriov full access mode.
>>> >>
>>> >> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call
>>> >> during device_suspend.")
>>> >> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>> >> ---
>>> >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>>> >>    1 file changed, 5 insertions(+), 4 deletions(-)
>>> >>
>>> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> >> index 5c0e2b766026..39af526cdbbe 100644
>>> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> >> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device
>>> >> *dev, bool fbcon)
>>> >>
>>> >>        adev->in_suspend = true;
>>> >>
>>> >> -     /* Evict the majority of BOs before grabbing the full access */
>>> >> -     r = amdgpu_device_evict_resources(adev);
>>> >> -     if (r)
>>> >> -             return r;
>>> >> +     /* Try to evict the majority of BOs before grabbing the 
>>> full access
>>> >> +      * Ignore the ret val at first place as we will unpin some 
>>> BOs if any
>>> >> +      * afterwards.
>>> >> +      */
>>> >> + (void)amdgpu_device_evict_resources(adev);
>>> >>
>>> >>        if (amdgpu_sriov_vf(adev)) {
>>> >> amdgpu_virt_fini_data_exchange(adev);
>>>
>>

[-- Attachment #2: Type: text/html, Size: 20227 bytes --]

next prev parent reply	other threads:[~2023-09-13 14:29 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-08  3:39 [PATCH] drm/amdgpu: Ignore first evction failure during suspend xinhui pan
2023-09-08  6:48 ` Christian König
2023-09-12  0:21   ` Pan, Xinhui
2023-09-12  9:01     ` Christian König
2023-09-13  5:13       ` 回复: " Pan, Xinhui
2023-09-13  8:07         ` Christian König
2023-09-13 13:54           ` Felix Kuehling
2023-09-13 14:28             ` Christian König [this message]
2023-09-14  0:02               ` Pan, Xinhui
2023-09-14  1:54                 ` Pan, Xinhui
2023-09-14  6:23                   ` Christian König
2023-09-14 13:37                     ` Felix Kuehling
2023-09-14 13:59                       ` Christian König
2023-09-22 10:38                         ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=59d9927d-5216-e2fa-22cd-205e4a35ebaa@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Harry.Wentland@amd.com \
    --cc=Shikang.Fan@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=felix.kuehling@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox