AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Felix Kuehling <felix.kuehling@amd.com>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Pan, Xinhui" <Xinhui.Pan@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Koenig, Christian" <Christian.Koenig@amd.com>,
	"Fan, Shikang" <Shikang.Fan@amd.com>
Subject: Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
Date: Wed, 13 Sep 2023 09:54:33 -0400	[thread overview]
Message-ID: <d9037d0e-e9fb-35f6-9e00-a2e1799bc2f6@amd.com> (raw)
In-Reply-To: <303c2bbb-865c-d5da-1418-21dc803f61a3@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6692 bytes --]

On 2023-09-13 4:07, Christian König wrote:
> [+Fleix]
>
> Well that looks like quite a serious bug.
>
> If I'm not completely mistaken the KFD work item tries to restore the 
> process by moving BOs into memory even after the suspend freeze. 
> Normally work items are frozen together with the user space processes 
> unless explicitly marked as not freezable.
>
> That this causes problem during the first eviction phase is just the 
> tip of the iceberg here. If a BO is moved into invisible memory during 
> this we wouldn't be able to get it out of that in the second phase 
> because SDMA and hw is already turned off.
>
> @Felix any idea how that can happen? Have you guys marked a work item 
> / work queue as not freezable?

We don't set anything to non-freezable in KFD.


Regards,
   Felix


> Or maybe the display guys?
>
> @Xinhui please investigate what work item that is and where that is 
> coming from. Something like "if (adev->in_suspend) dump_stack();" in 
> the right place should probably do it.
>
> Thanks,
> Christian.
>
> Am 13.09.23 um 07:13 schrieb Pan, Xinhui:
>>
>> [AMD Official Use Only - General]
>>
>>
>> I notice that only user space process are frozen on my side.  kthread 
>> and workqueue  keeps running. Maybe some kernel configs are not enabled.
>> I made one module which just prints something like i++ with mutex 
>> lock both in workqueue and kthread. I paste some logs below.
>> [438619.696196] XH: 14 from workqueue
>> [438619.700193] XH: 15 from kthread
>> [438620.394335] PM: suspend entry (deep)
>> [438620.399619] Filesystems sync: 0.001 seconds
>> [438620.403887] PM: Preparing system for sleep (deep)
>> [438620.409299] Freezing user space processes
>> [438620.414862] Freezing user space processes completed (elapsed 
>> 0.001 seconds)
>> [438620.421881] OOM killer disabled.
>> [438620.425197] Freezing remaining freezable tasks
>> [438620.430890] Freezing remaining freezable tasks completed (elapsed 
>> 0.001 seconds)
>> [438620.438348] PM: Suspending system (deep)
>> .....
>> [438623.746038] PM: suspend of devices complete after 3303.137 msecs
>> [438623.752125] PM: start suspend of devices complete after 3309.713 
>> msecs
>> [438623.758722] PM: suspend debug: Waiting for 5 second(s).
>> [438623.792166] XH: 22 from kthread
>> [438623.824140] XH: 23 from workqueue
>>
>>
>> So BOs definitely can be in use during suspend.
>> Even if kthread or workqueue can be stopped with one special kernel 
>> config. I think suspend can only stop the workqueue with its callback 
>> finish.
>> otherwise something like below makes things crazy.
>> LOCK BO
>> do something
>>     -> schedule or wait, anycode might sleep.  Stopped by suspend 
>> now? no, i think.
>> UNLOCK BO
>>
>> I do tests  with  cmds below.
>> echo devices  > /sys/power/pm_test
>> echo 0  > /sys/power/pm_async
>> echo 1  > /sys/power/pm_print_times
>> echo 1 > /sys/power/pm_debug_messages
>> echo 1 > /sys/module/amdgpu/parameters/debug_evictions
>> ./kfd.sh --gtest_filter=KFDEvictTest.BasicTest
>> pm-suspend
>>
>> thanks
>> xinhui
>>
>>
>> ------------------------------------------------------------------------
>> *发件人:* Christian König <ckoenig.leichtzumerken@gmail.com>
>> *发送时间:* 2023年9月12日 17:01
>> *收件人:* Pan, Xinhui <Xinhui.Pan@amd.com>; 
>> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
>> *抄送:* Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, 
>> Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
>> *主题:* Re: [PATCH] drm/amdgpu: Ignore first evction failure during 
>> suspend
>> When amdgpu_device_suspend() is called processes should be frozen
>> already. In other words KFD queues etc... should already be idle.
>>
>> So when the eviction fails here we missed something previously and that
>> in turn can cause tons amount of problems.
>>
>> So ignoring those errors is most likely not a good idea at all.
>>
>> Regards,
>> Christian.
>>
>> Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
>> > [AMD Official Use Only - General]
>> >
>> > Oh yep, Pinned BO is moved to other LRU list, So eviction fails 
>> because of other reason.
>> > I will change the comments in the patch.
>> > The problem is eviction fails as many reasons, say, BO is locked.
>> > ASAIK, kfd will stop the queues and flush some evict/restore work 
>> in its suspend callback. SO the first eviction before kfd callback 
>> likely fails.
>> >
>> > -----Original Message-----
>> > From: Christian König <ckoenig.leichtzumerken@gmail.com>
>> > Sent: Friday, September 8, 2023 2:49 PM
>> > To: Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
>> > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, 
>> Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
>> > Subject: Re: [PATCH] drm/amdgpu: Ignore first evction failure 
>> during suspend
>> >
>> > Am 08.09.23 um 05:39 schrieb xinhui pan:
>> >> Some BOs might be pinned. So the first eviction's failure will abort
>> >> the suspend sequence. These pinned BOs will be unpined afterwards
>> >> during suspend.
>> > That doesn't make much sense since pinned BOs don't cause eviction 
>> failure here.
>> >
>> > What exactly is the error code you see?
>> >
>> > Christian.
>> >
>> >> Actaully it has evicted most BOs, so that should stil work fine in
>> >> sriov full access mode.
>> >>
>> >> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call
>> >> during device_suspend.")
>> >> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>> >> ---
>> >>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>> >>    1 file changed, 5 insertions(+), 4 deletions(-)
>> >>
>> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> index 5c0e2b766026..39af526cdbbe 100644
>> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device
>> >> *dev, bool fbcon)
>> >>
>> >>        adev->in_suspend = true;
>> >>
>> >> -     /* Evict the majority of BOs before grabbing the full access */
>> >> -     r = amdgpu_device_evict_resources(adev);
>> >> -     if (r)
>> >> -             return r;
>> >> +     /* Try to evict the majority of BOs before grabbing the full 
>> access
>> >> +      * Ignore the ret val at first place as we will unpin some 
>> BOs if any
>> >> +      * afterwards.
>> >> +      */
>> >> + (void)amdgpu_device_evict_resources(adev);
>> >>
>> >>        if (amdgpu_sriov_vf(adev)) {
>> >> amdgpu_virt_fini_data_exchange(adev);
>>
>

[-- Attachment #2: Type: text/html, Size: 19029 bytes --]

  reply	other threads:[~2023-09-13 13:54 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-08  3:39 [PATCH] drm/amdgpu: Ignore first evction failure during suspend xinhui pan
2023-09-08  6:48 ` Christian König
2023-09-12  0:21   ` Pan, Xinhui
2023-09-12  9:01     ` Christian König
2023-09-13  5:13       ` 回复: " Pan, Xinhui
2023-09-13  8:07         ` Christian König
2023-09-13 13:54           ` Felix Kuehling [this message]
2023-09-13 14:28             ` Christian König
2023-09-14  0:02               ` Pan, Xinhui
2023-09-14  1:54                 ` Pan, Xinhui
2023-09-14  6:23                   ` Christian König
2023-09-14 13:37                     ` Felix Kuehling
2023-09-14 13:59                       ` Christian König
2023-09-22 10:38                         ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d9037d0e-e9fb-35f6-9e00-a2e1799bc2f6@amd.com \
    --to=felix.kuehling@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=Shikang.Fan@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=ckoenig.leichtzumerken@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox