From: Felix Kuehling <felix.kuehling@amd.com>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>,
"Pan, Xinhui" <Xinhui.Pan@amd.com>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
"Koenig, Christian" <Christian.Koenig@amd.com>,
"Fan, Shikang" <Shikang.Fan@amd.com>
Subject: Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
Date: Wed, 13 Sep 2023 09:54:33 -0400 [thread overview]
Message-ID: <d9037d0e-e9fb-35f6-9e00-a2e1799bc2f6@amd.com> (raw)
In-Reply-To: <303c2bbb-865c-d5da-1418-21dc803f61a3@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 6692 bytes --]
On 2023-09-13 4:07, Christian König wrote:
> [+Fleix]
>
> Well that looks like quite a serious bug.
>
> If I'm not completely mistaken the KFD work item tries to restore the
> process by moving BOs into memory even after the suspend freeze.
> Normally work items are frozen together with the user space processes
> unless explicitly marked as not freezable.
>
> That this causes problem during the first eviction phase is just the
> tip of the iceberg here. If a BO is moved into invisible memory during
> this we wouldn't be able to get it out of that in the second phase
> because SDMA and hw is already turned off.
>
> @Felix any idea how that can happen? Have you guys marked a work item
> / work queue as not freezable?
We don't set anything to non-freezable in KFD.
Regards,
Felix
> Or maybe the display guys?
>
> @Xinhui please investigate what work item that is and where that is
> coming from. Something like "if (adev->in_suspend) dump_stack();" in
> the right place should probably do it.
>
> Thanks,
> Christian.
>
> Am 13.09.23 um 07:13 schrieb Pan, Xinhui:
>>
>> [AMD Official Use Only - General]
>>
>>
>> I notice that only user space process are frozen on my side. kthread
>> and workqueue keeps running. Maybe some kernel configs are not enabled.
>> I made one module which just prints something like i++ with mutex
>> lock both in workqueue and kthread. I paste some logs below.
>> [438619.696196] XH: 14 from workqueue
>> [438619.700193] XH: 15 from kthread
>> [438620.394335] PM: suspend entry (deep)
>> [438620.399619] Filesystems sync: 0.001 seconds
>> [438620.403887] PM: Preparing system for sleep (deep)
>> [438620.409299] Freezing user space processes
>> [438620.414862] Freezing user space processes completed (elapsed
>> 0.001 seconds)
>> [438620.421881] OOM killer disabled.
>> [438620.425197] Freezing remaining freezable tasks
>> [438620.430890] Freezing remaining freezable tasks completed (elapsed
>> 0.001 seconds)
>> [438620.438348] PM: Suspending system (deep)
>> .....
>> [438623.746038] PM: suspend of devices complete after 3303.137 msecs
>> [438623.752125] PM: start suspend of devices complete after 3309.713
>> msecs
>> [438623.758722] PM: suspend debug: Waiting for 5 second(s).
>> [438623.792166] XH: 22 from kthread
>> [438623.824140] XH: 23 from workqueue
>>
>>
>> So BOs definitely can be in use during suspend.
>> Even if kthread or workqueue can be stopped with one special kernel
>> config. I think suspend can only stop the workqueue with its callback
>> finish.
>> otherwise something like below makes things crazy.
>> LOCK BO
>> do something
>> -> schedule or wait, anycode might sleep. Stopped by suspend
>> now? no, i think.
>> UNLOCK BO
>>
>> I do tests with cmds below.
>> echo devices > /sys/power/pm_test
>> echo 0 > /sys/power/pm_async
>> echo 1 > /sys/power/pm_print_times
>> echo 1 > /sys/power/pm_debug_messages
>> echo 1 > /sys/module/amdgpu/parameters/debug_evictions
>> ./kfd.sh --gtest_filter=KFDEvictTest.BasicTest
>> pm-suspend
>>
>> thanks
>> xinhui
>>
>>
>> ------------------------------------------------------------------------
>> *发件人:* Christian König <ckoenig.leichtzumerken@gmail.com>
>> *发送时间:* 2023年9月12日 17:01
>> *收件人:* Pan, Xinhui <Xinhui.Pan@amd.com>;
>> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
>> *抄送:* Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig,
>> Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
>> *主题:* Re: [PATCH] drm/amdgpu: Ignore first evction failure during
>> suspend
>> When amdgpu_device_suspend() is called processes should be frozen
>> already. In other words KFD queues etc... should already be idle.
>>
>> So when the eviction fails here we missed something previously and that
>> in turn can cause tons amount of problems.
>>
>> So ignoring those errors is most likely not a good idea at all.
>>
>> Regards,
>> Christian.
>>
>> Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
>> > [AMD Official Use Only - General]
>> >
>> > Oh yep, Pinned BO is moved to other LRU list, So eviction fails
>> because of other reason.
>> > I will change the comments in the patch.
>> > The problem is eviction fails as many reasons, say, BO is locked.
>> > ASAIK, kfd will stop the queues and flush some evict/restore work
>> in its suspend callback. SO the first eviction before kfd callback
>> likely fails.
>> >
>> > -----Original Message-----
>> > From: Christian König <ckoenig.leichtzumerken@gmail.com>
>> > Sent: Friday, September 8, 2023 2:49 PM
>> > To: Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
>> > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig,
>> Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com>
>> > Subject: Re: [PATCH] drm/amdgpu: Ignore first evction failure
>> during suspend
>> >
>> > Am 08.09.23 um 05:39 schrieb xinhui pan:
>> >> Some BOs might be pinned. So the first eviction's failure will abort
>> >> the suspend sequence. These pinned BOs will be unpined afterwards
>> >> during suspend.
>> > That doesn't make much sense since pinned BOs don't cause eviction
>> failure here.
>> >
>> > What exactly is the error code you see?
>> >
>> > Christian.
>> >
>> >> Actaully it has evicted most BOs, so that should stil work fine in
>> >> sriov full access mode.
>> >>
>> >> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call
>> >> during device_suspend.")
>> >> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>> >> ---
>> >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>> >> 1 file changed, 5 insertions(+), 4 deletions(-)
>> >>
>> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> index 5c0e2b766026..39af526cdbbe 100644
>> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device
>> >> *dev, bool fbcon)
>> >>
>> >> adev->in_suspend = true;
>> >>
>> >> - /* Evict the majority of BOs before grabbing the full access */
>> >> - r = amdgpu_device_evict_resources(adev);
>> >> - if (r)
>> >> - return r;
>> >> + /* Try to evict the majority of BOs before grabbing the full
>> access
>> >> + * Ignore the ret val at first place as we will unpin some
>> BOs if any
>> >> + * afterwards.
>> >> + */
>> >> + (void)amdgpu_device_evict_resources(adev);
>> >>
>> >> if (amdgpu_sriov_vf(adev)) {
>> >> amdgpu_virt_fini_data_exchange(adev);
>>
>
[-- Attachment #2: Type: text/html, Size: 19029 bytes --]
next prev parent reply other threads:[~2023-09-13 13:54 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-08 3:39 [PATCH] drm/amdgpu: Ignore first evction failure during suspend xinhui pan
2023-09-08 6:48 ` Christian König
2023-09-12 0:21 ` Pan, Xinhui
2023-09-12 9:01 ` Christian König
2023-09-13 5:13 ` 回复: " Pan, Xinhui
2023-09-13 8:07 ` Christian König
2023-09-13 13:54 ` Felix Kuehling [this message]
2023-09-13 14:28 ` Christian König
2023-09-14 0:02 ` Pan, Xinhui
2023-09-14 1:54 ` Pan, Xinhui
2023-09-14 6:23 ` Christian König
2023-09-14 13:37 ` Felix Kuehling
2023-09-14 13:59 ` Christian König
2023-09-22 10:38 ` Christian König
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d9037d0e-e9fb-35f6-9e00-a2e1799bc2f6@amd.com \
--to=felix.kuehling@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Christian.Koenig@amd.com \
--cc=Shikang.Fan@amd.com \
--cc=Xinhui.Pan@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=ckoenig.leichtzumerken@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox