On 2023-09-13 4:07, Christian König wrote: > [+Fleix] > > Well that looks like quite a serious bug. > > If I'm not completely mistaken the KFD work item tries to restore the > process by moving BOs into memory even after the suspend freeze. > Normally work items are frozen together with the user space processes > unless explicitly marked as not freezable. > > That this causes problem during the first eviction phase is just the > tip of the iceberg here. If a BO is moved into invisible memory during > this we wouldn't be able to get it out of that in the second phase > because SDMA and hw is already turned off. > > @Felix any idea how that can happen? Have you guys marked a work item > / work queue as not freezable? We don't set anything to non-freezable in KFD. Regards,   Felix > Or maybe the display guys? > > @Xinhui please investigate what work item that is and where that is > coming from. Something like "if (adev->in_suspend) dump_stack();" in > the right place should probably do it. > > Thanks, > Christian. > > Am 13.09.23 um 07:13 schrieb Pan, Xinhui: >> >> [AMD Official Use Only - General] >> >> >> I notice that only user space process are frozen on my side.  kthread >> and workqueue  keeps running. Maybe some kernel configs are not enabled. >> I made one module which just prints something like i++ with mutex >> lock both in workqueue and kthread. I paste some logs below. >> [438619.696196] XH: 14 from workqueue >> [438619.700193] XH: 15 from kthread >> [438620.394335] PM: suspend entry (deep) >> [438620.399619] Filesystems sync: 0.001 seconds >> [438620.403887] PM: Preparing system for sleep (deep) >> [438620.409299] Freezing user space processes >> [438620.414862] Freezing user space processes completed (elapsed >> 0.001 seconds) >> [438620.421881] OOM killer disabled. >> [438620.425197] Freezing remaining freezable tasks >> [438620.430890] Freezing remaining freezable tasks completed (elapsed >> 0.001 seconds) >> [438620.438348] PM: Suspending system (deep) >> ..... >> [438623.746038] PM: suspend of devices complete after 3303.137 msecs >> [438623.752125] PM: start suspend of devices complete after 3309.713 >> msecs >> [438623.758722] PM: suspend debug: Waiting for 5 second(s). >> [438623.792166] XH: 22 from kthread >> [438623.824140] XH: 23 from workqueue >> >> >> So BOs definitely can be in use during suspend. >> Even if kthread or workqueue can be stopped with one special kernel >> config. I think suspend can only stop the workqueue with its callback >> finish. >> otherwise something like below makes things crazy. >> LOCK BO >> do something >>     -> schedule or wait, anycode might sleep.  Stopped by suspend >> now? no, i think. >> UNLOCK BO >> >> I do tests  with  cmds below. >> echo devices  > /sys/power/pm_test >> echo 0  > /sys/power/pm_async >> echo 1  > /sys/power/pm_print_times >> echo 1 > /sys/power/pm_debug_messages >> echo 1 > /sys/module/amdgpu/parameters/debug_evictions >> ./kfd.sh --gtest_filter=KFDEvictTest.BasicTest >> pm-suspend >> >> thanks >> xinhui >> >> >> ------------------------------------------------------------------------ >> *发件人:* Christian König >> *发送时间:* 2023年9月12日 17:01 >> *收件人:* Pan, Xinhui ; >> amd-gfx@lists.freedesktop.org >> *抄送:* Deucher, Alexander ; Koenig, >> Christian ; Fan, Shikang >> *主题:* Re: [PATCH] drm/amdgpu: Ignore first evction failure during >> suspend >> When amdgpu_device_suspend() is called processes should be frozen >> already. In other words KFD queues etc... should already be idle. >> >> So when the eviction fails here we missed something previously and that >> in turn can cause tons amount of problems. >> >> So ignoring those errors is most likely not a good idea at all. >> >> Regards, >> Christian. >> >> Am 12.09.23 um 02:21 schrieb Pan, Xinhui: >> > [AMD Official Use Only - General] >> > >> > Oh yep, Pinned BO is moved to other LRU list, So eviction fails >> because of other reason. >> > I will change the comments in the patch. >> > The problem is eviction fails as many reasons, say, BO is locked. >> > ASAIK, kfd will stop the queues and flush some evict/restore work >> in its suspend callback. SO the first eviction before kfd callback >> likely fails. >> > >> > -----Original Message----- >> > From: Christian König >> > Sent: Friday, September 8, 2023 2:49 PM >> > To: Pan, Xinhui ; amd-gfx@lists.freedesktop.org >> > Cc: Deucher, Alexander ; Koenig, >> Christian ; Fan, Shikang >> > Subject: Re: [PATCH] drm/amdgpu: Ignore first evction failure >> during suspend >> > >> > Am 08.09.23 um 05:39 schrieb xinhui pan: >> >> Some BOs might be pinned. So the first eviction's failure will abort >> >> the suspend sequence. These pinned BOs will be unpined afterwards >> >> during suspend. >> > That doesn't make much sense since pinned BOs don't cause eviction >> failure here. >> > >> > What exactly is the error code you see? >> > >> > Christian. >> > >> >> Actaully it has evicted most BOs, so that should stil work fine in >> >> sriov full access mode. >> >> >> >> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call >> >> during device_suspend.") >> >> Signed-off-by: xinhui pan >> >> --- >> >>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++---- >> >>    1 file changed, 5 insertions(+), 4 deletions(-) >> >> >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> >> index 5c0e2b766026..39af526cdbbe 100644 >> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> >> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device >> >> *dev, bool fbcon) >> >> >> >>        adev->in_suspend = true; >> >> >> >> -     /* Evict the majority of BOs before grabbing the full access */ >> >> -     r = amdgpu_device_evict_resources(adev); >> >> -     if (r) >> >> -             return r; >> >> +     /* Try to evict the majority of BOs before grabbing the full >> access >> >> +      * Ignore the ret val at first place as we will unpin some >> BOs if any >> >> +      * afterwards. >> >> +      */ >> >> + (void)amdgpu_device_evict_resources(adev); >> >> >> >>        if (amdgpu_sriov_vf(adev)) { >> >> amdgpu_virt_fini_data_exchange(adev); >> >