From: "Christian König" <ckoenig.leichtzumerken@gmail.com>
To: Pekka Paalanen <ppaalanen@gmail.com>,
Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Cc: daniel.vetter@ffwll.ch, michel@daenzer.net,
dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 0/6] RFC Support hot device unplug in amdgpu
Date: Mon, 11 May 2020 14:29:36 +0200 [thread overview]
Message-ID: <649674e1-df76-8f93-c52d-e6b52d9a3273@gmail.com> (raw)
In-Reply-To: <20200511122600.0adb4494@eldfell.localdomain>
Am 11.05.20 um 11:26 schrieb Pekka Paalanen:
> On Sat, 9 May 2020 14:51:44 -0400
> Andrey Grodzovsky <andrey.grodzovsky@amd.com> wrote:
>
>> This RFC is a more of a proof of concept then a fully working
>> solution as there are a few unresolved issues we are hopping to get
>> advise on from people on the mailing list. Until now extracting a
>> card either by physical extraction (e.g. eGPU with thunderbold
>> connection or by emulation through syfs
>> -> /sys/bus/pci/devices/device_id/remove) would cause random crashes
>> in user apps. The random crashes in apps were mostly due to the app
>> having mapped a device backed BO into it's adress space was still
>> trying to access the BO while the backing device was gone. To answer
>> this first problem Christian suggested to fix the handling of mapped
>> memory in the clients when the device goes away by forcibly unmap all
>> buffers the user processes has by clearing their respective VMAs
>> mapping the device BOs. Then when the VMAs try to fill in the page
>> tables again we check in the fault handler if the device is removed
>> and if so, return an error. This will generate a SIGBUS to the
>> application which can then cleanly terminate. This indeed was done
>> but this in turn created a problem of kernel OOPs were the OOPSes
>> were due to the fact that while the app was terminating because of
>> the SIGBUS it would trigger use after free in the driver by calling
>> to accesses device structures that were already released from the pci
>> remove sequence. This we handled by introducing a 'flush' seqence
>> during device removal were we wait for drm file reference to drop to
>> 0 meaning all user clients directly using this device terminated.
>> With this I was able to cleanly emulate device unplug with X and
>> glxgears running and later emulate device plug back and restart of X
>> and glxgears.
>>
>> But this use case is only partial and as I see it all the use cases
>> are as follwing and the questions it raises.
>>
>> 1) Application accesses a BO by opening drm file
>> 1.1) BO is mapped into applications address space (BO is CPU visible) - this one we have a solution for by invaldating BO's CPU mapping casuing SIGBUS
>> and termination and waiting for drm file refcound to drop to 0 before releasing the device
>> 1.2) BO is not mapped into applcation address space (BO is CPU invisible) - no solution yet because how we force the application to terminate in this case ?
>>
>> 2) Application accesses a BO by importing a DMA-BUF
>> 2.1) BO is mapped into applications address space (BO is CPU visible) - solution is same as 1.1 but instead of waiting for drm file release we wait for the
>> imported dma-buf's file release
>> 2.2) BO is not mapped into applcation address space (BO is CPU invisible) - our solution is to invalidate GPUVM page tables and destroy backing storage for
>> all exported BOs which will in turn casue VM faults in the importing device and then when the importing driver will try to re-attach the imported BO to
>> update mappings we return -ENODEV in the import hook which hopeffuly will cause the user app to terminate.
>>
>> 3) Applcation opens a drm file or imports a dma-bud and holds a reference but never access any BO or does access but never more after device was unplug - how would we
>> force this applcation to termiante before proceeding with device removal code ? Otherwise the wait in pci remove just hangs for ever.
>>
>> The attached patches adress 1.1, 2.1 and 2.2, for now only 1.1 fully tested and I am still testing the others but I will be happy for any advise on all the
>> described use cases and maybe some alternative and better (more generic) approach to this like maybe obtaining PIDs of relevant processes through some revere
>> mapping from device file and exported dma-buf files and send them SIGKILL - would this make more sense or any other method ?
>>
>> Patches 1-3 address 1.1
>> Patch 4 addresses 2.1
>> Pathces 5-6 address 2.2
>>
>> Reference: https://gitlab.freedesktop.org/drm/amd/-/issues/1081
> Hi,
>
> how did you come up with the goal "make applications terminate"? Is
> that your end goal, or is it just step 1 of many on the road of
> supporting device hot-unplug?
>
> Why do you want to terminate also applications that don't "need" to
> terminate? Why hunt them down? I'm referring to your points 1.2, 2.2
> and 3.
Yeah, that is a known limitation. For now the whole idea is to terminate
the programs using the device as soon as possible.
> From an end user perspective, I believe making applications terminate
> is not helpful at all. Your display server still disappears, which
> means all your apps are forced to quit, and you lose your desktop. I do
> understand that a graceful termination is better than a hard lockup,
> but not much.
This is not for a desktop use case at all.
Regards,
Christian.
>
> When I've talked about DRM device hot-unplug with Daniel Vetter, our
> shared opinion seems to be that the unplug should not outright kill any
> programs that are prepared to handle errors, that is, functions or
> ioctls that return a success code can return an error, and then it is
> up for the application to decide how to handle that. The end goal must
> not be to terminate all applications that had something to do with the
> device. At the very least the display server must survive.
>
> The rough idea on how that should work is that DRM ioctls start
> returning errors and all mmaps are replaced with something harmless
> that does not cause a SIGBUS. Userspace can handle the errors if it
> wants to, and display servers will react to the device removed uevent
> if not earlier.
>
> Why deliberately avoid raising SIGBUS? Because it is such a huge pain
> to handle due to the archaic design of how signals are delivered. Most
> of existing userspace is also not prepared to handle SIGBUS anywhere.
>
> The problem of handling SIGBUS at all is that a process can only have a
> single signal handler per signal, but the process may comprise of
> multiple components that cannot cooperate on signal catching: Mesa GPU
> drivers, GUI toolkits, and the application itself may all do some
> things that would require handling SIGBUS if removing a DRM device
> raised it. For Mesa to cooperate with SIGBUS handling with the other
> components in the process, we'd need some whole new APIs, an EGL
> extension and maybe Vulkan extension too. The process may also have
> threads, which are really painful with signals. What if you need to
> handle the SIGBUS differently in different threads?
>
> Hence, mmaps should be replaced with something harmless, maybe
> something that reads back all zeros and ignores writes. The application
> will learn later that the DRM device is gone. Sending it a SIGBUS on
> the spot when it accesses an mmap does not help: the memory is gone
> already - if you didn't have a backup of the contents, you're not going
> to make one now.
>
> My point here is, are you designing things to specifically only
> terminate processes, or will you leave room in the design to improve the
> implementation towards a proper handling of DRM device hot-unplug?
>
>
> Thanks,
> pq
>
>
>> Andrey Grodzovsky (6):
>> drm/ttm: Add unampping of the entire device address space
>> drm/amdgpu: Force unmap all user VMAs on device removal.
>> drm/amdgpu: Wait for all user clients
>> drm/amdgpu: Wait for all clients importing out dma-bufs.
>> drm/ttm: Add destroy flag in TTM BO eviction interface
>> drm/amdgpu: Use TTM MMs destroy interface
>>
>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 ++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 2 +-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 27 ++++++++++++-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 22 ++++++++--
>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 9 +++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 ++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 17 +++++++-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 +
>> drivers/gpu/drm/nouveau/nouveau_drm.c | 2 +-
>> drivers/gpu/drm/qxl/qxl_object.c | 4 +-
>> drivers/gpu/drm/radeon/radeon_object.c | 2 +-
>> drivers/gpu/drm/ttm/ttm_bo.c | 63 +++++++++++++++++++++--------
>> drivers/gpu/drm/vmwgfx/vmwgfx_drv.c | 6 +--
>> include/drm/ttm/ttm_bo_api.h | 2 +-
>> include/drm/ttm/ttm_bo_driver.h | 2 +
>> 16 files changed, 139 insertions(+), 34 deletions(-)
>>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
next prev parent reply other threads:[~2020-05-11 12:29 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-09 18:51 [PATCH 0/6] RFC Support hot device unplug in amdgpu Andrey Grodzovsky
2020-05-09 18:51 ` [PATCH 1/6] drm/ttm: Add unampping of the entire device address space Andrey Grodzovsky
2020-05-11 6:45 ` Christian König
2020-06-05 14:29 ` Andrey Grodzovsky
2020-06-05 18:40 ` Christian König
2020-06-09 16:37 ` Andrey Grodzovsky
2020-05-09 18:51 ` [PATCH 2/6] drm/amdgpu: Force unmap all user VMAs on device removal Andrey Grodzovsky
2020-05-11 6:47 ` Christian König
2020-05-09 18:51 ` [PATCH 3/6] drm/amdgpu: Wait for all user clients Andrey Grodzovsky
2020-05-11 6:57 ` Christian König
2020-05-09 18:51 ` [PATCH 4/6] drm/amdgpu: Wait for all clients importing out dma-bufs Andrey Grodzovsky
2020-05-09 18:51 ` [PATCH 5/6] drm/ttm: Add destroy flag in TTM BO eviction interface Andrey Grodzovsky
2020-05-11 7:05 ` Christian König
2020-06-10 10:25 ` Thomas Hellström (Intel)
2020-06-10 13:56 ` Andrey Grodzovsky
2020-05-09 18:51 ` [PATCH 6/6] drm/amdgpu: Use TTM MMs destroy interface Andrey Grodzovsky
2020-05-11 9:26 ` [PATCH 0/6] RFC Support hot device unplug in amdgpu Pekka Paalanen
2020-05-11 12:29 ` Christian König [this message]
2020-05-11 16:37 ` Andrey Grodzovsky
2020-05-11 9:54 ` Daniel Vetter
2020-05-11 10:19 ` Chris Wilson
2020-05-11 11:03 ` Daniel Vetter
2020-05-11 11:19 ` Daniel Vetter
2020-05-11 12:34 ` Christian König
2020-05-11 12:43 ` Daniel Vetter
2020-05-11 11:43 ` Lukas Wunner
2020-05-11 12:21 ` Daniel Vetter
2020-05-11 14:08 ` Lukas Wunner
2020-05-11 14:14 ` Daniel Vetter
2020-05-13 14:32 ` Andrey Grodzovsky
2020-05-13 18:05 ` Daniel Vetter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=649674e1-df76-8f93-c52d-e6b52d9a3273@gmail.com \
--to=ckoenig.leichtzumerken@gmail.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=andrey.grodzovsky@amd.com \
--cc=christian.koenig@amd.com \
--cc=daniel.vetter@ffwll.ch \
--cc=dri-devel@lists.freedesktop.org \
--cc=michel@daenzer.net \
--cc=ppaalanen@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox