* [PATCH] drm/amdkfd: Skip locking KFD when unbinding GPU @ 2023-11-06 7:14 Lawrence Yiu 2023-11-06 23:10 ` Felix Kuehling 0 siblings, 1 reply; 4+ messages in thread From: Lawrence Yiu @ 2023-11-06 7:14 UTC (permalink / raw) To: amd-gfx, Felix.Kuehling Cc: alexander.deucher, Xinhui.Pan, christian.koenig, Lawrence Yiu After unbinding a GPU, KFD becomes locked and unusable, resulting in applications not being able to use ROCm for compute anymore and rocminfo outputting the following error message: ROCk module is loaded Unable to open /dev/kfd read-write: Invalid argument KFD remains locked even after rebinding the same GPU and a system reboot is required to unlock it. Fix this by not locking KFD during the GPU unbind process. Closes: https://github.com/RadeonOpenCompute/ROCm/issues/629 Signed-off-by: Lawrence Yiu <lawyiu.dev@gmail.com> --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c index 0a9cf9dfc224..c9436039e619 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c @@ -949,8 +949,8 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm) if (!kfd->init_complete) return; - /* for runtime suspend, skip locking kfd */ - if (!run_pm) { + /* for runtime suspend or GPU unbind, skip locking kfd */ + if (!run_pm && !drm_dev_is_unplugged(adev_to_drm(kfd->adev))) { mutex_lock(&kfd_processes_mutex); count = ++kfd_locked; mutex_unlock(&kfd_processes_mutex); -- 2.34.1 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] drm/amdkfd: Skip locking KFD when unbinding GPU 2023-11-06 7:14 [PATCH] drm/amdkfd: Skip locking KFD when unbinding GPU Lawrence Yiu @ 2023-11-06 23:10 ` Felix Kuehling 2023-11-07 22:03 ` Alex Deucher 0 siblings, 1 reply; 4+ messages in thread From: Felix Kuehling @ 2023-11-06 23:10 UTC (permalink / raw) To: Lawrence Yiu, amd-gfx; +Cc: alexander.deucher, Xinhui.Pan, christian.koenig On 2023-11-06 2:14, Lawrence Yiu wrote: > After unbinding a GPU, KFD becomes locked and unusable, resulting in > applications not being able to use ROCm for compute anymore and rocminfo > outputting the following error message: > > ROCk module is loaded > Unable to open /dev/kfd read-write: Invalid argument > > KFD remains locked even after rebinding the same GPU and a system reboot > is required to unlock it. Fix this by not locking KFD during the GPU > unbind process. > > Closes: https://github.com/RadeonOpenCompute/ROCm/issues/629 > Signed-off-by: Lawrence Yiu <lawyiu.dev@gmail.com> > --- > drivers/gpu/drm/amd/amdkfd/kfd_device.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > index 0a9cf9dfc224..c9436039e619 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > @@ -949,8 +949,8 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm) > if (!kfd->init_complete) > return; > > - /* for runtime suspend, skip locking kfd */ > - if (!run_pm) { > + /* for runtime suspend or GPU unbind, skip locking kfd */ > + if (!run_pm && !drm_dev_is_unplugged(adev_to_drm(kfd->adev))) { > mutex_lock(&kfd_processes_mutex); > count = ++kfd_locked; This lock is meant to prevent new KFD processes from starting while a GPU reset or suspend/resume is in progress. Just below it also suspends the user mode queues of all processes to ensure the GPUs are idle before suspending. It sounds like this is not applicable to the hot-unplug use case. In particular, if there is no matching kgd2kfd_resume call, that would lead to the symptom you describe, where KFD just gets stuck forever. What's the semantics of GPU hot unplug? Is it more like a GPU reset or more like runtime-PM? In other words, do we need to notify processes when a GPU goes away, or is there some other mechanism that ensures a GPU is idle before being unplugged? If it's more like runtime PM, then simply call kgd2kfd_suspend with run_pm=true. If it's more like a GPU reset, you can't just remove this lock. User mode won't be aware and will try to continue using the GPU. In the best case applications will just soft hang. Instead you should probably replace the kgd2kfd_suspend call with calls to kgd2kfd_pre_reset and kgd2kfd_post_reset. That would idle the affected GPU, notify user mode processes using the GPU that something is wrong, and resume all the GPUs again. You'd need to be careful about the sequence between actual unplug and post_reset. Not sure if post_reset would need changes to avoid failing on the removed GPU. Regards, Felix > mutex_unlock(&kfd_processes_mutex); ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] drm/amdkfd: Skip locking KFD when unbinding GPU 2023-11-06 23:10 ` Felix Kuehling @ 2023-11-07 22:03 ` Alex Deucher 2023-11-07 22:16 ` Felix Kuehling 0 siblings, 1 reply; 4+ messages in thread From: Alex Deucher @ 2023-11-07 22:03 UTC (permalink / raw) To: Felix Kuehling Cc: alexander.deucher, Xinhui.Pan, christian.koenig, amd-gfx, Lawrence Yiu On Mon, Nov 6, 2023 at 6:17 PM Felix Kuehling <felix.kuehling@amd.com> wrote: > > On 2023-11-06 2:14, Lawrence Yiu wrote: > > After unbinding a GPU, KFD becomes locked and unusable, resulting in > > applications not being able to use ROCm for compute anymore and rocminfo > > outputting the following error message: > > > > ROCk module is loaded > > Unable to open /dev/kfd read-write: Invalid argument > > > > KFD remains locked even after rebinding the same GPU and a system reboot > > is required to unlock it. Fix this by not locking KFD during the GPU > > unbind process. > > > > Closes: https://github.com/RadeonOpenCompute/ROCm/issues/629 > > Signed-off-by: Lawrence Yiu <lawyiu.dev@gmail.com> > > --- > > drivers/gpu/drm/amd/amdkfd/kfd_device.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > > index 0a9cf9dfc224..c9436039e619 100644 > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > > @@ -949,8 +949,8 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm) > > if (!kfd->init_complete) > > return; > > > > - /* for runtime suspend, skip locking kfd */ > > - if (!run_pm) { > > + /* for runtime suspend or GPU unbind, skip locking kfd */ > > + if (!run_pm && !drm_dev_is_unplugged(adev_to_drm(kfd->adev))) { > > mutex_lock(&kfd_processes_mutex); > > count = ++kfd_locked; > > This lock is meant to prevent new KFD processes from starting while a > GPU reset or suspend/resume is in progress. Just below it also suspends > the user mode queues of all processes to ensure the GPUs are idle before > suspending. It sounds like this is not applicable to the hot-unplug use > case. In particular, if there is no matching kgd2kfd_resume call, that > would lead to the symptom you describe, where KFD just gets stuck forever. > > What's the semantics of GPU hot unplug? Is it more like a GPU reset or > more like runtime-PM? In other words, do we need to notify processes > when a GPU goes away, or is there some other mechanism that ensures a > GPU is idle before being unplugged? > It's a separate PCI entry point (remove() in this case). From a driver perspective we quiesce any outstanding DMA and then tear down the driver. It's the same whether you are actually physically hotplugging the device or just unbinding the driver from the device. Alex > If it's more like runtime PM, then simply call kgd2kfd_suspend with > run_pm=true. > > If it's more like a GPU reset, you can't just remove this lock. User > mode won't be aware and will try to continue using the GPU. In the best > case applications will just soft hang. Instead you should probably > replace the kgd2kfd_suspend call with calls to kgd2kfd_pre_reset and > kgd2kfd_post_reset. That would idle the affected GPU, notify user mode > processes using the GPU that something is wrong, and resume all the GPUs > again. You'd need to be careful about the sequence between actual unplug > and post_reset. Not sure if post_reset would need changes to avoid > failing on the removed GPU. > > Regards, > Felix > > > > mutex_unlock(&kfd_processes_mutex); ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] drm/amdkfd: Skip locking KFD when unbinding GPU 2023-11-07 22:03 ` Alex Deucher @ 2023-11-07 22:16 ` Felix Kuehling 0 siblings, 0 replies; 4+ messages in thread From: Felix Kuehling @ 2023-11-07 22:16 UTC (permalink / raw) To: Alex Deucher Cc: alexander.deucher, Xinhui.Pan, christian.koenig, amd-gfx, Lawrence Yiu On 2023-11-07 17:03, Alex Deucher wrote: > On Mon, Nov 6, 2023 at 6:17 PM Felix Kuehling <felix.kuehling@amd.com> wrote: >> On 2023-11-06 2:14, Lawrence Yiu wrote: >>> After unbinding a GPU, KFD becomes locked and unusable, resulting in >>> applications not being able to use ROCm for compute anymore and rocminfo >>> outputting the following error message: >>> >>> ROCk module is loaded >>> Unable to open /dev/kfd read-write: Invalid argument >>> >>> KFD remains locked even after rebinding the same GPU and a system reboot >>> is required to unlock it. Fix this by not locking KFD during the GPU >>> unbind process. >>> >>> Closes: https://github.com/RadeonOpenCompute/ROCm/issues/629 >>> Signed-off-by: Lawrence Yiu <lawyiu.dev@gmail.com> >>> --- >>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c >>> index 0a9cf9dfc224..c9436039e619 100644 >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c >>> @@ -949,8 +949,8 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm) >>> if (!kfd->init_complete) >>> return; >>> >>> - /* for runtime suspend, skip locking kfd */ >>> - if (!run_pm) { >>> + /* for runtime suspend or GPU unbind, skip locking kfd */ >>> + if (!run_pm && !drm_dev_is_unplugged(adev_to_drm(kfd->adev))) { >>> mutex_lock(&kfd_processes_mutex); >>> count = ++kfd_locked; >> This lock is meant to prevent new KFD processes from starting while a >> GPU reset or suspend/resume is in progress. Just below it also suspends >> the user mode queues of all processes to ensure the GPUs are idle before >> suspending. It sounds like this is not applicable to the hot-unplug use >> case. In particular, if there is no matching kgd2kfd_resume call, that >> would lead to the symptom you describe, where KFD just gets stuck forever. >> >> What's the semantics of GPU hot unplug? Is it more like a GPU reset or >> more like runtime-PM? In other words, do we need to notify processes >> when a GPU goes away, or is there some other mechanism that ensures a >> GPU is idle before being unplugged? >> > It's a separate PCI entry point (remove() in this case). From a > driver perspective we quiesce any outstanding DMA and then tear down > the driver. It's the same whether you are actually physically > hotplugging the device or just unbinding the driver from the device. It sounds like we should treat it like a GPU reset for KFD, where we notify user mode that the context is gone. Except that between pre-reset and post-reset the topology changes, so we don't bring the removed GPU back up. That may require some non-trivial changes in a bunch of places, if the kfd_process_device data structures still refer to a device that no longer exist. Regards, Felix > > Alex > >> If it's more like runtime PM, then simply call kgd2kfd_suspend with >> run_pm=true. >> >> If it's more like a GPU reset, you can't just remove this lock. User >> mode won't be aware and will try to continue using the GPU. In the best >> case applications will just soft hang. Instead you should probably >> replace the kgd2kfd_suspend call with calls to kgd2kfd_pre_reset and >> kgd2kfd_post_reset. That would idle the affected GPU, notify user mode >> processes using the GPU that something is wrong, and resume all the GPUs >> again. You'd need to be careful about the sequence between actual unplug >> and post_reset. Not sure if post_reset would need changes to avoid >> failing on the removed GPU. >> >> Regards, >> Felix >> >> >>> mutex_unlock(&kfd_processes_mutex); ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-11-07 22:16 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-11-06 7:14 [PATCH] drm/amdkfd: Skip locking KFD when unbinding GPU Lawrence Yiu 2023-11-06 23:10 ` Felix Kuehling 2023-11-07 22:03 ` Alex Deucher 2023-11-07 22:16 ` Felix Kuehling
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.