All of lore.kernel.org
 help / color / mirror / Atom feed
From: Felix Kuehling <felix.kuehling@amd.com>
To: Lawrence Yiu <lawyiu.dev@gmail.com>, amd-gfx@lists.freedesktop.org
Cc: alexander.deucher@amd.com, Xinhui.Pan@amd.com, christian.koenig@amd.com
Subject: Re: [PATCH] drm/amdkfd: Skip locking KFD when unbinding GPU
Date: Mon, 6 Nov 2023 18:10:29 -0500	[thread overview]
Message-ID: <4db20aad-c19f-4adf-ba13-97acbdb6ba16@amd.com> (raw)
In-Reply-To: <20231106071405.121981-1-lawyiu.dev@gmail.com>

On 2023-11-06 2:14, Lawrence Yiu wrote:
> After unbinding a GPU, KFD becomes locked and unusable, resulting in
> applications not being able to use ROCm for compute anymore and rocminfo
> outputting the following error message:
>
> ROCk module is loaded
> Unable to open /dev/kfd read-write: Invalid argument
>
> KFD remains locked even after rebinding the same GPU and a system reboot
> is required to unlock it. Fix this by not locking KFD during the GPU
> unbind process.
>
> Closes: https://github.com/RadeonOpenCompute/ROCm/issues/629
> Signed-off-by: Lawrence Yiu <lawyiu.dev@gmail.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 0a9cf9dfc224..c9436039e619 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -949,8 +949,8 @@ void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
>   	if (!kfd->init_complete)
>   		return;
>   
> -	/* for runtime suspend, skip locking kfd */
> -	if (!run_pm) {
> +	/* for runtime suspend or GPU unbind, skip locking kfd */
> +	if (!run_pm && !drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
>   		mutex_lock(&kfd_processes_mutex);
>   		count = ++kfd_locked;

This lock is meant to prevent new KFD processes from starting while a 
GPU reset or suspend/resume is in progress. Just below it also suspends 
the user mode queues of all processes to ensure the GPUs are idle before 
suspending. It sounds like this is not applicable to the hot-unplug use 
case. In particular, if there is no matching kgd2kfd_resume call, that 
would lead to the symptom you describe, where KFD just gets stuck forever.

What's the semantics of GPU hot unplug? Is it more like a GPU reset or 
more like runtime-PM? In other words, do we need to notify processes 
when a GPU goes away, or is there some other mechanism that ensures a 
GPU is idle before being unplugged?

If it's more like runtime PM, then simply call kgd2kfd_suspend with 
run_pm=true.

If it's more like a GPU reset, you can't just remove this lock. User 
mode won't be aware and will try to continue using the GPU. In the best 
case applications will just soft hang. Instead you should probably 
replace the kgd2kfd_suspend call with calls to kgd2kfd_pre_reset and 
kgd2kfd_post_reset. That would idle the affected GPU, notify user mode 
processes using the GPU that something is wrong, and resume all the GPUs 
again. You'd need to be careful about the sequence between actual unplug 
and post_reset. Not sure if post_reset would need changes to avoid 
failing on the removed GPU.

Regards,
   Felix


>   		mutex_unlock(&kfd_processes_mutex);

  reply	other threads:[~2023-11-06 23:10 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-06  7:14 [PATCH] drm/amdkfd: Skip locking KFD when unbinding GPU Lawrence Yiu
2023-11-06 23:10 ` Felix Kuehling [this message]
2023-11-07 22:03   ` Alex Deucher
2023-11-07 22:16     ` Felix Kuehling

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4db20aad-c19f-4adf-ba13-97acbdb6ba16@amd.com \
    --to=felix.kuehling@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=lawyiu.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.