All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Kuehling, Felix" <felix.kuehling@amd.com>
To: Mario Limonciello <mario.limonciello@amd.com>,
	Mario Limonciello <superm1@kernel.org>,
	amd-gfx@lists.freedesktop.org
Cc: Kent Russell <kent.russell@amd.com>, Xiaogang.chen@amd.com
Subject: Re: [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes
Date: Mon, 20 Apr 2026 23:19:44 -0400	[thread overview]
Message-ID: <8fdea885-7ac0-42f0-a166-2fe2c9fe178f@amd.com> (raw)
In-Reply-To: <60e1e12d-7705-4531-ba15-c956f4d268d4@amd.com>

On 2026-04-20 17:25, Mario Limonciello wrote:
>
>
> On 3/7/26 06:49, Mario Limonciello wrote:
>>
>>
>> On 1/12/26 12:29 PM, Mario Limonciello wrote:
>>> When a surprise unplug occurs while a process has active KFD queues,
>>> userspace never gets a chance to call kfd_ioctl_destroy_queue() to
>>> properly clean them up. This leads to a WARN_ON in uninitialize()
>>> complaining about active_queue_count or processes_count being non-zero.
>>>
>>> The issue is that during surprise unplug:
>>> 1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
>>> 2. It calls amdgpu_amdkfd_device_fini_sw()
>>> 3. This leads to kfd_cleanup_nodes() -> device_queue_manager_uninit()
>>> 4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 ||
>>>     dqm->processes_count > 0)
>>>
>>> The warning triggers because the queues were never destroyed - 
>>> userspace
>>> had no opportunity to clean them up before the device disappeared.
>>>
>>> Fix this by checking for device unplug in kfd_cleanup_nodes() and
>>> calling process_termination for each affected process before
>>> uninitializing the DQM. This mirrors what happens during normal process
>>> shutdown (kfd_process_notifier_release_internal), ensuring queues are
>>> properly cleaned up even during surprise removal.
>>>
>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Cc: Kent Russell <kent.russell@amd.com>
>>> Cc: Xiaogang.chen@amd.com
>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>
>> Ping?
> Ping?

Hi Mario,

Sorry for not responding. I think one of the reasons is that the people 
you CC'ed may not know how hot-unplug is even supposed to work with KFD. 
I think most of what the process_termination function does is 
unnecessary because the GPU is being unplugged anyway. But without it 
you would leak the MQDs. I think that's the only good reason to go 
through this. One the other hand, could this be done later when the 
process actually terminates?

I'm also worried that we have pdd->dev pointers that will probably be 
dangling after unplug. So maybe this is only the tip of the iceberg and 
we should really be cleaning up all the process-device data structures 
on unplug. Then we'd also need to make sure that all the code that loops 
over p->pdds is able to handle NULL pointers gracefully.

Regards,
   Felix


>>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 ++++++++++++++++++++++++
>>>   1 file changed, 32 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/ 
>>> drm/amd/amdkfd/kfd_device.c
>>> index e9cfb80bd436..7727b66e6afb 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct kfd_dev 
>>> *kfd, unsigned int num_nodes)
>>>       flush_workqueue(kfd->ih_wq);
>>>       destroy_workqueue(kfd->ih_wq);
>>> +    /*
>>> +     * For surprise unplugs with running processes, we need to 
>>> clean up
>>> +     * queues before uninitializing the DQM to avoid WARN in 
>>> uninitialize.
>>> +     * This handles the case where userspace can't destroy queues 
>>> normally.
>>> +     */
>>> +    if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
>>> +        struct kfd_process *p;
>>> +        unsigned int temp;
>>> +        int idx;
>>> +
>>> +        idx = srcu_read_lock(&kfd_processes_srcu);
>>> +        hash_for_each_rcu(kfd_processes_table, temp, p, 
>>> kfd_processes) {
>>> +            int j;
>>> +
>>> +            for (j = 0; j < p->n_pdds; j++) {
>>> +                struct kfd_process_device *pdd = p->pdds[j];
>>> +
>>> +                if (pdd->dev->kfd != kfd)
>>> +                    continue;
>>> +
>>> +                dev_info(kfd_device,
>>> +                     "Terminating queues for process %d on 
>>> unplugged device\n",
>>> +                     p->lead_thread->pid);
>>> +
>>> + pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
>>> +                                       &pdd->qpd);
>>> +                pdd->already_dequeued = true;
>>> +            }
>>> +        }
>>> +        srcu_read_unlock(&kfd_processes_srcu, idx);
>>> +    }
>>> +
>>>       for (i = 0; i < num_nodes; i++) {
>>>           knode = kfd->nodes[i];
>>>           device_queue_manager_uninit(knode->dqm);
>>
>>
>

  reply	other threads:[~2026-04-21  3:19 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-12 18:29 [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes Mario Limonciello
2026-03-07 12:49 ` Mario Limonciello
2026-04-20 21:25   ` Mario Limonciello
2026-04-21  3:19     ` Kuehling, Felix [this message]
2026-04-21  3:21     ` Kuehling, Felix
2026-04-21 15:00     ` Chen, Xiaogang
2026-04-22  1:56       ` Kuehling, Felix
2026-04-22 15:53         ` Chen, Xiaogang
2026-04-22 21:00           ` Felix Kuehling
2026-04-22 22:02             ` Chen, Xiaogang
2026-04-23  1:38               ` Felix Kuehling

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8fdea885-7ac0-42f0-a166-2fe2c9fe178f@amd.com \
    --to=felix.kuehling@amd.com \
    --cc=Xiaogang.chen@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=kent.russell@amd.com \
    --cc=mario.limonciello@amd.com \
    --cc=superm1@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.