All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes
@ 2026-01-12 18:29 Mario Limonciello
  2026-03-07 12:49 ` Mario Limonciello
  0 siblings, 1 reply; 11+ messages in thread
From: Mario Limonciello @ 2026-01-12 18:29 UTC (permalink / raw)
  To: amd-gfx; +Cc: Mario Limonciello, Felix Kuehling, Kent Russell, Xiaogang.chen

When a surprise unplug occurs while a process has active KFD queues,
userspace never gets a chance to call kfd_ioctl_destroy_queue() to
properly clean them up. This leads to a WARN_ON in uninitialize()
complaining about active_queue_count or processes_count being non-zero.

The issue is that during surprise unplug:
1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
2. It calls amdgpu_amdkfd_device_fini_sw()
3. This leads to kfd_cleanup_nodes() -> device_queue_manager_uninit()
4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 || 
   dqm->processes_count > 0)

The warning triggers because the queues were never destroyed - userspace
had no opportunity to clean them up before the device disappeared.

Fix this by checking for device unplug in kfd_cleanup_nodes() and
calling process_termination for each affected process before
uninitializing the DQM. This mirrors what happens during normal process
shutdown (kfd_process_notifier_release_internal), ensuring queues are
properly cleaned up even during surprise removal.

Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Kent Russell <kent.russell@amd.com>
Cc: Xiaogang.chen@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 ++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index e9cfb80bd436..7727b66e6afb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct kfd_dev *kfd, unsigned int num_nodes)
 	flush_workqueue(kfd->ih_wq);
 	destroy_workqueue(kfd->ih_wq);
 
+	/*
+	 * For surprise unplugs with running processes, we need to clean up
+	 * queues before uninitializing the DQM to avoid WARN in uninitialize.
+	 * This handles the case where userspace can't destroy queues normally.
+	 */
+	if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
+		struct kfd_process *p;
+		unsigned int temp;
+		int idx;
+
+		idx = srcu_read_lock(&kfd_processes_srcu);
+		hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+			int j;
+
+			for (j = 0; j < p->n_pdds; j++) {
+				struct kfd_process_device *pdd = p->pdds[j];
+
+				if (pdd->dev->kfd != kfd)
+					continue;
+
+				dev_info(kfd_device,
+					 "Terminating queues for process %d on unplugged device\n",
+					 p->lead_thread->pid);
+
+				pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
+								       &pdd->qpd);
+				pdd->already_dequeued = true;
+			}
+		}
+		srcu_read_unlock(&kfd_processes_srcu, idx);
+	}
+
 	for (i = 0; i < num_nodes; i++) {
 		knode = kfd->nodes[i];
 		device_queue_manager_uninit(knode->dqm);
-- 
2.47.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes
  2026-01-12 18:29 [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes Mario Limonciello
@ 2026-03-07 12:49 ` Mario Limonciello
  2026-04-20 21:25   ` Mario Limonciello
  0 siblings, 1 reply; 11+ messages in thread
From: Mario Limonciello @ 2026-03-07 12:49 UTC (permalink / raw)
  To: Mario Limonciello, amd-gfx; +Cc: Felix Kuehling, Kent Russell, Xiaogang.chen



On 1/12/26 12:29 PM, Mario Limonciello wrote:
> When a surprise unplug occurs while a process has active KFD queues,
> userspace never gets a chance to call kfd_ioctl_destroy_queue() to
> properly clean them up. This leads to a WARN_ON in uninitialize()
> complaining about active_queue_count or processes_count being non-zero.
> 
> The issue is that during surprise unplug:
> 1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
> 2. It calls amdgpu_amdkfd_device_fini_sw()
> 3. This leads to kfd_cleanup_nodes() -> device_queue_manager_uninit()
> 4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 ||
>     dqm->processes_count > 0)
> 
> The warning triggers because the queues were never destroyed - userspace
> had no opportunity to clean them up before the device disappeared.
> 
> Fix this by checking for device unplug in kfd_cleanup_nodes() and
> calling process_termination for each affected process before
> uninitializing the DQM. This mirrors what happens during normal process
> shutdown (kfd_process_notifier_release_internal), ensuring queues are
> properly cleaned up even during surprise removal.
> 
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Kent Russell <kent.russell@amd.com>
> Cc: Xiaogang.chen@amd.com
> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>

Ping?

> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 ++++++++++++++++++++++++
>   1 file changed, 32 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index e9cfb80bd436..7727b66e6afb 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct kfd_dev *kfd, unsigned int num_nodes)
>   	flush_workqueue(kfd->ih_wq);
>   	destroy_workqueue(kfd->ih_wq);
>   
> +	/*
> +	 * For surprise unplugs with running processes, we need to clean up
> +	 * queues before uninitializing the DQM to avoid WARN in uninitialize.
> +	 * This handles the case where userspace can't destroy queues normally.
> +	 */
> +	if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
> +		struct kfd_process *p;
> +		unsigned int temp;
> +		int idx;
> +
> +		idx = srcu_read_lock(&kfd_processes_srcu);
> +		hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
> +			int j;
> +
> +			for (j = 0; j < p->n_pdds; j++) {
> +				struct kfd_process_device *pdd = p->pdds[j];
> +
> +				if (pdd->dev->kfd != kfd)
> +					continue;
> +
> +				dev_info(kfd_device,
> +					 "Terminating queues for process %d on unplugged device\n",
> +					 p->lead_thread->pid);
> +
> +				pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
> +								       &pdd->qpd);
> +				pdd->already_dequeued = true;
> +			}
> +		}
> +		srcu_read_unlock(&kfd_processes_srcu, idx);
> +	}
> +
>   	for (i = 0; i < num_nodes; i++) {
>   		knode = kfd->nodes[i];
>   		device_queue_manager_uninit(knode->dqm);


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes
  2026-03-07 12:49 ` Mario Limonciello
@ 2026-04-20 21:25   ` Mario Limonciello
  2026-04-21  3:19     ` Kuehling, Felix
                       ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Mario Limonciello @ 2026-04-20 21:25 UTC (permalink / raw)
  To: Mario Limonciello, amd-gfx, Felix Kuehling; +Cc: Kent Russell, Xiaogang.chen



On 3/7/26 06:49, Mario Limonciello wrote:
> 
> 
> On 1/12/26 12:29 PM, Mario Limonciello wrote:
>> When a surprise unplug occurs while a process has active KFD queues,
>> userspace never gets a chance to call kfd_ioctl_destroy_queue() to
>> properly clean them up. This leads to a WARN_ON in uninitialize()
>> complaining about active_queue_count or processes_count being non-zero.
>>
>> The issue is that during surprise unplug:
>> 1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
>> 2. It calls amdgpu_amdkfd_device_fini_sw()
>> 3. This leads to kfd_cleanup_nodes() -> device_queue_manager_uninit()
>> 4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 ||
>>     dqm->processes_count > 0)
>>
>> The warning triggers because the queues were never destroyed - userspace
>> had no opportunity to clean them up before the device disappeared.
>>
>> Fix this by checking for device unplug in kfd_cleanup_nodes() and
>> calling process_termination for each affected process before
>> uninitializing the DQM. This mirrors what happens during normal process
>> shutdown (kfd_process_notifier_release_internal), ensuring queues are
>> properly cleaned up even during surprise removal.
>>
>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>> Cc: Kent Russell <kent.russell@amd.com>
>> Cc: Xiaogang.chen@amd.com
>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> 
> Ping?
Ping?
> 
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 ++++++++++++++++++++++++
>>   1 file changed, 32 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/ 
>> drm/amd/amdkfd/kfd_device.c
>> index e9cfb80bd436..7727b66e6afb 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct kfd_dev 
>> *kfd, unsigned int num_nodes)
>>       flush_workqueue(kfd->ih_wq);
>>       destroy_workqueue(kfd->ih_wq);
>> +    /*
>> +     * For surprise unplugs with running processes, we need to clean up
>> +     * queues before uninitializing the DQM to avoid WARN in 
>> uninitialize.
>> +     * This handles the case where userspace can't destroy queues 
>> normally.
>> +     */
>> +    if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
>> +        struct kfd_process *p;
>> +        unsigned int temp;
>> +        int idx;
>> +
>> +        idx = srcu_read_lock(&kfd_processes_srcu);
>> +        hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>> +            int j;
>> +
>> +            for (j = 0; j < p->n_pdds; j++) {
>> +                struct kfd_process_device *pdd = p->pdds[j];
>> +
>> +                if (pdd->dev->kfd != kfd)
>> +                    continue;
>> +
>> +                dev_info(kfd_device,
>> +                     "Terminating queues for process %d on unplugged 
>> device\n",
>> +                     p->lead_thread->pid);
>> +
>> +                pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
>> +                                       &pdd->qpd);
>> +                pdd->already_dequeued = true;
>> +            }
>> +        }
>> +        srcu_read_unlock(&kfd_processes_srcu, idx);
>> +    }
>> +
>>       for (i = 0; i < num_nodes; i++) {
>>           knode = kfd->nodes[i];
>>           device_queue_manager_uninit(knode->dqm);
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes
  2026-04-20 21:25   ` Mario Limonciello
@ 2026-04-21  3:19     ` Kuehling, Felix
  2026-04-21  3:21     ` Kuehling, Felix
  2026-04-21 15:00     ` Chen, Xiaogang
  2 siblings, 0 replies; 11+ messages in thread
From: Kuehling, Felix @ 2026-04-21  3:19 UTC (permalink / raw)
  To: Mario Limonciello, Mario Limonciello, amd-gfx; +Cc: Kent Russell, Xiaogang.chen

On 2026-04-20 17:25, Mario Limonciello wrote:
>
>
> On 3/7/26 06:49, Mario Limonciello wrote:
>>
>>
>> On 1/12/26 12:29 PM, Mario Limonciello wrote:
>>> When a surprise unplug occurs while a process has active KFD queues,
>>> userspace never gets a chance to call kfd_ioctl_destroy_queue() to
>>> properly clean them up. This leads to a WARN_ON in uninitialize()
>>> complaining about active_queue_count or processes_count being non-zero.
>>>
>>> The issue is that during surprise unplug:
>>> 1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
>>> 2. It calls amdgpu_amdkfd_device_fini_sw()
>>> 3. This leads to kfd_cleanup_nodes() -> device_queue_manager_uninit()
>>> 4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 ||
>>>     dqm->processes_count > 0)
>>>
>>> The warning triggers because the queues were never destroyed - 
>>> userspace
>>> had no opportunity to clean them up before the device disappeared.
>>>
>>> Fix this by checking for device unplug in kfd_cleanup_nodes() and
>>> calling process_termination for each affected process before
>>> uninitializing the DQM. This mirrors what happens during normal process
>>> shutdown (kfd_process_notifier_release_internal), ensuring queues are
>>> properly cleaned up even during surprise removal.
>>>
>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Cc: Kent Russell <kent.russell@amd.com>
>>> Cc: Xiaogang.chen@amd.com
>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>
>> Ping?
> Ping?

Hi Mario,

Sorry for not responding. I think one of the reasons is that the people 
you CC'ed may not know how hot-unplug is even supposed to work with KFD. 
I think most of what the process_termination function does is 
unnecessary because the GPU is being unplugged anyway. But without it 
you would leak the MQDs. I think that's the only good reason to go 
through this. One the other hand, could this be done later when the 
process actually terminates?

I'm also worried that we have pdd->dev pointers that will probably be 
dangling after unplug. So maybe this is only the tip of the iceberg and 
we should really be cleaning up all the process-device data structures 
on unplug. Then we'd also need to make sure that all the code that loops 
over p->pdds is able to handle NULL pointers gracefully.

Regards,
   Felix


>>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 ++++++++++++++++++++++++
>>>   1 file changed, 32 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/ 
>>> drm/amd/amdkfd/kfd_device.c
>>> index e9cfb80bd436..7727b66e6afb 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct kfd_dev 
>>> *kfd, unsigned int num_nodes)
>>>       flush_workqueue(kfd->ih_wq);
>>>       destroy_workqueue(kfd->ih_wq);
>>> +    /*
>>> +     * For surprise unplugs with running processes, we need to 
>>> clean up
>>> +     * queues before uninitializing the DQM to avoid WARN in 
>>> uninitialize.
>>> +     * This handles the case where userspace can't destroy queues 
>>> normally.
>>> +     */
>>> +    if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
>>> +        struct kfd_process *p;
>>> +        unsigned int temp;
>>> +        int idx;
>>> +
>>> +        idx = srcu_read_lock(&kfd_processes_srcu);
>>> +        hash_for_each_rcu(kfd_processes_table, temp, p, 
>>> kfd_processes) {
>>> +            int j;
>>> +
>>> +            for (j = 0; j < p->n_pdds; j++) {
>>> +                struct kfd_process_device *pdd = p->pdds[j];
>>> +
>>> +                if (pdd->dev->kfd != kfd)
>>> +                    continue;
>>> +
>>> +                dev_info(kfd_device,
>>> +                     "Terminating queues for process %d on 
>>> unplugged device\n",
>>> +                     p->lead_thread->pid);
>>> +
>>> + pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
>>> +                                       &pdd->qpd);
>>> +                pdd->already_dequeued = true;
>>> +            }
>>> +        }
>>> +        srcu_read_unlock(&kfd_processes_srcu, idx);
>>> +    }
>>> +
>>>       for (i = 0; i < num_nodes; i++) {
>>>           knode = kfd->nodes[i];
>>>           device_queue_manager_uninit(knode->dqm);
>>
>>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes
  2026-04-20 21:25   ` Mario Limonciello
  2026-04-21  3:19     ` Kuehling, Felix
@ 2026-04-21  3:21     ` Kuehling, Felix
  2026-04-21 15:00     ` Chen, Xiaogang
  2 siblings, 0 replies; 11+ messages in thread
From: Kuehling, Felix @ 2026-04-21  3:21 UTC (permalink / raw)
  To: Mario Limonciello, Mario Limonciello, amd-gfx; +Cc: Kent Russell, Xiaogang.chen

On 2026-04-20 17:25, Mario Limonciello wrote:
>
>
> On 3/7/26 06:49, Mario Limonciello wrote:
>>
>>
>> On 1/12/26 12:29 PM, Mario Limonciello wrote:
>>> When a surprise unplug occurs while a process has active KFD queues,
>>> userspace never gets a chance to call kfd_ioctl_destroy_queue() to
>>> properly clean them up. This leads to a WARN_ON in uninitialize()
>>> complaining about active_queue_count or processes_count being non-zero.
>>>
>>> The issue is that during surprise unplug:
>>> 1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
>>> 2. It calls amdgpu_amdkfd_device_fini_sw()
>>> 3. This leads to kfd_cleanup_nodes() -> device_queue_manager_uninit()
>>> 4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 ||
>>>     dqm->processes_count > 0)
>>>
>>> The warning triggers because the queues were never destroyed - 
>>> userspace
>>> had no opportunity to clean them up before the device disappeared.
>>>
>>> Fix this by checking for device unplug in kfd_cleanup_nodes() and
>>> calling process_termination for each affected process before
>>> uninitializing the DQM. This mirrors what happens during normal process
>>> shutdown (kfd_process_notifier_release_internal), ensuring queues are
>>> properly cleaned up even during surprise removal.
>>>
>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Cc: Kent Russell <kent.russell@amd.com>
>>> Cc: Xiaogang.chen@amd.com
>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>
>> Ping?
> Ping?

Hi Mario,

Sorry for not responding. I think one of the reasons is that the people 
you CC'ed may not know how hot-unplug is even supposed to work with KFD. 
I think most of what the process_termination function does is 
unnecessary because the GPU is being unplugged anyway. But without it 
you would leak the MQDs. I think that's the only good reason to go 
through this. One the other hand, could this be done later when the 
process actually terminates?

I'm also worried that we have pdd->dev pointers that will probably be 
dangling after unplug. So maybe this is only the tip of the iceberg and 
we should really be cleaning up all the process-device data structures 
on unplug. Then we'd also need to make sure that all the code that loops 
over p->pdds is able to handle NULL pointers gracefully.

Regards,
   Felix


>>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 ++++++++++++++++++++++++
>>>   1 file changed, 32 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/ 
>>> drm/amd/amdkfd/kfd_device.c
>>> index e9cfb80bd436..7727b66e6afb 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct kfd_dev 
>>> *kfd, unsigned int num_nodes)
>>>       flush_workqueue(kfd->ih_wq);
>>>       destroy_workqueue(kfd->ih_wq);
>>> +    /*
>>> +     * For surprise unplugs with running processes, we need to 
>>> clean up
>>> +     * queues before uninitializing the DQM to avoid WARN in 
>>> uninitialize.
>>> +     * This handles the case where userspace can't destroy queues 
>>> normally.
>>> +     */
>>> +    if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
>>> +        struct kfd_process *p;
>>> +        unsigned int temp;
>>> +        int idx;
>>> +
>>> +        idx = srcu_read_lock(&kfd_processes_srcu);
>>> +        hash_for_each_rcu(kfd_processes_table, temp, p, 
>>> kfd_processes) {
>>> +            int j;
>>> +
>>> +            for (j = 0; j < p->n_pdds; j++) {
>>> +                struct kfd_process_device *pdd = p->pdds[j];
>>> +
>>> +                if (pdd->dev->kfd != kfd)
>>> +                    continue;
>>> +
>>> +                dev_info(kfd_device,
>>> +                     "Terminating queues for process %d on 
>>> unplugged device\n",
>>> +                     p->lead_thread->pid);
>>> +
>>> + pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
>>> +                                       &pdd->qpd);
>>> +                pdd->already_dequeued = true;
>>> +            }
>>> +        }
>>> +        srcu_read_unlock(&kfd_processes_srcu, idx);
>>> +    }
>>> +
>>>       for (i = 0; i < num_nodes; i++) {
>>>           knode = kfd->nodes[i];
>>>           device_queue_manager_uninit(knode->dqm);
>>
>>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes
  2026-04-20 21:25   ` Mario Limonciello
  2026-04-21  3:19     ` Kuehling, Felix
  2026-04-21  3:21     ` Kuehling, Felix
@ 2026-04-21 15:00     ` Chen, Xiaogang
  2026-04-22  1:56       ` Kuehling, Felix
  2 siblings, 1 reply; 11+ messages in thread
From: Chen, Xiaogang @ 2026-04-21 15:00 UTC (permalink / raw)
  To: Mario Limonciello, Mario Limonciello, amd-gfx, Felix Kuehling
  Cc: Kent Russell


On 4/20/2026 4:25 PM, Mario Limonciello wrote:
>
>
> On 3/7/26 06:49, Mario Limonciello wrote:
>>
>>
>> On 1/12/26 12:29 PM, Mario Limonciello wrote:
>>> When a surprise unplug occurs while a process has active KFD queues,
>>> userspace never gets a chance to call kfd_ioctl_destroy_queue() to
>>> properly clean them up. This leads to a WARN_ON in uninitialize()
>>> complaining about active_queue_count or processes_count being non-zero.
>>>
During hot-unplug driver sends SIGBUS signal to all processes who are 
using the unplugged device. It is expected that affected processes will 
clean their workloads when get this signal.

When a device got removed physically all sources from it will be 
removed. It is unnecessary(in theory) to clean them up. I am not 
surprised to see some software warnings due to hardware got physically 
removed since it is unexpected behavior at run time.

I think what we need worry about is if there is memory leak. Driver also 
waits when an affected device is idle(by 
kgd2kfd_check_device_idle(adev)) by checking/waiting if there is process 
still using it. If there is no process using the being removed device 
the processes should have been terminated by same process termination 
logic from driver.

Regards

Xiaogang

>>> The issue is that during surprise unplug:
>>> 1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
>>> 2. It calls amdgpu_amdkfd_device_fini_sw()
>>> 3. This leads to kfd_cleanup_nodes() -> device_queue_manager_uninit()
>>> 4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 ||
>>>     dqm->processes_count > 0)
>>>
>>> The warning triggers because the queues were never destroyed - 
>>> userspace
>>> had no opportunity to clean them up before the device disappeared.
>>>
>>> Fix this by checking for device unplug in kfd_cleanup_nodes() and
>>> calling process_termination for each affected process before
>>> uninitializing the DQM. This mirrors what happens during normal process
>>> shutdown (kfd_process_notifier_release_internal), ensuring queues are
>>> properly cleaned up even during surprise removal.
>>>
>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Cc: Kent Russell <kent.russell@amd.com>
>>> Cc: Xiaogang.chen@amd.com
>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>
>> Ping?
> Ping?
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 ++++++++++++++++++++++++
>>>   1 file changed, 32 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/ 
>>> drm/amd/amdkfd/kfd_device.c
>>> index e9cfb80bd436..7727b66e6afb 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct kfd_dev 
>>> *kfd, unsigned int num_nodes)
>>>       flush_workqueue(kfd->ih_wq);
>>>       destroy_workqueue(kfd->ih_wq);
>>> +    /*
>>> +     * For surprise unplugs with running processes, we need to 
>>> clean up
>>> +     * queues before uninitializing the DQM to avoid WARN in 
>>> uninitialize.
>>> +     * This handles the case where userspace can't destroy queues 
>>> normally.
>>> +     */
>>> +    if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
>>> +        struct kfd_process *p;
>>> +        unsigned int temp;
>>> +        int idx;
>>> +
>>> +        idx = srcu_read_lock(&kfd_processes_srcu);
>>> +        hash_for_each_rcu(kfd_processes_table, temp, p, 
>>> kfd_processes) {
>>> +            int j;
>>> +
>>> +            for (j = 0; j < p->n_pdds; j++) {
>>> +                struct kfd_process_device *pdd = p->pdds[j];
>>> +
>>> +                if (pdd->dev->kfd != kfd)
>>> +                    continue;
>>> +
>>> +                dev_info(kfd_device,
>>> +                     "Terminating queues for process %d on 
>>> unplugged device\n",
>>> +                     p->lead_thread->pid);
>>> +
>>> + pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
>>> +                                       &pdd->qpd);
>>> +                pdd->already_dequeued = true;
>>> +            }
>>> +        }
>>> +        srcu_read_unlock(&kfd_processes_srcu, idx);
>>> +    }
>>> +
>>>       for (i = 0; i < num_nodes; i++) {
>>>           knode = kfd->nodes[i];
>>>           device_queue_manager_uninit(knode->dqm);
>>
>>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes
  2026-04-21 15:00     ` Chen, Xiaogang
@ 2026-04-22  1:56       ` Kuehling, Felix
  2026-04-22 15:53         ` Chen, Xiaogang
  0 siblings, 1 reply; 11+ messages in thread
From: Kuehling, Felix @ 2026-04-22  1:56 UTC (permalink / raw)
  To: Chen, Xiaogang, Mario Limonciello, Mario Limonciello, amd-gfx
  Cc: Kent Russell


On 2026-04-21 11:00, Chen, Xiaogang wrote:
>
> On 4/20/2026 4:25 PM, Mario Limonciello wrote:
>>
>>
>> On 3/7/26 06:49, Mario Limonciello wrote:
>>>
>>>
>>> On 1/12/26 12:29 PM, Mario Limonciello wrote:
>>>> When a surprise unplug occurs while a process has active KFD queues,
>>>> userspace never gets a chance to call kfd_ioctl_destroy_queue() to
>>>> properly clean them up. This leads to a WARN_ON in uninitialize()
>>>> complaining about active_queue_count or processes_count being 
>>>> non-zero.
>>>>
> During hot-unplug driver sends SIGBUS signal to all processes who are 
> using the unplugged device. It is expected that affected processes 
> will clean their workloads when get this signal.
>
> When a device got removed physically all sources from it will be 
> removed. It is unnecessary(in theory) to clean them up. I am not 
> surprised to see some software warnings due to hardware got physically 
> removed since it is unexpected behavior at run time.
>
> I think what we need worry about is if there is memory leak. Driver 
> also waits when an affected device is idle(by 
> kgd2kfd_check_device_idle(adev)) by checking/waiting if there is 
> process still using it. If there is no process using the being removed 
> device the processes should have been terminated by same process 
> termination logic from driver.

The problem is, that a lot of the process termination stuff happens in a 
worker thread. It can happen after the hot-unplug is already done. That 
would lead to the cleanup worker accessing pointers to device structures 
that are no longer there (or used by something else).

We'd need to ensure proper synchronization so that the process cleanup 
completes before the device unplug frees the device structures.

Regards,
   Felix



>
> Regards
>
> Xiaogang
>
>>>> The issue is that during surprise unplug:
>>>> 1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
>>>> 2. It calls amdgpu_amdkfd_device_fini_sw()
>>>> 3. This leads to kfd_cleanup_nodes() -> device_queue_manager_uninit()
>>>> 4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 ||
>>>>     dqm->processes_count > 0)
>>>>
>>>> The warning triggers because the queues were never destroyed - 
>>>> userspace
>>>> had no opportunity to clean them up before the device disappeared.
>>>>
>>>> Fix this by checking for device unplug in kfd_cleanup_nodes() and
>>>> calling process_termination for each affected process before
>>>> uninitializing the DQM. This mirrors what happens during normal 
>>>> process
>>>> shutdown (kfd_process_notifier_release_internal), ensuring queues are
>>>> properly cleaned up even during surprise removal.
>>>>
>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>>> Cc: Kent Russell <kent.russell@amd.com>
>>>> Cc: Xiaogang.chen@amd.com
>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>>
>>> Ping?
>> Ping?
>>>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 
>>>> ++++++++++++++++++++++++
>>>>   1 file changed, 32 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/ 
>>>> drm/amd/amdkfd/kfd_device.c
>>>> index e9cfb80bd436..7727b66e6afb 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>> @@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct kfd_dev 
>>>> *kfd, unsigned int num_nodes)
>>>>       flush_workqueue(kfd->ih_wq);
>>>>       destroy_workqueue(kfd->ih_wq);
>>>> +    /*
>>>> +     * For surprise unplugs with running processes, we need to 
>>>> clean up
>>>> +     * queues before uninitializing the DQM to avoid WARN in 
>>>> uninitialize.
>>>> +     * This handles the case where userspace can't destroy queues 
>>>> normally.
>>>> +     */
>>>> +    if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
>>>> +        struct kfd_process *p;
>>>> +        unsigned int temp;
>>>> +        int idx;
>>>> +
>>>> +        idx = srcu_read_lock(&kfd_processes_srcu);
>>>> +        hash_for_each_rcu(kfd_processes_table, temp, p, 
>>>> kfd_processes) {
>>>> +            int j;
>>>> +
>>>> +            for (j = 0; j < p->n_pdds; j++) {
>>>> +                struct kfd_process_device *pdd = p->pdds[j];
>>>> +
>>>> +                if (pdd->dev->kfd != kfd)
>>>> +                    continue;
>>>> +
>>>> +                dev_info(kfd_device,
>>>> +                     "Terminating queues for process %d on 
>>>> unplugged device\n",
>>>> +                     p->lead_thread->pid);
>>>> +
>>>> + pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
>>>> +                                       &pdd->qpd);
>>>> +                pdd->already_dequeued = true;
>>>> +            }
>>>> +        }
>>>> +        srcu_read_unlock(&kfd_processes_srcu, idx);
>>>> +    }
>>>> +
>>>>       for (i = 0; i < num_nodes; i++) {
>>>>           knode = kfd->nodes[i];
>>>>           device_queue_manager_uninit(knode->dqm);
>>>
>>>
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes
  2026-04-22  1:56       ` Kuehling, Felix
@ 2026-04-22 15:53         ` Chen, Xiaogang
  2026-04-22 21:00           ` Felix Kuehling
  0 siblings, 1 reply; 11+ messages in thread
From: Chen, Xiaogang @ 2026-04-22 15:53 UTC (permalink / raw)
  To: Kuehling, Felix, Mario Limonciello, Mario Limonciello, amd-gfx
  Cc: Kent Russell


On 4/21/2026 8:56 PM, Kuehling, Felix wrote:
>
> On 2026-04-21 11:00, Chen, Xiaogang wrote:
>>
>> On 4/20/2026 4:25 PM, Mario Limonciello wrote:
>>>
>>>
>>> On 3/7/26 06:49, Mario Limonciello wrote:
>>>>
>>>>
>>>> On 1/12/26 12:29 PM, Mario Limonciello wrote:
>>>>> When a surprise unplug occurs while a process has active KFD queues,
>>>>> userspace never gets a chance to call kfd_ioctl_destroy_queue() to
>>>>> properly clean them up. This leads to a WARN_ON in uninitialize()
>>>>> complaining about active_queue_count or processes_count being 
>>>>> non-zero.
>>>>>
>> During hot-unplug driver sends SIGBUS signal to all processes who are 
>> using the unplugged device. It is expected that affected processes 
>> will clean their workloads when get this signal.
>>
>> When a device got removed physically all sources from it will be 
>> removed. It is unnecessary(in theory) to clean them up. I am not 
>> surprised to see some software warnings due to hardware got 
>> physically removed since it is unexpected behavior at run time.
>>
>> I think what we need worry about is if there is memory leak. Driver 
>> also waits when an affected device is idle(by 
>> kgd2kfd_check_device_idle(adev)) by checking/waiting if there is 
>> process still using it. If there is no process using the being 
>> removed device the processes should have been terminated by same 
>> process termination logic from driver.
>
> The problem is, that a lot of the process termination stuff happens in 
> a worker thread. It can happen after the hot-unplug is already done. 
> That would lead to the cleanup worker accessing pointers to device 
> structures that are no longer there (or used by something else).
>
> We'd need to ensure proper synchronization so that the process cleanup 
> completes before the device unplug frees the device structures.

How about at kgd2kfd_device_exit before doing any device clean up 
check/waiting there is no any kfd process run on this 
device(kgd2kfd_check_device_idle)?

Regards

Xiaogang

>
> Regards,
>   Felix
>
>
>
>>
>> Regards
>>
>> Xiaogang
>>
>>>>> The issue is that during surprise unplug:
>>>>> 1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
>>>>> 2. It calls amdgpu_amdkfd_device_fini_sw()
>>>>> 3. This leads to kfd_cleanup_nodes() -> device_queue_manager_uninit()
>>>>> 4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 ||
>>>>>     dqm->processes_count > 0)
>>>>>
>>>>> The warning triggers because the queues were never destroyed - 
>>>>> userspace
>>>>> had no opportunity to clean them up before the device disappeared.
>>>>>
>>>>> Fix this by checking for device unplug in kfd_cleanup_nodes() and
>>>>> calling process_termination for each affected process before
>>>>> uninitializing the DQM. This mirrors what happens during normal 
>>>>> process
>>>>> shutdown (kfd_process_notifier_release_internal), ensuring queues are
>>>>> properly cleaned up even during surprise removal.
>>>>>
>>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>> Cc: Kent Russell <kent.russell@amd.com>
>>>>> Cc: Xiaogang.chen@amd.com
>>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>>>
>>>> Ping?
>>> Ping?
>>>>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 
>>>>> ++++++++++++++++++++++++
>>>>>   1 file changed, 32 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>> b/drivers/gpu/ drm/amd/amdkfd/kfd_device.c
>>>>> index e9cfb80bd436..7727b66e6afb 100644
>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>> @@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct kfd_dev 
>>>>> *kfd, unsigned int num_nodes)
>>>>>       flush_workqueue(kfd->ih_wq);
>>>>>       destroy_workqueue(kfd->ih_wq);
>>>>> +    /*
>>>>> +     * For surprise unplugs with running processes, we need to 
>>>>> clean up
>>>>> +     * queues before uninitializing the DQM to avoid WARN in 
>>>>> uninitialize.
>>>>> +     * This handles the case where userspace can't destroy queues 
>>>>> normally.
>>>>> +     */
>>>>> +    if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
>>>>> +        struct kfd_process *p;
>>>>> +        unsigned int temp;
>>>>> +        int idx;
>>>>> +
>>>>> +        idx = srcu_read_lock(&kfd_processes_srcu);
>>>>> +        hash_for_each_rcu(kfd_processes_table, temp, p, 
>>>>> kfd_processes) {
>>>>> +            int j;
>>>>> +
>>>>> +            for (j = 0; j < p->n_pdds; j++) {
>>>>> +                struct kfd_process_device *pdd = p->pdds[j];
>>>>> +
>>>>> +                if (pdd->dev->kfd != kfd)
>>>>> +                    continue;
>>>>> +
>>>>> +                dev_info(kfd_device,
>>>>> +                     "Terminating queues for process %d on 
>>>>> unplugged device\n",
>>>>> +                     p->lead_thread->pid);
>>>>> +
>>>>> + pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
>>>>> +                                       &pdd->qpd);
>>>>> +                pdd->already_dequeued = true;
>>>>> +            }
>>>>> +        }
>>>>> +        srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>> +    }
>>>>> +
>>>>>       for (i = 0; i < num_nodes; i++) {
>>>>>           knode = kfd->nodes[i];
>>>>>           device_queue_manager_uninit(knode->dqm);
>>>>
>>>>
>>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes
  2026-04-22 15:53         ` Chen, Xiaogang
@ 2026-04-22 21:00           ` Felix Kuehling
  2026-04-22 22:02             ` Chen, Xiaogang
  0 siblings, 1 reply; 11+ messages in thread
From: Felix Kuehling @ 2026-04-22 21:00 UTC (permalink / raw)
  To: Chen, Xiaogang, Mario Limonciello, Mario Limonciello, amd-gfx
  Cc: Kent Russell


On 2026-04-22 11:53, Chen, Xiaogang wrote:
>
> On 4/21/2026 8:56 PM, Kuehling, Felix wrote:
>>
>> On 2026-04-21 11:00, Chen, Xiaogang wrote:
>>>
>>> On 4/20/2026 4:25 PM, Mario Limonciello wrote:
>>>>
>>>>
>>>> On 3/7/26 06:49, Mario Limonciello wrote:
>>>>>
>>>>>
>>>>> On 1/12/26 12:29 PM, Mario Limonciello wrote:
>>>>>> When a surprise unplug occurs while a process has active KFD queues,
>>>>>> userspace never gets a chance to call kfd_ioctl_destroy_queue() to
>>>>>> properly clean them up. This leads to a WARN_ON in uninitialize()
>>>>>> complaining about active_queue_count or processes_count being 
>>>>>> non-zero.
>>>>>>
>>> During hot-unplug driver sends SIGBUS signal to all processes who 
>>> are using the unplugged device. It is expected that affected 
>>> processes will clean their workloads when get this signal.
>>>
>>> When a device got removed physically all sources from it will be 
>>> removed. It is unnecessary(in theory) to clean them up. I am not 
>>> surprised to see some software warnings due to hardware got 
>>> physically removed since it is unexpected behavior at run time.
>>>
>>> I think what we need worry about is if there is memory leak. Driver 
>>> also waits when an affected device is idle(by 
>>> kgd2kfd_check_device_idle(adev)) by checking/waiting if there is 
>>> process still using it. If there is no process using the being 
>>> removed device the processes should have been terminated by same 
>>> process termination logic from driver.
>>
>> The problem is, that a lot of the process termination stuff happens 
>> in a worker thread. It can happen after the hot-unplug is already 
>> done. That would lead to the cleanup worker accessing pointers to 
>> device structures that are no longer there (or used by something else).
>>
>> We'd need to ensure proper synchronization so that the process 
>> cleanup completes before the device unplug frees the device structures.
>
> How about at kgd2kfd_device_exit before doing any device clean up 
> check/waiting there is no any kfd process run on this 
> device(kgd2kfd_check_device_idle)?

Looks like this should already be happening in this call chain: 
amdgpu_device_ip_fini_early -> amdgpu_amdkfd_teardown_processes -> 
kgd2kfd_teardown_processes -> kgd2kfd_check_device_idle

Maybe whats missing at the end of kgd2kfd_teardown_processes is a 
flush_workqueue(kfd_process_wq) to make sure that all the cleanup work 
is done. After that, there should be no more process data structures 
referencing the device.

Regards,
   Felix


>
> Regards
>
> Xiaogang
>
>>
>> Regards,
>>   Felix
>>
>>
>>
>>>
>>> Regards
>>>
>>> Xiaogang
>>>
>>>>>> The issue is that during surprise unplug:
>>>>>> 1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
>>>>>> 2. It calls amdgpu_amdkfd_device_fini_sw()
>>>>>> 3. This leads to kfd_cleanup_nodes() -> 
>>>>>> device_queue_manager_uninit()
>>>>>> 4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 ||
>>>>>>     dqm->processes_count > 0)
>>>>>>
>>>>>> The warning triggers because the queues were never destroyed - 
>>>>>> userspace
>>>>>> had no opportunity to clean them up before the device disappeared.
>>>>>>
>>>>>> Fix this by checking for device unplug in kfd_cleanup_nodes() and
>>>>>> calling process_termination for each affected process before
>>>>>> uninitializing the DQM. This mirrors what happens during normal 
>>>>>> process
>>>>>> shutdown (kfd_process_notifier_release_internal), ensuring queues 
>>>>>> are
>>>>>> properly cleaned up even during surprise removal.
>>>>>>
>>>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>>> Cc: Kent Russell <kent.russell@amd.com>
>>>>>> Cc: Xiaogang.chen@amd.com
>>>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>>>>
>>>>> Ping?
>>>> Ping?
>>>>>
>>>>>> ---
>>>>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 
>>>>>> ++++++++++++++++++++++++
>>>>>>   1 file changed, 32 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>>> b/drivers/gpu/ drm/amd/amdkfd/kfd_device.c
>>>>>> index e9cfb80bd436..7727b66e6afb 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>> @@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct kfd_dev 
>>>>>> *kfd, unsigned int num_nodes)
>>>>>>       flush_workqueue(kfd->ih_wq);
>>>>>>       destroy_workqueue(kfd->ih_wq);
>>>>>> +    /*
>>>>>> +     * For surprise unplugs with running processes, we need to 
>>>>>> clean up
>>>>>> +     * queues before uninitializing the DQM to avoid WARN in 
>>>>>> uninitialize.
>>>>>> +     * This handles the case where userspace can't destroy 
>>>>>> queues normally.
>>>>>> +     */
>>>>>> +    if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
>>>>>> +        struct kfd_process *p;
>>>>>> +        unsigned int temp;
>>>>>> +        int idx;
>>>>>> +
>>>>>> +        idx = srcu_read_lock(&kfd_processes_srcu);
>>>>>> +        hash_for_each_rcu(kfd_processes_table, temp, p, 
>>>>>> kfd_processes) {
>>>>>> +            int j;
>>>>>> +
>>>>>> +            for (j = 0; j < p->n_pdds; j++) {
>>>>>> +                struct kfd_process_device *pdd = p->pdds[j];
>>>>>> +
>>>>>> +                if (pdd->dev->kfd != kfd)
>>>>>> +                    continue;
>>>>>> +
>>>>>> +                dev_info(kfd_device,
>>>>>> +                     "Terminating queues for process %d on 
>>>>>> unplugged device\n",
>>>>>> +                     p->lead_thread->pid);
>>>>>> +
>>>>>> + pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
>>>>>> + &pdd->qpd);
>>>>>> +                pdd->already_dequeued = true;
>>>>>> +            }
>>>>>> +        }
>>>>>> +        srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>>> +    }
>>>>>> +
>>>>>>       for (i = 0; i < num_nodes; i++) {
>>>>>>           knode = kfd->nodes[i];
>>>>>>           device_queue_manager_uninit(knode->dqm);
>>>>>
>>>>>
>>>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes
  2026-04-22 21:00           ` Felix Kuehling
@ 2026-04-22 22:02             ` Chen, Xiaogang
  2026-04-23  1:38               ` Felix Kuehling
  0 siblings, 1 reply; 11+ messages in thread
From: Chen, Xiaogang @ 2026-04-22 22:02 UTC (permalink / raw)
  To: Felix Kuehling, Mario Limonciello, Mario Limonciello, amd-gfx
  Cc: Kent Russell


On 4/22/2026 4:00 PM, Felix Kuehling wrote:
>
> On 2026-04-22 11:53, Chen, Xiaogang wrote:
>>
>> On 4/21/2026 8:56 PM, Kuehling, Felix wrote:
>>>
>>> On 2026-04-21 11:00, Chen, Xiaogang wrote:
>>>>
>>>> On 4/20/2026 4:25 PM, Mario Limonciello wrote:
>>>>>
>>>>>
>>>>> On 3/7/26 06:49, Mario Limonciello wrote:
>>>>>>
>>>>>>
>>>>>> On 1/12/26 12:29 PM, Mario Limonciello wrote:
>>>>>>> When a surprise unplug occurs while a process has active KFD 
>>>>>>> queues,
>>>>>>> userspace never gets a chance to call kfd_ioctl_destroy_queue() to
>>>>>>> properly clean them up. This leads to a WARN_ON in uninitialize()
>>>>>>> complaining about active_queue_count or processes_count being 
>>>>>>> non-zero.
>>>>>>>
>>>> During hot-unplug driver sends SIGBUS signal to all processes who 
>>>> are using the unplugged device. It is expected that affected 
>>>> processes will clean their workloads when get this signal.
>>>>
>>>> When a device got removed physically all sources from it will be 
>>>> removed. It is unnecessary(in theory) to clean them up. I am not 
>>>> surprised to see some software warnings due to hardware got 
>>>> physically removed since it is unexpected behavior at run time.
>>>>
>>>> I think what we need worry about is if there is memory leak. Driver 
>>>> also waits when an affected device is idle(by 
>>>> kgd2kfd_check_device_idle(adev)) by checking/waiting if there is 
>>>> process still using it. If there is no process using the being 
>>>> removed device the processes should have been terminated by same 
>>>> process termination logic from driver.
>>>
>>> The problem is, that a lot of the process termination stuff happens 
>>> in a worker thread. It can happen after the hot-unplug is already 
>>> done. That would lead to the cleanup worker accessing pointers to 
>>> device structures that are no longer there (or used by something else).
>>>
>>> We'd need to ensure proper synchronization so that the process 
>>> cleanup completes before the device unplug frees the device structures.
>>
>> How about at kgd2kfd_device_exit before doing any device clean up 
>> check/waiting there is no any kfd process run on this 
>> device(kgd2kfd_check_device_idle)?
>
> Looks like this should already be happening in this call chain: 
> amdgpu_device_ip_fini_early -> amdgpu_amdkfd_teardown_processes -> 
> kgd2kfd_teardown_processes -> kgd2kfd_check_device_idle
>
> Maybe whats missing at the end of kgd2kfd_teardown_processes is a 
> flush_workqueue(kfd_process_wq) to make sure that all the cleanup work 
> is done. After that, there should be no more process data structures 
> referencing the device.
>
We send signal SIGBUS to affected kfd processes(who are using the being 
removed device). The app signal handler will be executed asynchronously. 
There is a delay for signal handler got run. If call flush_workqueue 
immediately after sent the signal the kfd_process_ref_release(from 
kfd_release->kfd_unref_process) will not be got ran immediately after 
sent signal. Then flush_workqueue will not take effect since 
kfd_process_wq is empty or no filled work item yet.

Regards

Xiaogang


> Regards,
>   Felix
>
>
>>
>> Regards
>>
>> Xiaogang
>>
>>>
>>> Regards,
>>>   Felix
>>>
>>>
>>>
>>>>
>>>> Regards
>>>>
>>>> Xiaogang
>>>>
>>>>>>> The issue is that during surprise unplug:
>>>>>>> 1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
>>>>>>> 2. It calls amdgpu_amdkfd_device_fini_sw()
>>>>>>> 3. This leads to kfd_cleanup_nodes() -> 
>>>>>>> device_queue_manager_uninit()
>>>>>>> 4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 ||
>>>>>>>     dqm->processes_count > 0)
>>>>>>>
>>>>>>> The warning triggers because the queues were never destroyed - 
>>>>>>> userspace
>>>>>>> had no opportunity to clean them up before the device disappeared.
>>>>>>>
>>>>>>> Fix this by checking for device unplug in kfd_cleanup_nodes() and
>>>>>>> calling process_termination for each affected process before
>>>>>>> uninitializing the DQM. This mirrors what happens during normal 
>>>>>>> process
>>>>>>> shutdown (kfd_process_notifier_release_internal), ensuring 
>>>>>>> queues are
>>>>>>> properly cleaned up even during surprise removal.
>>>>>>>
>>>>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>>>> Cc: Kent Russell <kent.russell@amd.com>
>>>>>>> Cc: Xiaogang.chen@amd.com
>>>>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>>>>>
>>>>>> Ping?
>>>>> Ping?
>>>>>>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 
>>>>>>> ++++++++++++++++++++++++
>>>>>>>   1 file changed, 32 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>>>> b/drivers/gpu/ drm/amd/amdkfd/kfd_device.c
>>>>>>> index e9cfb80bd436..7727b66e6afb 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>> @@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct 
>>>>>>> kfd_dev *kfd, unsigned int num_nodes)
>>>>>>>       flush_workqueue(kfd->ih_wq);
>>>>>>>       destroy_workqueue(kfd->ih_wq);
>>>>>>> +    /*
>>>>>>> +     * For surprise unplugs with running processes, we need to 
>>>>>>> clean up
>>>>>>> +     * queues before uninitializing the DQM to avoid WARN in 
>>>>>>> uninitialize.
>>>>>>> +     * This handles the case where userspace can't destroy 
>>>>>>> queues normally.
>>>>>>> +     */
>>>>>>> +    if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
>>>>>>> +        struct kfd_process *p;
>>>>>>> +        unsigned int temp;
>>>>>>> +        int idx;
>>>>>>> +
>>>>>>> +        idx = srcu_read_lock(&kfd_processes_srcu);
>>>>>>> +        hash_for_each_rcu(kfd_processes_table, temp, p, 
>>>>>>> kfd_processes) {
>>>>>>> +            int j;
>>>>>>> +
>>>>>>> +            for (j = 0; j < p->n_pdds; j++) {
>>>>>>> +                struct kfd_process_device *pdd = p->pdds[j];
>>>>>>> +
>>>>>>> +                if (pdd->dev->kfd != kfd)
>>>>>>> +                    continue;
>>>>>>> +
>>>>>>> +                dev_info(kfd_device,
>>>>>>> +                     "Terminating queues for process %d on 
>>>>>>> unplugged device\n",
>>>>>>> +                     p->lead_thread->pid);
>>>>>>> +
>>>>>>> + pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
>>>>>>> + &pdd->qpd);
>>>>>>> +                pdd->already_dequeued = true;
>>>>>>> +            }
>>>>>>> +        }
>>>>>>> +        srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>>>> +    }
>>>>>>> +
>>>>>>>       for (i = 0; i < num_nodes; i++) {
>>>>>>>           knode = kfd->nodes[i];
>>>>>>>           device_queue_manager_uninit(knode->dqm);
>>>>>>
>>>>>>
>>>>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes
  2026-04-22 22:02             ` Chen, Xiaogang
@ 2026-04-23  1:38               ` Felix Kuehling
  0 siblings, 0 replies; 11+ messages in thread
From: Felix Kuehling @ 2026-04-23  1:38 UTC (permalink / raw)
  To: Chen, Xiaogang, Mario Limonciello, Mario Limonciello, amd-gfx
  Cc: Kent Russell

[-- Attachment #1: Type: text/plain, Size: 7589 bytes --]

On 2026-04-22 18:02, Chen, Xiaogang wrote:
>
> On 4/22/2026 4:00 PM, Felix Kuehling wrote:
>>
>> On 2026-04-22 11:53, Chen, Xiaogang wrote:
>>>
>>> On 4/21/2026 8:56 PM, Kuehling, Felix wrote:
>>>>
>>>> On 2026-04-21 11:00, Chen, Xiaogang wrote:
>>>>>
>>>>> On 4/20/2026 4:25 PM, Mario Limonciello wrote:
>>>>>>
>>>>>>
>>>>>> On 3/7/26 06:49, Mario Limonciello wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 1/12/26 12:29 PM, Mario Limonciello wrote:
>>>>>>>> When a surprise unplug occurs while a process has active KFD 
>>>>>>>> queues,
>>>>>>>> userspace never gets a chance to call kfd_ioctl_destroy_queue() to
>>>>>>>> properly clean them up. This leads to a WARN_ON in uninitialize()
>>>>>>>> complaining about active_queue_count or processes_count being 
>>>>>>>> non-zero.
>>>>>>>>
>>>>> During hot-unplug driver sends SIGBUS signal to all processes who 
>>>>> are using the unplugged device. It is expected that affected 
>>>>> processes will clean their workloads when get this signal.
>>>>>
>>>>> When a device got removed physically all sources from it will be 
>>>>> removed. It is unnecessary(in theory) to clean them up. I am not 
>>>>> surprised to see some software warnings due to hardware got 
>>>>> physically removed since it is unexpected behavior at run time.
>>>>>
>>>>> I think what we need worry about is if there is memory leak. 
>>>>> Driver also waits when an affected device is idle(by 
>>>>> kgd2kfd_check_device_idle(adev)) by checking/waiting if there is 
>>>>> process still using it. If there is no process using the being 
>>>>> removed device the processes should have been terminated by same 
>>>>> process termination logic from driver.
>>>>
>>>> The problem is, that a lot of the process termination stuff happens 
>>>> in a worker thread. It can happen after the hot-unplug is already 
>>>> done. That would lead to the cleanup worker accessing pointers to 
>>>> device structures that are no longer there (or used by something 
>>>> else).
>>>>
>>>> We'd need to ensure proper synchronization so that the process 
>>>> cleanup completes before the device unplug frees the device 
>>>> structures.
>>>
>>> How about at kgd2kfd_device_exit before doing any device clean up 
>>> check/waiting there is no any kfd process run on this 
>>> device(kgd2kfd_check_device_idle)?
>>
>> Looks like this should already be happening in this call chain: 
>> amdgpu_device_ip_fini_early -> amdgpu_amdkfd_teardown_processes -> 
>> kgd2kfd_teardown_processes -> kgd2kfd_check_device_idle
>>
>> Maybe whats missing at the end of kgd2kfd_teardown_processes is a 
>> flush_workqueue(kfd_process_wq) to make sure that all the cleanup 
>> work is done. After that, there should be no more process data 
>> structures referencing the device.
>>
> We send signal SIGBUS to affected kfd processes(who are using the 
> being removed device). The app signal handler will be executed 
> asynchronously. There is a delay for signal handler got run. If call 
> flush_workqueue immediately after sent the signal the 
> kfd_process_ref_release(from kfd_release->kfd_unref_process) will not 
> be got ran immediately after sent signal. Then flush_workqueue will 
> not take effect since kfd_process_wq is empty or no filled work item yet.

This loop at the end of kgd2kfd_teardown_processes waits until all 
processes using the GPU have terminated:

         /* wait all kfd processes use adev terminate */
         while (!kgd2kfd_check_device_idle(adev))
                 cond_resched();

After this time the processes are no longer in the kfd_processes_table. 
But their kfd_process_wq_release workers haven't necessarily run to 
completion yet. Adding a flush_workqueue after this should do the job.

Regards,
   Felix


>
> Regards
>
> Xiaogang
>
>
>> Regards,
>>   Felix
>>
>>
>>>
>>> Regards
>>>
>>> Xiaogang
>>>
>>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>>
>>>>
>>>>>
>>>>> Regards
>>>>>
>>>>> Xiaogang
>>>>>
>>>>>>>> The issue is that during surprise unplug:
>>>>>>>> 1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
>>>>>>>> 2. It calls amdgpu_amdkfd_device_fini_sw()
>>>>>>>> 3. This leads to kfd_cleanup_nodes() -> 
>>>>>>>> device_queue_manager_uninit()
>>>>>>>> 4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 ||
>>>>>>>>     dqm->processes_count > 0)
>>>>>>>>
>>>>>>>> The warning triggers because the queues were never destroyed - 
>>>>>>>> userspace
>>>>>>>> had no opportunity to clean them up before the device disappeared.
>>>>>>>>
>>>>>>>> Fix this by checking for device unplug in kfd_cleanup_nodes() and
>>>>>>>> calling process_termination for each affected process before
>>>>>>>> uninitializing the DQM. This mirrors what happens during normal 
>>>>>>>> process
>>>>>>>> shutdown (kfd_process_notifier_release_internal), ensuring 
>>>>>>>> queues are
>>>>>>>> properly cleaned up even during surprise removal.
>>>>>>>>
>>>>>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>>>>> Cc: Kent Russell <kent.russell@amd.com>
>>>>>>>> Cc: Xiaogang.chen@amd.com
>>>>>>>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
>>>>>>>
>>>>>>> Ping?
>>>>>> Ping?
>>>>>>>
>>>>>>>> ---
>>>>>>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 
>>>>>>>> ++++++++++++++++++++++++
>>>>>>>>   1 file changed, 32 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
>>>>>>>> b/drivers/gpu/ drm/amd/amdkfd/kfd_device.c
>>>>>>>> index e9cfb80bd436..7727b66e6afb 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>> @@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct 
>>>>>>>> kfd_dev *kfd, unsigned int num_nodes)
>>>>>>>>       flush_workqueue(kfd->ih_wq);
>>>>>>>>       destroy_workqueue(kfd->ih_wq);
>>>>>>>> +    /*
>>>>>>>> +     * For surprise unplugs with running processes, we need to 
>>>>>>>> clean up
>>>>>>>> +     * queues before uninitializing the DQM to avoid WARN in 
>>>>>>>> uninitialize.
>>>>>>>> +     * This handles the case where userspace can't destroy 
>>>>>>>> queues normally.
>>>>>>>> +     */
>>>>>>>> +    if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
>>>>>>>> +        struct kfd_process *p;
>>>>>>>> +        unsigned int temp;
>>>>>>>> +        int idx;
>>>>>>>> +
>>>>>>>> +        idx = srcu_read_lock(&kfd_processes_srcu);
>>>>>>>> +        hash_for_each_rcu(kfd_processes_table, temp, p, 
>>>>>>>> kfd_processes) {
>>>>>>>> +            int j;
>>>>>>>> +
>>>>>>>> +            for (j = 0; j < p->n_pdds; j++) {
>>>>>>>> +                struct kfd_process_device *pdd = p->pdds[j];
>>>>>>>> +
>>>>>>>> +                if (pdd->dev->kfd != kfd)
>>>>>>>> +                    continue;
>>>>>>>> +
>>>>>>>> +                dev_info(kfd_device,
>>>>>>>> +                     "Terminating queues for process %d on 
>>>>>>>> unplugged device\n",
>>>>>>>> +                     p->lead_thread->pid);
>>>>>>>> +
>>>>>>>> + pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
>>>>>>>> + &pdd->qpd);
>>>>>>>> +                pdd->already_dequeued = true;
>>>>>>>> +            }
>>>>>>>> +        }
>>>>>>>> +        srcu_read_unlock(&kfd_processes_srcu, idx);
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>>       for (i = 0; i < num_nodes; i++) {
>>>>>>>>           knode = kfd->nodes[i];
>>>>>>>> device_queue_manager_uninit(knode->dqm);
>>>>>>>
>>>>>>>
>>>>>>

[-- Attachment #2: Type: text/html, Size: 14765 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-23  1:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-12 18:29 [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes Mario Limonciello
2026-03-07 12:49 ` Mario Limonciello
2026-04-20 21:25   ` Mario Limonciello
2026-04-21  3:19     ` Kuehling, Felix
2026-04-21  3:21     ` Kuehling, Felix
2026-04-21 15:00     ` Chen, Xiaogang
2026-04-22  1:56       ` Kuehling, Felix
2026-04-22 15:53         ` Chen, Xiaogang
2026-04-22 21:00           ` Felix Kuehling
2026-04-22 22:02             ` Chen, Xiaogang
2026-04-23  1:38               ` Felix Kuehling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.