All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes
@ 2026-01-12 18:29 Mario Limonciello
  2026-03-07 12:49 ` Mario Limonciello
  0 siblings, 1 reply; 11+ messages in thread
From: Mario Limonciello @ 2026-01-12 18:29 UTC (permalink / raw)
  To: amd-gfx; +Cc: Mario Limonciello, Felix Kuehling, Kent Russell, Xiaogang.chen

When a surprise unplug occurs while a process has active KFD queues,
userspace never gets a chance to call kfd_ioctl_destroy_queue() to
properly clean them up. This leads to a WARN_ON in uninitialize()
complaining about active_queue_count or processes_count being non-zero.

The issue is that during surprise unplug:
1. amdgpu_device_fini_hw() checks drm_dev_is_unplugged()
2. It calls amdgpu_amdkfd_device_fini_sw()
3. This leads to kfd_cleanup_nodes() -> device_queue_manager_uninit()
4. uninitialize() has: WARN_ON(dqm->active_queue_count > 0 || 
   dqm->processes_count > 0)

The warning triggers because the queues were never destroyed - userspace
had no opportunity to clean them up before the device disappeared.

Fix this by checking for device unplug in kfd_cleanup_nodes() and
calling process_termination for each affected process before
uninitializing the DQM. This mirrors what happens during normal process
shutdown (kfd_process_notifier_release_internal), ensuring queues are
properly cleaned up even during surprise removal.

Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Kent Russell <kent.russell@amd.com>
Cc: Xiaogang.chen@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c | 32 ++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index e9cfb80bd436..7727b66e6afb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -664,6 +664,38 @@ static void kfd_cleanup_nodes(struct kfd_dev *kfd, unsigned int num_nodes)
 	flush_workqueue(kfd->ih_wq);
 	destroy_workqueue(kfd->ih_wq);
 
+	/*
+	 * For surprise unplugs with running processes, we need to clean up
+	 * queues before uninitializing the DQM to avoid WARN in uninitialize.
+	 * This handles the case where userspace can't destroy queues normally.
+	 */
+	if (drm_dev_is_unplugged(adev_to_drm(kfd->adev))) {
+		struct kfd_process *p;
+		unsigned int temp;
+		int idx;
+
+		idx = srcu_read_lock(&kfd_processes_srcu);
+		hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+			int j;
+
+			for (j = 0; j < p->n_pdds; j++) {
+				struct kfd_process_device *pdd = p->pdds[j];
+
+				if (pdd->dev->kfd != kfd)
+					continue;
+
+				dev_info(kfd_device,
+					 "Terminating queues for process %d on unplugged device\n",
+					 p->lead_thread->pid);
+
+				pdd->dev->dqm->ops.process_termination(pdd->dev->dqm,
+								       &pdd->qpd);
+				pdd->already_dequeued = true;
+			}
+		}
+		srcu_read_unlock(&kfd_processes_srcu, idx);
+	}
+
 	for (i = 0; i < num_nodes; i++) {
 		knode = kfd->nodes[i];
 		device_queue_manager_uninit(knode->dqm);
-- 
2.47.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-23  1:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-12 18:29 [PATCH] drm/amdkfd: Terminate queues on surprise unplug with running processes Mario Limonciello
2026-03-07 12:49 ` Mario Limonciello
2026-04-20 21:25   ` Mario Limonciello
2026-04-21  3:19     ` Kuehling, Felix
2026-04-21  3:21     ` Kuehling, Felix
2026-04-21 15:00     ` Chen, Xiaogang
2026-04-22  1:56       ` Kuehling, Felix
2026-04-22 15:53         ` Chen, Xiaogang
2026-04-22 21:00           ` Felix Kuehling
2026-04-22 22:02             ` Chen, Xiaogang
2026-04-23  1:38               ` Felix Kuehling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.