AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak
@ 2026-03-19  8:21 Prike Liang
  2026-03-19  8:21 ` [PATCH 2/3] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl() Prike Liang
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Prike Liang @ 2026-03-19  8:21 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Christian.Koenig, Prike Liang

The TLB flush fence leaked during walking over
the free mapping list. Meanwhile, the TLB flush fence
is referenced by root BOs reservation which is a false
leak and we should mark it.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c           | 3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c | 7 ++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 7ef0cb6bcbda..1cd4c4217b02 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1560,6 +1560,8 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
 		goto error_free;
 
 	while (!list_empty(&vm->freed)) {
+		struct dma_fence *old_f = f;
+
 		mapping = list_first_entry(&vm->freed,
 			struct amdgpu_bo_va_mapping, list);
 		list_del(&mapping->list);
@@ -1572,6 +1574,7 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
 			dma_fence_put(f);
 			goto error_free;
 		}
+		dma_fence_put(old_f);
 	}
 
 	if (fence && f) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
index 5d26797356a3..93b72289e7df 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
@@ -23,6 +23,7 @@
 
 #include <linux/dma-fence.h>
 #include <linux/workqueue.h>
+#include <linux/kmemleak.h>
 
 #include "amdgpu.h"
 #include "amdgpu_vm.h"
@@ -106,6 +107,10 @@ void amdgpu_vm_tlb_fence_create(struct amdgpu_device *adev, struct amdgpu_vm *vm
 	/* TODO: We probably need a separate wq here */
 	dma_fence_get(&f->base);
 	schedule_work(&f->work);
-
+	/* The TLB fence is referenced by dma_resv and
+	 * the resv ref is the remaining ref, so that's
+	 * a false positive leak.
+	 */
+	kmemleak_not_leak(f);
 	*fence = &f->base;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl()
  2026-03-19  8:21 [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak Prike Liang
@ 2026-03-19  8:21 ` Prike Liang
  2026-03-19  8:44   ` Christian König
  2026-03-19  8:21 ` [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock Prike Liang
  2026-03-19  8:41 ` [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak Christian König
  2 siblings, 1 reply; 9+ messages in thread
From: Prike Liang @ 2026-03-19  8:21 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Christian.Koenig, Prike Liang

It requires freeing the syncobj and chain
alloction resource.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 98276b55ad3c..f54e0fb5cb2d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -980,6 +980,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
 						      timeline_chain,
 						      fence,
 						      args->vm_timeline_point);
+				timeline_chain = NULL;
 			}
 		}
 		dma_fence_put(fence);
@@ -987,6 +988,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
 	}
 
 error:
+	dma_fence_chain_free(timeline_chain);
+	drm_syncobj_put(timeline_syncobj);
 	drm_exec_fini(&exec);
 error_put_gobj:
 	drm_gem_object_put(gobj);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock
  2026-03-19  8:21 [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak Prike Liang
  2026-03-19  8:21 ` [PATCH 2/3] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl() Prike Liang
@ 2026-03-19  8:21 ` Prike Liang
  2026-03-19  8:46   ` Christian König
  2026-03-20  7:56   ` Khatri, Sunil
  2026-03-19  8:41 ` [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak Christian König
  2 siblings, 2 replies; 9+ messages in thread
From: Prike Liang @ 2026-03-19  8:21 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Christian.Koenig, Prike Liang

In the userq destroy routine, the queue refcount
should be 0 and the queue already removed from the
manager list, so it must not be touched. Attempting
to lock the userq mutex here would deadlock, as it
is already held by the eviction suspend work like as
following.

[  107.881652] ============================================
[  107.881866] WARNING: possible recursive locking detected
[  107.882081] 6.19.0-custom #16 Tainted: G     U     OE
[  107.882305] --------------------------------------------
[  107.882518] kworker/15:1/158 is trying to acquire lock:
[  107.882728] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4}, at: amdgpu_userq_kref_destroy+0x57/0x540 [amdgpu]
[  107.883462]
               but task is already holding lock:
[  107.883701] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4}, at: amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu]
[  107.884485]
               other info that might help us debug this:
[  107.884751]  Possible unsafe locking scenario:

[  107.884993]        CPU0
[  107.885100]        ----
[  107.885207]   lock(&userq_mgr->userq_mutex);
[  107.885385]   lock(&userq_mgr->userq_mutex);
[  107.885561]
                *** DEADLOCK ***

[  107.885798]  May be due to missing lock nesting notation

[  107.886069] 4 locks held by kworker/15:1/158:
[  107.886247]  #0: ffff8f2840057558 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x455/0x650
[  107.886630]  #1: ffffd32f01a4fe18 ((work_completion)(&evf_mgr->suspend_work)){+.+.}-{0:0}, at: process_one_work+0x1f3/0x650
[  107.887075]  #2: ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4}, at: amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu]
[  107.887799]  #3: ffffffffb8d3f700 (dma_fence_map){++++}-{0:0}, at: amdgpu_eviction_fence_suspend_worker+0x36/0xc0 [amdgpu]
[  107.888457]

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 50 +++++++++++++++++++++--
 1 file changed, 47 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
index bb5d572f5a3c..c7a9306a1c01 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
@@ -148,6 +148,52 @@ amdgpu_userq_detect_and_reset_queues(struct amdgpu_userq_mgr *uq_mgr)
 	return r;
 }
 
+static int
+amdgpu_userq_perq_detect_and_reset_queues(struct amdgpu_userq_mgr *uq_mgr,
+			struct amdgpu_usermode_queue *queue)
+{
+	struct amdgpu_device *adev = uq_mgr->adev;
+	bool gpu_reset = false;
+	int r = 0;
+
+	/* Warning if current process mutex is not held */
+	if (refcount_read(&queue->refcount.refcount))
+		WARN_ON(!mutex_is_locked(&uq_mgr->userq_mutex));
+
+	if (unlikely(adev->debug_disable_gpu_ring_reset)) {
+		dev_err(adev->dev, "userq reset disabled by debug mask\n");
+		return 0;
+	}
+
+	/*
+	 * If GPU recovery feature is disabled system-wide,
+	 * skip all reset detection logic
+	 */
+	if (!amdgpu_gpu_recovery)
+		return 0;
+
+	/*
+	 * Iterate through all queue types to detect and reset problematic queues
+	 * Process each queue type in the defined order
+	 */
+	int ring_type = queue->queue_type;
+	const struct amdgpu_userq_funcs *funcs = adev->userq_funcs[ring_type];
+
+	if (!amdgpu_userq_is_reset_type_supported(adev, ring_type, AMDGPU_RESET_TYPE_PER_QUEUE))
+			return r;
+
+	if (atomic_read(&uq_mgr->userq_count[ring_type]) > 0 &&
+	    funcs && funcs->detect_and_reset) {
+		r = funcs->detect_and_reset(adev, ring_type);
+		if (r)
+			gpu_reset = true;
+	}
+
+	if (gpu_reset)
+		amdgpu_userq_gpu_reset(adev);
+
+	return r;
+}
 static void amdgpu_userq_hang_detect_work(struct work_struct *work)
 {
 	struct amdgpu_usermode_queue *queue = container_of(work,
@@ -627,7 +673,6 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct amdgpu_usermode_que
 	/* Cancel any pending hang detection work and cleanup */
 	cancel_delayed_work_sync(&queue->hang_detect_work);
 
-	mutex_lock(&uq_mgr->userq_mutex);
 	queue->hang_detect_fence = NULL;
 	amdgpu_userq_wait_for_last_fence(queue);
 
@@ -649,7 +694,7 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct amdgpu_usermode_que
 #if defined(CONFIG_DEBUG_FS)
 	debugfs_remove_recursive(queue->debugfs_queue);
 #endif
-	amdgpu_userq_detect_and_reset_queues(uq_mgr);
+	amdgpu_userq_perq_detect_and_reset_queues(uq_mgr, queue);
 	r = amdgpu_userq_unmap_helper(queue);
 	/*TODO: It requires a reset for userq hw unmap error*/
 	if (unlikely(r != AMDGPU_USERQ_STATE_UNMAPPED)) {
@@ -657,7 +702,6 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct amdgpu_usermode_que
 		queue->state = AMDGPU_USERQ_STATE_HUNG;
 	}
 	amdgpu_userq_cleanup(queue);
-	mutex_unlock(&uq_mgr->userq_mutex);
 
 	pm_runtime_put_autosuspend(adev_to_drm(adev)->dev);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak
  2026-03-19  8:21 [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak Prike Liang
  2026-03-19  8:21 ` [PATCH 2/3] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl() Prike Liang
  2026-03-19  8:21 ` [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock Prike Liang
@ 2026-03-19  8:41 ` Christian König
  2 siblings, 0 replies; 9+ messages in thread
From: Christian König @ 2026-03-19  8:41 UTC (permalink / raw)
  To: Prike Liang, amd-gfx; +Cc: Alexander.Deucher



On 3/19/26 09:21, Prike Liang wrote:
> The TLB flush fence leaked during walking over
> the free mapping list. Meanwhile, the TLB flush fence
> is referenced by root BOs reservation which is a false
> leak and we should mark it.
> 
> Signed-off-by: Prike Liang <Prike.Liang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c           | 3 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c | 7 ++++++-
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 7ef0cb6bcbda..1cd4c4217b02 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -1560,6 +1560,8 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
>  		goto error_free;
>  
>  	while (!list_empty(&vm->freed)) {
> +		struct dma_fence *old_f = f;
> +
>  		mapping = list_first_entry(&vm->freed,
>  			struct amdgpu_bo_va_mapping, list);
>  		list_del(&mapping->list);
> @@ -1572,6 +1574,7 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
>  			dma_fence_put(f);
>  			goto error_free;
>  		}
> +		dma_fence_put(old_f);

That is incorrect as far as I can see, you are dropping the fence without acquiring a reference.

>  	}
>  
>  	if (fence && f) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
> index 5d26797356a3..93b72289e7df 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_tlb_fence.c
> @@ -23,6 +23,7 @@
>  
>  #include <linux/dma-fence.h>
>  #include <linux/workqueue.h>
> +#include <linux/kmemleak.h>
>  
>  #include "amdgpu.h"
>  #include "amdgpu_vm.h"
> @@ -106,6 +107,10 @@ void amdgpu_vm_tlb_fence_create(struct amdgpu_device *adev, struct amdgpu_vm *vm
>  	/* TODO: We probably need a separate wq here */
>  	dma_fence_get(&f->base);
>  	schedule_work(&f->work);
> -
> +	/* The TLB fence is referenced by dma_resv and
> +	 * the resv ref is the remaining ref, so that's
> +	 * a false positive leak.
> +	 */
> +	kmemleak_not_leak(f);

That is just utterly nonsense. kmemleak is perfectly capable to detect the pointer in the dma_resv object.

Regards,
Christian.

>  	*fence = &f->base;
>  }


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl()
  2026-03-19  8:21 ` [PATCH 2/3] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl() Prike Liang
@ 2026-03-19  8:44   ` Christian König
  0 siblings, 0 replies; 9+ messages in thread
From: Christian König @ 2026-03-19  8:44 UTC (permalink / raw)
  To: Prike Liang, amd-gfx; +Cc: Alexander.Deucher

On 3/19/26 09:21, Prike Liang wrote:
> It requires freeing the syncobj and chain
> alloction resource.
> 
> Signed-off-by: Prike Liang <Prike.Liang@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> index 98276b55ad3c..f54e0fb5cb2d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> @@ -980,6 +980,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>  						      timeline_chain,
>  						      fence,
>  						      args->vm_timeline_point);
> +				timeline_chain = NULL;
>  			}
>  		}
>  		dma_fence_put(fence);
> @@ -987,6 +988,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>  	}
>  
>  error:
> +	dma_fence_chain_free(timeline_chain);
> +	drm_syncobj_put(timeline_syncobj);
>  	drm_exec_fini(&exec);
>  error_put_gobj:
>  	drm_gem_object_put(gobj);


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock
  2026-03-19  8:21 ` [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock Prike Liang
@ 2026-03-19  8:46   ` Christian König
  2026-03-20  7:56   ` Khatri, Sunil
  1 sibling, 0 replies; 9+ messages in thread
From: Christian König @ 2026-03-19  8:46 UTC (permalink / raw)
  To: Prike Liang, amd-gfx, Khatri, Sunil; +Cc: Alexander.Deucher

On 3/19/26 09:21, Prike Liang wrote:
> In the userq destroy routine, the queue refcount
> should be 0 and the queue already removed from the
> manager list, so it must not be touched. Attempting
> to lock the userq mutex here would deadlock, as it
> is already held by the eviction suspend work like as
> following.

If I'm not completely mistaken Sunil already took a look into this.

@Sunil if you haven't seen that before please take a look at this patch.

Regards,
Christian.

> 
> [  107.881652] ============================================
> [  107.881866] WARNING: possible recursive locking detected
> [  107.882081] 6.19.0-custom #16 Tainted: G     U     OE
> [  107.882305] --------------------------------------------
> [  107.882518] kworker/15:1/158 is trying to acquire lock:
> [  107.882728] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4}, at: amdgpu_userq_kref_destroy+0x57/0x540 [amdgpu]
> [  107.883462]
>                but task is already holding lock:
> [  107.883701] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4}, at: amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu]
> [  107.884485]
>                other info that might help us debug this:
> [  107.884751]  Possible unsafe locking scenario:
> 
> [  107.884993]        CPU0
> [  107.885100]        ----
> [  107.885207]   lock(&userq_mgr->userq_mutex);
> [  107.885385]   lock(&userq_mgr->userq_mutex);
> [  107.885561]
>                 *** DEADLOCK ***
> 
> [  107.885798]  May be due to missing lock nesting notation
> 
> [  107.886069] 4 locks held by kworker/15:1/158:
> [  107.886247]  #0: ffff8f2840057558 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x455/0x650
> [  107.886630]  #1: ffffd32f01a4fe18 ((work_completion)(&evf_mgr->suspend_work)){+.+.}-{0:0}, at: process_one_work+0x1f3/0x650
> [  107.887075]  #2: ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4}, at: amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu]
> [  107.887799]  #3: ffffffffb8d3f700 (dma_fence_map){++++}-{0:0}, at: amdgpu_eviction_fence_suspend_worker+0x36/0xc0 [amdgpu]
> [  107.888457]
> 
> Signed-off-by: Prike Liang <Prike.Liang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 50 +++++++++++++++++++++--
>  1 file changed, 47 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> index bb5d572f5a3c..c7a9306a1c01 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> @@ -148,6 +148,52 @@ amdgpu_userq_detect_and_reset_queues(struct amdgpu_userq_mgr *uq_mgr)
>  	return r;
>  }
>  
> +static int
> +amdgpu_userq_perq_detect_and_reset_queues(struct amdgpu_userq_mgr *uq_mgr,
> +			struct amdgpu_usermode_queue *queue)
> +{
> +	struct amdgpu_device *adev = uq_mgr->adev;
> +	bool gpu_reset = false;
> +	int r = 0;
> +
> +	/* Warning if current process mutex is not held */
> +	if (refcount_read(&queue->refcount.refcount))
> +		WARN_ON(!mutex_is_locked(&uq_mgr->userq_mutex));
> +
> +	if (unlikely(adev->debug_disable_gpu_ring_reset)) {
> +		dev_err(adev->dev, "userq reset disabled by debug mask\n");
> +		return 0;
> +	}
> +
> +	/*
> +	 * If GPU recovery feature is disabled system-wide,
> +	 * skip all reset detection logic
> +	 */
> +	if (!amdgpu_gpu_recovery)
> +		return 0;
> +
> +	/*
> +	 * Iterate through all queue types to detect and reset problematic queues
> +	 * Process each queue type in the defined order
> +	 */
> +	int ring_type = queue->queue_type;
> +	const struct amdgpu_userq_funcs *funcs = adev->userq_funcs[ring_type];
> +
> +	if (!amdgpu_userq_is_reset_type_supported(adev, ring_type, AMDGPU_RESET_TYPE_PER_QUEUE))
> +			return r;
> +
> +	if (atomic_read(&uq_mgr->userq_count[ring_type]) > 0 &&
> +	    funcs && funcs->detect_and_reset) {
> +		r = funcs->detect_and_reset(adev, ring_type);
> +		if (r)
> +			gpu_reset = true;
> +	}
> +
> +	if (gpu_reset)
> +		amdgpu_userq_gpu_reset(adev);
> +
> +	return r;
> +}
>  static void amdgpu_userq_hang_detect_work(struct work_struct *work)
>  {
>  	struct amdgpu_usermode_queue *queue = container_of(work,
> @@ -627,7 +673,6 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct amdgpu_usermode_que
>  	/* Cancel any pending hang detection work and cleanup */
>  	cancel_delayed_work_sync(&queue->hang_detect_work);
>  
> -	mutex_lock(&uq_mgr->userq_mutex);
>  	queue->hang_detect_fence = NULL;
>  	amdgpu_userq_wait_for_last_fence(queue);
>  
> @@ -649,7 +694,7 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct amdgpu_usermode_que
>  #if defined(CONFIG_DEBUG_FS)
>  	debugfs_remove_recursive(queue->debugfs_queue);
>  #endif
> -	amdgpu_userq_detect_and_reset_queues(uq_mgr);
> +	amdgpu_userq_perq_detect_and_reset_queues(uq_mgr, queue);
>  	r = amdgpu_userq_unmap_helper(queue);
>  	/*TODO: It requires a reset for userq hw unmap error*/
>  	if (unlikely(r != AMDGPU_USERQ_STATE_UNMAPPED)) {
> @@ -657,7 +702,6 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct amdgpu_usermode_que
>  		queue->state = AMDGPU_USERQ_STATE_HUNG;
>  	}
>  	amdgpu_userq_cleanup(queue);
> -	mutex_unlock(&uq_mgr->userq_mutex);
>  
>  	pm_runtime_put_autosuspend(adev_to_drm(adev)->dev);
>  


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock
  2026-03-19  8:21 ` [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock Prike Liang
  2026-03-19  8:46   ` Christian König
@ 2026-03-20  7:56   ` Khatri, Sunil
  2026-03-20  8:49     ` Liang, Prike
  1 sibling, 1 reply; 9+ messages in thread
From: Khatri, Sunil @ 2026-03-20  7:56 UTC (permalink / raw)
  To: Prike Liang, amd-gfx; +Cc: Alexander.Deucher, Christian.Koenig


On 19-03-2026 01:51 pm, Prike Liang wrote:
> In the userq destroy routine, the queue refcount
> should be 0 and the queue already removed from the
> manager list, so it must not be touched. Attempting
> to lock the userq mutex here would deadlock, as it
> is already held by the eviction suspend work like as
> following.
>
> [  107.881652] ============================================
> [  107.881866] WARNING: possible recursive locking detected
> [  107.882081] 6.19.0-custom #16 Tainted: G     U     OE
> [  107.882305] --------------------------------------------
> [  107.882518] kworker/15:1/158 is trying to acquire lock:
> [  107.882728] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4}, at: amdgpu_userq_kref_destroy+0x57/0x540 [amdgpu]
> [  107.883462]
>                 but task is already holding lock:
> [  107.883701] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4}, at: amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu]
> [  107.884485]
>                 other info that might help us debug this:
> [  107.884751]  Possible unsafe locking scenario:
>
> [  107.884993]        CPU0
> [  107.885100]        ----
> [  107.885207]   lock(&userq_mgr->userq_mutex);
> [  107.885385]   lock(&userq_mgr->userq_mutex);
> [  107.885561]
>                  *** DEADLOCK ***
>
> [  107.885798]  May be due to missing lock nesting notation
>
> [  107.886069] 4 locks held by kworker/15:1/158:
> [  107.886247]  #0: ffff8f2840057558 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x455/0x650
> [  107.886630]  #1: ffffd32f01a4fe18 ((work_completion)(&evf_mgr->suspend_work)){+.+.}-{0:0}, at: process_one_work+0x1f3/0x650
> [  107.887075]  #2: ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4}, at: amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu]
> [  107.887799]  #3: ffffffffb8d3f700 (dma_fence_map){++++}-{0:0}, at: amdgpu_eviction_fence_suspend_worker+0x36/0xc0 [amdgpu]
> [  107.888457]
>
> Signed-off-by: Prike Liang <Prike.Liang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 50 +++++++++++++++++++++--
>   1 file changed, 47 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> index bb5d572f5a3c..c7a9306a1c01 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> @@ -148,6 +148,52 @@ amdgpu_userq_detect_and_reset_queues(struct amdgpu_userq_mgr *uq_mgr)
>   	return r;
>   }
>   
> +static int
> +amdgpu_userq_perq_detect_and_reset_queues(struct amdgpu_userq_mgr *uq_mgr,
> +			struct amdgpu_usermode_queue *queue)
> +{
> +	struct amdgpu_device *adev = uq_mgr->adev;
> +	bool gpu_reset = false;
> +	int r = 0;
> +
> +	/* Warning if current process mutex is not held */
> +	if (refcount_read(&queue->refcount.refcount))
> +		WARN_ON(!mutex_is_locked(&uq_mgr->userq_mutex));
> +
> +	if (unlikely(adev->debug_disable_gpu_ring_reset)) {
> +		dev_err(adev->dev, "userq reset disabled by debug mask\n");
> +		return 0;
> +	}
> +
> +	/*
> +	 * If GPU recovery feature is disabled system-wide,
> +	 * skip all reset detection logic
> +	 */
> +	if (!amdgpu_gpu_recovery)
> +		return 0;
> +
> +	/*
> +	 * Iterate through all queue types to detect and reset problematic queues
> +	 * Process each queue type in the defined order
> +	 */
> +	int ring_type = queue->queue_type;
> +	const struct amdgpu_userq_funcs *funcs = adev->userq_funcs[ring_type];
> +
> +	if (!amdgpu_userq_is_reset_type_supported(adev, ring_type, AMDGPU_RESET_TYPE_PER_QUEUE))
> +			return r;
> +
> +	if (atomic_read(&uq_mgr->userq_count[ring_type]) > 0 &&
> +	    funcs && funcs->detect_and_reset) {
> +		r = funcs->detect_and_reset(adev, ring_type);
> +		if (r)
> +			gpu_reset = true;
> +	}
> +
> +	if (gpu_reset)
> +		amdgpu_userq_gpu_reset(adev);
> +
> +	return r;
> +}
>   static void amdgpu_userq_hang_detect_work(struct work_struct *work)
>   {
>   	struct amdgpu_usermode_queue *queue = container_of(work,
> @@ -627,7 +673,6 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct amdgpu_usermode_que
>   	/* Cancel any pending hang detection work and cleanup */
>   	cancel_delayed_work_sync(&queue->hang_detect_work);
>   
> -	mutex_lock(&uq_mgr->userq_mutex);

Cant release locks here and we still need locks while updating hang_detect_fence and all other functions that follow.

>   	queue->hang_detect_fence = NULL;
>   	amdgpu_userq_wait_for_last_fence(queue);
>   
> @@ -649,7 +694,7 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct amdgpu_usermode_que
>   #if defined(CONFIG_DEBUG_FS)
>   	debugfs_remove_recursive(queue->debugfs_queue);
>   #endif
> -	amdgpu_userq_detect_and_reset_queues(uq_mgr);
> +	amdgpu_userq_perq_detect_and_reset_queues(uq_mgr, queue);
Possibility of the deadlock seems correct and there are some other 
places too that i found out. But we cant leave the locks here like this. 
We still need lock to clean up and rest of the function.I am looking 
into it and share a fix where we dont have to release locks and probably 
a better way

Regards
Sunil khatri

>   	r = amdgpu_userq_unmap_helper(queue);
>   	/*TODO: It requires a reset for userq hw unmap error*/
>   	if (unlikely(r != AMDGPU_USERQ_STATE_UNMAPPED)) {
> @@ -657,7 +702,6 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct amdgpu_usermode_que
>   		queue->state = AMDGPU_USERQ_STATE_HUNG;
>   	}
>   	amdgpu_userq_cleanup(queue);
> -	mutex_unlock(&uq_mgr->userq_mutex);
>   
>   	pm_runtime_put_autosuspend(adev_to_drm(adev)->dev);
>   

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock
  2026-03-20  7:56   ` Khatri, Sunil
@ 2026-03-20  8:49     ` Liang, Prike
  2026-03-20  9:17       ` Khatri, Sunil
  0 siblings, 1 reply; 9+ messages in thread
From: Liang, Prike @ 2026-03-20  8:49 UTC (permalink / raw)
  To: Khatri, Sunil, amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Koenig, Christian

[Public]

Regards,
      Prike

> -----Original Message-----
> From: Khatri, Sunil <Sunil.Khatri@amd.com>
> Sent: Friday, March 20, 2026 3:57 PM
> To: Liang, Prike <Prike.Liang@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>
> Subject: Re: [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock
>
>
> On 19-03-2026 01:51 pm, Prike Liang wrote:
> > In the userq destroy routine, the queue refcount should be 0 and the
> > queue already removed from the manager list, so it must not be
> > touched. Attempting to lock the userq mutex here would deadlock, as it
> > is already held by the eviction suspend work like as following.
> >
> > [  107.881652] ============================================
> > [  107.881866] WARNING: possible recursive locking detected
> > [  107.882081] 6.19.0-custom #16 Tainted: G     U     OE
> > [  107.882305] --------------------------------------------
> > [  107.882518] kworker/15:1/158 is trying to acquire lock:
> > [  107.882728] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4},
> > at: amdgpu_userq_kref_destroy+0x57/0x540 [amdgpu] [  107.883462]
> >                 but task is already holding lock:
> > [  107.883701] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4},
> > at: amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu] [  107.884485]
> >                 other info that might help us debug this:
> > [  107.884751]  Possible unsafe locking scenario:
> >
> > [  107.884993]        CPU0
> > [  107.885100]        ----
> > [  107.885207]   lock(&userq_mgr->userq_mutex);
> > [  107.885385]   lock(&userq_mgr->userq_mutex);
> > [  107.885561]
> >                  *** DEADLOCK ***
> >
> > [  107.885798]  May be due to missing lock nesting notation
> >
> > [  107.886069] 4 locks held by kworker/15:1/158:
> > [  107.886247]  #0: ffff8f2840057558
> > ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x455/0x650
> > [  107.886630]  #1: ffffd32f01a4fe18
> > ((work_completion)(&evf_mgr->suspend_work)){+.+.}-{0:0}, at:
> > process_one_work+0x1f3/0x650 [  107.887075]  #2: ffff8f2854b3d110
> > (&userq_mgr->userq_mutex){+.+.}-{4:4}, at:
> > amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu] [  107.887799]
> > #3: ffffffffb8d3f700 (dma_fence_map){++++}-{0:0}, at:
> > amdgpu_eviction_fence_suspend_worker+0x36/0xc0 [amdgpu] [  107.888457]
> >
> > Signed-off-by: Prike Liang <Prike.Liang@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 50 +++++++++++++++++++++-
> -
> >   1 file changed, 47 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> > index bb5d572f5a3c..c7a9306a1c01 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> > @@ -148,6 +148,52 @@ amdgpu_userq_detect_and_reset_queues(struct
> amdgpu_userq_mgr *uq_mgr)
> >     return r;
> >   }
> >
> > +static int
> > +amdgpu_userq_perq_detect_and_reset_queues(struct amdgpu_userq_mgr
> *uq_mgr,
> > +                   struct amdgpu_usermode_queue *queue) {
> > +   struct amdgpu_device *adev = uq_mgr->adev;
> > +   bool gpu_reset = false;
> > +   int r = 0;
> > +
> > +   /* Warning if current process mutex is not held */
> > +   if (refcount_read(&queue->refcount.refcount))
> > +           WARN_ON(!mutex_is_locked(&uq_mgr->userq_mutex));
> > +
> > +   if (unlikely(adev->debug_disable_gpu_ring_reset)) {
> > +           dev_err(adev->dev, "userq reset disabled by debug mask\n");
> > +           return 0;
> > +   }
> > +
> > +   /*
> > +    * If GPU recovery feature is disabled system-wide,
> > +    * skip all reset detection logic
> > +    */
> > +   if (!amdgpu_gpu_recovery)
> > +           return 0;
> > +
> > +   /*
> > +    * Iterate through all queue types to detect and reset problematic queues
> > +    * Process each queue type in the defined order
> > +    */
> > +   int ring_type = queue->queue_type;
> > +   const struct amdgpu_userq_funcs *funcs =
> > +adev->userq_funcs[ring_type];
> > +
> > +   if (!amdgpu_userq_is_reset_type_supported(adev, ring_type,
> AMDGPU_RESET_TYPE_PER_QUEUE))
> > +                   return r;
> > +
> > +   if (atomic_read(&uq_mgr->userq_count[ring_type]) > 0 &&
> > +       funcs && funcs->detect_and_reset) {
> > +           r = funcs->detect_and_reset(adev, ring_type);
> > +           if (r)
> > +                   gpu_reset = true;
> > +   }
> > +
> > +   if (gpu_reset)
> > +           amdgpu_userq_gpu_reset(adev);
> > +
> > +   return r;
> > +}
> >   static void amdgpu_userq_hang_detect_work(struct work_struct *work)
> >   {
> >     struct amdgpu_usermode_queue *queue = container_of(work, @@ -627,7
> > +673,6 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct
> amdgpu_usermode_que
> >     /* Cancel any pending hang detection work and cleanup */
> >     cancel_delayed_work_sync(&queue->hang_detect_work);
> >
> > -   mutex_lock(&uq_mgr->userq_mutex);
>
> Cant release locks here and we still need locks while updating hang_detect_fence
> and all other functions that follow.
It does not release the userq lock, instead of the mutex lock is already acquired by the eviction fence suspend work.
And the hang queue detect I have reworked a bit in this patch which doesn't asset the lock when all the queue dereferenced.


> >     queue->hang_detect_fence = NULL;
> >     amdgpu_userq_wait_for_last_fence(queue);
> >
> > @@ -649,7 +694,7 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr
> *uq_mgr, struct amdgpu_usermode_que
> >   #if defined(CONFIG_DEBUG_FS)
> >     debugfs_remove_recursive(queue->debugfs_queue);
> >   #endif
> > -   amdgpu_userq_detect_and_reset_queues(uq_mgr);
> > +   amdgpu_userq_perq_detect_and_reset_queues(uq_mgr, queue);
> Possibility of the deadlock seems correct and there are some other places too that i
> found out. But we cant leave the locks here like this.
> We still need lock to clean up and rest of the function.I am looking into it and share a
> fix where we dont have to release locks and probably a better way
We may need to resolve the deadlock case by case, and this patch arm to resolve the
userq destroyed deadlock in the eviction fence suspend work. I hope this patch can help you
observe the similar deadlock issue.

> Regards
> Sunil khatri
>
> >     r = amdgpu_userq_unmap_helper(queue);
> >     /*TODO: It requires a reset for userq hw unmap error*/
> >     if (unlikely(r != AMDGPU_USERQ_STATE_UNMAPPED)) { @@ -657,7
> +702,6
> > @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct
> amdgpu_usermode_que
> >             queue->state = AMDGPU_USERQ_STATE_HUNG;
> >     }
> >     amdgpu_userq_cleanup(queue);
> > -   mutex_unlock(&uq_mgr->userq_mutex);
> >
> >     pm_runtime_put_autosuspend(adev_to_drm(adev)->dev);
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock
  2026-03-20  8:49     ` Liang, Prike
@ 2026-03-20  9:17       ` Khatri, Sunil
  0 siblings, 0 replies; 9+ messages in thread
From: Khatri, Sunil @ 2026-03-20  9:17 UTC (permalink / raw)
  To: Liang, Prike, Khatri, Sunil, amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Koenig, Christian


On 20-03-2026 02:19 pm, Liang, Prike wrote:
> [Public]
>
> Regards,
>        Prike
>
>> -----Original Message-----
>> From: Khatri, Sunil <Sunil.Khatri@amd.com>
>> Sent: Friday, March 20, 2026 3:57 PM
>> To: Liang, Prike <Prike.Liang@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>
>> Subject: Re: [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock
>>
>>
>> On 19-03-2026 01:51 pm, Prike Liang wrote:
>>> In the userq destroy routine, the queue refcount should be 0 and the
>>> queue already removed from the manager list, so it must not be
>>> touched. Attempting to lock the userq mutex here would deadlock, as it
>>> is already held by the eviction suspend work like as following.
>>>
>>> [  107.881652] ============================================
>>> [  107.881866] WARNING: possible recursive locking detected
>>> [  107.882081] 6.19.0-custom #16 Tainted: G     U     OE
>>> [  107.882305] --------------------------------------------
>>> [  107.882518] kworker/15:1/158 is trying to acquire lock:
>>> [  107.882728] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4},
>>> at: amdgpu_userq_kref_destroy+0x57/0x540 [amdgpu] [  107.883462]
>>>                  but task is already holding lock:
>>> [  107.883701] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4},
>>> at: amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu] [  107.884485]
>>>                  other info that might help us debug this:
>>> [  107.884751]  Possible unsafe locking scenario:
>>>
>>> [  107.884993]        CPU0
>>> [  107.885100]        ----
>>> [  107.885207]   lock(&userq_mgr->userq_mutex);
>>> [  107.885385]   lock(&userq_mgr->userq_mutex);
>>> [  107.885561]
>>>                   *** DEADLOCK ***
>>>
>>> [  107.885798]  May be due to missing lock nesting notation
>>>
>>> [  107.886069] 4 locks held by kworker/15:1/158:
>>> [  107.886247]  #0: ffff8f2840057558
>>> ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x455/0x650
>>> [  107.886630]  #1: ffffd32f01a4fe18
>>> ((work_completion)(&evf_mgr->suspend_work)){+.+.}-{0:0}, at:
>>> process_one_work+0x1f3/0x650 [  107.887075]  #2: ffff8f2854b3d110
>>> (&userq_mgr->userq_mutex){+.+.}-{4:4}, at:
>>> amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu] [  107.887799]
>>> #3: ffffffffb8d3f700 (dma_fence_map){++++}-{0:0}, at:
>>> amdgpu_eviction_fence_suspend_worker+0x36/0xc0 [amdgpu] [  107.888457]
>>>
>>> Signed-off-by: Prike Liang <Prike.Liang@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 50 +++++++++++++++++++++-
>> -
>>>    1 file changed, 47 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> index bb5d572f5a3c..c7a9306a1c01 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> @@ -148,6 +148,52 @@ amdgpu_userq_detect_and_reset_queues(struct
>> amdgpu_userq_mgr *uq_mgr)
>>>      return r;
>>>    }
>>>
>>> +static int
>>> +amdgpu_userq_perq_detect_and_reset_queues(struct amdgpu_userq_mgr
>> *uq_mgr,
>>> +                   struct amdgpu_usermode_queue *queue) {
>>> +   struct amdgpu_device *adev = uq_mgr->adev;
>>> +   bool gpu_reset = false;
>>> +   int r = 0;
>>> +
>>> +   /* Warning if current process mutex is not held */
>>> +   if (refcount_read(&queue->refcount.refcount))
>>> +           WARN_ON(!mutex_is_locked(&uq_mgr->userq_mutex));
>>> +
>>> +   if (unlikely(adev->debug_disable_gpu_ring_reset)) {
>>> +           dev_err(adev->dev, "userq reset disabled by debug mask\n");
>>> +           return 0;
>>> +   }
>>> +
>>> +   /*
>>> +    * If GPU recovery feature is disabled system-wide,
>>> +    * skip all reset detection logic
>>> +    */
>>> +   if (!amdgpu_gpu_recovery)
>>> +           return 0;
>>> +
>>> +   /*
>>> +    * Iterate through all queue types to detect and reset problematic queues
>>> +    * Process each queue type in the defined order
>>> +    */
>>> +   int ring_type = queue->queue_type;
>>> +   const struct amdgpu_userq_funcs *funcs =
>>> +adev->userq_funcs[ring_type];
>>> +
>>> +   if (!amdgpu_userq_is_reset_type_supported(adev, ring_type,
>> AMDGPU_RESET_TYPE_PER_QUEUE))
>>> +                   return r;
>>> +
>>> +   if (atomic_read(&uq_mgr->userq_count[ring_type]) > 0 &&
>>> +       funcs && funcs->detect_and_reset) {
>>> +           r = funcs->detect_and_reset(adev, ring_type);
>>> +           if (r)
>>> +                   gpu_reset = true;
>>> +   }
>>> +
>>> +   if (gpu_reset)
>>> +           amdgpu_userq_gpu_reset(adev);
>>> +
>>> +   return r;
>>> +}
>>>    static void amdgpu_userq_hang_detect_work(struct work_struct *work)
>>>    {
>>>      struct amdgpu_usermode_queue *queue = container_of(work, @@ -627,7
>>> +673,6 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct
>> amdgpu_usermode_que
>>>      /* Cancel any pending hang detection work and cleanup */
>>>      cancel_delayed_work_sync(&queue->hang_detect_work);
>>>
>>> -   mutex_lock(&uq_mgr->userq_mutex);
>> Cant release locks here and we still need locks while updating hang_detect_fence
>> and all other functions that follow.
> It does not release the userq lock, instead of the mutex lock is already acquired by the eviction fence suspend work.
> And the hang queue detect I have reworked a bit in this patch which doesn't asset the lock when all the queue dereferenced.
True i missed that. But there is another problem we cant be calling 
below function with lock held as it causes deadlock again. as resume 
work and hang_detect too both takes lock again.
      cancel_delayed_work_sync(&uq_mgr->resume_work);
     /* Cancel any pending hang detection work and cleanup */
      cancel_delayed_work_sync(&queue->hang_detect_work);

>
>
>>>      queue->hang_detect_fence = NULL;
>>>      amdgpu_userq_wait_for_last_fence(queue);
>>>
>>> @@ -649,7 +694,7 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr
>> *uq_mgr, struct amdgpu_usermode_que
>>>    #if defined(CONFIG_DEBUG_FS)
>>>      debugfs_remove_recursive(queue->debugfs_queue);
>>>    #endif
>>> -   amdgpu_userq_detect_and_reset_queues(uq_mgr);
>>> +   amdgpu_userq_perq_detect_and_reset_queues(uq_mgr, queue);
>> Possibility of the deadlock seems correct and there are some other places too that i
>> found out. But we cant leave the locks here like this.
>> We still need lock to clean up and rest of the function.I am looking into it and share a
>> fix where we dont have to release locks and probably a better way
> We may need to resolve the deadlock case by case, and this patch arm to resolve the
> userq destroyed deadlock in the eviction fence suspend work. I hope this patch can help you
> observe the similar deadlock issue.

Yeah we could but i feel it needs a redesign or get/put which takes care 
from the caller itself in both lock already taken or not condition. I 
have one patch on this which fixes other corner cases too. lets
check if once.

Regards
Sunil khatri

>
>> Regards
>> Sunil khatri
>>
>>>      r = amdgpu_userq_unmap_helper(queue);
>>>      /*TODO: It requires a reset for userq hw unmap error*/
>>>      if (unlikely(r != AMDGPU_USERQ_STATE_UNMAPPED)) { @@ -657,7
>> +702,6
>>> @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct
>> amdgpu_usermode_que
>>>              queue->state = AMDGPU_USERQ_STATE_HUNG;
>>>      }
>>>      amdgpu_userq_cleanup(queue);
>>> -   mutex_unlock(&uq_mgr->userq_mutex);
>>>
>>>      pm_runtime_put_autosuspend(adev_to_drm(adev)->dev);
>>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-03-20  9:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-19  8:21 [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak Prike Liang
2026-03-19  8:21 ` [PATCH 2/3] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl() Prike Liang
2026-03-19  8:44   ` Christian König
2026-03-19  8:21 ` [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock Prike Liang
2026-03-19  8:46   ` Christian König
2026-03-20  7:56   ` Khatri, Sunil
2026-03-20  8:49     ` Liang, Prike
2026-03-20  9:17       ` Khatri, Sunil
2026-03-19  8:41 ` [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox