Re: [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Christian König" <christian.koenig@amd.com>
To: Prike Liang <Prike.Liang@amd.com>,
	amd-gfx@lists.freedesktop.org, "Khatri,
	Sunil" <Sunil.Khatri@amd.com>
Cc: Alexander.Deucher@amd.com
Subject: Re: [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock
Date: Thu, 19 Mar 2026 09:46:03 +0100	[thread overview]
Message-ID: <3e51e435-7eb2-41ed-ae8f-ca48703586a2@amd.com> (raw)
In-Reply-To: <20260319082150.3324177-3-Prike.Liang@amd.com>

On 3/19/26 09:21, Prike Liang wrote:
> In the userq destroy routine, the queue refcount
> should be 0 and the queue already removed from the
> manager list, so it must not be touched. Attempting
> to lock the userq mutex here would deadlock, as it
> is already held by the eviction suspend work like as
> following.

If I'm not completely mistaken Sunil already took a look into this.

@Sunil if you haven't seen that before please take a look at this patch.

Regards,
Christian.

> 
> [  107.881652] ============================================
> [  107.881866] WARNING: possible recursive locking detected
> [  107.882081] 6.19.0-custom #16 Tainted: G     U     OE
> [  107.882305] --------------------------------------------
> [  107.882518] kworker/15:1/158 is trying to acquire lock:
> [  107.882728] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4}, at: amdgpu_userq_kref_destroy+0x57/0x540 [amdgpu]
> [  107.883462]
>                but task is already holding lock:
> [  107.883701] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4}, at: amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu]
> [  107.884485]
>                other info that might help us debug this:
> [  107.884751]  Possible unsafe locking scenario:
> 
> [  107.884993]        CPU0
> [  107.885100]        ----
> [  107.885207]   lock(&userq_mgr->userq_mutex);
> [  107.885385]   lock(&userq_mgr->userq_mutex);
> [  107.885561]
>                 *** DEADLOCK ***
> 
> [  107.885798]  May be due to missing lock nesting notation
> 
> [  107.886069] 4 locks held by kworker/15:1/158:
> [  107.886247]  #0: ffff8f2840057558 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x455/0x650
> [  107.886630]  #1: ffffd32f01a4fe18 ((work_completion)(&evf_mgr->suspend_work)){+.+.}-{0:0}, at: process_one_work+0x1f3/0x650
> [  107.887075]  #2: ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4}, at: amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu]
> [  107.887799]  #3: ffffffffb8d3f700 (dma_fence_map){++++}-{0:0}, at: amdgpu_eviction_fence_suspend_worker+0x36/0xc0 [amdgpu]
> [  107.888457]
> 
> Signed-off-by: Prike Liang <Prike.Liang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 50 +++++++++++++++++++++--
>  1 file changed, 47 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> index bb5d572f5a3c..c7a9306a1c01 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
> @@ -148,6 +148,52 @@ amdgpu_userq_detect_and_reset_queues(struct amdgpu_userq_mgr *uq_mgr)
>  	return r;
>  }
>  
> +static int
> +amdgpu_userq_perq_detect_and_reset_queues(struct amdgpu_userq_mgr *uq_mgr,
> +			struct amdgpu_usermode_queue *queue)
> +{
> +	struct amdgpu_device *adev = uq_mgr->adev;
> +	bool gpu_reset = false;
> +	int r = 0;
> +
> +	/* Warning if current process mutex is not held */
> +	if (refcount_read(&queue->refcount.refcount))
> +		WARN_ON(!mutex_is_locked(&uq_mgr->userq_mutex));
> +
> +	if (unlikely(adev->debug_disable_gpu_ring_reset)) {
> +		dev_err(adev->dev, "userq reset disabled by debug mask\n");
> +		return 0;
> +	}
> +
> +	/*
> +	 * If GPU recovery feature is disabled system-wide,
> +	 * skip all reset detection logic
> +	 */
> +	if (!amdgpu_gpu_recovery)
> +		return 0;
> +
> +	/*
> +	 * Iterate through all queue types to detect and reset problematic queues
> +	 * Process each queue type in the defined order
> +	 */
> +	int ring_type = queue->queue_type;
> +	const struct amdgpu_userq_funcs *funcs = adev->userq_funcs[ring_type];
> +
> +	if (!amdgpu_userq_is_reset_type_supported(adev, ring_type, AMDGPU_RESET_TYPE_PER_QUEUE))
> +			return r;
> +
> +	if (atomic_read(&uq_mgr->userq_count[ring_type]) > 0 &&
> +	    funcs && funcs->detect_and_reset) {
> +		r = funcs->detect_and_reset(adev, ring_type);
> +		if (r)
> +			gpu_reset = true;
> +	}
> +
> +	if (gpu_reset)
> +		amdgpu_userq_gpu_reset(adev);
> +
> +	return r;
> +}
>  static void amdgpu_userq_hang_detect_work(struct work_struct *work)
>  {
>  	struct amdgpu_usermode_queue *queue = container_of(work,
> @@ -627,7 +673,6 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct amdgpu_usermode_que
>  	/* Cancel any pending hang detection work and cleanup */
>  	cancel_delayed_work_sync(&queue->hang_detect_work);
>  
> -	mutex_lock(&uq_mgr->userq_mutex);
>  	queue->hang_detect_fence = NULL;
>  	amdgpu_userq_wait_for_last_fence(queue);
>  
> @@ -649,7 +694,7 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct amdgpu_usermode_que
>  #if defined(CONFIG_DEBUG_FS)
>  	debugfs_remove_recursive(queue->debugfs_queue);
>  #endif
> -	amdgpu_userq_detect_and_reset_queues(uq_mgr);
> +	amdgpu_userq_perq_detect_and_reset_queues(uq_mgr, queue);
>  	r = amdgpu_userq_unmap_helper(queue);
>  	/*TODO: It requires a reset for userq hw unmap error*/
>  	if (unlikely(r != AMDGPU_USERQ_STATE_UNMAPPED)) {
> @@ -657,7 +702,6 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct amdgpu_usermode_que
>  		queue->state = AMDGPU_USERQ_STATE_HUNG;
>  	}
>  	amdgpu_userq_cleanup(queue);
> -	mutex_unlock(&uq_mgr->userq_mutex);
>  
>  	pm_runtime_put_autosuspend(adev_to_drm(adev)->dev);
>

next prev parent reply	other threads:[~2026-03-19  8:46 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19  8:21 [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak Prike Liang
2026-03-19  8:21 ` [PATCH 2/3] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl() Prike Liang
2026-03-19  8:44   ` Christian König
2026-03-19  8:21 ` [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock Prike Liang
2026-03-19  8:46   ` Christian König [this message]
2026-03-20  7:56   ` Khatri, Sunil
2026-03-20  8:49     ` Liang, Prike
2026-03-20  9:17       ` Khatri, Sunil
2026-03-19  8:41 ` [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3e51e435-7eb2-41ed-ae8f-ca48703586a2@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Prike.Liang@amd.com \
    --cc=Sunil.Khatri@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.