From: "Khatri, Sunil" <sukhatri@amd.com>
To: "Liang, Prike" <Prike.Liang@amd.com>,
"Khatri, Sunil" <Sunil.Khatri@amd.com>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
"Koenig, Christian" <Christian.Koenig@amd.com>
Subject: Re: [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock
Date: Fri, 20 Mar 2026 14:47:47 +0530 [thread overview]
Message-ID: <6a2a9e6b-c5e0-4e9e-a012-274a5c434fa5@amd.com> (raw)
In-Reply-To: <MN0PR12MB600408189043102213210DEEFB4CA@MN0PR12MB6004.namprd12.prod.outlook.com>
On 20-03-2026 02:19 pm, Liang, Prike wrote:
> [Public]
>
> Regards,
> Prike
>
>> -----Original Message-----
>> From: Khatri, Sunil <Sunil.Khatri@amd.com>
>> Sent: Friday, March 20, 2026 3:57 PM
>> To: Liang, Prike <Prike.Liang@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>
>> Subject: Re: [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock
>>
>>
>> On 19-03-2026 01:51 pm, Prike Liang wrote:
>>> In the userq destroy routine, the queue refcount should be 0 and the
>>> queue already removed from the manager list, so it must not be
>>> touched. Attempting to lock the userq mutex here would deadlock, as it
>>> is already held by the eviction suspend work like as following.
>>>
>>> [ 107.881652] ============================================
>>> [ 107.881866] WARNING: possible recursive locking detected
>>> [ 107.882081] 6.19.0-custom #16 Tainted: G U OE
>>> [ 107.882305] --------------------------------------------
>>> [ 107.882518] kworker/15:1/158 is trying to acquire lock:
>>> [ 107.882728] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4},
>>> at: amdgpu_userq_kref_destroy+0x57/0x540 [amdgpu] [ 107.883462]
>>> but task is already holding lock:
>>> [ 107.883701] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4},
>>> at: amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu] [ 107.884485]
>>> other info that might help us debug this:
>>> [ 107.884751] Possible unsafe locking scenario:
>>>
>>> [ 107.884993] CPU0
>>> [ 107.885100] ----
>>> [ 107.885207] lock(&userq_mgr->userq_mutex);
>>> [ 107.885385] lock(&userq_mgr->userq_mutex);
>>> [ 107.885561]
>>> *** DEADLOCK ***
>>>
>>> [ 107.885798] May be due to missing lock nesting notation
>>>
>>> [ 107.886069] 4 locks held by kworker/15:1/158:
>>> [ 107.886247] #0: ffff8f2840057558
>>> ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x455/0x650
>>> [ 107.886630] #1: ffffd32f01a4fe18
>>> ((work_completion)(&evf_mgr->suspend_work)){+.+.}-{0:0}, at:
>>> process_one_work+0x1f3/0x650 [ 107.887075] #2: ffff8f2854b3d110
>>> (&userq_mgr->userq_mutex){+.+.}-{4:4}, at:
>>> amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu] [ 107.887799]
>>> #3: ffffffffb8d3f700 (dma_fence_map){++++}-{0:0}, at:
>>> amdgpu_eviction_fence_suspend_worker+0x36/0xc0 [amdgpu] [ 107.888457]
>>>
>>> Signed-off-by: Prike Liang <Prike.Liang@amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 50 +++++++++++++++++++++-
>> -
>>> 1 file changed, 47 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> index bb5d572f5a3c..c7a9306a1c01 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> @@ -148,6 +148,52 @@ amdgpu_userq_detect_and_reset_queues(struct
>> amdgpu_userq_mgr *uq_mgr)
>>> return r;
>>> }
>>>
>>> +static int
>>> +amdgpu_userq_perq_detect_and_reset_queues(struct amdgpu_userq_mgr
>> *uq_mgr,
>>> + struct amdgpu_usermode_queue *queue) {
>>> + struct amdgpu_device *adev = uq_mgr->adev;
>>> + bool gpu_reset = false;
>>> + int r = 0;
>>> +
>>> + /* Warning if current process mutex is not held */
>>> + if (refcount_read(&queue->refcount.refcount))
>>> + WARN_ON(!mutex_is_locked(&uq_mgr->userq_mutex));
>>> +
>>> + if (unlikely(adev->debug_disable_gpu_ring_reset)) {
>>> + dev_err(adev->dev, "userq reset disabled by debug mask\n");
>>> + return 0;
>>> + }
>>> +
>>> + /*
>>> + * If GPU recovery feature is disabled system-wide,
>>> + * skip all reset detection logic
>>> + */
>>> + if (!amdgpu_gpu_recovery)
>>> + return 0;
>>> +
>>> + /*
>>> + * Iterate through all queue types to detect and reset problematic queues
>>> + * Process each queue type in the defined order
>>> + */
>>> + int ring_type = queue->queue_type;
>>> + const struct amdgpu_userq_funcs *funcs =
>>> +adev->userq_funcs[ring_type];
>>> +
>>> + if (!amdgpu_userq_is_reset_type_supported(adev, ring_type,
>> AMDGPU_RESET_TYPE_PER_QUEUE))
>>> + return r;
>>> +
>>> + if (atomic_read(&uq_mgr->userq_count[ring_type]) > 0 &&
>>> + funcs && funcs->detect_and_reset) {
>>> + r = funcs->detect_and_reset(adev, ring_type);
>>> + if (r)
>>> + gpu_reset = true;
>>> + }
>>> +
>>> + if (gpu_reset)
>>> + amdgpu_userq_gpu_reset(adev);
>>> +
>>> + return r;
>>> +}
>>> static void amdgpu_userq_hang_detect_work(struct work_struct *work)
>>> {
>>> struct amdgpu_usermode_queue *queue = container_of(work, @@ -627,7
>>> +673,6 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct
>> amdgpu_usermode_que
>>> /* Cancel any pending hang detection work and cleanup */
>>> cancel_delayed_work_sync(&queue->hang_detect_work);
>>>
>>> - mutex_lock(&uq_mgr->userq_mutex);
>> Cant release locks here and we still need locks while updating hang_detect_fence
>> and all other functions that follow.
> It does not release the userq lock, instead of the mutex lock is already acquired by the eviction fence suspend work.
> And the hang queue detect I have reworked a bit in this patch which doesn't asset the lock when all the queue dereferenced.
True i missed that. But there is another problem we cant be calling
below function with lock held as it causes deadlock again. as resume
work and hang_detect too both takes lock again.
cancel_delayed_work_sync(&uq_mgr->resume_work);
/* Cancel any pending hang detection work and cleanup */
cancel_delayed_work_sync(&queue->hang_detect_work);
>
>
>>> queue->hang_detect_fence = NULL;
>>> amdgpu_userq_wait_for_last_fence(queue);
>>>
>>> @@ -649,7 +694,7 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr
>> *uq_mgr, struct amdgpu_usermode_que
>>> #if defined(CONFIG_DEBUG_FS)
>>> debugfs_remove_recursive(queue->debugfs_queue);
>>> #endif
>>> - amdgpu_userq_detect_and_reset_queues(uq_mgr);
>>> + amdgpu_userq_perq_detect_and_reset_queues(uq_mgr, queue);
>> Possibility of the deadlock seems correct and there are some other places too that i
>> found out. But we cant leave the locks here like this.
>> We still need lock to clean up and rest of the function.I am looking into it and share a
>> fix where we dont have to release locks and probably a better way
> We may need to resolve the deadlock case by case, and this patch arm to resolve the
> userq destroyed deadlock in the eviction fence suspend work. I hope this patch can help you
> observe the similar deadlock issue.
Yeah we could but i feel it needs a redesign or get/put which takes care
from the caller itself in both lock already taken or not condition. I
have one patch on this which fixes other corner cases too. lets
check if once.
Regards
Sunil khatri
>
>> Regards
>> Sunil khatri
>>
>>> r = amdgpu_userq_unmap_helper(queue);
>>> /*TODO: It requires a reset for userq hw unmap error*/
>>> if (unlikely(r != AMDGPU_USERQ_STATE_UNMAPPED)) { @@ -657,7
>> +702,6
>>> @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct
>> amdgpu_usermode_que
>>> queue->state = AMDGPU_USERQ_STATE_HUNG;
>>> }
>>> amdgpu_userq_cleanup(queue);
>>> - mutex_unlock(&uq_mgr->userq_mutex);
>>>
>>> pm_runtime_put_autosuspend(adev_to_drm(adev)->dev);
>>>
next prev parent reply other threads:[~2026-03-20 9:17 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-19 8:21 [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak Prike Liang
2026-03-19 8:21 ` [PATCH 2/3] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl() Prike Liang
2026-03-19 8:44 ` Christian König
2026-03-19 8:21 ` [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock Prike Liang
2026-03-19 8:46 ` Christian König
2026-03-20 7:56 ` Khatri, Sunil
2026-03-20 8:49 ` Liang, Prike
2026-03-20 9:17 ` Khatri, Sunil [this message]
2026-03-19 8:41 ` [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak Christian König
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6a2a9e6b-c5e0-4e9e-a012-274a5c434fa5@amd.com \
--to=sukhatri@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Christian.Koenig@amd.com \
--cc=Prike.Liang@amd.com \
--cc=Sunil.Khatri@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox