Re: [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Khatri, Sunil" <sukhatri@amd.com>
To: "Liang, Prike" <Prike.Liang@amd.com>,
	"Khatri, Sunil" <Sunil.Khatri@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Koenig, Christian" <Christian.Koenig@amd.com>
Subject: Re: [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock
Date: Fri, 20 Mar 2026 14:47:47 +0530	[thread overview]
Message-ID: <6a2a9e6b-c5e0-4e9e-a012-274a5c434fa5@amd.com> (raw)
In-Reply-To: <MN0PR12MB600408189043102213210DEEFB4CA@MN0PR12MB6004.namprd12.prod.outlook.com>


On 20-03-2026 02:19 pm, Liang, Prike wrote:
> [Public]
>
> Regards,
>        Prike
>
>> -----Original Message-----
>> From: Khatri, Sunil <Sunil.Khatri@amd.com>
>> Sent: Friday, March 20, 2026 3:57 PM
>> To: Liang, Prike <Prike.Liang@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>
>> Subject: Re: [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock
>>
>>
>> On 19-03-2026 01:51 pm, Prike Liang wrote:
>>> In the userq destroy routine, the queue refcount should be 0 and the
>>> queue already removed from the manager list, so it must not be
>>> touched. Attempting to lock the userq mutex here would deadlock, as it
>>> is already held by the eviction suspend work like as following.
>>>
>>> [  107.881652] ============================================
>>> [  107.881866] WARNING: possible recursive locking detected
>>> [  107.882081] 6.19.0-custom #16 Tainted: G     U     OE
>>> [  107.882305] --------------------------------------------
>>> [  107.882518] kworker/15:1/158 is trying to acquire lock:
>>> [  107.882728] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4},
>>> at: amdgpu_userq_kref_destroy+0x57/0x540 [amdgpu] [  107.883462]
>>>                  but task is already holding lock:
>>> [  107.883701] ffff8f2854b3d110 (&userq_mgr->userq_mutex){+.+.}-{4:4},
>>> at: amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu] [  107.884485]
>>>                  other info that might help us debug this:
>>> [  107.884751]  Possible unsafe locking scenario:
>>>
>>> [  107.884993]        CPU0
>>> [  107.885100]        ----
>>> [  107.885207]   lock(&userq_mgr->userq_mutex);
>>> [  107.885385]   lock(&userq_mgr->userq_mutex);
>>> [  107.885561]
>>>                   *** DEADLOCK ***
>>>
>>> [  107.885798]  May be due to missing lock nesting notation
>>>
>>> [  107.886069] 4 locks held by kworker/15:1/158:
>>> [  107.886247]  #0: ffff8f2840057558
>>> ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x455/0x650
>>> [  107.886630]  #1: ffffd32f01a4fe18
>>> ((work_completion)(&evf_mgr->suspend_work)){+.+.}-{0:0}, at:
>>> process_one_work+0x1f3/0x650 [  107.887075]  #2: ffff8f2854b3d110
>>> (&userq_mgr->userq_mutex){+.+.}-{4:4}, at:
>>> amdgpu_eviction_fence_suspend_worker+0x31/0xc0 [amdgpu] [  107.887799]
>>> #3: ffffffffb8d3f700 (dma_fence_map){++++}-{0:0}, at:
>>> amdgpu_eviction_fence_suspend_worker+0x36/0xc0 [amdgpu] [  107.888457]
>>>
>>> Signed-off-by: Prike Liang <Prike.Liang@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c | 50 +++++++++++++++++++++-
>> -
>>>    1 file changed, 47 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> index bb5d572f5a3c..c7a9306a1c01 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq.c
>>> @@ -148,6 +148,52 @@ amdgpu_userq_detect_and_reset_queues(struct
>> amdgpu_userq_mgr *uq_mgr)
>>>      return r;
>>>    }
>>>
>>> +static int
>>> +amdgpu_userq_perq_detect_and_reset_queues(struct amdgpu_userq_mgr
>> *uq_mgr,
>>> +                   struct amdgpu_usermode_queue *queue) {
>>> +   struct amdgpu_device *adev = uq_mgr->adev;
>>> +   bool gpu_reset = false;
>>> +   int r = 0;
>>> +
>>> +   /* Warning if current process mutex is not held */
>>> +   if (refcount_read(&queue->refcount.refcount))
>>> +           WARN_ON(!mutex_is_locked(&uq_mgr->userq_mutex));
>>> +
>>> +   if (unlikely(adev->debug_disable_gpu_ring_reset)) {
>>> +           dev_err(adev->dev, "userq reset disabled by debug mask\n");
>>> +           return 0;
>>> +   }
>>> +
>>> +   /*
>>> +    * If GPU recovery feature is disabled system-wide,
>>> +    * skip all reset detection logic
>>> +    */
>>> +   if (!amdgpu_gpu_recovery)
>>> +           return 0;
>>> +
>>> +   /*
>>> +    * Iterate through all queue types to detect and reset problematic queues
>>> +    * Process each queue type in the defined order
>>> +    */
>>> +   int ring_type = queue->queue_type;
>>> +   const struct amdgpu_userq_funcs *funcs =
>>> +adev->userq_funcs[ring_type];
>>> +
>>> +   if (!amdgpu_userq_is_reset_type_supported(adev, ring_type,
>> AMDGPU_RESET_TYPE_PER_QUEUE))
>>> +                   return r;
>>> +
>>> +   if (atomic_read(&uq_mgr->userq_count[ring_type]) > 0 &&
>>> +       funcs && funcs->detect_and_reset) {
>>> +           r = funcs->detect_and_reset(adev, ring_type);
>>> +           if (r)
>>> +                   gpu_reset = true;
>>> +   }
>>> +
>>> +   if (gpu_reset)
>>> +           amdgpu_userq_gpu_reset(adev);
>>> +
>>> +   return r;
>>> +}
>>>    static void amdgpu_userq_hang_detect_work(struct work_struct *work)
>>>    {
>>>      struct amdgpu_usermode_queue *queue = container_of(work, @@ -627,7
>>> +673,6 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct
>> amdgpu_usermode_que
>>>      /* Cancel any pending hang detection work and cleanup */
>>>      cancel_delayed_work_sync(&queue->hang_detect_work);
>>>
>>> -   mutex_lock(&uq_mgr->userq_mutex);
>> Cant release locks here and we still need locks while updating hang_detect_fence
>> and all other functions that follow.
> It does not release the userq lock, instead of the mutex lock is already acquired by the eviction fence suspend work.
> And the hang queue detect I have reworked a bit in this patch which doesn't asset the lock when all the queue dereferenced.
True i missed that. But there is another problem we cant be calling 
below function with lock held as it causes deadlock again. as resume 
work and hang_detect too both takes lock again.
      cancel_delayed_work_sync(&uq_mgr->resume_work);
     /* Cancel any pending hang detection work and cleanup */
      cancel_delayed_work_sync(&queue->hang_detect_work);

>
>
>>>      queue->hang_detect_fence = NULL;
>>>      amdgpu_userq_wait_for_last_fence(queue);
>>>
>>> @@ -649,7 +694,7 @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr
>> *uq_mgr, struct amdgpu_usermode_que
>>>    #if defined(CONFIG_DEBUG_FS)
>>>      debugfs_remove_recursive(queue->debugfs_queue);
>>>    #endif
>>> -   amdgpu_userq_detect_and_reset_queues(uq_mgr);
>>> +   amdgpu_userq_perq_detect_and_reset_queues(uq_mgr, queue);
>> Possibility of the deadlock seems correct and there are some other places too that i
>> found out. But we cant leave the locks here like this.
>> We still need lock to clean up and rest of the function.I am looking into it and share a
>> fix where we dont have to release locks and probably a better way
> We may need to resolve the deadlock case by case, and this patch arm to resolve the
> userq destroyed deadlock in the eviction fence suspend work. I hope this patch can help you
> observe the similar deadlock issue.

Yeah we could but i feel it needs a redesign or get/put which takes care 
from the caller itself in both lock already taken or not condition. I 
have one patch on this which fixes other corner cases too. lets
check if once.

Regards
Sunil khatri

>
>> Regards
>> Sunil khatri
>>
>>>      r = amdgpu_userq_unmap_helper(queue);
>>>      /*TODO: It requires a reset for userq hw unmap error*/
>>>      if (unlikely(r != AMDGPU_USERQ_STATE_UNMAPPED)) { @@ -657,7
>> +702,6
>>> @@ amdgpu_userq_destroy(struct amdgpu_userq_mgr *uq_mgr, struct
>> amdgpu_usermode_que
>>>              queue->state = AMDGPU_USERQ_STATE_HUNG;
>>>      }
>>>      amdgpu_userq_cleanup(queue);
>>> -   mutex_unlock(&uq_mgr->userq_mutex);
>>>
>>>      pm_runtime_put_autosuspend(adev_to_drm(adev)->dev);
>>>

next prev parent reply	other threads:[~2026-03-20  9:17 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19  8:21 [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak Prike Liang
2026-03-19  8:21 ` [PATCH 2/3] drm/amdgpu: fix syncobj leak for amdgpu_gem_va_ioctl() Prike Liang
2026-03-19  8:44   ` Christian König
2026-03-19  8:21 ` [PATCH 3/3] drm/amdgpu: fix the userq destroy dead lock Prike Liang
2026-03-19  8:46   ` Christian König
2026-03-20  7:56   ` Khatri, Sunil
2026-03-20  8:49     ` Liang, Prike
2026-03-20  9:17       ` Khatri, Sunil [this message]
2026-03-19  8:41 ` [PATCH 1/3] drm/amdgpu: fix the tlb flush fence leak Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6a2a9e6b-c5e0-4e9e-a012-274a5c434fa5@amd.com \
    --to=sukhatri@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=Prike.Liang@amd.com \
    --cc=Sunil.Khatri@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox