public inbox for amd-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: "Liang, Prike" <Prike.Liang@amd.com>,
	"Khatri, Sunil" <Sunil.Khatri@amd.com>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 04/11] drm/amdgpu: rework amdgpu_userq_signal_ioctl
Date: Fri, 24 Apr 2026 15:02:21 +0200	[thread overview]
Message-ID: <fead91ae-a962-4124-88fb-ac746cfc525c@amd.com> (raw)
In-Reply-To: <PH7PR12MB60002D0F42DB677AC0C2F40AFB2B2@PH7PR12MB6000.namprd12.prod.outlook.com>

Hi Prike,

On 4/24/26 10:01, Liang, Prike wrote:
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Thursday, April 23, 2026 6:48 PM
>> To: Liang, Prike <Prike.Liang@amd.com>; Khatri, Sunil <Sunil.Khatri@amd.com>
>> Cc: Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander
>> <Alexander.Deucher@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 04/11] drm/amdgpu: rework amdgpu_userq_signal_ioctl
>>
>> Hi guys,
>>
>> On 4/23/26 11:58, Liang, Prike wrote:
>> ...
>>>> -static int amdgpu_userq_fence_alloc(struct amdgpu_userq_fence
>>>> **userq_fence)
>>>> +static int amdgpu_userq_fence_alloc(struct amdgpu_usermode_queue *userq,
>>>> +                                 struct amdgpu_userq_fence **pfence)
>>>>  {
>>>> -     *userq_fence = kmalloc(sizeof(**userq_fence), GFP_ATOMIC);
>>>> -     return *userq_fence ? 0 : -ENOMEM;
>>>> +     struct amdgpu_userq_fence_driver *fence_drv = userq->fence_drv;
>>>> +     struct amdgpu_userq_fence *userq_fence;
>>>> +     unsigned long count;
>>> We must initialize count; otherwise, it may contain a garbage value,
>>> which can cause amdgpu_userq_fence_alloc() to fail and, in turn, make userq
>> fence emission fail.
>>
>> I've got the same comment from both Sunil and Prike but as far as I can see  and
>> that is actually incorrect.
> This patch breaks the userq fence emit path, causing desktop boot to fail. Initializing count only works around the amdgpu_userq_fence_alloc() failure, and it doesn't address the root cause, which is that xa_find() cannot initialize count when fence_drv_xa itself hasn't been set up yet. Instead of just initializing count, we may need to check the return value of xa_find(), and if no wait fences are pending, skip retrieving the wait fence array entirely.

Yeah Sunil and I figured out what was wrong here.

I was looking at the xas_find() function and thought that xa_find() would be just a wrapper around that.

But that doesn't work like that. So I not only need to initialize count, but use the xas_fine function directly.

Thanks for pointing that out,
Christian.

> 
>>>
>>>> +     userq_fence = kmalloc(sizeof(*userq_fence), GFP_KERNEL);
>>>> +     if (!userq_fence)
>>>> +             return -ENOMEM;
>>>> +
>>>> +     /*
>>>> +      * Get the next unused entry, since we fill from the start this can be
>>>> +      * used as size to allocate the array.
>>>> +      */
>>>> +     mutex_lock(&userq->fence_drv_lock);
>>>> +     xa_find(&userq->fence_drv_xa, &count, ULONG_MAX, XA_FREE_MARK);
>>
>> The count should be initialized here. But could be that this doesn't work.
>>
>> Did you guys got a KASAN warning or something like that?
> I didn't see the KASAN warning. However, the underlying problem is that when fence_drv_xa hasn't been set up, count remains uninitialized (garbage), which eventually causes kvmalloc_array() to fail when allocating fence_drv_array.
> 
>>>> +
>>>> +     userq_fence->fence_drv_array = kvmalloc_array(count, sizeof(fence_drv),
>>>> +                                                   GFP_KERNEL);
>>>> +     if (!userq_fence->fence_drv_array) {
>>>> +             mutex_unlock(&userq->fence_drv_lock);
>>>> +             kfree(userq_fence);
>>>> +             return -ENOMEM;
>>>> +     }
>>>> +
>>>> +     userq_fence->fence_drv_array_count = count;
>>>> +     xa_extract(&userq->fence_drv_xa, (void **)userq_fence->fence_drv_array,
>>>> +                0, ULONG_MAX, count, XA_PRESENT);
>>> We may need to assign the userq_fence->fence_drv_array_count the exact copied
>> number from the xa_extract().
>>
>> Interresting point. Why could that differ ?
> Generally, xa_extract() should return the same number as count, but when there's a retry entry, the actual number of copied entries may differ from the wait fence array capacity indicated by count.
> 
>> Thanks for the comments,
>> Christian.


  reply	other threads:[~2026-04-24 13:02 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-21 12:55 [PATCH 01/11] drm/amdgpu: fix AMDGPU_INFO_READ_MMR_REG Christian König
2026-04-21 12:55 ` [PATCH 02/11] drm/amdgpu: remove deadlocks from amdgpu_userq_pre_reset Christian König
2026-04-22  4:53   ` Khatri, Sunil
2026-04-22  7:13     ` Christian König
2026-04-22  7:19       ` Khatri, Sunil
2026-04-22  7:24         ` Christian König
2026-04-22  7:29           ` Khatri, Sunil
2026-04-27  8:45   ` Liang, Prike
2026-04-21 12:55 ` [PATCH 03/11] drm/amdgpu: nuke amdgpu_userq_fence_free Christian König
2026-04-22  8:29   ` Khatri, Sunil
2026-04-22  9:26     ` Christian König
2026-04-22  9:40       ` Khatri, Sunil
2026-04-22 10:12         ` Christian König
2026-04-22 14:32           ` Khatri, Sunil
2026-04-27  6:21   ` Liang, Prike
2026-04-21 12:55 ` [PATCH 04/11] drm/amdgpu: rework amdgpu_userq_signal_ioctl Christian König
2026-04-22 10:08   ` Khatri, Sunil
2026-04-22 10:14     ` Christian König
2026-04-22 15:14       ` Khatri, Sunil
2026-04-23  9:58   ` Liang, Prike
2026-04-23 10:47     ` Christian König
2026-04-23 10:54       ` Khatri, Sunil
2026-04-24  8:01       ` Liang, Prike
2026-04-24 13:02         ` Christian König [this message]
2026-04-21 12:55 ` [PATCH 05/11] drm/amdgpu: rework userq fence signal processing Christian König
2026-04-22 10:16   ` Khatri, Sunil
2026-04-21 12:55 ` [PATCH 06/11] drm/amdgpu: remove almost all calls to amdgpu_userq_detect_and_reset_queues Christian König
2026-04-22 10:20   ` Khatri, Sunil
2026-04-21 12:55 ` [PATCH 07/11] drm/amdgpu: fix userq hang detection and reset Christian König
2026-04-22 10:35   ` Khatri, Sunil
2026-04-21 12:55 ` [PATCH 08/11] drm/amdgpu: rework userq reset work handling Christian König
2026-04-23 10:43   ` Khatri, Sunil
2026-04-21 12:55 ` [PATCH 09/11] drm/amdgpu: revert to old status lock handling v4 Christian König
2026-04-23 10:45   ` Khatri, Sunil
2026-04-21 12:55 ` [PATCH 10/11] drm/amdgpu: restructure VM state machine v2 Christian König
2026-04-23 10:46   ` Khatri, Sunil
2026-04-21 12:55 ` [PATCH 11/11] drm/amdgpu: WIP sync amdgpu_ttm_fill_mem only to kernel fences Christian König
2026-04-23 10:47   ` Khatri, Sunil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fead91ae-a962-4124-88fb-ac746cfc525c@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Prike.Liang@amd.com \
    --cc=Sunil.Khatri@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox