All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: "Liang, Prike" <Prike.Liang@amd.com>,
	"Khatri, Sunil" <Sunil.Khatri@amd.com>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 04/11] drm/amdgpu: rework amdgpu_userq_signal_ioctl
Date: Fri, 24 Apr 2026 15:02:21 +0200	[thread overview]
Message-ID: <fead91ae-a962-4124-88fb-ac746cfc525c@amd.com> (raw)
In-Reply-To: <PH7PR12MB60002D0F42DB677AC0C2F40AFB2B2@PH7PR12MB6000.namprd12.prod.outlook.com>

Hi Prike,

On 4/24/26 10:01, Liang, Prike wrote:
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Thursday, April 23, 2026 6:48 PM
>> To: Liang, Prike <Prike.Liang@amd.com>; Khatri, Sunil <Sunil.Khatri@amd.com>
>> Cc: Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander
>> <Alexander.Deucher@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 04/11] drm/amdgpu: rework amdgpu_userq_signal_ioctl
>>
>> Hi guys,
>>
>> On 4/23/26 11:58, Liang, Prike wrote:
>> ...
>>>> -static int amdgpu_userq_fence_alloc(struct amdgpu_userq_fence
>>>> **userq_fence)
>>>> +static int amdgpu_userq_fence_alloc(struct amdgpu_usermode_queue *userq,
>>>> +                                 struct amdgpu_userq_fence **pfence)
>>>>  {
>>>> -     *userq_fence = kmalloc(sizeof(**userq_fence), GFP_ATOMIC);
>>>> -     return *userq_fence ? 0 : -ENOMEM;
>>>> +     struct amdgpu_userq_fence_driver *fence_drv = userq->fence_drv;
>>>> +     struct amdgpu_userq_fence *userq_fence;
>>>> +     unsigned long count;
>>> We must initialize count; otherwise, it may contain a garbage value,
>>> which can cause amdgpu_userq_fence_alloc() to fail and, in turn, make userq
>> fence emission fail.
>>
>> I've got the same comment from both Sunil and Prike but as far as I can see  and
>> that is actually incorrect.
> This patch breaks the userq fence emit path, causing desktop boot to fail. Initializing count only works around the amdgpu_userq_fence_alloc() failure, and it doesn't address the root cause, which is that xa_find() cannot initialize count when fence_drv_xa itself hasn't been set up yet. Instead of just initializing count, we may need to check the return value of xa_find(), and if no wait fences are pending, skip retrieving the wait fence array entirely.

Yeah Sunil and I figured out what was wrong here.

I was looking at the xas_find() function and thought that xa_find() would be just a wrapper around that.

But that doesn't work like that. So I not only need to initialize count, but use the xas_fine function directly.

Thanks for pointing that out,
Christian.

> 
>>>
>>>> +     userq_fence = kmalloc(sizeof(*userq_fence), GFP_KERNEL);
>>>> +     if (!userq_fence)
>>>> +             return -ENOMEM;
>>>> +
>>>> +     /*
>>>> +      * Get the next unused entry, since we fill from the start this can be
>>>> +      * used as size to allocate the array.
>>>> +      */
>>>> +     mutex_lock(&userq->fence_drv_lock);
>>>> +     xa_find(&userq->fence_drv_xa, &count, ULONG_MAX, XA_FREE_MARK);
>>
>> The count should be initialized here. But could be that this doesn't work.
>>
>> Did you guys got a KASAN warning or something like that?
> I didn't see the KASAN warning. However, the underlying problem is that when fence_drv_xa hasn't been set up, count remains uninitialized (garbage), which eventually causes kvmalloc_array() to fail when allocating fence_drv_array.
> 
>>>> +
>>>> +     userq_fence->fence_drv_array = kvmalloc_array(count, sizeof(fence_drv),
>>>> +                                                   GFP_KERNEL);
>>>> +     if (!userq_fence->fence_drv_array) {
>>>> +             mutex_unlock(&userq->fence_drv_lock);
>>>> +             kfree(userq_fence);
>>>> +             return -ENOMEM;
>>>> +     }
>>>> +
>>>> +     userq_fence->fence_drv_array_count = count;
>>>> +     xa_extract(&userq->fence_drv_xa, (void **)userq_fence->fence_drv_array,
>>>> +                0, ULONG_MAX, count, XA_PRESENT);
>>> We may need to assign the userq_fence->fence_drv_array_count the exact copied
>> number from the xa_extract().
>>
>> Interresting point. Why could that differ ?
> Generally, xa_extract() should return the same number as count, but when there's a retry entry, the actual number of copied entries may differ from the wait fence array capacity indicated by count.
> 
>> Thanks for the comments,
>> Christian.


  reply	other threads:[~2026-04-24 13:02 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-21 12:55 [PATCH 01/11] drm/amdgpu: fix AMDGPU_INFO_READ_MMR_REG Christian König
2026-04-21 12:55 ` [PATCH 02/11] drm/amdgpu: remove deadlocks from amdgpu_userq_pre_reset Christian König
2026-04-22  4:53   ` Khatri, Sunil
2026-04-22  7:13     ` Christian König
2026-04-22  7:19       ` Khatri, Sunil
2026-04-22  7:24         ` Christian König
2026-04-22  7:29           ` Khatri, Sunil
2026-04-27  8:45   ` Liang, Prike
2026-04-21 12:55 ` [PATCH 03/11] drm/amdgpu: nuke amdgpu_userq_fence_free Christian König
2026-04-22  8:29   ` Khatri, Sunil
2026-04-22  9:26     ` Christian König
2026-04-22  9:40       ` Khatri, Sunil
2026-04-22 10:12         ` Christian König
2026-04-22 14:32           ` Khatri, Sunil
2026-04-27  6:21   ` Liang, Prike
2026-04-21 12:55 ` [PATCH 04/11] drm/amdgpu: rework amdgpu_userq_signal_ioctl Christian König
2026-04-22 10:08   ` Khatri, Sunil
2026-04-22 10:14     ` Christian König
2026-04-22 15:14       ` Khatri, Sunil
2026-04-23  9:58   ` Liang, Prike
2026-04-23 10:47     ` Christian König
2026-04-23 10:54       ` Khatri, Sunil
2026-04-24  8:01       ` Liang, Prike
2026-04-24 13:02         ` Christian König [this message]
2026-04-21 12:55 ` [PATCH 05/11] drm/amdgpu: rework userq fence signal processing Christian König
2026-04-22 10:16   ` Khatri, Sunil
2026-04-21 12:55 ` [PATCH 06/11] drm/amdgpu: remove almost all calls to amdgpu_userq_detect_and_reset_queues Christian König
2026-04-22 10:20   ` Khatri, Sunil
2026-04-21 12:55 ` [PATCH 07/11] drm/amdgpu: fix userq hang detection and reset Christian König
2026-04-22 10:35   ` Khatri, Sunil
2026-04-21 12:55 ` [PATCH 08/11] drm/amdgpu: rework userq reset work handling Christian König
2026-04-23 10:43   ` Khatri, Sunil
2026-05-11 17:50     ` Christian König
2026-05-11 17:58       ` Khatri, Sunil
2026-04-21 12:55 ` [PATCH 09/11] drm/amdgpu: revert to old status lock handling v4 Christian König
2026-04-23 10:45   ` Khatri, Sunil
2026-04-21 12:55 ` [PATCH 10/11] drm/amdgpu: restructure VM state machine v2 Christian König
2026-04-23 10:46   ` Khatri, Sunil
2026-04-21 12:55 ` [PATCH 11/11] drm/amdgpu: WIP sync amdgpu_ttm_fill_mem only to kernel fences Christian König
2026-04-23 10:47   ` Khatri, Sunil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fead91ae-a962-4124-88fb-ac746cfc525c@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Prike.Liang@amd.com \
    --cc=Sunil.Khatri@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.