AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Chen, Xiaogang" <xiaogang.chen@amd.com>
To: "Zhang, Yifan" <Yifan1.Zhang@amd.com>,
	"Yang, Philip" <Philip.Yang@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	"Lazar, Lijo" <Lijo.Lazar@amd.com>
Subject: Re: [PATCH v4 1/2] amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw
Date: Fri, 26 Sep 2025 10:26:35 -0500	[thread overview]
Message-ID: <ed839495-0c90-47d8-bfad-5d4ae87a3c05@amd.com> (raw)
In-Reply-To: <CY5PR12MB6369CEE4F7341EA964855FBDC11EA@CY5PR12MB6369.namprd12.prod.outlook.com>


On 9/25/2025 8:15 PM, Zhang, Yifan wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> flush_workqueue(kfd->ih_wq) and destroy_workqueue(kfd->ih_wq) in kfd_cleanup_nodes clean up pending work items, and node->interrupts_active check prevent new work items from being enqueued. So after kfd_cleanup_nodes free kfd node, there is no pending kfd ih_wq work items.
>
> And I agree there is still potential race here, since kfd->nodes[i] NULL check is not protected by lock. Will address it together with the issue Lijo mentioned.

yes, the ih_wq is flushed and destroyed before kfd node got freed. But 
still, kfd->nodes[i] can be null at any time for kgd2kfd_interrupt at a 
different task.

I think the real issue is kfd->nodes[i] was not checked under 
node->interrupt_lock. Both kfd_cleanup_nodes and kgd2kfd_interrupt need 
use this lock when operate on a kfd node. Currently only 
kgd2kfd_interrupt uses it and does not include kfd->nodes[i]. 
kfd_cleanup_nodes also needs use this lock to exclude operation on kfd 
node from kgd2kfd_interrupt.

Regards

Xiaogang

>
>
> -----Original Message-----
> From: Chen, Xiaogang <Xiaogang.Chen@amd.com>
> Sent: Friday, September 26, 2025 3:11 AM
> To: Yang, Philip <Philip.Yang@amd.com>; Zhang, Yifan <Yifan1.Zhang@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Yang, Philip <Philip.Yang@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>
> Subject: Re: [PATCH v4 1/2] amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw
>
>
> On 9/24/2025 5:48 PM, Philip Yang wrote:
>> On 2025-09-24 11:29, Yifan Zhang wrote:
>>> There is race in amdgpu_amdkfd_device_fini_sw and interrupt.
>>> if amdgpu_amdkfd_device_fini_sw run in b/w kfd_cleanup_nodes and
>>>     kfree(kfd), and KGD interrupt generated.
>>>
>>> kernel panic log:
>>>
>>> BUG: kernel NULL pointer dereference, address: 0000000000000098
>>> amdgpu 0000:c8:00.0: amdgpu: Requesting 4 partitions through PSP
>>>
>>> PGD d78c68067 P4D d78c68067
>>>
>>> kfd kfd: amdgpu: Allocated 3969056 bytes on gart
>>>
>>> PUD 1465b8067 PMD @
>>>
>>> Oops: @002 [#1] SMP NOPTI
>>>
>>> kfd kfd: amdgpu: Total number of KFD nodes to be created: 4
>>> CPU: 115 PID: @ Comm: swapper/115 Kdump: loaded Tainted: G S W OE K
>>>
>>> RIP: 0010:_raw_spin_lock_irqsave+0x12/0x40
>>>
>>> Code: 89 e@ 41 5c c3 cc cc cc cc 66 66 2e Of 1f 84 00 00 00 00 00 OF
>>> 1f 40 00 Of 1f 44% 00 00 41 54 9c 41 5c fa 31 cO ba 01 00 00 00 <fO>
>>> OF b1 17 75 Ba 4c 89 e@ 41 Sc
>>>
>>> 89 c6 e8 07 38 5d
>>>
>>> RSP: 0018: ffffc90@1a6b0e28 EFLAGS: 00010046
>>>
>>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000018
>>> 0000000000000001 RSI: ffff8883bb623e00 RDI: 0000000000000098
>>> ffff8883bb000000 RO8: ffff888100055020 ROO: ffff888100055020
>>> 0000000000000000 R11: 0000000000000000 R12: 0900000000000002
>>> ffff888F2b97da0@ R14: @000000000000098 R15: ffff8883babdfo00
>>>
>>> CS: 010 DS: 0000 ES: 0000 CRO: 0000000080050033
>>>
>>> CR2: 0000000000000098 CR3: 0000000e7cae2006 CR4: 0000000002770ce0
>>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> 0000000000000000 DR6: 00000000fffeO7FO DR7: 0000000000000400
>>>
>>> PKRU: 55555554
>>>
>>> Call Trace:
>>>
>>> <IRQ>
>>>
>>> kgd2kfd_interrupt+@x6b/0x1f@ [amdgpu]
>>>
>>> ? amdgpu_fence_process+0xa4/0x150 [amdgpu]
>>>
>>> kfd kfd: amdgpu: Node: 0, interrupt_bitmap: 3 YcpxFl Rant tErace
>>>
>>> amdgpu_irq_dispatch+0x165/0x210 [amdgpu]
>>>
>>> amdgpu_ih_process+0x80/0x100 [amdgpu]
>>>
>>> amdgpu: Virtual CRAT table created for GPU
>>>
>>> amdgpu_irq_handler+0x1f/@x60 [amdgpu]
>>>
>>> __handle_irq_event_percpu+0x3d/0x170
>>>
>>> amdgpu: Topology: Add dGPU node [0x74a2:0x1002]
>>>
>>> handle_irq_event+0x5a/@xcO
>>>
>>> handle_edge_irq+0x93/0x240
>>>
>>> kfd kfd: amdgpu: KFD node 1 partition @ size 49148M
>>>
>>> asm_call_irq_on_stack+0xf/@x20
>>>
>>> </IRQ>
>>>
>>> common_interrupt+0xb3/0x130
>>>
>>> asm_common_interrupt+0x1le/0x40
>>>
>>> 5.10.134-010.a1i5000.a18.x86_64 #1
>>>
>>> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
>> Reviewed-by: Philip Yang<Philip.Yang@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdkfd/kfd_device.c | 10 +++++++++-
>>>    1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index 349c351e242b..051a00152b08 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -1133,7 +1133,15 @@ void kgd2kfd_interrupt(struct kfd_dev *kfd,
>>> const void *ih_ring_entry)
>>>        }
>>>          for (i = 0; i < kfd->num_nodes; i++) {
>>> -        node = kfd->nodes[i];
>>> +        /* Race if another thread in b/w
>>> +         * kfd_cleanup_nodes and kfree(kfd),
>>> +         * when kfd->nodes[i] = NULL
>>> +         */
>>> +        if (kfd->nodes[i])
>>> +            node = kfd->nodes[i];
>>> +        else
>>> +            return;
>>> +
> KFD interrupt is handled later in wq node->kfd->ih_wq at interrupt_wq which uses  kfd->nodes[i].  Checking " kfd->nodes[i] == NULL" here is not enough as kfd_cleanup_nodes can free kfd node at any time.
>
> Regards
>
> Xiaogang
>
>>> spin_lock_irqsave(&node->interrupt_lock, flags);
>>>              if (node->interrupts_active

      reply	other threads:[~2025-09-26 15:26 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-24 15:29 [PATCH v4 1/2] amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw Yifan Zhang
2025-09-24 15:29 ` [PATCH v4 2/2] amd/amdkfd: enhance kfd process check in switch partition Yifan Zhang
2025-09-24 23:14   ` Philip Yang
2025-09-26 19:52   ` Chen, Xiaogang
2025-09-29 13:09     ` Philip Yang
2025-09-24 22:48 ` [PATCH v4 1/2] amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw Philip Yang
2025-09-25  6:19   ` Lazar, Lijo
2025-09-25  6:41     ` Zhang, Yifan
2025-09-25  6:58       ` Lazar, Lijo
2025-09-25  7:06         ` Lazar, Lijo
2025-09-25  9:54           ` Zhang, Yifan
2025-09-26  6:49             ` Lazar, Lijo
2025-09-26 14:33               ` Zhang, Yifan
2025-09-25 19:11   ` Chen, Xiaogang
2025-09-26  1:15     ` Zhang, Yifan
2025-09-26 15:26       ` Chen, Xiaogang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ed839495-0c90-47d8-bfad-5d4ae87a3c05@amd.com \
    --to=xiaogang.chen@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Lijo.Lazar@amd.com \
    --cc=Philip.Yang@amd.com \
    --cc=Yifan1.Zhang@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox