From: "Chen, Xiaogang" <xiaogang.chen@amd.com>
To: Philip Yang <yangp@amd.com>, Yifan Zhang <yifan1.zhang@amd.com>,
amd-gfx@lists.freedesktop.org
Cc: Alexander.Deucher@amd.com, Felix.Kuehling@amd.com,
Philip.Yang@amd.com, Lijo.Lazar@amd.com
Subject: Re: [PATCH v4 1/2] amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw
Date: Thu, 25 Sep 2025 14:11:21 -0500 [thread overview]
Message-ID: <08f80a38-2aa6-49fd-a16d-2b460f2a2403@amd.com> (raw)
In-Reply-To: <ccc729b8-640e-6e71-3522-7208413990f0@amd.com>
On 9/24/2025 5:48 PM, Philip Yang wrote:
>
> On 2025-09-24 11:29, Yifan Zhang wrote:
>> There is race in amdgpu_amdkfd_device_fini_sw and interrupt.
>> if amdgpu_amdkfd_device_fini_sw run in b/w kfd_cleanup_nodes and
>> kfree(kfd), and KGD interrupt generated.
>>
>> kernel panic log:
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000098
>> amdgpu 0000:c8:00.0: amdgpu: Requesting 4 partitions through PSP
>>
>> PGD d78c68067 P4D d78c68067
>>
>> kfd kfd: amdgpu: Allocated 3969056 bytes on gart
>>
>> PUD 1465b8067 PMD @
>>
>> Oops: @002 [#1] SMP NOPTI
>>
>> kfd kfd: amdgpu: Total number of KFD nodes to be created: 4
>> CPU: 115 PID: @ Comm: swapper/115 Kdump: loaded Tainted: G S W OE K
>>
>> RIP: 0010:_raw_spin_lock_irqsave+0x12/0x40
>>
>> Code: 89 e@ 41 5c c3 cc cc cc cc 66 66 2e Of 1f 84 00 00 00 00 00 OF
>> 1f 40 00 Of 1f 44% 00 00 41 54 9c 41 5c fa 31 cO ba 01 00 00 00 <fO>
>> OF b1 17 75 Ba 4c 89 e@ 41 Sc
>>
>> 89 c6 e8 07 38 5d
>>
>> RSP: 0018: ffffc90@1a6b0e28 EFLAGS: 00010046
>>
>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000018
>> 0000000000000001 RSI: ffff8883bb623e00 RDI: 0000000000000098
>> ffff8883bb000000 RO8: ffff888100055020 ROO: ffff888100055020
>> 0000000000000000 R11: 0000000000000000 R12: 0900000000000002
>> ffff888F2b97da0@ R14: @000000000000098 R15: ffff8883babdfo00
>>
>> CS: 010 DS: 0000 ES: 0000 CRO: 0000000080050033
>>
>> CR2: 0000000000000098 CR3: 0000000e7cae2006 CR4: 0000000002770ce0
>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> 0000000000000000 DR6: 00000000fffeO7FO DR7: 0000000000000400
>>
>> PKRU: 55555554
>>
>> Call Trace:
>>
>> <IRQ>
>>
>> kgd2kfd_interrupt+@x6b/0x1f@ [amdgpu]
>>
>> ? amdgpu_fence_process+0xa4/0x150 [amdgpu]
>>
>> kfd kfd: amdgpu: Node: 0, interrupt_bitmap: 3 YcpxFl Rant tErace
>>
>> amdgpu_irq_dispatch+0x165/0x210 [amdgpu]
>>
>> amdgpu_ih_process+0x80/0x100 [amdgpu]
>>
>> amdgpu: Virtual CRAT table created for GPU
>>
>> amdgpu_irq_handler+0x1f/@x60 [amdgpu]
>>
>> __handle_irq_event_percpu+0x3d/0x170
>>
>> amdgpu: Topology: Add dGPU node [0x74a2:0x1002]
>>
>> handle_irq_event+0x5a/@xcO
>>
>> handle_edge_irq+0x93/0x240
>>
>> kfd kfd: amdgpu: KFD node 1 partition @ size 49148M
>>
>> asm_call_irq_on_stack+0xf/@x20
>>
>> </IRQ>
>>
>> common_interrupt+0xb3/0x130
>>
>> asm_common_interrupt+0x1le/0x40
>>
>> 5.10.134-010.a1i5000.a18.x86_64 #1
>>
>> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
> Reviewed-by: Philip Yang<Philip.Yang@amd.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 10 +++++++++-
>> 1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index 349c351e242b..051a00152b08 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -1133,7 +1133,15 @@ void kgd2kfd_interrupt(struct kfd_dev *kfd,
>> const void *ih_ring_entry)
>> }
>> for (i = 0; i < kfd->num_nodes; i++) {
>> - node = kfd->nodes[i];
>> + /* Race if another thread in b/w
>> + * kfd_cleanup_nodes and kfree(kfd),
>> + * when kfd->nodes[i] = NULL
>> + */
>> + if (kfd->nodes[i])
>> + node = kfd->nodes[i];
>> + else
>> + return;
>> +
KFD interrupt is handled later in wq node->kfd->ih_wq at interrupt_wq
which uses kfd->nodes[i]. Checking " kfd->nodes[i] == NULL" here is
not enough as kfd_cleanup_nodes can free kfd node at any time.
Regards
Xiaogang
>> spin_lock_irqsave(&node->interrupt_lock, flags);
>> if (node->interrupts_active
next prev parent reply other threads:[~2025-09-25 19:11 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-24 15:29 [PATCH v4 1/2] amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw Yifan Zhang
2025-09-24 15:29 ` [PATCH v4 2/2] amd/amdkfd: enhance kfd process check in switch partition Yifan Zhang
2025-09-24 23:14 ` Philip Yang
2025-09-26 19:52 ` Chen, Xiaogang
2025-09-29 13:09 ` Philip Yang
2025-09-24 22:48 ` [PATCH v4 1/2] amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw Philip Yang
2025-09-25 6:19 ` Lazar, Lijo
2025-09-25 6:41 ` Zhang, Yifan
2025-09-25 6:58 ` Lazar, Lijo
2025-09-25 7:06 ` Lazar, Lijo
2025-09-25 9:54 ` Zhang, Yifan
2025-09-26 6:49 ` Lazar, Lijo
2025-09-26 14:33 ` Zhang, Yifan
2025-09-25 19:11 ` Chen, Xiaogang [this message]
2025-09-26 1:15 ` Zhang, Yifan
2025-09-26 15:26 ` Chen, Xiaogang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=08f80a38-2aa6-49fd-a16d-2b460f2a2403@amd.com \
--to=xiaogang.chen@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Felix.Kuehling@amd.com \
--cc=Lijo.Lazar@amd.com \
--cc=Philip.Yang@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=yangp@amd.com \
--cc=yifan1.zhang@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox