AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 1/2] amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw
@ 2025-09-24 15:29 Yifan Zhang
  2025-09-24 15:29 ` [PATCH v4 2/2] amd/amdkfd: enhance kfd process check in switch partition Yifan Zhang
  2025-09-24 22:48 ` [PATCH v4 1/2] amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw Philip Yang
  0 siblings, 2 replies; 16+ messages in thread
From: Yifan Zhang @ 2025-09-24 15:29 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, Felix.Kuehling, Philip.Yang, Lijo.Lazar,
	Yifan Zhang

There is race in amdgpu_amdkfd_device_fini_sw and interrupt.
if amdgpu_amdkfd_device_fini_sw run in b/w kfd_cleanup_nodes and
  kfree(kfd), and KGD interrupt generated.

kernel panic log:

BUG: kernel NULL pointer dereference, address: 0000000000000098
amdgpu 0000:c8:00.0: amdgpu: Requesting 4 partitions through PSP

PGD d78c68067 P4D d78c68067

kfd kfd: amdgpu: Allocated 3969056 bytes on gart

PUD 1465b8067 PMD @

Oops: @002 [#1] SMP NOPTI

kfd kfd: amdgpu: Total number of KFD nodes to be created: 4
CPU: 115 PID: @ Comm: swapper/115 Kdump: loaded Tainted: G S W OE K

RIP: 0010:_raw_spin_lock_irqsave+0x12/0x40

Code: 89 e@ 41 5c c3 cc cc cc cc 66 66 2e Of 1f 84 00 00 00 00 00 OF 1f 40 00 Of 1f 44% 00 00 41 54 9c 41 5c fa 31 cO ba 01 00 00 00 <fO> OF b1 17 75 Ba 4c 89 e@ 41 Sc

89 c6 e8 07 38 5d

RSP: 0018: ffffc90@1a6b0e28 EFLAGS: 00010046

RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000018
0000000000000001 RSI: ffff8883bb623e00 RDI: 0000000000000098
ffff8883bb000000 RO8: ffff888100055020 ROO: ffff888100055020
0000000000000000 R11: 0000000000000000 R12: 0900000000000002
ffff888F2b97da0@ R14: @000000000000098 R15: ffff8883babdfo00

CS: 010 DS: 0000 ES: 0000 CRO: 0000000080050033

CR2: 0000000000000098 CR3: 0000000e7cae2006 CR4: 0000000002770ce0
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
0000000000000000 DR6: 00000000fffeO7FO DR7: 0000000000000400

PKRU: 55555554

Call Trace:

<IRQ>

kgd2kfd_interrupt+@x6b/0x1f@ [amdgpu]

? amdgpu_fence_process+0xa4/0x150 [amdgpu]

kfd kfd: amdgpu: Node: 0, interrupt_bitmap: 3 YcpxFl Rant tErace

amdgpu_irq_dispatch+0x165/0x210 [amdgpu]

amdgpu_ih_process+0x80/0x100 [amdgpu]

amdgpu: Virtual CRAT table created for GPU

amdgpu_irq_handler+0x1f/@x60 [amdgpu]

__handle_irq_event_percpu+0x3d/0x170

amdgpu: Topology: Add dGPU node [0x74a2:0x1002]

handle_irq_event+0x5a/@xcO

handle_edge_irq+0x93/0x240

kfd kfd: amdgpu: KFD node 1 partition @ size 49148M

asm_call_irq_on_stack+0xf/@x20

</IRQ>

common_interrupt+0xb3/0x130

asm_common_interrupt+0x1le/0x40

5.10.134-010.a1i5000.a18.x86_64 #1

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 349c351e242b..051a00152b08 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -1133,7 +1133,15 @@ void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry)
 	}
 
 	for (i = 0; i < kfd->num_nodes; i++) {
-		node = kfd->nodes[i];
+		/* Race if another thread in b/w
+		 * kfd_cleanup_nodes and kfree(kfd),
+		 * when kfd->nodes[i] = NULL
+		 */
+		if (kfd->nodes[i])
+			node = kfd->nodes[i];
+		else
+			return;
+
 		spin_lock_irqsave(&node->interrupt_lock, flags);
 
 		if (node->interrupts_active
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-09-29 13:09 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-24 15:29 [PATCH v4 1/2] amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw Yifan Zhang
2025-09-24 15:29 ` [PATCH v4 2/2] amd/amdkfd: enhance kfd process check in switch partition Yifan Zhang
2025-09-24 23:14   ` Philip Yang
2025-09-26 19:52   ` Chen, Xiaogang
2025-09-29 13:09     ` Philip Yang
2025-09-24 22:48 ` [PATCH v4 1/2] amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw Philip Yang
2025-09-25  6:19   ` Lazar, Lijo
2025-09-25  6:41     ` Zhang, Yifan
2025-09-25  6:58       ` Lazar, Lijo
2025-09-25  7:06         ` Lazar, Lijo
2025-09-25  9:54           ` Zhang, Yifan
2025-09-26  6:49             ` Lazar, Lijo
2025-09-26 14:33               ` Zhang, Yifan
2025-09-25 19:11   ` Chen, Xiaogang
2025-09-26  1:15     ` Zhang, Yifan
2025-09-26 15:26       ` Chen, Xiaogang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox