[PATCH v2 0/6] Fix several bugs in error handling during device probe

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/6] Fix several bugs in error handling during device probe
@ 2025-01-05  2:45 Jiang Liu
  2025-01-05  2:45 ` [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes() Jiang Liu
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Jiang Liu @ 2025-01-05  2:45 UTC (permalink / raw)
  To: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell, shuox.liu; +Cc: Jiang Liu

This patchset tries to fix several memory leakages/invalid memory
accesses on error handling path during GPU driver loading/unloading.
They applies to:
https://gitlab.freedesktop.org/agd5f/linux.git amd-staging-drm-next

v2:
1) rebased to https://gitlab.freedesktop.org/agd5f/linux.git branch
   amd-staging-drm-next.
2) removed the first patch, which is unnecessary.
3) add amdgpu_xcp_drm_dev_free() in patch 0003 to enhance amdxcp
   driver to better support device remove and error handling.
4) reworked patch 0005 to fix it in amdgpu instead of drm core.

Jiang Liu (6):
  amdgpu: fix invalid memory access in kfd_cleanup_nodes()
  amdgpu: clear adev->in_suspend flag when fails to suspend
  drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
  amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
  amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
  amdgpu: get rid of false warnings caused by amdgpu_irq_put()

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 14 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c   | 13 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c     |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c     | 11 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h     |  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_device.c     |  9 +--
 drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 70 +++++++++++++++++----
 drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h |  1 +
 9 files changed, 99 insertions(+), 25 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()
  2025-01-05  2:45 [PATCH v2 0/6] Fix several bugs in error handling during device probe Jiang Liu
@ 2025-01-05  2:45 ` Jiang Liu
  2025-01-05  5:22   ` Shuo Liu
  2025-01-07 22:53   ` Chen, Xiaogang
  2025-01-05  2:45 ` [PATCH v2 2/6] amdgpu: clear adev->in_suspend flag when fails to suspend Jiang Liu
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Jiang Liu @ 2025-01-05  2:45 UTC (permalink / raw)
  To: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell, shuox.liu; +Cc: Jiang Liu

Fix possible resource leakage on error recovery path in function
kgd2kfd_device_init().

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index a29374c86405..fa5054940486 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -898,15 +898,15 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
 		if (kfd->adev->xcp_mgr)
 			kfd_setup_interrupt_bitmap(node, i);
 
+		spin_lock_init(&node->watch_points_lock);
+
+		kfd->nodes[i] = node;
+
 		/* Initialize the KFD node */
 		if (kfd_init_node(node)) {
 			dev_err(kfd_device, "Error initializing KFD node\n");
 			goto node_init_error;
 		}
-
-		spin_lock_init(&node->watch_points_lock);
-
-		kfd->nodes[i] = node;
 	}
 
 	svm_range_set_max_pages(kfd->adev);
@@ -921,6 +921,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
 	goto out;
 
 node_init_error:
+	i++;
 node_alloc_error:
 	kfd_cleanup_nodes(kfd, i);
 	kfd_doorbell_fini(kfd);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/6] amdgpu: clear adev->in_suspend flag when fails to suspend
  2025-01-05  2:45 [PATCH v2 0/6] Fix several bugs in error handling during device probe Jiang Liu
  2025-01-05  2:45 ` [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes() Jiang Liu
@ 2025-01-05  2:45 ` Jiang Liu
  2025-01-05  2:45 ` [PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Jiang Liu @ 2025-01-05  2:45 UTC (permalink / raw)
  To: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell, shuox.liu; +Cc: Jiang Liu

Clear adev->in_suspend flag when fails to suspend, otherwise it will
cause too much warnings like:
[ 1802.212027] ------------[ cut here ]------------
[ 1802.212028] WARNING: CPU: 97 PID: 11282 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:452 amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212198] Modules linked in: amdgpu(E-) tcp_diag(E) inet_diag(E) rfkill(E) intel_rapl_msr(E) intel_rapl_common(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) i10nm_edac(E) nfit(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) iTCO_wdt(E) pmt_telemetry(E) iTCO_vendor_support(E) pmt_class(E) intel_cstate(E) snd_hda_intel(E) ipmi_ssif(E) snd_intel_dspcfg(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) intel_uncore(E) snd(E) cdc_ether(E) pcspkr(E) i2c_i801(E) idxd(E) mei_me(E) usbnet(E) ses(E) isst_if_mmio(E) isst_if_mbox_pci(E) mii(E) joydev(E) soundcore(E) isst_if_common(E) idxd_bus(E) mei(E) enclosure(E) intel_vsec(E) i2c_smbus(E) i2c_ismt(E) sunrpc(E) acpi_power_meter(E) ipmi_si(E) acpi_ipmi(E) ipmi_devintf(E) acpi_pad(E) ipmi_msghandler(E) vfat(E) fat(E) sg(E) video(E) amdxcp(E) drm_ttm_helper(E) ttm(E) drm_exec(E) gpu_sched(E) drm_suballoc_helper(E) crc32c_intel(E) drm_buddy(E)
[ 1802.212223]  ast(E) drm_shmem_helper(E) drm_display_helper(E) i2c_algo_bit(E) drm_kms_helper(E) virtio_net(E) mpt3sas(E) drm(E) net_failover(E) ahci(E) raid_class(E) failover(E) libahci(E) scsi_transport_sas(E) dimlib(E) libata(E) i2c_core(E) wmi(E) pinctrl_emmitsburg(E) [last unloaded: amdgpu(E)]
[ 1802.212231] CPU: 97 PID: 11282 Comm: rmmod Kdump: loaded Tainted: G        W   E      6.10.0+ #2
[ 1802.212232] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
[ 1802.212232] RIP: 0010:amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212401] Code: 00 00 00 4d 85 e4 74 08 49 c7 04 24 00 00 00 00 48 85 ed 74 08 48 c7 45 00 00 00 00 00 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc <0f> 0b e9 3b ff ff ff 3d 00 fe ff ff 74 b3 49 8b be b0 10 ff ff 4c
[ 1802.212402] RSP: 0018:ffffbe0e92087cb0 EFLAGS: 00010202
[ 1802.212403] RAX: 0000000000000206 RBX: ffff9cbc06c37f10 RCX: 000000000020000d
[ 1802.212404] RDX: ffff9cbc06c37f18 RSI: ffff9cbc06c37f58 RDI: ffff9cbc06c37f10
[ 1802.212407] RBP: ffff9cbc06c37f18 R08: ffff9cba42388800 R09: 000000000020000d
[ 1802.212409] R10: 0000000000040000 R11: 0000000000000006 R12: ffff9cbc06c37f58
[ 1802.212410] R13: ffff9cba4238dc00 R14: ffff9cbc06c0ef50 R15: ffff9cbc06c00000
[ 1802.212411] FS:  00007f927dc4e740(0000) GS:ffff9db47e880000(0000) knlGS:0000000000000000
[ 1802.212412] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1802.212413] CR2: 000056342b2be1a8 CR3: 00000003f1546005 CR4: 0000000000f70ef0
[ 1802.212413] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1802.212414] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1802.212415] PKRU: 55555554
[ 1802.212415] Call Trace:
[ 1802.212416]  <TASK>
[ 1802.212417]  ? __warn+0x83/0x130
[ 1802.212419]  ? amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212586]  ? __report_bug+0xea/0x100
[ 1802.212588]  ? report_bug+0x24/0x70
[ 1802.212589]  ? handle_bug+0x3c/0x70
[ 1802.212590]  ? exc_invalid_op+0x18/0x70
[ 1802.212591]  ? asm_exc_invalid_op+0x1a/0x20
[ 1802.212594]  ? amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212746]  amdgpu_ring_fini+0x91/0x120 [amdgpu]
[ 1802.212901]  amdgpu_jpeg_sw_fini+0xb2/0xe0 [amdgpu]
[ 1802.213106]  amdgpu_device_ip_fini.isra.0+0xb1/0x1c0 [amdgpu]
[ 1802.213247]  amdgpu_device_fini_sw+0x49/0x290 [amdgpu]
[ 1802.213413]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[ 1802.213576]  devm_drm_dev_init_release+0x4e/0x70 [drm]
[ 1802.213602]  release_nodes+0x35/0xb0
[ 1802.213605]  devres_release_all+0x8f/0xd0
[ 1802.213606]  device_unbind_cleanup+0xe/0x70
[ 1802.213609]  device_release_driver_internal+0x1bc/0x200
[ 1802.213611]  driver_detach+0x48/0x90
[ 1802.213613]  bus_remove_driver+0x6d/0xf0
[ 1802.213615]  pci_unregister_driver+0x2e/0xb0
[ 1802.213618]  amdgpu_exit+0x15/0x1c4 [amdgpu]
[ 1802.213861]  __do_sys_delete_module.constprop.0+0x176/0x310
[ 1802.213863]  do_syscall_64+0x5d/0x170
[ 1802.213866]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1802.213869] RIP: 0033:0x7f927d6620cb
[ 1802.213870] Code: 73 01 c3 48 8b 0d a5 6d 19 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 6d 19 00 f7 d8 64 89 01 48
[ 1802.213871] RSP: 002b:00007fff031d9d78 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 1802.213873] RAX: ffffffffffffffda RBX: 000055d8d6df69e0 RCX: 00007f927d6620cb
[ 1802.213873] RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055d8d6df6a48
[ 1802.213874] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 1802.213875] R10: 00007f927d7aaac0 R11: 0000000000000206 R12: 00007fff031d9fa0
[ 1802.213876] R13: 00007fff031da5f1 R14: 000055d8d6df62a0 R15: 000055d8d6df69e0
[ 1802.213878]  </TASK>
[ 1802.213879] ---[ end trace 0000000000000000 ]---

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0a121aab5c74..5ff53a3b9851 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4910,7 +4910,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
 		amdgpu_virt_fini_data_exchange(adev);
 		r = amdgpu_virt_request_full_gpu(adev, false);
 		if (r)
-			return r;
+			goto error_out;
 	}
 
 	if (amdgpu_acpi_smart_shift_update(dev, AMDGPU_SS_DEV_D3))
@@ -4930,7 +4930,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
 
 	r = amdgpu_device_evict_resources(adev);
 	if (r)
-		return r;
+		goto error_out;
 
 	amdgpu_ttm_set_buffer_funcs_status(adev, false);
 
@@ -4943,9 +4943,12 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
 
 	r = amdgpu_dpm_notify_rlc_state(adev, false);
 	if (r)
-		return r;
+		goto error_out;
 
 	return 0;
+error_out:
+	adev->in_suspend = false;
+	return r;
 }
 
 /**
@@ -5007,8 +5010,10 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
 		amdgpu_virt_release_full_gpu(adev, true);
 	}
 
-	if (r)
+	if (r) {
+		adev->in_suspend = false;
 		return r;
+	}
 
 	/* Make sure IB tests flushed */
 	flush_delayed_work(&adev->delayed_init_work);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
  2025-01-05  2:45 [PATCH v2 0/6] Fix several bugs in error handling during device probe Jiang Liu
  2025-01-05  2:45 ` [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes() Jiang Liu
  2025-01-05  2:45 ` [PATCH v2 2/6] amdgpu: clear adev->in_suspend flag when fails to suspend Jiang Liu
@ 2025-01-05  2:45 ` Jiang Liu
  2025-01-06  6:51   ` Lazar, Lijo
  2025-01-05  2:45 ` [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Jiang Liu @ 2025-01-05  2:45 UTC (permalink / raw)
  To: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell, shuox.liu; +Cc: Jiang Liu

Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
to do error recovery.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c     | 11 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h     |  1 +
 drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 70 +++++++++++++++++----
 drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h |  1 +
 4 files changed, 70 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
index e209b5e101df..401fbaa0b6b8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
@@ -359,6 +359,8 @@ int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
 		ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev, ent->driver_data);
 		if (ret)
 			return ret;
+
+		adev->xcp_mgr->xcp[i].registered = true;
 	}
 
 	return 0;
@@ -376,12 +378,19 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
 		if (!adev->xcp_mgr->xcp[i].ddev)
 			break;
 
+		// Restore and free the original drm_device.
 		p_ddev = adev->xcp_mgr->xcp[i].ddev;
-		drm_dev_unplug(p_ddev);
+		if (adev->xcp_mgr->xcp[i].registered) {
+			drm_dev_unplug(p_ddev);
+			adev->xcp_mgr->xcp[i].registered = false;
+		}
 		p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
 		p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
 		p_ddev->driver =  adev->xcp_mgr->xcp[i].driver;
 		p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
+		amdgpu_xcp_drm_dev_free(p_ddev);
+
+		adev->xcp_mgr->xcp[i].ddev = NULL;
 	}
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
index b63f53242c57..cd06a4a232be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
@@ -101,6 +101,7 @@ struct amdgpu_xcp {
 	uint8_t id;
 	uint8_t mem_id;
 	bool valid;
+	bool registered;
 	atomic_t	ref_cnt;
 	struct drm_device *ddev;
 	struct drm_device *rdev;
diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
index faed84172dd4..9058d71b4756 100644
--- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
+++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
@@ -45,18 +45,26 @@ static const struct drm_driver amdgpu_xcp_driver = {
 
 static int8_t pdev_num;
 static struct xcp_device *xcp_dev[MAX_XCP_PLATFORM_DEVICE];
+static struct mutex xcp_mutex;
 
 int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
 {
 	struct platform_device *pdev;
 	struct xcp_device *pxcp_dev;
 	char dev_name[20];
-	int ret;
+	int ret, index;
 
+	mutex_lock(&xcp_mutex);
+	ret = -ENODEV;
 	if (pdev_num >= MAX_XCP_PLATFORM_DEVICE)
-		return -ENODEV;
+		goto out_unlock;
 
-	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", pdev_num);
+	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
+		if (!xcp_dev[index])
+			break;
+	}
+
+	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", index);
 	pdev = platform_device_register_simple(dev_name, -1, NULL, 0);
 	if (IS_ERR(pdev))
 		return PTR_ERR(pdev);
@@ -72,10 +80,11 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
 		goto out_devres;
 	}
 
-	xcp_dev[pdev_num] = pxcp_dev;
-	xcp_dev[pdev_num]->pdev = pdev;
+	xcp_dev[index] = pxcp_dev;
+	xcp_dev[index]->pdev = pdev;
 	*ddev = &pxcp_dev->drm;
 	pdev_num++;
+	mutex_unlock(&xcp_mutex);
 
 	return 0;
 
@@ -83,21 +92,58 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
 	devres_release_group(&pdev->dev, NULL);
 out_unregister:
 	platform_device_unregister(pdev);
+out_unlock:
+	mutex_unlock(&xcp_mutex);
 
 	return ret;
 }
 EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
 
-void amdgpu_xcp_drv_release(void)
+static void amdgpu_xcp_drm_dev_destroy(int index)
+{
+	struct platform_device *pdev;
+
+	pdev = xcp_dev[index]->pdev;
+	devres_release_group(&pdev->dev, NULL);
+	platform_device_unregister(pdev);
+	xcp_dev[index] = NULL;
+	pdev_num--;
+}
+
+void amdgpu_xcp_drm_dev_free(struct drm_device *ddev)
 {
-	for (--pdev_num; pdev_num >= 0; --pdev_num) {
-		struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
+	int index;
+	struct xcp_device *pxcp_dev;
+
+	if (ddev == NULL)
+		return;
 
-		devres_release_group(&pdev->dev, NULL);
-		platform_device_unregister(pdev);
-		xcp_dev[pdev_num] = NULL;
+	pxcp_dev = container_of(ddev, struct xcp_device, drm);
+
+	mutex_lock(&xcp_mutex);
+	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
+		if (xcp_dev[index] == pxcp_dev) {
+			amdgpu_xcp_drm_dev_destroy(index);
+			break;
+		}
+	}
+	mutex_unlock(&xcp_mutex);
+}
+EXPORT_SYMBOL(amdgpu_xcp_drm_dev_free);
+
+void amdgpu_xcp_drv_release(void)
+{
+	int index;
+
+	mutex_lock(&xcp_mutex);
+	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
+		if (xcp_dev[index]) {
+			WARN_ON(xcp_dev[index]);
+			amdgpu_xcp_drm_dev_destroy(index);
+		}
 	}
-	pdev_num = 0;
+	WARN_ON(pdev_num != 0);
+	mutex_unlock(&xcp_mutex);
 }
 EXPORT_SYMBOL(amdgpu_xcp_drv_release);
 
diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
index c1c4b679bf95..580a1602c8e3 100644
--- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
+++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
@@ -25,5 +25,6 @@
 #define _AMDGPU_XCP_DRV_H_
 
 int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev);
+void amdgpu_xcp_drm_dev_free(struct drm_device *ddev);
 void amdgpu_xcp_drv_release(void);
 #endif /* _AMDGPU_XCP_DRV_H_ */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
  2025-01-05  2:45 [PATCH v2 0/6] Fix several bugs in error handling during device probe Jiang Liu
                   ` (2 preceding siblings ...)
  2025-01-05  2:45 ` [PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
@ 2025-01-05  2:45 ` Jiang Liu
  2025-01-05  5:16   ` Shuo Liu
  2025-01-07 22:55   ` Chen, Xiaogang
  2025-01-05  2:45 ` [PATCH v2 5/6] amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini() Jiang Liu
  2025-01-05  2:45 ` [PATCH v2 6/6] amdgpu: get rid of false warnings caused by amdgpu_irq_put() Jiang Liu
  5 siblings, 2 replies; 18+ messages in thread
From: Jiang Liu @ 2025-01-05  2:45 UTC (permalink / raw)
  To: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell, shuox.liu; +Cc: Jiang Liu

If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
after free bug related to amdgpu_driver_release_kms() as:
2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kernel mode
2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) - not-present page
2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0
2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G        W   E      6.10.0+ #2
2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
2024-12-26 16:17:46 [16002.213648] FS:  00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
2024-12-26 16:17:46 [16002.223189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
e024se+0x3c/0x90 [amdxcp]
2024-12-26 16:17:46 [16002.337645]  __do_sys_delete_module.constprop.0+0x176/0x310
2024-12-26 16:17:46 [16002.344324]  do_syscall_64+0x5d/0x170
2024-12-26 16:17:46 [16002.348858]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26

Fix it by unplugging xcp drm devices when failed to probe GPU devices.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Tested-by: Shuo Liu <shuox.liu@linux.alibaba.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index d2a046736edd..9ebc0d47d1cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -165,8 +165,10 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
 		DRM_WARN("smart shift update failed\n");
 
 out:
-	if (r)
+	if (r) {
+		amdgpu_xcp_dev_unplug(adev);
 		amdgpu_driver_unload_kms(dev);
+	}
 
 	return r;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 5/6] amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
  2025-01-05  2:45 [PATCH v2 0/6] Fix several bugs in error handling during device probe Jiang Liu
                   ` (3 preceding siblings ...)
  2025-01-05  2:45 ` [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
@ 2025-01-05  2:45 ` Jiang Liu
  2025-01-07 22:55   ` Chen, Xiaogang
  2025-01-05  2:45 ` [PATCH v2 6/6] amdgpu: get rid of false warnings caused by amdgpu_irq_put() Jiang Liu
  5 siblings, 1 reply; 18+ messages in thread
From: Jiang Liu @ 2025-01-05  2:45 UTC (permalink / raw)
  To: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell, shuox.liu; +Cc: Jiang Liu

Function detects initialization status by checking sched->ops, so set
sched->ops to non-NULL just before return in function drm_sched_init()
to avoid possible invalid memory access on error recover path.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5ff53a3b9851..475ab635c699 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2857,6 +2857,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
 		if (r) {
 			DRM_ERROR("Failed to create scheduler on ring %s.\n",
 				  ring->name);
+			ring->sched.ops = NULL;
 			return r;
 		}
 		r = amdgpu_uvd_entity_init(adev, ring);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 2f24a6aa13bf..b5e87b515139 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -656,8 +656,10 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
 		 * The natural check would be sched.ready, which is
 		 * set as drm_sched_init() finishes...
 		 */
-		if (ring->sched.ops)
+		if (ring->sched.ops) {
 			drm_sched_fini(&ring->sched);
+			ring->sched.ops = NULL;
+		}
 
 		for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
 			dma_fence_put(ring->fence_drv.fences[j]);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 6/6] amdgpu: get rid of false warnings caused by amdgpu_irq_put()
  2025-01-05  2:45 [PATCH v2 0/6] Fix several bugs in error handling during device probe Jiang Liu
                   ` (4 preceding siblings ...)
  2025-01-05  2:45 ` [PATCH v2 5/6] amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini() Jiang Liu
@ 2025-01-05  2:45 ` Jiang Liu
  5 siblings, 0 replies; 18+ messages in thread
From: Jiang Liu @ 2025-01-05  2:45 UTC (permalink / raw)
  To: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell, shuox.liu; +Cc: Jiang Liu

If error happens before amdgpu_fence_driver_hw_init() gets called during
device probe, it will trigger a false warning in amdgpu_irq_put() as
below:
[ 1209.300996] ------------[ cut here ]------------
[ 1209.301061] WARNING: CPU: 48 PID: 293 at /tmp/amd.Rc9jFrl7/amd/amdgpu/amdgpu_irq.c:633 amdgpu_irq_put+0x45/0x70 [amdgpu]
[ 1209.301062] Modules linked in: ...
[ 1209.301093] CPU: 48 PID: 293 Comm: kworker/48:1 Kdump: loaded Tainted: G        W  OE     5.10.134-17.2.al8.x86_64 #1
[ 1209.301094] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
[ 1209.301095] Workqueue: events work_for_cpu_fn
[ 1209.301159] RIP: 0010:amdgpu_irq_put+0x45/0x70 [amdgpu]
[ 1209.301160] Code: 48 8b 4e 10 48 83 39 00 74 2c 89 d1 48 8d 04 88 8b 08 85 c9 74 14 f0 ff 08 b8 00 00 00 00 74 05 c3 cc cc cc cc e9 8b fd ff ff <0f> 0b b8 ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc
[ 1209.301162] RSP: 0018:ffffb08a99c8fd88 EFLAGS: 00010246
[ 1209.301162] RAX: ffff9efe1bcbf500 RBX: ffff9efe1cc3e400 RCX: 0000000000000000
[ 1209.301163] RDX: 0000000000000000 RSI: ffff9efe1cc3b108 RDI: ffff9efe1cc00000
[ 1209.301163] RBP: ffff9efe1cc10818 R08: 0000000000000001 R09: 000000000000000d
[ 1209.301164] R10: ffffb08a99c8fb48 R11: ffffffffa2068018 R12: ffff9efe1cc109d0
[ 1209.301164] R13: ffff9efe1cc00010 R14: ffff9efe1cc00000 R15: ffff9efe1cc3b108
[ 1209.301165] FS:  0000000000000000(0000) GS:ffff9ff9fce00000(0000) knlGS:0000000000000000
[ 1209.301165] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1209.301165] CR2: 00007fd0f6e860d0 CR3: 0000010092baa003 CR4: 0000000002770ee0
[ 1209.301166] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1209.301166] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1209.301167] PKRU: 55555554
[ 1209.301167] Call Trace:
[ 1209.301225]  amdgpu_fence_driver_hw_fini+0xda/0x110 [amdgpu]
[ 1209.301284]  amdgpu_device_fini_hw+0xaf/0x200 [amdgpu]
[ 1209.301342]  amdgpu_driver_load_kms+0x7f/0xc0 [amdgpu]
[ 1209.301400]  amdgpu_pci_probe+0x1cd/0x4a0 [amdgpu]
[ 1209.301401]  local_pci_probe+0x40/0xa0
[ 1209.301402]  work_for_cpu_fn+0x13/0x20
[ 1209.301403]  process_one_work+0x1ad/0x380
[ 1209.301404]  worker_thread+0x1c8/0x310
[ 1209.301405]  ? process_one_work+0x380/0x380
[ 1209.301406]  kthread+0x118/0x140
[ 1209.301407]  ? __kthread_bind_mask+0x60/0x60
[ 1209.301408]  ret_from_fork+0x1f/0x30
[ 1209.301410] ---[ end trace 733f120fe2ab13e5 ]---
[ 1209.301418] ------------[ cut here ]------------

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 +++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index b5e87b515139..0e41a535e05f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -614,9 +614,11 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
 
 		if (!drm_dev_is_unplugged(adev_to_drm(adev)) &&
 		    ring->fence_drv.irq_src &&
-		    amdgpu_fence_need_ring_interrupt_restore(ring))
+		    ring->fence_drv.irq_enabled) {
 			amdgpu_irq_put(adev, ring->fence_drv.irq_src,
 				       ring->fence_drv.irq_type);
+		        ring->fence_drv.irq_enabled = false;
+		}
 
 		del_timer_sync(&ring->fence_drv.fallback_timer);
 	}
@@ -693,9 +695,12 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
 
 		/* enable the interrupt */
 		if (ring->fence_drv.irq_src &&
-		    amdgpu_fence_need_ring_interrupt_restore(ring))
+		    !ring->fence_drv.irq_enabled &&
+		    amdgpu_fence_need_ring_interrupt_restore(ring)) {
 			amdgpu_irq_get(adev, ring->fence_drv.irq_src,
 				       ring->fence_drv.irq_type);
+		        ring->fence_drv.irq_enabled = true;
+		}
 	}
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index dee5a1b4e572..959d474a0516 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -118,6 +118,7 @@ struct amdgpu_fence_driver {
 	uint32_t			sync_seq;
 	atomic_t			last_seq;
 	bool				initialized;
+	bool				irq_enabled;
 	struct amdgpu_irq_src		*irq_src;
 	unsigned			irq_type;
 	struct timer_list		fallback_timer;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
  2025-01-05  2:45 ` [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
@ 2025-01-05  5:16   ` Shuo Liu
  2025-01-07 22:55   ` Chen, Xiaogang
  1 sibling, 0 replies; 18+ messages in thread
From: Shuo Liu @ 2025-01-05  5:16 UTC (permalink / raw)
  To: Jiang Liu; +Cc: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell

On Sun  5.Jan'25 at 10:45:32 +0800, Jiang Liu wrote:
>If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
>after free bug related to amdgpu_driver_release_kms() as:
>2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
>2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kernel mode
>2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) - not-present page
>2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0
>2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
>2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G        W   E      6.10.0+ #2
>2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
>2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
>2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
>43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
>2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
>2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
>2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
>2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
>2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
>2024-12-26 16:17:46 [16002.213648] FS:  00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
>2024-12-26 16:17:46 [16002.223189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
>2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>e024se+0x3c/0x90 [amdxcp]
This calltrace seems corrupted.
You can use this one from my test.

BUG: unable to handle page fault for address: 000000010000008f
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 139 PID: 10172 Comm: rmmod Kdump: loaded Tainted: G            E      6.10.0 #5
RIP: 0010:amdgpu_fence_driver_sw_fini+0x3b/0xd0 [amdgpu]
Code: 00 41 55 49 89 fd 41 54 49 81 c5 78 07 01 00 41 bc ff ff ff ff 55 53 eb 09 49 83 c6 08 4d 39 ee 74 7d 49 8b 2e 48 85 ed 74 ef <80> 7d 28 00 74 e9 48 83 7d 78 00 74 09 48 8d 7d 78 e8 ff 65 e1 ff
RSP: 0018:ffffb8e77badfde0 EFLAGS: 00010202
RAX: ffffffffc09e13b0 RBX: ffffa1461a4287f0 RCX: 00000000000000c0
RDX: 00000000000000c0 RSI: 00000000000000c0 RDI: ffffa1461a4287f0
RBP: 0000000100000067 R08: 00000000000000c0 R09: ffffa146112a6c48
R10: 0000000000040000 R11: 0000000000000007 R12: 00000000ffffffff
R13: ffffa1461a438f68 R14: ffffa1461a438b88 R15: 0000000000000000
FS:  00007f98ed908740(0000) GS:ffffa141ffd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000010000008f CR3: 0000000127d40003 CR4: 0000000000f70ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
  <TASK>
  ? __die+0x24/0x70
  ? page_fault_oops+0x65/0x140
  ? exc_page_fault+0x68/0x140
  ? asm_exc_page_fault+0x26/0x30
  ? __pfx_amdgpu_driver_release_kms+0x10/0x10 [amdgpu]
  ? amdgpu_fence_driver_sw_fini+0x3b/0xd0 [amdgpu]
  amdgpu_device_fini_sw+0x16/0x220 [amdgpu]
  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
  devm_drm_dev_init_release+0x4e/0x70 [drm]
  release_nodes+0x35/0xb0
  devres_release_group+0xa0/0xf0
  amdgpu_xcp_drv_release+0x3c/0x90 [amdxcp]

shuo
>2024-12-26 16:17:46 [16002.337645]  __do_sys_delete_module.constprop.0+0x176/0x310
>2024-12-26 16:17:46 [16002.344324]  do_syscall_64+0x5d/0x170
>2024-12-26 16:17:46 [16002.348858]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26

>
>Fix it by unplugging xcp drm devices when failed to probe GPU devices.
>
>Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>Tested-by: Shuo Liu <shuox.liu@linux.alibaba.com>
>Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>index d2a046736edd..9ebc0d47d1cb 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>@@ -165,8 +165,10 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
> 		DRM_WARN("smart shift update failed\n");
>
> out:
>-	if (r)
>+	if (r) {
>+		amdgpu_xcp_dev_unplug(adev);
> 		amdgpu_driver_unload_kms(dev);
>+	}
>
> 	return r;
> }
>-- 
>2.43.5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()
  2025-01-05  2:45 ` [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes() Jiang Liu
@ 2025-01-05  5:22   ` Shuo Liu
  2025-01-05  6:57     ` Gerry Liu
  2025-01-06  3:07     ` Gerry Liu
  2025-01-07 22:53   ` Chen, Xiaogang
  1 sibling, 2 replies; 18+ messages in thread
From: Shuo Liu @ 2025-01-05  5:22 UTC (permalink / raw)
  To: Jiang Liu; +Cc: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell

Hi Gerry,

On Sun  5.Jan'25 at 10:45:29 +0800, Jiang Liu wrote:
>Fix possible resource leakage on error recovery path in function
>kgd2kfd_device_init().
>
>Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>---
> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>index a29374c86405..fa5054940486 100644
>--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>@@ -898,15 +898,15 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> 		if (kfd->adev->xcp_mgr)
> 			kfd_setup_interrupt_bitmap(node, i);
>
>+		spin_lock_init(&node->watch_points_lock);
>+
>+		kfd->nodes[i] = node;
>+
> 		/* Initialize the KFD node */
> 		if (kfd_init_node(node)) {
> 			dev_err(kfd_device, "Error initializing KFD node\n");
> 			goto node_init_error;
> 		}
>-
>-		spin_lock_init(&node->watch_points_lock);
>-
>-		kfd->nodes[i] = node;
> 	}
>
> 	svm_range_set_max_pages(kfd->adev);
>@@ -921,6 +921,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> 	goto out;
>
> node_init_error:
>+	i++;
The err cleanup path can release node itself. So the following
kfd_cleanup_nodes() may do a double free? 

> node_alloc_error:
> 	kfd_cleanup_nodes(kfd, i);
> 	kfd_doorbell_fini(kfd);
shuo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()
  2025-01-05  5:22   ` Shuo Liu
@ 2025-01-05  6:57     ` Gerry Liu
  2025-01-06  3:07     ` Gerry Liu
  1 sibling, 0 replies; 18+ messages in thread
From: Gerry Liu @ 2025-01-05  6:57 UTC (permalink / raw)
  To: Shuo Liu; +Cc: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell

[-- Attachment #1: Type: text/plain, Size: 1774 bytes --]



> 2025年1月5日 13:22，Shuo Liu <shuox.liu@linux.alibaba.com> 写道：
> 
> Hi Gerry,
> 
> On Sun  5.Jan'25 at 10:45:29 +0800, Jiang Liu wrote:
>> Fix possible resource leakage on error recovery path in function
>> kgd2kfd_device_init().
>> 
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++++----
>> 1 file changed, 5 insertions(+), 4 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index a29374c86405..fa5054940486 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -898,15 +898,15 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>> 		if (kfd->adev->xcp_mgr)
>> 			kfd_setup_interrupt_bitmap(node, i);
>> 
>> +		spin_lock_init(&node->watch_points_lock);
>> +
>> +		kfd->nodes[i] = node;
>> +
>> 		/* Initialize the KFD node */
>> 		if (kfd_init_node(node)) {
>> 			dev_err(kfd_device, "Error initializing KFD node\n");
>> 			goto node_init_error;
>> 		}
>> -
>> -		spin_lock_init(&node->watch_points_lock);
>> -
>> -		kfd->nodes[i] = node;
>> 	}
>> 
>> 	svm_range_set_max_pages(kfd->adev);
>> @@ -921,6 +921,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>> 	goto out;
>> 
>> node_init_error:
>> +	i++;
> The err cleanup path can release node itself. So the following
> kfd_cleanup_nodes() may do a double free? 
The code in function kgd2kfd_device_exit() checks flag `kfd->init_complete`.
And if error happens within function kgd2kfd_device_init(), the flag `kfd->init_complete` will be set to false, thus avoid double free.

>> node_alloc_error:
>> 	kfd_cleanup_nodes(kfd, i);
>> 	kfd_doorbell_fini(kfd);
> shuo


[-- Attachment #2: Type: text/html, Size: 9061 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()
  2025-01-05  5:22   ` Shuo Liu
  2025-01-05  6:57     ` Gerry Liu
@ 2025-01-06  3:07     ` Gerry Liu
  1 sibling, 0 replies; 18+ messages in thread
From: Gerry Liu @ 2025-01-06  3:07 UTC (permalink / raw)
  To: Shuo Liu; +Cc: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell

[-- Attachment #1: Type: text/plain, Size: 1685 bytes --]



> 2025年1月5日 13:22，Shuo Liu <shuox.liu@linux.alibaba.com> 写道：
> 
> Hi Gerry,
> 
> On Sun  5.Jan'25 at 10:45:29 +0800, Jiang Liu wrote:
>> Fix possible resource leakage on error recovery path in function
>> kgd2kfd_device_init().
>> 
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++++----
>> 1 file changed, 5 insertions(+), 4 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index a29374c86405..fa5054940486 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -898,15 +898,15 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>> 		if (kfd->adev->xcp_mgr)
>> 			kfd_setup_interrupt_bitmap(node, i);
>> 
>> +		spin_lock_init(&node->watch_points_lock);
>> +
>> +		kfd->nodes[i] = node;
>> +
>> 		/* Initialize the KFD node */
>> 		if (kfd_init_node(node)) {
>> 			dev_err(kfd_device, "Error initializing KFD node\n");
>> 			goto node_init_error;
>> 		}
>> -
>> -		spin_lock_init(&node->watch_points_lock);
>> -
>> -		kfd->nodes[i] = node;
>> 	}
>> 
>> 	svm_range_set_max_pages(kfd->adev);
>> @@ -921,6 +921,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>> 	goto out;
>> 
>> node_init_error:
>> +	i++;
> The err cleanup path can release node itself. So the following
> kfd_cleanup_nodes() may do a double free? 
Hi Shuo,
	I missed the `kfree()` in function kfd_init_node(), so this patch is wrong and should be dropped.

Thanks,
Gerry
>> node_alloc_error:
>> 	kfd_cleanup_nodes(kfd, i);
>> 	kfd_doorbell_fini(kfd);
> shuo


[-- Attachment #2: Type: text/html, Size: 9057 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
  2025-01-05  2:45 ` [PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
@ 2025-01-06  6:51   ` Lazar, Lijo
  2025-01-07  2:00     ` Gerry Liu
  0 siblings, 1 reply; 18+ messages in thread
From: Lazar, Lijo @ 2025-01-06  6:51 UTC (permalink / raw)
  To: Jiang Liu, amd-gfx, xiaogang.chen, Kent.Russell, shuox.liu



On 1/5/2025 8:15 AM, Jiang Liu wrote:
> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
> to do error recovery.
> 
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c     | 11 +++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h     |  1 +
>  drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 70 +++++++++++++++++----
>  drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h |  1 +
>  4 files changed, 70 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> index e209b5e101df..401fbaa0b6b8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> @@ -359,6 +359,8 @@ int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
>  		ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev, ent->driver_data);
>  		if (ret)
>  			return ret;
> +
> +		adev->xcp_mgr->xcp[i].registered = true;
>  	}
>  
>  	return 0;
> @@ -376,12 +378,19 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
>  		if (!adev->xcp_mgr->xcp[i].ddev)
>  			break;
>  
> +		// Restore and free the original drm_device.
>  		p_ddev = adev->xcp_mgr->xcp[i].ddev;
> -		drm_dev_unplug(p_ddev);
> +		if (adev->xcp_mgr->xcp[i].registered) {
> +			drm_dev_unplug(p_ddev);
> +			adev->xcp_mgr->xcp[i].registered = false;
> +		}
>  		p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
>  		p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
>  		p_ddev->driver =  adev->xcp_mgr->xcp[i].driver;
>  		p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
> +		amdgpu_xcp_drm_dev_free(p_ddev);
> +
> +		adev->xcp_mgr->xcp[i].ddev = NULL;
>  	}
>  }
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
> index b63f53242c57..cd06a4a232be 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
> @@ -101,6 +101,7 @@ struct amdgpu_xcp {
>  	uint8_t id;
>  	uint8_t mem_id;
>  	bool valid;
> +	bool registered;
>  	atomic_t	ref_cnt;
>  	struct drm_device *ddev;
>  	struct drm_device *rdev;
> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> index faed84172dd4..9058d71b4756 100644
> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> @@ -45,18 +45,26 @@ static const struct drm_driver amdgpu_xcp_driver = {
>  
>  static int8_t pdev_num;
>  static struct xcp_device *xcp_dev[MAX_XCP_PLATFORM_DEVICE];
> +static struct mutex xcp_mutex;

I think this needs to be static DEFINE_MUTEX(xcp_mutex).

>  
>  int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>  {
>  	struct platform_device *pdev;
>  	struct xcp_device *pxcp_dev;
>  	char dev_name[20];
> -	int ret;
> +	int ret, index;
>  
> +	mutex_lock(&xcp_mutex);
> +	ret = -ENODEV;

Preference would be do this inside the below if() to associate the error
with the condition.

>  	if (pdev_num >= MAX_XCP_PLATFORM_DEVICE)
> -		return -ENODEV;
> +		goto out_unlock;
>  
> -	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", pdev_num);
> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> +		if (!xcp_dev[index])
> +			break;
> +	}
> +
> +	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", index);
>  	pdev = platform_device_register_simple(dev_name, -1, NULL, 0);
>  	if (IS_ERR(pdev))
>  		return PTR_ERR(pdev);

Seems mutex is left locked here.

> @@ -72,10 +80,11 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>  		goto out_devres;
>  	}
>  
> -	xcp_dev[pdev_num] = pxcp_dev;
> -	xcp_dev[pdev_num]->pdev = pdev;
> +	xcp_dev[index] = pxcp_dev;
> +	xcp_dev[index]->pdev = pdev;
>  	*ddev = &pxcp_dev->drm;
>  	pdev_num++;
> +	mutex_unlock(&xcp_mutex);
>  
>  	return 0;
>  
> @@ -83,21 +92,58 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>  	devres_release_group(&pdev->dev, NULL);
>  out_unregister:
>  	platform_device_unregister(pdev);
> +out_unlock:
> +	mutex_unlock(&xcp_mutex);
>  
>  	return ret;
>  }
>  EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
>  
> -void amdgpu_xcp_drv_release(void)
> +static void amdgpu_xcp_drm_dev_destroy(int index)
> +{
> +	struct platform_device *pdev;
> +
> +	pdev = xcp_dev[index]->pdev;
> +	devres_release_group(&pdev->dev, NULL);
> +	platform_device_unregister(pdev);
> +	xcp_dev[index] = NULL;
> +	pdev_num--;
> +}
> +
> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev)
>  {
> -	for (--pdev_num; pdev_num >= 0; --pdev_num) {
> -		struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
> +	int index;
> +	struct xcp_device *pxcp_dev;
> +
> +	if (ddev == NULL)
> +		return;
>  
> -		devres_release_group(&pdev->dev, NULL);
> -		platform_device_unregister(pdev);
> -		xcp_dev[pdev_num] = NULL;
> +	pxcp_dev = container_of(ddev, struct xcp_device, drm);
> +
> +	mutex_lock(&xcp_mutex);
> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> +		if (xcp_dev[index] == pxcp_dev) {
> +			amdgpu_xcp_drm_dev_destroy(index);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&xcp_mutex);
> +}
> +EXPORT_SYMBOL(amdgpu_xcp_drm_dev_free);
> +
> +void amdgpu_xcp_drv_release(void)
> +{
> +	int index;
> +
> +	mutex_lock(&xcp_mutex);
> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> +		if (xcp_dev[index]) {
> +			WARN_ON(xcp_dev[index]);

Why is this WARN check needed? There is already a if() check for valid
index.

Also, would suggest to separate out amdgpu_xcp.c from xcp_drv.c. xcp_drv
introducing a new interface may be kept in a separate patch.

Thanks,
Lijo

> +			amdgpu_xcp_drm_dev_destroy(index);
> +		}
>  	}
> -	pdev_num = 0;
> +	WARN_ON(pdev_num != 0);
> +	mutex_unlock(&xcp_mutex);
>  }
>  EXPORT_SYMBOL(amdgpu_xcp_drv_release);
>  
> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> index c1c4b679bf95..580a1602c8e3 100644
> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> @@ -25,5 +25,6 @@
>  #define _AMDGPU_XCP_DRV_H_
>  
>  int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev);
> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev);
>  void amdgpu_xcp_drv_release(void);
>  #endif /* _AMDGPU_XCP_DRV_H_ */


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
  2025-01-06  6:51   ` Lazar, Lijo
@ 2025-01-07  2:00     ` Gerry Liu
  0 siblings, 0 replies; 18+ messages in thread
From: Gerry Liu @ 2025-01-07  2:00 UTC (permalink / raw)
  To: Lazar, Lijo; +Cc: amd-gfx, xiaogang.chen, Kent.Russell, shuox.liu

[-- Attachment #1: Type: text/plain, Size: 6977 bytes --]



> 2025年1月6日 14:51，Lazar, Lijo <lijo.lazar@amd.com> 写道：
> 
> 
> 
> On 1/5/2025 8:15 AM, Jiang Liu wrote:
>> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
>> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
>> to do error recovery.
>> 
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c     | 11 +++-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h     |  1 +
>> drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 70 +++++++++++++++++----
>> drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h |  1 +
>> 4 files changed, 70 insertions(+), 13 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> index e209b5e101df..401fbaa0b6b8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> @@ -359,6 +359,8 @@ int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
>> 		ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev, ent->driver_data);
>> 		if (ret)
>> 			return ret;
>> +
>> +		adev->xcp_mgr->xcp[i].registered = true;
>> 	}
>> 
>> 	return 0;
>> @@ -376,12 +378,19 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
>> 		if (!adev->xcp_mgr->xcp[i].ddev)
>> 			break;
>> 
>> +		// Restore and free the original drm_device.
>> 		p_ddev = adev->xcp_mgr->xcp[i].ddev;
>> -		drm_dev_unplug(p_ddev);
>> +		if (adev->xcp_mgr->xcp[i].registered) {
>> +			drm_dev_unplug(p_ddev);
>> +			adev->xcp_mgr->xcp[i].registered = false;
>> +		}
>> 		p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
>> 		p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
>> 		p_ddev->driver =  adev->xcp_mgr->xcp[i].driver;
>> 		p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
>> +		amdgpu_xcp_drm_dev_free(p_ddev);
>> +
>> +		adev->xcp_mgr->xcp[i].ddev = NULL;
>> 	}
>> }
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
>> index b63f53242c57..cd06a4a232be 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
>> @@ -101,6 +101,7 @@ struct amdgpu_xcp {
>> 	uint8_t id;
>> 	uint8_t mem_id;
>> 	bool valid;
>> +	bool registered;
>> 	atomic_t	ref_cnt;
>> 	struct drm_device *ddev;
>> 	struct drm_device *rdev;
>> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
>> index faed84172dd4..9058d71b4756 100644
>> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
>> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
>> @@ -45,18 +45,26 @@ static const struct drm_driver amdgpu_xcp_driver = {
>> 
>> static int8_t pdev_num;
>> static struct xcp_device *xcp_dev[MAX_XCP_PLATFORM_DEVICE];
>> +static struct mutex xcp_mutex;
> 
> I think this needs to be static DEFINE_MUTEX(xcp_mutex).
> 
>> 
>> int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>> {
>> 	struct platform_device *pdev;
>> 	struct xcp_device *pxcp_dev;
>> 	char dev_name[20];
>> -	int ret;
>> +	int ret, index;
>> 
>> +	mutex_lock(&xcp_mutex);
>> +	ret = -ENODEV;
> 
> Preference would be do this inside the below if() to associate the error
> with the condition.
> 
>> 	if (pdev_num >= MAX_XCP_PLATFORM_DEVICE)
>> -		return -ENODEV;
>> +		goto out_unlock;
>> 
>> -	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", pdev_num);
>> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
>> +		if (!xcp_dev[index])
>> +			break;
>> +	}
>> +
>> +	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", index);
>> 	pdev = platform_device_register_simple(dev_name, -1, NULL, 0);
>> 	if (IS_ERR(pdev))
>> 		return PTR_ERR(pdev);
> 
> Seems mutex is left locked here.
Will fixed in next version:)

> 
>> @@ -72,10 +80,11 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>> 		goto out_devres;
>> 	}
>> 
>> -	xcp_dev[pdev_num] = pxcp_dev;
>> -	xcp_dev[pdev_num]->pdev = pdev;
>> +	xcp_dev[index] = pxcp_dev;
>> +	xcp_dev[index]->pdev = pdev;
>> 	*ddev = &pxcp_dev->drm;
>> 	pdev_num++;
>> +	mutex_unlock(&xcp_mutex);
>> 
>> 	return 0;
>> 
>> @@ -83,21 +92,58 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>> 	devres_release_group(&pdev->dev, NULL);
>> out_unregister:
>> 	platform_device_unregister(pdev);
>> +out_unlock:
>> +	mutex_unlock(&xcp_mutex);
>> 
>> 	return ret;
>> }
>> EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
>> 
>> -void amdgpu_xcp_drv_release(void)
>> +static void amdgpu_xcp_drm_dev_destroy(int index)
>> +{
>> +	struct platform_device *pdev;
>> +
>> +	pdev = xcp_dev[index]->pdev;
>> +	devres_release_group(&pdev->dev, NULL);
>> +	platform_device_unregister(pdev);
>> +	xcp_dev[index] = NULL;
>> +	pdev_num--;
>> +}
>> +
>> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev)
>> {
>> -	for (--pdev_num; pdev_num >= 0; --pdev_num) {
>> -		struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
>> +	int index;
>> +	struct xcp_device *pxcp_dev;
>> +
>> +	if (ddev == NULL)
>> +		return;
>> 
>> -		devres_release_group(&pdev->dev, NULL);
>> -		platform_device_unregister(pdev);
>> -		xcp_dev[pdev_num] = NULL;
>> +	pxcp_dev = container_of(ddev, struct xcp_device, drm);
>> +
>> +	mutex_lock(&xcp_mutex);
>> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
>> +		if (xcp_dev[index] == pxcp_dev) {
>> +			amdgpu_xcp_drm_dev_destroy(index);
>> +			break;
>> +		}
>> +	}
>> +	mutex_unlock(&xcp_mutex);
>> +}
>> +EXPORT_SYMBOL(amdgpu_xcp_drm_dev_free);
>> +
>> +void amdgpu_xcp_drv_release(void)
>> +{
>> +	int index;
>> +
>> +	mutex_lock(&xcp_mutex);
>> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
>> +		if (xcp_dev[index]) {
>> +			WARN_ON(xcp_dev[index]);
> 
> Why is this WARN check needed? There is already a if() check for valid
> index.
> 
> Also, would suggest to separate out amdgpu_xcp.c from xcp_drv.c. xcp_drv
> introducing a new interface may be kept in a separate patch.
With new implementation, all xcp devices should have already be removed when amdgpu_xcp_drv_release() gets called,
So hope to verify whether it works as expected.

Thanks!
> 
> Thanks,
> Lijo
> 
>> +			amdgpu_xcp_drm_dev_destroy(index);
>> +		}
>> 	}
>> -	pdev_num = 0;
>> +	WARN_ON(pdev_num != 0);
>> +	mutex_unlock(&xcp_mutex);
>> }
>> EXPORT_SYMBOL(amdgpu_xcp_drv_release);
>> 
>> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
>> index c1c4b679bf95..580a1602c8e3 100644
>> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
>> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
>> @@ -25,5 +25,6 @@
>> #define _AMDGPU_XCP_DRV_H_
>> 
>> int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev);
>> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev);
>> void amdgpu_xcp_drv_release(void);
>> #endif /* _AMDGPU_XCP_DRV_H_ */


[-- Attachment #2: Type: text/html, Size: 32096 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()
  2025-01-05  2:45 ` [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes() Jiang Liu
  2025-01-05  5:22   ` Shuo Liu
@ 2025-01-07 22:53   ` Chen, Xiaogang
  2025-01-08  2:29     ` Gerry Liu
  1 sibling, 1 reply; 18+ messages in thread
From: Chen, Xiaogang @ 2025-01-07 22:53 UTC (permalink / raw)
  To: Jiang Liu, amd-gfx, lijo.lazar, Kent.Russell, shuox.liu

[-- Attachment #1: Type: text/plain, Size: 1563 bytes --]


On 1/4/2025 8:45 PM, Jiang Liu wrote:
> Fix possible resource leakage on error recovery path in function
> kgd2kfd_device_init().
>
> Signed-off-by: Jiang Liu<gerry@linux.alibaba.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++++----
>   1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index a29374c86405..fa5054940486 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -898,15 +898,15 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>   		if (kfd->adev->xcp_mgr)
>   			kfd_setup_interrupt_bitmap(node, i);
>   
> +		spin_lock_init(&node->watch_points_lock);
> +
> +		kfd->nodes[i] = node;
> +
>   		/* Initialize the KFD node */
>   		if (kfd_init_node(node)) {
>   			dev_err(kfd_device, "Error initializing KFD node\n");
>   			goto node_init_error;
>   		}
> -
> -		spin_lock_init(&node->watch_points_lock);
> -
> -		kfd->nodes[i] = node;
>   	}
>   
>   	svm_range_set_max_pages(kfd->adev);
> @@ -921,6 +921,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>   	goto out;
>   
>   node_init_error:
> +	i++;
>   node_alloc_error:
>   	kfd_cleanup_nodes(kfd, i);
>   	kfd_doorbell_fini(kfd);

I think this change is not right: if kfd_init_node fail it does clean up 
for the kfd_node that it is initializing internally. kfd_cleanup_nodes 
does not need to clean up the kfd node i which failed to init, just 
clean up the kfd_nodes that were initialized previously.

Regard

Xiaogang

[-- Attachment #2: Type: text/html, Size: 2300 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
  2025-01-05  2:45 ` [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
  2025-01-05  5:16   ` Shuo Liu
@ 2025-01-07 22:55   ` Chen, Xiaogang
  2025-01-08  3:34     ` Gerry Liu
  1 sibling, 1 reply; 18+ messages in thread
From: Chen, Xiaogang @ 2025-01-07 22:55 UTC (permalink / raw)
  To: Jiang Liu, amd-gfx, lijo.lazar, Kent.Russell, shuox.liu

[-- Attachment #1: Type: text/plain, Size: 3649 bytes --]


On 1/4/2025 8:45 PM, Jiang Liu wrote:
> If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
> after free bug related to amdgpu_driver_release_kms() as:
> 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
> 2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kernel mode
> 2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) - not-present page
> 2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0
> 2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> 2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G        W   E      6.10.0+ #2
> 2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
> 2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
> 2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
> 43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
> 2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
> 2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
> 2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
> 2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
> 2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> 2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
> 2024-12-26 16:17:46 [16002.213648] FS:  00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
> 2024-12-26 16:17:46 [16002.223189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
> 2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> e024se+0x3c/0x90 [amdxcp]
> 2024-12-26 16:17:46 [16002.337645]  __do_sys_delete_module.constprop.0+0x176/0x310
> 2024-12-26 16:17:46 [16002.344324]  do_syscall_64+0x5d/0x170
> 2024-12-26 16:17:46 [16002.348858]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26
>
> Fix it by unplugging xcp drm devices when failed to probe GPU devices.
>
> Signed-off-by: Jiang Liu<gerry@linux.alibaba.com>
> Tested-by: Shuo Liu<shuox.liu@linux.alibaba.com>
> Reviewed-by: Lijo Lazar<lijo.lazar@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index d2a046736edd..9ebc0d47d1cb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -165,8 +165,10 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
>   		DRM_WARN("smart shift update failed\n");
>   
>   out:
> -	if (r)
> +	if (r) {
> +		amdgpu_xcp_dev_unplug(adev);

You have made amdgpu_xcp_drm_dev_free, why still use 
amdgpu_xcp_dev_unplug here? I think you want undo 
amdgpu_xcp_drm_dev_alloc in error path. Why involve adev device unplug? 
It is a different scenario.

Regards

Xiaogang

>   		amdgpu_driver_unload_kms(dev);
> +	}
>   
>   	return r;
>   }

[-- Attachment #2: Type: text/html, Size: 4618 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 5/6] amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
  2025-01-05  2:45 ` [PATCH v2 5/6] amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini() Jiang Liu
@ 2025-01-07 22:55   ` Chen, Xiaogang
  0 siblings, 0 replies; 18+ messages in thread
From: Chen, Xiaogang @ 2025-01-07 22:55 UTC (permalink / raw)
  To: Jiang Liu, amd-gfx, lijo.lazar, Kent.Russell, shuox.liu

[-- Attachment #1: Type: text/plain, Size: 1878 bytes --]


On 1/4/2025 8:45 PM, Jiang Liu wrote:
> Function detects initialization status by checking sched->ops, so set
> sched->ops to non-NULL just before return in function drm_sched_init()

This commit message is not what the change did: you set sched->ops to 
NULL just after return from drm_sched_init.

Other than that this change looks good to me.

Regards

Xiaogang


> to avoid possible invalid memory access on error recover path.
>
> Signed-off-by: Jiang Liu<gerry@linux.alibaba.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 4 +++-
>   2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5ff53a3b9851..475ab635c699 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2857,6 +2857,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>   		if (r) {
>   			DRM_ERROR("Failed to create scheduler on ring %s.\n",
>   				  ring->name);
> +			ring->sched.ops = NULL;
>   			return r;
>   		}
>   		r = amdgpu_uvd_entity_init(adev, ring);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 2f24a6aa13bf..b5e87b515139 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -656,8 +656,10 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>   		 * The natural check would be sched.ready, which is
>   		 * set as drm_sched_init() finishes...
>   		 */
> -		if (ring->sched.ops)
> +		if (ring->sched.ops) {
>   			drm_sched_fini(&ring->sched);
> +			ring->sched.ops = NULL;
> +		}
>   
>   		for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
>   			dma_fence_put(ring->fence_drv.fences[j]);

[-- Attachment #2: Type: text/html, Size: 2713 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()
  2025-01-07 22:53   ` Chen, Xiaogang
@ 2025-01-08  2:29     ` Gerry Liu
  0 siblings, 0 replies; 18+ messages in thread
From: Gerry Liu @ 2025-01-08  2:29 UTC (permalink / raw)
  To: Chen, Xiaogang; +Cc: amd-gfx, lijo.lazar, Kent.Russell, shuox.liu

[-- Attachment #1: Type: text/plain, Size: 1859 bytes --]



> 2025年1月8日 06:53，Chen, Xiaogang <xiaogang.chen@amd.com> 写道：
> 
> 
> 
> On 1/4/2025 8:45 PM, Jiang Liu wrote:
>> Fix possible resource leakage on error recovery path in function
>> kgd2kfd_device_init().
>> 
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com> <mailto:gerry@linux.alibaba.com>
>> ---
>>  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++++----
>>  1 file changed, 5 insertions(+), 4 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index a29374c86405..fa5054940486 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -898,15 +898,15 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>  		if (kfd->adev->xcp_mgr)
>>  			kfd_setup_interrupt_bitmap(node, i);
>>  
>> +		spin_lock_init(&node->watch_points_lock);
>> +
>> +		kfd->nodes[i] = node;
>> +
>>  		/* Initialize the KFD node */
>>  		if (kfd_init_node(node)) {
>>  			dev_err(kfd_device, "Error initializing KFD node\n");
>>  			goto node_init_error;
>>  		}
>> -
>> -		spin_lock_init(&node->watch_points_lock);
>> -
>> -		kfd->nodes[i] = node;
>>  	}
>>  
>>  	svm_range_set_max_pages(kfd->adev);
>> @@ -921,6 +921,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>>  	goto out;
>>  
>>  node_init_error:
>> +	i++;
>>  node_alloc_error:
>>  	kfd_cleanup_nodes(kfd, i);
>>  	kfd_doorbell_fini(kfd);
> I think this change is not right: if kfd_init_node fail it does clean up for the kfd_node that it is initializing internally. kfd_cleanup_nodes does not need to clean up the kfd node i which failed to init, just clean up the kfd_nodes that were initialized previously.
> 
Yes, I made a mistake here and will drop this patch in next version.

Thanks,
Gerry
> Regard
> 
> Xiaogang
> 


[-- Attachment #2: Type: text/html, Size: 3133 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
  2025-01-07 22:55   ` Chen, Xiaogang
@ 2025-01-08  3:34     ` Gerry Liu
  0 siblings, 0 replies; 18+ messages in thread
From: Gerry Liu @ 2025-01-08  3:34 UTC (permalink / raw)
  To: Chen, Xiaogang; +Cc: amd-gfx, lijo.lazar, Kent.Russell, shuox.liu

[-- Attachment #1: Type: text/plain, Size: 4698 bytes --]



> 2025年1月8日 06:55，Chen, Xiaogang <xiaogang.chen@amd.com> 写道：
> 
> 
> 
> On 1/4/2025 8:45 PM, Jiang Liu wrote:
>> If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
>> after free bug related to amdgpu_driver_release_kms() as:
>> 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
>> 2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kernel mode
>> 2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) - not-present page
>> 2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0
>> 2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
>> 2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G        W   E      6.10.0+ #2
>> 2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
>> 2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
>> 2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
>> 43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
>> 2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
>> 2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
>> 2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
>> 2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
>> 2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>> 2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
>> 2024-12-26 16:17:46 [16002.213648] FS:  00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
>> 2024-12-26 16:17:46 [16002.223189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> 2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
>> 2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> 2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>> e024se+0x3c/0x90 [amdxcp]
>> 2024-12-26 16:17:46 [16002.337645]  __do_sys_delete_module.constprop.0+0x176/0x310
>> 2024-12-26 16:17:46 [16002.344324]  do_syscall_64+0x5d/0x170
>> 2024-12-26 16:17:46 [16002.348858]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> 2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26
>> 
>> Fix it by unplugging xcp drm devices when failed to probe GPU devices.
>> 
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com> <mailto:gerry@linux.alibaba.com>
>> Tested-by: Shuo Liu <shuox.liu@linux.alibaba.com> <mailto:shuox.liu@linux.alibaba.com>
>> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> <mailto:lijo.lazar@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index d2a046736edd..9ebc0d47d1cb 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -165,8 +165,10 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
>>  		DRM_WARN("smart shift update failed\n");
>>  
>>  out:
>> -	if (r)
>> +	if (r) {
>> +		amdgpu_xcp_dev_unplug(adev);
> You have made amdgpu_xcp_drm_dev_free, why still use amdgpu_xcp_dev_unplug here? I think you want undo amdgpu_xcp_drm_dev_alloc in error path. Why involve adev device unplug? It is a different scenario.
> 
Hi xiaogang,
     I didn’t get your point yet. Current work flow is:
amdgpu_pci_probe()
	-> amdgpu_driver_load_kms()
		-> amdgpu_device_init()
			->amdgpu_discovery_set_ip_blocks()
				-> amdgpu_xcp_mgr_init()
					-> amdgpu_xcp_alloc_dev()
	-> amdgpu_xcp_dev_register()
And amdgpu_xcp_dev_unplug() undos work done by amdgpu_xcp_dev_register() and part of amdgpu_xcp_mgr_ini().
How about splitting amdgpu_xcp_dev_unplug() into amdgpuxcp_dev_deregister() and amdgpu_xcp_mgr_fini(), then things should get much clear.
And it seems the xcp_mgr object is leaked current, and there are resource leakage on error handling path of amdgpu_pci_probe. 
Above proposal may also help to fix those resource leakages.
Thanks,
Gerry
> Regards
> 
> Xiaogang
> 
>>  		amdgpu_driver_unload_kms(dev);
>> +	}
>>  
>>  	return r;
>>  }


[-- Attachment #2: Type: text/html, Size: 6610 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-01-08  9:07 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-05  2:45 [PATCH v2 0/6] Fix several bugs in error handling during device probe Jiang Liu
2025-01-05  2:45 ` [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes() Jiang Liu
2025-01-05  5:22   ` Shuo Liu
2025-01-05  6:57     ` Gerry Liu
2025-01-06  3:07     ` Gerry Liu
2025-01-07 22:53   ` Chen, Xiaogang
2025-01-08  2:29     ` Gerry Liu
2025-01-05  2:45 ` [PATCH v2 2/6] amdgpu: clear adev->in_suspend flag when fails to suspend Jiang Liu
2025-01-05  2:45 ` [PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
2025-01-06  6:51   ` Lazar, Lijo
2025-01-07  2:00     ` Gerry Liu
2025-01-05  2:45 ` [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
2025-01-05  5:16   ` Shuo Liu
2025-01-07 22:55   ` Chen, Xiaogang
2025-01-08  3:34     ` Gerry Liu
2025-01-05  2:45 ` [PATCH v2 5/6] amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini() Jiang Liu
2025-01-07 22:55   ` Chen, Xiaogang
2025-01-05  2:45 ` [PATCH v2 6/6] amdgpu: get rid of false warnings caused by amdgpu_irq_put() Jiang Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.