[PATCH v3 0/6] Fix several bugs in error handling during device probe

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/6] Fix several bugs in error handling during device probe
@ 2025-01-08  8:55 Jiang Liu
  2025-01-08  8:55 ` [v3 1/6] drm/amdgpu: clear adev->in_suspend flag when fails to suspend Jiang Liu
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Jiang Liu @ 2025-01-08  8:55 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
	Jun.Ma2, xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
  Cc: Jiang Liu

This patchset tries to fix several memory leakages/invalid memory
accesses on error handling path during GPU driver loading/unloading.
They applies to:
https://gitlab.freedesktop.org/agd5f/linux.git amd-staging-drm-next

v3:
1) drop first patch of v2
2) rework the 0003/0004 patches of v2 according to review comments
3) add patch 0004 to fix possible resource leakage in amdgpu_pci_probe()

v2:
1) rebased to https://gitlab.freedesktop.org/agd5f/linux.git branch
   amd-staging-drm-next.
2) removed the first patch, which is unnecessary.
3) add amdgpu_xcp_drm_dev_free() in patch 0003 to enhance amdxcp
   driver to better support device remove and error handling.
4) reworked patch 0005 to fix it in amdgpu instead of drm core.

Jiang Liu (6):
  drm/amdgpu: clear adev->in_suspend flag when fails to suspend
  drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
  drm/amdgpu: fix use after free bug related to
    amdgpu_driver_release_kms()
  drm/amdgpu: enhance error handling in function amdgpu_pci_probe()
  drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
  drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put()

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 16 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c     | 14 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c   | 13 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c     |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c     | 47 +++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h     |  4 +-
 drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 76 +++++++++++++++++----
 drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h |  1 +
 9 files changed, 140 insertions(+), 33 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [v3 1/6] drm/amdgpu: clear adev->in_suspend flag when fails to suspend
  2025-01-08  8:55 [PATCH v3 0/6] Fix several bugs in error handling during device probe Jiang Liu
@ 2025-01-08  8:55 ` Jiang Liu
  2025-01-08  8:56 ` [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Jiang Liu @ 2025-01-08  8:55 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
	Jun.Ma2, xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
  Cc: Jiang Liu

Clear adev->in_suspend flag when fails to suspend, otherwise it will
cause too much warnings like:
[ 1802.212027] ------------[ cut here ]------------
[ 1802.212028] WARNING: CPU: 97 PID: 11282 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:452 amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212198] Modules linked in: amdgpu(E-) tcp_diag(E) inet_diag(E) rfkill(E) intel_rapl_msr(E) intel_rapl_common(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) i10nm_edac(E) nfit(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) iTCO_wdt(E) pmt_telemetry(E) iTCO_vendor_support(E) pmt_class(E) intel_cstate(E) snd_hda_intel(E) ipmi_ssif(E) snd_intel_dspcfg(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) intel_uncore(E) snd(E) cdc_ether(E) pcspkr(E) i2c_i801(E) idxd(E) mei_me(E) usbnet(E) ses(E) isst_if_mmio(E) isst_if_mbox_pci(E) mii(E) joydev(E) soundcore(E) isst_if_common(E) idxd_bus(E) mei(E) enclosure(E) intel_vsec(E) i2c_smbus(E) i2c_ismt(E) sunrpc(E) acpi_power_meter(E) ipmi_si(E) acpi_ipmi(E) ipmi_devintf(E) acpi_pad(E) ipmi_msghandler(E) vfat(E) fat(E) sg(E) video(E) amdxcp(E) drm_ttm_helper(E) ttm(E) drm_exec(E) gpu_sched(E) drm_suballoc_helper(E) crc32c_intel(E) drm_buddy(E)
[ 1802.212223]  ast(E) drm_shmem_helper(E) drm_display_helper(E) i2c_algo_bit(E) drm_kms_helper(E) virtio_net(E) mpt3sas(E) drm(E) net_failover(E) ahci(E) raid_class(E) failover(E) libahci(E) scsi_transport_sas(E) dimlib(E) libata(E) i2c_core(E) wmi(E) pinctrl_emmitsburg(E) [last unloaded: amdgpu(E)]
[ 1802.212231] CPU: 97 PID: 11282 Comm: rmmod Kdump: loaded Tainted: G        W   E      6.10.0+ #2
[ 1802.212232] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
[ 1802.212232] RIP: 0010:amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212401] Code: 00 00 00 4d 85 e4 74 08 49 c7 04 24 00 00 00 00 48 85 ed 74 08 48 c7 45 00 00 00 00 00 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc <0f> 0b e9 3b ff ff ff 3d 00 fe ff ff 74 b3 49 8b be b0 10 ff ff 4c
[ 1802.212402] RSP: 0018:ffffbe0e92087cb0 EFLAGS: 00010202
[ 1802.212403] RAX: 0000000000000206 RBX: ffff9cbc06c37f10 RCX: 000000000020000d
[ 1802.212404] RDX: ffff9cbc06c37f18 RSI: ffff9cbc06c37f58 RDI: ffff9cbc06c37f10
[ 1802.212407] RBP: ffff9cbc06c37f18 R08: ffff9cba42388800 R09: 000000000020000d
[ 1802.212409] R10: 0000000000040000 R11: 0000000000000006 R12: ffff9cbc06c37f58
[ 1802.212410] R13: ffff9cba4238dc00 R14: ffff9cbc06c0ef50 R15: ffff9cbc06c00000
[ 1802.212411] FS:  00007f927dc4e740(0000) GS:ffff9db47e880000(0000) knlGS:0000000000000000
[ 1802.212412] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1802.212413] CR2: 000056342b2be1a8 CR3: 00000003f1546005 CR4: 0000000000f70ef0
[ 1802.212413] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1802.212414] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1802.212415] PKRU: 55555554
[ 1802.212415] Call Trace:
[ 1802.212416]  <TASK>
[ 1802.212417]  ? __warn+0x83/0x130
[ 1802.212419]  ? amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212586]  ? __report_bug+0xea/0x100
[ 1802.212588]  ? report_bug+0x24/0x70
[ 1802.212589]  ? handle_bug+0x3c/0x70
[ 1802.212590]  ? exc_invalid_op+0x18/0x70
[ 1802.212591]  ? asm_exc_invalid_op+0x1a/0x20
[ 1802.212594]  ? amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212746]  amdgpu_ring_fini+0x91/0x120 [amdgpu]
[ 1802.212901]  amdgpu_jpeg_sw_fini+0xb2/0xe0 [amdgpu]
[ 1802.213106]  amdgpu_device_ip_fini.isra.0+0xb1/0x1c0 [amdgpu]
[ 1802.213247]  amdgpu_device_fini_sw+0x49/0x290 [amdgpu]
[ 1802.213413]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[ 1802.213576]  devm_drm_dev_init_release+0x4e/0x70 [drm]
[ 1802.213602]  release_nodes+0x35/0xb0
[ 1802.213605]  devres_release_all+0x8f/0xd0
[ 1802.213606]  device_unbind_cleanup+0xe/0x70
[ 1802.213609]  device_release_driver_internal+0x1bc/0x200
[ 1802.213611]  driver_detach+0x48/0x90
[ 1802.213613]  bus_remove_driver+0x6d/0xf0
[ 1802.213615]  pci_unregister_driver+0x2e/0xb0
[ 1802.213618]  amdgpu_exit+0x15/0x1c4 [amdgpu]
[ 1802.213861]  __do_sys_delete_module.constprop.0+0x176/0x310
[ 1802.213863]  do_syscall_64+0x5d/0x170
[ 1802.213866]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1802.213869] RIP: 0033:0x7f927d6620cb
[ 1802.213870] Code: 73 01 c3 48 8b 0d a5 6d 19 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 6d 19 00 f7 d8 64 89 01 48
[ 1802.213871] RSP: 002b:00007fff031d9d78 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 1802.213873] RAX: ffffffffffffffda RBX: 000055d8d6df69e0 RCX: 00007f927d6620cb
[ 1802.213873] RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055d8d6df6a48
[ 1802.213874] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 1802.213875] R10: 00007f927d7aaac0 R11: 0000000000000206 R12: 00007fff031d9fa0
[ 1802.213876] R13: 00007fff031da5f1 R14: 000055d8d6df62a0 R15: 000055d8d6df69e0
[ 1802.213878]  </TASK>
[ 1802.213879] ---[ end trace 0000000000000000 ]---

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0a121aab5c74..5ff53a3b9851 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4910,7 +4910,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
 		amdgpu_virt_fini_data_exchange(adev);
 		r = amdgpu_virt_request_full_gpu(adev, false);
 		if (r)
-			return r;
+			goto error_out;
 	}
 
 	if (amdgpu_acpi_smart_shift_update(dev, AMDGPU_SS_DEV_D3))
@@ -4930,7 +4930,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
 
 	r = amdgpu_device_evict_resources(adev);
 	if (r)
-		return r;
+		goto error_out;
 
 	amdgpu_ttm_set_buffer_funcs_status(adev, false);
 
@@ -4943,9 +4943,12 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
 
 	r = amdgpu_dpm_notify_rlc_state(adev, false);
 	if (r)
-		return r;
+		goto error_out;
 
 	return 0;
+error_out:
+	adev->in_suspend = false;
+	return r;
 }
 
 /**
@@ -5007,8 +5010,10 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
 		amdgpu_virt_release_full_gpu(adev, true);
 	}
 
-	if (r)
+	if (r) {
+		adev->in_suspend = false;
 		return r;
+	}
 
 	/* Make sure IB tests flushed */
 	flush_delayed_work(&adev->delayed_init_work);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
  2025-01-08  8:55 [PATCH v3 0/6] Fix several bugs in error handling during device probe Jiang Liu
  2025-01-08  8:55 ` [v3 1/6] drm/amdgpu: clear adev->in_suspend flag when fails to suspend Jiang Liu
@ 2025-01-08  8:56 ` Jiang Liu
  2025-01-08  9:31   ` Lazar, Lijo
  2025-01-08 16:04   ` Mario Limonciello
  2025-01-08  8:56 ` [v3 3/6] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 23+ messages in thread
From: Jiang Liu @ 2025-01-08  8:56 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
	Jun.Ma2, xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
  Cc: Jiang Liu

Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
to do error recovery.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
 drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 76 +++++++++++++++++----
 drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h |  1 +
 2 files changed, 63 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
index faed84172dd4..fc92b5fe1040 100644
--- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
+++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
@@ -45,21 +45,32 @@ static const struct drm_driver amdgpu_xcp_driver = {
 
 static int8_t pdev_num;
 static struct xcp_device *xcp_dev[MAX_XCP_PLATFORM_DEVICE];
+static DEFINE_MUTEX(xcp_mutex);
 
 int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
 {
 	struct platform_device *pdev;
 	struct xcp_device *pxcp_dev;
 	char dev_name[20];
-	int ret;
+	int ret, index;
 
-	if (pdev_num >= MAX_XCP_PLATFORM_DEVICE)
-		return -ENODEV;
+	mutex_lock(&xcp_mutex);
+	if (pdev_num >= MAX_XCP_PLATFORM_DEVICE) {
+		ret = -ENODEV;
+		goto out_unlock;
+	}
+
+	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
+		if (!xcp_dev[index])
+			break;
+	}
 
-	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", pdev_num);
+	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", index);
 	pdev = platform_device_register_simple(dev_name, -1, NULL, 0);
-	if (IS_ERR(pdev))
-		return PTR_ERR(pdev);
+	if (IS_ERR(pdev)) {
+		ret = PTR_ERR(pdev);
+		goto out_unregister;
+	}
 
 	if (!devres_open_group(&pdev->dev, NULL, GFP_KERNEL)) {
 		ret = -ENOMEM;
@@ -72,10 +83,11 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
 		goto out_devres;
 	}
 
-	xcp_dev[pdev_num] = pxcp_dev;
-	xcp_dev[pdev_num]->pdev = pdev;
+	xcp_dev[index] = pxcp_dev;
+	xcp_dev[index]->pdev = pdev;
 	*ddev = &pxcp_dev->drm;
 	pdev_num++;
+	mutex_unlock(&xcp_mutex);
 
 	return 0;
 
@@ -83,21 +95,57 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
 	devres_release_group(&pdev->dev, NULL);
 out_unregister:
 	platform_device_unregister(pdev);
+out_unlock:
+	mutex_unlock(&xcp_mutex);
 
 	return ret;
 }
 EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
 
+static void amdgpu_xcp_drm_dev_destroy(int index)
+{
+	struct platform_device *pdev;
+
+	pdev = xcp_dev[index]->pdev;
+	devres_release_group(&pdev->dev, NULL);
+	platform_device_unregister(pdev);
+	xcp_dev[index] = NULL;
+	pdev_num--;
+}
+
+void amdgpu_xcp_drm_dev_free(struct drm_device *ddev)
+{
+	int index;
+	struct xcp_device *pxcp_dev;
+
+	if (ddev == NULL)
+		return;
+
+	pxcp_dev = container_of(ddev, struct xcp_device, drm);
+
+	mutex_lock(&xcp_mutex);
+	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
+		if (xcp_dev[index] == pxcp_dev) {
+			amdgpu_xcp_drm_dev_destroy(index);
+			break;
+		}
+	}
+	mutex_unlock(&xcp_mutex);
+}
+EXPORT_SYMBOL(amdgpu_xcp_drm_dev_free);
+
 void amdgpu_xcp_drv_release(void)
 {
-	for (--pdev_num; pdev_num >= 0; --pdev_num) {
-		struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
+	int index;
 
-		devres_release_group(&pdev->dev, NULL);
-		platform_device_unregister(pdev);
-		xcp_dev[pdev_num] = NULL;
+	mutex_lock(&xcp_mutex);
+	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
+		if (xcp_dev[index]) {
+			amdgpu_xcp_drm_dev_destroy(index);
+		}
 	}
-	pdev_num = 0;
+	WARN_ON(pdev_num != 0);
+	mutex_unlock(&xcp_mutex);
 }
 EXPORT_SYMBOL(amdgpu_xcp_drv_release);
 
diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
index c1c4b679bf95..580a1602c8e3 100644
--- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
+++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
@@ -25,5 +25,6 @@
 #define _AMDGPU_XCP_DRV_H_
 
 int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev);
+void amdgpu_xcp_drm_dev_free(struct drm_device *ddev);
 void amdgpu_xcp_drv_release(void);
 #endif /* _AMDGPU_XCP_DRV_H_ */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [v3 3/6] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
  2025-01-08  8:55 [PATCH v3 0/6] Fix several bugs in error handling during device probe Jiang Liu
  2025-01-08  8:55 ` [v3 1/6] drm/amdgpu: clear adev->in_suspend flag when fails to suspend Jiang Liu
  2025-01-08  8:56 ` [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
@ 2025-01-08  8:56 ` Jiang Liu
  2025-01-08  9:54   ` Lazar, Lijo
  2025-01-08  8:56 ` [v3 4/6] drm/amdgpu: enhance error handling in function amdgpu_pci_probe() Jiang Liu
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Jiang Liu @ 2025-01-08  8:56 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
	Jun.Ma2, xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
  Cc: Jiang Liu

If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
after free bug related to amdgpu_driver_release_kms() as:
2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kernel mode
2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) - not-present page
2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0
2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G        W   E      6.10.0+ #2
2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
2024-12-26 16:17:46 [16002.213648] FS:  00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
2024-12-26 16:17:46 [16002.223189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
e024se+0x3c/0x90 [amdxcp]
2024-12-26 16:17:46 [16002.337645]  __do_sys_delete_module.constprop.0+0x176/0x310
2024-12-26 16:17:46 [16002.344324]  do_syscall_64+0x5d/0x170
2024-12-26 16:17:46 [16002.348858]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26

Fix it by removing xcp drm devices when failed to probe GPU devices.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Tested-by: Shuo Liu <shuox.liu@linux.alibaba.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c    | 47 +++++++++++++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h    |  4 +-
 5 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5ff53a3b9851..510074a9074e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6682,7 +6682,7 @@ void amdgpu_device_halt(struct amdgpu_device *adev)
 	struct pci_dev *pdev = adev->pdev;
 	struct drm_device *ddev = adev_to_drm(adev);
 
-	amdgpu_xcp_dev_unplug(adev);
+	amdgpu_xcp_dev_deregister(adev);
 	drm_dev_unplug(ddev);
 
 	amdgpu_irq_disable_all(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 62de668e9ff8..41d1b06be600 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2435,7 +2435,7 @@ amdgpu_pci_remove(struct pci_dev *pdev)
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
 
-	amdgpu_xcp_dev_unplug(adev);
+	amdgpu_xcp_dev_deregister(adev);
 	amdgpu_gmc_prepare_nps_mode_change(adev);
 	drm_dev_unplug(dev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index d2a046736edd..be9147eb8308 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -1508,6 +1508,7 @@ void amdgpu_driver_release_kms(struct drm_device *dev)
 	struct amdgpu_device *adev = drm_to_adev(dev);
 
 	amdgpu_device_fini_sw(adev);
+	amdgpu_xcp_mgr_fini(adev);
 	pci_set_drvdata(adev->pdev, NULL);
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
index e209b5e101df..62dd5287808b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
@@ -283,6 +283,33 @@ static int amdgpu_xcp_dev_alloc(struct amdgpu_device *adev)
 	return 0;
 }
 
+static void amdgpu_xcp_dev_free(struct amdgpu_device *adev)
+{
+	struct drm_device *p_ddev;
+	int i;
+
+	if (!adev->xcp_mgr)
+		return;
+
+	for (i = 1; i < MAX_XCP; i++) {
+		if (!adev->xcp_mgr->xcp[i].ddev)
+			break;
+
+		// Restore and free the original drm_device.
+		p_ddev = adev->xcp_mgr->xcp[i].ddev;
+		p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
+		p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
+		p_ddev->driver =  adev->xcp_mgr->xcp[i].driver;
+		p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
+		amdgpu_xcp_drm_dev_free(p_ddev);
+
+		adev->xcp_mgr->xcp[i].ddev = NULL;
+	}
+
+	adev->xcp_mgr->xcp->ddev = NULL;
+}
+
+
 int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
 			int init_num_xcps,
 			struct amdgpu_xcp_mgr_funcs *xcp_funcs)
@@ -310,6 +337,13 @@ int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
 	return amdgpu_xcp_dev_alloc(adev);
 }
 
+void amdgpu_xcp_mgr_fini(struct amdgpu_device *adev)
+{
+	amdgpu_xcp_dev_free(adev);
+	kfree(adev->xcp_mgr);
+	adev->xcp_mgr = NULL;
+}
+
 int amdgpu_xcp_get_partition(struct amdgpu_xcp_mgr *xcp_mgr,
 			     enum AMDGPU_XCP_IP_BLOCK ip, int instance)
 {
@@ -359,12 +393,14 @@ int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
 		ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev, ent->driver_data);
 		if (ret)
 			return ret;
+
+		adev->xcp_mgr->xcp[i].registered = true;
 	}
 
 	return 0;
 }
 
-void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
+void amdgpu_xcp_dev_deregister(struct amdgpu_device *adev)
 {
 	struct drm_device *p_ddev;
 	int i;
@@ -377,11 +413,10 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
 			break;
 
 		p_ddev = adev->xcp_mgr->xcp[i].ddev;
-		drm_dev_unplug(p_ddev);
-		p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
-		p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
-		p_ddev->driver =  adev->xcp_mgr->xcp[i].driver;
-		p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
+		if (adev->xcp_mgr->xcp[i].registered) {
+			drm_dev_unplug(p_ddev);
+			adev->xcp_mgr->xcp[i].registered = false;
+		}
 	}
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
index b63f53242c57..be22d4398463 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
@@ -101,6 +101,7 @@ struct amdgpu_xcp {
 	uint8_t id;
 	uint8_t mem_id;
 	bool valid;
+	bool registered;
 	atomic_t	ref_cnt;
 	struct drm_device *ddev;
 	struct drm_device *rdev;
@@ -155,6 +156,7 @@ int amdgpu_xcp_resume(struct amdgpu_xcp_mgr *xcp_mgr, int xcp_id);
 
 int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
 			int init_xcps, struct amdgpu_xcp_mgr_funcs *xcp_funcs);
+void amdgpu_xcp_mgr_fini(struct amdgpu_device *adev);
 int amdgpu_xcp_init(struct amdgpu_xcp_mgr *xcp_mgr, int num_xcps, int mode);
 int amdgpu_xcp_query_partition_mode(struct amdgpu_xcp_mgr *xcp_mgr, u32 flags);
 int amdgpu_xcp_switch_partition_mode(struct amdgpu_xcp_mgr *xcp_mgr, int mode);
@@ -168,7 +170,7 @@ int amdgpu_xcp_get_inst_details(struct amdgpu_xcp *xcp,
 
 int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
 				const struct pci_device_id *ent);
-void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev);
+void amdgpu_xcp_dev_deregister(struct amdgpu_device *adev);
 int amdgpu_xcp_open_device(struct amdgpu_device *adev,
 			   struct amdgpu_fpriv *fpriv,
 			   struct drm_file *file_priv);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [v3 4/6] drm/amdgpu: enhance error handling in function amdgpu_pci_probe()
  2025-01-08  8:55 [PATCH v3 0/6] Fix several bugs in error handling during device probe Jiang Liu
                   ` (2 preceding siblings ...)
  2025-01-08  8:56 ` [v3 3/6] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
@ 2025-01-08  8:56 ` Jiang Liu
  2025-01-08 16:08   ` Mario Limonciello
  2025-01-08  8:56 ` [v3 5/6] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini() Jiang Liu
  2025-01-08  8:56 ` [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put() Jiang Liu
  5 siblings, 1 reply; 23+ messages in thread
From: Jiang Liu @ 2025-01-08  8:56 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
	Jun.Ma2, xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
  Cc: Jiang Liu

Enhance error handling in function amdgpu_pci_probe() to avoid
possible resource leakage.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 41d1b06be600..f8deca2f2696 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2346,16 +2346,16 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
 		msleep(5000);
 		goto retry_init;
 	} else if (ret) {
-		goto err_pci;
+		goto unload_kms;
 	}
 
 	ret = amdgpu_xcp_dev_register(adev, ent);
 	if (ret)
-		goto err_pci;
+		goto unplug_drm;
 
 	ret = amdgpu_amdkfd_drm_client_create(adev);
 	if (ret)
-		goto err_pci;
+		goto deregister_xcp;
 
 	/*
 	 * 1. don't init fbdev on hw without DCE
@@ -2424,6 +2424,12 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
 
 	return 0;
 
+deregister_xcp:
+	amdgpu_xcp_dev_deregister(adev);
+unplug_drm:
+	drm_dev_unplug(ddev);
+unload_kms:
+	amdgpu_driver_unload_kms(ddev);
 err_pci:
 	pci_disable_device(pdev);
 	return ret;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [v3 5/6] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
  2025-01-08  8:55 [PATCH v3 0/6] Fix several bugs in error handling during device probe Jiang Liu
                   ` (3 preceding siblings ...)
  2025-01-08  8:56 ` [v3 4/6] drm/amdgpu: enhance error handling in function amdgpu_pci_probe() Jiang Liu
@ 2025-01-08  8:56 ` Jiang Liu
  2025-01-08  9:16   ` Christian König
  2025-01-08  8:56 ` [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put() Jiang Liu
  5 siblings, 1 reply; 23+ messages in thread
From: Jiang Liu @ 2025-01-08  8:56 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
	Jun.Ma2, xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
  Cc: Jiang Liu

Function detects initialization status by checking sched->ops, so set
sched->ops to non-NULL just before return in function
amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers()
to avoid possible invalid memory access on error recover path.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 510074a9074e..741807a1fd2e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2857,6 +2857,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
 		if (r) {
 			DRM_ERROR("Failed to create scheduler on ring %s.\n",
 				  ring->name);
+			ring->sched.ops = NULL;
 			return r;
 		}
 		r = amdgpu_uvd_entity_init(adev, ring);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 2f24a6aa13bf..b5e87b515139 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -656,8 +656,10 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
 		 * The natural check would be sched.ready, which is
 		 * set as drm_sched_init() finishes...
 		 */
-		if (ring->sched.ops)
+		if (ring->sched.ops) {
 			drm_sched_fini(&ring->sched);
+			ring->sched.ops = NULL;
+		}
 
 		for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
 			dma_fence_put(ring->fence_drv.fences[j]);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put()
  2025-01-08  8:55 [PATCH v3 0/6] Fix several bugs in error handling during device probe Jiang Liu
                   ` (4 preceding siblings ...)
  2025-01-08  8:56 ` [v3 5/6] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini() Jiang Liu
@ 2025-01-08  8:56 ` Jiang Liu
  2025-01-08  9:05   ` Christian König
  2025-01-08 10:02   ` Lazar, Lijo
  5 siblings, 2 replies; 23+ messages in thread
From: Jiang Liu @ 2025-01-08  8:56 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
	Jun.Ma2, xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
  Cc: Jiang Liu

If error happens before amdgpu_fence_driver_hw_init() gets called during
device probe, it will trigger a false warning in amdgpu_irq_put() as
below:
[ 1209.300996] ------------[ cut here ]------------
[ 1209.301061] WARNING: CPU: 48 PID: 293 at /tmp/amd.Rc9jFrl7/amd/amdgpu/amdgpu_irq.c:633 amdgpu_irq_put+0x45/0x70 [amdgpu]
[ 1209.301062] Modules linked in: ...
[ 1209.301093] CPU: 48 PID: 293 Comm: kworker/48:1 Kdump: loaded Tainted: G        W  OE     5.10.134-17.2.al8.x86_64 #1
[ 1209.301094] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
[ 1209.301095] Workqueue: events work_for_cpu_fn
[ 1209.301159] RIP: 0010:amdgpu_irq_put+0x45/0x70 [amdgpu]
[ 1209.301160] Code: 48 8b 4e 10 48 83 39 00 74 2c 89 d1 48 8d 04 88 8b 08 85 c9 74 14 f0 ff 08 b8 00 00 00 00 74 05 c3 cc cc cc cc e9 8b fd ff ff <0f> 0b b8 ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc
[ 1209.301162] RSP: 0018:ffffb08a99c8fd88 EFLAGS: 00010246
[ 1209.301162] RAX: ffff9efe1bcbf500 RBX: ffff9efe1cc3e400 RCX: 0000000000000000
[ 1209.301163] RDX: 0000000000000000 RSI: ffff9efe1cc3b108 RDI: ffff9efe1cc00000
[ 1209.301163] RBP: ffff9efe1cc10818 R08: 0000000000000001 R09: 000000000000000d
[ 1209.301164] R10: ffffb08a99c8fb48 R11: ffffffffa2068018 R12: ffff9efe1cc109d0
[ 1209.301164] R13: ffff9efe1cc00010 R14: ffff9efe1cc00000 R15: ffff9efe1cc3b108
[ 1209.301165] FS:  0000000000000000(0000) GS:ffff9ff9fce00000(0000) knlGS:0000000000000000
[ 1209.301165] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1209.301165] CR2: 00007fd0f6e860d0 CR3: 0000010092baa003 CR4: 0000000002770ee0
[ 1209.301166] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1209.301166] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1209.301167] PKRU: 55555554
[ 1209.301167] Call Trace:
[ 1209.301225]  amdgpu_fence_driver_hw_fini+0xda/0x110 [amdgpu]
[ 1209.301284]  amdgpu_device_fini_hw+0xaf/0x200 [amdgpu]
[ 1209.301342]  amdgpu_driver_load_kms+0x7f/0xc0 [amdgpu]
[ 1209.301400]  amdgpu_pci_probe+0x1cd/0x4a0 [amdgpu]
[ 1209.301401]  local_pci_probe+0x40/0xa0
[ 1209.301402]  work_for_cpu_fn+0x13/0x20
[ 1209.301403]  process_one_work+0x1ad/0x380
[ 1209.301404]  worker_thread+0x1c8/0x310
[ 1209.301405]  ? process_one_work+0x380/0x380
[ 1209.301406]  kthread+0x118/0x140
[ 1209.301407]  ? __kthread_bind_mask+0x60/0x60
[ 1209.301408]  ret_from_fork+0x1f/0x30
[ 1209.301410] ---[ end trace 733f120fe2ab13e5 ]---
[ 1209.301418] ------------[ cut here ]------------

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 +++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index b5e87b515139..0e41a535e05f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -614,9 +614,11 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
 
 		if (!drm_dev_is_unplugged(adev_to_drm(adev)) &&
 		    ring->fence_drv.irq_src &&
-		    amdgpu_fence_need_ring_interrupt_restore(ring))
+		    ring->fence_drv.irq_enabled) {
 			amdgpu_irq_put(adev, ring->fence_drv.irq_src,
 				       ring->fence_drv.irq_type);
+		        ring->fence_drv.irq_enabled = false;
+		}
 
 		del_timer_sync(&ring->fence_drv.fallback_timer);
 	}
@@ -693,9 +695,12 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
 
 		/* enable the interrupt */
 		if (ring->fence_drv.irq_src &&
-		    amdgpu_fence_need_ring_interrupt_restore(ring))
+		    !ring->fence_drv.irq_enabled &&
+		    amdgpu_fence_need_ring_interrupt_restore(ring)) {
 			amdgpu_irq_get(adev, ring->fence_drv.irq_src,
 				       ring->fence_drv.irq_type);
+		        ring->fence_drv.irq_enabled = true;
+		}
 	}
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index dee5a1b4e572..959d474a0516 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -118,6 +118,7 @@ struct amdgpu_fence_driver {
 	uint32_t			sync_seq;
 	atomic_t			last_seq;
 	bool				initialized;
+	bool				irq_enabled;
 	struct amdgpu_irq_src		*irq_src;
 	unsigned			irq_type;
 	struct timer_list		fallback_timer;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put()
  2025-01-08  8:56 ` [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put() Jiang Liu
@ 2025-01-08  9:05   ` Christian König
  2025-01-10  2:01     ` Gerry Liu
  2025-01-08 10:02   ` Lazar, Lijo
  1 sibling, 1 reply; 23+ messages in thread
From: Christian König @ 2025-01-08  9:05 UTC (permalink / raw)
  To: Jiang Liu, alexander.deucher, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
	Jun.Ma2, xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx

Am 08.01.25 um 09:56 schrieb Jiang Liu:
> If error happens before amdgpu_fence_driver_hw_init() gets called during
> device probe, it will trigger a false warning in amdgpu_irq_put() as
> below:
> [ 1209.300996] ------------[ cut here ]------------
> [ 1209.301061] WARNING: CPU: 48 PID: 293 at /tmp/amd.Rc9jFrl7/amd/amdgpu/amdgpu_irq.c:633 amdgpu_irq_put+0x45/0x70 [amdgpu]
> [ 1209.301062] Modules linked in: ...
> [ 1209.301093] CPU: 48 PID: 293 Comm: kworker/48:1 Kdump: loaded Tainted: G        W  OE     5.10.134-17.2.al8.x86_64 #1
> [ 1209.301094] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
> [ 1209.301095] Workqueue: events work_for_cpu_fn
> [ 1209.301159] RIP: 0010:amdgpu_irq_put+0x45/0x70 [amdgpu]
> [ 1209.301160] Code: 48 8b 4e 10 48 83 39 00 74 2c 89 d1 48 8d 04 88 8b 08 85 c9 74 14 f0 ff 08 b8 00 00 00 00 74 05 c3 cc cc cc cc e9 8b fd ff ff <0f> 0b b8 ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc
> [ 1209.301162] RSP: 0018:ffffb08a99c8fd88 EFLAGS: 00010246
> [ 1209.301162] RAX: ffff9efe1bcbf500 RBX: ffff9efe1cc3e400 RCX: 0000000000000000
> [ 1209.301163] RDX: 0000000000000000 RSI: ffff9efe1cc3b108 RDI: ffff9efe1cc00000
> [ 1209.301163] RBP: ffff9efe1cc10818 R08: 0000000000000001 R09: 000000000000000d
> [ 1209.301164] R10: ffffb08a99c8fb48 R11: ffffffffa2068018 R12: ffff9efe1cc109d0
> [ 1209.301164] R13: ffff9efe1cc00010 R14: ffff9efe1cc00000 R15: ffff9efe1cc3b108
> [ 1209.301165] FS:  0000000000000000(0000) GS:ffff9ff9fce00000(0000) knlGS:0000000000000000
> [ 1209.301165] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1209.301165] CR2: 00007fd0f6e860d0 CR3: 0000010092baa003 CR4: 0000000002770ee0
> [ 1209.301166] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1209.301166] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 1209.301167] PKRU: 55555554
> [ 1209.301167] Call Trace:
> [ 1209.301225]  amdgpu_fence_driver_hw_fini+0xda/0x110 [amdgpu]
> [ 1209.301284]  amdgpu_device_fini_hw+0xaf/0x200 [amdgpu]
> [ 1209.301342]  amdgpu_driver_load_kms+0x7f/0xc0 [amdgpu]
> [ 1209.301400]  amdgpu_pci_probe+0x1cd/0x4a0 [amdgpu]
> [ 1209.301401]  local_pci_probe+0x40/0xa0
> [ 1209.301402]  work_for_cpu_fn+0x13/0x20
> [ 1209.301403]  process_one_work+0x1ad/0x380
> [ 1209.301404]  worker_thread+0x1c8/0x310
> [ 1209.301405]  ? process_one_work+0x380/0x380
> [ 1209.301406]  kthread+0x118/0x140
> [ 1209.301407]  ? __kthread_bind_mask+0x60/0x60
> [ 1209.301408]  ret_from_fork+0x1f/0x30
> [ 1209.301410] ---[ end trace 733f120fe2ab13e5 ]---
> [ 1209.301418] ------------[ cut here ]------------
>
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 +++++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  | 1 +
>   2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index b5e87b515139..0e41a535e05f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -614,9 +614,11 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
>   
>   		if (!drm_dev_is_unplugged(adev_to_drm(adev)) &&
>   		    ring->fence_drv.irq_src &&
> -		    amdgpu_fence_need_ring_interrupt_restore(ring))
> +		    ring->fence_drv.irq_enabled) {
>   			amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>   				       ring->fence_drv.irq_type);
> +		        ring->fence_drv.irq_enabled = false;
> +		}

Clearly a NAK, that is exactly what the warning is supposed to warn about.

Regards,
Christian.

>   
>   		del_timer_sync(&ring->fence_drv.fallback_timer);
>   	}
> @@ -693,9 +695,12 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
>   
>   		/* enable the interrupt */
>   		if (ring->fence_drv.irq_src &&
> -		    amdgpu_fence_need_ring_interrupt_restore(ring))
> +		    !ring->fence_drv.irq_enabled &&
> +		    amdgpu_fence_need_ring_interrupt_restore(ring)) {
>   			amdgpu_irq_get(adev, ring->fence_drv.irq_src,
>   				       ring->fence_drv.irq_type);
> +		        ring->fence_drv.irq_enabled = true;
> +		}
>   	}
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> index dee5a1b4e572..959d474a0516 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -118,6 +118,7 @@ struct amdgpu_fence_driver {
>   	uint32_t			sync_seq;
>   	atomic_t			last_seq;
>   	bool				initialized;
> +	bool				irq_enabled;
>   	struct amdgpu_irq_src		*irq_src;
>   	unsigned			irq_type;
>   	struct timer_list		fallback_timer;


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 5/6] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
  2025-01-08  8:56 ` [v3 5/6] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini() Jiang Liu
@ 2025-01-08  9:16   ` Christian König
  2025-01-08  9:58     ` Lazar, Lijo
  2025-01-08 16:30     ` Chen, Xiaogang
  0 siblings, 2 replies; 23+ messages in thread
From: Christian König @ 2025-01-08  9:16 UTC (permalink / raw)
  To: Jiang Liu, alexander.deucher, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
	Jun.Ma2, xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx

Am 08.01.25 um 09:56 schrieb Jiang Liu:
> Function detects initialization status by checking sched->ops,

Where is that done? Inside the scheduler or inside amdgpu?

Regards,
Christian.

>   so set
> sched->ops to non-NULL just before return in function
> amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers()
> to avoid possible invalid memory access on error recover path.
>
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 4 +++-
>   2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 510074a9074e..741807a1fd2e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2857,6 +2857,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>   		if (r) {
>   			DRM_ERROR("Failed to create scheduler on ring %s.\n",
>   				  ring->name);
> +			ring->sched.ops = NULL;
>   			return r;
>   		}
>   		r = amdgpu_uvd_entity_init(adev, ring);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 2f24a6aa13bf..b5e87b515139 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -656,8 +656,10 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>   		 * The natural check would be sched.ready, which is
>   		 * set as drm_sched_init() finishes...
>   		 */
> -		if (ring->sched.ops)
> +		if (ring->sched.ops) {
>   			drm_sched_fini(&ring->sched);
> +			ring->sched.ops = NULL;
> +		}
>   
>   		for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
>   			dma_fence_put(ring->fence_drv.fences[j]);


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
  2025-01-08  8:56 ` [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
@ 2025-01-08  9:31   ` Lazar, Lijo
  2025-01-09  1:46     ` Gerry Liu
  2025-01-08 16:04   ` Mario Limonciello
  1 sibling, 1 reply; 23+ messages in thread
From: Lazar, Lijo @ 2025-01-08  9:31 UTC (permalink / raw)
  To: Jiang Liu, alexander.deucher, christian.koenig, Xinhui.Pan,
	airlied, simona, sunil.khatri, Hawking.Zhang, mario.limonciello,
	Jun.Ma2, xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx



On 1/8/2025 2:26 PM, Jiang Liu wrote:
> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
> to do error recovery.
> 
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> ---
>  drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 76 +++++++++++++++++----
>  drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h |  1 +
>  2 files changed, 63 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> index faed84172dd4..fc92b5fe1040 100644
> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> @@ -45,21 +45,32 @@ static const struct drm_driver amdgpu_xcp_driver = {
>  
>  static int8_t pdev_num;
>  static struct xcp_device *xcp_dev[MAX_XCP_PLATFORM_DEVICE];
> +static DEFINE_MUTEX(xcp_mutex);
>  
>  int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>  {
>  	struct platform_device *pdev;
>  	struct xcp_device *pxcp_dev;
>  	char dev_name[20];
> -	int ret;
> +	int ret, index;
>  
> -	if (pdev_num >= MAX_XCP_PLATFORM_DEVICE)
> -		return -ENODEV;
> +	mutex_lock(&xcp_mutex);
> +	if (pdev_num >= MAX_XCP_PLATFORM_DEVICE) {
> +		ret = -ENODEV;
> +		goto out_unlock;
> +	}
> +
> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> +		if (!xcp_dev[index])
> +			break;
> +	}
>  
> -	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", pdev_num);
> +	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", index);
>  	pdev = platform_device_register_simple(dev_name, -1, NULL, 0);
> -	if (IS_ERR(pdev))
> -		return PTR_ERR(pdev);
> +	if (IS_ERR(pdev)) {
> +		ret = PTR_ERR(pdev);
> +		goto out_unregister;
> +	}
>  
>  	if (!devres_open_group(&pdev->dev, NULL, GFP_KERNEL)) {
>  		ret = -ENOMEM;
> @@ -72,10 +83,11 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>  		goto out_devres;
>  	}
>  
> -	xcp_dev[pdev_num] = pxcp_dev;
> -	xcp_dev[pdev_num]->pdev = pdev;
> +	xcp_dev[index] = pxcp_dev;
> +	xcp_dev[index]->pdev = pdev;
>  	*ddev = &pxcp_dev->drm;
>  	pdev_num++;
> +	mutex_unlock(&xcp_mutex);
>  
>  	return 0;
>  
> @@ -83,21 +95,57 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>  	devres_release_group(&pdev->dev, NULL);
>  out_unregister:
>  	platform_device_unregister(pdev);
> +out_unlock:
> +	mutex_unlock(&xcp_mutex);
>  
>  	return ret;
>  }
>  EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
>  
> +static void amdgpu_xcp_drm_dev_destroy(int index)

<Nit> Use something like __amdgpu_xcp_drm_dev_free(int index) to keep
the 'free' suffix.

> +{
> +	struct platform_device *pdev;
> +
> +	pdev = xcp_dev[index]->pdev;
> +	devres_release_group(&pdev->dev, NULL);
> +	platform_device_unregister(pdev);
> +	xcp_dev[index] = NULL;
> +	pdev_num--;
> +}
> +
> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev)
> +{
> +	int index;
> +	struct xcp_device *pxcp_dev;
> +
> +	if (ddev == NULL)

Does it make sense to add !pdev_num check or a WARN_ON(!pdev_num) below?

> +		return;
> +
> +	pxcp_dev = container_of(ddev, struct xcp_device, drm);
> +
> +	mutex_lock(&xcp_mutex);
> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> +		if (xcp_dev[index] == pxcp_dev) {
> +			amdgpu_xcp_drm_dev_destroy(index);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&xcp_mutex);
> +}
> +EXPORT_SYMBOL(amdgpu_xcp_drm_dev_free);
> +
>  void amdgpu_xcp_drv_release(void)
>  {
> -	for (--pdev_num; pdev_num >= 0; --pdev_num) {
> -		struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
> +	int index;
>  
> -		devres_release_group(&pdev->dev, NULL);
> -		platform_device_unregister(pdev);
> -		xcp_dev[pdev_num] = NULL;

To better optimize, add one check like below.
	if (!pdev_num)
		return;

Thanks,
Lijo

> +	mutex_lock(&xcp_mutex);
> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> +		if (xcp_dev[index]) {
> +			amdgpu_xcp_drm_dev_destroy(index);
> +		}
>  	}
> -	pdev_num = 0;
> +	WARN_ON(pdev_num != 0);
> +	mutex_unlock(&xcp_mutex);
>  }
>  EXPORT_SYMBOL(amdgpu_xcp_drv_release);
>  
> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> index c1c4b679bf95..580a1602c8e3 100644
> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> @@ -25,5 +25,6 @@
>  #define _AMDGPU_XCP_DRV_H_
>  
>  int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev);
> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev);
>  void amdgpu_xcp_drv_release(void);
>  #endif /* _AMDGPU_XCP_DRV_H_ */


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 3/6] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
  2025-01-08  8:56 ` [v3 3/6] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
@ 2025-01-08  9:54   ` Lazar, Lijo
  2025-01-09  3:00     ` Gerry Liu
  0 siblings, 1 reply; 23+ messages in thread
From: Lazar, Lijo @ 2025-01-08  9:54 UTC (permalink / raw)
  To: Jiang Liu, alexander.deucher, christian.koenig, Xinhui.Pan,
	airlied, simona, sunil.khatri, Hawking.Zhang, mario.limonciello,
	Jun.Ma2, xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx



On 1/8/2025 2:26 PM, Jiang Liu wrote:
> If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
> after free bug related to amdgpu_driver_release_kms() as:
> 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
> 2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kernel mode
> 2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) - not-present page
> 2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0
> 2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> 2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G        W   E      6.10.0+ #2
> 2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
> 2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
> 2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
> 43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
> 2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
> 2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
> 2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
> 2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
> 2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> 2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
> 2024-12-26 16:17:46 [16002.213648] FS:  00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
> 2024-12-26 16:17:46 [16002.223189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
> 2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> e024se+0x3c/0x90 [amdxcp]
> 2024-12-26 16:17:46 [16002.337645]  __do_sys_delete_module.constprop.0+0x176/0x310
> 2024-12-26 16:17:46 [16002.344324]  do_syscall_64+0x5d/0x170
> 2024-12-26 16:17:46 [16002.348858]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26
> 
> Fix it by removing xcp drm devices when failed to probe GPU devices.
> 
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> Tested-by: Shuo Liu <shuox.liu@linux.alibaba.com>
> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c    | 47 +++++++++++++++++++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h    |  4 +-
>  5 files changed, 47 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5ff53a3b9851..510074a9074e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -6682,7 +6682,7 @@ void amdgpu_device_halt(struct amdgpu_device *adev)
>  	struct pci_dev *pdev = adev->pdev;
>  	struct drm_device *ddev = adev_to_drm(adev);
>  
> -	amdgpu_xcp_dev_unplug(adev);
> +	amdgpu_xcp_dev_deregister(adev);
>  	drm_dev_unplug(ddev);
>  
>  	amdgpu_irq_disable_all(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 62de668e9ff8..41d1b06be600 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2435,7 +2435,7 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>  	struct drm_device *dev = pci_get_drvdata(pdev);
>  	struct amdgpu_device *adev = drm_to_adev(dev);
>  
> -	amdgpu_xcp_dev_unplug(adev);
> +	amdgpu_xcp_dev_deregister(adev);
>  	amdgpu_gmc_prepare_nps_mode_change(adev);
>  	drm_dev_unplug(dev);
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index d2a046736edd..be9147eb8308 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -1508,6 +1508,7 @@ void amdgpu_driver_release_kms(struct drm_device *dev)
>  	struct amdgpu_device *adev = drm_to_adev(dev);
>  
>  	amdgpu_device_fini_sw(adev);
> +	amdgpu_xcp_mgr_fini(adev);

Suggest to move this inside fini_sw()

>  	pci_set_drvdata(adev->pdev, NULL);
>  }
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> index e209b5e101df..62dd5287808b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> @@ -283,6 +283,33 @@ static int amdgpu_xcp_dev_alloc(struct amdgpu_device *adev)
>  	return 0;
>  }
>  
> +static void amdgpu_xcp_dev_free(struct amdgpu_device *adev)
> +{
> +	struct drm_device *p_ddev;
> +	int i;
> +
> +	if (!adev->xcp_mgr)
> +		return;
> +
> +	for (i = 1; i < MAX_XCP; i++) {
> +		if (!adev->xcp_mgr->xcp[i].ddev)
> +			break;
> +
> +		// Restore and free the original drm_device.
> +		p_ddev = adev->xcp_mgr->xcp[i].ddev;
> +		p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
> +		p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
> +		p_ddev->driver =  adev->xcp_mgr->xcp[i].driver;
> +		p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;

Now that there are more calls, this doesn't make sense here. What about
moving the redirection along with register() (I guess it matters from
that point onwards) and undoing it (restore back saved values) along
with deregister()? With that, there won't be a need to have registered
flag. You may only need to check if xcp rdev/pdev is not NULL.

> +		amdgpu_xcp_drm_dev_free(p_ddev);
> +
> +		adev->xcp_mgr->xcp[i].ddev = NULL;
> +	}
> +
> +	adev->xcp_mgr->xcp->ddev = NULL;
> +}
> +
> +
>  int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
>  			int init_num_xcps,
>  			struct amdgpu_xcp_mgr_funcs *xcp_funcs)
> @@ -310,6 +337,13 @@ int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
>  	return amdgpu_xcp_dev_alloc(adev);
>  }
>  
> +void amdgpu_xcp_mgr_fini(struct amdgpu_device *adev)
> +{
> +	amdgpu_xcp_dev_free(adev);
> +	kfree(adev->xcp_mgr);
> +	adev->xcp_mgr = NULL;

Thanks for adding this.

Thanks,
Lijo

> +}
> +
>  int amdgpu_xcp_get_partition(struct amdgpu_xcp_mgr *xcp_mgr,
>  			     enum AMDGPU_XCP_IP_BLOCK ip, int instance)
>  {
> @@ -359,12 +393,14 @@ int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
>  		ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev, ent->driver_data);
>  		if (ret)
>  			return ret;
> +
> +		adev->xcp_mgr->xcp[i].registered = true;
>  	}
>  
>  	return 0;
>  }
>  
> -void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
> +void amdgpu_xcp_dev_deregister(struct amdgpu_device *adev)
>  {
>  	struct drm_device *p_ddev;
>  	int i;
> @@ -377,11 +413,10 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
>  			break;
>  
>  		p_ddev = adev->xcp_mgr->xcp[i].ddev;
> -		drm_dev_unplug(p_ddev);
> -		p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
> -		p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
> -		p_ddev->driver =  adev->xcp_mgr->xcp[i].driver;
> -		p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
> +		if (adev->xcp_mgr->xcp[i].registered) {
> +			drm_dev_unplug(p_ddev);
> +			adev->xcp_mgr->xcp[i].registered = false;
> +		}
>  	}
>  }
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
> index b63f53242c57..be22d4398463 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
> @@ -101,6 +101,7 @@ struct amdgpu_xcp {
>  	uint8_t id;
>  	uint8_t mem_id;
>  	bool valid;
> +	bool registered;
>  	atomic_t	ref_cnt;
>  	struct drm_device *ddev;
>  	struct drm_device *rdev;
> @@ -155,6 +156,7 @@ int amdgpu_xcp_resume(struct amdgpu_xcp_mgr *xcp_mgr, int xcp_id);
>  
>  int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
>  			int init_xcps, struct amdgpu_xcp_mgr_funcs *xcp_funcs);
> +void amdgpu_xcp_mgr_fini(struct amdgpu_device *adev);
>  int amdgpu_xcp_init(struct amdgpu_xcp_mgr *xcp_mgr, int num_xcps, int mode);
>  int amdgpu_xcp_query_partition_mode(struct amdgpu_xcp_mgr *xcp_mgr, u32 flags);
>  int amdgpu_xcp_switch_partition_mode(struct amdgpu_xcp_mgr *xcp_mgr, int mode);
> @@ -168,7 +170,7 @@ int amdgpu_xcp_get_inst_details(struct amdgpu_xcp *xcp,
>  
>  int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
>  				const struct pci_device_id *ent);
> -void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev);
> +void amdgpu_xcp_dev_deregister(struct amdgpu_device *adev);
>  int amdgpu_xcp_open_device(struct amdgpu_device *adev,
>  			   struct amdgpu_fpriv *fpriv,
>  			   struct drm_file *file_priv);


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 5/6] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
  2025-01-08  9:16   ` Christian König
@ 2025-01-08  9:58     ` Lazar, Lijo
  2025-01-08 16:30     ` Chen, Xiaogang
  1 sibling, 0 replies; 23+ messages in thread
From: Lazar, Lijo @ 2025-01-08  9:58 UTC (permalink / raw)
  To: Christian König, Jiang Liu, alexander.deucher, Xinhui.Pan,
	airlied, simona, sunil.khatri, Hawking.Zhang, mario.limonciello,
	Jun.Ma2, xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx



On 1/8/2025 2:46 PM, Christian König wrote:
> Am 08.01.25 um 09:56 schrieb Jiang Liu:
>> Function detects initialization status by checking sched->ops,
> 
> Where is that done? Inside the scheduler or inside amdgpu?

Down below inside amdgpu_fence_driver_sw_fini(). I think sched.ready is
repurposed within amdgpu to indicate ring being ready to accept packets.

Thanks,
Lijo

> 
> Regards,
> Christian.
> 
>>   so set
>> sched->ops to non-NULL just before return in function
>> amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers()
>> to avoid possible invalid memory access on error recover path.
>>
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 4 +++-
>>   2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/
>> drm/amd/amdgpu/amdgpu_device.c
>> index 510074a9074e..741807a1fd2e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2857,6 +2857,7 @@ static int amdgpu_device_init_schedulers(struct
>> amdgpu_device *adev)
>>           if (r) {
>>               DRM_ERROR("Failed to create scheduler on ring %s.\n",
>>                     ring->name);
>> +            ring->sched.ops = NULL;
>>               return r;
>>           }
>>           r = amdgpu_uvd_entity_init(adev, ring);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/
>> drm/amd/amdgpu/amdgpu_fence.c
>> index 2f24a6aa13bf..b5e87b515139 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -656,8 +656,10 @@ void amdgpu_fence_driver_sw_fini(struct
>> amdgpu_device *adev)
>>            * The natural check would be sched.ready, which is
>>            * set as drm_sched_init() finishes...
>>            */
>> -        if (ring->sched.ops)
>> +        if (ring->sched.ops) {
>>               drm_sched_fini(&ring->sched);
>> +            ring->sched.ops = NULL;
>> +        }
>>             for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
>>               dma_fence_put(ring->fence_drv.fences[j]);
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put()
  2025-01-08  8:56 ` [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put() Jiang Liu
  2025-01-08  9:05   ` Christian König
@ 2025-01-08 10:02   ` Lazar, Lijo
  2025-01-08 10:05     ` Gerry Liu
  1 sibling, 1 reply; 23+ messages in thread
From: Lazar, Lijo @ 2025-01-08 10:02 UTC (permalink / raw)
  To: Jiang Liu, alexander.deucher, christian.koenig, Xinhui.Pan,
	airlied, simona, sunil.khatri, Hawking.Zhang, mario.limonciello,
	Jun.Ma2, xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx



On 1/8/2025 2:26 PM, Jiang Liu wrote:
> If error happens before amdgpu_fence_driver_hw_init() gets called during
> device probe, it will trigger a false warning in amdgpu_irq_put() as
> below:
> [ 1209.300996] ------------[ cut here ]------------
> [ 1209.301061] WARNING: CPU: 48 PID: 293 at /tmp/amd.Rc9jFrl7/amd/amdgpu/amdgpu_irq.c:633 amdgpu_irq_put+0x45/0x70 [amdgpu]
> [ 1209.301062] Modules linked in: ...
> [ 1209.301093] CPU: 48 PID: 293 Comm: kworker/48:1 Kdump: loaded Tainted: G        W  OE     5.10.134-17.2.al8.x86_64 #1
> [ 1209.301094] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
> [ 1209.301095] Workqueue: events work_for_cpu_fn
> [ 1209.301159] RIP: 0010:amdgpu_irq_put+0x45/0x70 [amdgpu]
> [ 1209.301160] Code: 48 8b 4e 10 48 83 39 00 74 2c 89 d1 48 8d 04 88 8b 08 85 c9 74 14 f0 ff 08 b8 00 00 00 00 74 05 c3 cc cc cc cc e9 8b fd ff ff <0f> 0b b8 ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc
> [ 1209.301162] RSP: 0018:ffffb08a99c8fd88 EFLAGS: 00010246
> [ 1209.301162] RAX: ffff9efe1bcbf500 RBX: ffff9efe1cc3e400 RCX: 0000000000000000
> [ 1209.301163] RDX: 0000000000000000 RSI: ffff9efe1cc3b108 RDI: ffff9efe1cc00000
> [ 1209.301163] RBP: ffff9efe1cc10818 R08: 0000000000000001 R09: 000000000000000d
> [ 1209.301164] R10: ffffb08a99c8fb48 R11: ffffffffa2068018 R12: ffff9efe1cc109d0
> [ 1209.301164] R13: ffff9efe1cc00010 R14: ffff9efe1cc00000 R15: ffff9efe1cc3b108
> [ 1209.301165] FS:  0000000000000000(0000) GS:ffff9ff9fce00000(0000) knlGS:0000000000000000
> [ 1209.301165] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1209.301165] CR2: 00007fd0f6e860d0 CR3: 0000010092baa003 CR4: 0000000002770ee0
> [ 1209.301166] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1209.301166] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 1209.301167] PKRU: 55555554
> [ 1209.301167] Call Trace:
> [ 1209.301225]  amdgpu_fence_driver_hw_fini+0xda/0x110 [amdgpu]
> [ 1209.301284]  amdgpu_device_fini_hw+0xaf/0x200 [amdgpu]
> [ 1209.301342]  amdgpu_driver_load_kms+0x7f/0xc0 [amdgpu]
> [ 1209.301400]  amdgpu_pci_probe+0x1cd/0x4a0 [amdgpu]
> [ 1209.301401]  local_pci_probe+0x40/0xa0
> [ 1209.301402]  work_for_cpu_fn+0x13/0x20
> [ 1209.301403]  process_one_work+0x1ad/0x380
> [ 1209.301404]  worker_thread+0x1c8/0x310
> [ 1209.301405]  ? process_one_work+0x380/0x380
> [ 1209.301406]  kthread+0x118/0x140
> [ 1209.301407]  ? __kthread_bind_mask+0x60/0x60
> [ 1209.301408]  ret_from_fork+0x1f/0x30
> [ 1209.301410] ---[ end trace 733f120fe2ab13e5 ]---
> [ 1209.301418] ------------[ cut here ]------------
> 
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 +++++++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  | 1 +
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index b5e87b515139..0e41a535e05f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -614,9 +614,11 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
>  
>  		if (!drm_dev_is_unplugged(adev_to_drm(adev)) &&
>  		    ring->fence_drv.irq_src &&
> -		    amdgpu_fence_need_ring_interrupt_restore(ring))
> +		    ring->fence_drv.irq_enabled) {
>  			amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>  				       ring->fence_drv.irq_type);
> +		        ring->fence_drv.irq_enabled = false;
> +		}
>  
>  		del_timer_sync(&ring->fence_drv.fallback_timer);
>  	}
> @@ -693,9 +695,12 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
>  
>  		/* enable the interrupt */
>  		if (ring->fence_drv.irq_src &&
> -		    amdgpu_fence_need_ring_interrupt_restore(ring))
> +		    !ring->fence_drv.irq_enabled &&
> +		    amdgpu_fence_need_ring_interrupt_restore(ring)) {
>  			amdgpu_irq_get(adev, ring->fence_drv.irq_src,
>  				       ring->fence_drv.irq_type);
> +		        ring->fence_drv.irq_enabled = true;
> +		}

I guess the problem is more generic like calling fence driver hw_fini()
when hw_init is not called.

Thanks,
Lijo

>  	}
>  }
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> index dee5a1b4e572..959d474a0516 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -118,6 +118,7 @@ struct amdgpu_fence_driver {
>  	uint32_t			sync_seq;
>  	atomic_t			last_seq;
>  	bool				initialized;
> +	bool				irq_enabled;
>  	struct amdgpu_irq_src		*irq_src;
>  	unsigned			irq_type;
>  	struct timer_list		fallback_timer;


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put()
  2025-01-08 10:02   ` Lazar, Lijo
@ 2025-01-08 10:05     ` Gerry Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Gerry Liu @ 2025-01-08 10:05 UTC (permalink / raw)
  To: Lazar, Lijo
  Cc: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, Hawking.Zhang, mario.limonciello, Jun.Ma2,
	xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx

[-- Attachment #1: Type: text/plain, Size: 5093 bytes --]



> 2025年1月8日 18:02，Lazar, Lijo <lijo.lazar@amd.com> 写道：
> 
> 
> 
> On 1/8/2025 2:26 PM, Jiang Liu wrote:
>> If error happens before amdgpu_fence_driver_hw_init() gets called during
>> device probe, it will trigger a false warning in amdgpu_irq_put() as
>> below:
>> [ 1209.300996] ------------[ cut here ]------------
>> [ 1209.301061] WARNING: CPU: 48 PID: 293 at /tmp/amd.Rc9jFrl7/amd/amdgpu/amdgpu_irq.c:633 amdgpu_irq_put+0x45/0x70 [amdgpu]
>> [ 1209.301062] Modules linked in: ...
>> [ 1209.301093] CPU: 48 PID: 293 Comm: kworker/48:1 Kdump: loaded Tainted: G        W  OE     5.10.134-17.2.al8.x86_64 #1
>> [ 1209.301094] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
>> [ 1209.301095] Workqueue: events work_for_cpu_fn
>> [ 1209.301159] RIP: 0010:amdgpu_irq_put+0x45/0x70 [amdgpu]
>> [ 1209.301160] Code: 48 8b 4e 10 48 83 39 00 74 2c 89 d1 48 8d 04 88 8b 08 85 c9 74 14 f0 ff 08 b8 00 00 00 00 74 05 c3 cc cc cc cc e9 8b fd ff ff <0f> 0b b8 ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc
>> [ 1209.301162] RSP: 0018:ffffb08a99c8fd88 EFLAGS: 00010246
>> [ 1209.301162] RAX: ffff9efe1bcbf500 RBX: ffff9efe1cc3e400 RCX: 0000000000000000
>> [ 1209.301163] RDX: 0000000000000000 RSI: ffff9efe1cc3b108 RDI: ffff9efe1cc00000
>> [ 1209.301163] RBP: ffff9efe1cc10818 R08: 0000000000000001 R09: 000000000000000d
>> [ 1209.301164] R10: ffffb08a99c8fb48 R11: ffffffffa2068018 R12: ffff9efe1cc109d0
>> [ 1209.301164] R13: ffff9efe1cc00010 R14: ffff9efe1cc00000 R15: ffff9efe1cc3b108
>> [ 1209.301165] FS:  0000000000000000(0000) GS:ffff9ff9fce00000(0000) knlGS:0000000000000000
>> [ 1209.301165] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1209.301165] CR2: 00007fd0f6e860d0 CR3: 0000010092baa003 CR4: 0000000002770ee0
>> [ 1209.301166] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 1209.301166] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>> [ 1209.301167] PKRU: 55555554
>> [ 1209.301167] Call Trace:
>> [ 1209.301225]  amdgpu_fence_driver_hw_fini+0xda/0x110 [amdgpu]
>> [ 1209.301284]  amdgpu_device_fini_hw+0xaf/0x200 [amdgpu]
>> [ 1209.301342]  amdgpu_driver_load_kms+0x7f/0xc0 [amdgpu]
>> [ 1209.301400]  amdgpu_pci_probe+0x1cd/0x4a0 [amdgpu]
>> [ 1209.301401]  local_pci_probe+0x40/0xa0
>> [ 1209.301402]  work_for_cpu_fn+0x13/0x20
>> [ 1209.301403]  process_one_work+0x1ad/0x380
>> [ 1209.301404]  worker_thread+0x1c8/0x310
>> [ 1209.301405]  ? process_one_work+0x380/0x380
>> [ 1209.301406]  kthread+0x118/0x140
>> [ 1209.301407]  ? __kthread_bind_mask+0x60/0x60
>> [ 1209.301408]  ret_from_fork+0x1f/0x30
>> [ 1209.301410] ---[ end trace 733f120fe2ab13e5 ]---
>> [ 1209.301418] ------------[ cut here ]------------
>> 
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 +++++++--
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  | 1 +
>> 2 files changed, 8 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index b5e87b515139..0e41a535e05f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -614,9 +614,11 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
>> 
>> 		if (!drm_dev_is_unplugged(adev_to_drm(adev)) &&
>> 		    ring->fence_drv.irq_src &&
>> -		    amdgpu_fence_need_ring_interrupt_restore(ring))
>> +		    ring->fence_drv.irq_enabled) {
>> 			amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>> 				       ring->fence_drv.irq_type);
>> +		        ring->fence_drv.irq_enabled = false;
>> +		}
>> 
>> 		del_timer_sync(&ring->fence_drv.fallback_timer);
>> 	}
>> @@ -693,9 +695,12 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
>> 
>> 		/* enable the interrupt */
>> 		if (ring->fence_drv.irq_src &&
>> -		    amdgpu_fence_need_ring_interrupt_restore(ring))
>> +		    !ring->fence_drv.irq_enabled &&
>> +		    amdgpu_fence_need_ring_interrupt_restore(ring)) {
>> 			amdgpu_irq_get(adev, ring->fence_drv.irq_src,
>> 				       ring->fence_drv.irq_type);
>> +		        ring->fence_drv.irq_enabled = true;
>> +		}
> 
> I guess the problem is more generic like calling fence driver hw_fini()
> when hw_init is not called.
> 
You are so smart:)
I’m working on another patch set to fix these generic issues by tweaking the ip block and ras block state machine.
Thanks,
Gerry

> Thanks,
> Lijo
> 
>> 	}
>> }
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> index dee5a1b4e572..959d474a0516 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> @@ -118,6 +118,7 @@ struct amdgpu_fence_driver {
>> 	uint32_t			sync_seq;
>> 	atomic_t			last_seq;
>> 	bool				initialized;
>> +	bool				irq_enabled;
>> 	struct amdgpu_irq_src		*irq_src;
>> 	unsigned			irq_type;
>> 	struct timer_list		fallback_timer;


[-- Attachment #2: Type: text/html, Size: 17770 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
  2025-01-08  8:56 ` [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
  2025-01-08  9:31   ` Lazar, Lijo
@ 2025-01-08 16:04   ` Mario Limonciello
  2025-01-09  2:39     ` Gerry Liu
  1 sibling, 1 reply; 23+ messages in thread
From: Mario Limonciello @ 2025-01-08 16:04 UTC (permalink / raw)
  To: Jiang Liu, alexander.deucher, christian.koenig, Xinhui.Pan,
	airlied, simona, sunil.khatri, lijo.lazar, Hawking.Zhang, Jun.Ma2,
	xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx

On 1/8/2025 02:56, Jiang Liu wrote:
> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
> to do error recovery.
> 
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> ---
>   drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 76 +++++++++++++++++----
>   drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h |  1 +
>   2 files changed, 63 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> index faed84172dd4..fc92b5fe1040 100644
> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> @@ -45,21 +45,32 @@ static const struct drm_driver amdgpu_xcp_driver = {
>   
>   static int8_t pdev_num;
>   static struct xcp_device *xcp_dev[MAX_XCP_PLATFORM_DEVICE];
> +static DEFINE_MUTEX(xcp_mutex);
>   
>   int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>   {
>   	struct platform_device *pdev;
>   	struct xcp_device *pxcp_dev;
>   	char dev_name[20];
> -	int ret;
> +	int ret, index;
>   
> -	if (pdev_num >= MAX_XCP_PLATFORM_DEVICE)
> -		return -ENODEV;
> +	mutex_lock(&xcp_mutex);

For "new" code what do you think about about using scoped guards like 
guard(mutex) instead of lock; goto label; unlock pattern?

I think it can generally keep the code cleaner, but you need to be 
careful because if you still need "other" goto cleanup patterns you can 
unintentionally break the compiled output.

> +	if (pdev_num >= MAX_XCP_PLATFORM_DEVICE) {
> +		ret = -ENODEV;
> +		goto out_unlock;
> +	}
> +
> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> +		if (!xcp_dev[index])
> +			break;
> +	}
>   
> -	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", pdev_num);
> +	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", index);
>   	pdev = platform_device_register_simple(dev_name, -1, NULL, 0);
> -	if (IS_ERR(pdev))
> -		return PTR_ERR(pdev);
> +	if (IS_ERR(pdev)) {
> +		ret = PTR_ERR(pdev);
> +		goto out_unregister;

Er, if you fail to register why would you unregister?  I think in this 
case with the current code you would 'goto out_unlock' instead.

The design pattern you might have been following was from platform 
drivers that do this, which is different:

platform_driver_register()
foo = platform_device_register_simple()
if (IS_ERR(foo))
	platform_driver_unregister()
return 0;

> +	}
>   
>   	if (!devres_open_group(&pdev->dev, NULL, GFP_KERNEL)) {
>   		ret = -ENOMEM;
> @@ -72,10 +83,11 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>   		goto out_devres;
>   	}
>   
> -	xcp_dev[pdev_num] = pxcp_dev;
> -	xcp_dev[pdev_num]->pdev = pdev;
> +	xcp_dev[index] = pxcp_dev;
> +	xcp_dev[index]->pdev = pdev;
>   	*ddev = &pxcp_dev->drm;
>   	pdev_num++;
> +	mutex_unlock(&xcp_mutex);
>   
>   	return 0;
>   
> @@ -83,21 +95,57 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>   	devres_release_group(&pdev->dev, NULL);
>   out_unregister:
>   	platform_device_unregister(pdev);
> +out_unlock:
> +	mutex_unlock(&xcp_mutex);
>   
>   	return ret;
>   }
>   EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
>   
> +static void amdgpu_xcp_drm_dev_destroy(int index)
> +{
> +	struct platform_device *pdev;
> +
> +	pdev = xcp_dev[index]->pdev;
> +	devres_release_group(&pdev->dev, NULL);
> +	platform_device_unregister(pdev);
> +	xcp_dev[index] = NULL;
> +	pdev_num--;
> +}
> +
> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev)
> +{
> +	int index;
> +	struct xcp_device *pxcp_dev;
> +
> +	if (ddev == NULL)
> +		return;
> +
> +	pxcp_dev = container_of(ddev, struct xcp_device, drm);
> +
> +	mutex_lock(&xcp_mutex);

I think this is a good case for a guard(mutex) instead.

> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> +		if (xcp_dev[index] == pxcp_dev) {
> +			amdgpu_xcp_drm_dev_destroy(index);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&xcp_mutex);
> +}
> +EXPORT_SYMBOL(amdgpu_xcp_drm_dev_free);
> +
>   void amdgpu_xcp_drv_release(void)
>   {
> -	for (--pdev_num; pdev_num >= 0; --pdev_num) {
> -		struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
> +	int index;
>   
> -		devres_release_group(&pdev->dev, NULL);
> -		platform_device_unregister(pdev);
> -		xcp_dev[pdev_num] = NULL;
> +	mutex_lock(&xcp_mutex);

Another good case for guard(mutex)

> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> +		if (xcp_dev[index]) {
> +			amdgpu_xcp_drm_dev_destroy(index);
> +		}
>   	}
> -	pdev_num = 0;
> +	WARN_ON(pdev_num != 0);
> +	mutex_unlock(&xcp_mutex);
>   }
>   EXPORT_SYMBOL(amdgpu_xcp_drv_release);
>   
> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> index c1c4b679bf95..580a1602c8e3 100644
> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> @@ -25,5 +25,6 @@
>   #define _AMDGPU_XCP_DRV_H_
>   
>   int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev);
> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev);
>   void amdgpu_xcp_drv_release(void);
>   #endif /* _AMDGPU_XCP_DRV_H_ */


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 4/6] drm/amdgpu: enhance error handling in function amdgpu_pci_probe()
  2025-01-08  8:56 ` [v3 4/6] drm/amdgpu: enhance error handling in function amdgpu_pci_probe() Jiang Liu
@ 2025-01-08 16:08   ` Mario Limonciello
  2025-01-09  3:34     ` Gerry Liu
  0 siblings, 1 reply; 23+ messages in thread
From: Mario Limonciello @ 2025-01-08 16:08 UTC (permalink / raw)
  To: Jiang Liu, alexander.deucher, christian.koenig, Xinhui.Pan,
	airlied, simona, sunil.khatri, lijo.lazar, Hawking.Zhang, Jun.Ma2,
	xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx

On 1/8/2025 02:56, Jiang Liu wrote:
> Enhance error handling in function amdgpu_pci_probe() to avoid
> possible resource leakage.
> 
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +++++++++---
>   1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 41d1b06be600..f8deca2f2696 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2346,16 +2346,16 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>   		msleep(5000);
>   		goto retry_init;
>   	} else if (ret) {
> -		goto err_pci;
> +		goto unload_kms;

This label change looks wrong to me.  If you fail to 
amdgpu_driver_load_kms(), why would you amdgpu_driver_unload_kms()?

amdgpu_driver_load_kms() has cleanup handling already to call 
amdgpu_driver_unload_kms()

>   	}
>   
>   	ret = amdgpu_xcp_dev_register(adev, ent);
>   	if (ret)
> -		goto err_pci;
> +		goto unplug_drm;
>   
>   	ret = amdgpu_amdkfd_drm_client_create(adev);
>   	if (ret)
> -		goto err_pci;
> +		goto deregister_xcp;
>   
>   	/*
>   	 * 1. don't init fbdev on hw without DCE
> @@ -2424,6 +2424,12 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>   
>   	return 0;
>   
> +deregister_xcp:
> +	amdgpu_xcp_dev_deregister(adev);
> +unplug_drm:
> +	drm_dev_unplug(ddev);
> +unload_kms:
> +	amdgpu_driver_unload_kms(ddev);
>   err_pci:
>   	pci_disable_device(pdev);
>   	return ret;


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 5/6] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
  2025-01-08  9:16   ` Christian König
  2025-01-08  9:58     ` Lazar, Lijo
@ 2025-01-08 16:30     ` Chen, Xiaogang
  2025-01-09 15:21       ` Christian König
  1 sibling, 1 reply; 23+ messages in thread
From: Chen, Xiaogang @ 2025-01-08 16:30 UTC (permalink / raw)
  To: Christian König, Jiang Liu, alexander.deucher, Xinhui.Pan,
	airlied, simona, sunil.khatri, lijo.lazar, Hawking.Zhang,
	mario.limonciello, Jun.Ma2, Kent.Russell, shuox.liu, amd-gfx


On 1/8/2025 3:16 AM, Christian König wrote:
> Am 08.01.25 um 09:56 schrieb Jiang Liu:
>> Function detects initialization status by checking sched->ops,
>
> Where is that done? Inside the scheduler or inside amdgpu?
Inside amdgpu set ring->sched.ops to null if ring's scheduler init fail 
since we use ring->sched.ops to decide uninit it by drm_sched_fini.
>
> Regards,
> Christian.
>
>>   so set
>> sched->ops to non-NULL just before return in function
>> amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers()
>> to avoid possible invalid memory access on error recover path.
>>
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 4 +++-
>>   2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 510074a9074e..741807a1fd2e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2857,6 +2857,7 @@ static int amdgpu_device_init_schedulers(struct 
>> amdgpu_device *adev)
>>           if (r) {
>>               DRM_ERROR("Failed to create scheduler on ring %s.\n",
>>                     ring->name);
>> +            ring->sched.ops = NULL;
>>               return r;
>>           }
>>           r = amdgpu_uvd_entity_init(adev, ring);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 2f24a6aa13bf..b5e87b515139 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -656,8 +656,10 @@ void amdgpu_fence_driver_sw_fini(struct 
>> amdgpu_device *adev)
>>            * The natural check would be sched.ready, which is
>>            * set as drm_sched_init() finishes...
>>            */
>> -        if (ring->sched.ops)
>> +        if (ring->sched.ops) {
>>               drm_sched_fini(&ring->sched);
>> +            ring->sched.ops = NULL;
>> +        }
>>             for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
>>               dma_fence_put(ring->fence_drv.fences[j]);
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
  2025-01-08  9:31   ` Lazar, Lijo
@ 2025-01-09  1:46     ` Gerry Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Gerry Liu @ 2025-01-09  1:46 UTC (permalink / raw)
  To: Lazar, Lijo
  Cc: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, Hawking.Zhang, mario.limonciello, Jun.Ma2,
	xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx

[-- Attachment #1: Type: text/plain, Size: 5014 bytes --]



> 2025年1月8日 17:31，Lazar, Lijo <lijo.lazar@amd.com> 写道：
> 
> 
> 
> On 1/8/2025 2:26 PM, Jiang Liu wrote:
>> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
>> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
>> to do error recovery.
>> 
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>> drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 76 +++++++++++++++++----
>> drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h |  1 +
>> 2 files changed, 63 insertions(+), 14 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
>> index faed84172dd4..fc92b5fe1040 100644
>> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
>> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
>> @@ -45,21 +45,32 @@ static const struct drm_driver amdgpu_xcp_driver = {
>> 
>> static int8_t pdev_num;
>> static struct xcp_device *xcp_dev[MAX_XCP_PLATFORM_DEVICE];
>> +static DEFINE_MUTEX(xcp_mutex);
>> 
>> int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>> {
>> 	struct platform_device *pdev;
>> 	struct xcp_device *pxcp_dev;
>> 	char dev_name[20];
>> -	int ret;
>> +	int ret, index;
>> 
>> -	if (pdev_num >= MAX_XCP_PLATFORM_DEVICE)
>> -		return -ENODEV;
>> +	mutex_lock(&xcp_mutex);
>> +	if (pdev_num >= MAX_XCP_PLATFORM_DEVICE) {
>> +		ret = -ENODEV;
>> +		goto out_unlock;
>> +	}
>> +
>> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
>> +		if (!xcp_dev[index])
>> +			break;
>> +	}
>> 
>> -	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", pdev_num);
>> +	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", index);
>> 	pdev = platform_device_register_simple(dev_name, -1, NULL, 0);
>> -	if (IS_ERR(pdev))
>> -		return PTR_ERR(pdev);
>> +	if (IS_ERR(pdev)) {
>> +		ret = PTR_ERR(pdev);
>> +		goto out_unregister;
>> +	}
>> 
>> 	if (!devres_open_group(&pdev->dev, NULL, GFP_KERNEL)) {
>> 		ret = -ENOMEM;
>> @@ -72,10 +83,11 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>> 		goto out_devres;
>> 	}
>> 
>> -	xcp_dev[pdev_num] = pxcp_dev;
>> -	xcp_dev[pdev_num]->pdev = pdev;
>> +	xcp_dev[index] = pxcp_dev;
>> +	xcp_dev[index]->pdev = pdev;
>> 	*ddev = &pxcp_dev->drm;
>> 	pdev_num++;
>> +	mutex_unlock(&xcp_mutex);
>> 
>> 	return 0;
>> 
>> @@ -83,21 +95,57 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>> 	devres_release_group(&pdev->dev, NULL);
>> out_unregister:
>> 	platform_device_unregister(pdev);
>> +out_unlock:
>> +	mutex_unlock(&xcp_mutex);
>> 
>> 	return ret;
>> }
>> EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
>> 
>> +static void amdgpu_xcp_drm_dev_destroy(int index)
> 
> <Nit> Use something like __amdgpu_xcp_drm_dev_free(int index) to keep
> the 'free' suffix.
> 
>> +{
>> +	struct platform_device *pdev;
>> +
>> +	pdev = xcp_dev[index]->pdev;
>> +	devres_release_group(&pdev->dev, NULL);
>> +	platform_device_unregister(pdev);
>> +	xcp_dev[index] = NULL;
>> +	pdev_num--;
>> +}
>> +
>> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev)
>> +{
>> +	int index;
>> +	struct xcp_device *pxcp_dev;
>> +
>> +	if (ddev == NULL)
> 
> Does it make sense to add !pdev_num check or a WARN_ON(!pdev_num) below?
> 
>> +		return;
>> +
>> +	pxcp_dev = container_of(ddev, struct xcp_device, drm);
>> +
>> +	mutex_lock(&xcp_mutex);
>> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
>> +		if (xcp_dev[index] == pxcp_dev) {
>> +			amdgpu_xcp_drm_dev_destroy(index);
>> +			break;
>> +		}
>> +	}
>> +	mutex_unlock(&xcp_mutex);
>> +}
>> +EXPORT_SYMBOL(amdgpu_xcp_drm_dev_free);
>> +
>> void amdgpu_xcp_drv_release(void)
>> {
>> -	for (--pdev_num; pdev_num >= 0; --pdev_num) {
>> -		struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
>> +	int index;
>> 
>> -		devres_release_group(&pdev->dev, NULL);
>> -		platform_device_unregister(pdev);
>> -		xcp_dev[pdev_num] = NULL;
> 
> To better optimize, add one check like below.
> 	if (!pdev_num)
> 		return;
Thanks for review and suggestions, all comments will be addressed in next version.

> 
> Thanks,
> Lijo
> 
>> +	mutex_lock(&xcp_mutex);
>> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
>> +		if (xcp_dev[index]) {
>> +			amdgpu_xcp_drm_dev_destroy(index);
>> +		}
>> 	}
>> -	pdev_num = 0;
>> +	WARN_ON(pdev_num != 0);
>> +	mutex_unlock(&xcp_mutex);
>> }
>> EXPORT_SYMBOL(amdgpu_xcp_drv_release);
>> 
>> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
>> index c1c4b679bf95..580a1602c8e3 100644
>> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
>> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
>> @@ -25,5 +25,6 @@
>> #define _AMDGPU_XCP_DRV_H_
>> 
>> int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev);
>> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev);
>> void amdgpu_xcp_drv_release(void);
>> #endif /* _AMDGPU_XCP_DRV_H_ */


[-- Attachment #2: Type: text/html, Size: 25092 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
  2025-01-08 16:04   ` Mario Limonciello
@ 2025-01-09  2:39     ` Gerry Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Gerry Liu @ 2025-01-09  2:39 UTC (permalink / raw)
  To: Mario Limonciello
  Cc: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, Jun.Ma2, xiaogang.chen,
	Kent.Russell, shuox.liu, amd-gfx



> 2025年1月9日 00:04，Mario Limonciello <mario.limonciello@amd.com> 写道：
> 
> On 1/8/2025 02:56, Jiang Liu wrote:
>> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
>> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
>> to do error recovery.
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>>  drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 76 +++++++++++++++++----
>>  drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h |  1 +
>>  2 files changed, 63 insertions(+), 14 deletions(-)
>> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
>> index faed84172dd4..fc92b5fe1040 100644
>> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
>> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
>> @@ -45,21 +45,32 @@ static const struct drm_driver amdgpu_xcp_driver = {
>>    static int8_t pdev_num;
>>  static struct xcp_device *xcp_dev[MAX_XCP_PLATFORM_DEVICE];
>> +static DEFINE_MUTEX(xcp_mutex);
>>    int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>>  {
>>  	struct platform_device *pdev;
>>  	struct xcp_device *pxcp_dev;
>>  	char dev_name[20];
>> -	int ret;
>> +	int ret, index;
>>  -	if (pdev_num >= MAX_XCP_PLATFORM_DEVICE)
>> -		return -ENODEV;
>> +	mutex_lock(&xcp_mutex);
> 
> For "new" code what do you think about about using scoped guards like guard(mutex) instead of lock; goto label; unlock pattern?
> 
> I think it can generally keep the code cleaner, but you need to be careful because if you still need "other" goto cleanup patterns you can unintentionally break the compiled output.
Thanks for introducing this new utility, which makes me feel like writing rust code:)

> 
>> +	if (pdev_num >= MAX_XCP_PLATFORM_DEVICE) {
>> +		ret = -ENODEV;
>> +		goto out_unlock;
>> +	}
>> +
>> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
>> +		if (!xcp_dev[index])
>> +			break;
>> +	}
>>  -	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", pdev_num);
>> +	snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", index);
>>  	pdev = platform_device_register_simple(dev_name, -1, NULL, 0);
>> -	if (IS_ERR(pdev))
>> -		return PTR_ERR(pdev);
>> +	if (IS_ERR(pdev)) {
>> +		ret = PTR_ERR(pdev);
>> +		goto out_unregister;
> 
> Er, if you fail to register why would you unregister?  I think in this case with the current code you would 'goto out_unlock' instead.
> 
> The design pattern you might have been following was from platform drivers that do this, which is different:
> 
> platform_driver_register()
> foo = platform_device_register_simple()
> if (IS_ERR(foo))
> 	platform_driver_unregister()
> return 0;
Will fix it in next version.

> 
>> +	}
>>    	if (!devres_open_group(&pdev->dev, NULL, GFP_KERNEL)) {
>>  		ret = -ENOMEM;
>> @@ -72,10 +83,11 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>>  		goto out_devres;
>>  	}
>>  -	xcp_dev[pdev_num] = pxcp_dev;
>> -	xcp_dev[pdev_num]->pdev = pdev;
>> +	xcp_dev[index] = pxcp_dev;
>> +	xcp_dev[index]->pdev = pdev;
>>  	*ddev = &pxcp_dev->drm;
>>  	pdev_num++;
>> +	mutex_unlock(&xcp_mutex);
>>    	return 0;
>>  @@ -83,21 +95,57 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>>  	devres_release_group(&pdev->dev, NULL);
>>  out_unregister:
>>  	platform_device_unregister(pdev);
>> +out_unlock:
>> +	mutex_unlock(&xcp_mutex);
>>    	return ret;
>>  }
>>  EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
>>  +static void amdgpu_xcp_drm_dev_destroy(int index)
>> +{
>> +	struct platform_device *pdev;
>> +
>> +	pdev = xcp_dev[index]->pdev;
>> +	devres_release_group(&pdev->dev, NULL);
>> +	platform_device_unregister(pdev);
>> +	xcp_dev[index] = NULL;
>> +	pdev_num--;
>> +}
>> +
>> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev)
>> +{
>> +	int index;
>> +	struct xcp_device *pxcp_dev;
>> +
>> +	if (ddev == NULL)
>> +		return;
>> +
>> +	pxcp_dev = container_of(ddev, struct xcp_device, drm);
>> +
>> +	mutex_lock(&xcp_mutex);
> 
> I think this is a good case for a guard(mutex) instead.
> 
>> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
>> +		if (xcp_dev[index] == pxcp_dev) {
>> +			amdgpu_xcp_drm_dev_destroy(index);
>> +			break;
>> +		}
>> +	}
>> +	mutex_unlock(&xcp_mutex);
>> +}
>> +EXPORT_SYMBOL(amdgpu_xcp_drm_dev_free);
>> +
>>  void amdgpu_xcp_drv_release(void)
>>  {
>> -	for (--pdev_num; pdev_num >= 0; --pdev_num) {
>> -		struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
>> +	int index;
>>  -		devres_release_group(&pdev->dev, NULL);
>> -		platform_device_unregister(pdev);
>> -		xcp_dev[pdev_num] = NULL;
>> +	mutex_lock(&xcp_mutex);
> 
> Another good case for guard(mutex)
> 
>> +	for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
>> +		if (xcp_dev[index]) {
>> +			amdgpu_xcp_drm_dev_destroy(index);
>> +		}
>>  	}
>> -	pdev_num = 0;
>> +	WARN_ON(pdev_num != 0);
>> +	mutex_unlock(&xcp_mutex);
>>  }
>>  EXPORT_SYMBOL(amdgpu_xcp_drv_release);
>>  diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
>> index c1c4b679bf95..580a1602c8e3 100644
>> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
>> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
>> @@ -25,5 +25,6 @@
>>  #define _AMDGPU_XCP_DRV_H_
>>    int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev);
>> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev);
>>  void amdgpu_xcp_drv_release(void);
>>  #endif /* _AMDGPU_XCP_DRV_H_ */


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 3/6] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
  2025-01-08  9:54   ` Lazar, Lijo
@ 2025-01-09  3:00     ` Gerry Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Gerry Liu @ 2025-01-09  3:00 UTC (permalink / raw)
  To: Lazar, Lijo
  Cc: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, Hawking.Zhang, mario.limonciello, Jun.Ma2,
	xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx

[-- Attachment #1: Type: text/plain, Size: 9649 bytes --]



> 2025年1月8日 17:54，Lazar, Lijo <lijo.lazar@amd.com> 写道：
> 
> 
> 
> On 1/8/2025 2:26 PM, Jiang Liu wrote:
>> If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
>> after free bug related to amdgpu_driver_release_kms() as:
>> 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
>> 2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kernel mode
>> 2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) - not-present page
>> 2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0
>> 2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
>> 2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G        W   E      6.10.0+ #2
>> 2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
>> 2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
>> 2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
>> 43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
>> 2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
>> 2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
>> 2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
>> 2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
>> 2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>> 2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
>> 2024-12-26 16:17:46 [16002.213648] FS:  00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
>> 2024-12-26 16:17:46 [16002.223189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> 2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
>> 2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> 2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>> e024se+0x3c/0x90 [amdxcp]
>> 2024-12-26 16:17:46 [16002.337645]  __do_sys_delete_module.constprop.0+0x176/0x310
>> 2024-12-26 16:17:46 [16002.344324]  do_syscall_64+0x5d/0x170
>> 2024-12-26 16:17:46 [16002.348858]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> 2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26
>> 
>> Fix it by removing xcp drm devices when failed to probe GPU devices.
>> 
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> Tested-by: Shuo Liu <shuox.liu@linux.alibaba.com>
>> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  2 +-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |  1 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c    | 47 +++++++++++++++++++---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h    |  4 +-
>> 5 files changed, 47 insertions(+), 9 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 5ff53a3b9851..510074a9074e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -6682,7 +6682,7 @@ void amdgpu_device_halt(struct amdgpu_device *adev)
>> 	struct pci_dev *pdev = adev->pdev;
>> 	struct drm_device *ddev = adev_to_drm(adev);
>> 
>> -	amdgpu_xcp_dev_unplug(adev);
>> +	amdgpu_xcp_dev_deregister(adev);
>> 	drm_dev_unplug(ddev);
>> 
>> 	amdgpu_irq_disable_all(adev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 62de668e9ff8..41d1b06be600 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -2435,7 +2435,7 @@ amdgpu_pci_remove(struct pci_dev *pdev)
>> 	struct drm_device *dev = pci_get_drvdata(pdev);
>> 	struct amdgpu_device *adev = drm_to_adev(dev);
>> 
>> -	amdgpu_xcp_dev_unplug(adev);
>> +	amdgpu_xcp_dev_deregister(adev);
>> 	amdgpu_gmc_prepare_nps_mode_change(adev);
>> 	drm_dev_unplug(dev);
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index d2a046736edd..be9147eb8308 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -1508,6 +1508,7 @@ void amdgpu_driver_release_kms(struct drm_device *dev)
>> 	struct amdgpu_device *adev = drm_to_adev(dev);
>> 
>> 	amdgpu_device_fini_sw(adev);
>> +	amdgpu_xcp_mgr_fini(adev);
> 
> Suggest to move this inside fini_sw()
> 
>> 	pci_set_drvdata(adev->pdev, NULL);
>> }
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> index e209b5e101df..62dd5287808b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> @@ -283,6 +283,33 @@ static int amdgpu_xcp_dev_alloc(struct amdgpu_device *adev)
>> 	return 0;
>> }
>> 
>> +static void amdgpu_xcp_dev_free(struct amdgpu_device *adev)
>> +{
>> +	struct drm_device *p_ddev;
>> +	int i;
>> +
>> +	if (!adev->xcp_mgr)
>> +		return;
>> +
>> +	for (i = 1; i < MAX_XCP; i++) {
>> +		if (!adev->xcp_mgr->xcp[i].ddev)
>> +			break;
>> +
>> +		// Restore and free the original drm_device.
>> +		p_ddev = adev->xcp_mgr->xcp[i].ddev;
>> +		p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
>> +		p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
>> +		p_ddev->driver =  adev->xcp_mgr->xcp[i].driver;
>> +		p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
> 
> Now that there are more calls, this doesn't make sense here. What about
> moving the redirection along with register() (I guess it matters from
> that point onwards) and undoing it (restore back saved values) along
> with deregister()? With that, there won't be a need to have registered
> flag. You may only need to check if xcp rdev/pdev is not NULL.
Good point, this makes code more clear.


> 
>> +		amdgpu_xcp_drm_dev_free(p_ddev);
>> +
>> +		adev->xcp_mgr->xcp[i].ddev = NULL;
>> +	}
>> +
>> +	adev->xcp_mgr->xcp->ddev = NULL;
>> +}
>> +
>> +
>> int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
>> 			int init_num_xcps,
>> 			struct amdgpu_xcp_mgr_funcs *xcp_funcs)
>> @@ -310,6 +337,13 @@ int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
>> 	return amdgpu_xcp_dev_alloc(adev);
>> }
>> 
>> +void amdgpu_xcp_mgr_fini(struct amdgpu_device *adev)
>> +{
>> +	amdgpu_xcp_dev_free(adev);
>> +	kfree(adev->xcp_mgr);
>> +	adev->xcp_mgr = NULL;
> 
> Thanks for adding this.
> 
> Thanks,
> Lijo
> 
>> +}
>> +
>> int amdgpu_xcp_get_partition(struct amdgpu_xcp_mgr *xcp_mgr,
>> 			     enum AMDGPU_XCP_IP_BLOCK ip, int instance)
>> {
>> @@ -359,12 +393,14 @@ int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
>> 		ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev, ent->driver_data);
>> 		if (ret)
>> 			return ret;
>> +
>> +		adev->xcp_mgr->xcp[i].registered = true;
>> 	}
>> 
>> 	return 0;
>> }
>> 
>> -void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
>> +void amdgpu_xcp_dev_deregister(struct amdgpu_device *adev)
>> {
>> 	struct drm_device *p_ddev;
>> 	int i;
>> @@ -377,11 +413,10 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
>> 			break;
>> 
>> 		p_ddev = adev->xcp_mgr->xcp[i].ddev;
>> -		drm_dev_unplug(p_ddev);
>> -		p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
>> -		p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
>> -		p_ddev->driver =  adev->xcp_mgr->xcp[i].driver;
>> -		p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
>> +		if (adev->xcp_mgr->xcp[i].registered) {
>> +			drm_dev_unplug(p_ddev);
>> +			adev->xcp_mgr->xcp[i].registered = false;
>> +		}
>> 	}
>> }
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
>> index b63f53242c57..be22d4398463 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
>> @@ -101,6 +101,7 @@ struct amdgpu_xcp {
>> 	uint8_t id;
>> 	uint8_t mem_id;
>> 	bool valid;
>> +	bool registered;
>> 	atomic_t	ref_cnt;
>> 	struct drm_device *ddev;
>> 	struct drm_device *rdev;
>> @@ -155,6 +156,7 @@ int amdgpu_xcp_resume(struct amdgpu_xcp_mgr *xcp_mgr, int xcp_id);
>> 
>> int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
>> 			int init_xcps, struct amdgpu_xcp_mgr_funcs *xcp_funcs);
>> +void amdgpu_xcp_mgr_fini(struct amdgpu_device *adev);
>> int amdgpu_xcp_init(struct amdgpu_xcp_mgr *xcp_mgr, int num_xcps, int mode);
>> int amdgpu_xcp_query_partition_mode(struct amdgpu_xcp_mgr *xcp_mgr, u32 flags);
>> int amdgpu_xcp_switch_partition_mode(struct amdgpu_xcp_mgr *xcp_mgr, int mode);
>> @@ -168,7 +170,7 @@ int amdgpu_xcp_get_inst_details(struct amdgpu_xcp *xcp,
>> 
>> int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
>> 				const struct pci_device_id *ent);
>> -void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev);
>> +void amdgpu_xcp_dev_deregister(struct amdgpu_device *adev);
>> int amdgpu_xcp_open_device(struct amdgpu_device *adev,
>> 			   struct amdgpu_fpriv *fpriv,
>> 			   struct drm_file *file_priv);


[-- Attachment #2: Type: text/html, Size: 31479 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 4/6] drm/amdgpu: enhance error handling in function amdgpu_pci_probe()
  2025-01-08 16:08   ` Mario Limonciello
@ 2025-01-09  3:34     ` Gerry Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Gerry Liu @ 2025-01-09  3:34 UTC (permalink / raw)
  To: Mario Limonciello
  Cc: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
	sunil.khatri, lijo.lazar, Hawking.Zhang, Jun.Ma2, xiaogang.chen,
	Kent.Russell, shuox.liu, amd-gfx

[-- Attachment #1: Type: text/plain, Size: 1805 bytes --]



> 2025年1月9日 00:08，Mario Limonciello <mario.limonciello@amd.com> 写道：
> 
> On 1/8/2025 02:56, Jiang Liu wrote:
>> Enhance error handling in function amdgpu_pci_probe() to avoid
>> possible resource leakage.
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +++++++++---
>>  1 file changed, 9 insertions(+), 3 deletions(-)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 41d1b06be600..f8deca2f2696 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -2346,16 +2346,16 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>>  		msleep(5000);
>>  		goto retry_init;
>>  	} else if (ret) {
>> -		goto err_pci;
>> +		goto unload_kms;
> 
> This label change looks wrong to me.  If you fail to amdgpu_driver_load_kms(), why would you amdgpu_driver_unload_kms()?
> 
> amdgpu_driver_load_kms() has cleanup handling already to call amdgpu_driver_unload_kms()
This goto is to handle failure of drm_dev_register() instead of failure of amdgpu_driver_load_kms().
Thanks,
Gerry

> 
>>  	}
>>    	ret = amdgpu_xcp_dev_register(adev, ent);
>>  	if (ret)
>> -		goto err_pci;
>> +		goto unplug_drm;
>>    	ret = amdgpu_amdkfd_drm_client_create(adev);
>>  	if (ret)
>> -		goto err_pci;
>> +		goto deregister_xcp;
>>    	/*
>>  	 * 1. don't init fbdev on hw without DCE
>> @@ -2424,6 +2424,12 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>>    	return 0;
>>  +deregister_xcp:
>> +	amdgpu_xcp_dev_deregister(adev);
>> +unplug_drm:
>> +	drm_dev_unplug(ddev);
>> +unload_kms:
>> +	amdgpu_driver_unload_kms(ddev);
>>  err_pci:
>>  	pci_disable_device(pdev);
>>  	return ret;


[-- Attachment #2: Type: text/html, Size: 8666 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 5/6] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
  2025-01-08 16:30     ` Chen, Xiaogang
@ 2025-01-09 15:21       ` Christian König
  0 siblings, 0 replies; 23+ messages in thread
From: Christian König @ 2025-01-09 15:21 UTC (permalink / raw)
  To: Chen, Xiaogang, Jiang Liu, alexander.deucher, Xinhui.Pan, airlied,
	simona, sunil.khatri, lijo.lazar, Hawking.Zhang,
	mario.limonciello, Jun.Ma2, Kent.Russell, shuox.liu, amd-gfx

Am 08.01.25 um 17:30 schrieb Chen, Xiaogang:
>
> On 1/8/2025 3:16 AM, Christian König wrote:
>> Am 08.01.25 um 09:56 schrieb Jiang Liu:
>>> Function detects initialization status by checking sched->ops,
>>
>> Where is that done? Inside the scheduler or inside amdgpu?
> Inside amdgpu set ring->sched.ops to null if ring's scheduler init 
> fail since we use ring->sched.ops to decide uninit it by drm_sched_fini.

That is probably something we should stop doing instead.

amdgpu_device_init_schedulers() needs some proper error handling and a 
matching amdgpu_device_fini_schedulers() function.

That this is still in the fence code and looking at the scheduler ops is 
probably just a leftover from very long ago.

Regards,
Christian.

>>
>> Regards,
>> Christian.
>>
>>>   so set
>>> sched->ops to non-NULL just before return in function
>>> amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers()
>>> to avoid possible invalid memory access on error recover path.
>>>
>>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 4 +++-
>>>   2 files changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 510074a9074e..741807a1fd2e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2857,6 +2857,7 @@ static int 
>>> amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>>>           if (r) {
>>>               DRM_ERROR("Failed to create scheduler on ring %s.\n",
>>>                     ring->name);
>>> +            ring->sched.ops = NULL;
>>>               return r;
>>>           }
>>>           r = amdgpu_uvd_entity_init(adev, ring);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 2f24a6aa13bf..b5e87b515139 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -656,8 +656,10 @@ void amdgpu_fence_driver_sw_fini(struct 
>>> amdgpu_device *adev)
>>>            * The natural check would be sched.ready, which is
>>>            * set as drm_sched_init() finishes...
>>>            */
>>> -        if (ring->sched.ops)
>>> +        if (ring->sched.ops) {
>>>               drm_sched_fini(&ring->sched);
>>> +            ring->sched.ops = NULL;
>>> +        }
>>>             for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
>>>               dma_fence_put(ring->fence_drv.fences[j]);
>>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put()
  2025-01-08  9:05   ` Christian König
@ 2025-01-10  2:01     ` Gerry Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Gerry Liu @ 2025-01-10  2:01 UTC (permalink / raw)
  To: Christian König
  Cc: alexander.deucher, Xinhui.Pan, airlied, simona, sunil.khatri,
	lijo.lazar, Hawking.Zhang, mario.limonciello, Jun.Ma2,
	xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx

[-- Attachment #1: Type: text/plain, Size: 5160 bytes --]



> 2025年1月8日 17:05，Christian König <christian.koenig@amd.com> 写道：
> 
> Am 08.01.25 um 09:56 schrieb Jiang Liu:
>> If error happens before amdgpu_fence_driver_hw_init() gets called during
>> device probe, it will trigger a false warning in amdgpu_irq_put() as
>> below:
>> [ 1209.300996] ------------[ cut here ]------------
>> [ 1209.301061] WARNING: CPU: 48 PID: 293 at /tmp/amd.Rc9jFrl7/amd/amdgpu/amdgpu_irq.c:633 amdgpu_irq_put+0x45/0x70 [amdgpu]
>> [ 1209.301062] Modules linked in: ...
>> [ 1209.301093] CPU: 48 PID: 293 Comm: kworker/48:1 Kdump: loaded Tainted: G        W  OE     5.10.134-17.2.al8.x86_64 #1
>> [ 1209.301094] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
>> [ 1209.301095] Workqueue: events work_for_cpu_fn
>> [ 1209.301159] RIP: 0010:amdgpu_irq_put+0x45/0x70 [amdgpu]
>> [ 1209.301160] Code: 48 8b 4e 10 48 83 39 00 74 2c 89 d1 48 8d 04 88 8b 08 85 c9 74 14 f0 ff 08 b8 00 00 00 00 74 05 c3 cc cc cc cc e9 8b fd ff ff <0f> 0b b8 ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc
>> [ 1209.301162] RSP: 0018:ffffb08a99c8fd88 EFLAGS: 00010246
>> [ 1209.301162] RAX: ffff9efe1bcbf500 RBX: ffff9efe1cc3e400 RCX: 0000000000000000
>> [ 1209.301163] RDX: 0000000000000000 RSI: ffff9efe1cc3b108 RDI: ffff9efe1cc00000
>> [ 1209.301163] RBP: ffff9efe1cc10818 R08: 0000000000000001 R09: 000000000000000d
>> [ 1209.301164] R10: ffffb08a99c8fb48 R11: ffffffffa2068018 R12: ffff9efe1cc109d0
>> [ 1209.301164] R13: ffff9efe1cc00010 R14: ffff9efe1cc00000 R15: ffff9efe1cc3b108
>> [ 1209.301165] FS:  0000000000000000(0000) GS:ffff9ff9fce00000(0000) knlGS:0000000000000000
>> [ 1209.301165] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1209.301165] CR2: 00007fd0f6e860d0 CR3: 0000010092baa003 CR4: 0000000002770ee0
>> [ 1209.301166] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 1209.301166] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>> [ 1209.301167] PKRU: 55555554
>> [ 1209.301167] Call Trace:
>> [ 1209.301225]  amdgpu_fence_driver_hw_fini+0xda/0x110 [amdgpu]
>> [ 1209.301284]  amdgpu_device_fini_hw+0xaf/0x200 [amdgpu]
>> [ 1209.301342]  amdgpu_driver_load_kms+0x7f/0xc0 [amdgpu]
>> [ 1209.301400]  amdgpu_pci_probe+0x1cd/0x4a0 [amdgpu]
>> [ 1209.301401]  local_pci_probe+0x40/0xa0
>> [ 1209.301402]  work_for_cpu_fn+0x13/0x20
>> [ 1209.301403]  process_one_work+0x1ad/0x380
>> [ 1209.301404]  worker_thread+0x1c8/0x310
>> [ 1209.301405]  ? process_one_work+0x380/0x380
>> [ 1209.301406]  kthread+0x118/0x140
>> [ 1209.301407]  ? __kthread_bind_mask+0x60/0x60
>> [ 1209.301408]  ret_from_fork+0x1f/0x30
>> [ 1209.301410] ---[ end trace 733f120fe2ab13e5 ]---
>> [ 1209.301418] ------------[ cut here ]------------
>> 
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 +++++++--
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  | 1 +
>>  2 files changed, 8 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index b5e87b515139..0e41a535e05f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -614,9 +614,11 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
>>    		if (!drm_dev_is_unplugged(adev_to_drm(adev)) &&
>>  		    ring->fence_drv.irq_src &&
>> -		    amdgpu_fence_need_ring_interrupt_restore(ring))
>> +		    ring->fence_drv.irq_enabled) {
>>  			amdgpu_irq_put(adev, ring->fence_drv.irq_src,
>>  				       ring->fence_drv.irq_type);
>> +		        ring->fence_drv.irq_enabled = false;
>> +		}
> 
> Clearly a NAK, that is exactly what the warning is supposed to warn about.
Hi Christian,
	This is part of a more generic issue related ip block state transition, I will move this patch into
the next patch set, which tries to enhance the ip block state machine to avoid false warnings.
Thanks,
Gerry

> 
> Regards,
> Christian.
> 
>>    		del_timer_sync(&ring->fence_drv.fallback_timer);
>>  	}
>> @@ -693,9 +695,12 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
>>    		/* enable the interrupt */
>>  		if (ring->fence_drv.irq_src &&
>> -		    amdgpu_fence_need_ring_interrupt_restore(ring))
>> +		    !ring->fence_drv.irq_enabled &&
>> +		    amdgpu_fence_need_ring_interrupt_restore(ring)) {
>>  			amdgpu_irq_get(adev, ring->fence_drv.irq_src,
>>  				       ring->fence_drv.irq_type);
>> +		        ring->fence_drv.irq_enabled = true;
>> +		}
>>  	}
>>  }
>>  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> index dee5a1b4e572..959d474a0516 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>> @@ -118,6 +118,7 @@ struct amdgpu_fence_driver {
>>  	uint32_t			sync_seq;
>>  	atomic_t			last_seq;
>>  	bool				initialized;
>> +	bool				irq_enabled;
>>  	struct amdgpu_irq_src		*irq_src;
>>  	unsigned			irq_type;
>>  	struct timer_list		fallback_timer;


[-- Attachment #2: Type: text/html, Size: 16652 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-01-10  2:01 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-08  8:55 [PATCH v3 0/6] Fix several bugs in error handling during device probe Jiang Liu
2025-01-08  8:55 ` [v3 1/6] drm/amdgpu: clear adev->in_suspend flag when fails to suspend Jiang Liu
2025-01-08  8:56 ` [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
2025-01-08  9:31   ` Lazar, Lijo
2025-01-09  1:46     ` Gerry Liu
2025-01-08 16:04   ` Mario Limonciello
2025-01-09  2:39     ` Gerry Liu
2025-01-08  8:56 ` [v3 3/6] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
2025-01-08  9:54   ` Lazar, Lijo
2025-01-09  3:00     ` Gerry Liu
2025-01-08  8:56 ` [v3 4/6] drm/amdgpu: enhance error handling in function amdgpu_pci_probe() Jiang Liu
2025-01-08 16:08   ` Mario Limonciello
2025-01-09  3:34     ` Gerry Liu
2025-01-08  8:56 ` [v3 5/6] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini() Jiang Liu
2025-01-08  9:16   ` Christian König
2025-01-08  9:58     ` Lazar, Lijo
2025-01-08 16:30     ` Chen, Xiaogang
2025-01-09 15:21       ` Christian König
2025-01-08  8:56 ` [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put() Jiang Liu
2025-01-08  9:05   ` Christian König
2025-01-10  2:01     ` Gerry Liu
2025-01-08 10:02   ` Lazar, Lijo
2025-01-08 10:05     ` Gerry Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.