* [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()
2025-01-05 2:45 [PATCH v2 0/6] Fix several bugs in error handling during device probe Jiang Liu
@ 2025-01-05 2:45 ` Jiang Liu
2025-01-05 5:22 ` Shuo Liu
2025-01-07 22:53 ` Chen, Xiaogang
2025-01-05 2:45 ` [PATCH v2 2/6] amdgpu: clear adev->in_suspend flag when fails to suspend Jiang Liu
` (4 subsequent siblings)
5 siblings, 2 replies; 18+ messages in thread
From: Jiang Liu @ 2025-01-05 2:45 UTC (permalink / raw)
To: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell, shuox.liu; +Cc: Jiang Liu
Fix possible resource leakage on error recovery path in function
kgd2kfd_device_init().
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index a29374c86405..fa5054940486 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -898,15 +898,15 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
if (kfd->adev->xcp_mgr)
kfd_setup_interrupt_bitmap(node, i);
+ spin_lock_init(&node->watch_points_lock);
+
+ kfd->nodes[i] = node;
+
/* Initialize the KFD node */
if (kfd_init_node(node)) {
dev_err(kfd_device, "Error initializing KFD node\n");
goto node_init_error;
}
-
- spin_lock_init(&node->watch_points_lock);
-
- kfd->nodes[i] = node;
}
svm_range_set_max_pages(kfd->adev);
@@ -921,6 +921,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
goto out;
node_init_error:
+ i++;
node_alloc_error:
kfd_cleanup_nodes(kfd, i);
kfd_doorbell_fini(kfd);
--
2.43.5
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()
2025-01-05 2:45 ` [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes() Jiang Liu
@ 2025-01-05 5:22 ` Shuo Liu
2025-01-05 6:57 ` Gerry Liu
2025-01-06 3:07 ` Gerry Liu
2025-01-07 22:53 ` Chen, Xiaogang
1 sibling, 2 replies; 18+ messages in thread
From: Shuo Liu @ 2025-01-05 5:22 UTC (permalink / raw)
To: Jiang Liu; +Cc: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell
Hi Gerry,
On Sun 5.Jan'25 at 10:45:29 +0800, Jiang Liu wrote:
>Fix possible resource leakage on error recovery path in function
>kgd2kfd_device_init().
>
>Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>---
> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>index a29374c86405..fa5054940486 100644
>--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>@@ -898,15 +898,15 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> if (kfd->adev->xcp_mgr)
> kfd_setup_interrupt_bitmap(node, i);
>
>+ spin_lock_init(&node->watch_points_lock);
>+
>+ kfd->nodes[i] = node;
>+
> /* Initialize the KFD node */
> if (kfd_init_node(node)) {
> dev_err(kfd_device, "Error initializing KFD node\n");
> goto node_init_error;
> }
>-
>- spin_lock_init(&node->watch_points_lock);
>-
>- kfd->nodes[i] = node;
> }
>
> svm_range_set_max_pages(kfd->adev);
>@@ -921,6 +921,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> goto out;
>
> node_init_error:
>+ i++;
The err cleanup path can release node itself. So the following
kfd_cleanup_nodes() may do a double free?
> node_alloc_error:
> kfd_cleanup_nodes(kfd, i);
> kfd_doorbell_fini(kfd);
shuo
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()
2025-01-05 5:22 ` Shuo Liu
@ 2025-01-05 6:57 ` Gerry Liu
2025-01-06 3:07 ` Gerry Liu
1 sibling, 0 replies; 18+ messages in thread
From: Gerry Liu @ 2025-01-05 6:57 UTC (permalink / raw)
To: Shuo Liu; +Cc: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell
[-- Attachment #1: Type: text/plain, Size: 1774 bytes --]
> 2025年1月5日 13:22,Shuo Liu <shuox.liu@linux.alibaba.com> 写道:
>
> Hi Gerry,
>
> On Sun 5.Jan'25 at 10:45:29 +0800, Jiang Liu wrote:
>> Fix possible resource leakage on error recovery path in function
>> kgd2kfd_device_init().
>>
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++++----
>> 1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index a29374c86405..fa5054940486 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -898,15 +898,15 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>> if (kfd->adev->xcp_mgr)
>> kfd_setup_interrupt_bitmap(node, i);
>>
>> + spin_lock_init(&node->watch_points_lock);
>> +
>> + kfd->nodes[i] = node;
>> +
>> /* Initialize the KFD node */
>> if (kfd_init_node(node)) {
>> dev_err(kfd_device, "Error initializing KFD node\n");
>> goto node_init_error;
>> }
>> -
>> - spin_lock_init(&node->watch_points_lock);
>> -
>> - kfd->nodes[i] = node;
>> }
>>
>> svm_range_set_max_pages(kfd->adev);
>> @@ -921,6 +921,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>> goto out;
>>
>> node_init_error:
>> + i++;
> The err cleanup path can release node itself. So the following
> kfd_cleanup_nodes() may do a double free?
The code in function kgd2kfd_device_exit() checks flag `kfd->init_complete`.
And if error happens within function kgd2kfd_device_init(), the flag `kfd->init_complete` will be set to false, thus avoid double free.
>> node_alloc_error:
>> kfd_cleanup_nodes(kfd, i);
>> kfd_doorbell_fini(kfd);
> shuo
[-- Attachment #2: Type: text/html, Size: 9061 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()
2025-01-05 5:22 ` Shuo Liu
2025-01-05 6:57 ` Gerry Liu
@ 2025-01-06 3:07 ` Gerry Liu
1 sibling, 0 replies; 18+ messages in thread
From: Gerry Liu @ 2025-01-06 3:07 UTC (permalink / raw)
To: Shuo Liu; +Cc: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell
[-- Attachment #1: Type: text/plain, Size: 1685 bytes --]
> 2025年1月5日 13:22,Shuo Liu <shuox.liu@linux.alibaba.com> 写道:
>
> Hi Gerry,
>
> On Sun 5.Jan'25 at 10:45:29 +0800, Jiang Liu wrote:
>> Fix possible resource leakage on error recovery path in function
>> kgd2kfd_device_init().
>>
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++++----
>> 1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index a29374c86405..fa5054940486 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -898,15 +898,15 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>> if (kfd->adev->xcp_mgr)
>> kfd_setup_interrupt_bitmap(node, i);
>>
>> + spin_lock_init(&node->watch_points_lock);
>> +
>> + kfd->nodes[i] = node;
>> +
>> /* Initialize the KFD node */
>> if (kfd_init_node(node)) {
>> dev_err(kfd_device, "Error initializing KFD node\n");
>> goto node_init_error;
>> }
>> -
>> - spin_lock_init(&node->watch_points_lock);
>> -
>> - kfd->nodes[i] = node;
>> }
>>
>> svm_range_set_max_pages(kfd->adev);
>> @@ -921,6 +921,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>> goto out;
>>
>> node_init_error:
>> + i++;
> The err cleanup path can release node itself. So the following
> kfd_cleanup_nodes() may do a double free?
Hi Shuo,
I missed the `kfree()` in function kfd_init_node(), so this patch is wrong and should be dropped.
Thanks,
Gerry
>> node_alloc_error:
>> kfd_cleanup_nodes(kfd, i);
>> kfd_doorbell_fini(kfd);
> shuo
[-- Attachment #2: Type: text/html, Size: 9057 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()
2025-01-05 2:45 ` [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes() Jiang Liu
2025-01-05 5:22 ` Shuo Liu
@ 2025-01-07 22:53 ` Chen, Xiaogang
2025-01-08 2:29 ` Gerry Liu
1 sibling, 1 reply; 18+ messages in thread
From: Chen, Xiaogang @ 2025-01-07 22:53 UTC (permalink / raw)
To: Jiang Liu, amd-gfx, lijo.lazar, Kent.Russell, shuox.liu
[-- Attachment #1: Type: text/plain, Size: 1563 bytes --]
On 1/4/2025 8:45 PM, Jiang Liu wrote:
> Fix possible resource leakage on error recovery path in function
> kgd2kfd_device_init().
>
> Signed-off-by: Jiang Liu<gerry@linux.alibaba.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index a29374c86405..fa5054940486 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -898,15 +898,15 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> if (kfd->adev->xcp_mgr)
> kfd_setup_interrupt_bitmap(node, i);
>
> + spin_lock_init(&node->watch_points_lock);
> +
> + kfd->nodes[i] = node;
> +
> /* Initialize the KFD node */
> if (kfd_init_node(node)) {
> dev_err(kfd_device, "Error initializing KFD node\n");
> goto node_init_error;
> }
> -
> - spin_lock_init(&node->watch_points_lock);
> -
> - kfd->nodes[i] = node;
> }
>
> svm_range_set_max_pages(kfd->adev);
> @@ -921,6 +921,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
> goto out;
>
> node_init_error:
> + i++;
> node_alloc_error:
> kfd_cleanup_nodes(kfd, i);
> kfd_doorbell_fini(kfd);
I think this change is not right: if kfd_init_node fail it does clean up
for the kfd_node that it is initializing internally. kfd_cleanup_nodes
does not need to clean up the kfd node i which failed to init, just
clean up the kfd_nodes that were initialized previously.
Regard
Xiaogang
[-- Attachment #2: Type: text/html, Size: 2300 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()
2025-01-07 22:53 ` Chen, Xiaogang
@ 2025-01-08 2:29 ` Gerry Liu
0 siblings, 0 replies; 18+ messages in thread
From: Gerry Liu @ 2025-01-08 2:29 UTC (permalink / raw)
To: Chen, Xiaogang; +Cc: amd-gfx, lijo.lazar, Kent.Russell, shuox.liu
[-- Attachment #1: Type: text/plain, Size: 1859 bytes --]
> 2025年1月8日 06:53,Chen, Xiaogang <xiaogang.chen@amd.com> 写道:
>
>
>
> On 1/4/2025 8:45 PM, Jiang Liu wrote:
>> Fix possible resource leakage on error recovery path in function
>> kgd2kfd_device_init().
>>
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com> <mailto:gerry@linux.alibaba.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 +++++----
>> 1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index a29374c86405..fa5054940486 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -898,15 +898,15 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>> if (kfd->adev->xcp_mgr)
>> kfd_setup_interrupt_bitmap(node, i);
>>
>> + spin_lock_init(&node->watch_points_lock);
>> +
>> + kfd->nodes[i] = node;
>> +
>> /* Initialize the KFD node */
>> if (kfd_init_node(node)) {
>> dev_err(kfd_device, "Error initializing KFD node\n");
>> goto node_init_error;
>> }
>> -
>> - spin_lock_init(&node->watch_points_lock);
>> -
>> - kfd->nodes[i] = node;
>> }
>>
>> svm_range_set_max_pages(kfd->adev);
>> @@ -921,6 +921,7 @@ bool kgd2kfd_device_init(struct kfd_dev *kfd,
>> goto out;
>>
>> node_init_error:
>> + i++;
>> node_alloc_error:
>> kfd_cleanup_nodes(kfd, i);
>> kfd_doorbell_fini(kfd);
> I think this change is not right: if kfd_init_node fail it does clean up for the kfd_node that it is initializing internally. kfd_cleanup_nodes does not need to clean up the kfd node i which failed to init, just clean up the kfd_nodes that were initialized previously.
>
Yes, I made a mistake here and will drop this patch in next version.
Thanks,
Gerry
> Regard
>
> Xiaogang
>
[-- Attachment #2: Type: text/html, Size: 3133 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 2/6] amdgpu: clear adev->in_suspend flag when fails to suspend
2025-01-05 2:45 [PATCH v2 0/6] Fix several bugs in error handling during device probe Jiang Liu
2025-01-05 2:45 ` [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes() Jiang Liu
@ 2025-01-05 2:45 ` Jiang Liu
2025-01-05 2:45 ` [PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
` (3 subsequent siblings)
5 siblings, 0 replies; 18+ messages in thread
From: Jiang Liu @ 2025-01-05 2:45 UTC (permalink / raw)
To: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell, shuox.liu; +Cc: Jiang Liu
Clear adev->in_suspend flag when fails to suspend, otherwise it will
cause too much warnings like:
[ 1802.212027] ------------[ cut here ]------------
[ 1802.212028] WARNING: CPU: 97 PID: 11282 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:452 amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212198] Modules linked in: amdgpu(E-) tcp_diag(E) inet_diag(E) rfkill(E) intel_rapl_msr(E) intel_rapl_common(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) i10nm_edac(E) nfit(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) iTCO_wdt(E) pmt_telemetry(E) iTCO_vendor_support(E) pmt_class(E) intel_cstate(E) snd_hda_intel(E) ipmi_ssif(E) snd_intel_dspcfg(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) intel_uncore(E) snd(E) cdc_ether(E) pcspkr(E) i2c_i801(E) idxd(E) mei_me(E) usbnet(E) ses(E) isst_if_mmio(E) isst_if_mbox_pci(E) mii(E) joydev(E) soundcore(E) isst_if_common(E) idxd_bus(E) mei(E) enclosure(E) intel_vsec(E) i2c_smbus(E) i2c_ismt(E) sunrpc(E) acpi_power_meter(E) ipmi_si(E) acpi_ipmi(E) ipmi_devintf(E) acpi_pad(E) ipmi_msghandler(E) vfat(E) fat(E) sg(E) video(E) amdxcp(E) drm_ttm_helper(E) ttm(E) drm_exec(E) gpu_sched(E) drm_suballoc_helper(E) crc32c_intel(E) drm_buddy(E)
[ 1802.212223] ast(E) drm_shmem_helper(E) drm_display_helper(E) i2c_algo_bit(E) drm_kms_helper(E) virtio_net(E) mpt3sas(E) drm(E) net_failover(E) ahci(E) raid_class(E) failover(E) libahci(E) scsi_transport_sas(E) dimlib(E) libata(E) i2c_core(E) wmi(E) pinctrl_emmitsburg(E) [last unloaded: amdgpu(E)]
[ 1802.212231] CPU: 97 PID: 11282 Comm: rmmod Kdump: loaded Tainted: G W E 6.10.0+ #2
[ 1802.212232] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
[ 1802.212232] RIP: 0010:amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212401] Code: 00 00 00 4d 85 e4 74 08 49 c7 04 24 00 00 00 00 48 85 ed 74 08 48 c7 45 00 00 00 00 00 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc <0f> 0b e9 3b ff ff ff 3d 00 fe ff ff 74 b3 49 8b be b0 10 ff ff 4c
[ 1802.212402] RSP: 0018:ffffbe0e92087cb0 EFLAGS: 00010202
[ 1802.212403] RAX: 0000000000000206 RBX: ffff9cbc06c37f10 RCX: 000000000020000d
[ 1802.212404] RDX: ffff9cbc06c37f18 RSI: ffff9cbc06c37f58 RDI: ffff9cbc06c37f10
[ 1802.212407] RBP: ffff9cbc06c37f18 R08: ffff9cba42388800 R09: 000000000020000d
[ 1802.212409] R10: 0000000000040000 R11: 0000000000000006 R12: ffff9cbc06c37f58
[ 1802.212410] R13: ffff9cba4238dc00 R14: ffff9cbc06c0ef50 R15: ffff9cbc06c00000
[ 1802.212411] FS: 00007f927dc4e740(0000) GS:ffff9db47e880000(0000) knlGS:0000000000000000
[ 1802.212412] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1802.212413] CR2: 000056342b2be1a8 CR3: 00000003f1546005 CR4: 0000000000f70ef0
[ 1802.212413] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1802.212414] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1802.212415] PKRU: 55555554
[ 1802.212415] Call Trace:
[ 1802.212416] <TASK>
[ 1802.212417] ? __warn+0x83/0x130
[ 1802.212419] ? amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212586] ? __report_bug+0xea/0x100
[ 1802.212588] ? report_bug+0x24/0x70
[ 1802.212589] ? handle_bug+0x3c/0x70
[ 1802.212590] ? exc_invalid_op+0x18/0x70
[ 1802.212591] ? asm_exc_invalid_op+0x1a/0x20
[ 1802.212594] ? amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212746] amdgpu_ring_fini+0x91/0x120 [amdgpu]
[ 1802.212901] amdgpu_jpeg_sw_fini+0xb2/0xe0 [amdgpu]
[ 1802.213106] amdgpu_device_ip_fini.isra.0+0xb1/0x1c0 [amdgpu]
[ 1802.213247] amdgpu_device_fini_sw+0x49/0x290 [amdgpu]
[ 1802.213413] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[ 1802.213576] devm_drm_dev_init_release+0x4e/0x70 [drm]
[ 1802.213602] release_nodes+0x35/0xb0
[ 1802.213605] devres_release_all+0x8f/0xd0
[ 1802.213606] device_unbind_cleanup+0xe/0x70
[ 1802.213609] device_release_driver_internal+0x1bc/0x200
[ 1802.213611] driver_detach+0x48/0x90
[ 1802.213613] bus_remove_driver+0x6d/0xf0
[ 1802.213615] pci_unregister_driver+0x2e/0xb0
[ 1802.213618] amdgpu_exit+0x15/0x1c4 [amdgpu]
[ 1802.213861] __do_sys_delete_module.constprop.0+0x176/0x310
[ 1802.213863] do_syscall_64+0x5d/0x170
[ 1802.213866] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1802.213869] RIP: 0033:0x7f927d6620cb
[ 1802.213870] Code: 73 01 c3 48 8b 0d a5 6d 19 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 6d 19 00 f7 d8 64 89 01 48
[ 1802.213871] RSP: 002b:00007fff031d9d78 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 1802.213873] RAX: ffffffffffffffda RBX: 000055d8d6df69e0 RCX: 00007f927d6620cb
[ 1802.213873] RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055d8d6df6a48
[ 1802.213874] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 1802.213875] R10: 00007f927d7aaac0 R11: 0000000000000206 R12: 00007fff031d9fa0
[ 1802.213876] R13: 00007fff031da5f1 R14: 000055d8d6df62a0 R15: 000055d8d6df69e0
[ 1802.213878] </TASK>
[ 1802.213879] ---[ end trace 0000000000000000 ]---
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0a121aab5c74..5ff53a3b9851 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4910,7 +4910,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
amdgpu_virt_fini_data_exchange(adev);
r = amdgpu_virt_request_full_gpu(adev, false);
if (r)
- return r;
+ goto error_out;
}
if (amdgpu_acpi_smart_shift_update(dev, AMDGPU_SS_DEV_D3))
@@ -4930,7 +4930,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
r = amdgpu_device_evict_resources(adev);
if (r)
- return r;
+ goto error_out;
amdgpu_ttm_set_buffer_funcs_status(adev, false);
@@ -4943,9 +4943,12 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
r = amdgpu_dpm_notify_rlc_state(adev, false);
if (r)
- return r;
+ goto error_out;
return 0;
+error_out:
+ adev->in_suspend = false;
+ return r;
}
/**
@@ -5007,8 +5010,10 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
amdgpu_virt_release_full_gpu(adev, true);
}
- if (r)
+ if (r) {
+ adev->in_suspend = false;
return r;
+ }
/* Make sure IB tests flushed */
flush_delayed_work(&adev->delayed_init_work);
--
2.43.5
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
2025-01-05 2:45 [PATCH v2 0/6] Fix several bugs in error handling during device probe Jiang Liu
2025-01-05 2:45 ` [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes() Jiang Liu
2025-01-05 2:45 ` [PATCH v2 2/6] amdgpu: clear adev->in_suspend flag when fails to suspend Jiang Liu
@ 2025-01-05 2:45 ` Jiang Liu
2025-01-06 6:51 ` Lazar, Lijo
2025-01-05 2:45 ` [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
` (2 subsequent siblings)
5 siblings, 1 reply; 18+ messages in thread
From: Jiang Liu @ 2025-01-05 2:45 UTC (permalink / raw)
To: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell, shuox.liu; +Cc: Jiang Liu
Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
to do error recovery.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 11 +++-
drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h | 1 +
drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 70 +++++++++++++++++----
drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h | 1 +
4 files changed, 70 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
index e209b5e101df..401fbaa0b6b8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
@@ -359,6 +359,8 @@ int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev, ent->driver_data);
if (ret)
return ret;
+
+ adev->xcp_mgr->xcp[i].registered = true;
}
return 0;
@@ -376,12 +378,19 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
if (!adev->xcp_mgr->xcp[i].ddev)
break;
+ // Restore and free the original drm_device.
p_ddev = adev->xcp_mgr->xcp[i].ddev;
- drm_dev_unplug(p_ddev);
+ if (adev->xcp_mgr->xcp[i].registered) {
+ drm_dev_unplug(p_ddev);
+ adev->xcp_mgr->xcp[i].registered = false;
+ }
p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
p_ddev->driver = adev->xcp_mgr->xcp[i].driver;
p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
+ amdgpu_xcp_drm_dev_free(p_ddev);
+
+ adev->xcp_mgr->xcp[i].ddev = NULL;
}
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
index b63f53242c57..cd06a4a232be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
@@ -101,6 +101,7 @@ struct amdgpu_xcp {
uint8_t id;
uint8_t mem_id;
bool valid;
+ bool registered;
atomic_t ref_cnt;
struct drm_device *ddev;
struct drm_device *rdev;
diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
index faed84172dd4..9058d71b4756 100644
--- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
+++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
@@ -45,18 +45,26 @@ static const struct drm_driver amdgpu_xcp_driver = {
static int8_t pdev_num;
static struct xcp_device *xcp_dev[MAX_XCP_PLATFORM_DEVICE];
+static struct mutex xcp_mutex;
int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
{
struct platform_device *pdev;
struct xcp_device *pxcp_dev;
char dev_name[20];
- int ret;
+ int ret, index;
+ mutex_lock(&xcp_mutex);
+ ret = -ENODEV;
if (pdev_num >= MAX_XCP_PLATFORM_DEVICE)
- return -ENODEV;
+ goto out_unlock;
- snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", pdev_num);
+ for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
+ if (!xcp_dev[index])
+ break;
+ }
+
+ snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", index);
pdev = platform_device_register_simple(dev_name, -1, NULL, 0);
if (IS_ERR(pdev))
return PTR_ERR(pdev);
@@ -72,10 +80,11 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
goto out_devres;
}
- xcp_dev[pdev_num] = pxcp_dev;
- xcp_dev[pdev_num]->pdev = pdev;
+ xcp_dev[index] = pxcp_dev;
+ xcp_dev[index]->pdev = pdev;
*ddev = &pxcp_dev->drm;
pdev_num++;
+ mutex_unlock(&xcp_mutex);
return 0;
@@ -83,21 +92,58 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
devres_release_group(&pdev->dev, NULL);
out_unregister:
platform_device_unregister(pdev);
+out_unlock:
+ mutex_unlock(&xcp_mutex);
return ret;
}
EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
-void amdgpu_xcp_drv_release(void)
+static void amdgpu_xcp_drm_dev_destroy(int index)
+{
+ struct platform_device *pdev;
+
+ pdev = xcp_dev[index]->pdev;
+ devres_release_group(&pdev->dev, NULL);
+ platform_device_unregister(pdev);
+ xcp_dev[index] = NULL;
+ pdev_num--;
+}
+
+void amdgpu_xcp_drm_dev_free(struct drm_device *ddev)
{
- for (--pdev_num; pdev_num >= 0; --pdev_num) {
- struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
+ int index;
+ struct xcp_device *pxcp_dev;
+
+ if (ddev == NULL)
+ return;
- devres_release_group(&pdev->dev, NULL);
- platform_device_unregister(pdev);
- xcp_dev[pdev_num] = NULL;
+ pxcp_dev = container_of(ddev, struct xcp_device, drm);
+
+ mutex_lock(&xcp_mutex);
+ for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
+ if (xcp_dev[index] == pxcp_dev) {
+ amdgpu_xcp_drm_dev_destroy(index);
+ break;
+ }
+ }
+ mutex_unlock(&xcp_mutex);
+}
+EXPORT_SYMBOL(amdgpu_xcp_drm_dev_free);
+
+void amdgpu_xcp_drv_release(void)
+{
+ int index;
+
+ mutex_lock(&xcp_mutex);
+ for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
+ if (xcp_dev[index]) {
+ WARN_ON(xcp_dev[index]);
+ amdgpu_xcp_drm_dev_destroy(index);
+ }
}
- pdev_num = 0;
+ WARN_ON(pdev_num != 0);
+ mutex_unlock(&xcp_mutex);
}
EXPORT_SYMBOL(amdgpu_xcp_drv_release);
diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
index c1c4b679bf95..580a1602c8e3 100644
--- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
+++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
@@ -25,5 +25,6 @@
#define _AMDGPU_XCP_DRV_H_
int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev);
+void amdgpu_xcp_drm_dev_free(struct drm_device *ddev);
void amdgpu_xcp_drv_release(void);
#endif /* _AMDGPU_XCP_DRV_H_ */
--
2.43.5
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
2025-01-05 2:45 ` [PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
@ 2025-01-06 6:51 ` Lazar, Lijo
2025-01-07 2:00 ` Gerry Liu
0 siblings, 1 reply; 18+ messages in thread
From: Lazar, Lijo @ 2025-01-06 6:51 UTC (permalink / raw)
To: Jiang Liu, amd-gfx, xiaogang.chen, Kent.Russell, shuox.liu
On 1/5/2025 8:15 AM, Jiang Liu wrote:
> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
> to do error recovery.
>
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 11 +++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h | 1 +
> drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 70 +++++++++++++++++----
> drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h | 1 +
> 4 files changed, 70 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> index e209b5e101df..401fbaa0b6b8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> @@ -359,6 +359,8 @@ int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
> ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev, ent->driver_data);
> if (ret)
> return ret;
> +
> + adev->xcp_mgr->xcp[i].registered = true;
> }
>
> return 0;
> @@ -376,12 +378,19 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
> if (!adev->xcp_mgr->xcp[i].ddev)
> break;
>
> + // Restore and free the original drm_device.
> p_ddev = adev->xcp_mgr->xcp[i].ddev;
> - drm_dev_unplug(p_ddev);
> + if (adev->xcp_mgr->xcp[i].registered) {
> + drm_dev_unplug(p_ddev);
> + adev->xcp_mgr->xcp[i].registered = false;
> + }
> p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
> p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
> p_ddev->driver = adev->xcp_mgr->xcp[i].driver;
> p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
> + amdgpu_xcp_drm_dev_free(p_ddev);
> +
> + adev->xcp_mgr->xcp[i].ddev = NULL;
> }
> }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
> index b63f53242c57..cd06a4a232be 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
> @@ -101,6 +101,7 @@ struct amdgpu_xcp {
> uint8_t id;
> uint8_t mem_id;
> bool valid;
> + bool registered;
> atomic_t ref_cnt;
> struct drm_device *ddev;
> struct drm_device *rdev;
> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> index faed84172dd4..9058d71b4756 100644
> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> @@ -45,18 +45,26 @@ static const struct drm_driver amdgpu_xcp_driver = {
>
> static int8_t pdev_num;
> static struct xcp_device *xcp_dev[MAX_XCP_PLATFORM_DEVICE];
> +static struct mutex xcp_mutex;
I think this needs to be static DEFINE_MUTEX(xcp_mutex).
>
> int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
> {
> struct platform_device *pdev;
> struct xcp_device *pxcp_dev;
> char dev_name[20];
> - int ret;
> + int ret, index;
>
> + mutex_lock(&xcp_mutex);
> + ret = -ENODEV;
Preference would be do this inside the below if() to associate the error
with the condition.
> if (pdev_num >= MAX_XCP_PLATFORM_DEVICE)
> - return -ENODEV;
> + goto out_unlock;
>
> - snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", pdev_num);
> + for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> + if (!xcp_dev[index])
> + break;
> + }
> +
> + snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", index);
> pdev = platform_device_register_simple(dev_name, -1, NULL, 0);
> if (IS_ERR(pdev))
> return PTR_ERR(pdev);
Seems mutex is left locked here.
> @@ -72,10 +80,11 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
> goto out_devres;
> }
>
> - xcp_dev[pdev_num] = pxcp_dev;
> - xcp_dev[pdev_num]->pdev = pdev;
> + xcp_dev[index] = pxcp_dev;
> + xcp_dev[index]->pdev = pdev;
> *ddev = &pxcp_dev->drm;
> pdev_num++;
> + mutex_unlock(&xcp_mutex);
>
> return 0;
>
> @@ -83,21 +92,58 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
> devres_release_group(&pdev->dev, NULL);
> out_unregister:
> platform_device_unregister(pdev);
> +out_unlock:
> + mutex_unlock(&xcp_mutex);
>
> return ret;
> }
> EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
>
> -void amdgpu_xcp_drv_release(void)
> +static void amdgpu_xcp_drm_dev_destroy(int index)
> +{
> + struct platform_device *pdev;
> +
> + pdev = xcp_dev[index]->pdev;
> + devres_release_group(&pdev->dev, NULL);
> + platform_device_unregister(pdev);
> + xcp_dev[index] = NULL;
> + pdev_num--;
> +}
> +
> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev)
> {
> - for (--pdev_num; pdev_num >= 0; --pdev_num) {
> - struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
> + int index;
> + struct xcp_device *pxcp_dev;
> +
> + if (ddev == NULL)
> + return;
>
> - devres_release_group(&pdev->dev, NULL);
> - platform_device_unregister(pdev);
> - xcp_dev[pdev_num] = NULL;
> + pxcp_dev = container_of(ddev, struct xcp_device, drm);
> +
> + mutex_lock(&xcp_mutex);
> + for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> + if (xcp_dev[index] == pxcp_dev) {
> + amdgpu_xcp_drm_dev_destroy(index);
> + break;
> + }
> + }
> + mutex_unlock(&xcp_mutex);
> +}
> +EXPORT_SYMBOL(amdgpu_xcp_drm_dev_free);
> +
> +void amdgpu_xcp_drv_release(void)
> +{
> + int index;
> +
> + mutex_lock(&xcp_mutex);
> + for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> + if (xcp_dev[index]) {
> + WARN_ON(xcp_dev[index]);
Why is this WARN check needed? There is already a if() check for valid
index.
Also, would suggest to separate out amdgpu_xcp.c from xcp_drv.c. xcp_drv
introducing a new interface may be kept in a separate patch.
Thanks,
Lijo
> + amdgpu_xcp_drm_dev_destroy(index);
> + }
> }
> - pdev_num = 0;
> + WARN_ON(pdev_num != 0);
> + mutex_unlock(&xcp_mutex);
> }
> EXPORT_SYMBOL(amdgpu_xcp_drv_release);
>
> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> index c1c4b679bf95..580a1602c8e3 100644
> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> @@ -25,5 +25,6 @@
> #define _AMDGPU_XCP_DRV_H_
>
> int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev);
> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev);
> void amdgpu_xcp_drv_release(void);
> #endif /* _AMDGPU_XCP_DRV_H_ */
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
2025-01-06 6:51 ` Lazar, Lijo
@ 2025-01-07 2:00 ` Gerry Liu
0 siblings, 0 replies; 18+ messages in thread
From: Gerry Liu @ 2025-01-07 2:00 UTC (permalink / raw)
To: Lazar, Lijo; +Cc: amd-gfx, xiaogang.chen, Kent.Russell, shuox.liu
[-- Attachment #1: Type: text/plain, Size: 6977 bytes --]
> 2025年1月6日 14:51,Lazar, Lijo <lijo.lazar@amd.com> 写道:
>
>
>
> On 1/5/2025 8:15 AM, Jiang Liu wrote:
>> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
>> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
>> to do error recovery.
>>
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 11 +++-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h | 1 +
>> drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 70 +++++++++++++++++----
>> drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h | 1 +
>> 4 files changed, 70 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> index e209b5e101df..401fbaa0b6b8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
>> @@ -359,6 +359,8 @@ int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
>> ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev, ent->driver_data);
>> if (ret)
>> return ret;
>> +
>> + adev->xcp_mgr->xcp[i].registered = true;
>> }
>>
>> return 0;
>> @@ -376,12 +378,19 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
>> if (!adev->xcp_mgr->xcp[i].ddev)
>> break;
>>
>> + // Restore and free the original drm_device.
>> p_ddev = adev->xcp_mgr->xcp[i].ddev;
>> - drm_dev_unplug(p_ddev);
>> + if (adev->xcp_mgr->xcp[i].registered) {
>> + drm_dev_unplug(p_ddev);
>> + adev->xcp_mgr->xcp[i].registered = false;
>> + }
>> p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
>> p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
>> p_ddev->driver = adev->xcp_mgr->xcp[i].driver;
>> p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
>> + amdgpu_xcp_drm_dev_free(p_ddev);
>> +
>> + adev->xcp_mgr->xcp[i].ddev = NULL;
>> }
>> }
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
>> index b63f53242c57..cd06a4a232be 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
>> @@ -101,6 +101,7 @@ struct amdgpu_xcp {
>> uint8_t id;
>> uint8_t mem_id;
>> bool valid;
>> + bool registered;
>> atomic_t ref_cnt;
>> struct drm_device *ddev;
>> struct drm_device *rdev;
>> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
>> index faed84172dd4..9058d71b4756 100644
>> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
>> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
>> @@ -45,18 +45,26 @@ static const struct drm_driver amdgpu_xcp_driver = {
>>
>> static int8_t pdev_num;
>> static struct xcp_device *xcp_dev[MAX_XCP_PLATFORM_DEVICE];
>> +static struct mutex xcp_mutex;
>
> I think this needs to be static DEFINE_MUTEX(xcp_mutex).
>
>>
>> int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>> {
>> struct platform_device *pdev;
>> struct xcp_device *pxcp_dev;
>> char dev_name[20];
>> - int ret;
>> + int ret, index;
>>
>> + mutex_lock(&xcp_mutex);
>> + ret = -ENODEV;
>
> Preference would be do this inside the below if() to associate the error
> with the condition.
>
>> if (pdev_num >= MAX_XCP_PLATFORM_DEVICE)
>> - return -ENODEV;
>> + goto out_unlock;
>>
>> - snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", pdev_num);
>> + for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
>> + if (!xcp_dev[index])
>> + break;
>> + }
>> +
>> + snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", index);
>> pdev = platform_device_register_simple(dev_name, -1, NULL, 0);
>> if (IS_ERR(pdev))
>> return PTR_ERR(pdev);
>
> Seems mutex is left locked here.
Will fixed in next version:)
>
>> @@ -72,10 +80,11 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>> goto out_devres;
>> }
>>
>> - xcp_dev[pdev_num] = pxcp_dev;
>> - xcp_dev[pdev_num]->pdev = pdev;
>> + xcp_dev[index] = pxcp_dev;
>> + xcp_dev[index]->pdev = pdev;
>> *ddev = &pxcp_dev->drm;
>> pdev_num++;
>> + mutex_unlock(&xcp_mutex);
>>
>> return 0;
>>
>> @@ -83,21 +92,58 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
>> devres_release_group(&pdev->dev, NULL);
>> out_unregister:
>> platform_device_unregister(pdev);
>> +out_unlock:
>> + mutex_unlock(&xcp_mutex);
>>
>> return ret;
>> }
>> EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
>>
>> -void amdgpu_xcp_drv_release(void)
>> +static void amdgpu_xcp_drm_dev_destroy(int index)
>> +{
>> + struct platform_device *pdev;
>> +
>> + pdev = xcp_dev[index]->pdev;
>> + devres_release_group(&pdev->dev, NULL);
>> + platform_device_unregister(pdev);
>> + xcp_dev[index] = NULL;
>> + pdev_num--;
>> +}
>> +
>> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev)
>> {
>> - for (--pdev_num; pdev_num >= 0; --pdev_num) {
>> - struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
>> + int index;
>> + struct xcp_device *pxcp_dev;
>> +
>> + if (ddev == NULL)
>> + return;
>>
>> - devres_release_group(&pdev->dev, NULL);
>> - platform_device_unregister(pdev);
>> - xcp_dev[pdev_num] = NULL;
>> + pxcp_dev = container_of(ddev, struct xcp_device, drm);
>> +
>> + mutex_lock(&xcp_mutex);
>> + for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
>> + if (xcp_dev[index] == pxcp_dev) {
>> + amdgpu_xcp_drm_dev_destroy(index);
>> + break;
>> + }
>> + }
>> + mutex_unlock(&xcp_mutex);
>> +}
>> +EXPORT_SYMBOL(amdgpu_xcp_drm_dev_free);
>> +
>> +void amdgpu_xcp_drv_release(void)
>> +{
>> + int index;
>> +
>> + mutex_lock(&xcp_mutex);
>> + for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
>> + if (xcp_dev[index]) {
>> + WARN_ON(xcp_dev[index]);
>
> Why is this WARN check needed? There is already a if() check for valid
> index.
>
> Also, would suggest to separate out amdgpu_xcp.c from xcp_drv.c. xcp_drv
> introducing a new interface may be kept in a separate patch.
With new implementation, all xcp devices should have already be removed when amdgpu_xcp_drv_release() gets called,
So hope to verify whether it works as expected.
Thanks!
>
> Thanks,
> Lijo
>
>> + amdgpu_xcp_drm_dev_destroy(index);
>> + }
>> }
>> - pdev_num = 0;
>> + WARN_ON(pdev_num != 0);
>> + mutex_unlock(&xcp_mutex);
>> }
>> EXPORT_SYMBOL(amdgpu_xcp_drv_release);
>>
>> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
>> index c1c4b679bf95..580a1602c8e3 100644
>> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
>> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
>> @@ -25,5 +25,6 @@
>> #define _AMDGPU_XCP_DRV_H_
>>
>> int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev);
>> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev);
>> void amdgpu_xcp_drv_release(void);
>> #endif /* _AMDGPU_XCP_DRV_H_ */
[-- Attachment #2: Type: text/html, Size: 32096 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
2025-01-05 2:45 [PATCH v2 0/6] Fix several bugs in error handling during device probe Jiang Liu
` (2 preceding siblings ...)
2025-01-05 2:45 ` [PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
@ 2025-01-05 2:45 ` Jiang Liu
2025-01-05 5:16 ` Shuo Liu
2025-01-07 22:55 ` Chen, Xiaogang
2025-01-05 2:45 ` [PATCH v2 5/6] amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini() Jiang Liu
2025-01-05 2:45 ` [PATCH v2 6/6] amdgpu: get rid of false warnings caused by amdgpu_irq_put() Jiang Liu
5 siblings, 2 replies; 18+ messages in thread
From: Jiang Liu @ 2025-01-05 2:45 UTC (permalink / raw)
To: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell, shuox.liu; +Cc: Jiang Liu
If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
after free bug related to amdgpu_driver_release_kms() as:
2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kernel mode
2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) - not-present page
2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0
2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G W E 6.10.0+ #2
2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
2024-12-26 16:17:46 [16002.213648] FS: 00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
2024-12-26 16:17:46 [16002.223189] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
e024se+0x3c/0x90 [amdxcp]
2024-12-26 16:17:46 [16002.337645] __do_sys_delete_module.constprop.0+0x176/0x310
2024-12-26 16:17:46 [16002.344324] do_syscall_64+0x5d/0x170
2024-12-26 16:17:46 [16002.348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e
2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26
Fix it by unplugging xcp drm devices when failed to probe GPU devices.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Tested-by: Shuo Liu <shuox.liu@linux.alibaba.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index d2a046736edd..9ebc0d47d1cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -165,8 +165,10 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
DRM_WARN("smart shift update failed\n");
out:
- if (r)
+ if (r) {
+ amdgpu_xcp_dev_unplug(adev);
amdgpu_driver_unload_kms(dev);
+ }
return r;
}
--
2.43.5
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
2025-01-05 2:45 ` [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
@ 2025-01-05 5:16 ` Shuo Liu
2025-01-07 22:55 ` Chen, Xiaogang
1 sibling, 0 replies; 18+ messages in thread
From: Shuo Liu @ 2025-01-05 5:16 UTC (permalink / raw)
To: Jiang Liu; +Cc: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell
On Sun 5.Jan'25 at 10:45:32 +0800, Jiang Liu wrote:
>If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
>after free bug related to amdgpu_driver_release_kms() as:
>2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
>2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kernel mode
>2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) - not-present page
>2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0
>2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
>2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G W E 6.10.0+ #2
>2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
>2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
>2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
>43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
>2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
>2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
>2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
>2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
>2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
>2024-12-26 16:17:46 [16002.213648] FS: 00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
>2024-12-26 16:17:46 [16002.223189] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
>2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>e024se+0x3c/0x90 [amdxcp]
This calltrace seems corrupted.
You can use this one from my test.
BUG: unable to handle page fault for address: 000000010000008f
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 139 PID: 10172 Comm: rmmod Kdump: loaded Tainted: G E 6.10.0 #5
RIP: 0010:amdgpu_fence_driver_sw_fini+0x3b/0xd0 [amdgpu]
Code: 00 41 55 49 89 fd 41 54 49 81 c5 78 07 01 00 41 bc ff ff ff ff 55 53 eb 09 49 83 c6 08 4d 39 ee 74 7d 49 8b 2e 48 85 ed 74 ef <80> 7d 28 00 74 e9 48 83 7d 78 00 74 09 48 8d 7d 78 e8 ff 65 e1 ff
RSP: 0018:ffffb8e77badfde0 EFLAGS: 00010202
RAX: ffffffffc09e13b0 RBX: ffffa1461a4287f0 RCX: 00000000000000c0
RDX: 00000000000000c0 RSI: 00000000000000c0 RDI: ffffa1461a4287f0
RBP: 0000000100000067 R08: 00000000000000c0 R09: ffffa146112a6c48
R10: 0000000000040000 R11: 0000000000000007 R12: 00000000ffffffff
R13: ffffa1461a438f68 R14: ffffa1461a438b88 R15: 0000000000000000
FS: 00007f98ed908740(0000) GS:ffffa141ffd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000010000008f CR3: 0000000127d40003 CR4: 0000000000f70ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
? __die+0x24/0x70
? page_fault_oops+0x65/0x140
? exc_page_fault+0x68/0x140
? asm_exc_page_fault+0x26/0x30
? __pfx_amdgpu_driver_release_kms+0x10/0x10 [amdgpu]
? amdgpu_fence_driver_sw_fini+0x3b/0xd0 [amdgpu]
amdgpu_device_fini_sw+0x16/0x220 [amdgpu]
amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
devm_drm_dev_init_release+0x4e/0x70 [drm]
release_nodes+0x35/0xb0
devres_release_group+0xa0/0xf0
amdgpu_xcp_drv_release+0x3c/0x90 [amdxcp]
shuo
>2024-12-26 16:17:46 [16002.337645] __do_sys_delete_module.constprop.0+0x176/0x310
>2024-12-26 16:17:46 [16002.344324] do_syscall_64+0x5d/0x170
>2024-12-26 16:17:46 [16002.348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26
>
>Fix it by unplugging xcp drm devices when failed to probe GPU devices.
>
>Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>Tested-by: Shuo Liu <shuox.liu@linux.alibaba.com>
>Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>index d2a046736edd..9ebc0d47d1cb 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>@@ -165,8 +165,10 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
> DRM_WARN("smart shift update failed\n");
>
> out:
>- if (r)
>+ if (r) {
>+ amdgpu_xcp_dev_unplug(adev);
> amdgpu_driver_unload_kms(dev);
>+ }
>
> return r;
> }
>--
>2.43.5
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
2025-01-05 2:45 ` [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
2025-01-05 5:16 ` Shuo Liu
@ 2025-01-07 22:55 ` Chen, Xiaogang
2025-01-08 3:34 ` Gerry Liu
1 sibling, 1 reply; 18+ messages in thread
From: Chen, Xiaogang @ 2025-01-07 22:55 UTC (permalink / raw)
To: Jiang Liu, amd-gfx, lijo.lazar, Kent.Russell, shuox.liu
[-- Attachment #1: Type: text/plain, Size: 3649 bytes --]
On 1/4/2025 8:45 PM, Jiang Liu wrote:
> If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
> after free bug related to amdgpu_driver_release_kms() as:
> 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
> 2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kernel mode
> 2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) - not-present page
> 2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0
> 2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> 2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G W E 6.10.0+ #2
> 2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
> 2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
> 2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
> 43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
> 2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
> 2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
> 2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
> 2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
> 2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> 2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
> 2024-12-26 16:17:46 [16002.213648] FS: 00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
> 2024-12-26 16:17:46 [16002.223189] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
> 2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> e024se+0x3c/0x90 [amdxcp]
> 2024-12-26 16:17:46 [16002.337645] __do_sys_delete_module.constprop.0+0x176/0x310
> 2024-12-26 16:17:46 [16002.344324] do_syscall_64+0x5d/0x170
> 2024-12-26 16:17:46 [16002.348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26
>
> Fix it by unplugging xcp drm devices when failed to probe GPU devices.
>
> Signed-off-by: Jiang Liu<gerry@linux.alibaba.com>
> Tested-by: Shuo Liu<shuox.liu@linux.alibaba.com>
> Reviewed-by: Lijo Lazar<lijo.lazar@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index d2a046736edd..9ebc0d47d1cb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -165,8 +165,10 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
> DRM_WARN("smart shift update failed\n");
>
> out:
> - if (r)
> + if (r) {
> + amdgpu_xcp_dev_unplug(adev);
You have made amdgpu_xcp_drm_dev_free, why still use
amdgpu_xcp_dev_unplug here? I think you want undo
amdgpu_xcp_drm_dev_alloc in error path. Why involve adev device unplug?
It is a different scenario.
Regards
Xiaogang
> amdgpu_driver_unload_kms(dev);
> + }
>
> return r;
> }
[-- Attachment #2: Type: text/html, Size: 4618 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
2025-01-07 22:55 ` Chen, Xiaogang
@ 2025-01-08 3:34 ` Gerry Liu
0 siblings, 0 replies; 18+ messages in thread
From: Gerry Liu @ 2025-01-08 3:34 UTC (permalink / raw)
To: Chen, Xiaogang; +Cc: amd-gfx, lijo.lazar, Kent.Russell, shuox.liu
[-- Attachment #1: Type: text/plain, Size: 4698 bytes --]
> 2025年1月8日 06:55,Chen, Xiaogang <xiaogang.chen@amd.com> 写道:
>
>
>
> On 1/4/2025 8:45 PM, Jiang Liu wrote:
>> If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
>> after free bug related to amdgpu_driver_release_kms() as:
>> 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
>> 2024-12-26 16:17:45 [16002.093792] #PF: supervisor read access in kernel mode
>> 2024-12-26 16:17:45 [16002.099993] #PF: error_code(0x0000) - not-present page
>> 2024-12-26 16:17:45 [16002.106188] PGD 0 P4D 0
>> 2024-12-26 16:17:45 [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
>> 2024-12-26 16:17:45 [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G W E 6.10.0+ #2
>> 2024-12-26 16:17:45 [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
>> 2024-12-26 16:17:45 [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
>> 2024-12-26 16:17:45 [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
>> 43 28 4d 63 ec <4a> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
>> 2024-12-26 16:17:45 [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
>> 2024-12-26 16:17:45 [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
>> 2024-12-26 16:17:46 [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
>> 2024-12-26 16:17:46 [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
>> 2024-12-26 16:17:46 [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>> 2024-12-26 16:17:46 [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
>> 2024-12-26 16:17:46 [16002.213648] FS: 00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
>> 2024-12-26 16:17:46 [16002.223189] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> 2024-12-26 16:17:46 [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
>> 2024-12-26 16:17:46 [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> 2024-12-26 16:17:46 [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>> e024se+0x3c/0x90 [amdxcp]
>> 2024-12-26 16:17:46 [16002.337645] __do_sys_delete_module.constprop.0+0x176/0x310
>> 2024-12-26 16:17:46 [16002.344324] do_syscall_64+0x5d/0x170
>> 2024-12-26 16:17:46 [16002.348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> 2024-12-26 16:17:46 [16002.354956] RIP: 0033:0x7f2736a620cb-12-26
>>
>> Fix it by unplugging xcp drm devices when failed to probe GPU devices.
>>
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com> <mailto:gerry@linux.alibaba.com>
>> Tested-by: Shuo Liu <shuox.liu@linux.alibaba.com> <mailto:shuox.liu@linux.alibaba.com>
>> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> <mailto:lijo.lazar@amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index d2a046736edd..9ebc0d47d1cb 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -165,8 +165,10 @@ int amdgpu_driver_load_kms(struct amdgpu_device *adev, unsigned long flags)
>> DRM_WARN("smart shift update failed\n");
>>
>> out:
>> - if (r)
>> + if (r) {
>> + amdgpu_xcp_dev_unplug(adev);
> You have made amdgpu_xcp_drm_dev_free, why still use amdgpu_xcp_dev_unplug here? I think you want undo amdgpu_xcp_drm_dev_alloc in error path. Why involve adev device unplug? It is a different scenario.
>
Hi xiaogang,
I didn’t get your point yet. Current work flow is:
amdgpu_pci_probe()
-> amdgpu_driver_load_kms()
-> amdgpu_device_init()
->amdgpu_discovery_set_ip_blocks()
-> amdgpu_xcp_mgr_init()
-> amdgpu_xcp_alloc_dev()
-> amdgpu_xcp_dev_register()
And amdgpu_xcp_dev_unplug() undos work done by amdgpu_xcp_dev_register() and part of amdgpu_xcp_mgr_ini().
How about splitting amdgpu_xcp_dev_unplug() into amdgpuxcp_dev_deregister() and amdgpu_xcp_mgr_fini(), then things should get much clear.
And it seems the xcp_mgr object is leaked current, and there are resource leakage on error handling path of amdgpu_pci_probe.
Above proposal may also help to fix those resource leakages.
Thanks,
Gerry
> Regards
>
> Xiaogang
>
>> amdgpu_driver_unload_kms(dev);
>> + }
>>
>> return r;
>> }
[-- Attachment #2: Type: text/html, Size: 6610 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 5/6] amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
2025-01-05 2:45 [PATCH v2 0/6] Fix several bugs in error handling during device probe Jiang Liu
` (3 preceding siblings ...)
2025-01-05 2:45 ` [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
@ 2025-01-05 2:45 ` Jiang Liu
2025-01-07 22:55 ` Chen, Xiaogang
2025-01-05 2:45 ` [PATCH v2 6/6] amdgpu: get rid of false warnings caused by amdgpu_irq_put() Jiang Liu
5 siblings, 1 reply; 18+ messages in thread
From: Jiang Liu @ 2025-01-05 2:45 UTC (permalink / raw)
To: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell, shuox.liu; +Cc: Jiang Liu
Function detects initialization status by checking sched->ops, so set
sched->ops to non-NULL just before return in function drm_sched_init()
to avoid possible invalid memory access on error recover path.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5ff53a3b9851..475ab635c699 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2857,6 +2857,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
if (r) {
DRM_ERROR("Failed to create scheduler on ring %s.\n",
ring->name);
+ ring->sched.ops = NULL;
return r;
}
r = amdgpu_uvd_entity_init(adev, ring);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 2f24a6aa13bf..b5e87b515139 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -656,8 +656,10 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
* The natural check would be sched.ready, which is
* set as drm_sched_init() finishes...
*/
- if (ring->sched.ops)
+ if (ring->sched.ops) {
drm_sched_fini(&ring->sched);
+ ring->sched.ops = NULL;
+ }
for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
dma_fence_put(ring->fence_drv.fences[j]);
--
2.43.5
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v2 5/6] amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
2025-01-05 2:45 ` [PATCH v2 5/6] amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini() Jiang Liu
@ 2025-01-07 22:55 ` Chen, Xiaogang
0 siblings, 0 replies; 18+ messages in thread
From: Chen, Xiaogang @ 2025-01-07 22:55 UTC (permalink / raw)
To: Jiang Liu, amd-gfx, lijo.lazar, Kent.Russell, shuox.liu
[-- Attachment #1: Type: text/plain, Size: 1878 bytes --]
On 1/4/2025 8:45 PM, Jiang Liu wrote:
> Function detects initialization status by checking sched->ops, so set
> sched->ops to non-NULL just before return in function drm_sched_init()
This commit message is not what the change did: you set sched->ops to
NULL just after return from drm_sched_init.
Other than that this change looks good to me.
Regards
Xiaogang
> to avoid possible invalid memory access on error recover path.
>
> Signed-off-by: Jiang Liu<gerry@linux.alibaba.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5ff53a3b9851..475ab635c699 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2857,6 +2857,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
> if (r) {
> DRM_ERROR("Failed to create scheduler on ring %s.\n",
> ring->name);
> + ring->sched.ops = NULL;
> return r;
> }
> r = amdgpu_uvd_entity_init(adev, ring);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 2f24a6aa13bf..b5e87b515139 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -656,8 +656,10 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
> * The natural check would be sched.ready, which is
> * set as drm_sched_init() finishes...
> */
> - if (ring->sched.ops)
> + if (ring->sched.ops) {
> drm_sched_fini(&ring->sched);
> + ring->sched.ops = NULL;
> + }
>
> for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
> dma_fence_put(ring->fence_drv.fences[j]);
[-- Attachment #2: Type: text/html, Size: 2713 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 6/6] amdgpu: get rid of false warnings caused by amdgpu_irq_put()
2025-01-05 2:45 [PATCH v2 0/6] Fix several bugs in error handling during device probe Jiang Liu
` (4 preceding siblings ...)
2025-01-05 2:45 ` [PATCH v2 5/6] amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini() Jiang Liu
@ 2025-01-05 2:45 ` Jiang Liu
5 siblings, 0 replies; 18+ messages in thread
From: Jiang Liu @ 2025-01-05 2:45 UTC (permalink / raw)
To: amd-gfx, xiaogang.chen, lijo.lazar, Kent.Russell, shuox.liu; +Cc: Jiang Liu
If error happens before amdgpu_fence_driver_hw_init() gets called during
device probe, it will trigger a false warning in amdgpu_irq_put() as
below:
[ 1209.300996] ------------[ cut here ]------------
[ 1209.301061] WARNING: CPU: 48 PID: 293 at /tmp/amd.Rc9jFrl7/amd/amdgpu/amdgpu_irq.c:633 amdgpu_irq_put+0x45/0x70 [amdgpu]
[ 1209.301062] Modules linked in: ...
[ 1209.301093] CPU: 48 PID: 293 Comm: kworker/48:1 Kdump: loaded Tainted: G W OE 5.10.134-17.2.al8.x86_64 #1
[ 1209.301094] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
[ 1209.301095] Workqueue: events work_for_cpu_fn
[ 1209.301159] RIP: 0010:amdgpu_irq_put+0x45/0x70 [amdgpu]
[ 1209.301160] Code: 48 8b 4e 10 48 83 39 00 74 2c 89 d1 48 8d 04 88 8b 08 85 c9 74 14 f0 ff 08 b8 00 00 00 00 74 05 c3 cc cc cc cc e9 8b fd ff ff <0f> 0b b8 ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc
[ 1209.301162] RSP: 0018:ffffb08a99c8fd88 EFLAGS: 00010246
[ 1209.301162] RAX: ffff9efe1bcbf500 RBX: ffff9efe1cc3e400 RCX: 0000000000000000
[ 1209.301163] RDX: 0000000000000000 RSI: ffff9efe1cc3b108 RDI: ffff9efe1cc00000
[ 1209.301163] RBP: ffff9efe1cc10818 R08: 0000000000000001 R09: 000000000000000d
[ 1209.301164] R10: ffffb08a99c8fb48 R11: ffffffffa2068018 R12: ffff9efe1cc109d0
[ 1209.301164] R13: ffff9efe1cc00010 R14: ffff9efe1cc00000 R15: ffff9efe1cc3b108
[ 1209.301165] FS: 0000000000000000(0000) GS:ffff9ff9fce00000(0000) knlGS:0000000000000000
[ 1209.301165] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1209.301165] CR2: 00007fd0f6e860d0 CR3: 0000010092baa003 CR4: 0000000002770ee0
[ 1209.301166] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1209.301166] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1209.301167] PKRU: 55555554
[ 1209.301167] Call Trace:
[ 1209.301225] amdgpu_fence_driver_hw_fini+0xda/0x110 [amdgpu]
[ 1209.301284] amdgpu_device_fini_hw+0xaf/0x200 [amdgpu]
[ 1209.301342] amdgpu_driver_load_kms+0x7f/0xc0 [amdgpu]
[ 1209.301400] amdgpu_pci_probe+0x1cd/0x4a0 [amdgpu]
[ 1209.301401] local_pci_probe+0x40/0xa0
[ 1209.301402] work_for_cpu_fn+0x13/0x20
[ 1209.301403] process_one_work+0x1ad/0x380
[ 1209.301404] worker_thread+0x1c8/0x310
[ 1209.301405] ? process_one_work+0x380/0x380
[ 1209.301406] kthread+0x118/0x140
[ 1209.301407] ? __kthread_bind_mask+0x60/0x60
[ 1209.301408] ret_from_fork+0x1f/0x30
[ 1209.301410] ---[ end trace 733f120fe2ab13e5 ]---
[ 1209.301418] ------------[ cut here ]------------
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 +++++++--
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index b5e87b515139..0e41a535e05f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -614,9 +614,11 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
if (!drm_dev_is_unplugged(adev_to_drm(adev)) &&
ring->fence_drv.irq_src &&
- amdgpu_fence_need_ring_interrupt_restore(ring))
+ ring->fence_drv.irq_enabled) {
amdgpu_irq_put(adev, ring->fence_drv.irq_src,
ring->fence_drv.irq_type);
+ ring->fence_drv.irq_enabled = false;
+ }
del_timer_sync(&ring->fence_drv.fallback_timer);
}
@@ -693,9 +695,12 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev)
/* enable the interrupt */
if (ring->fence_drv.irq_src &&
- amdgpu_fence_need_ring_interrupt_restore(ring))
+ !ring->fence_drv.irq_enabled &&
+ amdgpu_fence_need_ring_interrupt_restore(ring)) {
amdgpu_irq_get(adev, ring->fence_drv.irq_src,
ring->fence_drv.irq_type);
+ ring->fence_drv.irq_enabled = true;
+ }
}
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index dee5a1b4e572..959d474a0516 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -118,6 +118,7 @@ struct amdgpu_fence_driver {
uint32_t sync_seq;
atomic_t last_seq;
bool initialized;
+ bool irq_enabled;
struct amdgpu_irq_src *irq_src;
unsigned irq_type;
struct timer_list fallback_timer;
--
2.43.5
^ permalink raw reply related [flat|nested] 18+ messages in thread