* [v4 1/5] drm/amdgpu: clear adev->in_suspend flag when fails to suspend
2025-01-10 2:08 [v4 0/6] Fix several bugs in error handling during device probe Jiang Liu
@ 2025-01-10 2:08 ` Jiang Liu
2025-01-10 18:57 ` Mario Limonciello
2025-01-10 2:08 ` [v4 2/5] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
` (4 subsequent siblings)
5 siblings, 1 reply; 16+ messages in thread
From: Jiang Liu @ 2025-01-10 2:08 UTC (permalink / raw)
To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
Cc: Jiang Liu
Clear adev->in_suspend flag when fails to suspend, otherwise it will
cause too much warnings like:
[ 1802.212027] ------------[ cut here ]------------
[ 1802.212028] WARNING: CPU: 97 PID: 11282 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:452 amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212198] Modules linked in: amdgpu(E-) tcp_diag(E) inet_diag(E) rfkill(E) intel_rapl_msr(E) intel_rapl_common(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) i10nm_edac(E) nfit(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) iTCO_wdt(E) pmt_telemetry(E) iTCO_vendor_support(E) pmt_class(E) intel_cstate(E) snd_hda_intel(E) ipmi_ssif(E) snd_intel_dspcfg(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) intel_uncore(E) snd(E) cdc_ether(E) pcspkr(E) i2c_i801(E) idxd(E) mei_me(E) usbnet(E) ses(E) isst_if_mmio(E) isst_if_mbox_pci(E) mii(E) joydev(E) soundcore(E) isst_if_common(E) idxd_bus(E) mei(E) enclosure(E) intel_vsec(E) i2c_smbus(E) i2c_ismt(E) sunrpc(E) acpi_power_meter(E) ipmi_si(E) acpi_ipmi(E) ipmi_devintf(E) acpi_pad(E) ipmi_msghandler(E) vfat(E) fat(E) sg(E) video(E) amdxcp(E) drm_ttm_helper(E) ttm(E) drm_exec(E) gpu_sched(E) drm_suballoc_helper(E) crc32c_intel(E) drm_buddy(E)
[ 1802.212223] ast(E) drm_shmem_helper(E) drm_display_helper(E) i2c_algo_bit(E) drm_kms_helper(E) virtio_net(E) mpt3sas(E) drm(E) net_failover(E) ahci(E) raid_class(E) failover(E) libahci(E) scsi_transport_sas(E) dimlib(E) libata(E) i2c_core(E) wmi(E) pinctrl_emmitsburg(E) [last unloaded: amdgpu(E)]
[ 1802.212231] CPU: 97 PID: 11282 Comm: rmmod Kdump: loaded Tainted: G W E 6.10.0+ #2
[ 1802.212232] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
[ 1802.212232] RIP: 0010:amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212401] Code: 00 00 00 4d 85 e4 74 08 49 c7 04 24 00 00 00 00 48 85 ed 74 08 48 c7 45 00 00 00 00 00 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc <0f> 0b e9 3b ff ff ff 3d 00 fe ff ff 74 b3 49 8b be b0 10 ff ff 4c
[ 1802.212402] RSP: 0018:ffffbe0e92087cb0 EFLAGS: 00010202
[ 1802.212403] RAX: 0000000000000206 RBX: ffff9cbc06c37f10 RCX: 000000000020000d
[ 1802.212404] RDX: ffff9cbc06c37f18 RSI: ffff9cbc06c37f58 RDI: ffff9cbc06c37f10
[ 1802.212407] RBP: ffff9cbc06c37f18 R08: ffff9cba42388800 R09: 000000000020000d
[ 1802.212409] R10: 0000000000040000 R11: 0000000000000006 R12: ffff9cbc06c37f58
[ 1802.212410] R13: ffff9cba4238dc00 R14: ffff9cbc06c0ef50 R15: ffff9cbc06c00000
[ 1802.212411] FS: 00007f927dc4e740(0000) GS:ffff9db47e880000(0000) knlGS:0000000000000000
[ 1802.212412] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1802.212413] CR2: 000056342b2be1a8 CR3: 00000003f1546005 CR4: 0000000000f70ef0
[ 1802.212413] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1802.212414] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1802.212415] PKRU: 55555554
[ 1802.212415] Call Trace:
[ 1802.212416] <TASK>
[ 1802.212417] ? __warn+0x83/0x130
[ 1802.212419] ? amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212586] ? __report_bug+0xea/0x100
[ 1802.212588] ? report_bug+0x24/0x70
[ 1802.212589] ? handle_bug+0x3c/0x70
[ 1802.212590] ? exc_invalid_op+0x18/0x70
[ 1802.212591] ? asm_exc_invalid_op+0x1a/0x20
[ 1802.212594] ? amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
[ 1802.212746] amdgpu_ring_fini+0x91/0x120 [amdgpu]
[ 1802.212901] amdgpu_jpeg_sw_fini+0xb2/0xe0 [amdgpu]
[ 1802.213106] amdgpu_device_ip_fini.isra.0+0xb1/0x1c0 [amdgpu]
[ 1802.213247] amdgpu_device_fini_sw+0x49/0x290 [amdgpu]
[ 1802.213413] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[ 1802.213576] devm_drm_dev_init_release+0x4e/0x70 [drm]
[ 1802.213602] release_nodes+0x35/0xb0
[ 1802.213605] devres_release_all+0x8f/0xd0
[ 1802.213606] device_unbind_cleanup+0xe/0x70
[ 1802.213609] device_release_driver_internal+0x1bc/0x200
[ 1802.213611] driver_detach+0x48/0x90
[ 1802.213613] bus_remove_driver+0x6d/0xf0
[ 1802.213615] pci_unregister_driver+0x2e/0xb0
[ 1802.213618] amdgpu_exit+0x15/0x1c4 [amdgpu]
[ 1802.213861] __do_sys_delete_module.constprop.0+0x176/0x310
[ 1802.213863] do_syscall_64+0x5d/0x170
[ 1802.213866] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1802.213869] RIP: 0033:0x7f927d6620cb
[ 1802.213870] Code: 73 01 c3 48 8b 0d a5 6d 19 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 6d 19 00 f7 d8 64 89 01 48
[ 1802.213871] RSP: 002b:00007fff031d9d78 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 1802.213873] RAX: ffffffffffffffda RBX: 000055d8d6df69e0 RCX: 00007f927d6620cb
[ 1802.213873] RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055d8d6df6a48
[ 1802.213874] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 1802.213875] R10: 00007f927d7aaac0 R11: 0000000000000206 R12: 00007fff031d9fa0
[ 1802.213876] R13: 00007fff031da5f1 R14: 000055d8d6df62a0 R15: 000055d8d6df69e0
[ 1802.213878] </TASK>
[ 1802.213879] ---[ end trace 0000000000000000 ]---
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0a121aab5c74..5ff53a3b9851 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4910,7 +4910,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
amdgpu_virt_fini_data_exchange(adev);
r = amdgpu_virt_request_full_gpu(adev, false);
if (r)
- return r;
+ goto error_out;
}
if (amdgpu_acpi_smart_shift_update(dev, AMDGPU_SS_DEV_D3))
@@ -4930,7 +4930,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
r = amdgpu_device_evict_resources(adev);
if (r)
- return r;
+ goto error_out;
amdgpu_ttm_set_buffer_funcs_status(adev, false);
@@ -4943,9 +4943,12 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
r = amdgpu_dpm_notify_rlc_state(adev, false);
if (r)
- return r;
+ goto error_out;
return 0;
+error_out:
+ adev->in_suspend = false;
+ return r;
}
/**
@@ -5007,8 +5010,10 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
amdgpu_virt_release_full_gpu(adev, true);
}
- if (r)
+ if (r) {
+ adev->in_suspend = false;
return r;
+ }
/* Make sure IB tests flushed */
flush_delayed_work(&adev->delayed_init_work);
--
2.43.5
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [v4 1/5] drm/amdgpu: clear adev->in_suspend flag when fails to suspend
2025-01-10 2:08 ` [v4 1/5] drm/amdgpu: clear adev->in_suspend flag when fails to suspend Jiang Liu
@ 2025-01-10 18:57 ` Mario Limonciello
2025-01-14 3:12 ` Gerry Liu
0 siblings, 1 reply; 16+ messages in thread
From: Mario Limonciello @ 2025-01-10 18:57 UTC (permalink / raw)
To: Jiang Liu, alexander.deucher, christian.koenig, Xinhui.Pan,
airlied, simona, sunil.khatri, lijo.lazar, Hawking.Zhang,
xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
On 1/9/2025 20:08, Jiang Liu wrote:
> Clear adev->in_suspend flag when fails to suspend, otherwise it will
> cause too much warnings like:
> [ 1802.212027] ------------[ cut here ]------------
> [ 1802.212028] WARNING: CPU: 97 PID: 11282 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:452 amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
> [ 1802.212198] Modules linked in: amdgpu(E-) tcp_diag(E) inet_diag(E) rfkill(E) intel_rapl_msr(E) intel_rapl_common(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) i10nm_edac(E) nfit(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) iTCO_wdt(E) pmt_telemetry(E) iTCO_vendor_support(E) pmt_class(E) intel_cstate(E) snd_hda_intel(E) ipmi_ssif(E) snd_intel_dspcfg(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) intel_uncore(E) snd(E) cdc_ether(E) pcspkr(E) i2c_i801(E) idxd(E) mei_me(E) usbnet(E) ses(E) isst_if_mmio(E) isst_if_mbox_pci(E) mii(E) joydev(E) soundcore(E) isst_if_common(E) idxd_bus(E) mei(E) enclosure(E) intel_vsec(E) i2c_smbus(E) i2c_ismt(E) sunrpc(E) acpi_power_meter(E) ipmi_si(E) acpi_ipmi(E) ipmi_devintf(E) acpi_pad(E) ipmi_msghandler(E) vfat(E) fat(E) sg(E) video(E) amdxcp(E) drm_ttm_helper(E) ttm(E) drm_exec(E) gpu_sched(E) drm_suballoc_helper(E) crc32c_intel(E) drm_buddy(E)
> [ 1802.212223] ast(E) drm_shmem_helper(E) drm_display_helper(E) i2c_algo_bit(E) drm_kms_helper(E) virtio_net(E) mpt3sas(E) drm(E) net_failover(E) ahci(E) raid_class(E) failover(E) libahci(E) scsi_transport_sas(E) dimlib(E) libata(E) i2c_core(E) wmi(E) pinctrl_emmitsburg(E) [last unloaded: amdgpu(E)]
> [ 1802.212231] CPU: 97 PID: 11282 Comm: rmmod Kdump: loaded Tainted: G W E 6.10.0+ #2
> [ 1802.212232] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
> [ 1802.212232] RIP: 0010:amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
> [ 1802.212401] Code: 00 00 00 4d 85 e4 74 08 49 c7 04 24 00 00 00 00 48 85 ed 74 08 48 c7 45 00 00 00 00 00 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc <0f> 0b e9 3b ff ff ff 3d 00 fe ff ff 74 b3 49 8b be b0 10 ff ff 4c
> [ 1802.212402] RSP: 0018:ffffbe0e92087cb0 EFLAGS: 00010202
> [ 1802.212403] RAX: 0000000000000206 RBX: ffff9cbc06c37f10 RCX: 000000000020000d
> [ 1802.212404] RDX: ffff9cbc06c37f18 RSI: ffff9cbc06c37f58 RDI: ffff9cbc06c37f10
> [ 1802.212407] RBP: ffff9cbc06c37f18 R08: ffff9cba42388800 R09: 000000000020000d
> [ 1802.212409] R10: 0000000000040000 R11: 0000000000000006 R12: ffff9cbc06c37f58
> [ 1802.212410] R13: ffff9cba4238dc00 R14: ffff9cbc06c0ef50 R15: ffff9cbc06c00000
> [ 1802.212411] FS: 00007f927dc4e740(0000) GS:ffff9db47e880000(0000) knlGS:0000000000000000
> [ 1802.212412] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1802.212413] CR2: 000056342b2be1a8 CR3: 00000003f1546005 CR4: 0000000000f70ef0
> [ 1802.212413] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1802.212414] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [ 1802.212415] PKRU: 55555554
> [ 1802.212415] Call Trace:
> [ 1802.212416] <TASK>
> [ 1802.212417] ? __warn+0x83/0x130
> [ 1802.212419] ? amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
> [ 1802.212586] ? __report_bug+0xea/0x100
> [ 1802.212588] ? report_bug+0x24/0x70
> [ 1802.212589] ? handle_bug+0x3c/0x70
> [ 1802.212590] ? exc_invalid_op+0x18/0x70
> [ 1802.212591] ? asm_exc_invalid_op+0x1a/0x20
> [ 1802.212594] ? amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
> [ 1802.212746] amdgpu_ring_fini+0x91/0x120 [amdgpu]
> [ 1802.212901] amdgpu_jpeg_sw_fini+0xb2/0xe0 [amdgpu]
> [ 1802.213106] amdgpu_device_ip_fini.isra.0+0xb1/0x1c0 [amdgpu]
> [ 1802.213247] amdgpu_device_fini_sw+0x49/0x290 [amdgpu]
> [ 1802.213413] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> [ 1802.213576] devm_drm_dev_init_release+0x4e/0x70 [drm]
> [ 1802.213602] release_nodes+0x35/0xb0
> [ 1802.213605] devres_release_all+0x8f/0xd0
> [ 1802.213606] device_unbind_cleanup+0xe/0x70
> [ 1802.213609] device_release_driver_internal+0x1bc/0x200
> [ 1802.213611] driver_detach+0x48/0x90
> [ 1802.213613] bus_remove_driver+0x6d/0xf0
> [ 1802.213615] pci_unregister_driver+0x2e/0xb0
> [ 1802.213618] amdgpu_exit+0x15/0x1c4 [amdgpu]
> [ 1802.213861] __do_sys_delete_module.constprop.0+0x176/0x310
> [ 1802.213863] do_syscall_64+0x5d/0x170
> [ 1802.213866] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1802.213869] RIP: 0033:0x7f927d6620cb
> [ 1802.213870] Code: 73 01 c3 48 8b 0d a5 6d 19 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 6d 19 00 f7 d8 64 89 01 48
> [ 1802.213871] RSP: 002b:00007fff031d9d78 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> [ 1802.213873] RAX: ffffffffffffffda RBX: 000055d8d6df69e0 RCX: 00007f927d6620cb
> [ 1802.213873] RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055d8d6df6a48
> [ 1802.213874] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [ 1802.213875] R10: 00007f927d7aaac0 R11: 0000000000000206 R12: 00007fff031d9fa0
> [ 1802.213876] R13: 00007fff031da5f1 R14: 000055d8d6df62a0 R15: 000055d8d6df69e0
> [ 1802.213878] </TASK>
> [ 1802.213879] ---[ end trace 0000000000000000 ]---
>
This patch seems like it's attacking two different problems; a failure
to suspend and a failure to resume. Some comments inline on each.
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +++++++++----
> 1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 0a121aab5c74..5ff53a3b9851 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4910,7 +4910,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
> amdgpu_virt_fini_data_exchange(adev);
> r = amdgpu_virt_request_full_gpu(adev, false);
> if (r)
> - return r;
> + goto error_out;
> }
>
> if (amdgpu_acpi_smart_shift_update(dev, AMDGPU_SS_DEV_D3))
> @@ -4930,7 +4930,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>
> r = amdgpu_device_evict_resources(adev);
> if (r)
> - return r;
> + goto error_out;
>
> amdgpu_ttm_set_buffer_funcs_status(adev, false);
>
> @@ -4943,9 +4943,12 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>
> r = amdgpu_dpm_notify_rlc_state(adev, false);
> if (r)
> - return r;
> + goto error_out;
For the changes to suspend, I would worry that this isn't really
"unwinding" the suspend. I think that if you just "abort" and say
in_suspend = false this is going to leave some IP in an inconsistent
state depending on where you failed. For example let's (hypothetically)
say this is a dGPU that you failed to evict resources.
You would have still called amdgpu_device_ip_suspend_phase1() which
means that all the valid blocks would have had amdgpu_ip_block_suspend()
called. This calls an individual block's suspend() callback but not
it's resume().
The failure code for amdgpu_device_suspend() would bubble up to the
pmops handlers. I think if it's called from runtime pm
(amdgpu_pmops_runtime_suspend) then the runtime support is no longer
functional. If it's from amdgpu_pmops_suspend() or
amdgpu_pmops_freeze() though I don't think anyone calls asic reset.
To change that flag I think you either need to unwind everything or we
need a recovery path in the callers to do amdgpu_asic_reset().
>
> return 0;
> +error_out:
> + adev->in_suspend = false;
> + return r;
> }
>
> /**
> @@ -5007,8 +5010,10 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
> amdgpu_virt_release_full_gpu(adev, true);
> }
>
> - if (r)
> + if (r) {
> + adev->in_suspend = false;
> return r;
> + }
>
> /* Make sure IB tests flushed */
> flush_delayed_work(&adev->delayed_init_work);
With the resume path is the failure to resume a real problem? Or is
that just hypothetical?
Similarly; I don't think you can just say you're not in suspend anymore.
If this is just hypothetical I think this chunk should be dropped.
If it's real, again maybe a better recovery is a reset.
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [v4 1/5] drm/amdgpu: clear adev->in_suspend flag when fails to suspend
2025-01-10 18:57 ` Mario Limonciello
@ 2025-01-14 3:12 ` Gerry Liu
0 siblings, 0 replies; 16+ messages in thread
From: Gerry Liu @ 2025-01-14 3:12 UTC (permalink / raw)
To: Mario Limonciello
Cc: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
sunil.khatri, lijo.lazar, Hawking.Zhang, xiaogang.chen,
Kent.Russell, shuox.liu, amd-gfx
> 2025年1月11日 02:57,Mario Limonciello <mario.limonciello@amd.com> 写道:
>
> On 1/9/2025 20:08, Jiang Liu wrote:
>> Clear adev->in_suspend flag when fails to suspend, otherwise it will
>> cause too much warnings like:
>> [ 1802.212027] ------------[ cut here ]------------
>> [ 1802.212028] WARNING: CPU: 97 PID: 11282 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:452 amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
>> [ 1802.212198] Modules linked in: amdgpu(E-) tcp_diag(E) inet_diag(E) rfkill(E) intel_rapl_msr(E) intel_rapl_common(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) i10nm_edac(E) nfit(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) rapl(E) iTCO_wdt(E) pmt_telemetry(E) iTCO_vendor_support(E) pmt_class(E) intel_cstate(E) snd_hda_intel(E) ipmi_ssif(E) snd_intel_dspcfg(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) intel_uncore(E) snd(E) cdc_ether(E) pcspkr(E) i2c_i801(E) idxd(E) mei_me(E) usbnet(E) ses(E) isst_if_mmio(E) isst_if_mbox_pci(E) mii(E) joydev(E) soundcore(E) isst_if_common(E) idxd_bus(E) mei(E) enclosure(E) intel_vsec(E) i2c_smbus(E) i2c_ismt(E) sunrpc(E) acpi_power_meter(E) ipmi_si(E) acpi_ipmi(E) ipmi_devintf(E) acpi_pad(E) ipmi_msghandler(E) vfat(E) fat(E) sg(E) video(E) amdxcp(E) drm_ttm_helper(E) ttm(E) drm_exec(E) gpu_sched(E) drm_suballoc_helper(E) crc32c_intel(E) drm_buddy(E)
>> [ 1802.212223] ast(E) drm_shmem_helper(E) drm_display_helper(E) i2c_algo_bit(E) drm_kms_helper(E) virtio_net(E) mpt3sas(E) drm(E) net_failover(E) ahci(E) raid_class(E) failover(E) libahci(E) scsi_transport_sas(E) dimlib(E) libata(E) i2c_core(E) wmi(E) pinctrl_emmitsburg(E) [last unloaded: amdgpu(E)]
>> [ 1802.212231] CPU: 97 PID: 11282 Comm: rmmod Kdump: loaded Tainted: G W E 6.10.0+ #2
>> [ 1802.212232] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
>> [ 1802.212232] RIP: 0010:amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
>> [ 1802.212401] Code: 00 00 00 4d 85 e4 74 08 49 c7 04 24 00 00 00 00 48 85 ed 74 08 48 c7 45 00 00 00 00 00 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc <0f> 0b e9 3b ff ff ff 3d 00 fe ff ff 74 b3 49 8b be b0 10 ff ff 4c
>> [ 1802.212402] RSP: 0018:ffffbe0e92087cb0 EFLAGS: 00010202
>> [ 1802.212403] RAX: 0000000000000206 RBX: ffff9cbc06c37f10 RCX: 000000000020000d
>> [ 1802.212404] RDX: ffff9cbc06c37f18 RSI: ffff9cbc06c37f58 RDI: ffff9cbc06c37f10
>> [ 1802.212407] RBP: ffff9cbc06c37f18 R08: ffff9cba42388800 R09: 000000000020000d
>> [ 1802.212409] R10: 0000000000040000 R11: 0000000000000006 R12: ffff9cbc06c37f58
>> [ 1802.212410] R13: ffff9cba4238dc00 R14: ffff9cbc06c0ef50 R15: ffff9cbc06c00000
>> [ 1802.212411] FS: 00007f927dc4e740(0000) GS:ffff9db47e880000(0000) knlGS:0000000000000000
>> [ 1802.212412] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1802.212413] CR2: 000056342b2be1a8 CR3: 00000003f1546005 CR4: 0000000000f70ef0
>> [ 1802.212413] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 1802.212414] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>> [ 1802.212415] PKRU: 55555554
>> [ 1802.212415] Call Trace:
>> [ 1802.212416] <TASK>
>> [ 1802.212417] ? __warn+0x83/0x130
>> [ 1802.212419] ? amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
>> [ 1802.212586] ? __report_bug+0xea/0x100
>> [ 1802.212588] ? report_bug+0x24/0x70
>> [ 1802.212589] ? handle_bug+0x3c/0x70
>> [ 1802.212590] ? exc_invalid_op+0x18/0x70
>> [ 1802.212591] ? asm_exc_invalid_op+0x1a/0x20
>> [ 1802.212594] ? amdgpu_bo_free_kernel+0xf9/0x120 [amdgpu]
>> [ 1802.212746] amdgpu_ring_fini+0x91/0x120 [amdgpu]
>> [ 1802.212901] amdgpu_jpeg_sw_fini+0xb2/0xe0 [amdgpu]
>> [ 1802.213106] amdgpu_device_ip_fini.isra.0+0xb1/0x1c0 [amdgpu]
>> [ 1802.213247] amdgpu_device_fini_sw+0x49/0x290 [amdgpu]
>> [ 1802.213413] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>> [ 1802.213576] devm_drm_dev_init_release+0x4e/0x70 [drm]
>> [ 1802.213602] release_nodes+0x35/0xb0
>> [ 1802.213605] devres_release_all+0x8f/0xd0
>> [ 1802.213606] device_unbind_cleanup+0xe/0x70
>> [ 1802.213609] device_release_driver_internal+0x1bc/0x200
>> [ 1802.213611] driver_detach+0x48/0x90
>> [ 1802.213613] bus_remove_driver+0x6d/0xf0
>> [ 1802.213615] pci_unregister_driver+0x2e/0xb0
>> [ 1802.213618] amdgpu_exit+0x15/0x1c4 [amdgpu]
>> [ 1802.213861] __do_sys_delete_module.constprop.0+0x176/0x310
>> [ 1802.213863] do_syscall_64+0x5d/0x170
>> [ 1802.213866] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> [ 1802.213869] RIP: 0033:0x7f927d6620cb
>> [ 1802.213870] Code: 73 01 c3 48 8b 0d a5 6d 19 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 6d 19 00 f7 d8 64 89 01 48
>> [ 1802.213871] RSP: 002b:00007fff031d9d78 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
>> [ 1802.213873] RAX: ffffffffffffffda RBX: 000055d8d6df69e0 RCX: 00007f927d6620cb
>> [ 1802.213873] RDX: 000000000000000a RSI: 0000000000000800 RDI: 000055d8d6df6a48
>> [ 1802.213874] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>> [ 1802.213875] R10: 00007f927d7aaac0 R11: 0000000000000206 R12: 00007fff031d9fa0
>> [ 1802.213876] R13: 00007fff031da5f1 R14: 000055d8d6df62a0 R15: 000055d8d6df69e0
>> [ 1802.213878] </TASK>
>> [ 1802.213879] ---[ end trace 0000000000000000 ]---
>
> This patch seems like it's attacking two different problems; a failure to suspend and a failure to resume. Some comments inline on each.
>
>
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +++++++++----
>> 1 file changed, 9 insertions(+), 4 deletions(-)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 0a121aab5c74..5ff53a3b9851 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4910,7 +4910,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>> amdgpu_virt_fini_data_exchange(adev);
>> r = amdgpu_virt_request_full_gpu(adev, false);
>> if (r)
>> - return r;
>> + goto error_out;
>> }
>> if (amdgpu_acpi_smart_shift_update(dev, AMDGPU_SS_DEV_D3))
>> @@ -4930,7 +4930,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>> r = amdgpu_device_evict_resources(adev);
>> if (r)
>> - return r;
>> + goto error_out;
>> amdgpu_ttm_set_buffer_funcs_status(adev, false);
>> @@ -4943,9 +4943,12 @@ int amdgpu_device_suspend(struct drm_device *dev, bool notify_clients)
>> r = amdgpu_dpm_notify_rlc_state(adev, false);
>> if (r)
>> - return r;
>> + goto error_out;
>
> For the changes to suspend, I would worry that this isn't really "unwinding" the suspend. I think that if you just "abort" and say in_suspend = false this is going to leave some IP in an inconsistent state depending on where you failed. For example let's (hypothetically) say this is a dGPU that you failed to evict resources.
>
> You would have still called amdgpu_device_ip_suspend_phase1() which means that all the valid blocks would have had amdgpu_ip_block_suspend() called. This calls an individual block's suspend() callback but not it's resume().
>
> The failure code for amdgpu_device_suspend() would bubble up to the pmops handlers. I think if it's called from runtime pm (amdgpu_pmops_runtime_suspend) then the runtime support is no longer functional. If it's from amdgpu_pmops_suspend() or amdgpu_pmops_freeze() though I don't think anyone calls asic reset.
>
> To change that flag I think you either need to unwind everything or we need a recovery path in the callers to do amdgpu_asic_reset().
Hi Mario,
You are right. My fix is just hiding the warning instead of solving the issue.
I have tried to handle the failure at amdgpu_pmops_suspend() by calling amdgpu_asic_reset(). But that doesn’t really fix the issue, it will cause following `amd-smi` hang forever with many kernel warnings.
So seem I can’t find a reasonable way to handle the situation and have to drop this patch:(
>
>> return 0;
>> +error_out:
>> + adev->in_suspend = false;
>> + return r;
>> }
>> /**
>> @@ -5007,8 +5010,10 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>> amdgpu_virt_release_full_gpu(adev, true);
>> }
>> - if (r)
>> + if (r) {
>> + adev->in_suspend = false;
>> return r;
>> + }
>> /* Make sure IB tests flushed */
>> flush_delayed_work(&adev->delayed_init_work);
>
> With the resume path is the failure to resume a real problem? Or is that just hypothetical?
>
> Similarly; I don't think you can just say you're not in suspend anymore. If this is just hypothetical I think this chunk should be dropped.
>
> If it's real, again maybe a better recovery is a reset.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [v4 2/5] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
2025-01-10 2:08 [v4 0/6] Fix several bugs in error handling during device probe Jiang Liu
2025-01-10 2:08 ` [v4 1/5] drm/amdgpu: clear adev->in_suspend flag when fails to suspend Jiang Liu
@ 2025-01-10 2:08 ` Jiang Liu
2025-01-10 19:02 ` Mario Limonciello
2025-01-13 4:40 ` Lazar, Lijo
2025-01-10 2:08 ` [v4 3/5] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
` (3 subsequent siblings)
5 siblings, 2 replies; 16+ messages in thread
From: Jiang Liu @ 2025-01-10 2:08 UTC (permalink / raw)
To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
Cc: Jiang Liu
Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
to do error recovery.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 65 +++++++++++++++++----
drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h | 1 +
2 files changed, 56 insertions(+), 10 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
index faed84172dd4..0473fe0479d9 100644
--- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
+++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
@@ -45,18 +45,26 @@ static const struct drm_driver amdgpu_xcp_driver = {
static int8_t pdev_num;
static struct xcp_device *xcp_dev[MAX_XCP_PLATFORM_DEVICE];
+static DEFINE_MUTEX(xcp_mutex);
int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
{
struct platform_device *pdev;
struct xcp_device *pxcp_dev;
char dev_name[20];
- int ret;
+ int ret, index;
+
+ guard(mutex)(&xcp_mutex);
if (pdev_num >= MAX_XCP_PLATFORM_DEVICE)
return -ENODEV;
- snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", pdev_num);
+ for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
+ if (!xcp_dev[index])
+ break;
+ }
+
+ snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", index);
pdev = platform_device_register_simple(dev_name, -1, NULL, 0);
if (IS_ERR(pdev))
return PTR_ERR(pdev);
@@ -72,8 +80,8 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
goto out_devres;
}
- xcp_dev[pdev_num] = pxcp_dev;
- xcp_dev[pdev_num]->pdev = pdev;
+ xcp_dev[index] = pxcp_dev;
+ xcp_dev[index]->pdev = pdev;
*ddev = &pxcp_dev->drm;
pdev_num++;
@@ -88,16 +96,53 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
}
EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
+static void __amdgpu_xcp_drm_dev_free(int index)
+{
+ struct platform_device *pdev;
+
+ pdev = xcp_dev[index]->pdev;
+ devres_release_group(&pdev->dev, NULL);
+ platform_device_unregister(pdev);
+ xcp_dev[index] = NULL;
+ pdev_num--;
+}
+
+void amdgpu_xcp_drm_dev_free(struct drm_device *ddev)
+{
+ int index;
+ struct xcp_device *pxcp_dev;
+
+ if (ddev == NULL)
+ return;
+
+ guard(mutex)(&xcp_mutex);
+ WARN_ON(!pdev_num);
+
+ pxcp_dev = container_of(ddev, struct xcp_device, drm);
+ for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
+ if (xcp_dev[index] == pxcp_dev) {
+ __amdgpu_xcp_drm_dev_free(index);
+ break;
+ }
+ }
+}
+EXPORT_SYMBOL(amdgpu_xcp_drm_dev_free);
+
void amdgpu_xcp_drv_release(void)
{
- for (--pdev_num; pdev_num >= 0; --pdev_num) {
- struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
+ int index;
- devres_release_group(&pdev->dev, NULL);
- platform_device_unregister(pdev);
- xcp_dev[pdev_num] = NULL;
+ guard(mutex)(&xcp_mutex);
+
+ for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
+ if (xcp_dev[index]) {
+ __amdgpu_xcp_drm_dev_free(index);
+ if (!pdev_num)
+ break;
+ }
}
- pdev_num = 0;
+
+ WARN_ON(pdev_num != 0);
}
EXPORT_SYMBOL(amdgpu_xcp_drv_release);
diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
index c1c4b679bf95..580a1602c8e3 100644
--- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
+++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
@@ -25,5 +25,6 @@
#define _AMDGPU_XCP_DRV_H_
int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev);
+void amdgpu_xcp_drm_dev_free(struct drm_device *ddev);
void amdgpu_xcp_drv_release(void);
#endif /* _AMDGPU_XCP_DRV_H_ */
--
2.43.5
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [v4 2/5] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
2025-01-10 2:08 ` [v4 2/5] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
@ 2025-01-10 19:02 ` Mario Limonciello
2025-01-13 4:40 ` Lazar, Lijo
1 sibling, 0 replies; 16+ messages in thread
From: Mario Limonciello @ 2025-01-10 19:02 UTC (permalink / raw)
To: Jiang Liu, alexander.deucher, christian.koenig, Xinhui.Pan,
airlied, simona, sunil.khatri, lijo.lazar, Hawking.Zhang,
xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
On 1/9/2025 20:08, Jiang Liu wrote:
> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
s/crreated/created/
> to do error recovery.
>
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> ---
> drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 65 +++++++++++++++++----
> drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h | 1 +
> 2 files changed, 56 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> index faed84172dd4..0473fe0479d9 100644
> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> @@ -45,18 +45,26 @@ static const struct drm_driver amdgpu_xcp_driver = {
>
> static int8_t pdev_num;
> static struct xcp_device *xcp_dev[MAX_XCP_PLATFORM_DEVICE];
> +static DEFINE_MUTEX(xcp_mutex);
>
> int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
> {
> struct platform_device *pdev;
> struct xcp_device *pxcp_dev;
> char dev_name[20];
> - int ret;
> + int ret, index;
> +
> + guard(mutex)(&xcp_mutex);
>
> if (pdev_num >= MAX_XCP_PLATFORM_DEVICE)
> return -ENODEV;
>
> - snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", pdev_num);
> + for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> + if (!xcp_dev[index])
> + break;
> + }
> +
> + snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", index);
> pdev = platform_device_register_simple(dev_name, -1, NULL, 0);
> if (IS_ERR(pdev))
> return PTR_ERR(pdev);
> @@ -72,8 +80,8 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
> goto out_devres;
> }
>
> - xcp_dev[pdev_num] = pxcp_dev;
> - xcp_dev[pdev_num]->pdev = pdev;
> + xcp_dev[index] = pxcp_dev;
> + xcp_dev[index]->pdev = pdev;
> *ddev = &pxcp_dev->drm;
> pdev_num++;
>
> @@ -88,16 +96,53 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
> }
> EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
>
> +static void __amdgpu_xcp_drm_dev_free(int index)
> +{
> + struct platform_device *pdev;
> +
Would it be valuable to have WARN_ON(!pdev_num) here?
> + pdev = xcp_dev[index]->pdev;
> + devres_release_group(&pdev->dev, NULL);
> + platform_device_unregister(pdev);
> + xcp_dev[index] = NULL;
> + pdev_num--;
> +}
> +
> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev)
> +{
> + int index;
> + struct xcp_device *pxcp_dev;
Just a nit; this should be reverse xmas tree order.
> +
> + if (ddev == NULL)
> + return;
> +
> + guard(mutex)(&xcp_mutex);
> + WARN_ON(!pdev_num);
> +
> + pxcp_dev = container_of(ddev, struct xcp_device, drm);
> + for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> + if (xcp_dev[index] == pxcp_dev) {
> + __amdgpu_xcp_drm_dev_free(index);
> + break;
> + }
> + }
> +}
> +EXPORT_SYMBOL(amdgpu_xcp_drm_dev_free);
> +
> void amdgpu_xcp_drv_release(void)
> {
> - for (--pdev_num; pdev_num >= 0; --pdev_num) {
> - struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
> + int index;
>
> - devres_release_group(&pdev->dev, NULL);
> - platform_device_unregister(pdev);
> - xcp_dev[pdev_num] = NULL;
> + guard(mutex)(&xcp_mutex);
> +
> + for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> + if (xcp_dev[index]) {
> + __amdgpu_xcp_drm_dev_free(index);
> + if (!pdev_num)
> + break;
> + }
> }
> - pdev_num = 0;
> +
> + WARN_ON(pdev_num != 0);
> }
> EXPORT_SYMBOL(amdgpu_xcp_drv_release);
>
> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> index c1c4b679bf95..580a1602c8e3 100644
> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> @@ -25,5 +25,6 @@
> #define _AMDGPU_XCP_DRV_H_
>
> int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev);
> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev);
> void amdgpu_xcp_drv_release(void);
> #endif /* _AMDGPU_XCP_DRV_H_ */
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [v4 2/5] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
2025-01-10 2:08 ` [v4 2/5] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
2025-01-10 19:02 ` Mario Limonciello
@ 2025-01-13 4:40 ` Lazar, Lijo
1 sibling, 0 replies; 16+ messages in thread
From: Lazar, Lijo @ 2025-01-13 4:40 UTC (permalink / raw)
To: Jiang Liu, alexander.deucher, christian.koenig, Xinhui.Pan,
airlied, simona, sunil.khatri, Hawking.Zhang, mario.limonciello,
xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
On 1/10/2025 7:38 AM, Jiang Liu wrote:
> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific
> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used
> to do error recovery.
>
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> ---
> drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 65 +++++++++++++++++----
> drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h | 1 +
> 2 files changed, 56 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> index faed84172dd4..0473fe0479d9 100644
> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
> @@ -45,18 +45,26 @@ static const struct drm_driver amdgpu_xcp_driver = {
>
> static int8_t pdev_num;
> static struct xcp_device *xcp_dev[MAX_XCP_PLATFORM_DEVICE];
> +static DEFINE_MUTEX(xcp_mutex);
>
> int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
> {
> struct platform_device *pdev;
> struct xcp_device *pxcp_dev;
> char dev_name[20];
> - int ret;
> + int ret, index;
> +
> + guard(mutex)(&xcp_mutex);
>
> if (pdev_num >= MAX_XCP_PLATFORM_DEVICE)
> return -ENODEV;
>
> - snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", pdev_num);
> + for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> + if (!xcp_dev[index])
> + break;
> + }
> +
> + snprintf(dev_name, sizeof(dev_name), "amdgpu_xcp_%d", index);
> pdev = platform_device_register_simple(dev_name, -1, NULL, 0);
> if (IS_ERR(pdev))
> return PTR_ERR(pdev);
> @@ -72,8 +80,8 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
> goto out_devres;
> }
>
> - xcp_dev[pdev_num] = pxcp_dev;
> - xcp_dev[pdev_num]->pdev = pdev;
> + xcp_dev[index] = pxcp_dev;
> + xcp_dev[index]->pdev = pdev;
> *ddev = &pxcp_dev->drm;
> pdev_num++;
>
> @@ -88,16 +96,53 @@ int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev)
> }
> EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
>
> +static void __amdgpu_xcp_drm_dev_free(int index)
> +{
> + struct platform_device *pdev;
> +
> + pdev = xcp_dev[index]->pdev;
> + devres_release_group(&pdev->dev, NULL);
> + platform_device_unregister(pdev);
> + xcp_dev[index] = NULL;
> + pdev_num--;
> +}
> +
> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev)
> +{
> + int index;
> + struct xcp_device *pxcp_dev;
> +
> + if (ddev == NULL)
> + return;
> +
> + guard(mutex)(&xcp_mutex);
> + WARN_ON(!pdev_num);
> +
> + pxcp_dev = container_of(ddev, struct xcp_device, drm);
> + for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
> + if (xcp_dev[index] == pxcp_dev) {
> + __amdgpu_xcp_drm_dev_free(index);
> + break;
> + }
> + }
> +}
> +EXPORT_SYMBOL(amdgpu_xcp_drm_dev_free);
> +
> void amdgpu_xcp_drv_release(void)
> {
> - for (--pdev_num; pdev_num >= 0; --pdev_num) {
> - struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
> + int index;
>
> - devres_release_group(&pdev->dev, NULL);
> - platform_device_unregister(pdev);
> - xcp_dev[pdev_num] = NULL;
> + guard(mutex)(&xcp_mutex);
> +
> + for (index = 0; index < MAX_XCP_PLATFORM_DEVICE; index++) {
May be this - before entering or as first check.
for (index = 0; pdev_num && index < MAX_XCP_PLATFORM_DEVICE; index++)
Thanks,
Lijo
> + if (xcp_dev[index]) {
> + __amdgpu_xcp_drm_dev_free(index);
> + if (!pdev_num)
> + break;
> + }
> }
> - pdev_num = 0;
> +
> + WARN_ON(pdev_num != 0);
> }
> EXPORT_SYMBOL(amdgpu_xcp_drv_release);
>
> diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> index c1c4b679bf95..580a1602c8e3 100644
> --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h
> @@ -25,5 +25,6 @@
> #define _AMDGPU_XCP_DRV_H_
>
> int amdgpu_xcp_drm_dev_alloc(struct drm_device **ddev);
> +void amdgpu_xcp_drm_dev_free(struct drm_device *ddev);
> void amdgpu_xcp_drv_release(void);
> #endif /* _AMDGPU_XCP_DRV_H_ */
^ permalink raw reply [flat|nested] 16+ messages in thread
* [v4 3/5] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
2025-01-10 2:08 [v4 0/6] Fix several bugs in error handling during device probe Jiang Liu
2025-01-10 2:08 ` [v4 1/5] drm/amdgpu: clear adev->in_suspend flag when fails to suspend Jiang Liu
2025-01-10 2:08 ` [v4 2/5] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free() Jiang Liu
@ 2025-01-10 2:08 ` Jiang Liu
2025-01-10 19:03 ` Mario Limonciello
2025-01-10 2:08 ` [v4 4/5] drm/amdgpu: enhance error handling in function amdgpu_pci_probe() Jiang Liu
` (2 subsequent siblings)
5 siblings, 1 reply; 16+ messages in thread
From: Jiang Liu @ 2025-01-10 2:08 UTC (permalink / raw)
To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
Cc: Jiang Liu
If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
after free bug related to amdgpu_driver_release_kms() as:
[16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
[16002.093792] #PF: supervisor read access in kernel mode
[16002.099993] #PF: error_code(0x0000) - not-present page
[16002.106188] PGD 0 P4D 0
[16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G W E 6.10.0+ #2
[16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
[16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
[16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
[16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
[16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
[16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
[16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
[16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
[16002.213648] FS: 00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
[16002.223189] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
[16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
dxcp]
[16002.337645] __do_sys_delete_module.constprop.0+0x176/0x310
[16002.344324] do_syscall_64+0x5d/0x170
[16002.348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[16002.354956] RIP: 0033:0x7f2736a620cb-12-26
Fix it by removing xcp drm devices when failed to probe GPU devices.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Tested-by: Shuo Liu <shuox.liu@linux.alibaba.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 70 ++++++++++++++++++----
drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h | 4 +-
4 files changed, 64 insertions(+), 16 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5ff53a3b9851..f29a4ad3c6d0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4777,6 +4777,8 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
amdgpu_reset_put_reset_domain(adev->reset_domain);
adev->reset_domain = NULL;
+ amdgpu_xcp_mgr_fini(adev);
+
kfree(adev->pci_state);
}
@@ -6682,7 +6684,7 @@ void amdgpu_device_halt(struct amdgpu_device *adev)
struct pci_dev *pdev = adev->pdev;
struct drm_device *ddev = adev_to_drm(adev);
- amdgpu_xcp_dev_unplug(adev);
+ amdgpu_xcp_dev_deregister(adev);
drm_dev_unplug(ddev);
amdgpu_irq_disable_all(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 62de668e9ff8..41d1b06be600 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2435,7 +2435,7 @@ amdgpu_pci_remove(struct pci_dev *pdev)
struct drm_device *dev = pci_get_drvdata(pdev);
struct amdgpu_device *adev = drm_to_adev(dev);
- amdgpu_xcp_dev_unplug(adev);
+ amdgpu_xcp_dev_deregister(adev);
amdgpu_gmc_prepare_nps_mode_change(adev);
drm_dev_unplug(dev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
index e209b5e101df..21993aff0dbc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
@@ -268,21 +268,34 @@ static int amdgpu_xcp_dev_alloc(struct amdgpu_device *adev)
return ret;
}
- /* Redirect all IOCTLs to the primary device */
- adev->xcp_mgr->xcp[i].rdev = p_ddev->render->dev;
- adev->xcp_mgr->xcp[i].pdev = p_ddev->primary->dev;
- adev->xcp_mgr->xcp[i].driver = (struct drm_driver *)p_ddev->driver;
- adev->xcp_mgr->xcp[i].vma_offset_manager = p_ddev->vma_offset_manager;
- p_ddev->render->dev = ddev;
- p_ddev->primary->dev = ddev;
- p_ddev->vma_offset_manager = ddev->vma_offset_manager;
- p_ddev->driver = &amdgpu_partition_driver;
adev->xcp_mgr->xcp[i].ddev = p_ddev;
}
return 0;
}
+static void amdgpu_xcp_dev_free(struct amdgpu_device *adev)
+{
+ struct drm_device *p_ddev;
+ int i;
+
+ if (!adev->xcp_mgr)
+ return;
+
+ for (i = 1; i < MAX_XCP; i++) {
+ if (!adev->xcp_mgr->xcp[i].ddev)
+ break;
+
+ p_ddev = adev->xcp_mgr->xcp[i].ddev;
+ adev->xcp_mgr->xcp[i].ddev = NULL;
+
+ amdgpu_xcp_drm_dev_free(p_ddev);
+ }
+
+ adev->xcp_mgr->xcp->ddev = NULL;
+}
+
+
int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
int init_num_xcps,
struct amdgpu_xcp_mgr_funcs *xcp_funcs)
@@ -310,6 +323,13 @@ int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
return amdgpu_xcp_dev_alloc(adev);
}
+void amdgpu_xcp_mgr_fini(struct amdgpu_device *adev)
+{
+ amdgpu_xcp_dev_free(adev);
+ kfree(adev->xcp_mgr);
+ adev->xcp_mgr = NULL;
+}
+
int amdgpu_xcp_get_partition(struct amdgpu_xcp_mgr *xcp_mgr,
enum AMDGPU_XCP_IP_BLOCK ip, int instance)
{
@@ -348,23 +368,44 @@ int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
const struct pci_device_id *ent)
{
int i, ret;
+ struct drm_device *p_ddev;
+ struct drm_device *ddev;
if (!adev->xcp_mgr)
return 0;
+ ddev = adev_to_drm(adev);
+
for (i = 1; i < MAX_XCP; i++) {
- if (!adev->xcp_mgr->xcp[i].ddev)
+ if (!adev->xcp_mgr->xcp[i].ddev || adev->xcp_mgr->xcp[i].driver)
break;
+ /* Redirect all IOCTLs to the primary device */
+ p_ddev = adev->xcp_mgr->xcp[i].ddev;
+ adev->xcp_mgr->xcp[i].rdev = p_ddev->render->dev;
+ adev->xcp_mgr->xcp[i].pdev = p_ddev->primary->dev;
+ adev->xcp_mgr->xcp[i].driver = (struct drm_driver *)p_ddev->driver;
+ adev->xcp_mgr->xcp[i].vma_offset_manager = p_ddev->vma_offset_manager;
+ p_ddev->render->dev = ddev;
+ p_ddev->primary->dev = ddev;
+ p_ddev->driver = &amdgpu_partition_driver;
+ p_ddev->vma_offset_manager = ddev->vma_offset_manager;
+
ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev, ent->driver_data);
- if (ret)
+ if (ret) {
+ p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
+ p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
+ p_ddev->driver = adev->xcp_mgr->xcp[i].driver;
+ p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
+ adev->xcp_mgr->xcp[i].driver = NULL;
return ret;
+ }
}
return 0;
}
-void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
+void amdgpu_xcp_dev_deregister(struct amdgpu_device *adev)
{
struct drm_device *p_ddev;
int i;
@@ -373,15 +414,18 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
return;
for (i = 1; i < MAX_XCP; i++) {
- if (!adev->xcp_mgr->xcp[i].ddev)
+ if (!adev->xcp_mgr->xcp[i].ddev || !adev->xcp_mgr->xcp[i].driver)
break;
+ // Restore and free the original drm_device.
p_ddev = adev->xcp_mgr->xcp[i].ddev;
drm_dev_unplug(p_ddev);
+
p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
p_ddev->driver = adev->xcp_mgr->xcp[i].driver;
p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
+ adev->xcp_mgr->xcp[i].driver = NULL;
}
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
index b63f53242c57..be22d4398463 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
@@ -101,6 +101,7 @@ struct amdgpu_xcp {
uint8_t id;
uint8_t mem_id;
bool valid;
+ bool registered;
atomic_t ref_cnt;
struct drm_device *ddev;
struct drm_device *rdev;
@@ -155,6 +156,7 @@ int amdgpu_xcp_resume(struct amdgpu_xcp_mgr *xcp_mgr, int xcp_id);
int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
int init_xcps, struct amdgpu_xcp_mgr_funcs *xcp_funcs);
+void amdgpu_xcp_mgr_fini(struct amdgpu_device *adev);
int amdgpu_xcp_init(struct amdgpu_xcp_mgr *xcp_mgr, int num_xcps, int mode);
int amdgpu_xcp_query_partition_mode(struct amdgpu_xcp_mgr *xcp_mgr, u32 flags);
int amdgpu_xcp_switch_partition_mode(struct amdgpu_xcp_mgr *xcp_mgr, int mode);
@@ -168,7 +170,7 @@ int amdgpu_xcp_get_inst_details(struct amdgpu_xcp *xcp,
int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
const struct pci_device_id *ent);
-void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev);
+void amdgpu_xcp_dev_deregister(struct amdgpu_device *adev);
int amdgpu_xcp_open_device(struct amdgpu_device *adev,
struct amdgpu_fpriv *fpriv,
struct drm_file *file_priv);
--
2.43.5
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [v4 3/5] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms()
2025-01-10 2:08 ` [v4 3/5] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
@ 2025-01-10 19:03 ` Mario Limonciello
0 siblings, 0 replies; 16+ messages in thread
From: Mario Limonciello @ 2025-01-10 19:03 UTC (permalink / raw)
To: Jiang Liu, alexander.deucher, christian.koenig, Xinhui.Pan,
airlied, simona, sunil.khatri, lijo.lazar, Hawking.Zhang,
xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
On 1/9/2025 20:08, Jiang Liu wrote:
> If some GPU device failed to probe, `rmmod amdgpu` will trigger a use
> after free bug related to amdgpu_driver_release_kms() as:
> [16002.085540] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [16002.093792] #PF: supervisor read access in kernel mode
> [16002.099993] #PF: error_code(0x0000) - not-present page
> [16002.106188] PGD 0 P4D 0
> [16002.109464] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> [16002.115372] CPU: 2 PID: 14375 Comm: rmmod Kdump: loaded Tainted: G W E 6.10.0+ #2
> [16002.125577] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
> [16002.136858] RIP: 0010:drm_sched_fini+0x3f/0xe0 [gpu_sched]
> [16002.143463] Code: 60 c6 87 be 00 00 00 01 e8 ce e0 90 d8 48 8d bb 80 00 00 00 e8 c2 e0 90 d8 8b 43 20 85 c0 74 51 45 31 e4 48 8b
> 8b 2c e8 48 89 ef e8 f5 0e 59 d9 48 8b 45 10 48 8d 55 10 48 39
> [16002.164992] RSP: 0018:ffffb570dbbb7da8 EFLAGS: 00010246
> [16002.171316] RAX: 0000000000000000 RBX: ffff96b0fdadc878 RCX: 0000000000000000
> [16002.179784] RDX: 000fffffffe00000 RSI: 0000000000000000 RDI: ffff96b0fdadc8f8
> [16002.188252] RBP: ffff96b0fdadc800 R08: ffff97abbd035040 R09: ffffffff9ac52540
> [16002.196722] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [16002.205179] R13: 0000000000000000 R14: ffff96b0fdadfc00 R15: 0000000000000000
> [16002.213648] FS: 00007f2737000740(0000) GS:ffff97abbd100000(0000) knlGS:0000000000000000
> [16002.223189] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [16002.230103] CR2: 0000000000000000 CR3: 000000011be3a005 CR4: 0000000000f70ef0
> [16002.238581] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [16002.247053] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> dxcp]
> [16002.337645] __do_sys_delete_module.constprop.0+0x176/0x310
> [16002.344324] do_syscall_64+0x5d/0x170
> [16002.348858] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [16002.354956] RIP: 0033:0x7f2736a620cb-12-26
>
> Fix it by removing xcp drm devices when failed to probe GPU devices.
>
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> Tested-by: Shuo Liu <shuox.liu@linux.alibaba.com>
> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
I have one nit below, but otherwise LGTM.
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 70 ++++++++++++++++++----
> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h | 4 +-
> 4 files changed, 64 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5ff53a3b9851..f29a4ad3c6d0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4777,6 +4777,8 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> amdgpu_reset_put_reset_domain(adev->reset_domain);
> adev->reset_domain = NULL;
>
> + amdgpu_xcp_mgr_fini(adev);
> +
> kfree(adev->pci_state);
>
> }
> @@ -6682,7 +6684,7 @@ void amdgpu_device_halt(struct amdgpu_device *adev)
> struct pci_dev *pdev = adev->pdev;
> struct drm_device *ddev = adev_to_drm(adev);
>
> - amdgpu_xcp_dev_unplug(adev);
> + amdgpu_xcp_dev_deregister(adev);
> drm_dev_unplug(ddev);
>
> amdgpu_irq_disable_all(adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 62de668e9ff8..41d1b06be600 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2435,7 +2435,7 @@ amdgpu_pci_remove(struct pci_dev *pdev)
> struct drm_device *dev = pci_get_drvdata(pdev);
> struct amdgpu_device *adev = drm_to_adev(dev);
>
> - amdgpu_xcp_dev_unplug(adev);
> + amdgpu_xcp_dev_deregister(adev);
> amdgpu_gmc_prepare_nps_mode_change(adev);
> drm_dev_unplug(dev);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> index e209b5e101df..21993aff0dbc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
> @@ -268,21 +268,34 @@ static int amdgpu_xcp_dev_alloc(struct amdgpu_device *adev)
> return ret;
> }
>
> - /* Redirect all IOCTLs to the primary device */
> - adev->xcp_mgr->xcp[i].rdev = p_ddev->render->dev;
> - adev->xcp_mgr->xcp[i].pdev = p_ddev->primary->dev;
> - adev->xcp_mgr->xcp[i].driver = (struct drm_driver *)p_ddev->driver;
> - adev->xcp_mgr->xcp[i].vma_offset_manager = p_ddev->vma_offset_manager;
> - p_ddev->render->dev = ddev;
> - p_ddev->primary->dev = ddev;
> - p_ddev->vma_offset_manager = ddev->vma_offset_manager;
> - p_ddev->driver = &amdgpu_partition_driver;
> adev->xcp_mgr->xcp[i].ddev = p_ddev;
> }
>
> return 0;
> }
>
> +static void amdgpu_xcp_dev_free(struct amdgpu_device *adev)
> +{
> + struct drm_device *p_ddev;
> + int i;
> +
> + if (!adev->xcp_mgr)
> + return;
> +
> + for (i = 1; i < MAX_XCP; i++) {
> + if (!adev->xcp_mgr->xcp[i].ddev)
> + break;
> +
> + p_ddev = adev->xcp_mgr->xcp[i].ddev;
> + adev->xcp_mgr->xcp[i].ddev = NULL;
> +
> + amdgpu_xcp_drm_dev_free(p_ddev);
> + }
> +
> + adev->xcp_mgr->xcp->ddev = NULL;
> +}
> +
> +
You only need one newline here.
> int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
> int init_num_xcps,
> struct amdgpu_xcp_mgr_funcs *xcp_funcs)
> @@ -310,6 +323,13 @@ int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
> return amdgpu_xcp_dev_alloc(adev);
> }
>
> +void amdgpu_xcp_mgr_fini(struct amdgpu_device *adev)
> +{
> + amdgpu_xcp_dev_free(adev);
> + kfree(adev->xcp_mgr);
> + adev->xcp_mgr = NULL;
> +}
> +
> int amdgpu_xcp_get_partition(struct amdgpu_xcp_mgr *xcp_mgr,
> enum AMDGPU_XCP_IP_BLOCK ip, int instance)
> {
> @@ -348,23 +368,44 @@ int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
> const struct pci_device_id *ent)
> {
> int i, ret;
> + struct drm_device *p_ddev;
> + struct drm_device *ddev;
>
> if (!adev->xcp_mgr)
> return 0;
>
> + ddev = adev_to_drm(adev);
> +
> for (i = 1; i < MAX_XCP; i++) {
> - if (!adev->xcp_mgr->xcp[i].ddev)
> + if (!adev->xcp_mgr->xcp[i].ddev || adev->xcp_mgr->xcp[i].driver)
> break;
>
> + /* Redirect all IOCTLs to the primary device */
> + p_ddev = adev->xcp_mgr->xcp[i].ddev;
> + adev->xcp_mgr->xcp[i].rdev = p_ddev->render->dev;
> + adev->xcp_mgr->xcp[i].pdev = p_ddev->primary->dev;
> + adev->xcp_mgr->xcp[i].driver = (struct drm_driver *)p_ddev->driver;
> + adev->xcp_mgr->xcp[i].vma_offset_manager = p_ddev->vma_offset_manager;
> + p_ddev->render->dev = ddev;
> + p_ddev->primary->dev = ddev;
> + p_ddev->driver = &amdgpu_partition_driver;
> + p_ddev->vma_offset_manager = ddev->vma_offset_manager;
> +
> ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev, ent->driver_data);
> - if (ret)
> + if (ret) {
> + p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
> + p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
> + p_ddev->driver = adev->xcp_mgr->xcp[i].driver;
> + p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
> + adev->xcp_mgr->xcp[i].driver = NULL;
> return ret;
> + }
> }
>
> return 0;
> }
>
> -void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
> +void amdgpu_xcp_dev_deregister(struct amdgpu_device *adev)
> {
> struct drm_device *p_ddev;
> int i;
> @@ -373,15 +414,18 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
> return;
>
> for (i = 1; i < MAX_XCP; i++) {
> - if (!adev->xcp_mgr->xcp[i].ddev)
> + if (!adev->xcp_mgr->xcp[i].ddev || !adev->xcp_mgr->xcp[i].driver)
> break;
>
> + // Restore and free the original drm_device.
> p_ddev = adev->xcp_mgr->xcp[i].ddev;
> drm_dev_unplug(p_ddev);
> +
> p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
> p_ddev->primary->dev = adev->xcp_mgr->xcp[i].pdev;
> p_ddev->driver = adev->xcp_mgr->xcp[i].driver;
> p_ddev->vma_offset_manager = adev->xcp_mgr->xcp[i].vma_offset_manager;
> + adev->xcp_mgr->xcp[i].driver = NULL;
> }
> }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
> index b63f53242c57..be22d4398463 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
> @@ -101,6 +101,7 @@ struct amdgpu_xcp {
> uint8_t id;
> uint8_t mem_id;
> bool valid;
> + bool registered;
> atomic_t ref_cnt;
> struct drm_device *ddev;
> struct drm_device *rdev;
> @@ -155,6 +156,7 @@ int amdgpu_xcp_resume(struct amdgpu_xcp_mgr *xcp_mgr, int xcp_id);
>
> int amdgpu_xcp_mgr_init(struct amdgpu_device *adev, int init_mode,
> int init_xcps, struct amdgpu_xcp_mgr_funcs *xcp_funcs);
> +void amdgpu_xcp_mgr_fini(struct amdgpu_device *adev);
> int amdgpu_xcp_init(struct amdgpu_xcp_mgr *xcp_mgr, int num_xcps, int mode);
> int amdgpu_xcp_query_partition_mode(struct amdgpu_xcp_mgr *xcp_mgr, u32 flags);
> int amdgpu_xcp_switch_partition_mode(struct amdgpu_xcp_mgr *xcp_mgr, int mode);
> @@ -168,7 +170,7 @@ int amdgpu_xcp_get_inst_details(struct amdgpu_xcp *xcp,
>
> int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
> const struct pci_device_id *ent);
> -void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev);
> +void amdgpu_xcp_dev_deregister(struct amdgpu_device *adev);
> int amdgpu_xcp_open_device(struct amdgpu_device *adev,
> struct amdgpu_fpriv *fpriv,
> struct drm_file *file_priv);
^ permalink raw reply [flat|nested] 16+ messages in thread
* [v4 4/5] drm/amdgpu: enhance error handling in function amdgpu_pci_probe()
2025-01-10 2:08 [v4 0/6] Fix several bugs in error handling during device probe Jiang Liu
` (2 preceding siblings ...)
2025-01-10 2:08 ` [v4 3/5] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms() Jiang Liu
@ 2025-01-10 2:08 ` Jiang Liu
2025-01-10 19:05 ` Mario Limonciello
2025-01-10 2:08 ` [v4 5/5] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini() Jiang Liu
2025-01-10 19:06 ` [v4 0/6] Fix several bugs in error handling during device probe Mario Limonciello
5 siblings, 1 reply; 16+ messages in thread
From: Jiang Liu @ 2025-01-10 2:08 UTC (permalink / raw)
To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
Cc: Jiang Liu
Enhance error handling in function amdgpu_pci_probe() to avoid
possible resource leakage.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 41d1b06be600..f8deca2f2696 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2346,16 +2346,16 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
msleep(5000);
goto retry_init;
} else if (ret) {
- goto err_pci;
+ goto unload_kms;
}
ret = amdgpu_xcp_dev_register(adev, ent);
if (ret)
- goto err_pci;
+ goto unplug_drm;
ret = amdgpu_amdkfd_drm_client_create(adev);
if (ret)
- goto err_pci;
+ goto deregister_xcp;
/*
* 1. don't init fbdev on hw without DCE
@@ -2424,6 +2424,12 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
return 0;
+deregister_xcp:
+ amdgpu_xcp_dev_deregister(adev);
+unplug_drm:
+ drm_dev_unplug(ddev);
+unload_kms:
+ amdgpu_driver_unload_kms(ddev);
err_pci:
pci_disable_device(pdev);
return ret;
--
2.43.5
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [v4 4/5] drm/amdgpu: enhance error handling in function amdgpu_pci_probe()
2025-01-10 2:08 ` [v4 4/5] drm/amdgpu: enhance error handling in function amdgpu_pci_probe() Jiang Liu
@ 2025-01-10 19:05 ` Mario Limonciello
0 siblings, 0 replies; 16+ messages in thread
From: Mario Limonciello @ 2025-01-10 19:05 UTC (permalink / raw)
To: Jiang Liu, alexander.deucher, christian.koenig, Xinhui.Pan,
airlied, simona, sunil.khatri, lijo.lazar, Hawking.Zhang,
xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
On 1/9/2025 20:08, Jiang Liu wrote:
> Enhance error handling in function amdgpu_pci_probe() to avoid
> possible resource leakage.
>
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 41d1b06be600..f8deca2f2696 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2346,16 +2346,16 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
> msleep(5000);
> goto retry_init;
> } else if (ret) {
> - goto err_pci;
> + goto unload_kms;
> }
>
> ret = amdgpu_xcp_dev_register(adev, ent);
> if (ret)
> - goto err_pci;
> + goto unplug_drm;
>
> ret = amdgpu_amdkfd_drm_client_create(adev);
> if (ret)
> - goto err_pci;
> + goto deregister_xcp;
>
> /*
> * 1. don't init fbdev on hw without DCE
> @@ -2424,6 +2424,12 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>
> return 0;
>
> +deregister_xcp:
> + amdgpu_xcp_dev_deregister(adev);
> +unplug_drm:
> + drm_dev_unplug(ddev);
> +unload_kms:
> + amdgpu_driver_unload_kms(ddev);
> err_pci:
> pci_disable_device(pdev);
> return ret;
^ permalink raw reply [flat|nested] 16+ messages in thread
* [v4 5/5] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
2025-01-10 2:08 [v4 0/6] Fix several bugs in error handling during device probe Jiang Liu
` (3 preceding siblings ...)
2025-01-10 2:08 ` [v4 4/5] drm/amdgpu: enhance error handling in function amdgpu_pci_probe() Jiang Liu
@ 2025-01-10 2:08 ` Jiang Liu
2025-01-10 6:51 ` Christian König
2025-01-10 19:06 ` [v4 0/6] Fix several bugs in error handling during device probe Mario Limonciello
5 siblings, 1 reply; 16+ messages in thread
From: Jiang Liu @ 2025-01-10 2:08 UTC (permalink / raw)
To: alexander.deucher, christian.koenig, Xinhui.Pan, airlied, simona,
sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
Cc: Jiang Liu
Function detects initialization status by checking sched->ops, so set
sched->ops to non-NULL just before return in function
amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers()
to avoid possible invalid memory access on error recover path.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f29a4ad3c6d0..3688e9eb949b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2857,6 +2857,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
if (r) {
DRM_ERROR("Failed to create scheduler on ring %s.\n",
ring->name);
+ ring->sched.ops = NULL;
return r;
}
r = amdgpu_uvd_entity_init(adev, ring);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 2f24a6aa13bf..b5e87b515139 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -656,8 +656,10 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
* The natural check would be sched.ready, which is
* set as drm_sched_init() finishes...
*/
- if (ring->sched.ops)
+ if (ring->sched.ops) {
drm_sched_fini(&ring->sched);
+ ring->sched.ops = NULL;
+ }
for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
dma_fence_put(ring->fence_drv.fences[j]);
--
2.43.5
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [v4 5/5] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
2025-01-10 2:08 ` [v4 5/5] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini() Jiang Liu
@ 2025-01-10 6:51 ` Christian König
2025-01-10 7:37 ` Gerry Liu
0 siblings, 1 reply; 16+ messages in thread
From: Christian König @ 2025-01-10 6:51 UTC (permalink / raw)
To: Jiang Liu, alexander.deucher, Xinhui.Pan, airlied, simona,
sunil.khatri, lijo.lazar, Hawking.Zhang, mario.limonciello,
xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
Am 10.01.25 um 03:08 schrieb Jiang Liu:
> Function detects initialization status by checking sched->ops, so set
> sched->ops to non-NULL just before return in function
> amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers()
> to avoid possible invalid memory access on error recover path.
>
> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index f29a4ad3c6d0..3688e9eb949b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2857,6 +2857,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
> if (r) {
> DRM_ERROR("Failed to create scheduler on ring %s.\n",
> ring->name);
> + ring->sched.ops = NULL;
> return r;
> }
> r = amdgpu_uvd_entity_init(adev, ring);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 2f24a6aa13bf..b5e87b515139 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -656,8 +656,10 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
> * The natural check would be sched.ready, which is
> * set as drm_sched_init() finishes...
> */
> - if (ring->sched.ops)
> + if (ring->sched.ops) {
> drm_sched_fini(&ring->sched);
> + ring->sched.ops = NULL;
> + }
As said in my previous reply we should really stop checking sched.ops here.
The scheduler should not be cleaned up by this function in the first place.
Regards,
Christian.
>
> for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
> dma_fence_put(ring->fence_drv.fences[j]);
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [v4 5/5] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
2025-01-10 6:51 ` Christian König
@ 2025-01-10 7:37 ` Gerry Liu
2025-01-10 8:19 ` Christian König
0 siblings, 1 reply; 16+ messages in thread
From: Gerry Liu @ 2025-01-10 7:37 UTC (permalink / raw)
To: Christian König
Cc: alexander.deucher, Xinhui.Pan, airlied, simona, sunil.khatri,
lijo.lazar, Hawking.Zhang, mario.limonciello, xiaogang.chen,
Kent.Russell, shuox.liu, amd-gfx
> 2025年1月10日 14:51,Christian König <christian.koenig@amd.com> 写道:
>
> Am 10.01.25 um 03:08 schrieb Jiang Liu:
>> Function detects initialization status by checking sched->ops, so set
>> sched->ops to non-NULL just before return in function
>> amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers()
>> to avoid possible invalid memory access on error recover path.
>>
>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
>> 2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index f29a4ad3c6d0..3688e9eb949b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2857,6 +2857,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>> if (r) {
>> DRM_ERROR("Failed to create scheduler on ring %s.\n",
>> ring->name);
>> + ring->sched.ops = NULL;
>> return r;
>> }
>> r = amdgpu_uvd_entity_init(adev, ring);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 2f24a6aa13bf..b5e87b515139 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -656,8 +656,10 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>> * The natural check would be sched.ready, which is
>> * set as drm_sched_init() finishes...
>> */
>> - if (ring->sched.ops)
>> + if (ring->sched.ops) {
>> drm_sched_fini(&ring->sched);
>> + ring->sched.ops = NULL;
>> + }
>
> As said in my previous reply we should really stop checking sched.ops here.
>
> The scheduler should not be cleaned up by this function in the first place.
Hi Christian,
The current workflow is as below:
amdgpu_device_init
amdgpu_fence_driver_sw_init
amdgpu_device_ip_init
amdgpu_device_init_schedulers
drm_sched_init
amdgpu_fence_driver_hw_init
amdgpu_device_fini_hw
amdgpu_fence_driver_hw_fini
amdgpu_device_fini_sw
amdgpu_device_ip_fini
amdgpu_fence_driver_sw_fini
drm_sched_fini
As you have mentioned, we should introduce amdgpu_device_fini_schedulers() and gets it called by amdgpu_device_ip_fini().
But I have no idea about why currently drm_sched_fini() is called by amdgpu_fence_driver_sw_fini() and whether it’s safe to move it into amdgpu_device_ip_fini().
Thanks,
Gerry
>
> Regards,
> Christian.
>
>> for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
>> dma_fence_put(ring->fence_drv.fences[j]);
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [v4 5/5] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
2025-01-10 7:37 ` Gerry Liu
@ 2025-01-10 8:19 ` Christian König
0 siblings, 0 replies; 16+ messages in thread
From: Christian König @ 2025-01-10 8:19 UTC (permalink / raw)
To: Gerry Liu
Cc: alexander.deucher, Xinhui.Pan, airlied, simona, sunil.khatri,
lijo.lazar, Hawking.Zhang, mario.limonciello, xiaogang.chen,
Kent.Russell, shuox.liu, amd-gfx
Am 10.01.25 um 08:37 schrieb Gerry Liu:
>> 2025年1月10日 14:51,Christian König <christian.koenig@amd.com> 写道:
>>
>> Am 10.01.25 um 03:08 schrieb Jiang Liu:
>>> Function detects initialization status by checking sched->ops, so set
>>> sched->ops to non-NULL just before return in function
>>> amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers()
>>> to avoid possible invalid memory access on error recover path.
>>>
>>> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
>>> 2 files changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index f29a4ad3c6d0..3688e9eb949b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2857,6 +2857,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>>> if (r) {
>>> DRM_ERROR("Failed to create scheduler on ring %s.\n",
>>> ring->name);
>>> + ring->sched.ops = NULL;
>>> return r;
>>> }
>>> r = amdgpu_uvd_entity_init(adev, ring);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 2f24a6aa13bf..b5e87b515139 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -656,8 +656,10 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>>> * The natural check would be sched.ready, which is
>>> * set as drm_sched_init() finishes...
>>> */
>>> - if (ring->sched.ops)
>>> + if (ring->sched.ops) {
>>> drm_sched_fini(&ring->sched);
>>> + ring->sched.ops = NULL;
>>> + }
>> As said in my previous reply we should really stop checking sched.ops here.
>>
>> The scheduler should not be cleaned up by this function in the first place.
> Hi Christian,
> The current workflow is as below:
> amdgpu_device_init
> amdgpu_fence_driver_sw_init
> amdgpu_device_ip_init
> amdgpu_device_init_schedulers
> drm_sched_init
> amdgpu_fence_driver_hw_init
> amdgpu_device_fini_hw
> amdgpu_fence_driver_hw_fini
> amdgpu_device_fini_sw
> amdgpu_device_ip_fini
> amdgpu_fence_driver_sw_fini
> drm_sched_fini
>
> As you have mentioned, we should introduce amdgpu_device_fini_schedulers() and gets it called by amdgpu_device_ip_fini().
> But I have no idea about why currently drm_sched_fini() is called by amdgpu_fence_driver_sw_fini() and whether it’s safe to move it into amdgpu_device_ip_fini().
If I remember correctly we just put the drm_sched_fini() with the ops
check as cheap way of cleaning up all schedulers before we had the
ring->no_scheduler flag which indicates if the scheduler is initialized
or not in the first place.
So it should be save to move the drm_sched_fini() into
amdgpu_device_ip_fini() as well. Just check ring->no_scheduler instead
of ring->sched.ops to decide if the scheduler needs to be cleaned up or not.
Regards,
Christian.
>
> Thanks,
> Gerry
>
>> Regards,
>> Christian.
>>
>>> for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
>>> dma_fence_put(ring->fence_drv.fences[j]);
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [v4 0/6] Fix several bugs in error handling during device probe
2025-01-10 2:08 [v4 0/6] Fix several bugs in error handling during device probe Jiang Liu
` (4 preceding siblings ...)
2025-01-10 2:08 ` [v4 5/5] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini() Jiang Liu
@ 2025-01-10 19:06 ` Mario Limonciello
5 siblings, 0 replies; 16+ messages in thread
From: Mario Limonciello @ 2025-01-10 19:06 UTC (permalink / raw)
To: Jiang Liu, alexander.deucher, christian.koenig, Xinhui.Pan,
airlied, simona, sunil.khatri, lijo.lazar, Hawking.Zhang,
xiaogang.chen, Kent.Russell, shuox.liu, amd-gfx
Just a minor nit on this cover letter; you have it listed as [v4 0/6]
but there are only 5 patches in the series in this version so the cover
letter title should be updated too.
On 1/9/2025 20:08, Jiang Liu wrote:
> This patchset tries to fix several memory leakages/invalid memory
> accesses on error handling path during GPU driver loading/unloading.
> They applies to:
> https://gitlab.freedesktop.org/agd5f/linux.git amd-staging-drm-next
>
> v4:
> 1) drop patch 1 in v3
> 2) split out amdxcp related change into a dedicated patch
> 3) use `guard(mutex)` instead of mutex_lock/unlock().
> 4) move patch 6 in v3 to next patch set
>
> v3:
> 1) drop first patch of v2
> 2) rework the 0003/0004 patches of v2 according to review comments
> 3) add patch 0004 to fix possible resource leakage in amdgpu_pci_probe()
>
> v2:
> 1) rebased to https://gitlab.freedesktop.org/agd5f/linux.git branch
> amd-staging-drm-next.
> 2) removed the first patch, which is unnecessary.
> 3) add amdgpu_xcp_drm_dev_free() in patch 0003 to enhance amdxcp
> driver to better support device remove and error handling.
> 4) reworked patch 0005 to fix it in amdgpu instead of drm core.
>
> Jiang Liu (5):
> drm/amdgpu: clear adev->in_suspend flag when fails to suspend
> drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()
> drm/amdgpu: fix use after free bug related to
> amdgpu_driver_release_kms()
> drm/amdgpu: enhance error handling in function amdgpu_pci_probe()
> drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 18 ++++--
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 14 +++--
> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 70 +++++++++++++++++----
> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h | 4 +-
> drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 65 ++++++++++++++++---
> drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.h | 1 +
> 7 files changed, 142 insertions(+), 34 deletions(-)
>
^ permalink raw reply [flat|nested] 16+ messages in thread