amdgpu vs kexec

amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* amdgpu vs kexec
@ 2025-06-16  9:39 Peter Zijlstra
  2025-06-16 11:51 ` Christian König
  2025-06-16 14:02 ` Lazar, Lijo
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2025-06-16  9:39 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig; +Cc: Borislav Petkov, amd-gfx

[-- Attachment #1: Type: text/plain, Size: 4440 bytes --]

Hi guys,

My (Intel Sapphire Rapids) workstation has a RX 7800 XT and when I kexec
a bunch of times, the amdgpu driver gets upset and barfs on boot.

It starts like so:

[   16.926489] amdgpu 0000:19:00.0: amdgpu: Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 16
[   16.980590] amdgpu 0000:19:00.0: amdgpu: reserve 0xa700000 from 0x83e0000000 for PSP TMR
[   19.204585] amdgpu 0000:19:00.0: amdgpu: failed to load ucode SMC(0x32)
[   19.227333] amdgpu 0000:19:00.0: amdgpu: psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0)
[   19.256420] amdgpu 0000:19:00.0: amdgpu: PSP load smu failed!
[   19.467875] [drm:psp_v13_0_ring_destroy [amdgpu]] *ERROR* Fail to stop psp ring
[   19.491771] amdgpu 0000:19:00.0: amdgpu: PSP firmware loading failed
[   19.513372] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[   19.540397] amdgpu 0000:19:00.0: amdgpu: amdgpu_device_ip_init failed
[   19.562177] amdgpu 0000:19:00.0: amdgpu: Fatal error during GPU init
[   19.583785] amdgpu 0000:19:00.0: amdgpu: amdgpu: finishing device.
[   19.605474] ------------[ cut here ]------------
[   19.615370] WARNING: CPU: 0 PID: 704 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:631 amdgpu_irq_put+0x46/0x70 [amdgpu]
[   19.638375] Modules linked in: rndis_host hid_generic cdc_ether usbhid usbnet mii hid amdgpu(+) amdxcp gpu_sched drm_panel_backlight_quirks drm_buddy drm_ttm_helper ttm video wmi drm_exec drm_suballoc_helper drm_display_helper ast ah
ci cec rc_core iTCO_wdt libahci drm_shmem_helper xhci_pci drm_client_lib intel_pmc_bxt libata xhci_hcd iTCO_vendor_support igb nvme watchdog drm_kms_helper idxd atlantic intel_lpss_pci usbcore scsi_mod i2c_algo_bit drm nvme_core i2c_i80
1 idxd_bus crc16 intel_lpss macsec dca i2c_smbus idma64 scsi_common ucsi_acpi typec_ucsi typec roles usb_common pinctrl_alderlake button efivarfs
[   19.754852] CPU: 0 UID: 0 PID: 704 Comm: kworker/0:5 Not tainted 6.15.0-dirty #51 PREEMPT(full)
[   19.773770] Hardware name: Supermicro SYS-531A-I/X13SRA-TF, BIOS 1.1b 08/01/2023
[   19.789693] Workqueue: events work_for_cpu_fn
[   19.799066] RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu]
[   19.810480] Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 c3 cc cc cc cc e9 5a fd ff ff <0f> 0b b8 ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc
[   19.851066] RSP: 0018:ff55eefd81aafd48 EFLAGS: 00010246
[   19.862314] RAX: ff466ca3653aac00 RBX: ff466ca2d7f98b40 RCX: 0000000000000000
[   19.877675] RDX: 0000000000000000 RSI: ff466ca2d7fa5990 RDI: ff466ca2d7f80000
[   19.893037] RBP: ff466ca2d7f90388 R08: 0000000000000000 R09: ff55eefd81aafb10
[   19.908401] R10: ff466cc1ffcd2fa8 R11: 0000000000000003 R12: ff466ca2d7f90830
[   19.923763] R13: ff466ca2d7f80010 R14: ff466ca2d7f80000 R15: ff466ca2d7fa5990
[   19.939132] FS:  0000000000000000(0000) GS:ff466cc1db2ee000(0000) knlGS:0000000000000000
[   19.956551] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   19.968920] CR2: 00007f45f54e3de8 CR3: 000000207e624003 CR4: 0000000000f71ef0
[   19.984282] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   19.999645] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   20.015010] PKRU: 55555554
[   20.020820] Call Trace:
[   20.026075]  <TASK>
[   20.030581]  amdgpu_fence_driver_hw_fini+0xfc/0x130 [amdgpu]
[   20.042894]  amdgpu_device_fini_hw+0xb7/0x2c6 [amdgpu]
[   20.054152]  amdgpu_driver_load_kms.cold+0x18/0x2e [amdgpu]
[   20.066323]  amdgpu_pci_probe+0x1cf/0x470 [amdgpu]
[   20.076775]  local_pci_probe+0x42/0x90
[   20.084839]  work_for_cpu_fn+0x17/0x30
[   20.092899]  process_one_work+0x188/0x340
[   20.101523]  worker_thread+0x256/0x3a0
[   20.109584]  ? __pfx_worker_thread+0x10/0x10
[   20.118767]  kthread+0xf9/0x240
[   20.125519]  ? __pfx_kthread+0x10/0x10
[   20.133578]  ret_from_fork+0x31/0x50
[   20.141268]  ? __pfx_kthread+0x10/0x10
[   20.149326]  ret_from_fork_asm+0x1a/0x30
[   20.157765]  </TASK>
[   20.162457] ---[ end trace 0000000000000000 ]---

and then continues to barf for a while longer. Full dmesg attached.

When I do a full power cycle its okay again for a few kexecs, but will
ultimately go unhappy again.

I'm doing a 'normal' systemctl kexec, which I figure should more or less
shut things down normally. Its not like a crash-kexec -- which is a
whole other story and can be expected to cause trouble.

[-- Attachment #2: dmesg.gz --]
[-- Type: application/gzip, Size: 35326 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: amdgpu vs kexec
  2025-06-16  9:39 amdgpu vs kexec Peter Zijlstra
@ 2025-06-16 11:51 ` Christian König
  2025-06-16 14:54   ` Peter Zijlstra
  2025-06-16 14:02 ` Lazar, Lijo
  1 sibling, 1 reply; 15+ messages in thread
From: Christian König @ 2025-06-16 11:51 UTC (permalink / raw)
  To: Peter Zijlstra, alexander.deucher; +Cc: Borislav Petkov, amd-gfx

Hi Peter,

On 6/16/25 11:39, Peter Zijlstra wrote:
> Hi guys,
> 
> My (Intel Sapphire Rapids) workstation has a RX 7800 XT and when I kexec
> a bunch of times, the amdgpu driver gets upset and barfs on boot.

yeah, that is an "intentional" HW feature and yes you're certainly not the first one to complain about it :(

The PSP (platform security processor IIRC) is designed in such a way that you can initialize it only once after a power cycle / hard reset for security reasons (e.g. to not leak crypto keys used for digital rights management etc..).

On dGPUs we work around that manually by power cycling the ASIC when that situation is detected during amdgpu load, but that unfortunately doesn't work 100% reliable.

On APUs the situation is even worse because the PSP is shared between the GPU and the CPU.

We have forwarded such complains internally for years, but there is not much else Alex and I can do about it.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: amdgpu vs kexec
  2025-06-16  9:39 amdgpu vs kexec Peter Zijlstra
  2025-06-16 11:51 ` Christian König
@ 2025-06-16 14:02 ` Lazar, Lijo
  1 sibling, 0 replies; 15+ messages in thread
From: Lazar, Lijo @ 2025-06-16 14:02 UTC (permalink / raw)
  To: Peter Zijlstra, alexander.deucher, christian.koenig
  Cc: Borislav Petkov, amd-gfx, Kenneth Feng



On 6/16/2025 3:09 PM, Peter Zijlstra wrote:
> Hi guys,
> 
> My (Intel Sapphire Rapids) workstation has a RX 7800 XT and when I kexec
> a bunch of times, the amdgpu driver gets upset and barfs on boot.
> 
> It starts like so:
> 
> [   16.926489] amdgpu 0000:19:00.0: amdgpu: Found VCN firmware Version ENC: 1.23 DEC: 9 VEP: 0 Revision: 16
> [   16.980590] amdgpu 0000:19:00.0: amdgpu: reserve 0xa700000 from 0x83e0000000 for PSP TMR
> [   19.204585] amdgpu 0000:19:00.0: amdgpu: failed to load ucode SMC(0x32)
> [   19.227333] amdgpu 0000:19:00.0: amdgpu: psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0)
> [   19.256420] amdgpu 0000:19:00.0: amdgpu: PSP load smu failed!
> [   19.467875] [drm:psp_v13_0_ring_destroy [amdgpu]] *ERROR* Fail to stop psp ring
> [   19.491771] amdgpu 0000:19:00.0: amdgpu: PSP firmware loading failed
> [   19.513372] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
> [   19.540397] amdgpu 0000:19:00.0: amdgpu: amdgpu_device_ip_init failed
> [   19.562177] amdgpu 0000:19:00.0: amdgpu: Fatal error during GPU init
> [   19.583785] amdgpu 0000:19:00.0: amdgpu: amdgpu: finishing device.
> [   19.605474] ------------[ cut here ]------------
> [   19.615370] WARNING: CPU: 0 PID: 704 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:631 amdgpu_irq_put+0x46/0x70 [amdgpu]
> [   19.638375] Modules linked in: rndis_host hid_generic cdc_ether usbhid usbnet mii hid amdgpu(+) amdxcp gpu_sched drm_panel_backlight_quirks drm_buddy drm_ttm_helper ttm video wmi drm_exec drm_suballoc_helper drm_display_helper ast ah
> ci cec rc_core iTCO_wdt libahci drm_shmem_helper xhci_pci drm_client_lib intel_pmc_bxt libata xhci_hcd iTCO_vendor_support igb nvme watchdog drm_kms_helper idxd atlantic intel_lpss_pci usbcore scsi_mod i2c_algo_bit drm nvme_core i2c_i80
> 1 idxd_bus crc16 intel_lpss macsec dca i2c_smbus idma64 scsi_common ucsi_acpi typec_ucsi typec roles usb_common pinctrl_alderlake button efivarfs
> [   19.754852] CPU: 0 UID: 0 PID: 704 Comm: kworker/0:5 Not tainted 6.15.0-dirty #51 PREEMPT(full)
> [   19.773770] Hardware name: Supermicro SYS-531A-I/X13SRA-TF, BIOS 1.1b 08/01/2023
> [   19.789693] Workqueue: events work_for_cpu_fn
> [   19.799066] RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu]
> [   19.810480] Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 c3 cc cc cc cc e9 5a fd ff ff <0f> 0b b8 ea ff ff ff c3 cc cc cc cc b8 ea ff ff ff c3 cc cc cc cc
> [   19.851066] RSP: 0018:ff55eefd81aafd48 EFLAGS: 00010246
> [   19.862314] RAX: ff466ca3653aac00 RBX: ff466ca2d7f98b40 RCX: 0000000000000000
> [   19.877675] RDX: 0000000000000000 RSI: ff466ca2d7fa5990 RDI: ff466ca2d7f80000
> [   19.893037] RBP: ff466ca2d7f90388 R08: 0000000000000000 R09: ff55eefd81aafb10
> [   19.908401] R10: ff466cc1ffcd2fa8 R11: 0000000000000003 R12: ff466ca2d7f90830
> [   19.923763] R13: ff466ca2d7f80010 R14: ff466ca2d7f80000 R15: ff466ca2d7fa5990
> [   19.939132] FS:  0000000000000000(0000) GS:ff466cc1db2ee000(0000) knlGS:0000000000000000
> [   19.956551] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   19.968920] CR2: 00007f45f54e3de8 CR3: 000000207e624003 CR4: 0000000000f71ef0
> [   19.984282] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   19.999645] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [   20.015010] PKRU: 55555554
> [   20.020820] Call Trace:
> [   20.026075]  <TASK>
> [   20.030581]  amdgpu_fence_driver_hw_fini+0xfc/0x130 [amdgpu]
> [   20.042894]  amdgpu_device_fini_hw+0xb7/0x2c6 [amdgpu]
> [   20.054152]  amdgpu_driver_load_kms.cold+0x18/0x2e [amdgpu]
> [   20.066323]  amdgpu_pci_probe+0x1cf/0x470 [amdgpu]
> [   20.076775]  local_pci_probe+0x42/0x90
> [   20.084839]  work_for_cpu_fn+0x17/0x30
> [   20.092899]  process_one_work+0x188/0x340
> [   20.101523]  worker_thread+0x256/0x3a0
> [   20.109584]  ? __pfx_worker_thread+0x10/0x10
> [   20.118767]  kthread+0xf9/0x240
> [   20.125519]  ? __pfx_kthread+0x10/0x10
> [   20.133578]  ret_from_fork+0x31/0x50
> [   20.141268]  ? __pfx_kthread+0x10/0x10
> [   20.149326]  ret_from_fork_asm+0x1a/0x30
> [   20.157765]  </TASK>
> [   20.162457] ---[ end trace 0000000000000000 ]---
> 
> and then continues to barf for a while longer. Full dmesg attached.
> 
> When I do a full power cycle its okay again for a few kexecs, but will
> ultimately go unhappy again.
> 
> I'm doing a 'normal' systemctl kexec, which I figure should more or less
> shut things down normally. Its not like a crash-kexec -- which is a
> whole other story and can be expected to cause trouble.

From the log -

[   16.512201] amdgpu 0000:19:00.0: amdgpu: GPU mode1 reset
[   16.531581] amdgpu 0000:19:00.0: amdgpu: SMU: valid command, bad
prerequisites: index:2 param:0x00000000 message:GetSmuVersion
[   16.564138] amdgpu 0000:19:00.0: amdgpu: GPU psp mode1 reset

It looks like PMFW responsible for reset is not in good shape and then
driver is taking an unexpected code path which breaks PSP as well.

Kenneth, any thoughts?

Thanks,
Lijo



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: amdgpu vs kexec
  2025-06-16 11:51 ` Christian König
@ 2025-06-16 14:54   ` Peter Zijlstra
  2025-06-18  2:12     ` Mario Limonciello
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2025-06-16 14:54 UTC (permalink / raw)
  To: Christian König; +Cc: alexander.deucher, Borislav Petkov, amd-gfx

On Mon, Jun 16, 2025 at 01:51:21PM +0200, Christian König wrote:
> Hi Peter,
> 
> On 6/16/25 11:39, Peter Zijlstra wrote:
> > Hi guys,
> > 
> > My (Intel Sapphire Rapids) workstation has a RX 7800 XT and when I kexec
> > a bunch of times, the amdgpu driver gets upset and barfs on boot.
> 
> yeah, that is an "intentional" HW feature and yes you're certainly not
> the first one to complain about it :(
> 
> The PSP (platform security processor IIRC) is designed in such a way
> that you can initialize it only once after a power cycle / hard reset
> for security reasons (e.g. to not leak crypto keys used for digital
> rights management etc..).
> 
> On dGPUs we work around that manually by power cycling the ASIC when
> that situation is detected during amdgpu load, but that unfortunately
> doesn't work 100% reliable.

Right.. hence the splats.

> On APUs the situation is even worse because the PSP is shared between
> the GPU and the CPU.
> 
> We have forwarded such complains internally for years, but there is
> not much else Alex and I can do about it.

Oh well. Thanks for the info!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: amdgpu vs kexec
  2025-06-16 14:54   ` Peter Zijlstra
@ 2025-06-18  2:12     ` Mario Limonciello
  2025-06-18  8:51       ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Mario Limonciello @ 2025-06-18  2:12 UTC (permalink / raw)
  To: Peter Zijlstra, Christian König
  Cc: alexander.deucher, Borislav Petkov, amd-gfx

On 6/16/2025 9:54 AM, Peter Zijlstra wrote:
> On Mon, Jun 16, 2025 at 01:51:21PM +0200, Christian König wrote:
>> Hi Peter,
>>
>> On 6/16/25 11:39, Peter Zijlstra wrote:
>>> Hi guys,
>>>
>>> My (Intel Sapphire Rapids) workstation has a RX 7800 XT and when I kexec
>>> a bunch of times, the amdgpu driver gets upset and barfs on boot.
>>
>> yeah, that is an "intentional" HW feature and yes you're certainly not
>> the first one to complain about it :(
>>
>> The PSP (platform security processor IIRC) is designed in such a way
>> that you can initialize it only once after a power cycle / hard reset
>> for security reasons (e.g. to not leak crypto keys used for digital
>> rights management etc..).
>>
>> On dGPUs we work around that manually by power cycling the ASIC when
>> that situation is detected during amdgpu load, but that unfortunately
>> doesn't work 100% reliable.
> 
> Right.. hence the splats.

How about if we reset before the kexec?  There is a symbol for drivers 
to use to know they're about to go through kexec to do $THINGS.

Something like this:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 0fc0eeedc6461..2b1216b14d618 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -34,6 +34,7 @@

  #include <linux/cc_platform.h>
  #include <linux/dynamic_debug.h>
+#include <linux/kexec.h>
  #include <linux/module.h>
  #include <linux/mmu_notifier.h>
  #include <linux/pm_runtime.h>
@@ -2544,6 +2545,9 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
                 adev->mp1_state = PP_MP1_STATE_UNLOAD;
         amdgpu_device_ip_suspend(adev);
         adev->mp1_state = PP_MP1_STATE_NONE;
+
+       if (kexec_in_progress)
+               amdgpu_asic_reset(adev);
  }

  static int amdgpu_pmops_prepare(struct device *dev)

> 
>> On APUs the situation is even worse because the PSP is shared between
>> the GPU and the CPU.
>>
>> We have forwarded such complains internally for years, but there is
>> not much else Alex and I can do about it.
> 
> Oh well. Thanks for the info!
> 


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: amdgpu vs kexec
  2025-06-18  2:12     ` Mario Limonciello
@ 2025-06-18  8:51       ` Peter Zijlstra
  2025-06-18  9:05         ` Christian König
  2025-06-18  9:12         ` Peter Zijlstra
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2025-06-18  8:51 UTC (permalink / raw)
  To: Mario Limonciello
  Cc: Christian König, alexander.deucher, Borislav Petkov, amd-gfx

On Tue, Jun 17, 2025 at 09:12:12PM -0500, Mario Limonciello wrote:

> How about if we reset before the kexec?  There is a symbol for drivers to
> use to know they're about to go through kexec to do $THINGS.
> 
> Something like this:
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 0fc0eeedc6461..2b1216b14d618 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -34,6 +34,7 @@
> 
>  #include <linux/cc_platform.h>
>  #include <linux/dynamic_debug.h>
> +#include <linux/kexec.h>
>  #include <linux/module.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/pm_runtime.h>
> @@ -2544,6 +2545,9 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
>                 adev->mp1_state = PP_MP1_STATE_UNLOAD;
>         amdgpu_device_ip_suspend(adev);
>         adev->mp1_state = PP_MP1_STATE_NONE;
> +
> +       if (kexec_in_progress)
> +               amdgpu_asic_reset(adev);
>  }
> 
>  static int amdgpu_pmops_prepare(struct device *dev)

I will throw this in the dev kernel... I'll let you know.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: amdgpu vs kexec
  2025-06-18  8:51       ` Peter Zijlstra
@ 2025-06-18  9:05         ` Christian König
  2025-06-18 13:34           ` Mario Limonciello
  2025-06-18  9:12         ` Peter Zijlstra
  1 sibling, 1 reply; 15+ messages in thread
From: Christian König @ 2025-06-18  9:05 UTC (permalink / raw)
  To: Peter Zijlstra, Mario Limonciello, Lazar, Lijo
  Cc: alexander.deucher, Borislav Petkov, amd-gfx

On 6/18/25 10:51, Peter Zijlstra wrote:
> On Tue, Jun 17, 2025 at 09:12:12PM -0500, Mario Limonciello wrote:
> 
>> How about if we reset before the kexec?  There is a symbol for drivers to
>> use to know they're about to go through kexec to do $THINGS.
>>
>> Something like this:
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 0fc0eeedc6461..2b1216b14d618 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -34,6 +34,7 @@
>>
>>  #include <linux/cc_platform.h>
>>  #include <linux/dynamic_debug.h>
>> +#include <linux/kexec.h>
>>  #include <linux/module.h>
>>  #include <linux/mmu_notifier.h>
>>  #include <linux/pm_runtime.h>
>> @@ -2544,6 +2545,9 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
>>                 adev->mp1_state = PP_MP1_STATE_UNLOAD;
>>         amdgpu_device_ip_suspend(adev);
>>         adev->mp1_state = PP_MP1_STATE_NONE;
>> +
>> +       if (kexec_in_progress)
>> +               amdgpu_asic_reset(adev);
>>  }
>>
>>  static int amdgpu_pmops_prepare(struct device *dev)
> 
> I will throw this in the dev kernel... I'll let you know.

Mhm if the drivers are informed about the kexec then we could also send the unload/reset packet only to the PSP IIRC.

That might have a better chance of succeeding than a full ASIC reset.

Lijo should know more about that.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: amdgpu vs kexec
  2025-06-18  8:51       ` Peter Zijlstra
  2025-06-18  9:05         ` Christian König
@ 2025-06-18  9:12         ` Peter Zijlstra
  2025-06-18  9:26           ` Peter Zijlstra
  2025-06-18 23:55           ` Baoquan He
  1 sibling, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2025-06-18  9:12 UTC (permalink / raw)
  To: Mario Limonciello, bhe
  Cc: Christian König, alexander.deucher, Borislav Petkov, amd-gfx

On Wed, Jun 18, 2025 at 10:51:23AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 17, 2025 at 09:12:12PM -0500, Mario Limonciello wrote:
> 
> > How about if we reset before the kexec?  There is a symbol for drivers to
> > use to know they're about to go through kexec to do $THINGS.
> > 
> > Something like this:
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 0fc0eeedc6461..2b1216b14d618 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -34,6 +34,7 @@
> > 
> >  #include <linux/cc_platform.h>
> >  #include <linux/dynamic_debug.h>
> > +#include <linux/kexec.h>
> >  #include <linux/module.h>
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/pm_runtime.h>
> > @@ -2544,6 +2545,9 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
> >                 adev->mp1_state = PP_MP1_STATE_UNLOAD;
> >         amdgpu_device_ip_suspend(adev);
> >         adev->mp1_state = PP_MP1_STATE_NONE;
> > +
> > +       if (kexec_in_progress)
> > +               amdgpu_asic_reset(adev);
> >  }
> > 
> >  static int amdgpu_pmops_prepare(struct device *dev)
> 
> I will throw this in the dev kernel... I'll let you know.

First hurdle appears to be that this symbol is not exported. I fixed
that, but perhaps the kexec folks don't like drivers to use this?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: amdgpu vs kexec
  2025-06-18  9:12         ` Peter Zijlstra
@ 2025-06-18  9:26           ` Peter Zijlstra
  2025-06-18 13:35             ` Mario Limonciello
  2025-06-20 10:39             ` Lazar, Lijo
  2025-06-18 23:55           ` Baoquan He
  1 sibling, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2025-06-18  9:26 UTC (permalink / raw)
  To: Mario Limonciello, bhe
  Cc: Christian König, alexander.deucher, Borislav Petkov, amd-gfx

On Wed, Jun 18, 2025 at 11:12:32AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 18, 2025 at 10:51:23AM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 17, 2025 at 09:12:12PM -0500, Mario Limonciello wrote:
> > 
> > > How about if we reset before the kexec?  There is a symbol for drivers to
> > > use to know they're about to go through kexec to do $THINGS.
> > > 
> > > Something like this:
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > index 0fc0eeedc6461..2b1216b14d618 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > @@ -34,6 +34,7 @@
> > > 
> > >  #include <linux/cc_platform.h>
> > >  #include <linux/dynamic_debug.h>
> > > +#include <linux/kexec.h>
> > >  #include <linux/module.h>
> > >  #include <linux/mmu_notifier.h>
> > >  #include <linux/pm_runtime.h>
> > > @@ -2544,6 +2545,9 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
> > >                 adev->mp1_state = PP_MP1_STATE_UNLOAD;
> > >         amdgpu_device_ip_suspend(adev);
> > >         adev->mp1_state = PP_MP1_STATE_NONE;
> > > +
> > > +       if (kexec_in_progress)
> > > +               amdgpu_asic_reset(adev);
> > >  }
> > > 
> > >  static int amdgpu_pmops_prepare(struct device *dev)
> > 
> > I will throw this in the dev kernel... I'll let you know.
> 
> First hurdle appears to be that this symbol is not exported. I fixed
> that, but perhaps the kexec folks don't like drivers to use this?

Bah, so first kexec after a fresh reboot into a kernel carrying this has
the thing failing.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: amdgpu vs kexec
  2025-06-18  9:05         ` Christian König
@ 2025-06-18 13:34           ` Mario Limonciello
  2025-06-18 13:46             ` Alex Deucher
  0 siblings, 1 reply; 15+ messages in thread
From: Mario Limonciello @ 2025-06-18 13:34 UTC (permalink / raw)
  To: Christian König, Peter Zijlstra, Lazar, Lijo
  Cc: alexander.deucher, Borislav Petkov, amd-gfx

On 6/18/2025 4:05 AM, Christian König wrote:
> On 6/18/25 10:51, Peter Zijlstra wrote:
>> On Tue, Jun 17, 2025 at 09:12:12PM -0500, Mario Limonciello wrote:
>>
>>> How about if we reset before the kexec?  There is a symbol for drivers to
>>> use to know they're about to go through kexec to do $THINGS.
>>>
>>> Something like this:
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index 0fc0eeedc6461..2b1216b14d618 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -34,6 +34,7 @@
>>>
>>>   #include <linux/cc_platform.h>
>>>   #include <linux/dynamic_debug.h>
>>> +#include <linux/kexec.h>
>>>   #include <linux/module.h>
>>>   #include <linux/mmu_notifier.h>
>>>   #include <linux/pm_runtime.h>
>>> @@ -2544,6 +2545,9 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
>>>                  adev->mp1_state = PP_MP1_STATE_UNLOAD;
>>>          amdgpu_device_ip_suspend(adev);
>>>          adev->mp1_state = PP_MP1_STATE_NONE;
>>> +
>>> +       if (kexec_in_progress)
>>> +               amdgpu_asic_reset(adev);
>>>   }
>>>
>>>   static int amdgpu_pmops_prepare(struct device *dev)
>>
>> I will throw this in the dev kernel... I'll let you know.
> 
> Mhm if the drivers are informed about the kexec

It looks like PeterZ found the symbol isn't exported; but that's not to 
say it "can't be" if it fixes this issue.

> then we could also send the unload/reset packet only to the PSP IIRC.
> 
> That might have a better chance of succeeding than a full ASIC reset.
> 
> Lijo should know more about that.
> 
> Regards,
> Christian.

Another idea is to do a FLR on the way down.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: amdgpu vs kexec
  2025-06-18  9:26           ` Peter Zijlstra
@ 2025-06-18 13:35             ` Mario Limonciello
  2025-06-20 10:39             ` Lazar, Lijo
  1 sibling, 0 replies; 15+ messages in thread
From: Mario Limonciello @ 2025-06-18 13:35 UTC (permalink / raw)
  To: Peter Zijlstra, bhe
  Cc: Christian König, alexander.deucher, Borislav Petkov, amd-gfx

On 6/18/2025 4:26 AM, Peter Zijlstra wrote:
> On Wed, Jun 18, 2025 at 11:12:32AM +0200, Peter Zijlstra wrote:
>> On Wed, Jun 18, 2025 at 10:51:23AM +0200, Peter Zijlstra wrote:
>>> On Tue, Jun 17, 2025 at 09:12:12PM -0500, Mario Limonciello wrote:
>>>
>>>> How about if we reset before the kexec?  There is a symbol for drivers to
>>>> use to know they're about to go through kexec to do $THINGS.
>>>>
>>>> Something like this:
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index 0fc0eeedc6461..2b1216b14d618 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -34,6 +34,7 @@
>>>>
>>>>   #include <linux/cc_platform.h>
>>>>   #include <linux/dynamic_debug.h>
>>>> +#include <linux/kexec.h>
>>>>   #include <linux/module.h>
>>>>   #include <linux/mmu_notifier.h>
>>>>   #include <linux/pm_runtime.h>
>>>> @@ -2544,6 +2545,9 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
>>>>                  adev->mp1_state = PP_MP1_STATE_UNLOAD;
>>>>          amdgpu_device_ip_suspend(adev);
>>>>          adev->mp1_state = PP_MP1_STATE_NONE;
>>>> +
>>>> +       if (kexec_in_progress)
>>>> +               amdgpu_asic_reset(adev);
>>>>   }
>>>>
>>>>   static int amdgpu_pmops_prepare(struct device *dev)
>>>
>>> I will throw this in the dev kernel... I'll let you know.
>>
>> First hurdle appears to be that this symbol is not exported. I fixed
>> that, but perhaps the kexec folks don't like drivers to use this?
> 
> Bah, so first kexec after a fresh reboot into a kernel carrying this has
> the thing failing.
> 

Dang, too bad.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: amdgpu vs kexec
  2025-06-18 13:34           ` Mario Limonciello
@ 2025-06-18 13:46             ` Alex Deucher
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Deucher @ 2025-06-18 13:46 UTC (permalink / raw)
  To: Mario Limonciello
  Cc: Christian König, Peter Zijlstra, Lazar, Lijo,
	alexander.deucher, Borislav Petkov, amd-gfx

On Wed, Jun 18, 2025 at 9:41 AM Mario Limonciello <superm1@kernel.org> wrote:
>
> On 6/18/2025 4:05 AM, Christian König wrote:
> > On 6/18/25 10:51, Peter Zijlstra wrote:
> >> On Tue, Jun 17, 2025 at 09:12:12PM -0500, Mario Limonciello wrote:
> >>
> >>> How about if we reset before the kexec?  There is a symbol for drivers to
> >>> use to know they're about to go through kexec to do $THINGS.
> >>>
> >>> Something like this:
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> index 0fc0eeedc6461..2b1216b14d618 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> @@ -34,6 +34,7 @@
> >>>
> >>>   #include <linux/cc_platform.h>
> >>>   #include <linux/dynamic_debug.h>
> >>> +#include <linux/kexec.h>
> >>>   #include <linux/module.h>
> >>>   #include <linux/mmu_notifier.h>
> >>>   #include <linux/pm_runtime.h>
> >>> @@ -2544,6 +2545,9 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
> >>>                  adev->mp1_state = PP_MP1_STATE_UNLOAD;
> >>>          amdgpu_device_ip_suspend(adev);
> >>>          adev->mp1_state = PP_MP1_STATE_NONE;
> >>> +
> >>> +       if (kexec_in_progress)
> >>> +               amdgpu_asic_reset(adev);
> >>>   }
> >>>
> >>>   static int amdgpu_pmops_prepare(struct device *dev)
> >>
> >> I will throw this in the dev kernel... I'll let you know.
> >
> > Mhm if the drivers are informed about the kexec
>
> It looks like PeterZ found the symbol isn't exported; but that's not to
> say it "can't be" if it fixes this issue.
>
> > then we could also send the unload/reset packet only to the PSP IIRC.
> >
> > That might have a better chance of succeeding than a full ASIC reset.
> >
> > Lijo should know more about that.
> >
> > Regards,
> > Christian.
>
> Another idea is to do a FLR on the way down.

I think you want something like:

r = amdgpu_dpm_set_mp1_state(adev, PP_MP1_STATE_UNLOAD);

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: amdgpu vs kexec
  2025-06-18  9:12         ` Peter Zijlstra
  2025-06-18  9:26           ` Peter Zijlstra
@ 2025-06-18 23:55           ` Baoquan He
  2025-06-19 13:32             ` Mario Limonciello
  1 sibling, 1 reply; 15+ messages in thread
From: Baoquan He @ 2025-06-18 23:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mario Limonciello, Christian König, alexander.deucher,
	Borislav Petkov, amd-gfx

On 06/18/25 at 11:12am, Peter Zijlstra wrote:
> On Wed, Jun 18, 2025 at 10:51:23AM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 17, 2025 at 09:12:12PM -0500, Mario Limonciello wrote:
> > 
> > > How about if we reset before the kexec?  There is a symbol for drivers to
> > > use to know they're about to go through kexec to do $THINGS.
> > > 
> > > Something like this:
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > index 0fc0eeedc6461..2b1216b14d618 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > @@ -34,6 +34,7 @@
> > > 
> > >  #include <linux/cc_platform.h>
> > >  #include <linux/dynamic_debug.h>
> > > +#include <linux/kexec.h>
> > >  #include <linux/module.h>
> > >  #include <linux/mmu_notifier.h>
> > >  #include <linux/pm_runtime.h>
> > > @@ -2544,6 +2545,9 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
> > >                 adev->mp1_state = PP_MP1_STATE_UNLOAD;
> > >         amdgpu_device_ip_suspend(adev);
> > >         adev->mp1_state = PP_MP1_STATE_NONE;
> > > +
> > > +       if (kexec_in_progress)
> > > +               amdgpu_asic_reset(adev);
> > >  }
> > > 
> > >  static int amdgpu_pmops_prepare(struct device *dev)
> > 
> > I will throw this in the dev kernel... I'll let you know.
> 
> First hurdle appears to be that this symbol is not exported. I fixed
> that, but perhaps the kexec folks don't like drivers to use this?

I can't find the original mail of this thread, while we don't have a
known restriction about that afaik.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: amdgpu vs kexec
  2025-06-18 23:55           ` Baoquan He
@ 2025-06-19 13:32             ` Mario Limonciello
  0 siblings, 0 replies; 15+ messages in thread
From: Mario Limonciello @ 2025-06-19 13:32 UTC (permalink / raw)
  To: Baoquan He, Peter Zijlstra
  Cc: Christian König, alexander.deucher, Borislav Petkov, amd-gfx

On 6/18/2025 6:55 PM, Baoquan He wrote:
> On 06/18/25 at 11:12am, Peter Zijlstra wrote:
>> On Wed, Jun 18, 2025 at 10:51:23AM +0200, Peter Zijlstra wrote:
>>> On Tue, Jun 17, 2025 at 09:12:12PM -0500, Mario Limonciello wrote:
>>>
>>>> How about if we reset before the kexec?  There is a symbol for drivers to
>>>> use to know they're about to go through kexec to do $THINGS.
>>>>
>>>> Something like this:
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index 0fc0eeedc6461..2b1216b14d618 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -34,6 +34,7 @@
>>>>
>>>>   #include <linux/cc_platform.h>
>>>>   #include <linux/dynamic_debug.h>
>>>> +#include <linux/kexec.h>
>>>>   #include <linux/module.h>
>>>>   #include <linux/mmu_notifier.h>
>>>>   #include <linux/pm_runtime.h>
>>>> @@ -2544,6 +2545,9 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
>>>>                  adev->mp1_state = PP_MP1_STATE_UNLOAD;
>>>>          amdgpu_device_ip_suspend(adev);
>>>>          adev->mp1_state = PP_MP1_STATE_NONE;
>>>> +
>>>> +       if (kexec_in_progress)
>>>> +               amdgpu_asic_reset(adev);
>>>>   }
>>>>
>>>>   static int amdgpu_pmops_prepare(struct device *dev)
>>>
>>> I will throw this in the dev kernel... I'll let you know.
>>
>> First hurdle appears to be that this symbol is not exported. I fixed
>> that, but perhaps the kexec folks don't like drivers to use this?
> 
> I can't find the original mail of this thread, while we don't have a
> known restriction about that afaik.
> 

FYI here's the whole thread:

https://lore.kernel.org/amd-gfx/423aec58-0ab2-4471-b986-dfb955e63ca8@kernel.org/T/#m68bea029aac9b7ec015a26a8dfb8268ffb007125



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: amdgpu vs kexec
  2025-06-18  9:26           ` Peter Zijlstra
  2025-06-18 13:35             ` Mario Limonciello
@ 2025-06-20 10:39             ` Lazar, Lijo
  1 sibling, 0 replies; 15+ messages in thread
From: Lazar, Lijo @ 2025-06-20 10:39 UTC (permalink / raw)
  To: Peter Zijlstra, Mario Limonciello, bhe
  Cc: Christian König, alexander.deucher, Borislav Petkov, amd-gfx



On 6/18/2025 2:56 PM, Peter Zijlstra wrote:
> On Wed, Jun 18, 2025 at 11:12:32AM +0200, Peter Zijlstra wrote:
>> On Wed, Jun 18, 2025 at 10:51:23AM +0200, Peter Zijlstra wrote:
>>> On Tue, Jun 17, 2025 at 09:12:12PM -0500, Mario Limonciello wrote:
>>>
>>>> How about if we reset before the kexec?  There is a symbol for drivers to
>>>> use to know they're about to go through kexec to do $THINGS.
>>>>
>>>> Something like this:
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index 0fc0eeedc6461..2b1216b14d618 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -34,6 +34,7 @@
>>>>
>>>>  #include <linux/cc_platform.h>
>>>>  #include <linux/dynamic_debug.h>
>>>> +#include <linux/kexec.h>
>>>>  #include <linux/module.h>
>>>>  #include <linux/mmu_notifier.h>
>>>>  #include <linux/pm_runtime.h>
>>>> @@ -2544,6 +2545,9 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
>>>>                 adev->mp1_state = PP_MP1_STATE_UNLOAD;
>>>>         amdgpu_device_ip_suspend(adev);
>>>>         adev->mp1_state = PP_MP1_STATE_NONE;
>>>> +
>>>> +       if (kexec_in_progress)
>>>> +               amdgpu_asic_reset(adev);
>>>>  }
>>>>
>>>>  static int amdgpu_pmops_prepare(struct device *dev)
>>>
>>> I will throw this in the dev kernel... I'll let you know.
>>
>> First hurdle appears to be that this symbol is not exported. I fixed
>> that, but perhaps the kexec folks don't like drivers to use this?
> 
> Bah, so first kexec after a fresh reboot into a kernel carrying this has
> the thing failing.
> 

Could you check if passing amdgpu module param - 'runpm=0' - helps?

Thanks,
Lijo


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-06-20 10:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-16  9:39 amdgpu vs kexec Peter Zijlstra
2025-06-16 11:51 ` Christian König
2025-06-16 14:54   ` Peter Zijlstra
2025-06-18  2:12     ` Mario Limonciello
2025-06-18  8:51       ` Peter Zijlstra
2025-06-18  9:05         ` Christian König
2025-06-18 13:34           ` Mario Limonciello
2025-06-18 13:46             ` Alex Deucher
2025-06-18  9:12         ` Peter Zijlstra
2025-06-18  9:26           ` Peter Zijlstra
2025-06-18 13:35             ` Mario Limonciello
2025-06-20 10:39             ` Lazar, Lijo
2025-06-18 23:55           ` Baoquan He
2025-06-19 13:32             ` Mario Limonciello
2025-06-16 14:02 ` Lazar, Lijo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).