* [PATCH v4 0/2] Address amdgpu reload issues in APUs @ 2025-11-19 0:22 Rodrigo Siqueira 2025-11-19 0:22 ` [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded Rodrigo Siqueira 2025-11-19 0:22 ` [PATCH v4 2/2] Revert "drm/amd: fix gfx hang on renoir in IGT reload test" Rodrigo Siqueira 0 siblings, 2 replies; 6+ messages in thread From: Rodrigo Siqueira @ 2025-11-19 0:22 UTC (permalink / raw) To: Alex Deucher, Christian König, Mario Limonciello Cc: Robert Beckett, amd-gfx, kernel-dev, Rodrigo Siqueira This series addresses the issue of amdgpu reload failures in APUs. The first commit adds a GPU reset during unload time, and the second commit removes a specific fix for the Renoir device that becomes outdated with the first patch. Thanks Siqueira Rodrigo Siqueira (2): drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded Revert "drm/amd: fix gfx hang on renoir in IGT reload test" drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++ drivers/gpu/drm/amd/amdgpu/soc15.c | 4 ---- 2 files changed, 9 insertions(+), 4 deletions(-) -- 2.51.0 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded 2025-11-19 0:22 [PATCH v4 0/2] Address amdgpu reload issues in APUs Rodrigo Siqueira @ 2025-11-19 0:22 ` Rodrigo Siqueira 2025-11-19 9:20 ` Christian König 2025-11-19 0:22 ` [PATCH v4 2/2] Revert "drm/amd: fix gfx hang on renoir in IGT reload test" Rodrigo Siqueira 1 sibling, 1 reply; 6+ messages in thread From: Rodrigo Siqueira @ 2025-11-19 0:22 UTC (permalink / raw) To: Alex Deucher, Christian König, Mario Limonciello Cc: Robert Beckett, amd-gfx, kernel-dev, Rodrigo Siqueira When trying to unload amdgpu in the SteamDeck (TTY mode), the following set of errors happens and the system gets unstable: [..] [drm] Initialized amdgpu 3.64.0 for 0000:04:00.0 on minor 0 amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110). amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110). [..] amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000 amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff! amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000 amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff! [..] When the driver initializes the GPU, the PSP validates all the firmware loaded, and after that, it is not possible to load any other firmware unless the device is reset. What is happening in the load/unload situation is that PSP halts the GC engine because it suspects that something is amiss. To address this issue, this commit ensures that the GPU is reset (mode 2 reset) in the unload sequence. Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 860ac1f9e35d..80d00475bc9f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3680,6 +3680,15 @@ static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev) "failed to release exclusive mode on fini\n"); } + /* Reset the device before entirely removing it to avoid load issues + * caused by firmware validation. + */ + if ((adev->flags & AMD_IS_APU) && !adev->gmc.is_app_apu) { + r = amdgpu_asic_reset(adev); + if (r) + dev_err(adev->dev, "asic reset on %s failed\n", __func__); + } + return 0; } -- 2.51.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded 2025-11-19 0:22 ` [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded Rodrigo Siqueira @ 2025-11-19 9:20 ` Christian König 2025-11-19 14:00 ` Alex Deucher 0 siblings, 1 reply; 6+ messages in thread From: Christian König @ 2025-11-19 9:20 UTC (permalink / raw) To: Rodrigo Siqueira, Alex Deucher, Mario Limonciello Cc: Robert Beckett, amd-gfx, kernel-dev On 11/19/25 01:22, Rodrigo Siqueira wrote: > When trying to unload amdgpu in the SteamDeck (TTY mode), the following > set of errors happens and the system gets unstable: > > [..] > [drm] Initialized amdgpu 3.64.0 for 0000:04:00.0 on minor 0 > amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110). > amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110). > [..] > amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000 > amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff! > amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000 > amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff! > [..] > > When the driver initializes the GPU, the PSP validates all the firmware > loaded, and after that, it is not possible to load any other firmware > unless the device is reset. What is happening in the load/unload > situation is that PSP halts the GC engine because it suspects that > something is amiss. To address this issue, this commit ensures that the > GPU is reset (mode 2 reset) in the unload sequence. Mhm doing that on unload sounds like a bad idea to me. We should rather do that on re-load to also cover the case of aborted VMs for example. Regards, Christian. > > Suggested-by: Alex Deucher <alexander.deucher@amd.com> > Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 860ac1f9e35d..80d00475bc9f 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -3680,6 +3680,15 @@ static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev) > "failed to release exclusive mode on fini\n"); > } > > + /* Reset the device before entirely removing it to avoid load issues > + * caused by firmware validation. > + */ > + if ((adev->flags & AMD_IS_APU) && !adev->gmc.is_app_apu) { > + r = amdgpu_asic_reset(adev); > + if (r) > + dev_err(adev->dev, "asic reset on %s failed\n", __func__); > + } > + > return 0; > } > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded 2025-11-19 9:20 ` Christian König @ 2025-11-19 14:00 ` Alex Deucher 2025-11-19 14:14 ` Christian König 0 siblings, 1 reply; 6+ messages in thread From: Alex Deucher @ 2025-11-19 14:00 UTC (permalink / raw) To: Christian König Cc: Rodrigo Siqueira, Alex Deucher, Mario Limonciello, Robert Beckett, amd-gfx, kernel-dev On Wed, Nov 19, 2025 at 4:29 AM Christian König <christian.koenig@amd.com> wrote: > > > > On 11/19/25 01:22, Rodrigo Siqueira wrote: > > When trying to unload amdgpu in the SteamDeck (TTY mode), the following > > set of errors happens and the system gets unstable: > > > > [..] > > [drm] Initialized amdgpu 3.64.0 for 0000:04:00.0 on minor 0 > > amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110). > > amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110). > > [..] > > amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000 > > amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff! > > amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000 > > amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff! > > [..] > > > > When the driver initializes the GPU, the PSP validates all the firmware > > loaded, and after that, it is not possible to load any other firmware > > unless the device is reset. What is happening in the load/unload > > situation is that PSP halts the GC engine because it suspects that > > something is amiss. To address this issue, this commit ensures that the > > GPU is reset (mode 2 reset) in the unload sequence. > > Mhm doing that on unload sounds like a bad idea to me. > > We should rather do that on re-load to also cover the case of aborted VMs for example. That's what we already do for dGPUs, but for APUs, there's not really a good way to detect this case on startup. On dGPUs we check to see if the PSP is running, on APUs the PSP is always running because it's shared with the whole SoC. Always resetting on init is not desirable as it adds latency and causes screen flicker. Alex > > Regards, > Christian. > > > > > Suggested-by: Alex Deucher <alexander.deucher@amd.com> > > Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++ > > 1 file changed, 9 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index 860ac1f9e35d..80d00475bc9f 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -3680,6 +3680,15 @@ static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev) > > "failed to release exclusive mode on fini\n"); > > } > > > > + /* Reset the device before entirely removing it to avoid load issues > > + * caused by firmware validation. > > + */ > > + if ((adev->flags & AMD_IS_APU) && !adev->gmc.is_app_apu) { > > + r = amdgpu_asic_reset(adev); > > + if (r) > > + dev_err(adev->dev, "asic reset on %s failed\n", __func__); > > + } > > + > > return 0; > > } > > > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded 2025-11-19 14:00 ` Alex Deucher @ 2025-11-19 14:14 ` Christian König 0 siblings, 0 replies; 6+ messages in thread From: Christian König @ 2025-11-19 14:14 UTC (permalink / raw) To: Alex Deucher Cc: Rodrigo Siqueira, Alex Deucher, Mario Limonciello, Robert Beckett, amd-gfx, kernel-dev On 11/19/25 15:00, Alex Deucher wrote: > On Wed, Nov 19, 2025 at 4:29 AM Christian König > <christian.koenig@amd.com> wrote: >> >> >> >> On 11/19/25 01:22, Rodrigo Siqueira wrote: >>> When trying to unload amdgpu in the SteamDeck (TTY mode), the following >>> set of errors happens and the system gets unstable: >>> >>> [..] >>> [drm] Initialized amdgpu 3.64.0 for 0000:04:00.0 on minor 0 >>> amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110). >>> amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110). >>> [..] >>> amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000 >>> amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff! >>> amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000 >>> amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff! >>> [..] >>> >>> When the driver initializes the GPU, the PSP validates all the firmware >>> loaded, and after that, it is not possible to load any other firmware >>> unless the device is reset. What is happening in the load/unload >>> situation is that PSP halts the GC engine because it suspects that >>> something is amiss. To address this issue, this commit ensures that the >>> GPU is reset (mode 2 reset) in the unload sequence. >> >> Mhm doing that on unload sounds like a bad idea to me. >> >> We should rather do that on re-load to also cover the case of aborted VMs for example. > > That's what we already do for dGPUs, but for APUs, there's not really > a good way to detect this case on startup. On dGPUs we check to see > if the PSP is running, on APUs the PSP is always running because it's > shared with the whole SoC. Always resetting on init is not desirable > as it adds latency and causes screen flicker. Ah! Good point, we need the reasoning that the PSP is always running on APUs as code comment here. With that done looks fine to me as well. Regards, Christian. > > Alex > >> >> Regards, >> Christian. >> >>> >>> Suggested-by: Alex Deucher <alexander.deucher@amd.com> >>> Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++ >>> 1 file changed, 9 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index 860ac1f9e35d..80d00475bc9f 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -3680,6 +3680,15 @@ static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev) >>> "failed to release exclusive mode on fini\n"); >>> } >>> >>> + /* Reset the device before entirely removing it to avoid load issues >>> + * caused by firmware validation. >>> + */ >>> + if ((adev->flags & AMD_IS_APU) && !adev->gmc.is_app_apu) { >>> + r = amdgpu_asic_reset(adev); >>> + if (r) >>> + dev_err(adev->dev, "asic reset on %s failed\n", __func__); >>> + } >>> + >>> return 0; >>> } >>> >> ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v4 2/2] Revert "drm/amd: fix gfx hang on renoir in IGT reload test" 2025-11-19 0:22 [PATCH v4 0/2] Address amdgpu reload issues in APUs Rodrigo Siqueira 2025-11-19 0:22 ` [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded Rodrigo Siqueira @ 2025-11-19 0:22 ` Rodrigo Siqueira 1 sibling, 0 replies; 6+ messages in thread From: Rodrigo Siqueira @ 2025-11-19 0:22 UTC (permalink / raw) To: Alex Deucher, Christian König, Mario Limonciello Cc: Robert Beckett, amd-gfx, kernel-dev, Rodrigo Siqueira The original patch introduced additional latency during boot time because it triggers a driver reload to avoid a CP hang when the driver is reloaded multiple times. This has been addressed with a more generic solution that triggers the GPU reset only during the unload phase, avoiding extra latency during boot time. For this reason, this commit reverts the original change. This reverts commit 72a98763b473890e6605604bfcaf71fc212b4720. Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> --- drivers/gpu/drm/amd/amdgpu/soc15.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c index 9785fada4fa7..42f5d9c0e3af 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -853,10 +853,6 @@ static bool soc15_need_reset_on_init(struct amdgpu_device *adev) { u32 sol_reg; - /* CP hangs in IGT reloading test on RN, reset to WA */ - if (adev->asic_type == CHIP_RENOIR) - return true; - if (amdgpu_gmc_need_reset_on_init(adev)) return true; if (amdgpu_psp_tos_reload_needed(adev)) -- 2.51.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-11-19 14:14 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-11-19 0:22 [PATCH v4 0/2] Address amdgpu reload issues in APUs Rodrigo Siqueira 2025-11-19 0:22 ` [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded Rodrigo Siqueira 2025-11-19 9:20 ` Christian König 2025-11-19 14:00 ` Alex Deucher 2025-11-19 14:14 ` Christian König 2025-11-19 0:22 ` [PATCH v4 2/2] Revert "drm/amd: fix gfx hang on renoir in IGT reload test" Rodrigo Siqueira
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox