AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/2] Address amdgpu reload issues in APUs
@ 2025-11-19  0:22 Rodrigo Siqueira
  2025-11-19  0:22 ` [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded Rodrigo Siqueira
  2025-11-19  0:22 ` [PATCH v4 2/2] Revert "drm/amd: fix gfx hang on renoir in IGT reload test" Rodrigo Siqueira
  0 siblings, 2 replies; 6+ messages in thread
From: Rodrigo Siqueira @ 2025-11-19  0:22 UTC (permalink / raw)
  To: Alex Deucher, Christian König, Mario Limonciello
  Cc: Robert Beckett, amd-gfx, kernel-dev, Rodrigo Siqueira

This series addresses the issue of amdgpu reload failures in APUs. The
first commit adds a GPU reset during unload time, and the second commit
removes a specific fix for the Renoir device that becomes outdated with
the first patch.

Thanks
Siqueira

Rodrigo Siqueira (2):
  drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded
  Revert "drm/amd: fix gfx hang on renoir in IGT reload test"

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
 drivers/gpu/drm/amd/amdgpu/soc15.c         | 4 ----
 2 files changed, 9 insertions(+), 4 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded
  2025-11-19  0:22 [PATCH v4 0/2] Address amdgpu reload issues in APUs Rodrigo Siqueira
@ 2025-11-19  0:22 ` Rodrigo Siqueira
  2025-11-19  9:20   ` Christian König
  2025-11-19  0:22 ` [PATCH v4 2/2] Revert "drm/amd: fix gfx hang on renoir in IGT reload test" Rodrigo Siqueira
  1 sibling, 1 reply; 6+ messages in thread
From: Rodrigo Siqueira @ 2025-11-19  0:22 UTC (permalink / raw)
  To: Alex Deucher, Christian König, Mario Limonciello
  Cc: Robert Beckett, amd-gfx, kernel-dev, Rodrigo Siqueira

When trying to unload amdgpu in the SteamDeck (TTY mode), the following
set of errors happens and the system gets unstable:

[..]
 [drm] Initialized amdgpu 3.64.0 for 0000:04:00.0 on minor 0
 amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110).
 amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110).
[..]
 amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000
 amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff!
 amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000
 amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff!
[..]

When the driver initializes the GPU, the PSP validates all the firmware
loaded, and after that, it is not possible to load any other firmware
unless the device is reset. What is happening in the load/unload
situation is that PSP halts the GC engine because it suspects that
something is amiss. To address this issue, this commit ensures that the
GPU is reset (mode 2 reset) in the unload sequence.

Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 860ac1f9e35d..80d00475bc9f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3680,6 +3680,15 @@ static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev)
 				"failed to release exclusive mode on fini\n");
 	}
 
+	/* Reset the device before entirely removing it to avoid load issues
+	 * caused by firmware validation.
+	 */
+	if ((adev->flags & AMD_IS_APU) && !adev->gmc.is_app_apu) {
+		r = amdgpu_asic_reset(adev);
+		if (r)
+			dev_err(adev->dev, "asic reset on %s failed\n", __func__);
+	}
+
 	return 0;
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v4 2/2] Revert "drm/amd: fix gfx hang on renoir in IGT reload test"
  2025-11-19  0:22 [PATCH v4 0/2] Address amdgpu reload issues in APUs Rodrigo Siqueira
  2025-11-19  0:22 ` [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded Rodrigo Siqueira
@ 2025-11-19  0:22 ` Rodrigo Siqueira
  1 sibling, 0 replies; 6+ messages in thread
From: Rodrigo Siqueira @ 2025-11-19  0:22 UTC (permalink / raw)
  To: Alex Deucher, Christian König, Mario Limonciello
  Cc: Robert Beckett, amd-gfx, kernel-dev, Rodrigo Siqueira

The original patch introduced additional latency during boot time
because it triggers a driver reload to avoid a CP hang when the driver
is reloaded multiple times. This has been addressed with a more generic
solution that triggers the GPU reset only during the unload phase,
avoiding extra latency during boot time. For this reason, this commit
reverts the original change.

This reverts commit 72a98763b473890e6605604bfcaf71fc212b4720.

Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/soc15.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 9785fada4fa7..42f5d9c0e3af 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -853,10 +853,6 @@ static bool soc15_need_reset_on_init(struct amdgpu_device *adev)
 {
 	u32 sol_reg;
 
-	/* CP hangs in IGT reloading test on RN, reset to WA */
-	if (adev->asic_type == CHIP_RENOIR)
-		return true;
-
 	if (amdgpu_gmc_need_reset_on_init(adev))
 		return true;
 	if (amdgpu_psp_tos_reload_needed(adev))
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded
  2025-11-19  0:22 ` [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded Rodrigo Siqueira
@ 2025-11-19  9:20   ` Christian König
  2025-11-19 14:00     ` Alex Deucher
  0 siblings, 1 reply; 6+ messages in thread
From: Christian König @ 2025-11-19  9:20 UTC (permalink / raw)
  To: Rodrigo Siqueira, Alex Deucher, Mario Limonciello
  Cc: Robert Beckett, amd-gfx, kernel-dev



On 11/19/25 01:22, Rodrigo Siqueira wrote:
> When trying to unload amdgpu in the SteamDeck (TTY mode), the following
> set of errors happens and the system gets unstable:
> 
> [..]
>  [drm] Initialized amdgpu 3.64.0 for 0000:04:00.0 on minor 0
>  amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110).
>  amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110).
> [..]
>  amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000
>  amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff!
>  amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000
>  amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff!
> [..]
> 
> When the driver initializes the GPU, the PSP validates all the firmware
> loaded, and after that, it is not possible to load any other firmware
> unless the device is reset. What is happening in the load/unload
> situation is that PSP halts the GC engine because it suspects that
> something is amiss. To address this issue, this commit ensures that the
> GPU is reset (mode 2 reset) in the unload sequence.

Mhm doing that on unload sounds like a bad idea to me.

We should rather do that on re-load to also cover the case of aborted VMs for example.

Regards,
Christian.

> 
> Suggested-by: Alex Deucher <alexander.deucher@amd.com>
> Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 860ac1f9e35d..80d00475bc9f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3680,6 +3680,15 @@ static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev)
>  				"failed to release exclusive mode on fini\n");
>  	}
>  
> +	/* Reset the device before entirely removing it to avoid load issues
> +	 * caused by firmware validation.
> +	 */
> +	if ((adev->flags & AMD_IS_APU) && !adev->gmc.is_app_apu) {
> +		r = amdgpu_asic_reset(adev);
> +		if (r)
> +			dev_err(adev->dev, "asic reset on %s failed\n", __func__);
> +	}
> +
>  	return 0;
>  }
>  


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded
  2025-11-19  9:20   ` Christian König
@ 2025-11-19 14:00     ` Alex Deucher
  2025-11-19 14:14       ` Christian König
  0 siblings, 1 reply; 6+ messages in thread
From: Alex Deucher @ 2025-11-19 14:00 UTC (permalink / raw)
  To: Christian König
  Cc: Rodrigo Siqueira, Alex Deucher, Mario Limonciello, Robert Beckett,
	amd-gfx, kernel-dev

On Wed, Nov 19, 2025 at 4:29 AM Christian König
<christian.koenig@amd.com> wrote:
>
>
>
> On 11/19/25 01:22, Rodrigo Siqueira wrote:
> > When trying to unload amdgpu in the SteamDeck (TTY mode), the following
> > set of errors happens and the system gets unstable:
> >
> > [..]
> >  [drm] Initialized amdgpu 3.64.0 for 0000:04:00.0 on minor 0
> >  amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110).
> >  amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110).
> > [..]
> >  amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000
> >  amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff!
> >  amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000
> >  amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff!
> > [..]
> >
> > When the driver initializes the GPU, the PSP validates all the firmware
> > loaded, and after that, it is not possible to load any other firmware
> > unless the device is reset. What is happening in the load/unload
> > situation is that PSP halts the GC engine because it suspects that
> > something is amiss. To address this issue, this commit ensures that the
> > GPU is reset (mode 2 reset) in the unload sequence.
>
> Mhm doing that on unload sounds like a bad idea to me.
>
> We should rather do that on re-load to also cover the case of aborted VMs for example.

That's what we already do for dGPUs, but for APUs, there's not really
a good way to detect this case on startup.  On dGPUs we check to see
if the PSP is running, on APUs the PSP is always running because it's
shared with the whole SoC.  Always resetting on init is not desirable
as it adds latency and causes screen flicker.

Alex

>
> Regards,
> Christian.
>
> >
> > Suggested-by: Alex Deucher <alexander.deucher@amd.com>
> > Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 860ac1f9e35d..80d00475bc9f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3680,6 +3680,15 @@ static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev)
> >                               "failed to release exclusive mode on fini\n");
> >       }
> >
> > +     /* Reset the device before entirely removing it to avoid load issues
> > +      * caused by firmware validation.
> > +      */
> > +     if ((adev->flags & AMD_IS_APU) && !adev->gmc.is_app_apu) {
> > +             r = amdgpu_asic_reset(adev);
> > +             if (r)
> > +                     dev_err(adev->dev, "asic reset on %s failed\n", __func__);
> > +     }
> > +
> >       return 0;
> >  }
> >
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded
  2025-11-19 14:00     ` Alex Deucher
@ 2025-11-19 14:14       ` Christian König
  0 siblings, 0 replies; 6+ messages in thread
From: Christian König @ 2025-11-19 14:14 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rodrigo Siqueira, Alex Deucher, Mario Limonciello, Robert Beckett,
	amd-gfx, kernel-dev



On 11/19/25 15:00, Alex Deucher wrote:
> On Wed, Nov 19, 2025 at 4:29 AM Christian König
> <christian.koenig@amd.com> wrote:
>>
>>
>>
>> On 11/19/25 01:22, Rodrigo Siqueira wrote:
>>> When trying to unload amdgpu in the SteamDeck (TTY mode), the following
>>> set of errors happens and the system gets unstable:
>>>
>>> [..]
>>>  [drm] Initialized amdgpu 3.64.0 for 0000:04:00.0 on minor 0
>>>  amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110).
>>>  amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110).
>>> [..]
>>>  amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000
>>>  amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff!
>>>  amdgpu 0000:04:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001E SMN_C2PMSG_82:0x00000000
>>>  amdgpu 0000:04:00.0: amdgpu: Failed to disable gfxoff!
>>> [..]
>>>
>>> When the driver initializes the GPU, the PSP validates all the firmware
>>> loaded, and after that, it is not possible to load any other firmware
>>> unless the device is reset. What is happening in the load/unload
>>> situation is that PSP halts the GC engine because it suspects that
>>> something is amiss. To address this issue, this commit ensures that the
>>> GPU is reset (mode 2 reset) in the unload sequence.
>>
>> Mhm doing that on unload sounds like a bad idea to me.
>>
>> We should rather do that on re-load to also cover the case of aborted VMs for example.
> 
> That's what we already do for dGPUs, but for APUs, there's not really
> a good way to detect this case on startup.  On dGPUs we check to see
> if the PSP is running, on APUs the PSP is always running because it's
> shared with the whole SoC.  Always resetting on init is not desirable
> as it adds latency and causes screen flicker.

Ah! Good point, we need the reasoning that the PSP is always running on APUs as code comment here.

With that done looks fine to me as well.

Regards,
Christian.

> 
> Alex
> 
>>
>> Regards,
>> Christian.
>>
>>>
>>> Suggested-by: Alex Deucher <alexander.deucher@amd.com>
>>> Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++++
>>>  1 file changed, 9 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 860ac1f9e35d..80d00475bc9f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3680,6 +3680,15 @@ static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev)
>>>                               "failed to release exclusive mode on fini\n");
>>>       }
>>>
>>> +     /* Reset the device before entirely removing it to avoid load issues
>>> +      * caused by firmware validation.
>>> +      */
>>> +     if ((adev->flags & AMD_IS_APU) && !adev->gmc.is_app_apu) {
>>> +             r = amdgpu_asic_reset(adev);
>>> +             if (r)
>>> +                     dev_err(adev->dev, "asic reset on %s failed\n", __func__);
>>> +     }
>>> +
>>>       return 0;
>>>  }
>>>
>>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-11-19 14:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-19  0:22 [PATCH v4 0/2] Address amdgpu reload issues in APUs Rodrigo Siqueira
2025-11-19  0:22 ` [PATCH v4 1/2] drm/amdgpu: Fix GFX hang on SteamDeck when amdgpu is reloaded Rodrigo Siqueira
2025-11-19  9:20   ` Christian König
2025-11-19 14:00     ` Alex Deucher
2025-11-19 14:14       ` Christian König
2025-11-19  0:22 ` [PATCH v4 2/2] Revert "drm/amd: fix gfx hang on renoir in IGT reload test" Rodrigo Siqueira

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox