[PATCH 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
@ 2026-06-10  5:57 Jiqian Chen
  2026-06-10  6:59 ` Huang Rui
  2026-06-10  7:50 ` Christian König
  0 siblings, 2 replies; 6+ messages in thread
From: Jiqian Chen @ 2026-06-10  5:57 UTC (permalink / raw)
  To: Alex Deucher, Christian König
  Cc: amd-gfx, Huang Rui, Huang Trigger, Jiqian Chen

For Renior APU with gfx9, in some test scenarios with disabling
ring_reset, like accessing an unmapped invalid address, it can
trigger a gpu job timeout event, then driver uses Mode2 reset
to reset GPU, but after Mode2, the CPC and CPF are still stuck,
that causes compute Ring tests fail. What's more, the HQDs of
MECs are still active, that causes MECs use stale HQDs when MECs
are unhalted before driver restore MQDs, then causes compute IB
tests fail.

So, add sequences to reset CPC and CPF after Mode2, and de-active
HQDs of MECs before unhalting MECs and mapping compute queues.

Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
---
Hi all,

My board is Renior APU with gfx9, smu12. I run a testcase that
accesses an invalid address to trigger a amdgpu_job_timedout()
with disabling ring_reset, so that driver will call mode2 reset
directly. After mode2 reset I found compute Ring tests and compute
IB tests fail randomly on random compute ring.
We checked the scan dump of GPU, we can see the CPC and CPF are
still stuck, that may cause Compute Ring tests fail.
I added printings in driver codes (gfx_v9_0_cp_resume), and found
the HQDs of MECs are still active, that may cause MECs use stale
HQDs when MECs are unhalted before mapping compute queues (restore
MQDs to HQDs).
So, I send this patch to fix above problems.
There are two main changes of my patches:
One is to reset CPC and CPF before resuming KCQ.
Another is to disable HQDs beofre unhalting MECs.
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 40 ++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 47721d0c3781..dc0978bc312c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -3944,7 +3944,8 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device *adev)
 
 static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
 {
-	int r, i;
+	u32 tmp;
+	int r, i, j, k;
 	struct amdgpu_ring *ring;
 
 	if (!(adev->flags & AMD_IS_APU))
@@ -3967,6 +3968,43 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
 		gfx_v9_0_cp_gfx_enable(adev, false);
 	gfx_v9_0_cp_compute_enable(adev, false);
 
+	if ((adev->flags & AMD_IS_APU) &&
+		(adev->apu_flags & AMD_APU_IS_RENOIR) && amdgpu_in_reset(adev)) {
+		/*
+		 * CPC and CPF are still stuck after Mode2 reset, that causes later
+		 * compute ring test fail and then loop Mode2 reset infinitely
+		 */
+		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
+		tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPC, 1);
+		tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPF, 1);
+		WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
+		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
+		udelay(50);
+
+		tmp &= ~(GRBM_SOFT_RESET__SOFT_RESET_CPC_MASK |
+				GRBM_SOFT_RESET__SOFT_RESET_CPF_MASK);
+		WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
+		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
+		udelay(50);
+
+		/*
+		 * CP_HQD_ACTIVE survives Mode2 reset. Deactivate every MEC HQD to
+		 * prevent MEC use stale HQD when MEC unhalted before restoring MQD.
+		 * Otherwise, later compute IB test may fail
+		 */
+		for (i = 0; i < adev->gfx.mec.num_mec; i++) {
+			for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
+				for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; k++) {
+					mutex_lock(&adev->srbm_mutex);
+					soc15_grbm_select(adev, i + 1, j, k, 0, 0);
+					WREG32_SOC15_RLC(GC, 0, mmCP_HQD_ACTIVE, 0);
+					soc15_grbm_select(adev, 0, 0, 0, 0, 0);
+					mutex_unlock(&adev->srbm_mutex);
+				}
+			}
+		}
+	}
+
 	r = gfx_v9_0_kiq_resume(adev);
 	if (r)
 		return r;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-10  5:57 [PATCH 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2 Jiqian Chen
@ 2026-06-10  6:59 ` Huang Rui
  2026-06-10  7:58   ` Chen, Jiqian
  2026-06-10  7:50 ` Christian König
  1 sibling, 1 reply; 6+ messages in thread
From: Huang Rui @ 2026-06-10  6:59 UTC (permalink / raw)
  To: Jiqian Chen; +Cc: Alex Deucher, Christian König, amd-gfx, Huang Trigger

On Wed, Jun 10, 2026 at 01:57:36PM +0800, Jiqian Chen wrote:
> For Renior APU with gfx9, in some test scenarios with disabling
> ring_reset, like accessing an unmapped invalid address, it can
> trigger a gpu job timeout event, then driver uses Mode2 reset
> to reset GPU, but after Mode2, the CPC and CPF are still stuck,
> that causes compute Ring tests fail. What's more, the HQDs of
> MECs are still active, that causes MECs use stale HQDs when MECs
> are unhalted before driver restore MQDs, then causes compute IB
> tests fail.
> 
> So, add sequences to reset CPC and CPF after Mode2, and de-active
> HQDs of MECs before unhalting MECs and mapping compute queues.
> 
> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
> ---
> Hi all,
> 
> My board is Renior APU with gfx9, smu12. I run a testcase that
> accesses an invalid address to trigger a amdgpu_job_timedout()
> with disabling ring_reset, so that driver will call mode2 reset
> directly. After mode2 reset I found compute Ring tests and compute
> IB tests fail randomly on random compute ring.
> We checked the scan dump of GPU, we can see the CPC and CPF are
> still stuck, that may cause Compute Ring tests fail.
> I added printings in driver codes (gfx_v9_0_cp_resume), and found
> the HQDs of MECs are still active, that may cause MECs use stale
> HQDs when MECs are unhalted before mapping compute queues (restore
> MQDs to HQDs).
> So, I send this patch to fix above problems.
> There are two main changes of my patches:
> One is to reset CPC and CPF before resuming KCQ.
> Another is to disable HQDs beofre unhalting MECs.
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 40 ++++++++++++++++++++++++++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 47721d0c3781..dc0978bc312c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -3944,7 +3944,8 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device *adev)
>  
>  static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>  {
> -	int r, i;
> +	u32 tmp;
> +	int r, i, j, k;
>  	struct amdgpu_ring *ring;
>  
>  	if (!(adev->flags & AMD_IS_APU))
> @@ -3967,6 +3968,43 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>  		gfx_v9_0_cp_gfx_enable(adev, false);
>  	gfx_v9_0_cp_compute_enable(adev, false);
>  
> +	if ((adev->flags & AMD_IS_APU) &&
> +		(adev->apu_flags & AMD_APU_IS_RENOIR) && amdgpu_in_reset(adev)) {

It should be not only for Renoir, and I think it should be also for all
gfx9 based APU such as Raven, Picasso, etc.

Could you use AMD_RESET_METHOD_MODE2 of"enum amd_reset_method" as the check
condition? Because it is the issue of mode2 reset.

> +		/*
> +		 * CPC and CPF are still stuck after Mode2 reset, that causes later
> +		 * compute ring test fail and then loop Mode2 reset infinitely
> +		 */
> +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> +		tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPC, 1);
> +		tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPF, 1);
> +		WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
> +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> +		udelay(50);
> +
> +		tmp &= ~(GRBM_SOFT_RESET__SOFT_RESET_CPC_MASK |
> +				GRBM_SOFT_RESET__SOFT_RESET_CPF_MASK);
> +		WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
> +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> +		udelay(50);

It would be better to use a specific function to implement the register
programming like clearing CPC/CPF and also HQD_ACTIVE below.
gfx_v9_0_cp_resume() is high level function.

Thanks,
Ray

> +
> +		/*
> +		 * CP_HQD_ACTIVE survives Mode2 reset. Deactivate every MEC HQD to
> +		 * prevent MEC use stale HQD when MEC unhalted before restoring MQD.
> +		 * Otherwise, later compute IB test may fail
> +		 */
> +		for (i = 0; i < adev->gfx.mec.num_mec; i++) {
> +			for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
> +				for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; k++) {
> +					mutex_lock(&adev->srbm_mutex);
> +					soc15_grbm_select(adev, i + 1, j, k, 0, 0);
> +					WREG32_SOC15_RLC(GC, 0, mmCP_HQD_ACTIVE, 0);
> +					soc15_grbm_select(adev, 0, 0, 0, 0, 0);
> +					mutex_unlock(&adev->srbm_mutex);
> +				}
> +			}
> +		}
> +	}
> +
>  	r = gfx_v9_0_kiq_resume(adev);
>  	if (r)
>  		return r;
> -- 
> 2.39.5
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-10  6:59 ` Huang Rui
@ 2026-06-10  7:58   ` Chen, Jiqian
  0 siblings, 0 replies; 6+ messages in thread
From: Chen, Jiqian @ 2026-06-10  7:58 UTC (permalink / raw)
  To: Huang, Ray
  Cc: Deucher, Alexander, Koenig, Christian,
	amd-gfx@lists.freedesktop.org, Huang, Trigger

On 6/10/26 14:59, Huang Rui wrote:
> On Wed, Jun 10, 2026 at 01:57:36PM +0800, Jiqian Chen wrote:
>> For Renior APU with gfx9, in some test scenarios with disabling
>> ring_reset, like accessing an unmapped invalid address, it can
>> trigger a gpu job timeout event, then driver uses Mode2 reset
>> to reset GPU, but after Mode2, the CPC and CPF are still stuck,
>> that causes compute Ring tests fail. What's more, the HQDs of
>> MECs are still active, that causes MECs use stale HQDs when MECs
>> are unhalted before driver restore MQDs, then causes compute IB
>> tests fail.
>>
>> So, add sequences to reset CPC and CPF after Mode2, and de-active
>> HQDs of MECs before unhalting MECs and mapping compute queues.
>>
>> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
>> ---
>> Hi all,
>>
>> My board is Renior APU with gfx9, smu12. I run a testcase that
>> accesses an invalid address to trigger a amdgpu_job_timedout()
>> with disabling ring_reset, so that driver will call mode2 reset
>> directly. After mode2 reset I found compute Ring tests and compute
>> IB tests fail randomly on random compute ring.
>> We checked the scan dump of GPU, we can see the CPC and CPF are
>> still stuck, that may cause Compute Ring tests fail.
>> I added printings in driver codes (gfx_v9_0_cp_resume), and found
>> the HQDs of MECs are still active, that may cause MECs use stale
>> HQDs when MECs are unhalted before mapping compute queues (restore
>> MQDs to HQDs).
>> So, I send this patch to fix above problems.
>> There are two main changes of my patches:
>> One is to reset CPC and CPF before resuming KCQ.
>> Another is to disable HQDs beofre unhalting MECs.
>> ---
>>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 40 ++++++++++++++++++++++++++-
>>  1 file changed, 39 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>> index 47721d0c3781..dc0978bc312c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>> @@ -3944,7 +3944,8 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device *adev)
>>  
>>  static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>>  {
>> -	int r, i;
>> +	u32 tmp;
>> +	int r, i, j, k;
>>  	struct amdgpu_ring *ring;
>>  
>>  	if (!(adev->flags & AMD_IS_APU))
>> @@ -3967,6 +3968,43 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>>  		gfx_v9_0_cp_gfx_enable(adev, false);
>>  	gfx_v9_0_cp_compute_enable(adev, false);
>>  
>> +	if ((adev->flags & AMD_IS_APU) &&
>> +		(adev->apu_flags & AMD_APU_IS_RENOIR) && amdgpu_in_reset(adev)) {
> 
> It should be not only for Renoir, and I think it should be also for all
> gfx9 based APU such as Raven, Picasso, etc.
> 
> Could you use AMD_RESET_METHOD_MODE2 of"enum amd_reset_method" as the check
> condition? Because it is the issue of mode2 reset.
Thanks, I will do these two suggestions in next version.

> 
>> +		/*
>> +		 * CPC and CPF are still stuck after Mode2 reset, that causes later
>> +		 * compute ring test fail and then loop Mode2 reset infinitely
>> +		 */
>> +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
>> +		tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPC, 1);
>> +		tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPF, 1);
>> +		WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
>> +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
>> +		udelay(50);
>> +
>> +		tmp &= ~(GRBM_SOFT_RESET__SOFT_RESET_CPC_MASK |
>> +				GRBM_SOFT_RESET__SOFT_RESET_CPF_MASK);
>> +		WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
>> +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
>> +		udelay(50);
> 
> It would be better to use a specific function to implement the register
> programming like clearing CPC/CPF and also HQD_ACTIVE below.
> gfx_v9_0_cp_resume() is high level function.
> 
> Thanks,
> Ray
> 
>> +
>> +		/*
>> +		 * CP_HQD_ACTIVE survives Mode2 reset. Deactivate every MEC HQD to
>> +		 * prevent MEC use stale HQD when MEC unhalted before restoring MQD.
>> +		 * Otherwise, later compute IB test may fail
>> +		 */
>> +		for (i = 0; i < adev->gfx.mec.num_mec; i++) {
>> +			for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
>> +				for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; k++) {
>> +					mutex_lock(&adev->srbm_mutex);
>> +					soc15_grbm_select(adev, i + 1, j, k, 0, 0);
>> +					WREG32_SOC15_RLC(GC, 0, mmCP_HQD_ACTIVE, 0);
>> +					soc15_grbm_select(adev, 0, 0, 0, 0, 0);
>> +					mutex_unlock(&adev->srbm_mutex);
>> +				}
>> +			}
>> +		}
>> +	}
>> +
>>  	r = gfx_v9_0_kiq_resume(adev);
>>  	if (r)
>>  		return r;
>> -- 
>> 2.39.5
>>

-- 
Best regards,
Jiqian Chen.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-10  5:57 [PATCH 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2 Jiqian Chen
  2026-06-10  6:59 ` Huang Rui
@ 2026-06-10  7:50 ` Christian König
  2026-06-10  8:01   ` Chen, Jiqian
  2026-06-10 13:42   ` Huang Rui
  1 sibling, 2 replies; 6+ messages in thread
From: Christian König @ 2026-06-10  7:50 UTC (permalink / raw)
  To: Jiqian Chen, Alex Deucher
  Cc: amd-gfx, Huang Rui, Huang Trigger, Timur Kristóf,
	Samuel Pitoiset, Tvrtko Ursulin

On 6/10/26 07:57, Jiqian Chen wrote:
> For Renior APU with gfx9, in some test scenarios with disabling
> ring_reset, like accessing an unmapped invalid address, it can
> trigger a gpu job timeout event, then driver uses Mode2 reset
> to reset GPU, but after Mode2, the CPC and CPF are still stuck,
> that causes compute Ring tests fail. What's more, the HQDs of
> MECs are still active, that causes MECs use stale HQDs when MECs
> are unhalted before driver restore MQDs, then causes compute IB
> tests fail.
> 
> So, add sequences to reset CPC and CPF after Mode2, and de-active
> HQDs of MECs before unhalting MECs and mapping compute queues.
> 
> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
> ---
> Hi all,
> 
> My board is Renior APU with gfx9, smu12. I run a testcase that
> accesses an invalid address to trigger a amdgpu_job_timedout()
> with disabling ring_reset, so that driver will call mode2 reset
> directly. After mode2 reset I found compute Ring tests and compute
> IB tests fail randomly on random compute ring.

Oh! It's really nice to see that.

We had quite a number of bug reports on this issue, but were never able to reproduce it reliable.

IIRC some Valve engineers ran into that as well, adding a few people on CC.

I can't judge if the proposed fix is technically correct, but it's good to see that there is some progress on this issue.

Thanks,
Christian.

> We checked the scan dump of GPU, we can see the CPC and CPF are
> still stuck, that may cause Compute Ring tests fail.
> I added printings in driver codes (gfx_v9_0_cp_resume), and found
> the HQDs of MECs are still active, that may cause MECs use stale
> HQDs when MECs are unhalted before mapping compute queues (restore
> MQDs to HQDs).
> So, I send this patch to fix above problems.
> There are two main changes of my patches:
> One is to reset CPC and CPF before resuming KCQ.
> Another is to disable HQDs beofre unhalting MECs.
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 40 ++++++++++++++++++++++++++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 47721d0c3781..dc0978bc312c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -3944,7 +3944,8 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device *adev)
>  
>  static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>  {
> -	int r, i;
> +	u32 tmp;
> +	int r, i, j, k;
>  	struct amdgpu_ring *ring;
>  
>  	if (!(adev->flags & AMD_IS_APU))
> @@ -3967,6 +3968,43 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>  		gfx_v9_0_cp_gfx_enable(adev, false);
>  	gfx_v9_0_cp_compute_enable(adev, false);
>  
> +	if ((adev->flags & AMD_IS_APU) &&
> +		(adev->apu_flags & AMD_APU_IS_RENOIR) && amdgpu_in_reset(adev)) {
> +		/*
> +		 * CPC and CPF are still stuck after Mode2 reset, that causes later
> +		 * compute ring test fail and then loop Mode2 reset infinitely
> +		 */
> +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> +		tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPC, 1);
> +		tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPF, 1);
> +		WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
> +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> +		udelay(50);
> +
> +		tmp &= ~(GRBM_SOFT_RESET__SOFT_RESET_CPC_MASK |
> +				GRBM_SOFT_RESET__SOFT_RESET_CPF_MASK);
> +		WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
> +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> +		udelay(50);
> +
> +		/*
> +		 * CP_HQD_ACTIVE survives Mode2 reset. Deactivate every MEC HQD to
> +		 * prevent MEC use stale HQD when MEC unhalted before restoring MQD.
> +		 * Otherwise, later compute IB test may fail
> +		 */
> +		for (i = 0; i < adev->gfx.mec.num_mec; i++) {
> +			for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
> +				for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; k++) {
> +					mutex_lock(&adev->srbm_mutex);
> +					soc15_grbm_select(adev, i + 1, j, k, 0, 0);
> +					WREG32_SOC15_RLC(GC, 0, mmCP_HQD_ACTIVE, 0);
> +					soc15_grbm_select(adev, 0, 0, 0, 0, 0);
> +					mutex_unlock(&adev->srbm_mutex);
> +				}
> +			}
> +		}
> +	}
> +
>  	r = gfx_v9_0_kiq_resume(adev);
>  	if (r)
>  		return r;


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-10  7:50 ` Christian König
@ 2026-06-10  8:01   ` Chen, Jiqian
  2026-06-10 13:42   ` Huang Rui
  1 sibling, 0 replies; 6+ messages in thread
From: Chen, Jiqian @ 2026-06-10  8:01 UTC (permalink / raw)
  To: Koenig, Christian, Deucher, Alexander
  Cc: amd-gfx@lists.freedesktop.org, Huang, Ray, Huang, Trigger,
	Timur Kristóf, Samuel Pitoiset, Tvrtko Ursulin

On 6/10/26 15:50, Christian König wrote:
> On 6/10/26 07:57, Jiqian Chen wrote:
>> For Renior APU with gfx9, in some test scenarios with disabling
>> ring_reset, like accessing an unmapped invalid address, it can
>> trigger a gpu job timeout event, then driver uses Mode2 reset
>> to reset GPU, but after Mode2, the CPC and CPF are still stuck,
>> that causes compute Ring tests fail. What's more, the HQDs of
>> MECs are still active, that causes MECs use stale HQDs when MECs
>> are unhalted before driver restore MQDs, then causes compute IB
>> tests fail.
>>
>> So, add sequences to reset CPC and CPF after Mode2, and de-active
>> HQDs of MECs before unhalting MECs and mapping compute queues.
>>
>> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
>> ---
>> Hi all,
>>
>> My board is Renior APU with gfx9, smu12. I run a testcase that
>> accesses an invalid address to trigger a amdgpu_job_timedout()
>> with disabling ring_reset, so that driver will call mode2 reset
>> directly. After mode2 reset I found compute Ring tests and compute
>> IB tests fail randomly on random compute ring.
> 
> Oh! It's really nice to see that.
> 
> We had quite a number of bug reports on this issue, but were never able to reproduce it reliable.
I use one testcase of amdgpu_test with our specific changes to reproduce this issue.
If anyone need the binary, I can share it.

> 
> IIRC some Valve engineers ran into that as well, adding a few people on CC.
> 
> I can't judge if the proposed fix is technically correct, but it's good to see that there is some progress on this issue.
Thank you!

> 
> Thanks,
> Christian.
> 
>> We checked the scan dump of GPU, we can see the CPC and CPF are
>> still stuck, that may cause Compute Ring tests fail.
>> I added printings in driver codes (gfx_v9_0_cp_resume), and found
>> the HQDs of MECs are still active, that may cause MECs use stale
>> HQDs when MECs are unhalted before mapping compute queues (restore
>> MQDs to HQDs).
>> So, I send this patch to fix above problems.
>> There are two main changes of my patches:
>> One is to reset CPC and CPF before resuming KCQ.
>> Another is to disable HQDs beofre unhalting MECs.
>> ---
>>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 40 ++++++++++++++++++++++++++-
>>  1 file changed, 39 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>> index 47721d0c3781..dc0978bc312c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>> @@ -3944,7 +3944,8 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device *adev)
>>  
>>  static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>>  {
>> -	int r, i;
>> +	u32 tmp;
>> +	int r, i, j, k;
>>  	struct amdgpu_ring *ring;
>>  
>>  	if (!(adev->flags & AMD_IS_APU))
>> @@ -3967,6 +3968,43 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>>  		gfx_v9_0_cp_gfx_enable(adev, false);
>>  	gfx_v9_0_cp_compute_enable(adev, false);
>>  
>> +	if ((adev->flags & AMD_IS_APU) &&
>> +		(adev->apu_flags & AMD_APU_IS_RENOIR) && amdgpu_in_reset(adev)) {
>> +		/*
>> +		 * CPC and CPF are still stuck after Mode2 reset, that causes later
>> +		 * compute ring test fail and then loop Mode2 reset infinitely
>> +		 */
>> +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
>> +		tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPC, 1);
>> +		tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPF, 1);
>> +		WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
>> +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
>> +		udelay(50);
>> +
>> +		tmp &= ~(GRBM_SOFT_RESET__SOFT_RESET_CPC_MASK |
>> +				GRBM_SOFT_RESET__SOFT_RESET_CPF_MASK);
>> +		WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
>> +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
>> +		udelay(50);
>> +
>> +		/*
>> +		 * CP_HQD_ACTIVE survives Mode2 reset. Deactivate every MEC HQD to
>> +		 * prevent MEC use stale HQD when MEC unhalted before restoring MQD.
>> +		 * Otherwise, later compute IB test may fail
>> +		 */
>> +		for (i = 0; i < adev->gfx.mec.num_mec; i++) {
>> +			for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
>> +				for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; k++) {
>> +					mutex_lock(&adev->srbm_mutex);
>> +					soc15_grbm_select(adev, i + 1, j, k, 0, 0);
>> +					WREG32_SOC15_RLC(GC, 0, mmCP_HQD_ACTIVE, 0);
>> +					soc15_grbm_select(adev, 0, 0, 0, 0, 0);
>> +					mutex_unlock(&adev->srbm_mutex);
>> +				}
>> +			}
>> +		}
>> +	}
>> +
>>  	r = gfx_v9_0_kiq_resume(adev);
>>  	if (r)
>>  		return r;
> 

-- 
Best regards,
Jiqian Chen.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-10  7:50 ` Christian König
  2026-06-10  8:01   ` Chen, Jiqian
@ 2026-06-10 13:42   ` Huang Rui
  1 sibling, 0 replies; 6+ messages in thread
From: Huang Rui @ 2026-06-10 13:42 UTC (permalink / raw)
  To: Christian König
  Cc: Jiqian Chen, Alex Deucher, amd-gfx, Huang Trigger,
	Timur Kristóf, Samuel Pitoiset, Tvrtko Ursulin

On Wed, Jun 10, 2026 at 09:50:18AM +0200, Christian König wrote:
> On 6/10/26 07:57, Jiqian Chen wrote:
> > For Renior APU with gfx9, in some test scenarios with disabling
> > ring_reset, like accessing an unmapped invalid address, it can
> > trigger a gpu job timeout event, then driver uses Mode2 reset
> > to reset GPU, but after Mode2, the CPC and CPF are still stuck,
> > that causes compute Ring tests fail. What's more, the HQDs of
> > MECs are still active, that causes MECs use stale HQDs when MECs
> > are unhalted before driver restore MQDs, then causes compute IB
> > tests fail.
> > 
> > So, add sequences to reset CPC and CPF after Mode2, and de-active
> > HQDs of MECs before unhalting MECs and mapping compute queues.
> > 
> > Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
> > ---
> > Hi all,
> > 
> > My board is Renior APU with gfx9, smu12. I run a testcase that
> > accesses an invalid address to trigger a amdgpu_job_timedout()
> > with disabling ring_reset, so that driver will call mode2 reset
> > directly. After mode2 reset I found compute Ring tests and compute
> > IB tests fail randomly on random compute ring.
> 
> Oh! It's really nice to see that.
> 
> We had quite a number of bug reports on this issue, but were never able to reproduce it reliable.
> 
> IIRC some Valve engineers ran into that as well, adding a few people on CC.
> 
> I can't judge if the proposed fix is technically correct, but it's good to see that there is some progress on this issue.

Thank you for the recognition. Jiqian is currently working with the
hardware designer to investigate this issue. Although the issue manifests
as a random loss of the EOP interrupt, once it occurs, the mode2 reset can
be repeatedly triggered by the compute IB test loop, causing the mode2
reset to never complete and the driver to become stuck. After applying this
patch, we are now able to pass hundreds of iterations of the mode2 reset
stress test.

Thanks,
Ray

> 
> Thanks,
> Christian.
> 
> > We checked the scan dump of GPU, we can see the CPC and CPF are
> > still stuck, that may cause Compute Ring tests fail.
> > I added printings in driver codes (gfx_v9_0_cp_resume), and found
> > the HQDs of MECs are still active, that may cause MECs use stale
> > HQDs when MECs are unhalted before mapping compute queues (restore
> > MQDs to HQDs).
> > So, I send this patch to fix above problems.
> > There are two main changes of my patches:
> > One is to reset CPC and CPF before resuming KCQ.
> > Another is to disable HQDs beofre unhalting MECs.
> > ---
> >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 40 ++++++++++++++++++++++++++-
> >  1 file changed, 39 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > index 47721d0c3781..dc0978bc312c 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > @@ -3944,7 +3944,8 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device *adev)
> >  
> >  static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
> >  {
> > -	int r, i;
> > +	u32 tmp;
> > +	int r, i, j, k;
> >  	struct amdgpu_ring *ring;
> >  
> >  	if (!(adev->flags & AMD_IS_APU))
> > @@ -3967,6 +3968,43 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
> >  		gfx_v9_0_cp_gfx_enable(adev, false);
> >  	gfx_v9_0_cp_compute_enable(adev, false);
> >  
> > +	if ((adev->flags & AMD_IS_APU) &&
> > +		(adev->apu_flags & AMD_APU_IS_RENOIR) && amdgpu_in_reset(adev)) {
> > +		/*
> > +		 * CPC and CPF are still stuck after Mode2 reset, that causes later
> > +		 * compute ring test fail and then loop Mode2 reset infinitely
> > +		 */
> > +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> > +		tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPC, 1);
> > +		tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPF, 1);
> > +		WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
> > +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> > +		udelay(50);
> > +
> > +		tmp &= ~(GRBM_SOFT_RESET__SOFT_RESET_CPC_MASK |
> > +				GRBM_SOFT_RESET__SOFT_RESET_CPF_MASK);
> > +		WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
> > +		tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> > +		udelay(50);
> > +
> > +		/*
> > +		 * CP_HQD_ACTIVE survives Mode2 reset. Deactivate every MEC HQD to
> > +		 * prevent MEC use stale HQD when MEC unhalted before restoring MQD.
> > +		 * Otherwise, later compute IB test may fail
> > +		 */
> > +		for (i = 0; i < adev->gfx.mec.num_mec; i++) {
> > +			for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
> > +				for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; k++) {
> > +					mutex_lock(&adev->srbm_mutex);
> > +					soc15_grbm_select(adev, i + 1, j, k, 0, 0);
> > +					WREG32_SOC15_RLC(GC, 0, mmCP_HQD_ACTIVE, 0);
> > +					soc15_grbm_select(adev, 0, 0, 0, 0, 0);
> > +					mutex_unlock(&adev->srbm_mutex);
> > +				}
> > +			}
> > +		}
> > +	}
> > +
> >  	r = gfx_v9_0_kiq_resume(adev);
> >  	if (r)
> >  		return r;
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-10 13:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-10  5:57 [PATCH 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2 Jiqian Chen
2026-06-10  6:59 ` Huang Rui
2026-06-10  7:58   ` Chen, Jiqian
2026-06-10  7:50 ` Christian König
2026-06-10  8:01   ` Chen, Jiqian
2026-06-10 13:42   ` Huang Rui

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.