[PATCH v2 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
@ 2026-06-11  5:57 Jiqian Chen
  2026-06-11  6:26 ` Huang Rui
  2026-06-11 20:26 ` Timur Kristóf
  0 siblings, 2 replies; 9+ messages in thread
From: Jiqian Chen @ 2026-06-11  5:57 UTC (permalink / raw)
  To: Alex Deucher, Christian König
  Cc: amd-gfx, Timur Kristóf, Samuel Pitoiset, Tvrtko Ursulin,
	Huang Rui, Huang Trigger, Jiqian Chen

For Renior APU with gfx9, in some test scenarios with disabling
ring_reset, like accessing an unmapped invalid address, it can
trigger a gpu job timeout event, then driver uses Mode2 reset
to reset GPU, but after Mode2 compute Ring test and IB test fail
randomly. It because the CPC and CPF are still stuck after Mode2,
that causes compute Ring test fail. What's more, the HQDs of
MECs are still active, that causes MECs use stale HQDs when MECs
are unhalted before driver restore MQDs, then causes compute IB
tests fail.

So, add sequences to reset CPC and CPF after Mode2, and de-active
HQDs of MECs before unhalting MECs.

Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
---
v1->v2 changes:
* Move my sequences into a new function gfx_v9_0_cp_mode2_clear_state
* Add reset Mode2 method check to the if condition that call my sequences

v1:
Hi all,

My board is Renior APU with gfx9, smu12. I run a testcase that
accesses an invalid address to trigger a amdgpu_job_timedout()
with disabling ring_reset, so that driver will call mode2 reset
directly. After mode2 reset I found compute Ring tests and compute
IB tests fail randomly on random compute ring.

We checked the scan dump of GPU, we can see the CPC and CPF are
still stuck, that caused Compute Ring tests fail.

I added printings in driver codes (gfx_v9_0_cp_resume), and found
the HQDs of MECs are still active, that may cause MECs use stale
HQDs when MECs are unhalted before mapping compute queues (restoring
MQDs to HQDs).

So, I send this patch to fix above problems.
There are two main changes of my patch:
One is to reset CPC and CPF before resuming KCQ.
Another is to disable HQDs beofre unhalting MECs.
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 44 +++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 47721d0c3781..d3ef45aa299a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -3942,6 +3942,46 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device *adev)
 	return amdgpu_gfx_enable_kcq(adev, 0);
 }
 
+static void gfx_v9_0_cp_mode2_clear_state(struct amdgpu_device *adev)
+{
+	u32 tmp;
+	int i, j, k;
+
+	/*
+	 * CPC and CPF are still stuck after Mode2 reset, that causes later
+	 * compute ring test fail and then loop Mode2 reset infinitely
+	 */
+	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
+	tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPC, 1);
+	tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPF, 1);
+	WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
+	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
+	udelay(50);
+
+	tmp &= ~(GRBM_SOFT_RESET__SOFT_RESET_CPC_MASK |
+			GRBM_SOFT_RESET__SOFT_RESET_CPF_MASK);
+	WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
+	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
+	udelay(50);
+
+	/*
+	 * CP_HQD_ACTIVE survives Mode2 reset. Deactivate every MEC HQD to
+	 * prevent MEC use stale HQD when MEC unhalted before restoring MQD.
+	 * Otherwise, later compute IB test may fail
+	 */
+	for (i = 0; i < adev->gfx.mec.num_mec; i++) {
+		for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
+			for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; k++) {
+				mutex_lock(&adev->srbm_mutex);
+				soc15_grbm_select(adev, i + 1, j, k, 0, 0);
+				WREG32_SOC15_RLC(GC, 0, mmCP_HQD_ACTIVE, 0);
+				soc15_grbm_select(adev, 0, 0, 0, 0, 0);
+				mutex_unlock(&adev->srbm_mutex);
+			}
+		}
+	}
+}
+
 static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
 {
 	int r, i;
@@ -3967,6 +4007,10 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
 		gfx_v9_0_cp_gfx_enable(adev, false);
 	gfx_v9_0_cp_compute_enable(adev, false);
 
+	if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) &&
+		amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_MODE2)
+		gfx_v9_0_cp_mode2_clear_state(adev);
+
 	r = gfx_v9_0_kiq_resume(adev);
 	if (r)
 		return r;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-11  5:57 [PATCH v2 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2 Jiqian Chen
@ 2026-06-11  6:26 ` Huang Rui
  2026-06-11  7:09   ` Lazar, Lijo
  2026-06-11 20:26 ` Timur Kristóf
  1 sibling, 1 reply; 9+ messages in thread
From: Huang Rui @ 2026-06-11  6:26 UTC (permalink / raw)
  To: Jiqian Chen
  Cc: Alex Deucher, Christian König, amd-gfx, Timur Kristóf,
	Samuel Pitoiset, Tvrtko Ursulin, Huang Trigger

On Thu, Jun 11, 2026 at 01:57:15PM +0800, Jiqian Chen wrote:
> For Renior APU with gfx9, in some test scenarios with disabling
> ring_reset, like accessing an unmapped invalid address, it can
> trigger a gpu job timeout event, then driver uses Mode2 reset
> to reset GPU, but after Mode2 compute Ring test and IB test fail
> randomly. It because the CPC and CPF are still stuck after Mode2,
> that causes compute Ring test fail. What's more, the HQDs of
> MECs are still active, that causes MECs use stale HQDs when MECs
> are unhalted before driver restore MQDs, then causes compute IB
> tests fail.
> 
> So, add sequences to reset CPC and CPF after Mode2, and de-active
> HQDs of MECs before unhalting MECs.
> 
> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
> ---
> v1->v2 changes:
> * Move my sequences into a new function gfx_v9_0_cp_mode2_clear_state
> * Add reset Mode2 method check to the if condition that call my sequences
> 
> v1:
> Hi all,
> 
> My board is Renior APU with gfx9, smu12. I run a testcase that
> accesses an invalid address to trigger a amdgpu_job_timedout()
> with disabling ring_reset, so that driver will call mode2 reset
> directly. After mode2 reset I found compute Ring tests and compute
> IB tests fail randomly on random compute ring.
> 
> We checked the scan dump of GPU, we can see the CPC and CPF are
> still stuck, that caused Compute Ring tests fail.
> 
> I added printings in driver codes (gfx_v9_0_cp_resume), and found
> the HQDs of MECs are still active, that may cause MECs use stale
> HQDs when MECs are unhalted before mapping compute queues (restoring
> MQDs to HQDs).
> 
> So, I send this patch to fix above problems.
> There are two main changes of my patch:
> One is to reset CPC and CPF before resuming KCQ.
> Another is to disable HQDs beofre unhalting MECs.
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 44 +++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 47721d0c3781..d3ef45aa299a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -3942,6 +3942,46 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device *adev)
>  	return amdgpu_gfx_enable_kcq(adev, 0);
>  }
>  
> +static void gfx_v9_0_cp_mode2_clear_state(struct amdgpu_device *adev)
> +{
> +	u32 tmp;
> +	int i, j, k;
> +
> +	/*
> +	 * CPC and CPF are still stuck after Mode2 reset, that causes later
> +	 * compute ring test fail and then loop Mode2 reset infinitely
> +	 */
> +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> +	tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPC, 1);
> +	tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPF, 1);
> +	WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
> +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> +	udelay(50);
> +
> +	tmp &= ~(GRBM_SOFT_RESET__SOFT_RESET_CPC_MASK |
> +			GRBM_SOFT_RESET__SOFT_RESET_CPF_MASK);
> +	WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
> +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> +	udelay(50);
> +
> +	/*
> +	 * CP_HQD_ACTIVE survives Mode2 reset. Deactivate every MEC HQD to
> +	 * prevent MEC use stale HQD when MEC unhalted before restoring MQD.
> +	 * Otherwise, later compute IB test may fail
> +	 */
> +	for (i = 0; i < adev->gfx.mec.num_mec; i++) {
> +		for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
> +			for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; k++) {
> +				mutex_lock(&adev->srbm_mutex);
> +				soc15_grbm_select(adev, i + 1, j, k, 0, 0);
> +				WREG32_SOC15_RLC(GC, 0, mmCP_HQD_ACTIVE, 0);

I think we don't need to use WREG32_SOC15_RLC here, because SRIOV GPU won't
access this code path.

> +				soc15_grbm_select(adev, 0, 0, 0, 0, 0);
> +				mutex_unlock(&adev->srbm_mutex);
> +			}
> +		}
> +	}
> +}
> +
>  static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>  {
>  	int r, i;
> @@ -3967,6 +4007,10 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>  		gfx_v9_0_cp_gfx_enable(adev, false);
>  	gfx_v9_0_cp_compute_enable(adev, false);
>  
> +	if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) &&
> +		amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_MODE2)

If we constrain the condition to a mode2 reset, does that mean we no longer
need to restrict it to APU?

Thanks,
Ray

> +		gfx_v9_0_cp_mode2_clear_state(adev);
> +
>  	r = gfx_v9_0_kiq_resume(adev);
>  	if (r)
>  		return r;
> -- 
> 2.39.5
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-11  6:26 ` Huang Rui
@ 2026-06-11  7:09   ` Lazar, Lijo
  2026-06-11  7:42     ` Huang Rui
  0 siblings, 1 reply; 9+ messages in thread
From: Lazar, Lijo @ 2026-06-11  7:09 UTC (permalink / raw)
  To: Huang Rui, Jiqian Chen
  Cc: Alex Deucher, Christian König, amd-gfx, Timur Kristóf,
	Samuel Pitoiset, Tvrtko Ursulin, Huang Trigger



On 11-Jun-26 11:56 AM, Huang Rui wrote:
> On Thu, Jun 11, 2026 at 01:57:15PM +0800, Jiqian Chen wrote:
>> For Renior APU with gfx9, in some test scenarios with disabling
>> ring_reset, like accessing an unmapped invalid address, it can
>> trigger a gpu job timeout event, then driver uses Mode2 reset
>> to reset GPU, but after Mode2 compute Ring test and IB test fail
>> randomly. It because the CPC and CPF are still stuck after Mode2,
>> that causes compute Ring test fail. What's more, the HQDs of
>> MECs are still active, that causes MECs use stale HQDs when MECs
>> are unhalted before driver restore MQDs, then causes compute IB
>> tests fail.
>>
>> So, add sequences to reset CPC and CPF after Mode2, and de-active
>> HQDs of MECs before unhalting MECs.
>>
>> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
>> ---
>> v1->v2 changes:
>> * Move my sequences into a new function gfx_v9_0_cp_mode2_clear_state
>> * Add reset Mode2 method check to the if condition that call my sequences
>>
>> v1:
>> Hi all,
>>
>> My board is Renior APU with gfx9, smu12. I run a testcase that
>> accesses an invalid address to trigger a amdgpu_job_timedout()
>> with disabling ring_reset, so that driver will call mode2 reset
>> directly. After mode2 reset I found compute Ring tests and compute
>> IB tests fail randomly on random compute ring.
>>
>> We checked the scan dump of GPU, we can see the CPC and CPF are
>> still stuck, that caused Compute Ring tests fail.
>>
>> I added printings in driver codes (gfx_v9_0_cp_resume), and found
>> the HQDs of MECs are still active, that may cause MECs use stale
>> HQDs when MECs are unhalted before mapping compute queues (restoring
>> MQDs to HQDs).
>>
>> So, I send this patch to fix above problems.
>> There are two main changes of my patch:
>> One is to reset CPC and CPF before resuming KCQ.
>> Another is to disable HQDs beofre unhalting MECs.
>> ---
>>   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 44 +++++++++++++++++++++++++++
>>   1 file changed, 44 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>> index 47721d0c3781..d3ef45aa299a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>> @@ -3942,6 +3942,46 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device *adev)
>>   	return amdgpu_gfx_enable_kcq(adev, 0);
>>   }
>>   
>> +static void gfx_v9_0_cp_mode2_clear_state(struct amdgpu_device *adev)
>> +{
>> +	u32 tmp;
>> +	int i, j, k;
>> +
>> +	/*
>> +	 * CPC and CPF are still stuck after Mode2 reset, that causes later
>> +	 * compute ring test fail and then loop Mode2 reset infinitely
>> +	 */
>> +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
>> +	tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPC, 1);
>> +	tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPF, 1);
>> +	WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
>> +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
>> +	udelay(50);
>> +
>> +	tmp &= ~(GRBM_SOFT_RESET__SOFT_RESET_CPC_MASK |
>> +			GRBM_SOFT_RESET__SOFT_RESET_CPF_MASK);
>> +	WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
>> +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
>> +	udelay(50);
>> +
>> +	/*
>> +	 * CP_HQD_ACTIVE survives Mode2 reset. Deactivate every MEC HQD to
>> +	 * prevent MEC use stale HQD when MEC unhalted before restoring MQD.
>> +	 * Otherwise, later compute IB test may fail
>> +	 */
>> +	for (i = 0; i < adev->gfx.mec.num_mec; i++) {
>> +		for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
>> +			for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; k++) {
>> +				mutex_lock(&adev->srbm_mutex);
>> +				soc15_grbm_select(adev, i + 1, j, k, 0, 0);
>> +				WREG32_SOC15_RLC(GC, 0, mmCP_HQD_ACTIVE, 0);
> 
> I think we don't need to use WREG32_SOC15_RLC here, because SRIOV GPU won't
> access this code path.
> 
>> +				soc15_grbm_select(adev, 0, 0, 0, 0, 0);
>> +				mutex_unlock(&adev->srbm_mutex);
>> +			}
>> +		}
>> +	}
>> +}
>> +
>>   static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>>   {
>>   	int r, i;
>> @@ -3967,6 +4007,10 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>>   		gfx_v9_0_cp_gfx_enable(adev, false);
>>   	gfx_v9_0_cp_compute_enable(adev, false);
>>   
>> +	if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) &&
>> +		amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_MODE2)
> 
> If we constrain the condition to a mode2 reset, does that mean we no longer
> need to restrict it to APU?
> 

This issue is not reported on aldebaran which also supports mode-2 reset.

Thanks,
Lijo

> Thanks,
> Ray
> 
>> +		gfx_v9_0_cp_mode2_clear_state(adev);
>> +
>>   	r = gfx_v9_0_kiq_resume(adev);
>>   	if (r)
>>   		return r;
>> -- 
>> 2.39.5
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-11  7:09   ` Lazar, Lijo
@ 2026-06-11  7:42     ` Huang Rui
  0 siblings, 0 replies; 9+ messages in thread
From: Huang Rui @ 2026-06-11  7:42 UTC (permalink / raw)
  To: Lazar, Lijo
  Cc: Jiqian Chen, Alex Deucher, Christian König, amd-gfx,
	Timur Kristóf, Samuel Pitoiset, Tvrtko Ursulin,
	Huang Trigger

On Thu, Jun 11, 2026 at 12:39:32PM +0530, Lazar, Lijo wrote:
> 
> 
> On 11-Jun-26 11:56 AM, Huang Rui wrote:
> > On Thu, Jun 11, 2026 at 01:57:15PM +0800, Jiqian Chen wrote:
> > > For Renior APU with gfx9, in some test scenarios with disabling
> > > ring_reset, like accessing an unmapped invalid address, it can
> > > trigger a gpu job timeout event, then driver uses Mode2 reset
> > > to reset GPU, but after Mode2 compute Ring test and IB test fail
> > > randomly. It because the CPC and CPF are still stuck after Mode2,
> > > that causes compute Ring test fail. What's more, the HQDs of
> > > MECs are still active, that causes MECs use stale HQDs when MECs
> > > are unhalted before driver restore MQDs, then causes compute IB
> > > tests fail.
> > > 
> > > So, add sequences to reset CPC and CPF after Mode2, and de-active
> > > HQDs of MECs before unhalting MECs.
> > > 
> > > Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
> > > ---
> > > v1->v2 changes:
> > > * Move my sequences into a new function gfx_v9_0_cp_mode2_clear_state
> > > * Add reset Mode2 method check to the if condition that call my sequences
> > > 
> > > v1:
> > > Hi all,
> > > 
> > > My board is Renior APU with gfx9, smu12. I run a testcase that
> > > accesses an invalid address to trigger a amdgpu_job_timedout()
> > > with disabling ring_reset, so that driver will call mode2 reset
> > > directly. After mode2 reset I found compute Ring tests and compute
> > > IB tests fail randomly on random compute ring.
> > > 
> > > We checked the scan dump of GPU, we can see the CPC and CPF are
> > > still stuck, that caused Compute Ring tests fail.
> > > 
> > > I added printings in driver codes (gfx_v9_0_cp_resume), and found
> > > the HQDs of MECs are still active, that may cause MECs use stale
> > > HQDs when MECs are unhalted before mapping compute queues (restoring
> > > MQDs to HQDs).
> > > 
> > > So, I send this patch to fix above problems.
> > > There are two main changes of my patch:
> > > One is to reset CPC and CPF before resuming KCQ.
> > > Another is to disable HQDs beofre unhalting MECs.
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 44 +++++++++++++++++++++++++++
> > >   1 file changed, 44 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > index 47721d0c3781..d3ef45aa299a 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > @@ -3942,6 +3942,46 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device *adev)
> > >   	return amdgpu_gfx_enable_kcq(adev, 0);
> > >   }
> > > +static void gfx_v9_0_cp_mode2_clear_state(struct amdgpu_device *adev)
> > > +{
> > > +	u32 tmp;
> > > +	int i, j, k;
> > > +
> > > +	/*
> > > +	 * CPC and CPF are still stuck after Mode2 reset, that causes later
> > > +	 * compute ring test fail and then loop Mode2 reset infinitely
> > > +	 */
> > > +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> > > +	tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPC, 1);
> > > +	tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPF, 1);
> > > +	WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
> > > +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> > > +	udelay(50);
> > > +
> > > +	tmp &= ~(GRBM_SOFT_RESET__SOFT_RESET_CPC_MASK |
> > > +			GRBM_SOFT_RESET__SOFT_RESET_CPF_MASK);
> > > +	WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
> > > +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> > > +	udelay(50);
> > > +
> > > +	/*
> > > +	 * CP_HQD_ACTIVE survives Mode2 reset. Deactivate every MEC HQD to
> > > +	 * prevent MEC use stale HQD when MEC unhalted before restoring MQD.
> > > +	 * Otherwise, later compute IB test may fail
> > > +	 */
> > > +	for (i = 0; i < adev->gfx.mec.num_mec; i++) {
> > > +		for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
> > > +			for (k = 0; k < adev->gfx.mec.num_queue_per_pipe; k++) {
> > > +				mutex_lock(&adev->srbm_mutex);
> > > +				soc15_grbm_select(adev, i + 1, j, k, 0, 0);
> > > +				WREG32_SOC15_RLC(GC, 0, mmCP_HQD_ACTIVE, 0);
> > 
> > I think we don't need to use WREG32_SOC15_RLC here, because SRIOV GPU won't
> > access this code path.
> > 
> > > +				soc15_grbm_select(adev, 0, 0, 0, 0, 0);
> > > +				mutex_unlock(&adev->srbm_mutex);
> > > +			}
> > > +		}
> > > +	}
> > > +}
> > > +
> > >   static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
> > >   {
> > >   	int r, i;
> > > @@ -3967,6 +4007,10 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
> > >   		gfx_v9_0_cp_gfx_enable(adev, false);
> > >   	gfx_v9_0_cp_compute_enable(adev, false);
> > > +	if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) &&
> > > +		amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_MODE2)
> > 
> > If we constrain the condition to a mode2 reset, does that mean we no longer
> > need to restrict it to APU?
> > 
> 
> This issue is not reported on aldebaran which also supports mode-2 reset.
> 

Nice catch, thanks Lijo. We should still keep APU flag.

Thanks,
Ray

> Thanks,
> Lijo
> 
> > Thanks,
> > Ray
> > 
> > > +		gfx_v9_0_cp_mode2_clear_state(adev);
> > > +
> > >   	r = gfx_v9_0_kiq_resume(adev);
> > >   	if (r)
> > >   		return r;
> > > -- 
> > > 2.39.5
> > > 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-11  5:57 [PATCH v2 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2 Jiqian Chen
  2026-06-11  6:26 ` Huang Rui
@ 2026-06-11 20:26 ` Timur Kristóf
  2026-06-12  6:56   ` Chen, Jiqian
  1 sibling, 1 reply; 9+ messages in thread
From: Timur Kristóf @ 2026-06-11 20:26 UTC (permalink / raw)
  To: Alex Deucher, Christian König, Jiqian Chen
  Cc: amd-gfx, Samuel Pitoiset, Tvrtko Ursulin, Huang Rui,
	Huang Trigger, Jiqian Chen

On Thursday, June 11, 2026 7:57:15 AM Central European Summer Time Jiqian Chen 
wrote:
> For Renior APU with gfx9, in some test scenarios with disabling
> ring_reset, like accessing an unmapped invalid address, it can
> trigger a gpu job timeout event, then driver uses Mode2 reset
> to reset GPU, but after Mode2 compute Ring test and IB test fail
> randomly. It because the CPC and CPF are still stuck after Mode2,
> that causes compute Ring test fail. What's more, the HQDs of
> MECs are still active, that causes MECs use stale HQDs when MECs
> are unhalted before driver restore MQDs, then causes compute IB
> tests fail.
> 
> So, add sequences to reset CPC and CPF after Mode2, and de-active
> HQDs of MECs before unhalting MECs.
> 
> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
> ---
> v1->v2 changes:
> * Move my sequences into a new function gfx_v9_0_cp_mode2_clear_state
> * Add reset Mode2 method check to the if condition that call my sequences
> 
> v1:
> Hi all,
> 
> My board is Renior APU with gfx9, smu12. I run a testcase that
> accesses an invalid address to trigger a amdgpu_job_timedout()
> with disabling ring_reset, so that driver will call mode2 reset
> directly. After mode2 reset I found compute Ring tests and compute
> IB tests fail randomly on random compute ring.
> 
> We checked the scan dump of GPU, we can see the CPC and CPF are
> still stuck, that caused Compute Ring tests fail.
> 
> I added printings in driver codes (gfx_v9_0_cp_resume), and found
> the HQDs of MECs are still active, that may cause MECs use stale
> HQDs when MECs are unhalted before mapping compute queues (restoring
> MQDs to HQDs).
> 
> So, I send this patch to fix above problems.
> There are two main changes of my patch:
> One is to reset CPC and CPF before resuming KCQ.
> Another is to disable HQDs beofre unhalting MECs.

Hi,

Indeed I've seen similar issues on other GPUs, as I've been looking into 
improving GPU recovery.

Instead of forcing the HQD_ACTIVE to zero, I suggest to deactivate the HQD 
before reset. We should introduce a gfx_v9_0_deactivate_hqd() function similar 
to what gfx_v8_0_deactivate_hqd() is doing, and call that from somewhere in 
gfx_v9_0_hw_fini() when disabling the compute queues.

In fact, it looks like it already deactivates HQD, but only for the KIQ and 
only when it isn't in reset or suspend. That looks wrong to me and I think it 
should do that for all compute queues (in addition to the KIQ) either 
unconditionally or before a mode2 reset.

What do you think?

I don't have a Renoir APU yet but if you need help, I can try to see if I can 
reproduce something like this on a Vega 10 dGPU.

Best regards,
Timur

> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 44 +++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c index 47721d0c3781..d3ef45aa299a
> 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -3942,6 +3942,46 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device
> *adev) return amdgpu_gfx_enable_kcq(adev, 0);
>  }
> 
> +static void gfx_v9_0_cp_mode2_clear_state(struct amdgpu_device *adev)
> +{
> +	u32 tmp;
> +	int i, j, k;
> +
> +	/*
> +	 * CPC and CPF are still stuck after Mode2 reset, that causes later
> +	 * compute ring test fail and then loop Mode2 reset infinitely
> +	 */
> +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> +	tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPC, 1);
> +	tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPF, 1);
> +	WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
> +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> +	udelay(50);
> +
> +	tmp &= ~(GRBM_SOFT_RESET__SOFT_RESET_CPC_MASK |
> +			GRBM_SOFT_RESET__SOFT_RESET_CPF_MASK);
> +	WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
> +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
> +	udelay(50);
> +
> +	/*
> +	 * CP_HQD_ACTIVE survives Mode2 reset. Deactivate every MEC HQD to
> +	 * prevent MEC use stale HQD when MEC unhalted before restoring 
MQD.
> +	 * Otherwise, later compute IB test may fail
> +	 */
> +	for (i = 0; i < adev->gfx.mec.num_mec; i++) {
> +		for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
> +			for (k = 0; k < adev-
>gfx.mec.num_queue_per_pipe; k++) {
> +				mutex_lock(&adev->srbm_mutex);
> +				soc15_grbm_select(adev, i + 1, j, 
k, 0, 0);
> +				WREG32_SOC15_RLC(GC, 0, 
mmCP_HQD_ACTIVE, 0);
> +				soc15_grbm_select(adev, 0, 0, 0, 
0, 0);
> +				mutex_unlock(&adev->srbm_mutex);
> +			}
> +		}
> +	}
> +}
> +
>  static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>  {
>  	int r, i;
> @@ -3967,6 +4007,10 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device
> *adev) gfx_v9_0_cp_gfx_enable(adev, false);
>  	gfx_v9_0_cp_compute_enable(adev, false);
> 
> +	if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) &&
> +		amdgpu_asic_reset_method(adev) == 
AMD_RESET_METHOD_MODE2)
> +		gfx_v9_0_cp_mode2_clear_state(adev);
> +
>  	r = gfx_v9_0_kiq_resume(adev);
>  	if (r)
>  		return r;





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-11 20:26 ` Timur Kristóf
@ 2026-06-12  6:56   ` Chen, Jiqian
  2026-06-12  7:48     ` Timur Kristóf
  0 siblings, 1 reply; 9+ messages in thread
From: Chen, Jiqian @ 2026-06-12  6:56 UTC (permalink / raw)
  To: Timur Kristóf, Deucher, Alexander, Koenig, Christian
  Cc: amd-gfx@lists.freedesktop.org, Samuel Pitoiset, Tvrtko Ursulin,
	Huang, Ray, Huang, Trigger

Hi Timur,

On 6/12/26 04:26, Timur Kristóf wrote:
> On Thursday, June 11, 2026 7:57:15 AM Central European Summer Time Jiqian Chen 
> wrote:
>> For Renior APU with gfx9, in some test scenarios with disabling
>> ring_reset, like accessing an unmapped invalid address, it can
>> trigger a gpu job timeout event, then driver uses Mode2 reset
>> to reset GPU, but after Mode2 compute Ring test and IB test fail
>> randomly. It because the CPC and CPF are still stuck after Mode2,
>> that causes compute Ring test fail. What's more, the HQDs of
>> MECs are still active, that causes MECs use stale HQDs when MECs
>> are unhalted before driver restore MQDs, then causes compute IB
>> tests fail.
>>
>> So, add sequences to reset CPC and CPF after Mode2, and de-active
>> HQDs of MECs before unhalting MECs.
>>
>> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
>> ---
>> v1->v2 changes:
>> * Move my sequences into a new function gfx_v9_0_cp_mode2_clear_state
>> * Add reset Mode2 method check to the if condition that call my sequences
>>
>> v1:
>> Hi all,
>>
>> My board is Renior APU with gfx9, smu12. I run a testcase that
>> accesses an invalid address to trigger a amdgpu_job_timedout()
>> with disabling ring_reset, so that driver will call mode2 reset
>> directly. After mode2 reset I found compute Ring tests and compute
>> IB tests fail randomly on random compute ring.
>>
>> We checked the scan dump of GPU, we can see the CPC and CPF are
>> still stuck, that caused Compute Ring tests fail.
>>
>> I added printings in driver codes (gfx_v9_0_cp_resume), and found
>> the HQDs of MECs are still active, that may cause MECs use stale
>> HQDs when MECs are unhalted before mapping compute queues (restoring
>> MQDs to HQDs).
>>
>> So, I send this patch to fix above problems.
>> There are two main changes of my patch:
>> One is to reset CPC and CPF before resuming KCQ.
>> Another is to disable HQDs beofre unhalting MECs.
> 
> Hi,
> 
> Indeed I've seen similar issues on other GPUs, as I've been looking into 
> improving GPU recovery.
> 
> Instead of forcing the HQD_ACTIVE to zero, I suggest to deactivate the HQD 
> before reset. We should introduce a gfx_v9_0_deactivate_hqd() function similar 
> to what gfx_v8_0_deactivate_hqd() is doing, and call that from somewhere in 
> gfx_v9_0_hw_fini() when disabling the compute queues.
Make sense, that look like a more suitable place, I will try to move my sequences into gfx_v9_0_hw_fini() in next version.

> 
> In fact, it looks like it already deactivates HQD, but only for the KIQ and 
> only when it isn't in reset or suspend. That looks wrong to me and I think it 
> should do that for all compute queues (in addition to the KIQ) either 
> unconditionally or before a mode2 reset.
So, you think the if condition checks are not needed?
	if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) &&
			amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_MODE2)
Since I only reproduced and verified when mode2 on APU, I think keeping this check would be better.

> 
> What do you think?
> 
> I don't have a Renoir APU yet but if you need help, I can try to see if I can 
> reproduce something like this on a Vega 10 dGPU.
It seems Vega 10 dGPU uses Moed1 or BACO reset. I am not sure if it has the same issue. When you "see similar issues on other GPUs", are they all APUs? What's the gfx version? And what reset method they use. If they are not, I may find a same hardware as your to verify my changes.
I tried other APU with gfx10, there is no this issue.

> 
> Best regards,
> Timur
> 
>> ---
>>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 44 +++++++++++++++++++++++++++
>>  1 file changed, 44 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c index 47721d0c3781..d3ef45aa299a
>> 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
>> @@ -3942,6 +3942,46 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device
>> *adev) return amdgpu_gfx_enable_kcq(adev, 0);
>>  }
>>
>> +static void gfx_v9_0_cp_mode2_clear_state(struct amdgpu_device *adev)
>> +{
>> +	u32 tmp;
>> +	int i, j, k;
>> +
>> +	/*
>> +	 * CPC and CPF are still stuck after Mode2 reset, that causes later
>> +	 * compute ring test fail and then loop Mode2 reset infinitely
>> +	 */
>> +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
>> +	tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPC, 1);
>> +	tmp = REG_SET_FIELD(tmp, GRBM_SOFT_RESET, SOFT_RESET_CPF, 1);
>> +	WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
>> +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
>> +	udelay(50);
>> +
>> +	tmp &= ~(GRBM_SOFT_RESET__SOFT_RESET_CPC_MASK |
>> +			GRBM_SOFT_RESET__SOFT_RESET_CPF_MASK);
>> +	WREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET, tmp);
>> +	tmp = RREG32_SOC15(GC, 0, mmGRBM_SOFT_RESET);
>> +	udelay(50);
>> +
>> +	/*
>> +	 * CP_HQD_ACTIVE survives Mode2 reset. Deactivate every MEC HQD to
>> +	 * prevent MEC use stale HQD when MEC unhalted before restoring 
> MQD.
>> +	 * Otherwise, later compute IB test may fail
>> +	 */
>> +	for (i = 0; i < adev->gfx.mec.num_mec; i++) {
>> +		for (j = 0; j < adev->gfx.mec.num_pipe_per_mec; j++) {
>> +			for (k = 0; k < adev-
>> gfx.mec.num_queue_per_pipe; k++) {
>> +				mutex_lock(&adev->srbm_mutex);
>> +				soc15_grbm_select(adev, i + 1, j, 
> k, 0, 0);
>> +				WREG32_SOC15_RLC(GC, 0, 
> mmCP_HQD_ACTIVE, 0);
>> +				soc15_grbm_select(adev, 0, 0, 0, 
> 0, 0);
>> +				mutex_unlock(&adev->srbm_mutex);
>> +			}
>> +		}
>> +	}
>> +}
>> +
>>  static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>>  {
>>  	int r, i;
>> @@ -3967,6 +4007,10 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device
>> *adev) gfx_v9_0_cp_gfx_enable(adev, false);
>>  	gfx_v9_0_cp_compute_enable(adev, false);
>>
>> +	if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) &&
>> +		amdgpu_asic_reset_method(adev) == 
> AMD_RESET_METHOD_MODE2)
>> +		gfx_v9_0_cp_mode2_clear_state(adev);
>> +
>>  	r = gfx_v9_0_kiq_resume(adev);
>>  	if (r)
>>  		return r;
> 
> 
> 
> 

-- 
Best regards,
Jiqian Chen.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-12  6:56   ` Chen, Jiqian
@ 2026-06-12  7:48     ` Timur Kristóf
  2026-06-12  8:22       ` Chen, Jiqian
  0 siblings, 1 reply; 9+ messages in thread
From: Timur Kristóf @ 2026-06-12  7:48 UTC (permalink / raw)
  To: Deucher, Alexander, Koenig, Christian, Chen, Jiqian
  Cc: amd-gfx@lists.freedesktop.org, Samuel Pitoiset, Tvrtko Ursulin,
	Huang, Ray, Huang, Trigger

Hi Jiqian,

> > Indeed I've seen similar issues on other GPUs, as I've been looking into 
> > improving GPU recovery.
> > 
> > Instead of forcing the HQD_ACTIVE to zero, I suggest to deactivate the HQD
> >  before reset. We should introduce a gfx_v9_0_deactivate_hqd() function
> > similar to what gfx_v8_0_deactivate_hqd() is doing, and call that from
> > somewhere in gfx_v9_0_hw_fini() when disabling the compute queues.
> 
> Make sense, that look like a more suitable place, I will try to move my
> sequences into gfx_v9_0_hw_fini() in next version.
 
Sounds good.

> > In fact, it looks like it already deactivates HQD, but only for the KIQ
> > and  only when it isn't in reset or suspend. That looks wrong to me and
> > I think it should do that for all compute queues (in addition to the KIQ)
> > either unconditionally or before a mode2 reset.
> 
> So, you think the if condition checks are not needed?
> 	if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) &&
> 			amdgpu_asic_reset_method(adev) == 
AMD_RESET_METHOD_MODE2)
> Since I only reproduced and verified when mode2 on APU, I think keeping this
> check would be better.

Yes, I think the checks may not be needed or need to be adjusted.
Additionally, the same sequence needs to be repeated for every compute ring.

> > I don't have a Renoir APU yet but if you need help, I can try to see if I
> > can  reproduce something like this on a Vega 10 dGPU.
> 
> It seems Vega 10 dGPU uses Moed1 or BACO reset. I am not sure if it has the
> same issue. When you "see similar issues on other GPUs", are they all APUs?
> What's the gfx version? And what reset method they use. If they are not, I
> may find a same hardware as your to verify my changes. I tried other APU
> with gfx10, there is no this issue.

You are correct that dGPUs don't use mode2 reset. I saw a similar issue while 
working on a patch series to improve GFX IP block soft reset on GFX8. I am 
testing that on a Carrizo APU as well as Fiji and Polaris 10 dGPUs.

The problem I saw is very similar to yours: compute rings fail to resume after 
the reset and are "stuck". I managed to solve that by making sure the HQD is 
deactivated before the reset and ensuring that the MQD is cleaned up after the 
reset.

Best regards,
Timur









^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-12  7:48     ` Timur Kristóf
@ 2026-06-12  8:22       ` Chen, Jiqian
  2026-06-12 11:19         ` Timur Kristóf
  0 siblings, 1 reply; 9+ messages in thread
From: Chen, Jiqian @ 2026-06-12  8:22 UTC (permalink / raw)
  To: Timur Kristóf, Deucher, Alexander, Koenig, Christian
  Cc: amd-gfx@lists.freedesktop.org, Samuel Pitoiset, Tvrtko Ursulin,
	Huang, Ray, Huang, Trigger

On 6/12/26 15:48, Timur Kristóf wrote:
> Hi Jiqian,
> 
>>> Indeed I've seen similar issues on other GPUs, as I've been looking into 
>>> improving GPU recovery.
>>>
>>> Instead of forcing the HQD_ACTIVE to zero, I suggest to deactivate the HQD
>>>  before reset. We should introduce a gfx_v9_0_deactivate_hqd() function
>>> similar to what gfx_v8_0_deactivate_hqd() is doing, and call that from
>>> somewhere in gfx_v9_0_hw_fini() when disabling the compute queues.
>>
>> Make sense, that look like a more suitable place, I will try to move my
>> sequences into gfx_v9_0_hw_fini() in next version.
>  
> Sounds good.
> 
>>> In fact, it looks like it already deactivates HQD, but only for the KIQ
>>> and  only when it isn't in reset or suspend. That looks wrong to me and
>>> I think it should do that for all compute queues (in addition to the KIQ)
>>> either unconditionally or before a mode2 reset.
>>
>> So, you think the if condition checks are not needed?
>> 	if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) &&
>> 			amdgpu_asic_reset_method(adev) == 
> AMD_RESET_METHOD_MODE2)
>> Since I only reproduced and verified when mode2 on APU, I think keeping this
>> check would be better.
> 
> Yes, I think the checks may not be needed or need to be adjusted.
I am not sure if removing the checks can cause new issues in other APUs or dGPUs that don't have this issue.
Per our tests, GPUs that use Mode1 don't have this issue.
Is disabling HQD harmless even for GPUs that are not experiencing this issue?

> Additionally, the same sequence needs to be repeated for every compute ring.
Yes, I had done these for every compute ring.

> 
>>> I don't have a Renoir APU yet but if you need help, I can try to see if I
>>> can  reproduce something like this on a Vega 10 dGPU.
>>
>> It seems Vega 10 dGPU uses Moed1 or BACO reset. I am not sure if it has the
>> same issue. When you "see similar issues on other GPUs", are they all APUs?
>> What's the gfx version? And what reset method they use. If they are not, I
>> may find a same hardware as your to verify my changes. I tried other APU
>> with gfx10, there is no this issue.
> 
> You are correct that dGPUs don't use mode2 reset. I saw a similar issue while 
> working on a patch series to improve GFX IP block soft reset on GFX8. I am 
> testing that on a Carrizo APU as well as Fiji and Polaris 10 dGPUs.
> 
> The problem I saw is very similar to yours: compute rings fail to resume after 
> the reset and are "stuck". I managed to solve that by making sure the HQD is 
> deactivated before the reset and ensuring that the MQD is cleaned up after the 
> reset.
That's great, you already have solved the gfx8 issue.

> 
> Best regards,
> Timur
> 
> 
> 
> 
> 
> 
> 
> 

-- 
Best regards,
Jiqian Chen.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-12  8:22       ` Chen, Jiqian
@ 2026-06-12 11:19         ` Timur Kristóf
  0 siblings, 0 replies; 9+ messages in thread
From: Timur Kristóf @ 2026-06-12 11:19 UTC (permalink / raw)
  To: Deucher, Alexander, Koenig, Christian, Chen, Jiqian
  Cc: amd-gfx@lists.freedesktop.org, Samuel Pitoiset, Tvrtko Ursulin,
	Huang, Ray, Huang, Trigger

Hi,

> > Yes, I think the checks may not be needed or need to be adjusted.
> 
> I am not sure if removing the checks can cause new issues in other APUs or
> dGPUs that don't have this issue. Per our tests, GPUs that use Mode1 don't
> have this issue.
> Is disabling HQD harmless even for GPUs that are not experiencing this
> issue?

It's OK if your patch only addresses the issue on the APU.
We can always revisit it later if/when someone needs this for other GPUs.

I suspect that if anyone were interested in using IP block soft reset on a 
Vega dGPU, then probably they'd find the exact same issue.

Best regards,
Timur



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-06-12 11:19 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-11  5:57 [PATCH v2 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2 Jiqian Chen
2026-06-11  6:26 ` Huang Rui
2026-06-11  7:09   ` Lazar, Lijo
2026-06-11  7:42     ` Huang Rui
2026-06-11 20:26 ` Timur Kristóf
2026-06-12  6:56   ` Chen, Jiqian
2026-06-12  7:48     ` Timur Kristóf
2026-06-12  8:22       ` Chen, Jiqian
2026-06-12 11:19         ` Timur Kristóf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.