[PATCH v3 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
@ 2026-06-12  9:26 Jiqian Chen
  2026-06-12 12:46 ` Huang Rui
  2026-06-12 13:37 ` Timur Kristóf
  0 siblings, 2 replies; 3+ messages in thread
From: Jiqian Chen @ 2026-06-12  9:26 UTC (permalink / raw)
  To: Alex Deucher, Christian König
  Cc: amd-gfx, Timur Kristóf, Samuel Pitoiset, Tvrtko Ursulin,
	Huang Rui, Huang Trigger, Jiqian Chen

For Renior APU with gfx9, in some test scenarios with disabling
ring_reset, like accessing an unmapped invalid address, it can
trigger a gpu job timeout event, then driver uses Mode2 reset
to reset GPU, but after Mode2 compute Ring test and IB test fail
randomly. It because the HQDs of MECs are always active before or
after Mode2, that causes MECs use stale HQDs when MECs are unhalted
before driver restore MQDs, and causes CPC and CPF are still stuck
after Mode2, then causes compute Ring and IB tests fail.

So, add sequences to deactivate HQDs of MECs in suspend IP function
of the resetting process.

v2: Move all sequences into a new function gfx_v9_0_cp_mode2_clear_state (Ray Huang)
    To check reset Mode2 method in the if condition (Ray Huang)
v3: Move all sequences before Mode2 instead of after Mode2 (Timur Kristóf)

Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
---
v2->v3 changes:
* Move all sequencess before Mode2 instead of after Mode2, and add a new
  function gfx_v9_0_deactivate_kcq_hqd to do the disable compute HQDs
  sequences.
  Then the resetting CPC and CPF are not needed since we have already
  move all sequences before Mode2 and they are not stuck

v1->v2 changes:
* Move my sequences into a new function gfx_v9_0_cp_mode2_clear_state
* Add reset Mode2 method check to the if condition that call my sequences

v1:
Hi all,

My board is Renior APU with gfx9, smu12. I run a testcase that
accesses an invalid address to trigger a amdgpu_job_timedout()
with disabling ring_reset, so that driver will call mode2 reset
directly. After mode2 reset I found compute Ring tests and compute
IB tests fail randomly on random compute ring.

We checked the scan dump of GPU, we can see the CPC and CPF are
still stuck, that caused Compute Ring tests fail.

I added printings in driver codes (gfx_v9_0_cp_resume), and found
the HQDs of MECs are still active, that may cause MECs use stale
HQDs when MECs are unhalted before mapping compute queues (restoring
MQDs to HQDs).

So, I send this patch to fix above problems.
There are two main changes of my patch:
One is to reset CPC and CPF before resuming KCQ.
Another is to disable HQDs beofre unhalting MECs.
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 37 +++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 90bbddb45730..0c01701488e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -4071,6 +4071,39 @@ static int gfx_v9_0_hw_init(struct amdgpu_ip_block *ip_block)
 	return r;
 }
 
+static void gfx_v9_0_deactivate_kcq_hqd(struct amdgpu_device *adev)
+{
+	for (int i = 0; i < adev->gfx.num_compute_rings; i++) {
+		u32 tmp;
+		struct amdgpu_ring *ring = &adev->gfx.compute_ring[i];
+
+		mutex_lock(&adev->srbm_mutex);
+		soc15_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0, 0);
+		tmp = RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE);
+		/* disable the queue if it's active */
+		if (tmp & CP_HQD_ACTIVE__ACTIVE_MASK) {
+			int j;
+
+			WREG32_SOC15(GC, 0, mmCP_HQD_DEQUEUE_REQUEST, 1);
+			for (j = 0; j < adev->usec_timeout; j++) {
+				tmp = RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE);
+				if (!(tmp & CP_HQD_ACTIVE__ACTIVE_MASK))
+					break;
+				udelay(1);
+			}
+			if (j == AMDGPU_MAX_USEC_TIMEOUT) {
+				DRM_DEBUG("comp_%u_%u_%u dequeue request failed.\n",
+							ring->me, ring->pipe, ring->queue);
+				/* Manual disable if dequeue request times out */
+				WREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE, 0);
+			}
+			WREG32_SOC15(GC, 0, mmCP_HQD_DEQUEUE_REQUEST, 0);
+		}
+		soc15_grbm_select(adev, 0, 0, 0, 0, 0);
+		mutex_unlock(&adev->srbm_mutex);
+	}
+}
+
 static int gfx_v9_0_hw_fini(struct amdgpu_ip_block *ip_block)
 {
 	struct amdgpu_device *adev = ip_block->adev;
@@ -4095,6 +4128,10 @@ static int gfx_v9_0_hw_fini(struct amdgpu_ip_block *ip_block)
 		return 0;
 	}
 
+	if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) &&
+		amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_MODE2)
+		gfx_v9_0_deactivate_kcq_hqd(adev);
+
 	/* Use deinitialize sequence from CAIL when unbinding device from driver,
 	 * otherwise KIQ is hanging when binding back
 	 */
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v3 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-12  9:26 [PATCH v3 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2 Jiqian Chen
@ 2026-06-12 12:46 ` Huang Rui
  2026-06-12 13:37 ` Timur Kristóf
  1 sibling, 0 replies; 3+ messages in thread
From: Huang Rui @ 2026-06-12 12:46 UTC (permalink / raw)
  To: Jiqian Chen
  Cc: Alex Deucher, Christian König, amd-gfx, Timur Kristóf,
	Samuel Pitoiset, Tvrtko Ursulin, Huang Trigger

On Fri, Jun 12, 2026 at 05:26:54PM +0800, Jiqian Chen wrote:
> For Renior APU with gfx9, in some test scenarios with disabling
> ring_reset, like accessing an unmapped invalid address, it can
> trigger a gpu job timeout event, then driver uses Mode2 reset
> to reset GPU, but after Mode2 compute Ring test and IB test fail
> randomly. It because the HQDs of MECs are always active before or
> after Mode2, that causes MECs use stale HQDs when MECs are unhalted
> before driver restore MQDs, and causes CPC and CPF are still stuck
> after Mode2, then causes compute Ring and IB tests fail.
> 
> So, add sequences to deactivate HQDs of MECs in suspend IP function
> of the resetting process.
> 
> v2: Move all sequences into a new function gfx_v9_0_cp_mode2_clear_state (Ray Huang)
>     To check reset Mode2 method in the if condition (Ray Huang)
> v3: Move all sequences before Mode2 instead of after Mode2 (Timur Kristóf)
> 
> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>

Reviewed-by: Huang Rui <ray.huang@amd.com>

> ---
> v2->v3 changes:
> * Move all sequencess before Mode2 instead of after Mode2, and add a new
>   function gfx_v9_0_deactivate_kcq_hqd to do the disable compute HQDs
>   sequences.
>   Then the resetting CPC and CPF are not needed since we have already
>   move all sequences before Mode2 and they are not stuck
> 
> v1->v2 changes:
> * Move my sequences into a new function gfx_v9_0_cp_mode2_clear_state
> * Add reset Mode2 method check to the if condition that call my sequences
> 
> v1:
> Hi all,
> 
> My board is Renior APU with gfx9, smu12. I run a testcase that
> accesses an invalid address to trigger a amdgpu_job_timedout()
> with disabling ring_reset, so that driver will call mode2 reset
> directly. After mode2 reset I found compute Ring tests and compute
> IB tests fail randomly on random compute ring.
> 
> We checked the scan dump of GPU, we can see the CPC and CPF are
> still stuck, that caused Compute Ring tests fail.
> 
> I added printings in driver codes (gfx_v9_0_cp_resume), and found
> the HQDs of MECs are still active, that may cause MECs use stale
> HQDs when MECs are unhalted before mapping compute queues (restoring
> MQDs to HQDs).
> 
> So, I send this patch to fix above problems.
> There are two main changes of my patch:
> One is to reset CPC and CPF before resuming KCQ.
> Another is to disable HQDs beofre unhalting MECs.
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 37 +++++++++++++++++++++++++++
>  1 file changed, 37 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 90bbddb45730..0c01701488e7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -4071,6 +4071,39 @@ static int gfx_v9_0_hw_init(struct amdgpu_ip_block *ip_block)
>  	return r;
>  }
>  
> +static void gfx_v9_0_deactivate_kcq_hqd(struct amdgpu_device *adev)
> +{
> +	for (int i = 0; i < adev->gfx.num_compute_rings; i++) {
> +		u32 tmp;
> +		struct amdgpu_ring *ring = &adev->gfx.compute_ring[i];
> +
> +		mutex_lock(&adev->srbm_mutex);
> +		soc15_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0, 0);
> +		tmp = RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE);
> +		/* disable the queue if it's active */
> +		if (tmp & CP_HQD_ACTIVE__ACTIVE_MASK) {
> +			int j;
> +
> +			WREG32_SOC15(GC, 0, mmCP_HQD_DEQUEUE_REQUEST, 1);
> +			for (j = 0; j < adev->usec_timeout; j++) {
> +				tmp = RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE);
> +				if (!(tmp & CP_HQD_ACTIVE__ACTIVE_MASK))
> +					break;
> +				udelay(1);
> +			}
> +			if (j == AMDGPU_MAX_USEC_TIMEOUT) {
> +				DRM_DEBUG("comp_%u_%u_%u dequeue request failed.\n",
> +							ring->me, ring->pipe, ring->queue);
> +				/* Manual disable if dequeue request times out */
> +				WREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE, 0);
> +			}
> +			WREG32_SOC15(GC, 0, mmCP_HQD_DEQUEUE_REQUEST, 0);
> +		}
> +		soc15_grbm_select(adev, 0, 0, 0, 0, 0);
> +		mutex_unlock(&adev->srbm_mutex);
> +	}
> +}
> +
>  static int gfx_v9_0_hw_fini(struct amdgpu_ip_block *ip_block)
>  {
>  	struct amdgpu_device *adev = ip_block->adev;
> @@ -4095,6 +4128,10 @@ static int gfx_v9_0_hw_fini(struct amdgpu_ip_block *ip_block)
>  		return 0;
>  	}
>  
> +	if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) &&
> +		amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_MODE2)
> +		gfx_v9_0_deactivate_kcq_hqd(adev);
> +
>  	/* Use deinitialize sequence from CAIL when unbinding device from driver,
>  	 * otherwise KIQ is hanging when binding back
>  	 */
> -- 
> 2.39.5
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v3 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2
  2026-06-12  9:26 [PATCH v3 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2 Jiqian Chen
  2026-06-12 12:46 ` Huang Rui
@ 2026-06-12 13:37 ` Timur Kristóf
  1 sibling, 0 replies; 3+ messages in thread
From: Timur Kristóf @ 2026-06-12 13:37 UTC (permalink / raw)
  To: Alex Deucher, Christian König, Jiqian Chen
  Cc: amd-gfx, Samuel Pitoiset, Tvrtko Ursulin, Huang Rui,
	Huang Trigger, Jiqian Chen

On Friday, June 12, 2026 11:26:54 AM Central European Summer Time Jiqian Chen 
wrote:
> For Renior APU with gfx9, in some test scenarios with disabling
> ring_reset, like accessing an unmapped invalid address, it can
> trigger a gpu job timeout event, then driver uses Mode2 reset
> to reset GPU, but after Mode2 compute Ring test and IB test fail
> randomly. It because the HQDs of MECs are always active before or
> after Mode2, that causes MECs use stale HQDs when MECs are unhalted
> before driver restore MQDs, and causes CPC and CPF are still stuck
> after Mode2, then causes compute Ring and IB tests fail.
> 
> So, add sequences to deactivate HQDs of MECs in suspend IP function
> of the resetting process.
> 
> v2: Move all sequences into a new function gfx_v9_0_cp_mode2_clear_state
> (Ray Huang) To check reset Mode2 method in the if condition (Ray Huang)
> v3: Move all sequences before Mode2 instead of after Mode2 (Timur Kristóf)
> 
> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>

Looks good, thank you!

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>

> ---
> v2->v3 changes:
> * Move all sequencess before Mode2 instead of after Mode2, and add a new
>   function gfx_v9_0_deactivate_kcq_hqd to do the disable compute HQDs
>   sequences.
>   Then the resetting CPC and CPF are not needed since we have already
>   move all sequences before Mode2 and they are not stuck
> 
> v1->v2 changes:
> * Move my sequences into a new function gfx_v9_0_cp_mode2_clear_state
> * Add reset Mode2 method check to the if condition that call my sequences
> 
> v1:
> Hi all,
> 
> My board is Renior APU with gfx9, smu12. I run a testcase that
> accesses an invalid address to trigger a amdgpu_job_timedout()
> with disabling ring_reset, so that driver will call mode2 reset
> directly. After mode2 reset I found compute Ring tests and compute
> IB tests fail randomly on random compute ring.
> 
> We checked the scan dump of GPU, we can see the CPC and CPF are
> still stuck, that caused Compute Ring tests fail.
> 
> I added printings in driver codes (gfx_v9_0_cp_resume), and found
> the HQDs of MECs are still active, that may cause MECs use stale
> HQDs when MECs are unhalted before mapping compute queues (restoring
> MQDs to HQDs).
> 
> So, I send this patch to fix above problems.
> There are two main changes of my patch:
> One is to reset CPC and CPF before resuming KCQ.
> Another is to disable HQDs beofre unhalting MECs.
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 37 +++++++++++++++++++++++++++
>  1 file changed, 37 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c index 90bbddb45730..0c01701488e7
> 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -4071,6 +4071,39 @@ static int gfx_v9_0_hw_init(struct amdgpu_ip_block
> *ip_block) return r;
>  }
> 
> +static void gfx_v9_0_deactivate_kcq_hqd(struct amdgpu_device *adev)
> +{
> +	for (int i = 0; i < adev->gfx.num_compute_rings; i++) {
> +		u32 tmp;
> +		struct amdgpu_ring *ring = &adev->gfx.compute_ring[i];
> +
> +		mutex_lock(&adev->srbm_mutex);
> +		soc15_grbm_select(adev, ring->me, ring->pipe, ring-
>queue, 0, 0);
> +		tmp = RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE);
> +		/* disable the queue if it's active */
> +		if (tmp & CP_HQD_ACTIVE__ACTIVE_MASK) {
> +			int j;
> +
> +			WREG32_SOC15(GC, 0, mmCP_HQD_DEQUEUE_REQUEST, 
1);
> +			for (j = 0; j < adev->usec_timeout; j++) {
> +				tmp = RREG32_SOC15(GC, 0, 
mmCP_HQD_ACTIVE);
> +				if (!(tmp & 
CP_HQD_ACTIVE__ACTIVE_MASK))
> +					break;
> +				udelay(1);
> +			}
> +			if (j == AMDGPU_MAX_USEC_TIMEOUT) {
> +				DRM_DEBUG("comp_%u_%u_%u dequeue 
request failed.\n",
> +							
ring->me, ring->pipe, ring->queue);
> +				/* Manual disable if dequeue 
request times out */
> +				WREG32_SOC15(GC, 0, 
mmCP_HQD_ACTIVE, 0);
> +			}
> +			WREG32_SOC15(GC, 0, mmCP_HQD_DEQUEUE_REQUEST, 
0);
> +		}
> +		soc15_grbm_select(adev, 0, 0, 0, 0, 0);
> +		mutex_unlock(&adev->srbm_mutex);
> +	}
> +}
> +
>  static int gfx_v9_0_hw_fini(struct amdgpu_ip_block *ip_block)
>  {
>  	struct amdgpu_device *adev = ip_block->adev;
> @@ -4095,6 +4128,10 @@ static int gfx_v9_0_hw_fini(struct amdgpu_ip_block
> *ip_block) return 0;
>  	}
> 
> +	if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) &&
> +		amdgpu_asic_reset_method(adev) == 
AMD_RESET_METHOD_MODE2)
> +		gfx_v9_0_deactivate_kcq_hqd(adev);
> +
>  	/* Use deinitialize sequence from CAIL when unbinding device from 
driver,
>  	 * otherwise KIQ is hanging when binding back
>  	 */





^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-12 13:37 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12  9:26 [PATCH v3 1/1] drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2 Jiqian Chen
2026-06-12 12:46 ` Huang Rui
2026-06-12 13:37 ` Timur Kristóf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.