[PATCH 1/5] drm/amdgpu: use dma_fence_get

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/5] drm/amdgpu: use dma_fence_get_status() for adapter reset
@ 2025-12-19 18:21 Alex Deucher
  2025-12-19 18:21 ` [PATCH 2/5] drm/amdgpu: avoid a warning in timedout job handler Alex Deucher
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Alex Deucher @ 2025-12-19 18:21 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

We need to check if the fence was signaled without an
error as the per queue resets may have signalled the fence
while attempting to reset the queue.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 52a23fcaf617c..5d4fb20f719c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6539,7 +6539,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 *
 	 * job->base holds a reference to parent fence
 	 */
-	if (job && dma_fence_is_signaled(&job->hw_fence->base)) {
+	if (job && (dma_fence_get_status(&job->hw_fence->base) > 0)) {
 		job_signaled = true;
 		dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
 		goto skip_hw_reset;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/5] drm/amdgpu: avoid a warning in timedout job handler
  2025-12-19 18:21 [PATCH 1/5] drm/amdgpu: use dma_fence_get_status() for adapter reset Alex Deucher
@ 2025-12-19 18:21 ` Alex Deucher
  2025-12-19 18:21 ` [PATCH 3/5] drm/amdgpu: mark fences with errors before ring reset Alex Deucher
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Alex Deucher @ 2025-12-19 18:21 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher, Timur Kristóf

Only set an error on the fence if the fence is not
signalled.  We can end up with a warning if the
per queue reset path signals the fence and sets an error
as part of the reset, but fails to recover.

Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 67fde99724bad..7f5d01164897f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -147,7 +147,8 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name);
 	}
 
-	dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
+	if (dma_fence_get_status(&s_job->s_fence->finished) == 0)
+		dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
 
 	amdgpu_vm_put_task_info(ti);
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/5] drm/amdgpu: mark fences with errors before ring reset
  2025-12-19 18:21 [PATCH 1/5] drm/amdgpu: use dma_fence_get_status() for adapter reset Alex Deucher
  2025-12-19 18:21 ` [PATCH 2/5] drm/amdgpu: avoid a warning in timedout job handler Alex Deucher
@ 2025-12-19 18:21 ` Alex Deucher
  2025-12-19 19:36   ` Alex Deucher
  2025-12-19 18:21 ` [PATCH 4/5] drm/amdgpu/gfx9: rework pipeline sync packet sequence Alex Deucher
  2025-12-19 18:22 ` [PATCH 5/5] drm/amdgpu/gfx9: Implement KGQ ring reset Alex Deucher
  3 siblings, 1 reply; 7+ messages in thread
From: Alex Deucher @ 2025-12-19 18:21 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Mark fences with errors before we reset the rings as
we may end up signalling fences as part of the reset
sequence.  The error needs to be set before the fence
is signalled.  On GC10 and newer, this isn't a problem
since we don't signal any fences.  On GC9, we need to
signal the fence after the reset to unblock the recovery
sequence.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 600e6bb98af7a..5defdebd7091e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -872,6 +872,10 @@ void amdgpu_ring_reset_helper_begin(struct amdgpu_ring *ring,
 	drm_sched_wqueue_stop(&ring->sched);
 	/* back up the non-guilty commands */
 	amdgpu_ring_backup_unprocessed_commands(ring, guilty_fence);
+	/* signal the guilty fence and set an error on all fences from the context */
+	if (guilty_fence)
+		amdgpu_fence_driver_guilty_force_completion(guilty_fence);
+
 }
 
 int amdgpu_ring_reset_helper_end(struct amdgpu_ring *ring,
@@ -885,9 +889,6 @@ int amdgpu_ring_reset_helper_end(struct amdgpu_ring *ring,
 	if (r)
 		return r;
 
-	/* signal the guilty fence and set an error on all fences from the context */
-	if (guilty_fence)
-		amdgpu_fence_driver_guilty_force_completion(guilty_fence);
 	/* Re-emit the non-guilty commands */
 	if (ring->ring_backup_entries_to_copy) {
 		amdgpu_ring_alloc_reemit(ring, ring->ring_backup_entries_to_copy);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 3/5] drm/amdgpu: mark fences with errors before ring reset
  2025-12-19 18:21 ` [PATCH 3/5] drm/amdgpu: mark fences with errors before ring reset Alex Deucher
@ 2025-12-19 19:36   ` Alex Deucher
  0 siblings, 0 replies; 7+ messages in thread
From: Alex Deucher @ 2025-12-19 19:36 UTC (permalink / raw)
  To: Alex Deucher; +Cc: amd-gfx

On Fri, Dec 19, 2025 at 1:31 PM Alex Deucher <alexander.deucher@amd.com> wrote:
>
> Mark fences with errors before we reset the rings as
> we may end up signalling fences as part of the reset
> sequence.  The error needs to be set before the fence
> is signalled.  On GC10 and newer, this isn't a problem
> since we don't signal any fences.  On GC9, we need to
> signal the fence after the reset to unblock the recovery
> sequence.

This patch is no longer necessary and can be dropped.

Alex

>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> index 600e6bb98af7a..5defdebd7091e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> @@ -872,6 +872,10 @@ void amdgpu_ring_reset_helper_begin(struct amdgpu_ring *ring,
>         drm_sched_wqueue_stop(&ring->sched);
>         /* back up the non-guilty commands */
>         amdgpu_ring_backup_unprocessed_commands(ring, guilty_fence);
> +       /* signal the guilty fence and set an error on all fences from the context */
> +       if (guilty_fence)
> +               amdgpu_fence_driver_guilty_force_completion(guilty_fence);
> +
>  }
>
>  int amdgpu_ring_reset_helper_end(struct amdgpu_ring *ring,
> @@ -885,9 +889,6 @@ int amdgpu_ring_reset_helper_end(struct amdgpu_ring *ring,
>         if (r)
>                 return r;
>
> -       /* signal the guilty fence and set an error on all fences from the context */
> -       if (guilty_fence)
> -               amdgpu_fence_driver_guilty_force_completion(guilty_fence);
>         /* Re-emit the non-guilty commands */
>         if (ring->ring_backup_entries_to_copy) {
>                 amdgpu_ring_alloc_reemit(ring, ring->ring_backup_entries_to_copy);
> --
> 2.52.0
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 4/5] drm/amdgpu/gfx9: rework pipeline sync packet sequence
  2025-12-19 18:21 [PATCH 1/5] drm/amdgpu: use dma_fence_get_status() for adapter reset Alex Deucher
  2025-12-19 18:21 ` [PATCH 2/5] drm/amdgpu: avoid a warning in timedout job handler Alex Deucher
  2025-12-19 18:21 ` [PATCH 3/5] drm/amdgpu: mark fences with errors before ring reset Alex Deucher
@ 2025-12-19 18:21 ` Alex Deucher
  2025-12-19 18:22 ` [PATCH 5/5] drm/amdgpu/gfx9: Implement KGQ ring reset Alex Deucher
  3 siblings, 0 replies; 7+ messages in thread
From: Alex Deucher @ 2025-12-19 18:21 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Replace WAIT_REG_MEM with EVENT_WRITE flushes for all
shader types and ACQUIRE_MEM.  That should accomplish
the same thing and avoid having to wait on a fence
preventing any issues with pipeline syncs during
queue resets.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 67 +++++++++++++++------------
 1 file changed, 37 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 7b012ca1153ea..0d8e797d59b8a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -5572,15 +5572,42 @@ static void gfx_v9_0_ring_emit_fence(struct amdgpu_ring *ring, u64 addr,
 	amdgpu_ring_write(ring, 0);
 }
 
-static void gfx_v9_0_ring_emit_pipeline_sync(struct amdgpu_ring *ring)
+static void gfx_v9_0_ring_emit_event_write(struct amdgpu_ring *ring,
+					   uint32_t event_type,
+					   uint32_t event_index)
 {
-	int usepfp = (ring->funcs->type == AMDGPU_RING_TYPE_GFX);
-	uint32_t seq = ring->fence_drv.sync_seq;
-	uint64_t addr = ring->fence_drv.gpu_addr;
+	amdgpu_ring_write(ring, PACKET3(PACKET3_EVENT_WRITE, 0));
+	amdgpu_ring_write(ring, EVENT_TYPE(event_type) |
+			  EVENT_INDEX(event_index));
+}
+
+static void gfx_v9_0_emit_mem_sync(struct amdgpu_ring *ring)
+{
+	const unsigned int cp_coher_cntl =
+			PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_SH_ICACHE_ACTION_ENA(1) |
+			PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_SH_KCACHE_ACTION_ENA(1) |
+			PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_TC_ACTION_ENA(1) |
+			PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_TCL1_ACTION_ENA(1) |
+			PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_TC_WB_ACTION_ENA(1);
 
-	gfx_v9_0_wait_reg_mem(ring, usepfp, 1, 0,
-			      lower_32_bits(addr), upper_32_bits(addr),
-			      seq, 0xffffffff, 4);
+	/* ACQUIRE_MEM -make one or more surfaces valid for use by the subsequent operations */
+	amdgpu_ring_write(ring, PACKET3(PACKET3_ACQUIRE_MEM, 5));
+	amdgpu_ring_write(ring, cp_coher_cntl); /* CP_COHER_CNTL */
+	amdgpu_ring_write(ring, 0xffffffff);  /* CP_COHER_SIZE */
+	amdgpu_ring_write(ring, 0xffffff);  /* CP_COHER_SIZE_HI */
+	amdgpu_ring_write(ring, 0); /* CP_COHER_BASE */
+	amdgpu_ring_write(ring, 0);  /* CP_COHER_BASE_HI */
+	amdgpu_ring_write(ring, 0x0000000A); /* POLL_INTERVAL */
+}
+
+static void gfx_v9_0_ring_emit_pipeline_sync(struct amdgpu_ring *ring)
+{
+	if (ring->funcs->type == AMDGPU_RING_TYPE_GFX) {
+		gfx_v9_0_ring_emit_event_write(ring, VS_PARTIAL_FLUSH, 4);
+		gfx_v9_0_ring_emit_event_write(ring, PS_PARTIAL_FLUSH, 4);
+	}
+	gfx_v9_0_ring_emit_event_write(ring, CS_PARTIAL_FLUSH, 4);
+	gfx_v9_0_emit_mem_sync(ring);
 }
 
 static void gfx_v9_0_ring_emit_vm_flush(struct amdgpu_ring *ring,
@@ -7071,25 +7098,6 @@ static void gfx_v9_0_query_ras_error_count(struct amdgpu_device *adev,
 	gfx_v9_0_query_utc_edc_status(adev, err_data);
 }
 
-static void gfx_v9_0_emit_mem_sync(struct amdgpu_ring *ring)
-{
-	const unsigned int cp_coher_cntl =
-			PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_SH_ICACHE_ACTION_ENA(1) |
-			PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_SH_KCACHE_ACTION_ENA(1) |
-			PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_TC_ACTION_ENA(1) |
-			PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_TCL1_ACTION_ENA(1) |
-			PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_TC_WB_ACTION_ENA(1);
-
-	/* ACQUIRE_MEM -make one or more surfaces valid for use by the subsequent operations */
-	amdgpu_ring_write(ring, PACKET3(PACKET3_ACQUIRE_MEM, 5));
-	amdgpu_ring_write(ring, cp_coher_cntl); /* CP_COHER_CNTL */
-	amdgpu_ring_write(ring, 0xffffffff);  /* CP_COHER_SIZE */
-	amdgpu_ring_write(ring, 0xffffff);  /* CP_COHER_SIZE_HI */
-	amdgpu_ring_write(ring, 0); /* CP_COHER_BASE */
-	amdgpu_ring_write(ring, 0);  /* CP_COHER_BASE_HI */
-	amdgpu_ring_write(ring, 0x0000000A); /* POLL_INTERVAL */
-}
-
 static void gfx_v9_0_emit_wave_limit_cs(struct amdgpu_ring *ring,
 					uint32_t pipe, bool enable)
 {
@@ -7404,7 +7412,7 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_gfx = {
 	.set_wptr = gfx_v9_0_ring_set_wptr_gfx,
 	.emit_frame_size = /* totally 242 maximum if 16 IBs */
 		5 +  /* COND_EXEC */
-		7 +  /* PIPELINE_SYNC */
+		13 +  /* PIPELINE_SYNC */
 		SOC15_FLUSH_GPU_TLB_NUM_WREG * 5 +
 		SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 7 +
 		2 + /* VM_FLUSH */
@@ -7460,7 +7468,7 @@ static const struct amdgpu_ring_funcs gfx_v9_0_sw_ring_funcs_gfx = {
 	.set_wptr = amdgpu_sw_ring_set_wptr_gfx,
 	.emit_frame_size = /* totally 242 maximum if 16 IBs */
 		5 +  /* COND_EXEC */
-		7 +  /* PIPELINE_SYNC */
+		13 +  /* PIPELINE_SYNC */
 		SOC15_FLUSH_GPU_TLB_NUM_WREG * 5 +
 		SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 7 +
 		2 + /* VM_FLUSH */
@@ -7521,7 +7529,7 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_compute = {
 		20 + /* gfx_v9_0_ring_emit_gds_switch */
 		7 + /* gfx_v9_0_ring_emit_hdp_flush */
 		5 + /* hdp invalidate */
-		7 + /* gfx_v9_0_ring_emit_pipeline_sync */
+		9 + /* gfx_v9_0_ring_emit_pipeline_sync */
 		SOC15_FLUSH_GPU_TLB_NUM_WREG * 5 +
 		SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 7 +
 		8 + 8 + 8 + /* gfx_v9_0_ring_emit_fence x3 for user fence, vm fence */
@@ -7564,7 +7572,6 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_kiq = {
 		20 + /* gfx_v9_0_ring_emit_gds_switch */
 		7 + /* gfx_v9_0_ring_emit_hdp_flush */
 		5 + /* hdp invalidate */
-		7 + /* gfx_v9_0_ring_emit_pipeline_sync */
 		SOC15_FLUSH_GPU_TLB_NUM_WREG * 5 +
 		SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 7 +
 		8 + 8 + 8, /* gfx_v9_0_ring_emit_fence_kiq x3 for user fence, vm fence */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 5/5] drm/amdgpu/gfx9: Implement KGQ ring reset
  2025-12-19 18:21 [PATCH 1/5] drm/amdgpu: use dma_fence_get_status() for adapter reset Alex Deucher
                   ` (2 preceding siblings ...)
  2025-12-19 18:21 ` [PATCH 4/5] drm/amdgpu/gfx9: rework pipeline sync packet sequence Alex Deucher
@ 2025-12-19 18:22 ` Alex Deucher
  2025-12-19 20:46   ` Timur Kristóf
  3 siblings, 1 reply; 7+ messages in thread
From: Alex Deucher @ 2025-12-19 18:22 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher, Jiqian Chen

GFX ring resets work differently on pre-GFX10 hardware since
there is no MQD managed by the scheduler.
For ring reset, you need issue the reset via CP_VMID_RESET
via KIQ or MMIO and submit the following to the gfx ring to
complete the reset:
1. EOP packet with EXEC bit set
2. WAIT_REG_MEM to wait for the fence
3. Clear CP_VMID_RESET to 0
4. EVENT_WRITE ENABLE_LEGACY_PIPELINE
5. EOP packet with EXEC bit set
6. WAIT_REG_MEM to wait for the fence
Once those commands have completed the reset should
be complete and the ring can accept new packets.

Tested-by: Jiqian Chen <Jiqian.Chen@amd.com> (v1)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 92 ++++++++++++++++++++++++++-
 1 file changed, 89 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 0d8e797d59b8a..7e9d753f4a808 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -2411,8 +2411,10 @@ static int gfx_v9_0_sw_init(struct amdgpu_ip_block *ip_block)
 		amdgpu_get_soft_full_reset_mask(&adev->gfx.gfx_ring[0]);
 	adev->gfx.compute_supported_reset =
 		amdgpu_get_soft_full_reset_mask(&adev->gfx.compute_ring[0]);
-	if (!amdgpu_sriov_vf(adev) && !adev->debug_disable_gpu_ring_reset)
+	if (!amdgpu_sriov_vf(adev) && !adev->debug_disable_gpu_ring_reset) {
 		adev->gfx.compute_supported_reset |= AMDGPU_RESET_TYPE_PER_QUEUE;
+		adev->gfx.gfx_supported_reset |= AMDGPU_RESET_TYPE_PER_QUEUE;
+	}
 
 	r = amdgpu_gfx_kiq_init(adev, GFX9_MEC_HPD_SIZE, 0);
 	if (r) {
@@ -7172,6 +7174,91 @@ static void gfx_v9_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop)
 	amdgpu_ring_insert_nop(ring, num_nop - 1);
 }
 
+static void gfx_v9_0_ring_emit_wreg_me(struct amdgpu_ring *ring,
+				       uint32_t reg,
+				       uint32_t val)
+{
+	uint32_t cmd = 0;
+
+	switch (ring->funcs->type) {
+	case AMDGPU_RING_TYPE_KIQ:
+		cmd = (1 << 16); /* no inc addr */
+		break;
+	default:
+		cmd = WR_CONFIRM;
+		break;
+	}
+	amdgpu_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, 3));
+	amdgpu_ring_write(ring, cmd);
+	amdgpu_ring_write(ring, reg);
+	amdgpu_ring_write(ring, 0);
+	amdgpu_ring_write(ring, val);
+}
+
+static int gfx_v9_0_reset_kgq(struct amdgpu_ring *ring,
+			      unsigned int vmid,
+			      struct amdgpu_fence *timedout_fence)
+{
+	struct amdgpu_device *adev = ring->adev;
+	struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
+	struct amdgpu_ring *kiq_ring = &kiq->ring;
+	unsigned long flags;
+	u32 tmp;
+	int r;
+
+	amdgpu_ring_reset_helper_begin(ring, timedout_fence);
+
+	spin_lock_irqsave(&kiq->ring_lock, flags);
+
+	if (amdgpu_ring_alloc(kiq_ring, 5)) {
+		spin_unlock_irqrestore(&kiq->ring_lock, flags);
+		return -ENOMEM;
+	}
+
+	/* send the reset - 5 */
+	tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
+	gfx_v9_0_ring_emit_wreg(kiq_ring,
+				SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), tmp);
+	amdgpu_ring_commit(kiq_ring);
+	r = amdgpu_ring_test_ring(kiq_ring);
+	spin_unlock_irqrestore(&kiq->ring_lock, flags);
+	if (r)
+		return r;
+
+	if (amdgpu_ring_alloc(ring, 8 + 7 + 5 + 2 + 8 + 7))
+		return -ENOMEM;
+	/* emit the fence to finish the reset - 8 */
+	ring->trail_seq++;
+	gfx_v9_0_ring_emit_fence(ring, ring->trail_fence_gpu_addr,
+				 ring->trail_seq, AMDGPU_FENCE_FLAG_EXEC);
+	/* wait for the fence - 7 */
+	gfx_v9_0_wait_reg_mem(ring, 0, 1, 0,
+			      lower_32_bits(ring->trail_fence_gpu_addr),
+			      upper_32_bits(ring->trail_fence_gpu_addr),
+			      ring->trail_seq, 0xffffffff, 4);
+	/* clear mmCP_VMID_RESET - 5 */
+	gfx_v9_0_ring_emit_wreg_me(ring,
+				   SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), 0);
+	/* event write ENABLE_LEGACY_PIPELINE - 2 */
+	gfx_v9_0_ring_emit_event_write(ring, ENABLE_LEGACY_PIPELINE, 0);
+	/* emit a regular fence - 8 */
+	ring->trail_seq++;
+	gfx_v9_0_ring_emit_fence(ring, ring->trail_fence_gpu_addr,
+				 ring->trail_seq, AMDGPU_FENCE_FLAG_EXEC);
+	/* wait for the fence - 7 */
+	gfx_v9_0_wait_reg_mem(ring, 1, 1, 0,
+			      lower_32_bits(ring->trail_fence_gpu_addr),
+			      upper_32_bits(ring->trail_fence_gpu_addr),
+			      ring->trail_seq, 0xffffffff, 4);
+	amdgpu_ring_commit(ring);
+	/* wait for the commands to complete */
+	r = amdgpu_ring_test_ring(ring);
+	if (r)
+		return r;
+
+	return amdgpu_ring_reset_helper_end(ring, timedout_fence);
+}
+
 static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
 			      unsigned int vmid,
 			      struct amdgpu_fence *timedout_fence)
@@ -7450,9 +7537,9 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_gfx = {
 	.emit_wreg = gfx_v9_0_ring_emit_wreg,
 	.emit_reg_wait = gfx_v9_0_ring_emit_reg_wait,
 	.emit_reg_write_reg_wait = gfx_v9_0_ring_emit_reg_write_reg_wait,
-	.soft_recovery = gfx_v9_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v9_0_emit_mem_sync,
 	.emit_cleaner_shader = gfx_v9_0_ring_emit_cleaner_shader,
+	.reset = gfx_v9_0_reset_kgq,
 	.begin_use = amdgpu_gfx_enforce_isolation_ring_begin_use,
 	.end_use = amdgpu_gfx_enforce_isolation_ring_end_use,
 };
@@ -7551,7 +7638,6 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_compute = {
 	.emit_wreg = gfx_v9_0_ring_emit_wreg,
 	.emit_reg_wait = gfx_v9_0_ring_emit_reg_wait,
 	.emit_reg_write_reg_wait = gfx_v9_0_ring_emit_reg_write_reg_wait,
-	.soft_recovery = gfx_v9_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v9_0_emit_mem_sync,
 	.emit_wave_limit = gfx_v9_0_emit_wave_limit,
 	.reset = gfx_v9_0_reset_kcq,
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 5/5] drm/amdgpu/gfx9: Implement KGQ ring reset
  2025-12-19 18:22 ` [PATCH 5/5] drm/amdgpu/gfx9: Implement KGQ ring reset Alex Deucher
@ 2025-12-19 20:46   ` Timur Kristóf
  0 siblings, 0 replies; 7+ messages in thread
From: Timur Kristóf @ 2025-12-19 20:46 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher, Jiqian Chen, Alex Deucher

On 2025. december 19., péntek 12:22:00 középső államokbeli zónaidő Alex 
Deucher wrote:
> GFX ring resets work differently on pre-GFX10 hardware since
> there is no MQD managed by the scheduler.
> For ring reset, you need issue the reset via CP_VMID_RESET
> via KIQ or MMIO and submit the following to the gfx ring to
> complete the reset:
> 1. EOP packet with EXEC bit set
> 2. WAIT_REG_MEM to wait for the fence
> 3. Clear CP_VMID_RESET to 0
> 4. EVENT_WRITE ENABLE_LEGACY_PIPELINE
> 5. EOP packet with EXEC bit set
> 6. WAIT_REG_MEM to wait for the fence
> Once those commands have completed the reset should
> be complete and the ring can accept new packets.
> 
> Tested-by: Jiqian Chen <Jiqian.Chen@amd.com> (v1)
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Hi Alex,

Thank you for working on this.
For the entire series,
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>

I can't test it at the moment but can give it a try in January or so.

Best regards,
Timur

> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 92 ++++++++++++++++++++++++++-
>  1 file changed, 89 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c index 0d8e797d59b8a..7e9d753f4a808
> 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -2411,8 +2411,10 @@ static int gfx_v9_0_sw_init(struct amdgpu_ip_block
> *ip_block) amdgpu_get_soft_full_reset_mask(&adev->gfx.gfx_ring[0]);
>  	adev->gfx.compute_supported_reset =
>  		amdgpu_get_soft_full_reset_mask(&adev-
>gfx.compute_ring[0]);
> -	if (!amdgpu_sriov_vf(adev) && !adev->debug_disable_gpu_ring_reset)
> +	if (!amdgpu_sriov_vf(adev) && !adev->debug_disable_gpu_ring_reset) 
{
>  		adev->gfx.compute_supported_reset |= 
AMDGPU_RESET_TYPE_PER_QUEUE;
> +		adev->gfx.gfx_supported_reset |= 
AMDGPU_RESET_TYPE_PER_QUEUE;
> +	}
> 
>  	r = amdgpu_gfx_kiq_init(adev, GFX9_MEC_HPD_SIZE, 0);
>  	if (r) {
> @@ -7172,6 +7174,91 @@ static void gfx_v9_ring_insert_nop(struct amdgpu_ring
> *ring, uint32_t num_nop) amdgpu_ring_insert_nop(ring, num_nop - 1);
>  }
> 
> +static void gfx_v9_0_ring_emit_wreg_me(struct amdgpu_ring *ring,
> +				       uint32_t reg,
> +				       uint32_t val)
> +{
> +	uint32_t cmd = 0;
> +
> +	switch (ring->funcs->type) {
> +	case AMDGPU_RING_TYPE_KIQ:
> +		cmd = (1 << 16); /* no inc addr */
> +		break;
> +	default:
> +		cmd = WR_CONFIRM;
> +		break;
> +	}
> +	amdgpu_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, 3));
> +	amdgpu_ring_write(ring, cmd);
> +	amdgpu_ring_write(ring, reg);
> +	amdgpu_ring_write(ring, 0);
> +	amdgpu_ring_write(ring, val);
> +}
> +
> +static int gfx_v9_0_reset_kgq(struct amdgpu_ring *ring,
> +			      unsigned int vmid,
> +			      struct amdgpu_fence *timedout_fence)
> +{
> +	struct amdgpu_device *adev = ring->adev;
> +	struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
> +	struct amdgpu_ring *kiq_ring = &kiq->ring;
> +	unsigned long flags;
> +	u32 tmp;
> +	int r;
> +
> +	amdgpu_ring_reset_helper_begin(ring, timedout_fence);
> +
> +	spin_lock_irqsave(&kiq->ring_lock, flags);
> +
> +	if (amdgpu_ring_alloc(kiq_ring, 5)) {
> +		spin_unlock_irqrestore(&kiq->ring_lock, flags);
> +		return -ENOMEM;
> +	}
> +
> +	/* send the reset - 5 */
> +	tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
> +	gfx_v9_0_ring_emit_wreg(kiq_ring,
> +				SOC15_REG_OFFSET(GC, 0, 
mmCP_VMID_RESET), tmp);
> +	amdgpu_ring_commit(kiq_ring);
> +	r = amdgpu_ring_test_ring(kiq_ring);
> +	spin_unlock_irqrestore(&kiq->ring_lock, flags);
> +	if (r)
> +		return r;
> +
> +	if (amdgpu_ring_alloc(ring, 8 + 7 + 5 + 2 + 8 + 7))
> +		return -ENOMEM;
> +	/* emit the fence to finish the reset - 8 */
> +	ring->trail_seq++;
> +	gfx_v9_0_ring_emit_fence(ring, ring->trail_fence_gpu_addr,
> +				 ring->trail_seq, 
AMDGPU_FENCE_FLAG_EXEC);
> +	/* wait for the fence - 7 */
> +	gfx_v9_0_wait_reg_mem(ring, 0, 1, 0,
> +			      lower_32_bits(ring-
>trail_fence_gpu_addr),
> +			      upper_32_bits(ring-
>trail_fence_gpu_addr),
> +			      ring->trail_seq, 0xffffffff, 4);
> +	/* clear mmCP_VMID_RESET - 5 */
> +	gfx_v9_0_ring_emit_wreg_me(ring,
> +				   SOC15_REG_OFFSET(GC, 0, 
mmCP_VMID_RESET), 0);
> +	/* event write ENABLE_LEGACY_PIPELINE - 2 */
> +	gfx_v9_0_ring_emit_event_write(ring, ENABLE_LEGACY_PIPELINE, 0);
> +	/* emit a regular fence - 8 */
> +	ring->trail_seq++;
> +	gfx_v9_0_ring_emit_fence(ring, ring->trail_fence_gpu_addr,
> +				 ring->trail_seq, 
AMDGPU_FENCE_FLAG_EXEC);
> +	/* wait for the fence - 7 */
> +	gfx_v9_0_wait_reg_mem(ring, 1, 1, 0,
> +			      lower_32_bits(ring-
>trail_fence_gpu_addr),
> +			      upper_32_bits(ring-
>trail_fence_gpu_addr),
> +			      ring->trail_seq, 0xffffffff, 4);
> +	amdgpu_ring_commit(ring);
> +	/* wait for the commands to complete */
> +	r = amdgpu_ring_test_ring(ring);
> +	if (r)
> +		return r;
> +
> +	return amdgpu_ring_reset_helper_end(ring, timedout_fence);
> +}
> +
>  static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
>  			      unsigned int vmid,
>  			      struct amdgpu_fence *timedout_fence)
> @@ -7450,9 +7537,9 @@ static const struct amdgpu_ring_funcs
> gfx_v9_0_ring_funcs_gfx = { .emit_wreg = gfx_v9_0_ring_emit_wreg,
>  	.emit_reg_wait = gfx_v9_0_ring_emit_reg_wait,
>  	.emit_reg_write_reg_wait = gfx_v9_0_ring_emit_reg_write_reg_wait,
> -	.soft_recovery = gfx_v9_0_ring_soft_recovery,
>  	.emit_mem_sync = gfx_v9_0_emit_mem_sync,
>  	.emit_cleaner_shader = gfx_v9_0_ring_emit_cleaner_shader,
> +	.reset = gfx_v9_0_reset_kgq,
>  	.begin_use = amdgpu_gfx_enforce_isolation_ring_begin_use,
>  	.end_use = amdgpu_gfx_enforce_isolation_ring_end_use,
>  };
> @@ -7551,7 +7638,6 @@ static const struct amdgpu_ring_funcs
> gfx_v9_0_ring_funcs_compute = { .emit_wreg = gfx_v9_0_ring_emit_wreg,
>  	.emit_reg_wait = gfx_v9_0_ring_emit_reg_wait,
>  	.emit_reg_write_reg_wait = gfx_v9_0_ring_emit_reg_write_reg_wait,
> -	.soft_recovery = gfx_v9_0_ring_soft_recovery,
>  	.emit_mem_sync = gfx_v9_0_emit_mem_sync,
>  	.emit_wave_limit = gfx_v9_0_emit_wave_limit,
>  	.reset = gfx_v9_0_reset_kcq,





^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-12-19 20:46 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-19 18:21 [PATCH 1/5] drm/amdgpu: use dma_fence_get_status() for adapter reset Alex Deucher
2025-12-19 18:21 ` [PATCH 2/5] drm/amdgpu: avoid a warning in timedout job handler Alex Deucher
2025-12-19 18:21 ` [PATCH 3/5] drm/amdgpu: mark fences with errors before ring reset Alex Deucher
2025-12-19 19:36   ` Alex Deucher
2025-12-19 18:21 ` [PATCH 4/5] drm/amdgpu/gfx9: rework pipeline sync packet sequence Alex Deucher
2025-12-19 18:22 ` [PATCH 5/5] drm/amdgpu/gfx9: Implement KGQ ring reset Alex Deucher
2025-12-19 20:46   ` Timur Kristóf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.