* [PATCH 1/5] drm/amdgpu: use dma_fence_get_status() for adapter reset
@ 2025-12-19 18:21 Alex Deucher
2025-12-19 18:21 ` [PATCH 2/5] drm/amdgpu: avoid a warning in timedout job handler Alex Deucher
` (3 more replies)
0 siblings, 4 replies; 7+ messages in thread
From: Alex Deucher @ 2025-12-19 18:21 UTC (permalink / raw)
To: amd-gfx; +Cc: Alex Deucher
We need to check if the fence was signaled without an
error as the per queue resets may have signalled the fence
while attempting to reset the queue.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 52a23fcaf617c..5d4fb20f719c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6539,7 +6539,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
*
* job->base holds a reference to parent fence
*/
- if (job && dma_fence_is_signaled(&job->hw_fence->base)) {
+ if (job && (dma_fence_get_status(&job->hw_fence->base) > 0)) {
job_signaled = true;
dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
goto skip_hw_reset;
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH 2/5] drm/amdgpu: avoid a warning in timedout job handler 2025-12-19 18:21 [PATCH 1/5] drm/amdgpu: use dma_fence_get_status() for adapter reset Alex Deucher @ 2025-12-19 18:21 ` Alex Deucher 2025-12-19 18:21 ` [PATCH 3/5] drm/amdgpu: mark fences with errors before ring reset Alex Deucher ` (2 subsequent siblings) 3 siblings, 0 replies; 7+ messages in thread From: Alex Deucher @ 2025-12-19 18:21 UTC (permalink / raw) To: amd-gfx; +Cc: Alex Deucher, Timur Kristóf Only set an error on the fence if the fence is not signalled. We can end up with a warning if the per queue reset path signals the fence and sets an error as part of the reset, but fails to recover. Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 67fde99724bad..7f5d01164897f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -147,7 +147,8 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name); } - dma_fence_set_error(&s_job->s_fence->finished, -ETIME); + if (dma_fence_get_status(&s_job->s_fence->finished) == 0) + dma_fence_set_error(&s_job->s_fence->finished, -ETIME); amdgpu_vm_put_task_info(ti); -- 2.52.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 3/5] drm/amdgpu: mark fences with errors before ring reset 2025-12-19 18:21 [PATCH 1/5] drm/amdgpu: use dma_fence_get_status() for adapter reset Alex Deucher 2025-12-19 18:21 ` [PATCH 2/5] drm/amdgpu: avoid a warning in timedout job handler Alex Deucher @ 2025-12-19 18:21 ` Alex Deucher 2025-12-19 19:36 ` Alex Deucher 2025-12-19 18:21 ` [PATCH 4/5] drm/amdgpu/gfx9: rework pipeline sync packet sequence Alex Deucher 2025-12-19 18:22 ` [PATCH 5/5] drm/amdgpu/gfx9: Implement KGQ ring reset Alex Deucher 3 siblings, 1 reply; 7+ messages in thread From: Alex Deucher @ 2025-12-19 18:21 UTC (permalink / raw) To: amd-gfx; +Cc: Alex Deucher Mark fences with errors before we reset the rings as we may end up signalling fences as part of the reset sequence. The error needs to be set before the fence is signalled. On GC10 and newer, this isn't a problem since we don't signal any fences. On GC9, we need to signal the fence after the reset to unblock the recovery sequence. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 600e6bb98af7a..5defdebd7091e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -872,6 +872,10 @@ void amdgpu_ring_reset_helper_begin(struct amdgpu_ring *ring, drm_sched_wqueue_stop(&ring->sched); /* back up the non-guilty commands */ amdgpu_ring_backup_unprocessed_commands(ring, guilty_fence); + /* signal the guilty fence and set an error on all fences from the context */ + if (guilty_fence) + amdgpu_fence_driver_guilty_force_completion(guilty_fence); + } int amdgpu_ring_reset_helper_end(struct amdgpu_ring *ring, @@ -885,9 +889,6 @@ int amdgpu_ring_reset_helper_end(struct amdgpu_ring *ring, if (r) return r; - /* signal the guilty fence and set an error on all fences from the context */ - if (guilty_fence) - amdgpu_fence_driver_guilty_force_completion(guilty_fence); /* Re-emit the non-guilty commands */ if (ring->ring_backup_entries_to_copy) { amdgpu_ring_alloc_reemit(ring, ring->ring_backup_entries_to_copy); -- 2.52.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 3/5] drm/amdgpu: mark fences with errors before ring reset 2025-12-19 18:21 ` [PATCH 3/5] drm/amdgpu: mark fences with errors before ring reset Alex Deucher @ 2025-12-19 19:36 ` Alex Deucher 0 siblings, 0 replies; 7+ messages in thread From: Alex Deucher @ 2025-12-19 19:36 UTC (permalink / raw) To: Alex Deucher; +Cc: amd-gfx On Fri, Dec 19, 2025 at 1:31 PM Alex Deucher <alexander.deucher@amd.com> wrote: > > Mark fences with errors before we reset the rings as > we may end up signalling fences as part of the reset > sequence. The error needs to be set before the fence > is signalled. On GC10 and newer, this isn't a problem > since we don't signal any fences. On GC9, we need to > signal the fence after the reset to unblock the recovery > sequence. This patch is no longer necessary and can be dropped. Alex > > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c > index 600e6bb98af7a..5defdebd7091e 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c > @@ -872,6 +872,10 @@ void amdgpu_ring_reset_helper_begin(struct amdgpu_ring *ring, > drm_sched_wqueue_stop(&ring->sched); > /* back up the non-guilty commands */ > amdgpu_ring_backup_unprocessed_commands(ring, guilty_fence); > + /* signal the guilty fence and set an error on all fences from the context */ > + if (guilty_fence) > + amdgpu_fence_driver_guilty_force_completion(guilty_fence); > + > } > > int amdgpu_ring_reset_helper_end(struct amdgpu_ring *ring, > @@ -885,9 +889,6 @@ int amdgpu_ring_reset_helper_end(struct amdgpu_ring *ring, > if (r) > return r; > > - /* signal the guilty fence and set an error on all fences from the context */ > - if (guilty_fence) > - amdgpu_fence_driver_guilty_force_completion(guilty_fence); > /* Re-emit the non-guilty commands */ > if (ring->ring_backup_entries_to_copy) { > amdgpu_ring_alloc_reemit(ring, ring->ring_backup_entries_to_copy); > -- > 2.52.0 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 4/5] drm/amdgpu/gfx9: rework pipeline sync packet sequence 2025-12-19 18:21 [PATCH 1/5] drm/amdgpu: use dma_fence_get_status() for adapter reset Alex Deucher 2025-12-19 18:21 ` [PATCH 2/5] drm/amdgpu: avoid a warning in timedout job handler Alex Deucher 2025-12-19 18:21 ` [PATCH 3/5] drm/amdgpu: mark fences with errors before ring reset Alex Deucher @ 2025-12-19 18:21 ` Alex Deucher 2025-12-19 18:22 ` [PATCH 5/5] drm/amdgpu/gfx9: Implement KGQ ring reset Alex Deucher 3 siblings, 0 replies; 7+ messages in thread From: Alex Deucher @ 2025-12-19 18:21 UTC (permalink / raw) To: amd-gfx; +Cc: Alex Deucher Replace WAIT_REG_MEM with EVENT_WRITE flushes for all shader types and ACQUIRE_MEM. That should accomplish the same thing and avoid having to wait on a fence preventing any issues with pipeline syncs during queue resets. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> --- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 67 +++++++++++++++------------ 1 file changed, 37 insertions(+), 30 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c index 7b012ca1153ea..0d8e797d59b8a 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c @@ -5572,15 +5572,42 @@ static void gfx_v9_0_ring_emit_fence(struct amdgpu_ring *ring, u64 addr, amdgpu_ring_write(ring, 0); } -static void gfx_v9_0_ring_emit_pipeline_sync(struct amdgpu_ring *ring) +static void gfx_v9_0_ring_emit_event_write(struct amdgpu_ring *ring, + uint32_t event_type, + uint32_t event_index) { - int usepfp = (ring->funcs->type == AMDGPU_RING_TYPE_GFX); - uint32_t seq = ring->fence_drv.sync_seq; - uint64_t addr = ring->fence_drv.gpu_addr; + amdgpu_ring_write(ring, PACKET3(PACKET3_EVENT_WRITE, 0)); + amdgpu_ring_write(ring, EVENT_TYPE(event_type) | + EVENT_INDEX(event_index)); +} + +static void gfx_v9_0_emit_mem_sync(struct amdgpu_ring *ring) +{ + const unsigned int cp_coher_cntl = + PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_SH_ICACHE_ACTION_ENA(1) | + PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_SH_KCACHE_ACTION_ENA(1) | + PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_TC_ACTION_ENA(1) | + PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_TCL1_ACTION_ENA(1) | + PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_TC_WB_ACTION_ENA(1); - gfx_v9_0_wait_reg_mem(ring, usepfp, 1, 0, - lower_32_bits(addr), upper_32_bits(addr), - seq, 0xffffffff, 4); + /* ACQUIRE_MEM -make one or more surfaces valid for use by the subsequent operations */ + amdgpu_ring_write(ring, PACKET3(PACKET3_ACQUIRE_MEM, 5)); + amdgpu_ring_write(ring, cp_coher_cntl); /* CP_COHER_CNTL */ + amdgpu_ring_write(ring, 0xffffffff); /* CP_COHER_SIZE */ + amdgpu_ring_write(ring, 0xffffff); /* CP_COHER_SIZE_HI */ + amdgpu_ring_write(ring, 0); /* CP_COHER_BASE */ + amdgpu_ring_write(ring, 0); /* CP_COHER_BASE_HI */ + amdgpu_ring_write(ring, 0x0000000A); /* POLL_INTERVAL */ +} + +static void gfx_v9_0_ring_emit_pipeline_sync(struct amdgpu_ring *ring) +{ + if (ring->funcs->type == AMDGPU_RING_TYPE_GFX) { + gfx_v9_0_ring_emit_event_write(ring, VS_PARTIAL_FLUSH, 4); + gfx_v9_0_ring_emit_event_write(ring, PS_PARTIAL_FLUSH, 4); + } + gfx_v9_0_ring_emit_event_write(ring, CS_PARTIAL_FLUSH, 4); + gfx_v9_0_emit_mem_sync(ring); } static void gfx_v9_0_ring_emit_vm_flush(struct amdgpu_ring *ring, @@ -7071,25 +7098,6 @@ static void gfx_v9_0_query_ras_error_count(struct amdgpu_device *adev, gfx_v9_0_query_utc_edc_status(adev, err_data); } -static void gfx_v9_0_emit_mem_sync(struct amdgpu_ring *ring) -{ - const unsigned int cp_coher_cntl = - PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_SH_ICACHE_ACTION_ENA(1) | - PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_SH_KCACHE_ACTION_ENA(1) | - PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_TC_ACTION_ENA(1) | - PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_TCL1_ACTION_ENA(1) | - PACKET3_ACQUIRE_MEM_CP_COHER_CNTL_TC_WB_ACTION_ENA(1); - - /* ACQUIRE_MEM -make one or more surfaces valid for use by the subsequent operations */ - amdgpu_ring_write(ring, PACKET3(PACKET3_ACQUIRE_MEM, 5)); - amdgpu_ring_write(ring, cp_coher_cntl); /* CP_COHER_CNTL */ - amdgpu_ring_write(ring, 0xffffffff); /* CP_COHER_SIZE */ - amdgpu_ring_write(ring, 0xffffff); /* CP_COHER_SIZE_HI */ - amdgpu_ring_write(ring, 0); /* CP_COHER_BASE */ - amdgpu_ring_write(ring, 0); /* CP_COHER_BASE_HI */ - amdgpu_ring_write(ring, 0x0000000A); /* POLL_INTERVAL */ -} - static void gfx_v9_0_emit_wave_limit_cs(struct amdgpu_ring *ring, uint32_t pipe, bool enable) { @@ -7404,7 +7412,7 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_gfx = { .set_wptr = gfx_v9_0_ring_set_wptr_gfx, .emit_frame_size = /* totally 242 maximum if 16 IBs */ 5 + /* COND_EXEC */ - 7 + /* PIPELINE_SYNC */ + 13 + /* PIPELINE_SYNC */ SOC15_FLUSH_GPU_TLB_NUM_WREG * 5 + SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 7 + 2 + /* VM_FLUSH */ @@ -7460,7 +7468,7 @@ static const struct amdgpu_ring_funcs gfx_v9_0_sw_ring_funcs_gfx = { .set_wptr = amdgpu_sw_ring_set_wptr_gfx, .emit_frame_size = /* totally 242 maximum if 16 IBs */ 5 + /* COND_EXEC */ - 7 + /* PIPELINE_SYNC */ + 13 + /* PIPELINE_SYNC */ SOC15_FLUSH_GPU_TLB_NUM_WREG * 5 + SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 7 + 2 + /* VM_FLUSH */ @@ -7521,7 +7529,7 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_compute = { 20 + /* gfx_v9_0_ring_emit_gds_switch */ 7 + /* gfx_v9_0_ring_emit_hdp_flush */ 5 + /* hdp invalidate */ - 7 + /* gfx_v9_0_ring_emit_pipeline_sync */ + 9 + /* gfx_v9_0_ring_emit_pipeline_sync */ SOC15_FLUSH_GPU_TLB_NUM_WREG * 5 + SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 7 + 8 + 8 + 8 + /* gfx_v9_0_ring_emit_fence x3 for user fence, vm fence */ @@ -7564,7 +7572,6 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_kiq = { 20 + /* gfx_v9_0_ring_emit_gds_switch */ 7 + /* gfx_v9_0_ring_emit_hdp_flush */ 5 + /* hdp invalidate */ - 7 + /* gfx_v9_0_ring_emit_pipeline_sync */ SOC15_FLUSH_GPU_TLB_NUM_WREG * 5 + SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 7 + 8 + 8 + 8, /* gfx_v9_0_ring_emit_fence_kiq x3 for user fence, vm fence */ -- 2.52.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 5/5] drm/amdgpu/gfx9: Implement KGQ ring reset 2025-12-19 18:21 [PATCH 1/5] drm/amdgpu: use dma_fence_get_status() for adapter reset Alex Deucher ` (2 preceding siblings ...) 2025-12-19 18:21 ` [PATCH 4/5] drm/amdgpu/gfx9: rework pipeline sync packet sequence Alex Deucher @ 2025-12-19 18:22 ` Alex Deucher 2025-12-19 20:46 ` Timur Kristóf 3 siblings, 1 reply; 7+ messages in thread From: Alex Deucher @ 2025-12-19 18:22 UTC (permalink / raw) To: amd-gfx; +Cc: Alex Deucher, Jiqian Chen GFX ring resets work differently on pre-GFX10 hardware since there is no MQD managed by the scheduler. For ring reset, you need issue the reset via CP_VMID_RESET via KIQ or MMIO and submit the following to the gfx ring to complete the reset: 1. EOP packet with EXEC bit set 2. WAIT_REG_MEM to wait for the fence 3. Clear CP_VMID_RESET to 0 4. EVENT_WRITE ENABLE_LEGACY_PIPELINE 5. EOP packet with EXEC bit set 6. WAIT_REG_MEM to wait for the fence Once those commands have completed the reset should be complete and the ring can accept new packets. Tested-by: Jiqian Chen <Jiqian.Chen@amd.com> (v1) Signed-off-by: Alex Deucher <alexander.deucher@amd.com> --- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 92 ++++++++++++++++++++++++++- 1 file changed, 89 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c index 0d8e797d59b8a..7e9d753f4a808 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c @@ -2411,8 +2411,10 @@ static int gfx_v9_0_sw_init(struct amdgpu_ip_block *ip_block) amdgpu_get_soft_full_reset_mask(&adev->gfx.gfx_ring[0]); adev->gfx.compute_supported_reset = amdgpu_get_soft_full_reset_mask(&adev->gfx.compute_ring[0]); - if (!amdgpu_sriov_vf(adev) && !adev->debug_disable_gpu_ring_reset) + if (!amdgpu_sriov_vf(adev) && !adev->debug_disable_gpu_ring_reset) { adev->gfx.compute_supported_reset |= AMDGPU_RESET_TYPE_PER_QUEUE; + adev->gfx.gfx_supported_reset |= AMDGPU_RESET_TYPE_PER_QUEUE; + } r = amdgpu_gfx_kiq_init(adev, GFX9_MEC_HPD_SIZE, 0); if (r) { @@ -7172,6 +7174,91 @@ static void gfx_v9_ring_insert_nop(struct amdgpu_ring *ring, uint32_t num_nop) amdgpu_ring_insert_nop(ring, num_nop - 1); } +static void gfx_v9_0_ring_emit_wreg_me(struct amdgpu_ring *ring, + uint32_t reg, + uint32_t val) +{ + uint32_t cmd = 0; + + switch (ring->funcs->type) { + case AMDGPU_RING_TYPE_KIQ: + cmd = (1 << 16); /* no inc addr */ + break; + default: + cmd = WR_CONFIRM; + break; + } + amdgpu_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, 3)); + amdgpu_ring_write(ring, cmd); + amdgpu_ring_write(ring, reg); + amdgpu_ring_write(ring, 0); + amdgpu_ring_write(ring, val); +} + +static int gfx_v9_0_reset_kgq(struct amdgpu_ring *ring, + unsigned int vmid, + struct amdgpu_fence *timedout_fence) +{ + struct amdgpu_device *adev = ring->adev; + struct amdgpu_kiq *kiq = &adev->gfx.kiq[0]; + struct amdgpu_ring *kiq_ring = &kiq->ring; + unsigned long flags; + u32 tmp; + int r; + + amdgpu_ring_reset_helper_begin(ring, timedout_fence); + + spin_lock_irqsave(&kiq->ring_lock, flags); + + if (amdgpu_ring_alloc(kiq_ring, 5)) { + spin_unlock_irqrestore(&kiq->ring_lock, flags); + return -ENOMEM; + } + + /* send the reset - 5 */ + tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid); + gfx_v9_0_ring_emit_wreg(kiq_ring, + SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), tmp); + amdgpu_ring_commit(kiq_ring); + r = amdgpu_ring_test_ring(kiq_ring); + spin_unlock_irqrestore(&kiq->ring_lock, flags); + if (r) + return r; + + if (amdgpu_ring_alloc(ring, 8 + 7 + 5 + 2 + 8 + 7)) + return -ENOMEM; + /* emit the fence to finish the reset - 8 */ + ring->trail_seq++; + gfx_v9_0_ring_emit_fence(ring, ring->trail_fence_gpu_addr, + ring->trail_seq, AMDGPU_FENCE_FLAG_EXEC); + /* wait for the fence - 7 */ + gfx_v9_0_wait_reg_mem(ring, 0, 1, 0, + lower_32_bits(ring->trail_fence_gpu_addr), + upper_32_bits(ring->trail_fence_gpu_addr), + ring->trail_seq, 0xffffffff, 4); + /* clear mmCP_VMID_RESET - 5 */ + gfx_v9_0_ring_emit_wreg_me(ring, + SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), 0); + /* event write ENABLE_LEGACY_PIPELINE - 2 */ + gfx_v9_0_ring_emit_event_write(ring, ENABLE_LEGACY_PIPELINE, 0); + /* emit a regular fence - 8 */ + ring->trail_seq++; + gfx_v9_0_ring_emit_fence(ring, ring->trail_fence_gpu_addr, + ring->trail_seq, AMDGPU_FENCE_FLAG_EXEC); + /* wait for the fence - 7 */ + gfx_v9_0_wait_reg_mem(ring, 1, 1, 0, + lower_32_bits(ring->trail_fence_gpu_addr), + upper_32_bits(ring->trail_fence_gpu_addr), + ring->trail_seq, 0xffffffff, 4); + amdgpu_ring_commit(ring); + /* wait for the commands to complete */ + r = amdgpu_ring_test_ring(ring); + if (r) + return r; + + return amdgpu_ring_reset_helper_end(ring, timedout_fence); +} + static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid, struct amdgpu_fence *timedout_fence) @@ -7450,9 +7537,9 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_gfx = { .emit_wreg = gfx_v9_0_ring_emit_wreg, .emit_reg_wait = gfx_v9_0_ring_emit_reg_wait, .emit_reg_write_reg_wait = gfx_v9_0_ring_emit_reg_write_reg_wait, - .soft_recovery = gfx_v9_0_ring_soft_recovery, .emit_mem_sync = gfx_v9_0_emit_mem_sync, .emit_cleaner_shader = gfx_v9_0_ring_emit_cleaner_shader, + .reset = gfx_v9_0_reset_kgq, .begin_use = amdgpu_gfx_enforce_isolation_ring_begin_use, .end_use = amdgpu_gfx_enforce_isolation_ring_end_use, }; @@ -7551,7 +7638,6 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_compute = { .emit_wreg = gfx_v9_0_ring_emit_wreg, .emit_reg_wait = gfx_v9_0_ring_emit_reg_wait, .emit_reg_write_reg_wait = gfx_v9_0_ring_emit_reg_write_reg_wait, - .soft_recovery = gfx_v9_0_ring_soft_recovery, .emit_mem_sync = gfx_v9_0_emit_mem_sync, .emit_wave_limit = gfx_v9_0_emit_wave_limit, .reset = gfx_v9_0_reset_kcq, -- 2.52.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 5/5] drm/amdgpu/gfx9: Implement KGQ ring reset 2025-12-19 18:22 ` [PATCH 5/5] drm/amdgpu/gfx9: Implement KGQ ring reset Alex Deucher @ 2025-12-19 20:46 ` Timur Kristóf 0 siblings, 0 replies; 7+ messages in thread From: Timur Kristóf @ 2025-12-19 20:46 UTC (permalink / raw) To: amd-gfx; +Cc: Alex Deucher, Jiqian Chen, Alex Deucher On 2025. december 19., péntek 12:22:00 középső államokbeli zónaidő Alex Deucher wrote: > GFX ring resets work differently on pre-GFX10 hardware since > there is no MQD managed by the scheduler. > For ring reset, you need issue the reset via CP_VMID_RESET > via KIQ or MMIO and submit the following to the gfx ring to > complete the reset: > 1. EOP packet with EXEC bit set > 2. WAIT_REG_MEM to wait for the fence > 3. Clear CP_VMID_RESET to 0 > 4. EVENT_WRITE ENABLE_LEGACY_PIPELINE > 5. EOP packet with EXEC bit set > 6. WAIT_REG_MEM to wait for the fence > Once those commands have completed the reset should > be complete and the ring can accept new packets. > > Tested-by: Jiqian Chen <Jiqian.Chen@amd.com> (v1) > Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Hi Alex, Thank you for working on this. For the entire series, Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> I can't test it at the moment but can give it a try in January or so. Best regards, Timur > --- > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 92 ++++++++++++++++++++++++++- > 1 file changed, 89 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c index 0d8e797d59b8a..7e9d753f4a808 > 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > @@ -2411,8 +2411,10 @@ static int gfx_v9_0_sw_init(struct amdgpu_ip_block > *ip_block) amdgpu_get_soft_full_reset_mask(&adev->gfx.gfx_ring[0]); > adev->gfx.compute_supported_reset = > amdgpu_get_soft_full_reset_mask(&adev- >gfx.compute_ring[0]); > - if (!amdgpu_sriov_vf(adev) && !adev->debug_disable_gpu_ring_reset) > + if (!amdgpu_sriov_vf(adev) && !adev->debug_disable_gpu_ring_reset) { > adev->gfx.compute_supported_reset |= AMDGPU_RESET_TYPE_PER_QUEUE; > + adev->gfx.gfx_supported_reset |= AMDGPU_RESET_TYPE_PER_QUEUE; > + } > > r = amdgpu_gfx_kiq_init(adev, GFX9_MEC_HPD_SIZE, 0); > if (r) { > @@ -7172,6 +7174,91 @@ static void gfx_v9_ring_insert_nop(struct amdgpu_ring > *ring, uint32_t num_nop) amdgpu_ring_insert_nop(ring, num_nop - 1); > } > > +static void gfx_v9_0_ring_emit_wreg_me(struct amdgpu_ring *ring, > + uint32_t reg, > + uint32_t val) > +{ > + uint32_t cmd = 0; > + > + switch (ring->funcs->type) { > + case AMDGPU_RING_TYPE_KIQ: > + cmd = (1 << 16); /* no inc addr */ > + break; > + default: > + cmd = WR_CONFIRM; > + break; > + } > + amdgpu_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, 3)); > + amdgpu_ring_write(ring, cmd); > + amdgpu_ring_write(ring, reg); > + amdgpu_ring_write(ring, 0); > + amdgpu_ring_write(ring, val); > +} > + > +static int gfx_v9_0_reset_kgq(struct amdgpu_ring *ring, > + unsigned int vmid, > + struct amdgpu_fence *timedout_fence) > +{ > + struct amdgpu_device *adev = ring->adev; > + struct amdgpu_kiq *kiq = &adev->gfx.kiq[0]; > + struct amdgpu_ring *kiq_ring = &kiq->ring; > + unsigned long flags; > + u32 tmp; > + int r; > + > + amdgpu_ring_reset_helper_begin(ring, timedout_fence); > + > + spin_lock_irqsave(&kiq->ring_lock, flags); > + > + if (amdgpu_ring_alloc(kiq_ring, 5)) { > + spin_unlock_irqrestore(&kiq->ring_lock, flags); > + return -ENOMEM; > + } > + > + /* send the reset - 5 */ > + tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid); > + gfx_v9_0_ring_emit_wreg(kiq_ring, > + SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), tmp); > + amdgpu_ring_commit(kiq_ring); > + r = amdgpu_ring_test_ring(kiq_ring); > + spin_unlock_irqrestore(&kiq->ring_lock, flags); > + if (r) > + return r; > + > + if (amdgpu_ring_alloc(ring, 8 + 7 + 5 + 2 + 8 + 7)) > + return -ENOMEM; > + /* emit the fence to finish the reset - 8 */ > + ring->trail_seq++; > + gfx_v9_0_ring_emit_fence(ring, ring->trail_fence_gpu_addr, > + ring->trail_seq, AMDGPU_FENCE_FLAG_EXEC); > + /* wait for the fence - 7 */ > + gfx_v9_0_wait_reg_mem(ring, 0, 1, 0, > + lower_32_bits(ring- >trail_fence_gpu_addr), > + upper_32_bits(ring- >trail_fence_gpu_addr), > + ring->trail_seq, 0xffffffff, 4); > + /* clear mmCP_VMID_RESET - 5 */ > + gfx_v9_0_ring_emit_wreg_me(ring, > + SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), 0); > + /* event write ENABLE_LEGACY_PIPELINE - 2 */ > + gfx_v9_0_ring_emit_event_write(ring, ENABLE_LEGACY_PIPELINE, 0); > + /* emit a regular fence - 8 */ > + ring->trail_seq++; > + gfx_v9_0_ring_emit_fence(ring, ring->trail_fence_gpu_addr, > + ring->trail_seq, AMDGPU_FENCE_FLAG_EXEC); > + /* wait for the fence - 7 */ > + gfx_v9_0_wait_reg_mem(ring, 1, 1, 0, > + lower_32_bits(ring- >trail_fence_gpu_addr), > + upper_32_bits(ring- >trail_fence_gpu_addr), > + ring->trail_seq, 0xffffffff, 4); > + amdgpu_ring_commit(ring); > + /* wait for the commands to complete */ > + r = amdgpu_ring_test_ring(ring); > + if (r) > + return r; > + > + return amdgpu_ring_reset_helper_end(ring, timedout_fence); > +} > + > static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring, > unsigned int vmid, > struct amdgpu_fence *timedout_fence) > @@ -7450,9 +7537,9 @@ static const struct amdgpu_ring_funcs > gfx_v9_0_ring_funcs_gfx = { .emit_wreg = gfx_v9_0_ring_emit_wreg, > .emit_reg_wait = gfx_v9_0_ring_emit_reg_wait, > .emit_reg_write_reg_wait = gfx_v9_0_ring_emit_reg_write_reg_wait, > - .soft_recovery = gfx_v9_0_ring_soft_recovery, > .emit_mem_sync = gfx_v9_0_emit_mem_sync, > .emit_cleaner_shader = gfx_v9_0_ring_emit_cleaner_shader, > + .reset = gfx_v9_0_reset_kgq, > .begin_use = amdgpu_gfx_enforce_isolation_ring_begin_use, > .end_use = amdgpu_gfx_enforce_isolation_ring_end_use, > }; > @@ -7551,7 +7638,6 @@ static const struct amdgpu_ring_funcs > gfx_v9_0_ring_funcs_compute = { .emit_wreg = gfx_v9_0_ring_emit_wreg, > .emit_reg_wait = gfx_v9_0_ring_emit_reg_wait, > .emit_reg_write_reg_wait = gfx_v9_0_ring_emit_reg_write_reg_wait, > - .soft_recovery = gfx_v9_0_ring_soft_recovery, > .emit_mem_sync = gfx_v9_0_emit_mem_sync, > .emit_wave_limit = gfx_v9_0_emit_wave_limit, > .reset = gfx_v9_0_reset_kcq, ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-12-19 20:46 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-19 18:21 [PATCH 1/5] drm/amdgpu: use dma_fence_get_status() for adapter reset Alex Deucher 2025-12-19 18:21 ` [PATCH 2/5] drm/amdgpu: avoid a warning in timedout job handler Alex Deucher 2025-12-19 18:21 ` [PATCH 3/5] drm/amdgpu: mark fences with errors before ring reset Alex Deucher 2025-12-19 19:36 ` Alex Deucher 2025-12-19 18:21 ` [PATCH 4/5] drm/amdgpu/gfx9: rework pipeline sync packet sequence Alex Deucher 2025-12-19 18:22 ` [PATCH 5/5] drm/amdgpu/gfx9: Implement KGQ ring reset Alex Deucher 2025-12-19 20:46 ` Timur Kristóf
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.