[PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support"

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support"
@ 2025-05-19 18:22 Alex Deucher
  2025-05-19 18:22 ` [PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction Alex Deucher
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Alex Deucher @ 2025-05-19 18:22 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

This reverts commit b7a1a0ef12b81957584fef7b61e2d5ec049c7209.

A user reported stuttering under heavy gfx load with this commit.
I suspect it's due to the fact that the gfx contexts are shared
between the pipes so if there is alot of load on one pipe, we could
end up stalling waiting for a context.

On top of that, disabling the second pipe may fix the reliability
of vmid resets.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3519
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 75ea071744eb5..14cbd1f08eb5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -4781,7 +4781,7 @@ static int gfx_v10_0_sw_init(struct amdgpu_ip_block *ip_block)
 	case IP_VERSION(10, 3, 3):
 	case IP_VERSION(10, 3, 7):
 		adev->gfx.me.num_me = 1;
-		adev->gfx.me.num_pipe_per_me = 2;
+		adev->gfx.me.num_pipe_per_me = 1;
 		adev->gfx.me.num_queue_per_pipe = 2;
 		adev->gfx.mec.num_mec = 2;
 		adev->gfx.mec.num_pipe_per_mec = 4;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction
  2025-05-19 18:22 [PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support" Alex Deucher
@ 2025-05-19 18:22 ` Alex Deucher
  2025-05-20 13:09   ` Alex Deucher
  2025-05-19 18:22 ` [PATCH 3/8] drm/amdgpu: rework gfx9 queue reset Alex Deucher
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Alex Deucher @ 2025-05-19 18:22 UTC (permalink / raw)
  To: amd-gfx; +Cc: Christian König, Christian König, Alex Deucher

From: Christian König <ckoenig.leichtzumerken@gmail.com>

Stopping the scheduler for queue reset is generally a good idea because
it prevents any worker from touching the ring buffer.

But using amdgpu_fence_driver_force_completion() before restarting it was
a really bad idea because it marked fences as failed while the work was
potentially still running.

Stop doing that and cleanup the comment a bit.

Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 ++++++++++++-------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index acb21fc8b3ce5..a0fab947143b5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -136,10 +136,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 	} else if (amdgpu_gpu_recovery && ring->funcs->reset) {
 		bool is_guilty;
 
-		dev_err(adev->dev, "Starting %s ring reset\n", s_job->sched->name);
-		/* stop the scheduler, but don't mess with the
-		 * bad job yet because if ring reset fails
-		 * we'll fall back to full GPU reset.
+		dev_err(adev->dev, "Starting %s ring reset\n",
+			s_job->sched->name);
+
+		/*
+		 * Stop the scheduler to prevent anybody else from touching the
+		 * ring buffer.
 		 */
 		drm_sched_wqueue_stop(&ring->sched);
 
@@ -157,19 +159,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 
 		r = amdgpu_ring_reset(ring, job->vmid);
 		if (!r) {
-			if (amdgpu_ring_sched_ready(ring))
-				drm_sched_stop(&ring->sched, s_job);
-			if (is_guilty) {
+			if (is_guilty)
 				atomic_inc(&ring->adev->gpu_reset_counter);
-				amdgpu_fence_driver_force_completion(ring);
-			}
-			if (amdgpu_ring_sched_ready(ring))
-				drm_sched_start(&ring->sched, 0);
-			dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
-			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
+			drm_sched_wqueue_start(&ring->sched);
+			dev_err(adev->dev, "Ring %s reset succeeded\n",
+				ring->sched.name);
+			drm_dev_wedged_event(adev_to_drm(adev),
+					     DRM_WEDGE_RECOVERY_NONE);
 			goto exit;
 		}
-		dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
+		dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name);
 	}
 	dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/8] drm/amdgpu: rework gfx9 queue reset
  2025-05-19 18:22 [PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support" Alex Deucher
  2025-05-19 18:22 ` [PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction Alex Deucher
@ 2025-05-19 18:22 ` Alex Deucher
  2025-05-19 18:22 ` [PATCH 4/8] drm/amdgpu: rework gfx7 " Alex Deucher
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Alex Deucher @ 2025-05-19 18:22 UTC (permalink / raw)
  To: amd-gfx; +Cc: Christian König, Christian König, Alex Deucher

From: Christian König <ckoenig.leichtzumerken@gmail.com>

Testing this feature turned out that it was a bit unstable. The
CP_VMID_RESET register takes the VMID which all submissions from should
be canceled.

Unlike Windows Linux uses per process VMIDs instead of per engine VMIDs
for the simple reason that we don't have enough. So resetting one VMID
only killed the submissions of one specific process.

Fortunately that turned out to be exactly what we want to have.

So clear the CP_VMID_RESET register between every context switch between
applications when we do the pipeline sync to avoid trouble if multiple
VMIDs are used on the ring right behind each other.

Use the same pipeline sync function in the reset handler and issue an IB
test instead of a ring test after the queue reset to provide a longer
timeout and additional fence value should there be additional work on
the ring after the one aborted.

Also drop the soft recovery since that pretty much does the same thing as
CP_VMID_RESET, just on a lower level and with less chance of succeeding.

This now survives a stress test running over night sending a broken
submission ever 45 seconds and recovering fine from each of them.

v2: fix up pipeline_sync count, only emit vmid reset on gfx (Alex)

Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 50 +++++++++++----------------
 2 files changed, 22 insertions(+), 29 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 836ea081088af..af79a03abc110 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -278,6 +278,7 @@ extern int amdgpu_user_queue;
 #define AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS	        3000
 #define AMDGPU_MAX_USEC_TIMEOUT			100000	/* 100 ms */
 #define AMDGPU_FENCE_JIFFIES_TIMEOUT		(HZ / 2)
+#define AMDGPU_QUEUE_RESET_TIMEOUT		(HZ / 10)
 #define AMDGPU_DEBUGFS_MAX_COMPONENTS		32
 #define AMDGPUFB_CONN_LIMIT			4
 #define AMDGPU_BIOS_NUM_SCRATCH			16
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index d377a7c57d5e1..cfcedfc8aa6e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -5565,6 +5565,19 @@ static void gfx_v9_0_ring_emit_pipeline_sync(struct amdgpu_ring *ring)
 	int usepfp = (ring->funcs->type == AMDGPU_RING_TYPE_GFX);
 	uint32_t seq = ring->fence_drv.sync_seq;
 	uint64_t addr = ring->fence_drv.gpu_addr;
+	struct amdgpu_device *adev = ring->adev;
+
+	if (ring->funcs->type == AMDGPU_RING_TYPE_GFX) {
+		amdgpu_ring_emit_reg_wait(ring,
+					  SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET),
+					  0, 0xffff);
+		amdgpu_ring_emit_wreg(ring,
+				      SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET),
+				      0);
+		amdgpu_ring_emit_fence(ring, ring->fence_drv.gpu_addr,
+				       ring->fence_drv.sync_seq,
+				       AMDGPU_FENCE_FLAG_EXEC);
+	}
 
 	gfx_v9_0_wait_reg_mem(ring, usepfp, 1, 0,
 			      lower_32_bits(addr), upper_32_bits(addr),
@@ -5896,20 +5909,6 @@ static void gfx_v9_0_ring_emit_reg_write_reg_wait(struct amdgpu_ring *ring,
 							   ref, mask);
 }
 
-static void gfx_v9_0_ring_soft_recovery(struct amdgpu_ring *ring, unsigned vmid)
-{
-	struct amdgpu_device *adev = ring->adev;
-	uint32_t value = 0;
-
-	value = REG_SET_FIELD(value, SQ_CMD, CMD, 0x03);
-	value = REG_SET_FIELD(value, SQ_CMD, MODE, 0x01);
-	value = REG_SET_FIELD(value, SQ_CMD, CHECK_VMID, 1);
-	value = REG_SET_FIELD(value, SQ_CMD, VM_ID, vmid);
-	amdgpu_gfx_rlc_enter_safe_mode(adev, 0);
-	WREG32_SOC15(GC, 0, mmSQ_CMD, value);
-	amdgpu_gfx_rlc_exit_safe_mode(adev, 0);
-}
-
 static void gfx_v9_0_set_gfx_eop_interrupt_state(struct amdgpu_device *adev,
 						 enum amdgpu_interrupt_state state)
 {
@@ -7185,16 +7184,12 @@ static int gfx_v9_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 	if (r)
 		return r;
 
-	if (amdgpu_ring_alloc(ring, 7 + 7 + 5))
+	if (amdgpu_ring_alloc(ring, 7 + 7 + 5 + 7))
 		return -ENOMEM;
-	gfx_v9_0_ring_emit_fence(ring, ring->fence_drv.gpu_addr,
-				 ring->fence_drv.sync_seq, AMDGPU_FENCE_FLAG_EXEC);
-	gfx_v9_0_ring_emit_reg_wait(ring,
-				    SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), 0, 0xffff);
-	gfx_v9_0_ring_emit_wreg(ring,
-				SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), 0);
+	gfx_v9_0_ring_emit_pipeline_sync(ring);
+	amdgpu_ring_commit(ring);
 
-	return amdgpu_ring_test_ring(ring);
+	return gfx_v9_0_ring_test_ib(ring, AMDGPU_QUEUE_RESET_TIMEOUT);
 }
 
 static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
@@ -7437,7 +7432,7 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_gfx = {
 	.set_wptr = gfx_v9_0_ring_set_wptr_gfx,
 	.emit_frame_size = /* totally 242 maximum if 16 IBs */
 		5 +  /* COND_EXEC */
-		7 +  /* PIPELINE_SYNC */
+		7 + 7 + 5 + 8 +  /* PIPELINE_SYNC */
 		SOC15_FLUSH_GPU_TLB_NUM_WREG * 5 +
 		SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 7 +
 		2 + /* VM_FLUSH */
@@ -7475,7 +7470,6 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_gfx = {
 	.emit_wreg = gfx_v9_0_ring_emit_wreg,
 	.emit_reg_wait = gfx_v9_0_ring_emit_reg_wait,
 	.emit_reg_write_reg_wait = gfx_v9_0_ring_emit_reg_write_reg_wait,
-	.soft_recovery = gfx_v9_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v9_0_emit_mem_sync,
 	.reset = gfx_v9_0_reset_kgq,
 	.emit_cleaner_shader = gfx_v9_0_ring_emit_cleaner_shader,
@@ -7494,7 +7488,7 @@ static const struct amdgpu_ring_funcs gfx_v9_0_sw_ring_funcs_gfx = {
 	.set_wptr = amdgpu_sw_ring_set_wptr_gfx,
 	.emit_frame_size = /* totally 242 maximum if 16 IBs */
 		5 +  /* COND_EXEC */
-		7 +  /* PIPELINE_SYNC */
+		7 + 7 + 5 + 8 + /* PIPELINE_SYNC */
 		SOC15_FLUSH_GPU_TLB_NUM_WREG * 5 +
 		SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 7 +
 		2 + /* VM_FLUSH */
@@ -7533,7 +7527,6 @@ static const struct amdgpu_ring_funcs gfx_v9_0_sw_ring_funcs_gfx = {
 	.emit_wreg = gfx_v9_0_ring_emit_wreg,
 	.emit_reg_wait = gfx_v9_0_ring_emit_reg_wait,
 	.emit_reg_write_reg_wait = gfx_v9_0_ring_emit_reg_write_reg_wait,
-	.soft_recovery = gfx_v9_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v9_0_emit_mem_sync,
 	.patch_cntl = gfx_v9_0_ring_patch_cntl,
 	.patch_de = gfx_v9_0_ring_patch_de_meta,
@@ -7555,7 +7548,7 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_compute = {
 		20 + /* gfx_v9_0_ring_emit_gds_switch */
 		7 + /* gfx_v9_0_ring_emit_hdp_flush */
 		5 + /* hdp invalidate */
-		7 + /* gfx_v9_0_ring_emit_pipeline_sync */
+		7 +  /* PIPELINE_SYNC */
 		SOC15_FLUSH_GPU_TLB_NUM_WREG * 5 +
 		SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 7 +
 		8 + 8 + 8 + /* gfx_v9_0_ring_emit_fence x3 for user fence, vm fence */
@@ -7577,7 +7570,6 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_compute = {
 	.emit_wreg = gfx_v9_0_ring_emit_wreg,
 	.emit_reg_wait = gfx_v9_0_ring_emit_reg_wait,
 	.emit_reg_write_reg_wait = gfx_v9_0_ring_emit_reg_write_reg_wait,
-	.soft_recovery = gfx_v9_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v9_0_emit_mem_sync,
 	.emit_wave_limit = gfx_v9_0_emit_wave_limit,
 	.reset = gfx_v9_0_reset_kcq,
@@ -7598,7 +7590,7 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_kiq = {
 		20 + /* gfx_v9_0_ring_emit_gds_switch */
 		7 + /* gfx_v9_0_ring_emit_hdp_flush */
 		5 + /* hdp invalidate */
-		7 + /* gfx_v9_0_ring_emit_pipeline_sync */
+		7 + /* PIPELINE_SYNC */
 		SOC15_FLUSH_GPU_TLB_NUM_WREG * 5 +
 		SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 7 +
 		8 + 8 + 8, /* gfx_v9_0_ring_emit_fence_kiq x3 for user fence, vm fence */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/8] drm/amdgpu: rework gfx7 queue reset
  2025-05-19 18:22 [PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support" Alex Deucher
  2025-05-19 18:22 ` [PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction Alex Deucher
  2025-05-19 18:22 ` [PATCH 3/8] drm/amdgpu: rework gfx9 queue reset Alex Deucher
@ 2025-05-19 18:22 ` Alex Deucher
  2025-05-19 18:22 ` [PATCH 5/8] drm/amdgpu: rework gfx8 " Alex Deucher
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Alex Deucher @ 2025-05-19 18:22 UTC (permalink / raw)
  To: amd-gfx; +Cc: Christian König, Christian König, Alex Deucher

From: Christian König <ckoenig.leichtzumerken@gmail.com>

Apply the same changes to gfx7 as done to gfx9.

Untested and probably needs some more work.

v2: fix up pipeline_sync count, only emit vmid reset on gfx (Alex)

Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c | 89 ++++++++++++---------------
 1 file changed, 40 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
index da0534ff1271a..fc2f1fc26a3d2 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
@@ -3095,6 +3095,33 @@ static int gfx_v7_0_cp_resume(struct amdgpu_device *adev)
 	return 0;
 }
 
+static void gfx_v7_0_wait_reg_mem(struct amdgpu_ring *ring, int eng_sel,
+				  int mem_space, int opt, uint32_t addr0,
+				  uint32_t addr1, uint32_t ref, uint32_t mask,
+				  uint32_t inv)
+{
+	amdgpu_ring_write(ring, PACKET3(PACKET3_WAIT_REG_MEM, 5));
+	amdgpu_ring_write(ring,
+			  /* memory (1) or register (0) */
+			  (WAIT_REG_MEM_MEM_SPACE(mem_space) |
+			   WAIT_REG_MEM_OPERATION(opt) | /* wait */
+			   WAIT_REG_MEM_FUNCTION(3) |  /* equal */
+			   WAIT_REG_MEM_ENGINE(eng_sel)));
+
+	WARN_ON(mem_space && addr0 & 0x3); /* Dword align */
+	amdgpu_ring_write(ring, addr0);
+	amdgpu_ring_write(ring, addr1);
+	amdgpu_ring_write(ring, ref);
+	amdgpu_ring_write(ring, mask);
+	amdgpu_ring_write(ring, inv); /* poll interval */
+}
+
+static void gfx_v7_0_ring_emit_reg_wait(struct amdgpu_ring *ring, uint32_t reg,
+					uint32_t val, uint32_t mask)
+{
+	gfx_v7_0_wait_reg_mem(ring, 0, 0, 0, reg, 0, val, mask, 0x20);
+}
+
 /**
  * gfx_v7_0_ring_emit_pipeline_sync - cik vm flush using the CP
  *
@@ -3109,6 +3136,13 @@ static void gfx_v7_0_ring_emit_pipeline_sync(struct amdgpu_ring *ring)
 	uint32_t seq = ring->fence_drv.sync_seq;
 	uint64_t addr = ring->fence_drv.gpu_addr;
 
+	if (ring->funcs->type == AMDGPU_RING_TYPE_GFX) {
+		gfx_v7_0_ring_emit_reg_wait(ring, mmCP_VMID_RESET, 0, 0xffff);
+		amdgpu_ring_emit_wreg(ring, mmCP_VMID_RESET, 0);
+		amdgpu_ring_emit_fence(ring, ring->fence_drv.gpu_addr,
+				       ring->fence_drv.sync_seq,
+				       AMDGPU_FENCE_FLAG_EXEC);
+	}
 	amdgpu_ring_write(ring, PACKET3(PACKET3_WAIT_REG_MEM, 5));
 	amdgpu_ring_write(ring, (WAIT_REG_MEM_MEM_SPACE(1) | /* memory */
 				 WAIT_REG_MEM_FUNCTION(3) | /* equal */
@@ -3998,18 +4032,6 @@ static void gfx_v7_0_ring_emit_gds_switch(struct amdgpu_ring *ring,
 	amdgpu_ring_write(ring, (1 << (oa_size + oa_base)) - (1 << oa_base));
 }
 
-static void gfx_v7_0_ring_soft_recovery(struct amdgpu_ring *ring, unsigned vmid)
-{
-	struct amdgpu_device *adev = ring->adev;
-	uint32_t value = 0;
-
-	value = REG_SET_FIELD(value, SQ_CMD, CMD, 0x03);
-	value = REG_SET_FIELD(value, SQ_CMD, MODE, 0x01);
-	value = REG_SET_FIELD(value, SQ_CMD, CHECK_VMID, 1);
-	value = REG_SET_FIELD(value, SQ_CMD, VM_ID, vmid);
-	WREG32(mmSQ_CMD, value);
-}
-
 static uint32_t wave_read_ind(struct amdgpu_device *adev, uint32_t simd, uint32_t wave, uint32_t address)
 {
 	WREG32(mmSQ_IND_INDEX,
@@ -4884,34 +4906,6 @@ static void gfx_v7_0_emit_mem_sync_compute(struct amdgpu_ring *ring)
 	amdgpu_ring_write(ring, 0x0000000A);	/* poll interval */
 }
 
-static void gfx_v7_0_wait_reg_mem(struct amdgpu_ring *ring, int eng_sel,
-				  int mem_space, int opt, uint32_t addr0,
-				  uint32_t addr1, uint32_t ref, uint32_t mask,
-				  uint32_t inv)
-{
-	amdgpu_ring_write(ring, PACKET3(PACKET3_WAIT_REG_MEM, 5));
-	amdgpu_ring_write(ring,
-			  /* memory (1) or register (0) */
-			  (WAIT_REG_MEM_MEM_SPACE(mem_space) |
-			   WAIT_REG_MEM_OPERATION(opt) | /* wait */
-			   WAIT_REG_MEM_FUNCTION(3) |  /* equal */
-			   WAIT_REG_MEM_ENGINE(eng_sel)));
-
-	if (mem_space)
-		BUG_ON(addr0 & 0x3); /* Dword align */
-	amdgpu_ring_write(ring, addr0);
-	amdgpu_ring_write(ring, addr1);
-	amdgpu_ring_write(ring, ref);
-	amdgpu_ring_write(ring, mask);
-	amdgpu_ring_write(ring, inv); /* poll interval */
-}
-
-static void gfx_v7_0_ring_emit_reg_wait(struct amdgpu_ring *ring, uint32_t reg,
-					uint32_t val, uint32_t mask)
-{
-	gfx_v7_0_wait_reg_mem(ring, 0, 0, 0, reg, 0, val, mask, 0x20);
-}
-
 static int gfx_v7_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 {
 	struct amdgpu_device *adev = ring->adev;
@@ -4944,14 +4938,13 @@ static int gfx_v7_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 	if (r)
 		return r;
 
-	if (amdgpu_ring_alloc(ring, 7 + 12 + 5))
+	if (amdgpu_ring_alloc(ring, 7 + 12 + 5 + 7 + 4))
 		return -ENOMEM;
-	gfx_v7_0_ring_emit_fence_gfx(ring, ring->fence_drv.gpu_addr,
-				     ring->fence_drv.sync_seq, AMDGPU_FENCE_FLAG_EXEC);
-	gfx_v7_0_ring_emit_reg_wait(ring, mmCP_VMID_RESET, 0, 0xffff);
-	gfx_v7_0_ring_emit_wreg(ring, mmCP_VMID_RESET, 0);
 
-	return amdgpu_ring_test_ring(ring);
+	gfx_v7_0_ring_emit_pipeline_sync(ring);
+	amdgpu_ring_commit(ring);
+
+	return gfx_v7_0_ring_test_ib(ring, AMDGPU_QUEUE_RESET_TIMEOUT);
 }
 
 static const struct amd_ip_funcs gfx_v7_0_ip_funcs = {
@@ -4984,7 +4977,7 @@ static const struct amdgpu_ring_funcs gfx_v7_0_ring_funcs_gfx = {
 		7 + /* gfx_v7_0_ring_emit_hdp_flush */
 		5 + /* hdp invalidate */
 		12 + 12 + 12 + /* gfx_v7_0_ring_emit_fence_gfx x3 for user fence, vm fence */
-		7 + 4 + /* gfx_v7_0_ring_emit_pipeline_sync */
+		7 + 12 + 5 + 7 + 4 + /* gfx_v7_0_ring_emit_pipeline_sync */
 		CIK_FLUSH_GPU_TLB_NUM_WREG * 5 + 7 + 6 + /* gfx_v7_0_ring_emit_vm_flush */
 		3 + 4 + /* gfx_v7_ring_emit_cntxcntl including vgt flush*/
 		5, /* SURFACE_SYNC */
@@ -5001,7 +4994,6 @@ static const struct amdgpu_ring_funcs gfx_v7_0_ring_funcs_gfx = {
 	.pad_ib = amdgpu_ring_generic_pad_ib,
 	.emit_cntxcntl = gfx_v7_ring_emit_cntxcntl,
 	.emit_wreg = gfx_v7_0_ring_emit_wreg,
-	.soft_recovery = gfx_v7_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v7_0_emit_mem_sync,
 	.reset = gfx_v7_0_reset_kgq,
 };
@@ -5034,7 +5026,6 @@ static const struct amdgpu_ring_funcs gfx_v7_0_ring_funcs_compute = {
 	.insert_nop = amdgpu_ring_insert_nop,
 	.pad_ib = amdgpu_ring_generic_pad_ib,
 	.emit_wreg = gfx_v7_0_ring_emit_wreg,
-	.soft_recovery = gfx_v7_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v7_0_emit_mem_sync_compute,
 };
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/8] drm/amdgpu: rework gfx8 queue reset
  2025-05-19 18:22 [PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support" Alex Deucher
                   ` (2 preceding siblings ...)
  2025-05-19 18:22 ` [PATCH 4/8] drm/amdgpu: rework gfx7 " Alex Deucher
@ 2025-05-19 18:22 ` Alex Deucher
  2025-05-19 18:22 ` [PATCH 6/8] drm/amdgpu: rework gfx10 " Alex Deucher
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Alex Deucher @ 2025-05-19 18:22 UTC (permalink / raw)
  To: amd-gfx; +Cc: Christian König, Christian König, Alex Deucher

From: Christian König <ckoenig.leichtzumerken@gmail.com>

Apply the same changes to gfx8 as done to gfx9.

Untested and probably needs some more work.

v2: fix up pipeline_sync count, only emit vmid reset on gfx (Alex)

Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 90 ++++++++++++---------------
 1 file changed, 41 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
index 5ee2237d8ee8f..f09f029ecd9d5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -6140,12 +6140,47 @@ static void gfx_v8_0_ring_emit_fence_gfx(struct amdgpu_ring *ring, u64 addr,
 
 }
 
+static void gfx_v8_0_wait_reg_mem(struct amdgpu_ring *ring, int eng_sel,
+				  int mem_space, int opt, uint32_t addr0,
+				  uint32_t addr1, uint32_t ref, uint32_t mask,
+				  uint32_t inv)
+{
+	amdgpu_ring_write(ring, PACKET3(PACKET3_WAIT_REG_MEM, 5));
+	amdgpu_ring_write(ring,
+			  /* memory (1) or register (0) */
+			  (WAIT_REG_MEM_MEM_SPACE(mem_space) |
+			   WAIT_REG_MEM_OPERATION(opt) | /* wait */
+			   WAIT_REG_MEM_FUNCTION(3) |  /* equal */
+			   WAIT_REG_MEM_ENGINE(eng_sel)));
+
+	WARN_ON(mem_space && addr0 & 0x3); /* Dword align */
+	amdgpu_ring_write(ring, addr0);
+	amdgpu_ring_write(ring, addr1);
+	amdgpu_ring_write(ring, ref);
+	amdgpu_ring_write(ring, mask);
+	amdgpu_ring_write(ring, inv); /* poll interval */
+}
+
+static void gfx_v8_0_ring_emit_reg_wait(struct amdgpu_ring *ring, uint32_t reg,
+					uint32_t val, uint32_t mask)
+{
+	gfx_v8_0_wait_reg_mem(ring, 0, 0, 0, reg, 0, val, mask, 0x20);
+}
+
 static void gfx_v8_0_ring_emit_pipeline_sync(struct amdgpu_ring *ring)
 {
 	int usepfp = (ring->funcs->type == AMDGPU_RING_TYPE_GFX);
 	uint32_t seq = ring->fence_drv.sync_seq;
 	uint64_t addr = ring->fence_drv.gpu_addr;
 
+	if (ring->funcs->type == AMDGPU_RING_TYPE_GFX) {
+		gfx_v8_0_ring_emit_reg_wait(ring, mmCP_VMID_RESET, 0, 0xffff);
+		amdgpu_ring_emit_wreg(ring, mmCP_VMID_RESET, 0);
+		amdgpu_ring_emit_fence(ring, ring->fence_drv.gpu_addr,
+				       ring->fence_drv.sync_seq,
+				       AMDGPU_FENCE_FLAG_EXEC);
+	}
+
 	amdgpu_ring_write(ring, PACKET3(PACKET3_WAIT_REG_MEM, 5));
 	amdgpu_ring_write(ring, (WAIT_REG_MEM_MEM_SPACE(1) | /* memory */
 				 WAIT_REG_MEM_FUNCTION(3) | /* equal */
@@ -6339,46 +6374,6 @@ static void gfx_v8_0_ring_emit_wreg(struct amdgpu_ring *ring, uint32_t reg,
 	amdgpu_ring_write(ring, val);
 }
 
-static void gfx_v8_0_wait_reg_mem(struct amdgpu_ring *ring, int eng_sel,
-				  int mem_space, int opt, uint32_t addr0,
-				  uint32_t addr1, uint32_t ref, uint32_t mask,
-				  uint32_t inv)
-{
-	amdgpu_ring_write(ring, PACKET3(PACKET3_WAIT_REG_MEM, 5));
-	amdgpu_ring_write(ring,
-			  /* memory (1) or register (0) */
-			  (WAIT_REG_MEM_MEM_SPACE(mem_space) |
-			   WAIT_REG_MEM_OPERATION(opt) | /* wait */
-			   WAIT_REG_MEM_FUNCTION(3) |  /* equal */
-			   WAIT_REG_MEM_ENGINE(eng_sel)));
-
-	if (mem_space)
-		BUG_ON(addr0 & 0x3); /* Dword align */
-	amdgpu_ring_write(ring, addr0);
-	amdgpu_ring_write(ring, addr1);
-	amdgpu_ring_write(ring, ref);
-	amdgpu_ring_write(ring, mask);
-	amdgpu_ring_write(ring, inv); /* poll interval */
-}
-
-static void gfx_v8_0_ring_emit_reg_wait(struct amdgpu_ring *ring, uint32_t reg,
-					uint32_t val, uint32_t mask)
-{
-	gfx_v8_0_wait_reg_mem(ring, 0, 0, 0, reg, 0, val, mask, 0x20);
-}
-
-static void gfx_v8_0_ring_soft_recovery(struct amdgpu_ring *ring, unsigned vmid)
-{
-	struct amdgpu_device *adev = ring->adev;
-	uint32_t value = 0;
-
-	value = REG_SET_FIELD(value, SQ_CMD, CMD, 0x03);
-	value = REG_SET_FIELD(value, SQ_CMD, MODE, 0x01);
-	value = REG_SET_FIELD(value, SQ_CMD, CHECK_VMID, 1);
-	value = REG_SET_FIELD(value, SQ_CMD, VM_ID, vmid);
-	WREG32(mmSQ_CMD, value);
-}
-
 static void gfx_v8_0_set_gfx_eop_interrupt_state(struct amdgpu_device *adev,
 						 enum amdgpu_interrupt_state state)
 {
@@ -6875,14 +6870,13 @@ static int gfx_v8_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 	if (r)
 		return r;
 
-	if (amdgpu_ring_alloc(ring, 7 + 12 + 5))
+	if (amdgpu_ring_alloc(ring, 7 + 12 + 5 + 7))
 		return -ENOMEM;
-	gfx_v8_0_ring_emit_fence_gfx(ring, ring->fence_drv.gpu_addr,
-				     ring->fence_drv.sync_seq, AMDGPU_FENCE_FLAG_EXEC);
-	gfx_v8_0_ring_emit_reg_wait(ring, mmCP_VMID_RESET, 0, 0xffff);
-	gfx_v8_0_ring_emit_wreg(ring, mmCP_VMID_RESET, 0);
 
-	return amdgpu_ring_test_ring(ring);
+	gfx_v8_0_ring_emit_pipeline_sync(ring);
+	amdgpu_ring_commit(ring);
+
+	return gfx_v8_0_ring_test_ib(ring, AMDGPU_QUEUE_RESET_TIMEOUT);
 }
 
 static const struct amd_ip_funcs gfx_v8_0_ip_funcs = {
@@ -6916,7 +6910,7 @@ static const struct amdgpu_ring_funcs gfx_v8_0_ring_funcs_gfx = {
 	.set_wptr = gfx_v8_0_ring_set_wptr_gfx,
 	.emit_frame_size = /* maximum 215dw if count 16 IBs in */
 		5 +  /* COND_EXEC */
-		7 +  /* PIPELINE_SYNC */
+		7 + 7 + 5 + 12 + /* PIPELINE_SYNC */
 		VI_FLUSH_GPU_TLB_NUM_WREG * 5 + 9 + /* VM_FLUSH */
 		12 +  /* FENCE for VM_FLUSH */
 		20 + /* GDS switch */
@@ -6948,7 +6942,6 @@ static const struct amdgpu_ring_funcs gfx_v8_0_ring_funcs_gfx = {
 	.emit_cntxcntl = gfx_v8_ring_emit_cntxcntl,
 	.init_cond_exec = gfx_v8_0_ring_emit_init_cond_exec,
 	.emit_wreg = gfx_v8_0_ring_emit_wreg,
-	.soft_recovery = gfx_v8_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v8_0_emit_mem_sync,
 	.reset = gfx_v8_0_reset_kgq,
 };
@@ -6983,7 +6976,6 @@ static const struct amdgpu_ring_funcs gfx_v8_0_ring_funcs_compute = {
 	.insert_nop = amdgpu_ring_insert_nop,
 	.pad_ib = amdgpu_ring_generic_pad_ib,
 	.emit_wreg = gfx_v8_0_ring_emit_wreg,
-	.soft_recovery = gfx_v8_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v8_0_emit_mem_sync_compute,
 	.emit_wave_limit = gfx_v8_0_emit_wave_limit,
 };
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/8] drm/amdgpu: rework gfx10 queue reset
  2025-05-19 18:22 [PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support" Alex Deucher
                   ` (3 preceding siblings ...)
  2025-05-19 18:22 ` [PATCH 5/8] drm/amdgpu: rework gfx8 " Alex Deucher
@ 2025-05-19 18:22 ` Alex Deucher
  2025-05-19 18:22 ` [PATCH 7/8] drm/amdgpu: rework gfx11 " Alex Deucher
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Alex Deucher @ 2025-05-19 18:22 UTC (permalink / raw)
  To: amd-gfx; +Cc: Christian König, Christian König, Alex Deucher

From: Christian König <ckoenig.leichtzumerken@gmail.com>

Apply the same changes to gfx10 as done to gfx9.

The general idea to reset the whole kernel queue and then asking the kiq
to map it again didn't worked at all. Background is that we don't use per
application kernel queues for gfx10 on Linux for performance reasons.

So instead use the gfx9 approach here as well and only reset all
submissions from a specific VMID instead of the whole queue.

Navi 10 seems to be stable, but Navi 2x still shows hangs during over
night testing. This needs more investigation, but the result is clearly
better than before.

v2: fix up pipeline_sync count, only emit vmid reset on gfx (Alex)

Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 60 ++++++++------------------
 1 file changed, 19 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 14cbd1f08eb5c..52e12fb083d5f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -8746,6 +8746,19 @@ static void gfx_v10_0_ring_emit_pipeline_sync(struct amdgpu_ring *ring)
 	int usepfp = (ring->funcs->type == AMDGPU_RING_TYPE_GFX);
 	uint32_t seq = ring->fence_drv.sync_seq;
 	uint64_t addr = ring->fence_drv.gpu_addr;
+	struct amdgpu_device *adev = ring->adev;
+
+	if (ring->funcs->type == AMDGPU_RING_TYPE_GFX) {
+		amdgpu_ring_emit_reg_wait(ring,
+					  SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET),
+					  0, 0xffff);
+		amdgpu_ring_emit_wreg(ring,
+				      SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET),
+				      0);
+		amdgpu_ring_emit_fence(ring, ring->fence_drv.gpu_addr,
+				       ring->fence_drv.sync_seq,
+				       AMDGPU_FENCE_FLAG_EXEC);
+	}
 
 	gfx_v10_0_wait_reg_mem(ring, usepfp, 1, 0, lower_32_bits(addr),
 			       upper_32_bits(addr), seq, 0xffffffff, 4);
@@ -9046,21 +9059,6 @@ static void gfx_v10_0_ring_emit_reg_write_reg_wait(struct amdgpu_ring *ring,
 							   ref, mask);
 }
 
-static void gfx_v10_0_ring_soft_recovery(struct amdgpu_ring *ring,
-					 unsigned int vmid)
-{
-	struct amdgpu_device *adev = ring->adev;
-	uint32_t value = 0;
-
-	value = REG_SET_FIELD(value, SQ_CMD, CMD, 0x03);
-	value = REG_SET_FIELD(value, SQ_CMD, MODE, 0x01);
-	value = REG_SET_FIELD(value, SQ_CMD, CHECK_VMID, 1);
-	value = REG_SET_FIELD(value, SQ_CMD, VM_ID, vmid);
-	amdgpu_gfx_rlc_enter_safe_mode(adev, 0);
-	WREG32_SOC15(GC, 0, mmSQ_CMD, value);
-	amdgpu_gfx_rlc_exit_safe_mode(adev, 0);
-}
-
 static void
 gfx_v10_0_set_gfx_eop_interrupt_state(struct amdgpu_device *adev,
 				      uint32_t me, uint32_t pipe,
@@ -9529,38 +9527,21 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 	struct amdgpu_ring *kiq_ring = &kiq->ring;
 	unsigned long flags;
 	u32 tmp;
-	u64 addr;
 	int r;
 
 	if (amdgpu_sriov_vf(adev))
 		return -EINVAL;
 
-	if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
-		return -EINVAL;
-
 	spin_lock_irqsave(&kiq->ring_lock, flags);
 
-	if (amdgpu_ring_alloc(kiq_ring, 5 + 7 + 7 + kiq->pmf->map_queues_size)) {
+	if (amdgpu_ring_alloc(kiq_ring, 5)) {
 		spin_unlock_irqrestore(&kiq->ring_lock, flags);
 		return -ENOMEM;
 	}
 
-	addr = amdgpu_bo_gpu_offset(ring->mqd_obj) +
-		offsetof(struct v10_gfx_mqd, cp_gfx_hqd_active);
 	tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
-	if (ring->pipe == 0)
-		tmp = REG_SET_FIELD(tmp, CP_VMID_RESET, PIPE0_QUEUES, 1 << ring->queue);
-	else
-		tmp = REG_SET_FIELD(tmp, CP_VMID_RESET, PIPE1_QUEUES, 1 << ring->queue);
-
 	gfx_v10_0_ring_emit_wreg(kiq_ring,
 				 SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), tmp);
-	gfx_v10_0_wait_reg_mem(kiq_ring, 0, 1, 0,
-			       lower_32_bits(addr), upper_32_bits(addr),
-			       0, 1, 0x20);
-	gfx_v10_0_ring_emit_reg_wait(kiq_ring,
-				     SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), 0, 0xffffffff);
-	kiq->pmf->kiq_map_queues(kiq_ring, ring);
 	amdgpu_ring_commit(kiq_ring);
 
 	spin_unlock_irqrestore(&kiq->ring_lock, flags);
@@ -9569,13 +9550,12 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 	if (r)
 		return r;
 
-	r = gfx_v10_0_kgq_init_queue(ring, true);
-	if (r) {
-		DRM_ERROR("fail to init kgq\n");
-		return r;
-	}
+	if (amdgpu_ring_alloc(ring, 7 + 7 + 5 + 7))
+		return -ENOMEM;
+	gfx_v10_0_ring_emit_pipeline_sync(ring);
+	amdgpu_ring_commit(ring);
 
-	return amdgpu_ring_test_ring(ring);
+	return gfx_v10_0_ring_test_ib(ring, AMDGPU_QUEUE_RESET_TIMEOUT);
 }
 
 static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
@@ -9882,7 +9862,6 @@ static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_gfx = {
 	.emit_wreg = gfx_v10_0_ring_emit_wreg,
 	.emit_reg_wait = gfx_v10_0_ring_emit_reg_wait,
 	.emit_reg_write_reg_wait = gfx_v10_0_ring_emit_reg_write_reg_wait,
-	.soft_recovery = gfx_v10_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v10_0_emit_mem_sync,
 	.reset = gfx_v10_0_reset_kgq,
 	.emit_cleaner_shader = gfx_v10_0_ring_emit_cleaner_shader,
@@ -9923,7 +9902,6 @@ static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_compute = {
 	.emit_wreg = gfx_v10_0_ring_emit_wreg,
 	.emit_reg_wait = gfx_v10_0_ring_emit_reg_wait,
 	.emit_reg_write_reg_wait = gfx_v10_0_ring_emit_reg_write_reg_wait,
-	.soft_recovery = gfx_v10_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v10_0_emit_mem_sync,
 	.reset = gfx_v10_0_reset_kcq,
 	.emit_cleaner_shader = gfx_v10_0_ring_emit_cleaner_shader,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 7/8] drm/amdgpu: rework gfx11 queue reset
  2025-05-19 18:22 [PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support" Alex Deucher
                   ` (4 preceding siblings ...)
  2025-05-19 18:22 ` [PATCH 6/8] drm/amdgpu: rework gfx10 " Alex Deucher
@ 2025-05-19 18:22 ` Alex Deucher
  2025-05-19 18:22 ` [PATCH 8/8] drm/amdgpu: rework gfx12 " Alex Deucher
  2025-05-20  8:23 ` [PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support" Christian König
  7 siblings, 0 replies; 15+ messages in thread
From: Alex Deucher @ 2025-05-19 18:22 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Apply the same changes to gfx11 as done to gfx10.

Background is that we don't use per application kernel queues for
gfx11 on Linux for performance reasons.

So instead use the gfx10 approach here as well and only reset all
submissions from a specific VMID instead of the whole queue.

v2: fix up pipeline_sync count, only emit vmid reset on gfx (Alex)

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 116 +++++--------------------
 1 file changed, 22 insertions(+), 94 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index afd6d59164bfa..db69b76d6ab25 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -5936,7 +5936,19 @@ static void gfx_v11_0_ring_emit_pipeline_sync(struct amdgpu_ring *ring)
 	int usepfp = (ring->funcs->type == AMDGPU_RING_TYPE_GFX);
 	uint32_t seq = ring->fence_drv.sync_seq;
 	uint64_t addr = ring->fence_drv.gpu_addr;
+	struct amdgpu_device *adev = ring->adev;
 
+	if (ring->funcs->type == AMDGPU_RING_TYPE_GFX) {
+		amdgpu_ring_emit_reg_wait(ring,
+					  SOC15_REG_OFFSET(GC, 0, regCP_VMID_RESET),
+					  0, 0xffff);
+		amdgpu_ring_emit_wreg(ring,
+				      SOC15_REG_OFFSET(GC, 0, regCP_VMID_RESET),
+				      0);
+		amdgpu_ring_emit_fence(ring, ring->fence_drv.gpu_addr,
+				       ring->fence_drv.sync_seq,
+				       AMDGPU_FENCE_FLAG_EXEC);
+	}
 	gfx_v11_0_wait_reg_mem(ring, usepfp, 1, 0, lower_32_bits(addr),
 			       upper_32_bits(addr), seq, 0xffffffff, 4);
 }
@@ -6278,21 +6290,6 @@ static void gfx_v11_0_ring_emit_reg_write_reg_wait(struct amdgpu_ring *ring,
 			       ref, mask, 0x20);
 }
 
-static void gfx_v11_0_ring_soft_recovery(struct amdgpu_ring *ring,
-					 unsigned vmid)
-{
-	struct amdgpu_device *adev = ring->adev;
-	uint32_t value = 0;
-
-	value = REG_SET_FIELD(value, SQ_CMD, CMD, 0x03);
-	value = REG_SET_FIELD(value, SQ_CMD, MODE, 0x01);
-	value = REG_SET_FIELD(value, SQ_CMD, CHECK_VMID, 1);
-	value = REG_SET_FIELD(value, SQ_CMD, VM_ID, vmid);
-	amdgpu_gfx_rlc_enter_safe_mode(adev, 0);
-	WREG32_SOC15(GC, 0, regSQ_CMD, value);
-	amdgpu_gfx_rlc_exit_safe_mode(adev, 0);
-}
-
 static void
 gfx_v11_0_set_gfx_eop_interrupt_state(struct amdgpu_device *adev,
 				      uint32_t me, uint32_t pipe,
@@ -6750,92 +6747,25 @@ static bool gfx_v11_pipe_reset_support(struct amdgpu_device *adev)
 	return false;
 }
 
-
-static int gfx_v11_reset_gfx_pipe(struct amdgpu_ring *ring)
-{
-	struct amdgpu_device *adev = ring->adev;
-	uint32_t reset_pipe = 0, clean_pipe = 0;
-	int r;
-
-	if (!gfx_v11_pipe_reset_support(adev))
-		return -EOPNOTSUPP;
-
-	gfx_v11_0_set_safe_mode(adev, 0);
-	mutex_lock(&adev->srbm_mutex);
-	soc21_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
-
-	switch (ring->pipe) {
-	case 0:
-		reset_pipe = REG_SET_FIELD(reset_pipe, CP_ME_CNTL,
-					   PFP_PIPE0_RESET, 1);
-		reset_pipe = REG_SET_FIELD(reset_pipe, CP_ME_CNTL,
-					   ME_PIPE0_RESET, 1);
-		clean_pipe = REG_SET_FIELD(clean_pipe, CP_ME_CNTL,
-					   PFP_PIPE0_RESET, 0);
-		clean_pipe = REG_SET_FIELD(clean_pipe, CP_ME_CNTL,
-					   ME_PIPE0_RESET, 0);
-		break;
-	case 1:
-		reset_pipe = REG_SET_FIELD(reset_pipe, CP_ME_CNTL,
-					   PFP_PIPE1_RESET, 1);
-		reset_pipe = REG_SET_FIELD(reset_pipe, CP_ME_CNTL,
-					   ME_PIPE1_RESET, 1);
-		clean_pipe = REG_SET_FIELD(clean_pipe, CP_ME_CNTL,
-					   PFP_PIPE1_RESET, 0);
-		clean_pipe = REG_SET_FIELD(clean_pipe, CP_ME_CNTL,
-					   ME_PIPE1_RESET, 0);
-		break;
-	default:
-		break;
-	}
-
-	WREG32_SOC15(GC, 0, regCP_ME_CNTL, reset_pipe);
-	WREG32_SOC15(GC, 0, regCP_ME_CNTL, clean_pipe);
-
-	r = (RREG32(SOC15_REG_OFFSET(GC, 0, regCP_GFX_RS64_INSTR_PNTR1)) << 2) -
-						RS64_FW_UC_START_ADDR_LO;
-	soc21_grbm_select(adev, 0, 0, 0, 0);
-	mutex_unlock(&adev->srbm_mutex);
-	gfx_v11_0_unset_safe_mode(adev, 0);
-
-	dev_info(adev->dev, "The ring %s pipe reset to the ME firmware start PC: %s\n", ring->name,
-			r == 0 ? "successfully" : "failed");
-	/* FIXME: Sometimes driver can't cache the ME firmware start PC correctly,
-	 * so the pipe reset status relies on the later gfx ring test result.
-	 */
-	return 0;
-}
-
 static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 {
 	struct amdgpu_device *adev = ring->adev;
+	u32 tmp;
 	int r;
 
 	if (amdgpu_sriov_vf(adev))
 		return -EINVAL;
 
-	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
-	if (r) {
-
-		dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
-		r = gfx_v11_reset_gfx_pipe(ring);
-		if (r)
-			return r;
-	}
-
-	r = gfx_v11_0_kgq_init_queue(ring, true);
-	if (r) {
-		dev_err(adev->dev, "failed to init kgq\n");
-		return r;
-	}
-
-	r = amdgpu_mes_map_legacy_queue(adev, ring);
-	if (r) {
-		dev_err(adev->dev, "failed to remap kgq\n");
+	tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
+	r = amdgpu_mes_wreg(adev, SOC15_REG_OFFSET(GC, 0, regCP_VMID_RESET), tmp);
+	if (r)
 		return r;
-	}
+	if (amdgpu_ring_alloc(ring, 7 + 7 + 5 + 8))
+		return -ENOMEM;
+	gfx_v11_0_ring_emit_pipeline_sync(ring);
+	amdgpu_ring_commit(ring);
 
-	return amdgpu_ring_test_ring(ring);
+	return gfx_v11_0_ring_test_ib(ring, AMDGPU_QUEUE_RESET_TIMEOUT);
 }
 
 static int gfx_v11_0_reset_compute_pipe(struct amdgpu_ring *ring)
@@ -7196,7 +7126,7 @@ static const struct amdgpu_ring_funcs gfx_v11_0_ring_funcs_gfx = {
 		5 + /* update_spm_vmid */
 		5 + /* COND_EXEC */
 		22 + /* SET_Q_PREEMPTION_MODE */
-		7 + /* PIPELINE_SYNC */
+		7 + 7 + 5 + 8 + /* PIPELINE_SYNC */
 		SOC15_FLUSH_GPU_TLB_NUM_WREG * 5 +
 		SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 7 +
 		4 + /* VM_FLUSH */
@@ -7231,7 +7161,6 @@ static const struct amdgpu_ring_funcs gfx_v11_0_ring_funcs_gfx = {
 	.emit_wreg = gfx_v11_0_ring_emit_wreg,
 	.emit_reg_wait = gfx_v11_0_ring_emit_reg_wait,
 	.emit_reg_write_reg_wait = gfx_v11_0_ring_emit_reg_write_reg_wait,
-	.soft_recovery = gfx_v11_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v11_0_emit_mem_sync,
 	.reset = gfx_v11_0_reset_kgq,
 	.emit_cleaner_shader = gfx_v11_0_ring_emit_cleaner_shader,
@@ -7273,7 +7202,6 @@ static const struct amdgpu_ring_funcs gfx_v11_0_ring_funcs_compute = {
 	.emit_wreg = gfx_v11_0_ring_emit_wreg,
 	.emit_reg_wait = gfx_v11_0_ring_emit_reg_wait,
 	.emit_reg_write_reg_wait = gfx_v11_0_ring_emit_reg_write_reg_wait,
-	.soft_recovery = gfx_v11_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v11_0_emit_mem_sync,
 	.reset = gfx_v11_0_reset_kcq,
 	.emit_cleaner_shader = gfx_v11_0_ring_emit_cleaner_shader,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 8/8] drm/amdgpu: rework gfx12 queue reset
  2025-05-19 18:22 [PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support" Alex Deucher
                   ` (5 preceding siblings ...)
  2025-05-19 18:22 ` [PATCH 7/8] drm/amdgpu: rework gfx11 " Alex Deucher
@ 2025-05-19 18:22 ` Alex Deucher
  2025-05-20  8:23 ` [PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support" Christian König
  7 siblings, 0 replies; 15+ messages in thread
From: Alex Deucher @ 2025-05-19 18:22 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Apply the same changes to gfx12 as done to gfx10.

Background is that we don't use per application kernel queues for
gfx12 on Linux for performance reasons.

So instead use the gfx10 approach here as well and only reset all
submissions from a specific VMID instead of the whole queue.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 102 ++++++-------------------
 1 file changed, 23 insertions(+), 79 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index f09d96bfee16d..ffecbe217d09d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -4479,6 +4479,19 @@ static void gfx_v12_0_ring_emit_pipeline_sync(struct amdgpu_ring *ring)
 	int usepfp = (ring->funcs->type == AMDGPU_RING_TYPE_GFX);
 	uint32_t seq = ring->fence_drv.sync_seq;
 	uint64_t addr = ring->fence_drv.gpu_addr;
+	struct amdgpu_device *adev = ring->adev;
+
+	if (ring->funcs->type == AMDGPU_RING_TYPE_GFX) {
+		amdgpu_ring_emit_reg_wait(ring,
+					  SOC15_REG_OFFSET(GC, 0, regCP_VMID_RESET),
+					  0, 0xffff);
+		amdgpu_ring_emit_wreg(ring,
+				      SOC15_REG_OFFSET(GC, 0, regCP_VMID_RESET),
+				      0);
+		amdgpu_ring_emit_fence(ring, ring->fence_drv.gpu_addr,
+				       ring->fence_drv.sync_seq,
+				       AMDGPU_FENCE_FLAG_EXEC);
+	}
 
 	gfx_v12_0_wait_reg_mem(ring, usepfp, 1, 0, lower_32_bits(addr),
 			       upper_32_bits(addr), seq, 0xffffffff, 4);
@@ -5251,91 +5264,22 @@ static bool gfx_v12_pipe_reset_support(struct amdgpu_device *adev)
 	return false;
 }
 
-static int gfx_v12_reset_gfx_pipe(struct amdgpu_ring *ring)
-{
-	struct amdgpu_device *adev = ring->adev;
-	uint32_t reset_pipe = 0, clean_pipe = 0;
-	int r;
-
-	if (!gfx_v12_pipe_reset_support(adev))
-		return -EOPNOTSUPP;
-
-	gfx_v12_0_set_safe_mode(adev, 0);
-	mutex_lock(&adev->srbm_mutex);
-	soc24_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
-
-	switch (ring->pipe) {
-	case 0:
-		reset_pipe = REG_SET_FIELD(reset_pipe, CP_ME_CNTL,
-					   PFP_PIPE0_RESET, 1);
-		reset_pipe = REG_SET_FIELD(reset_pipe, CP_ME_CNTL,
-					   ME_PIPE0_RESET, 1);
-		clean_pipe = REG_SET_FIELD(clean_pipe, CP_ME_CNTL,
-					   PFP_PIPE0_RESET, 0);
-		clean_pipe = REG_SET_FIELD(clean_pipe, CP_ME_CNTL,
-					   ME_PIPE0_RESET, 0);
-		break;
-	case 1:
-		reset_pipe = REG_SET_FIELD(reset_pipe, CP_ME_CNTL,
-					   PFP_PIPE1_RESET, 1);
-		reset_pipe = REG_SET_FIELD(reset_pipe, CP_ME_CNTL,
-					   ME_PIPE1_RESET, 1);
-		clean_pipe = REG_SET_FIELD(clean_pipe, CP_ME_CNTL,
-					   PFP_PIPE1_RESET, 0);
-		clean_pipe = REG_SET_FIELD(clean_pipe, CP_ME_CNTL,
-					   ME_PIPE1_RESET, 0);
-		break;
-	default:
-		break;
-	}
-
-	WREG32_SOC15(GC, 0, regCP_ME_CNTL, reset_pipe);
-	WREG32_SOC15(GC, 0, regCP_ME_CNTL, clean_pipe);
-
-	r = (RREG32(SOC15_REG_OFFSET(GC, 0, regCP_GFX_RS64_INSTR_PNTR1)) << 2) -
-					RS64_FW_UC_START_ADDR_LO;
-	soc24_grbm_select(adev, 0, 0, 0, 0);
-	mutex_unlock(&adev->srbm_mutex);
-	gfx_v12_0_unset_safe_mode(adev, 0);
-
-	dev_info(adev->dev, "The ring %s pipe reset: %s\n", ring->name,
-			r == 0 ? "successfully" : "failed");
-	/* Sometimes the ME start pc counter can't cache correctly, so the
-	 * PC check only as a reference and pipe reset result rely on the
-	 * later ring test.
-	 */
-	return 0;
-}
-
 static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 {
 	struct amdgpu_device *adev = ring->adev;
+	u32 tmp;
 	int r;
 
-	if (amdgpu_sriov_vf(adev))
-		return -EINVAL;
-
-	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
-	if (r) {
-		dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
-		r = gfx_v12_reset_gfx_pipe(ring);
-		if (r)
-			return r;
-	}
-
-	r = gfx_v12_0_kgq_init_queue(ring, true);
-	if (r) {
-		dev_err(adev->dev, "failed to init kgq\n");
-		return r;
-	}
-
-	r = amdgpu_mes_map_legacy_queue(adev, ring);
-	if (r) {
-		dev_err(adev->dev, "failed to remap kgq\n");
+	tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
+	r = amdgpu_mes_wreg(adev, SOC15_REG_OFFSET(GC, 0, regCP_VMID_RESET), tmp);
+	if (r)
 		return r;
-	}
+	if (amdgpu_ring_alloc(ring, 7 + 7 + 5 + 8))
+		return -ENOMEM;
+	gfx_v12_0_ring_emit_pipeline_sync(ring);
+	amdgpu_ring_commit(ring);
 
-	return amdgpu_ring_test_ring(ring);
+	return gfx_v12_0_ring_test_ib(ring, AMDGPU_QUEUE_RESET_TIMEOUT);
 }
 
 static int gfx_v12_0_reset_compute_pipe(struct amdgpu_ring *ring)
@@ -5495,7 +5439,7 @@ static const struct amdgpu_ring_funcs gfx_v12_0_ring_funcs_gfx = {
 	.set_wptr = gfx_v12_0_ring_set_wptr_gfx,
 	.emit_frame_size = /* totally 242 maximum if 16 IBs */
 		5 + /* COND_EXEC */
-		7 + /* PIPELINE_SYNC */
+		7 + 7 + 5 + 8 + /* PIPELINE_SYNC */
 		SOC15_FLUSH_GPU_TLB_NUM_WREG * 5 +
 		SOC15_FLUSH_GPU_TLB_NUM_REG_WAIT * 7 +
 		2 + /* VM_FLUSH */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support"
  2025-05-19 18:22 [PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support" Alex Deucher
                   ` (6 preceding siblings ...)
  2025-05-19 18:22 ` [PATCH 8/8] drm/amdgpu: rework gfx12 " Alex Deucher
@ 2025-05-20  8:23 ` Christian König
  7 siblings, 0 replies; 15+ messages in thread
From: Christian König @ 2025-05-20  8:23 UTC (permalink / raw)
  To: Alex Deucher, amd-gfx

On 5/19/25 20:22, Alex Deucher wrote:
> This reverts commit b7a1a0ef12b81957584fef7b61e2d5ec049c7209.
> 
> A user reported stuttering under heavy gfx load with this commit.
> I suspect it's due to the fact that the gfx contexts are shared
> between the pipes so if there is alot of load on one pipe, we could
> end up stalling waiting for a context.
> 
> On top of that, disabling the second pipe may fix the reliability
> of vmid resets.
> 
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3519
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com> for this one, #7 and #8 in this series.

For the rest I think it would be better if you add your rb to the patches I think.

Thanks a lot for looking into it,
Christian.

> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index 75ea071744eb5..14cbd1f08eb5c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -4781,7 +4781,7 @@ static int gfx_v10_0_sw_init(struct amdgpu_ip_block *ip_block)
>  	case IP_VERSION(10, 3, 3):
>  	case IP_VERSION(10, 3, 7):
>  		adev->gfx.me.num_me = 1;
> -		adev->gfx.me.num_pipe_per_me = 2;
> +		adev->gfx.me.num_pipe_per_me = 1;
>  		adev->gfx.me.num_queue_per_pipe = 2;
>  		adev->gfx.mec.num_mec = 2;
>  		adev->gfx.mec.num_pipe_per_mec = 4;


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction
  2025-05-19 18:22 ` [PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction Alex Deucher
@ 2025-05-20 13:09   ` Alex Deucher
  2025-05-20 13:49     ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Deucher @ 2025-05-20 13:09 UTC (permalink / raw)
  To: Alex Deucher; +Cc: amd-gfx, Christian König, Christian König

On Mon, May 19, 2025 at 2:30 PM Alex Deucher <alexander.deucher@amd.com> wrote:
>
> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>
> Stopping the scheduler for queue reset is generally a good idea because
> it prevents any worker from touching the ring buffer.
>
> But using amdgpu_fence_driver_force_completion() before restarting it was
> a really bad idea because it marked fences as failed while the work was
> potentially still running.
>
> Stop doing that and cleanup the comment a bit.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 ++++++++++++-------------
>  1 file changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index acb21fc8b3ce5..a0fab947143b5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -136,10 +136,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>         } else if (amdgpu_gpu_recovery && ring->funcs->reset) {
>                 bool is_guilty;
>
> -               dev_err(adev->dev, "Starting %s ring reset\n", s_job->sched->name);
> -               /* stop the scheduler, but don't mess with the
> -                * bad job yet because if ring reset fails
> -                * we'll fall back to full GPU reset.
> +               dev_err(adev->dev, "Starting %s ring reset\n",
> +                       s_job->sched->name);
> +
> +               /*
> +                * Stop the scheduler to prevent anybody else from touching the
> +                * ring buffer.
>                  */
>                 drm_sched_wqueue_stop(&ring->sched);
>
> @@ -157,19 +159,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>
>                 r = amdgpu_ring_reset(ring, job->vmid);
>                 if (!r) {
> -                       if (amdgpu_ring_sched_ready(ring))
> -                               drm_sched_stop(&ring->sched, s_job);
> -                       if (is_guilty) {
> +                       if (is_guilty)
>                                 atomic_inc(&ring->adev->gpu_reset_counter);
> -                               amdgpu_fence_driver_force_completion(ring);

Do we still need this in the case of rings where we reset the entire
queue?  E.g., compute or VCN?  In which case we should move this to
the ring->reset callback.

Alex

> -                       }
> -                       if (amdgpu_ring_sched_ready(ring))
> -                               drm_sched_start(&ring->sched, 0);
> -                       dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
> -                       drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> +                       drm_sched_wqueue_start(&ring->sched);
> +                       dev_err(adev->dev, "Ring %s reset succeeded\n",
> +                               ring->sched.name);
> +                       drm_dev_wedged_event(adev_to_drm(adev),
> +                                            DRM_WEDGE_RECOVERY_NONE);
>                         goto exit;
>                 }
> -               dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
> +               dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name);
>         }
>         dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction
  2025-05-20 13:09   ` Alex Deucher
@ 2025-05-20 13:49     ` Christian König
  2025-05-20 16:38       ` Alex Deucher
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2025-05-20 13:49 UTC (permalink / raw)
  To: Alex Deucher, Alex Deucher; +Cc: amd-gfx, Christian König

On 5/20/25 15:09, Alex Deucher wrote:
> On Mon, May 19, 2025 at 2:30 PM Alex Deucher <alexander.deucher@amd.com> wrote:
>>
>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>
>> Stopping the scheduler for queue reset is generally a good idea because
>> it prevents any worker from touching the ring buffer.
>>
>> But using amdgpu_fence_driver_force_completion() before restarting it was
>> a really bad idea because it marked fences as failed while the work was
>> potentially still running.
>>
>> Stop doing that and cleanup the comment a bit.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 ++++++++++++-------------
>>  1 file changed, 13 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index acb21fc8b3ce5..a0fab947143b5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -136,10 +136,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>         } else if (amdgpu_gpu_recovery && ring->funcs->reset) {
>>                 bool is_guilty;
>>
>> -               dev_err(adev->dev, "Starting %s ring reset\n", s_job->sched->name);
>> -               /* stop the scheduler, but don't mess with the
>> -                * bad job yet because if ring reset fails
>> -                * we'll fall back to full GPU reset.
>> +               dev_err(adev->dev, "Starting %s ring reset\n",
>> +                       s_job->sched->name);
>> +
>> +               /*
>> +                * Stop the scheduler to prevent anybody else from touching the
>> +                * ring buffer.
>>                  */
>>                 drm_sched_wqueue_stop(&ring->sched);
>>
>> @@ -157,19 +159,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>
>>                 r = amdgpu_ring_reset(ring, job->vmid);
>>                 if (!r) {
>> -                       if (amdgpu_ring_sched_ready(ring))
>> -                               drm_sched_stop(&ring->sched, s_job);
>> -                       if (is_guilty) {
>> +                       if (is_guilty)
>>                                 atomic_inc(&ring->adev->gpu_reset_counter);
>> -                               amdgpu_fence_driver_force_completion(ring);
> 
> Do we still need this in the case of rings where we reset the entire
> queue?  E.g., compute or VCN?  In which case we should move this to
> the ring->reset callback.

No, it shouldn't be necessary in the first place as long as the rings still execute their fence packages.

And that should be the case at least for both graphics and compute.

Forcing completion only makes sense when the whole ASIC was resetted and nothing executed any more.

Regards,
Christian.


> 
> Alex
> 
>> -                       }
>> -                       if (amdgpu_ring_sched_ready(ring))
>> -                               drm_sched_start(&ring->sched, 0);
>> -                       dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
>> -                       drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
>> +                       drm_sched_wqueue_start(&ring->sched);
>> +                       dev_err(adev->dev, "Ring %s reset succeeded\n",
>> +                               ring->sched.name);
>> +                       drm_dev_wedged_event(adev_to_drm(adev),
>> +                                            DRM_WEDGE_RECOVERY_NONE);
>>                         goto exit;
>>                 }
>> -               dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
>> +               dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name);
>>         }
>>         dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>>
>> --
>> 2.49.0
>>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction
  2025-05-20 13:49     ` Christian König
@ 2025-05-20 16:38       ` Alex Deucher
  2025-05-21  9:03         ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Deucher @ 2025-05-20 16:38 UTC (permalink / raw)
  To: Christian König; +Cc: Alex Deucher, amd-gfx, Christian König

On Tue, May 20, 2025 at 9:49 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/20/25 15:09, Alex Deucher wrote:
> > On Mon, May 19, 2025 at 2:30 PM Alex Deucher <alexander.deucher@amd.com> wrote:
> >>
> >> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> >>
> >> Stopping the scheduler for queue reset is generally a good idea because
> >> it prevents any worker from touching the ring buffer.
> >>
> >> But using amdgpu_fence_driver_force_completion() before restarting it was
> >> a really bad idea because it marked fences as failed while the work was
> >> potentially still running.
> >>
> >> Stop doing that and cleanup the comment a bit.
> >>
> >> Signed-off-by: Christian König <christian.koenig@amd.com>
> >> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> >> ---
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 ++++++++++++-------------
> >>  1 file changed, 13 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >> index acb21fc8b3ce5..a0fab947143b5 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >> @@ -136,10 +136,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>         } else if (amdgpu_gpu_recovery && ring->funcs->reset) {
> >>                 bool is_guilty;
> >>
> >> -               dev_err(adev->dev, "Starting %s ring reset\n", s_job->sched->name);
> >> -               /* stop the scheduler, but don't mess with the
> >> -                * bad job yet because if ring reset fails
> >> -                * we'll fall back to full GPU reset.
> >> +               dev_err(adev->dev, "Starting %s ring reset\n",
> >> +                       s_job->sched->name);
> >> +
> >> +               /*
> >> +                * Stop the scheduler to prevent anybody else from touching the
> >> +                * ring buffer.
> >>                  */
> >>                 drm_sched_wqueue_stop(&ring->sched);
> >>
> >> @@ -157,19 +159,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>
> >>                 r = amdgpu_ring_reset(ring, job->vmid);
> >>                 if (!r) {
> >> -                       if (amdgpu_ring_sched_ready(ring))
> >> -                               drm_sched_stop(&ring->sched, s_job);
> >> -                       if (is_guilty) {
> >> +                       if (is_guilty)
> >>                                 atomic_inc(&ring->adev->gpu_reset_counter);
> >> -                               amdgpu_fence_driver_force_completion(ring);
> >
> > Do we still need this in the case of rings where we reset the entire
> > queue?  E.g., compute or VCN?  In which case we should move this to
> > the ring->reset callback.
>
> No, it shouldn't be necessary in the first place as long as the rings still execute their fence packages.
>
> And that should be the case at least for both graphics and compute.
>
> Forcing completion only makes sense when the whole ASIC was resetted and nothing executed any more.

This seems to result in a deadlock if you reset the entire queue
rather than just the vmid.   I.e., if you test with just this patch
and not any of the following patches.  In that case, the queue is
reset so none of the fences are signaled.

Alex


>
> Regards,
> Christian.
>
>
> >
> > Alex
> >
> >> -                       }
> >> -                       if (amdgpu_ring_sched_ready(ring))
> >> -                               drm_sched_start(&ring->sched, 0);
> >> -                       dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
> >> -                       drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> >> +                       drm_sched_wqueue_start(&ring->sched);
> >> +                       dev_err(adev->dev, "Ring %s reset succeeded\n",
> >> +                               ring->sched.name);
> >> +                       drm_dev_wedged_event(adev_to_drm(adev),
> >> +                                            DRM_WEDGE_RECOVERY_NONE);
> >>                         goto exit;
> >>                 }
> >> -               dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
> >> +               dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name);
> >>         }
> >>         dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
> >>
> >> --
> >> 2.49.0
> >>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction
  2025-05-20 16:38       ` Alex Deucher
@ 2025-05-21  9:03         ` Christian König
  2025-05-21 13:28           ` Alex Deucher
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2025-05-21  9:03 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Alex Deucher, amd-gfx, Christian König

On 5/20/25 18:38, Alex Deucher wrote:
> On Tue, May 20, 2025 at 9:49 AM Christian König
> <christian.koenig@amd.com> wrote:
>>
>> On 5/20/25 15:09, Alex Deucher wrote:
>>> On Mon, May 19, 2025 at 2:30 PM Alex Deucher <alexander.deucher@amd.com> wrote:
>>>>
>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>>>
>>>> Stopping the scheduler for queue reset is generally a good idea because
>>>> it prevents any worker from touching the ring buffer.
>>>>
>>>> But using amdgpu_fence_driver_force_completion() before restarting it was
>>>> a really bad idea because it marked fences as failed while the work was
>>>> potentially still running.
>>>>
>>>> Stop doing that and cleanup the comment a bit.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>> ---
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 ++++++++++++-------------
>>>>  1 file changed, 13 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> index acb21fc8b3ce5..a0fab947143b5 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>> @@ -136,10 +136,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>         } else if (amdgpu_gpu_recovery && ring->funcs->reset) {
>>>>                 bool is_guilty;
>>>>
>>>> -               dev_err(adev->dev, "Starting %s ring reset\n", s_job->sched->name);
>>>> -               /* stop the scheduler, but don't mess with the
>>>> -                * bad job yet because if ring reset fails
>>>> -                * we'll fall back to full GPU reset.
>>>> +               dev_err(adev->dev, "Starting %s ring reset\n",
>>>> +                       s_job->sched->name);
>>>> +
>>>> +               /*
>>>> +                * Stop the scheduler to prevent anybody else from touching the
>>>> +                * ring buffer.
>>>>                  */
>>>>                 drm_sched_wqueue_stop(&ring->sched);
>>>>
>>>> @@ -157,19 +159,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>
>>>>                 r = amdgpu_ring_reset(ring, job->vmid);
>>>>                 if (!r) {
>>>> -                       if (amdgpu_ring_sched_ready(ring))
>>>> -                               drm_sched_stop(&ring->sched, s_job);
>>>> -                       if (is_guilty) {
>>>> +                       if (is_guilty)
>>>>                                 atomic_inc(&ring->adev->gpu_reset_counter);
>>>> -                               amdgpu_fence_driver_force_completion(ring);
>>>
>>> Do we still need this in the case of rings where we reset the entire
>>> queue?  E.g., compute or VCN?  In which case we should move this to
>>> the ring->reset callback.
>>
>> No, it shouldn't be necessary in the first place as long as the rings still execute their fence packages.
>>
>> And that should be the case at least for both graphics and compute.
>>
>> Forcing completion only makes sense when the whole ASIC was resetted and nothing executed any more.
> 
> This seems to result in a deadlock if you reset the entire queue
> rather than just the vmid.   I.e., if you test with just this patch
> and not any of the following patches.  In that case, the queue is
> reset so none of the fences are signaled.

That is just because of a specific behavior of the GFX/Compute engine.

When fences are issued while a reset is ongoing the CP writes the fence value not to the requested location, but rather to fence_addr + 4. See amdgpu_debugfs_fence_info_show for more details.

That's why I cleared the reset before issuing the fence command in the follow up patches.

Key point is that the stuff still executes and telling the core os that it can release the memory by force completing the fences is a really bad idea in that case.

Regards,
Christian.

> 
> Alex
> 
> 
>>
>> Regards,
>> Christian.
>>
>>
>>>
>>> Alex
>>>
>>>> -                       }
>>>> -                       if (amdgpu_ring_sched_ready(ring))
>>>> -                               drm_sched_start(&ring->sched, 0);
>>>> -                       dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
>>>> -                       drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
>>>> +                       drm_sched_wqueue_start(&ring->sched);
>>>> +                       dev_err(adev->dev, "Ring %s reset succeeded\n",
>>>> +                               ring->sched.name);
>>>> +                       drm_dev_wedged_event(adev_to_drm(adev),
>>>> +                                            DRM_WEDGE_RECOVERY_NONE);
>>>>                         goto exit;
>>>>                 }
>>>> -               dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
>>>> +               dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name);
>>>>         }
>>>>         dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>>>>
>>>> --
>>>> 2.49.0
>>>>
>>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction
  2025-05-21  9:03         ` Christian König
@ 2025-05-21 13:28           ` Alex Deucher
  2025-05-22  8:24             ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Deucher @ 2025-05-21 13:28 UTC (permalink / raw)
  To: Christian König; +Cc: Alex Deucher, amd-gfx, Christian König

On Wed, May 21, 2025 at 5:03 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/20/25 18:38, Alex Deucher wrote:
> > On Tue, May 20, 2025 at 9:49 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >> On 5/20/25 15:09, Alex Deucher wrote:
> >>> On Mon, May 19, 2025 at 2:30 PM Alex Deucher <alexander.deucher@amd.com> wrote:
> >>>>
> >>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
> >>>>
> >>>> Stopping the scheduler for queue reset is generally a good idea because
> >>>> it prevents any worker from touching the ring buffer.
> >>>>
> >>>> But using amdgpu_fence_driver_force_completion() before restarting it was
> >>>> a really bad idea because it marked fences as failed while the work was
> >>>> potentially still running.
> >>>>
> >>>> Stop doing that and cleanup the comment a bit.
> >>>>
> >>>> Signed-off-by: Christian König <christian.koenig@amd.com>
> >>>> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
> >>>> ---
> >>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 ++++++++++++-------------
> >>>>  1 file changed, 13 insertions(+), 14 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>>> index acb21fc8b3ce5..a0fab947143b5 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>>> @@ -136,10 +136,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>>>         } else if (amdgpu_gpu_recovery && ring->funcs->reset) {
> >>>>                 bool is_guilty;
> >>>>
> >>>> -               dev_err(adev->dev, "Starting %s ring reset\n", s_job->sched->name);
> >>>> -               /* stop the scheduler, but don't mess with the
> >>>> -                * bad job yet because if ring reset fails
> >>>> -                * we'll fall back to full GPU reset.
> >>>> +               dev_err(adev->dev, "Starting %s ring reset\n",
> >>>> +                       s_job->sched->name);
> >>>> +
> >>>> +               /*
> >>>> +                * Stop the scheduler to prevent anybody else from touching the
> >>>> +                * ring buffer.
> >>>>                  */
> >>>>                 drm_sched_wqueue_stop(&ring->sched);
> >>>>
> >>>> @@ -157,19 +159,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>>>
> >>>>                 r = amdgpu_ring_reset(ring, job->vmid);
> >>>>                 if (!r) {
> >>>> -                       if (amdgpu_ring_sched_ready(ring))
> >>>> -                               drm_sched_stop(&ring->sched, s_job);
> >>>> -                       if (is_guilty) {
> >>>> +                       if (is_guilty)
> >>>>                                 atomic_inc(&ring->adev->gpu_reset_counter);
> >>>> -                               amdgpu_fence_driver_force_completion(ring);
> >>>
> >>> Do we still need this in the case of rings where we reset the entire
> >>> queue?  E.g., compute or VCN?  In which case we should move this to
> >>> the ring->reset callback.
> >>
> >> No, it shouldn't be necessary in the first place as long as the rings still execute their fence packages.
> >>
> >> And that should be the case at least for both graphics and compute.
> >>
> >> Forcing completion only makes sense when the whole ASIC was resetted and nothing executed any more.
> >
> > This seems to result in a deadlock if you reset the entire queue
> > rather than just the vmid.   I.e., if you test with just this patch
> > and not any of the following patches.  In that case, the queue is
> > reset so none of the fences are signaled.
>
> That is just because of a specific behavior of the GFX/Compute engine.
>
> When fences are issued while a reset is ongoing the CP writes the fence value not to the requested location, but rather to fence_addr + 4. See amdgpu_debugfs_fence_info_show for more details.
>
> That's why I cleared the reset before issuing the fence command in the follow up patches.
>
> Key point is that the stuff still executes and telling the core os that it can release the memory by force completing the fences is a really bad idea in that case.

But the current code (without this patch set), at least for gfx10 and
newer and VCN, doesn't write the fence, it just resets the entire ring
so the fence never gets written.  So either we need to keep this code
here, or we need to write the fence in the ring reset callbacks I
think.

Alex

>
> Regards,
> Christian.
>
> >
> > Alex
> >
> >
> >>
> >> Regards,
> >> Christian.
> >>
> >>
> >>>
> >>> Alex
> >>>
> >>>> -                       }
> >>>> -                       if (amdgpu_ring_sched_ready(ring))
> >>>> -                               drm_sched_start(&ring->sched, 0);
> >>>> -                       dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
> >>>> -                       drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> >>>> +                       drm_sched_wqueue_start(&ring->sched);
> >>>> +                       dev_err(adev->dev, "Ring %s reset succeeded\n",
> >>>> +                               ring->sched.name);
> >>>> +                       drm_dev_wedged_event(adev_to_drm(adev),
> >>>> +                                            DRM_WEDGE_RECOVERY_NONE);
> >>>>                         goto exit;
> >>>>                 }
> >>>> -               dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
> >>>> +               dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name);
> >>>>         }
> >>>>         dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
> >>>>
> >>>> --
> >>>> 2.49.0
> >>>>
> >>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction
  2025-05-21 13:28           ` Alex Deucher
@ 2025-05-22  8:24             ` Christian König
  0 siblings, 0 replies; 15+ messages in thread
From: Christian König @ 2025-05-22  8:24 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Alex Deucher, amd-gfx, Christian König

On 5/21/25 15:28, Alex Deucher wrote:
> On Wed, May 21, 2025 at 5:03 AM Christian König
> <christian.koenig@amd.com> wrote:
>>
>> On 5/20/25 18:38, Alex Deucher wrote:
>>> On Tue, May 20, 2025 at 9:49 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>>
>>>> On 5/20/25 15:09, Alex Deucher wrote:
>>>>> On Mon, May 19, 2025 at 2:30 PM Alex Deucher <alexander.deucher@amd.com> wrote:
>>>>>>
>>>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>>>>>
>>>>>> Stopping the scheduler for queue reset is generally a good idea because
>>>>>> it prevents any worker from touching the ring buffer.
>>>>>>
>>>>>> But using amdgpu_fence_driver_force_completion() before restarting it was
>>>>>> a really bad idea because it marked fences as failed while the work was
>>>>>> potentially still running.
>>>>>>
>>>>>> Stop doing that and cleanup the comment a bit.
>>>>>>
>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>>>>>> ---
>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 ++++++++++++-------------
>>>>>>  1 file changed, 13 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> index acb21fc8b3ce5..a0fab947143b5 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>>> @@ -136,10 +136,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>>>         } else if (amdgpu_gpu_recovery && ring->funcs->reset) {
>>>>>>                 bool is_guilty;
>>>>>>
>>>>>> -               dev_err(adev->dev, "Starting %s ring reset\n", s_job->sched->name);
>>>>>> -               /* stop the scheduler, but don't mess with the
>>>>>> -                * bad job yet because if ring reset fails
>>>>>> -                * we'll fall back to full GPU reset.
>>>>>> +               dev_err(adev->dev, "Starting %s ring reset\n",
>>>>>> +                       s_job->sched->name);
>>>>>> +
>>>>>> +               /*
>>>>>> +                * Stop the scheduler to prevent anybody else from touching the
>>>>>> +                * ring buffer.
>>>>>>                  */
>>>>>>                 drm_sched_wqueue_stop(&ring->sched);
>>>>>>
>>>>>> @@ -157,19 +159,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>>>
>>>>>>                 r = amdgpu_ring_reset(ring, job->vmid);
>>>>>>                 if (!r) {
>>>>>> -                       if (amdgpu_ring_sched_ready(ring))
>>>>>> -                               drm_sched_stop(&ring->sched, s_job);
>>>>>> -                       if (is_guilty) {
>>>>>> +                       if (is_guilty)
>>>>>>                                 atomic_inc(&ring->adev->gpu_reset_counter);
>>>>>> -                               amdgpu_fence_driver_force_completion(ring);
>>>>>
>>>>> Do we still need this in the case of rings where we reset the entire
>>>>> queue?  E.g., compute or VCN?  In which case we should move this to
>>>>> the ring->reset callback.
>>>>
>>>> No, it shouldn't be necessary in the first place as long as the rings still execute their fence packages.
>>>>
>>>> And that should be the case at least for both graphics and compute.
>>>>
>>>> Forcing completion only makes sense when the whole ASIC was resetted and nothing executed any more.
>>>
>>> This seems to result in a deadlock if you reset the entire queue
>>> rather than just the vmid.   I.e., if you test with just this patch
>>> and not any of the following patches.  In that case, the queue is
>>> reset so none of the fences are signaled.
>>
>> That is just because of a specific behavior of the GFX/Compute engine.
>>
>> When fences are issued while a reset is ongoing the CP writes the fence value not to the requested location, but rather to fence_addr + 4. See amdgpu_debugfs_fence_info_show for more details.
>>
>> That's why I cleared the reset before issuing the fence command in the follow up patches.
>>
>> Key point is that the stuff still executes and telling the core os that it can release the memory by force completing the fences is a really bad idea in that case.
> 
> But the current code (without this patch set), at least for gfx10 and
> newer and VCN, doesn't write the fence, it just resets the entire ring

Exactly that is *not* happening. See what the CP does is to kill/skip all submissions for the specified VMID. But submissions for other VMIDs seem to still be executed.

It's just that fence values written in VMID 0 don't end up at their specified location, but rather at addr + 4. I have only tested this on gfx9 and gfx10 (Navi 10), but I strongly expect that all CP generations work like this.

So the kernel doesn't recognize that those submissions have finished, but if we force finish them it can be that we end up using freed up memory because the submissions are still executing.

> so the fence never gets written.  So either we need to keep this code
> here, or we need to write the fence in the ring reset callbacks I
> think.

VCN is of course a different story, we could certainly add the force finish there if necessary.

Regards,
Christian.


> 
> Alex
> 
>>
>> Regards,
>> Christian.
>>
>>>
>>> Alex
>>>
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>>> -                       }
>>>>>> -                       if (amdgpu_ring_sched_ready(ring))
>>>>>> -                               drm_sched_start(&ring->sched, 0);
>>>>>> -                       dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
>>>>>> -                       drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
>>>>>> +                       drm_sched_wqueue_start(&ring->sched);
>>>>>> +                       dev_err(adev->dev, "Ring %s reset succeeded\n",
>>>>>> +                               ring->sched.name);
>>>>>> +                       drm_dev_wedged_event(adev_to_drm(adev),
>>>>>> +                                            DRM_WEDGE_RECOVERY_NONE);
>>>>>>                         goto exit;
>>>>>>                 }
>>>>>> -               dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
>>>>>> +               dev_err(adev->dev, "Ring %s reset failed\n", ring->sched.name);
>>>>>>         }
>>>>>>         dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>>>>>>
>>>>>> --
>>>>>> 2.49.0
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-05-22  8:24 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-19 18:22 [PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support" Alex Deucher
2025-05-19 18:22 ` [PATCH 2/8] drm/amdgpu: rework queue reset scheduler interaction Alex Deucher
2025-05-20 13:09   ` Alex Deucher
2025-05-20 13:49     ` Christian König
2025-05-20 16:38       ` Alex Deucher
2025-05-21  9:03         ` Christian König
2025-05-21 13:28           ` Alex Deucher
2025-05-22  8:24             ` Christian König
2025-05-19 18:22 ` [PATCH 3/8] drm/amdgpu: rework gfx9 queue reset Alex Deucher
2025-05-19 18:22 ` [PATCH 4/8] drm/amdgpu: rework gfx7 " Alex Deucher
2025-05-19 18:22 ` [PATCH 5/8] drm/amdgpu: rework gfx8 " Alex Deucher
2025-05-19 18:22 ` [PATCH 6/8] drm/amdgpu: rework gfx10 " Alex Deucher
2025-05-19 18:22 ` [PATCH 7/8] drm/amdgpu: rework gfx11 " Alex Deucher
2025-05-19 18:22 ` [PATCH 8/8] drm/amdgpu: rework gfx12 " Alex Deucher
2025-05-20  8:23 ` [PATCH 1/8] Revert "drm/amd/amdgpu: add pipe1 hardware support" Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.