[PATCH v3 1/8] drm/amdgpu: add coordinated MEC pipe reset for GFX compute queues

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 1/8] drm/amdgpu: add coordinated MEC pipe reset for GFX compute queues
@ 2026-04-14  8:58 Jesse Zhang
  2026-04-14  8:58 ` [PATCH v3 2/8] drm/amdgpu/gfx11: Refactor compute pipe reset and add HQD cleanup Jesse Zhang
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Jesse Zhang @ 2026-04-14  8:58 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, Christian Koenig, Jesse Zhang, Alex Deucher,
	Jesse Zhang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="Y", Size: 11912 bytes --]

Introduce a shared mutex and common helpers to serialize MEC pipe reset
sequences between KGD (DRM scheduler) and KFD (AMDKFD) paths. This
prevents races where one path could stop/start schedulers or reprogram
hardware while the other is in the middle of a pipe reset, potentially
leading to queue map/unmap corruption or HQD state mismatches.

The change adds:

  - mec_pipe_reset_mutex to struct amdgpu_gfx, initialized during
    device init.

  - amdgpu_gfx_mec_pipe_reset_prepare(): stops DRM schedulers and KFD
    scheduling for all compute rings on a given (xcc_id, me, pipe)
    tuple, backing up unprocessed commands except for an optional
    guilty queue that is already handled via the KGD ring reset path.

  - amdgpu_gfx_mec_pipe_restart_schedulers(): restarts all schedulers
    and KFD scheduling for the affected pipe.

  - amdgpu_gfx_mec_pipe_reset_recover_queues(): re-initializes and
    remaps each KCQ on the pipe, optionally using a timed-out fence for
    the guilty queue and collateral fences for others, then completes
    the ring reset helper sequence.

  - amdgpu_gfx_mec_pipe_reset_run(): the core orchestration routine
    that takes the mutex, invokes prepare, performs the HW pipe reset
    via either a KFD or KGD callback, restarts schedulers on error, and
    recovers queues.

The implementation correctly handles single and multi-XCC configurations
by offsetting into the compute_ring array per partition. The special
queue value AMDGPU_MEC_PIPE_RESET_NO_QUEUE allows KFD-initiated resets
where no single DRM KCQ is identified as the timeout victim.

Suggested-by:  Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 196 +++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  35 ++++
 3 files changed, 232 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fbdf458758d6..62d573b6135f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3742,6 +3742,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 		amdgpu_sync_create(&adev->isolation[i].active);
 		amdgpu_sync_create(&adev->isolation[i].prev);
 	}
+	mutex_init(&adev->gfx.mec_pipe_reset_mutex);
 	mutex_init(&adev->gfx.userq_sch_mutex);
 	mutex_init(&adev->gfx.workload_profile_mutex);
 	mutex_init(&adev->vcn.workload_profile_mutex);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 2956e45c9254..8118a91f6b64 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -24,6 +24,7 @@
  */
 
 #include <linux/firmware.h>
+#include <linux/lockdep.h>
 #include <linux/pm_runtime.h>
 
 #include "amdgpu.h"
@@ -69,6 +70,201 @@ void amdgpu_queue_mask_bit_to_mec_queue(struct amdgpu_device *adev, int bit,
 
 }
 
+static bool amdgpu_gfx_ring_on_mec_pipe(struct amdgpu_ring *ring, u32 me, u32 pipe)
+{
+	if (!ring || !ring->funcs || ring->funcs->type != AMDGPU_RING_TYPE_COMPUTE)
+		return false;
+	if (ring->no_scheduler)
+		return false;
+
+	return ring->me == me && ring->pipe == pipe;
+}
+
+/* Same layout as amdgpu_gfx_run_cleaner_shader(): block of num_compute_rings per XCC. */
+static unsigned int amdgpu_gfx_mec_pipe_compute_ring_base(struct amdgpu_device *adev,
+							 u32 xcc_id)
+{
+	int num_xcc = adev->gfx.xcc_mask ? NUM_XCC(adev->gfx.xcc_mask) : 1;
+
+	if (num_xcc <= 1)
+		return 0;
+	return xcc_id * adev->gfx.num_compute_rings;
+}
+
+/**
+ * amdgpu_gfx_mec_pipe_reset_prepare - stop schedulers before MEC pipe reset HW
+ *
+ * Backs up ring state for KCQs on (@xcc_id, @me, @pipe), stops their DRM
+ * schedulers, and stops KFD scheduling for the node. The MEC queue at
+ * @guilty_queue is skipped when it is not AMDGPU_MEC_PIPE_RESET_NO_QUEUE
+ * (already backed up by amdgpu_ring_reset_helper_begin() on the KGD path).
+ *
+ * Caller must hold &adev->gfx.mec_pipe_reset_mutex (e.g. via
+ * amdgpu_gfx_mec_pipe_reset_run()).
+ */
+void amdgpu_gfx_mec_pipe_reset_prepare(struct amdgpu_device *adev,
+				       u32 xcc_id, u32 me, u32 pipe,
+				       u32 guilty_queue)
+{
+	struct amdgpu_ring *ring;
+	unsigned int j, base;
+	bool skip_by_queue = (guilty_queue == AMDGPU_MEC_PIPE_RESET_NO_QUEUE);
+
+	lockdep_assert_held(&adev->gfx.mec_pipe_reset_mutex);
+
+	base = amdgpu_gfx_mec_pipe_compute_ring_base(adev, xcc_id);
+	for (j = 0; j < adev->gfx.num_compute_rings; j++) {
+		ring = &adev->gfx.compute_ring[base + j];
+		if (!amdgpu_gfx_ring_on_mec_pipe(ring, me, pipe))
+			continue;
+		if (skip_by_queue || ring->queue != guilty_queue)
+			amdgpu_ring_backup_unprocessed_commands(ring, NULL);
+		if (amdgpu_ring_sched_ready(ring))
+			drm_sched_wqueue_stop(&ring->sched);
+	}
+}
+
+void amdgpu_gfx_mec_pipe_restart_schedulers(struct amdgpu_device *adev,
+					    u32 me, u32 pipe, u32 xcc_id)
+{
+	struct amdgpu_ring *ring;
+	unsigned int j, base;
+
+	lockdep_assert_held(&adev->gfx.mec_pipe_reset_mutex);
+
+	base = amdgpu_gfx_mec_pipe_compute_ring_base(adev, xcc_id);
+	for (j = 0; j < adev->gfx.num_compute_rings; j++) {
+		ring = &adev->gfx.compute_ring[base + j];
+		if (!amdgpu_gfx_ring_on_mec_pipe(ring, me, pipe))
+			continue;
+		if (amdgpu_ring_sched_ready(ring))
+			drm_sched_wqueue_start(&ring->sched);
+	}
+}
+
+/**
+ * amdgpu_gfx_mec_pipe_reset_recover_queues - re-init KCQs after MEC pipe reset
+ *
+ * Re-inits and remaps every kernel compute queue on (@xcc_id, @me, @pipe),
+ * restarts schedulers, then amdgpu_ring_reset_helper_end() per ring.
+ * @guilty_queue: MEC queue index of the timed-out KCQ, or
+ * AMDGPU_MEC_PIPE_RESET_NO_QUEUE when every ring uses the collateral fence;
+ * @timedout_fence must then be NULL.
+ * @kcq_init: optional IP hook for kcq_init + MES remap.
+ *
+ * Caller must hold &adev->gfx.mec_pipe_reset_mutex (e.g. via
+ * amdgpu_gfx_mec_pipe_reset_run()).
+ */
+int amdgpu_gfx_mec_pipe_reset_recover_queues(struct amdgpu_device *adev,
+					     u32 xcc_id, u32 me, u32 pipe,
+					     u32 guilty_queue,
+					     struct amdgpu_fence *timedout_fence,
+					     amdgpu_gfx_kcq_init_queue_t kcq_init)
+{
+	struct amdgpu_fence collateral_reemit = {};
+	struct amdgpu_ring *ring;
+	unsigned int j, base;
+	int err = 0;
+	bool has_guilty = (guilty_queue != AMDGPU_MEC_PIPE_RESET_NO_QUEUE);
+
+	lockdep_assert_held(&adev->gfx.mec_pipe_reset_mutex);
+
+	if (has_guilty && !timedout_fence)
+		return -EINVAL;
+
+	collateral_reemit.context = (u64)-1;
+
+	base = amdgpu_gfx_mec_pipe_compute_ring_base(adev, xcc_id);
+	if (kcq_init) {
+		for (j = 0; j < adev->gfx.num_compute_rings; j++) {
+			ring = &adev->gfx.compute_ring[base + j];
+			if (!amdgpu_gfx_ring_on_mec_pipe(ring, me, pipe))
+				continue;
+
+			err = kcq_init(ring, true);
+			if (err)
+				goto err_sched;
+			err = amdgpu_mes_map_legacy_queue(adev, ring, 0);
+			if (err)
+				goto err_sched;
+		}
+	}
+
+	amdgpu_gfx_mec_pipe_restart_schedulers(adev, me, pipe, xcc_id);
+
+	for (j = 0; j < adev->gfx.num_compute_rings; j++) {
+		ring = &adev->gfx.compute_ring[base + j];
+		if (!amdgpu_gfx_ring_on_mec_pipe(ring, me, pipe))
+			continue;
+
+		err = amdgpu_ring_reset_helper_end(
+			ring,
+			(timedout_fence && ring->queue == guilty_queue) ?
+				timedout_fence :
+				&collateral_reemit);
+		if (err) {
+			dev_err(adev->dev,
+				"ring %s failed recover after MEC pipe reset (%d)\n",
+				ring->name, err);
+			return err;
+		}
+	}
+
+	return 0;
+
+err_sched:
+	amdgpu_gfx_mec_pipe_restart_schedulers(adev, me, pipe, xcc_id);
+	return err;
+}
+
+/**
+ * amdgpu_gfx_mec_pipe_reset_run - coordinate MEC pipe reset between KGD and KFD
+ *
+ * Takes &adev->gfx.mec_pipe_reset_mutex for the full prepare → pipe HW/reset →
+ * recover sequence so KFD and KGD cannot interleave scheduler stop/start,
+ * MES map/unmap, or HQD programming on the same device.
+ *
+ * @queue: MEC queue index (required when @kcq_pipe_reset is used).
+ * AMDGPU_MEC_PIPE_RESET_NO_QUEUE is only valid with @kfd_pipe_reset (KFD path;
+ * pass @timedout_fence NULL). At least one of @kcq_pipe_reset or @kfd_pipe_reset
+ * must be non-NULL.
+ * If both are provided, only @kfd_pipe_reset is invoked.
+ *
+ * Returns: 0 on success, or a negative error code.
+ */
+int amdgpu_gfx_mec_pipe_reset_run(struct amdgpu_device *adev,
+				  u32 xcc_id, u32 me, u32 pipe, u32 queue,
+				  struct amdgpu_fence *timedout_fence,
+				  amdgpu_gfx_kcq_mec_pipe_reset_t kcq_pipe_reset,
+				  amdgpu_gfx_kfd_mec_pipe_reset_t kfd_pipe_reset,
+				  amdgpu_gfx_kcq_init_queue_t kcq_init)
+{
+	int err;
+
+	if (!kcq_pipe_reset && !kfd_pipe_reset)
+		return -EINVAL;
+
+	mutex_lock(&adev->gfx.mec_pipe_reset_mutex);
+	amdgpu_gfx_mec_pipe_reset_prepare(adev, xcc_id, me, pipe, queue);
+
+	if (kfd_pipe_reset)
+		err = kfd_pipe_reset(adev, xcc_id, me, pipe);
+	else
+		err = kcq_pipe_reset(adev, me, pipe, queue);
+
+	if (err) {
+		amdgpu_gfx_mec_pipe_restart_schedulers(adev, me, pipe, xcc_id);
+		mutex_unlock(&adev->gfx.mec_pipe_reset_mutex);
+		return err;
+	}
+
+	err = amdgpu_gfx_mec_pipe_reset_recover_queues(adev, xcc_id, me, pipe,
+							queue, timedout_fence,
+							kcq_init);
+	mutex_unlock(&adev->gfx.mec_pipe_reset_mutex);
+	return err;
+}
+
 bool amdgpu_gfx_is_mec_queue_enabled(struct amdgpu_device *adev,
 				     int xcc_id, int mec, int pipe, int queue)
 {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index a0cf0a3b41da..a1f13262d782 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -527,6 +527,9 @@ struct amdgpu_gfx {
 	const void			*cleaner_shader_ptr;
 	bool				enable_cleaner_shader;
 	struct amdgpu_isolation_work	enforce_isolation[MAX_XCP];
+	/* Serialize MEC pipe reset prep/HW/recover between KGD and KFD */
+	struct mutex			mec_pipe_reset_mutex;
+
 	/* Mutex for synchronizing KFD scheduler operations */
 	struct mutex                    userq_sch_mutex;
 	u64				userq_sch_req_count[MAX_XCP];
@@ -603,6 +606,38 @@ int amdgpu_gfx_mec_queue_to_bit(struct amdgpu_device *adev, int mec,
 				int pipe, int queue);
 void amdgpu_queue_mask_bit_to_mec_queue(struct amdgpu_device *adev, int bit,
 				 int *mec, int *pipe, int *queue);
+
+/*
+ * Pass @queue == AMDGPU_MEC_PIPE_RESET_NO_QUEUE when no DRM KCQ is the timeout
+ * victim (e.g. KFD-driven pipe reset); all queues on the pipe are backed up in
+ * prepare and recover uses collateral fences only.
+ */
+#define AMDGPU_MEC_PIPE_RESET_NO_QUEUE		U32_MAX
+
+typedef int (*amdgpu_gfx_kcq_init_queue_t)(struct amdgpu_ring *ring, bool clear);
+typedef int (*amdgpu_gfx_kcq_mec_pipe_reset_t)(struct amdgpu_device *adev,
+					       u32 me, u32 pipe, u32 queue);
+typedef int (*amdgpu_gfx_kfd_mec_pipe_reset_t)(struct amdgpu_device *adev,
+					       u32 xcc_id, u32 me, u32 pipe);
+
+int amdgpu_gfx_mec_pipe_reset_run(struct amdgpu_device *adev,
+				  u32 xcc_id, u32 me, u32 pipe, u32 queue,
+				  struct amdgpu_fence *timedout_fence,
+				  amdgpu_gfx_kcq_mec_pipe_reset_t kcq_pipe_reset,
+				  amdgpu_gfx_kfd_mec_pipe_reset_t kfd_pipe_reset,
+				  amdgpu_gfx_kcq_init_queue_t kcq_init);
+
+void amdgpu_gfx_mec_pipe_reset_prepare(struct amdgpu_device *adev,
+				       u32 xcc_id, u32 me, u32 pipe,
+				       u32 guilty_queue);
+void amdgpu_gfx_mec_pipe_restart_schedulers(struct amdgpu_device *adev,
+					    u32 me, u32 pipe, u32 xcc_id);
+int amdgpu_gfx_mec_pipe_reset_recover_queues(
+	struct amdgpu_device *adev,
+	u32 xcc_id, u32 me, u32 pipe,
+	u32 guilty_queue,
+	struct amdgpu_fence *timedout_fence,
+	amdgpu_gfx_kcq_init_queue_t kcq_init);
 bool amdgpu_gfx_is_mec_queue_enabled(struct amdgpu_device *adev, int xcc_id,
 				     int mec, int pipe, int queue);
 bool amdgpu_gfx_is_high_priority_compute_queue(struct amdgpu_device *adev,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/8] drm/amdgpu/gfx11: Refactor compute pipe reset and add HQD cleanup
  2026-04-14  8:58 [PATCH v3 1/8] drm/amdgpu: add coordinated MEC pipe reset for GFX compute queues Jesse Zhang
@ 2026-04-14  8:58 ` Jesse Zhang
  2026-04-14  8:58 ` [PATCH v3 3/8] drm/amdgpu/gfx11: Fall back to pipe reset if per-queue reset ring test fails Jesse Zhang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Jesse Zhang @ 2026-04-14  8:58 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, Christian Koenig, Jesse Zhang, Manu Rastogi,
	Alex Deucher, Jesse Zhang

Refactor gfx_v11_0_reset_compute_pipe() to accept explicit me, pipe, and
queue parameters instead of deriving them from the ring structure. This
enables the function to be used in generic pipe reset flows.

Introduce gfx_v11_0_clear_hqds_on_mec_pipe() to properly clear
CP_HQD_ACTIVE and CP_HQD_DEQUEUE_REQUEST for all queues on a given MEC
pipe while the pipe reset is asserted, ensuring the HQDs are torn down
correctly before deasserting reset.

Switch the KCQ reset path to use the common MEC pipe reset helper
amdgpu_gfx_mec_pipe_reset_run(), which coordinates the reset sequence
including KFD suspend/resume to avoid conflicts with user mode queues.

Suggested-by:  Manu Rastogi <manu.rastogi@amd.com>
Suggested-by:  Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 177 +++++++++++++++----------
 1 file changed, 109 insertions(+), 68 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index ae39b9e1f7d6..e29e8e620699 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -6906,11 +6906,39 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
 	return amdgpu_ring_reset_helper_end(ring, timedout_fence);
 }
 
-static int gfx_v11_0_reset_compute_pipe(struct amdgpu_ring *ring)
+/*
+ * With MEC pipe reset asserted, clear CP_HQD_ACTIVE / CP_HQD_DEQUEUE_REQUEST for
+ * every queue on (me, pipe). HQDs must be torn down while pipe reset stays
+ * asserted; only then clear the pipe reset bit.
+ * Caller must hold adev->srbm_mutex.
+ */
+static void gfx_v11_0_clear_hqds_on_mec_pipe(struct amdgpu_device *adev, u32 me,
+					     u32 pipe)
 {
+	unsigned int q;
+	int j;
 
-	struct amdgpu_device *adev = ring->adev;
-	uint32_t reset_pipe = 0, clean_pipe = 0;
+	for (q = 0; q < adev->gfx.mec.num_queue_per_pipe; q++) {
+		soc21_grbm_select(adev, me, pipe, q, 0);
+		/* Start from a clean HQD dequeue state before forcing HQD inactive. */
+		WREG32_SOC15(GC, 0, regCP_HQD_ACTIVE, 0);
+		if (RREG32_SOC15(GC, 0, regCP_HQD_ACTIVE) & 1) {
+			WREG32_SOC15(GC, 0, regCP_HQD_DEQUEUE_REQUEST, 1);
+			for (j = 0; j < adev->usec_timeout; j++) {
+				if (!(RREG32_SOC15(GC, 0, regCP_HQD_ACTIVE) & 1))
+					break;
+				udelay(1);
+			}
+		}
+
+		WREG32_SOC15(GC, 0, regCP_HQD_DEQUEUE_REQUEST, 0);
+	}
+}
+
+static int gfx_v11_0_reset_compute_pipe(struct amdgpu_device *adev,
+					   u32 me, u32 pipe, u32 queue)
+{
+	uint32_t reset_val, clean_val;
 	int r;
 
 	if (!gfx_v11_pipe_reset_support(adev))
@@ -6918,109 +6946,115 @@ static int gfx_v11_0_reset_compute_pipe(struct amdgpu_ring *ring)
 
 	gfx_v11_0_set_safe_mode(adev, 0);
 	mutex_lock(&adev->srbm_mutex);
-	soc21_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
-
-	reset_pipe = RREG32_SOC15(GC, 0, regCP_MEC_RS64_CNTL);
-	clean_pipe = reset_pipe;
+	soc21_grbm_select(adev, me, pipe, queue, 0);
 
 	if (adev->gfx.rs64_enable) {
+		reset_val = RREG32_SOC15(GC, 0, regCP_MEC_RS64_CNTL);
+		clean_val = reset_val;
 
-		switch (ring->pipe) {
+		switch (pipe) {
 		case 0:
-			reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE0_RESET, 1);
-			clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE0_RESET, 0);
+			reset_val = REG_SET_FIELD(reset_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE0_RESET, 1);
+			clean_val = REG_SET_FIELD(clean_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE0_RESET, 0);
 			break;
 		case 1:
-			reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE1_RESET, 1);
-			clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE1_RESET, 0);
+			reset_val = REG_SET_FIELD(reset_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE1_RESET, 1);
+			clean_val = REG_SET_FIELD(clean_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE1_RESET, 0);
 			break;
 		case 2:
-			reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE2_RESET, 1);
-			clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE2_RESET, 0);
+			reset_val = REG_SET_FIELD(reset_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE2_RESET, 1);
+			clean_val = REG_SET_FIELD(clean_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE2_RESET, 0);
 			break;
 		case 3:
-			reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE3_RESET, 1);
-			clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE3_RESET, 0);
+			reset_val = REG_SET_FIELD(reset_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE3_RESET, 1);
+			clean_val = REG_SET_FIELD(clean_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE3_RESET, 0);
 			break;
 		default:
 			break;
 		}
-		WREG32_SOC15(GC, 0, regCP_MEC_RS64_CNTL, reset_pipe);
-		WREG32_SOC15(GC, 0, regCP_MEC_RS64_CNTL, clean_pipe);
+		WREG32_SOC15(GC, 0, regCP_MEC_RS64_CNTL, reset_val);
+		gfx_v11_0_clear_hqds_on_mec_pipe(adev, me, pipe);
+		soc21_grbm_select(adev, me, pipe, queue, 0);
+		WREG32_SOC15(GC, 0, regCP_MEC_RS64_CNTL, clean_val);
 		r = (RREG32_SOC15(GC, 0, regCP_MEC_RS64_INSTR_PNTR) << 2) -
 					RS64_FW_UC_START_ADDR_LO;
 	} else {
-		if (ring->me == 1) {
-			switch (ring->pipe) {
+		reset_val = RREG32_SOC15(GC, 0, regCP_MEC_CNTL);
+		clean_val = reset_val;
+
+		if (me == 1) {
+			switch (pipe) {
 			case 0:
-				reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
-							   MEC_ME1_PIPE0_RESET, 1);
-				clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_CNTL,
-							   MEC_ME1_PIPE0_RESET, 0);
+				reset_val = REG_SET_FIELD(reset_val, CP_MEC_CNTL,
+							  MEC_ME1_PIPE0_RESET, 1);
+				clean_val = REG_SET_FIELD(clean_val, CP_MEC_CNTL,
+							  MEC_ME1_PIPE0_RESET, 0);
 				break;
 			case 1:
-				reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
-							   MEC_ME1_PIPE1_RESET, 1);
-				clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_CNTL,
-							   MEC_ME1_PIPE1_RESET, 0);
+				reset_val = REG_SET_FIELD(reset_val, CP_MEC_CNTL,
+							  MEC_ME1_PIPE1_RESET, 1);
+				clean_val = REG_SET_FIELD(clean_val, CP_MEC_CNTL,
+							  MEC_ME1_PIPE1_RESET, 0);
 				break;
 			case 2:
-				reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
-							   MEC_ME1_PIPE2_RESET, 1);
-				clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_CNTL,
-							   MEC_ME1_PIPE2_RESET, 0);
+				reset_val = REG_SET_FIELD(reset_val, CP_MEC_CNTL,
+							  MEC_ME1_PIPE2_RESET, 1);
+				clean_val = REG_SET_FIELD(clean_val, CP_MEC_CNTL,
+							  MEC_ME1_PIPE2_RESET, 0);
 				break;
 			case 3:
-				reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
-							   MEC_ME1_PIPE3_RESET, 1);
-				clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_CNTL,
-							   MEC_ME1_PIPE3_RESET, 0);
+				reset_val = REG_SET_FIELD(reset_val, CP_MEC_CNTL,
+							  MEC_ME1_PIPE3_RESET, 1);
+				clean_val = REG_SET_FIELD(clean_val, CP_MEC_CNTL,
+							  MEC_ME1_PIPE3_RESET, 0);
 				break;
 			default:
 				break;
 			}
 			/* mec1 fw pc: CP_MEC1_INSTR_PNTR */
 		} else {
-			switch (ring->pipe) {
+			switch (pipe) {
 			case 0:
-				reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
-							   MEC_ME2_PIPE0_RESET, 1);
-				clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_CNTL,
-							   MEC_ME2_PIPE0_RESET, 0);
+				reset_val = REG_SET_FIELD(reset_val, CP_MEC_CNTL,
+							  MEC_ME2_PIPE0_RESET, 1);
+				clean_val = REG_SET_FIELD(clean_val, CP_MEC_CNTL,
+							  MEC_ME2_PIPE0_RESET, 0);
 				break;
 			case 1:
-				reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
-							   MEC_ME2_PIPE1_RESET, 1);
-				clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_CNTL,
-							   MEC_ME2_PIPE1_RESET, 0);
+				reset_val = REG_SET_FIELD(reset_val, CP_MEC_CNTL,
+							  MEC_ME2_PIPE1_RESET, 1);
+				clean_val = REG_SET_FIELD(clean_val, CP_MEC_CNTL,
+							  MEC_ME2_PIPE1_RESET, 0);
 				break;
 			case 2:
-				reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
-							   MEC_ME2_PIPE2_RESET, 1);
-				clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_CNTL,
-							   MEC_ME2_PIPE2_RESET, 0);
+				reset_val = REG_SET_FIELD(reset_val, CP_MEC_CNTL,
+							  MEC_ME2_PIPE2_RESET, 1);
+				clean_val = REG_SET_FIELD(clean_val, CP_MEC_CNTL,
+							  MEC_ME2_PIPE2_RESET, 0);
 				break;
 			case 3:
-				reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
-							   MEC_ME2_PIPE3_RESET, 1);
-				clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_CNTL,
-							   MEC_ME2_PIPE3_RESET, 0);
+				reset_val = REG_SET_FIELD(reset_val, CP_MEC_CNTL,
+							  MEC_ME2_PIPE3_RESET, 1);
+				clean_val = REG_SET_FIELD(clean_val, CP_MEC_CNTL,
+							  MEC_ME2_PIPE3_RESET, 0);
 				break;
 			default:
 				break;
 			}
 			/* mec2 fw pc: CP:CP_MEC2_INSTR_PNTR */
 		}
-		WREG32_SOC15(GC, 0, regCP_MEC_CNTL, reset_pipe);
-		WREG32_SOC15(GC, 0, regCP_MEC_CNTL, clean_pipe);
+		WREG32_SOC15(GC, 0, regCP_MEC_CNTL, reset_val);
+		gfx_v11_0_clear_hqds_on_mec_pipe(adev, me, pipe);
+		soc21_grbm_select(adev, me, pipe, queue, 0);
+		WREG32_SOC15(GC, 0, regCP_MEC_CNTL, clean_val);
 		r = RREG32(SOC15_REG_OFFSET(GC, 0, regCP_MEC1_INSTR_PNTR));
 	}
 
@@ -7028,8 +7062,8 @@ static int gfx_v11_0_reset_compute_pipe(struct amdgpu_ring *ring)
 	mutex_unlock(&adev->srbm_mutex);
 	gfx_v11_0_unset_safe_mode(adev, 0);
 
-	dev_info(adev->dev, "The ring %s pipe resets to MEC FW start PC: %s\n", ring->name,
-			r == 0 ? "successfully" : "failed");
+	dev_dbg(adev->dev, "MEC pipe me%u pipe%u queue%u resets to MEC FW start PC: %s\n",
+		me, pipe, queue, r == 0 ? "successfully" : "failed");
 	/*FIXME:Sometimes driver can't cache the MEC firmware start PC correctly, so the pipe
 	 * reset status relies on the compute ring test result.
 	 */
@@ -7048,9 +7082,16 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
 	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true, 0);
 	if (r) {
 		dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
-		r = gfx_v11_0_reset_compute_pipe(ring);
-		if (r)
-			return r;
+
+		amdgpu_amdkfd_suspend(adev, true);
+		r = amdgpu_gfx_mec_pipe_reset_run(adev,
+						     ring->xcc_id, ring->me, ring->pipe,
+						     ring->queue, timedout_fence,
+						     gfx_v11_0_reset_compute_pipe,
+						     NULL,
+						     gfx_v11_0_kcq_init_queue);
+		amdgpu_amdkfd_resume(adev, true);
+		return r;
 	}
 
 	r = gfx_v11_0_kcq_init_queue(ring, true);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 3/8] drm/amdgpu/gfx11: Fall back to pipe reset if per-queue reset ring test fails
  2026-04-14  8:58 [PATCH v3 1/8] drm/amdgpu: add coordinated MEC pipe reset for GFX compute queues Jesse Zhang
  2026-04-14  8:58 ` [PATCH v3 2/8] drm/amdgpu/gfx11: Refactor compute pipe reset and add HQD cleanup Jesse Zhang
@ 2026-04-14  8:58 ` Jesse Zhang
  2026-04-14  8:58 ` [PATCH v3 4/8] drm/amdgpu/gfx11: enable per-pipe reset support for compute queues Jesse Zhang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Jesse Zhang @ 2026-04-14  8:58 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Christian Koenig, Jesse Zhang, Jesse Zhang

After a per-queue reset via MES, verify that the queue is functional by
performing a ring test. If the test fails, fall back to a full pipe reset
to ensure proper recovery.

This adds a fallback path similar to the one already present when the
initial per-queue reset attempt fails, improving the robustness of KCQ
reset handling on GFX11 hardware.

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index e29e8e620699..fbef19ed46f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -7075,14 +7075,18 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
 			       struct amdgpu_fence *timedout_fence)
 {
 	struct amdgpu_device *adev = ring->adev;
+	int reset_mode = AMDGPU_RESET_TYPE_PER_QUEUE;
 	int r = 0;
 
 	amdgpu_ring_reset_helper_begin(ring, timedout_fence);
 
 	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true, 0);
+
+pipe_reset:
 	if (r) {
 		dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
 
+		reset_mode = AMDGPU_RESET_TYPE_PER_PIPE;
 		amdgpu_amdkfd_suspend(adev, true);
 		r = amdgpu_gfx_mec_pipe_reset_run(adev,
 						     ring->xcc_id, ring->me, ring->pipe,
@@ -7105,6 +7109,13 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
 		return r;
 	}
 
+	if (reset_mode == AMDGPU_RESET_TYPE_PER_QUEUE) {
+		if (amdgpu_ring_reset_helper_end(ring, timedout_fence))
+			goto pipe_reset;
+		else
+			return 0;
+	}
+
 	return amdgpu_ring_reset_helper_end(ring, timedout_fence);
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 4/8] drm/amdgpu/gfx11: enable per-pipe reset support for compute queues
  2026-04-14  8:58 [PATCH v3 1/8] drm/amdgpu: add coordinated MEC pipe reset for GFX compute queues Jesse Zhang
  2026-04-14  8:58 ` [PATCH v3 2/8] drm/amdgpu/gfx11: Refactor compute pipe reset and add HQD cleanup Jesse Zhang
  2026-04-14  8:58 ` [PATCH v3 3/8] drm/amdgpu/gfx11: Fall back to pipe reset if per-queue reset ring test fails Jesse Zhang
@ 2026-04-14  8:58 ` Jesse Zhang
  2026-04-14  8:58 ` [PATCH v3 5/8] drm/amdgpu/gfx12: Refactor compute pipe reset and add HQD cleanup Jesse Zhang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Jesse Zhang @ 2026-04-14  8:58 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, Christian Koenig, Jesse Zhang, Alex Deucher,
	Jesse Zhang

This allows the driver to fall back to pipe-level reset when per-queue
reset fails, improving recovery success for hung compute or graphics
rings.

V2: replace both gfx_v11_compute_pipe_reset_support() and gfx_v11_pipe_reset_support() with
amdgpu_ring_is_reset_type_supported (Alex)

Suggested-by:  Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index fbef19ed46f9..d2e8c50f8fdb 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -1851,6 +1851,8 @@ static int gfx_v11_0_sw_init(struct amdgpu_ip_block *ip_block)
 		    !adev->debug_disable_gpu_ring_reset) {
 			adev->gfx.compute_supported_reset |= AMDGPU_RESET_TYPE_PER_QUEUE;
 			adev->gfx.gfx_supported_reset |= AMDGPU_RESET_TYPE_PER_QUEUE;
+			if (adev->gfx.mec_fw_version >= 2670)
+				adev->gfx.compute_supported_reset |= AMDGPU_RESET_TYPE_PER_PIPE;
 		}
 		break;
 	default:
@@ -1858,6 +1860,7 @@ static int gfx_v11_0_sw_init(struct amdgpu_ip_block *ip_block)
 		    !adev->debug_disable_gpu_ring_reset) {
 			adev->gfx.compute_supported_reset |= AMDGPU_RESET_TYPE_PER_QUEUE;
 			adev->gfx.gfx_supported_reset |= AMDGPU_RESET_TYPE_PER_QUEUE;
+			adev->gfx.compute_supported_reset |= AMDGPU_RESET_TYPE_PER_PIPE;
 		}
 		break;
 	}
@@ -6807,13 +6810,6 @@ static void gfx_v11_0_emit_mem_sync(struct amdgpu_ring *ring)
 	amdgpu_ring_write(ring, gcr_cntl); /* GCR_CNTL */
 }
 
-static bool gfx_v11_pipe_reset_support(struct amdgpu_device *adev)
-{
-	/* Disable the pipe reset until the CPFW fully support it.*/
-	dev_warn_once(adev->dev, "The CPFW hasn't support pipe reset yet.\n");
-	return false;
-}
-
 
 static int gfx_v11_reset_gfx_pipe(struct amdgpu_ring *ring)
 {
@@ -6821,9 +6817,6 @@ static int gfx_v11_reset_gfx_pipe(struct amdgpu_ring *ring)
 	uint32_t reset_pipe = 0, clean_pipe = 0;
 	int r;
 
-	if (!gfx_v11_pipe_reset_support(adev))
-		return -EOPNOTSUPP;
-
 	gfx_v11_0_set_safe_mode(adev, 0);
 	mutex_lock(&adev->srbm_mutex);
 	soc21_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
@@ -6884,6 +6877,10 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring,
 	if (r) {
 
 		dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
+		if (!amdgpu_ring_is_reset_type_supported(ring,
+						AMDGPU_RESET_TYPE_PER_PIPE))
+			return r;
+
 		r = gfx_v11_reset_gfx_pipe(ring);
 		if (r)
 			return r;
@@ -6941,9 +6938,6 @@ static int gfx_v11_0_reset_compute_pipe(struct amdgpu_device *adev,
 	uint32_t reset_val, clean_val;
 	int r;
 
-	if (!gfx_v11_pipe_reset_support(adev))
-		return -EOPNOTSUPP;
-
 	gfx_v11_0_set_safe_mode(adev, 0);
 	mutex_lock(&adev->srbm_mutex);
 	soc21_grbm_select(adev, me, pipe, queue, 0);
@@ -7085,8 +7079,11 @@ static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring,
 pipe_reset:
 	if (r) {
 		dev_warn(adev->dev, "fail(%d) to reset kcq and try pipe reset\n", r);
-
 		reset_mode = AMDGPU_RESET_TYPE_PER_PIPE;
+		if (!amdgpu_ring_is_reset_type_supported(ring,
+					AMDGPU_RESET_TYPE_PER_PIPE))
+			return r;
+
 		amdgpu_amdkfd_suspend(adev, true);
 		r = amdgpu_gfx_mec_pipe_reset_run(adev,
 						     ring->xcc_id, ring->me, ring->pipe,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 5/8] drm/amdgpu/gfx12: Refactor compute pipe reset and add HQD cleanup
  2026-04-14  8:58 [PATCH v3 1/8] drm/amdgpu: add coordinated MEC pipe reset for GFX compute queues Jesse Zhang
                   ` (2 preceding siblings ...)
  2026-04-14  8:58 ` [PATCH v3 4/8] drm/amdgpu/gfx11: enable per-pipe reset support for compute queues Jesse Zhang
@ 2026-04-14  8:58 ` Jesse Zhang
  2026-04-14  8:58 ` [PATCH v3 6/8] drm/amdgpu/gfx12: Fall back to pipe reset if per-queue reset ring test fails Jesse Zhang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Jesse Zhang @ 2026-04-14  8:58 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, Christian Koenig, Jesse Zhang, Manu Rastogi,
	Alex Deucher, Jesse Zhang

Refactor gfx_v12_0_reset_compute_pipe() to accept explicit me, pipe, and
queue parameters instead of deriving them from the ring structure. This
enables the function to be used in generic pipe reset flows.

Introduce gfx_v12_0_clear_hqds_on_mec_pipe() to properly clear
CP_HQD_ACTIVE and CP_HQD_DEQUEUE_REQUEST for all queues on a given MEC
pipe while the pipe reset is asserted, ensuring the HQDs are torn down
correctly before deasserting reset.

Switch the KCQ reset path to use the common MEC pipe reset helper
amdgpu_gfx_mec_pipe_reset_run(), which coordinates the reset sequence
including KFD suspend/resume to avoid conflicts with user mode queues.

Suggested-by:  Manu Rastogi <manu.rastogi@amd.com>
Suggested-by:  Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 133 ++++++++++++++++---------
 1 file changed, 85 insertions(+), 48 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index a418ae609c36..676a655d1cb6 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -5355,10 +5355,38 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
 	return amdgpu_ring_reset_helper_end(ring, timedout_fence);
 }
 
-static int gfx_v12_0_reset_compute_pipe(struct amdgpu_ring *ring)
+/*
+ * With MEC pipe reset asserted, clear CP_HQD_ACTIVE / CP_HQD_DEQUEUE_REQUEST for
+ * every queue on (me, pipe). HQDs must be torn down while pipe reset stays
+ * asserted; only then clear the pipe reset bit.
+ * Caller must hold adev->srbm_mutex.
+ */
+static void gfx_v12_0_clear_hqds_on_mec_pipe(struct amdgpu_device *adev, u32 me,
+					     u32 pipe)
 {
-	struct amdgpu_device *adev = ring->adev;
-	uint32_t reset_pipe = 0, clean_pipe = 0;
+	unsigned int q;
+	int j;
+
+	for (q = 0; q < adev->gfx.mec.num_queue_per_pipe; q++) {
+		soc24_grbm_select(adev, me, pipe, q, 0);
+		/* Start from a clean HQD dequeue state before forcing HQD inactive. */
+		WREG32_SOC15(GC, 0, regCP_HQD_ACTIVE, 0);
+		if (RREG32_SOC15(GC, 0, regCP_HQD_ACTIVE) & 1) {
+			WREG32_SOC15(GC, 0, regCP_HQD_DEQUEUE_REQUEST, 1);
+			for (j = 0; j < adev->usec_timeout; j++) {
+				if (!(RREG32_SOC15(GC, 0, regCP_HQD_ACTIVE) & 1))
+					break;
+				udelay(1);
+			}
+		}
+		WREG32_SOC15(GC, 0, regCP_HQD_DEQUEUE_REQUEST, 0);
+	}
+}
+
+static int gfx_v12_0_reset_compute_pipe(struct amdgpu_device *adev,
+					   u32 me, u32 pipe, u32 queue)
+{
+	uint32_t reset_val, clean_val;
 	int r = 0;
 
 	if (!gfx_v12_pipe_reset_support(adev))
@@ -5366,75 +5394,78 @@ static int gfx_v12_0_reset_compute_pipe(struct amdgpu_ring *ring)
 
 	gfx_v12_0_set_safe_mode(adev, 0);
 	mutex_lock(&adev->srbm_mutex);
-	soc24_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
-
-	reset_pipe = RREG32_SOC15(GC, 0, regCP_MEC_RS64_CNTL);
-	clean_pipe = reset_pipe;
-
+	soc24_grbm_select(adev, me, pipe, queue, 0);
 	if (adev->gfx.rs64_enable) {
-		switch (ring->pipe) {
+		reset_val = RREG32_SOC15(GC, 0, regCP_MEC_RS64_CNTL);
+		clean_val = reset_val;
+
+		switch (pipe) {
 		case 0:
-			reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE0_RESET, 1);
-			clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE0_RESET, 0);
+			reset_val = REG_SET_FIELD(reset_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE0_RESET, 1);
+			clean_val = REG_SET_FIELD(clean_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE0_RESET, 0);
 			break;
 		case 1:
-			reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE1_RESET, 1);
-			clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE1_RESET, 0);
+			reset_val = REG_SET_FIELD(reset_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE1_RESET, 1);
+			clean_val = REG_SET_FIELD(clean_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE1_RESET, 0);
 			break;
 		case 2:
-			reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE2_RESET, 1);
-			clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE2_RESET, 0);
+			reset_val = REG_SET_FIELD(reset_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE2_RESET, 1);
+			clean_val = REG_SET_FIELD(clean_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE2_RESET, 0);
 			break;
 		case 3:
-			reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE3_RESET, 1);
-			clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_RS64_CNTL,
-						   MEC_PIPE3_RESET, 0);
+			reset_val = REG_SET_FIELD(reset_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE3_RESET, 1);
+			clean_val = REG_SET_FIELD(clean_val, CP_MEC_RS64_CNTL,
+						  MEC_PIPE3_RESET, 0);
 			break;
 		default:
 			break;
 		}
-		WREG32_SOC15(GC, 0, regCP_MEC_RS64_CNTL, reset_pipe);
-		WREG32_SOC15(GC, 0, regCP_MEC_RS64_CNTL, clean_pipe);
+		WREG32_SOC15(GC, 0, regCP_MEC_RS64_CNTL, reset_val);
+		gfx_v12_0_clear_hqds_on_mec_pipe(adev, me, pipe);
+		soc24_grbm_select(adev, me, pipe, queue, 0);
+		WREG32_SOC15(GC, 0, regCP_MEC_RS64_CNTL, clean_val);
 		r = (RREG32_SOC15(GC, 0, regCP_MEC_RS64_INSTR_PNTR) << 2) -
 				RS64_FW_UC_START_ADDR_LO;
 	} else {
-		switch (ring->pipe) {
+		reset_val = RREG32_SOC15(GC, 0, regCP_MEC_CNTL);
+		clean_val = reset_val;
+
+		switch (pipe) {
 		case 0:
-			reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
-							   MEC_ME1_PIPE0_RESET, 1);
-			clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_CNTL,
-							   MEC_ME1_PIPE0_RESET, 0);
+			reset_val = REG_SET_FIELD(reset_val, CP_MEC_CNTL,
+						  MEC_ME1_PIPE0_RESET, 1);
+			clean_val = REG_SET_FIELD(clean_val, CP_MEC_CNTL,
+						  MEC_ME1_PIPE0_RESET, 0);
 			break;
 		case 1:
-			reset_pipe = REG_SET_FIELD(reset_pipe, CP_MEC_CNTL,
-							   MEC_ME1_PIPE1_RESET, 1);
-			clean_pipe = REG_SET_FIELD(clean_pipe, CP_MEC_CNTL,
-							   MEC_ME1_PIPE1_RESET, 0);
+			reset_val = REG_SET_FIELD(reset_val, CP_MEC_CNTL,
+						  MEC_ME1_PIPE1_RESET, 1);
+			clean_val = REG_SET_FIELD(clean_val, CP_MEC_CNTL,
+						  MEC_ME1_PIPE1_RESET, 0);
 			break;
 		default:
-		break;
+			break;
 		}
-		WREG32_SOC15(GC, 0, regCP_MEC_CNTL, reset_pipe);
-		WREG32_SOC15(GC, 0, regCP_MEC_CNTL, clean_pipe);
-		/* Doesn't find the F32 MEC instruction pointer register, and suppose
-		 * the driver won't run into the F32 mode.
-		 */
+
+		WREG32_SOC15(GC, 0, regCP_MEC_CNTL, reset_val);
+		gfx_v12_0_clear_hqds_on_mec_pipe(adev, me, pipe);
+		soc24_grbm_select(adev, me, pipe, queue, 0);
+		WREG32_SOC15(GC, 0, regCP_MEC_CNTL, clean_val);
 	}
 
 	soc24_grbm_select(adev, 0, 0, 0, 0);
 	mutex_unlock(&adev->srbm_mutex);
 	gfx_v12_0_unset_safe_mode(adev, 0);
 
-	dev_info(adev->dev, "The ring %s pipe resets: %s\n", ring->name,
-			r == 0 ? "successfully" : "failed");
-	/* Need the ring test to verify the pipe reset result.*/
+	dev_dbg(adev->dev, "MEC pipe me%u pipe%u queue%u resets to MEC FW start PC: %s\n",
+		me, pipe, queue, r == 0 ? "successfully" : "failed");
 	return 0;
 }
 
@@ -5450,9 +5481,15 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
 	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true, 0);
 	if (r) {
 		dev_warn(adev->dev, "fail(%d) to reset kcq  and try pipe reset\n", r);
-		r = gfx_v12_0_reset_compute_pipe(ring);
-		if (r)
-			return r;
+		amdgpu_amdkfd_suspend(adev, true);
+		r = amdgpu_gfx_mec_pipe_reset_run(adev,
+						     ring->xcc_id, ring->me, ring->pipe,
+						     ring->queue, timedout_fence,
+						     gfx_v12_0_reset_compute_pipe,
+						     NULL,
+						     gfx_v12_0_kcq_init_queue);
+		amdgpu_amdkfd_resume(adev, true);
+		return r;
 	}
 
 	r = gfx_v12_0_kcq_init_queue(ring, true);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 6/8] drm/amdgpu/gfx12: Fall back to pipe reset if per-queue reset ring test fails
  2026-04-14  8:58 [PATCH v3 1/8] drm/amdgpu: add coordinated MEC pipe reset for GFX compute queues Jesse Zhang
                   ` (3 preceding siblings ...)
  2026-04-14  8:58 ` [PATCH v3 5/8] drm/amdgpu/gfx12: Refactor compute pipe reset and add HQD cleanup Jesse Zhang
@ 2026-04-14  8:58 ` Jesse Zhang
  2026-04-14  8:58 ` [PATCH v3 7/8] drm/amdgpu/gfx12: enable per-pipe reset support for compute queues Jesse Zhang
  2026-04-14  8:58 ` [PATCH v3 8/8] drm/amdgpu/gfx_v12_0: set gfx.rs64_enable from PFP header on GFX12 Jesse Zhang
  6 siblings, 0 replies; 8+ messages in thread
From: Jesse Zhang @ 2026-04-14  8:58 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Christian Koenig, Jesse Zhang, Jesse Zhang

After a per-queue reset via MES, verify that the queue is functional by
performing a ring test. If the test fails, fall back to a full pipe reset
to ensure proper recovery.

This adds a fallback path similar to the one already present when the
initial per-queue reset attempt fails, improving the robustness of KCQ
reset handling on GFX11 hardware.

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index 676a655d1cb6..f90354e2ab3f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -5474,13 +5474,17 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
 			       struct amdgpu_fence *timedout_fence)
 {
 	struct amdgpu_device *adev = ring->adev;
+	int reset_mode = AMDGPU_RESET_TYPE_PER_QUEUE;
 	int r;
 
 	amdgpu_ring_reset_helper_begin(ring, timedout_fence);
 
 	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, true, 0);
+
+pipe_reset:
 	if (r) {
 		dev_warn(adev->dev, "fail(%d) to reset kcq  and try pipe reset\n", r);
+		reset_mode = AMDGPU_RESET_TYPE_PER_PIPE;
 		amdgpu_amdkfd_suspend(adev, true);
 		r = amdgpu_gfx_mec_pipe_reset_run(adev,
 						     ring->xcc_id, ring->me, ring->pipe,
@@ -5503,6 +5507,13 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
 		return r;
 	}
 
+	if (reset_mode == AMDGPU_RESET_TYPE_PER_QUEUE) {
+		if (amdgpu_ring_reset_helper_end(ring, timedout_fence))
+			goto pipe_reset;
+		else
+			return 0;
+	}
+
 	return amdgpu_ring_reset_helper_end(ring, timedout_fence);
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 7/8] drm/amdgpu/gfx12: enable per-pipe reset support for compute queues
  2026-04-14  8:58 [PATCH v3 1/8] drm/amdgpu: add coordinated MEC pipe reset for GFX compute queues Jesse Zhang
                   ` (4 preceding siblings ...)
  2026-04-14  8:58 ` [PATCH v3 6/8] drm/amdgpu/gfx12: Fall back to pipe reset if per-queue reset ring test fails Jesse Zhang
@ 2026-04-14  8:58 ` Jesse Zhang
  2026-04-14  8:58 ` [PATCH v3 8/8] drm/amdgpu/gfx_v12_0: set gfx.rs64_enable from PFP header on GFX12 Jesse Zhang
  6 siblings, 0 replies; 8+ messages in thread
From: Jesse Zhang @ 2026-04-14  8:58 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, Christian Koenig, Jesse Zhang, Alex Deucher,
	Jesse Zhang

This allows the driver to fall back to pipe-level reset when per-queue
reset fails, improving recovery success for hung compute or graphics
rings.

V2: replace both gfx_v11_compute_pipe_reset_support() and gfx_v11_pipe_reset_support() with
amdgpu_ring_is_reset_type_supported (Alex)

Suggested-by:  Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index f90354e2ab3f..2dcdee1eef1c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -1557,6 +1557,8 @@ static int gfx_v12_0_sw_init(struct amdgpu_ip_block *ip_block)
 		    !adev->debug_disable_gpu_ring_reset) {
 			adev->gfx.compute_supported_reset |= AMDGPU_RESET_TYPE_PER_QUEUE;
 			adev->gfx.gfx_supported_reset |= AMDGPU_RESET_TYPE_PER_QUEUE;
+			if (adev->gfx.mec_fw_version >= 3190)
+				adev->gfx.compute_supported_reset |= AMDGPU_RESET_TYPE_PER_PIPE;
 		}
 		break;
 	default:
@@ -5257,22 +5259,12 @@ static void gfx_v12_ip_dump(struct amdgpu_ip_block *ip_block)
 	amdgpu_gfx_off_ctrl(adev, true);
 }
 
-static bool gfx_v12_pipe_reset_support(struct amdgpu_device *adev)
-{
-	/* Disable the pipe reset until the CPFW fully support it.*/
-	dev_warn_once(adev->dev, "The CPFW hasn't support pipe reset yet.\n");
-	return false;
-}
-
 static int gfx_v12_reset_gfx_pipe(struct amdgpu_ring *ring)
 {
 	struct amdgpu_device *adev = ring->adev;
 	uint32_t reset_pipe = 0, clean_pipe = 0;
 	int r;
 
-	if (!gfx_v12_pipe_reset_support(adev))
-		return -EOPNOTSUPP;
-
 	gfx_v12_0_set_safe_mode(adev, 0);
 	mutex_lock(&adev->srbm_mutex);
 	soc24_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
@@ -5333,6 +5325,10 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring,
 	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, use_mmio, 0);
 	if (r) {
 		dev_warn(adev->dev, "reset via MES failed and try pipe reset %d\n", r);
+		if (!amdgpu_ring_is_reset_type_supported(ring,
+					AMDGPU_RESET_TYPE_PER_PIPE))
+			return r;
+
 		r = gfx_v12_reset_gfx_pipe(ring);
 		if (r)
 			return r;
@@ -5389,9 +5385,6 @@ static int gfx_v12_0_reset_compute_pipe(struct amdgpu_device *adev,
 	uint32_t reset_val, clean_val;
 	int r = 0;
 
-	if (!gfx_v12_pipe_reset_support(adev))
-		return -EOPNOTSUPP;
-
 	gfx_v12_0_set_safe_mode(adev, 0);
 	mutex_lock(&adev->srbm_mutex);
 	soc24_grbm_select(adev, me, pipe, queue, 0);
@@ -5485,6 +5478,10 @@ static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring,
 	if (r) {
 		dev_warn(adev->dev, "fail(%d) to reset kcq  and try pipe reset\n", r);
 		reset_mode = AMDGPU_RESET_TYPE_PER_PIPE;
+		if (!amdgpu_ring_is_reset_type_supported(ring,
+						AMDGPU_RESET_TYPE_PER_PIPE))
+			return r;
+
 		amdgpu_amdkfd_suspend(adev, true);
 		r = amdgpu_gfx_mec_pipe_reset_run(adev,
 						     ring->xcc_id, ring->me, ring->pipe,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 8/8] drm/amdgpu/gfx_v12_0: set gfx.rs64_enable from PFP header on GFX12
  2026-04-14  8:58 [PATCH v3 1/8] drm/amdgpu: add coordinated MEC pipe reset for GFX compute queues Jesse Zhang
                   ` (5 preceding siblings ...)
  2026-04-14  8:58 ` [PATCH v3 7/8] drm/amdgpu/gfx12: enable per-pipe reset support for compute queues Jesse Zhang
@ 2026-04-14  8:58 ` Jesse Zhang
  6 siblings, 0 replies; 8+ messages in thread
From: Jesse Zhang @ 2026-04-14  8:58 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, Christian Koenig, Jesse Zhang, Jesse Zhang

gfx_v12_0_init_microcode() always loads RS64 CP ucode but never set
adev->gfx.rs64_enable, so it stayed false and code that branches on it
(e.g. MEC pipe reset) used the legacy CP_MEC_CNTL path incorrectly.

Match GFX11: derive RS64 mode from the PFP firmware header (v2.0) via
amdgpu_ucode_hdr_version(). Log at debug when RS64 is enabled.

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index 2dcdee1eef1c..a88c8bc4be64 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -602,6 +602,13 @@ static int gfx_v12_0_init_microcode(struct amdgpu_device *adev)
 				   "amdgpu/%s_pfp.bin", ucode_prefix);
 	if (err)
 		goto out;
+
+	adev->gfx.rs64_enable = amdgpu_ucode_hdr_version(
+				(union amdgpu_firmware_header *)
+				adev->gfx.pfp_fw->data, 2, 0);
+	if (adev->gfx.rs64_enable)
+		dev_dbg(adev->dev, "CP RS64 enable\n");
+
 	amdgpu_gfx_cp_init_microcode(adev, AMDGPU_UCODE_ID_CP_RS64_PFP);
 	amdgpu_gfx_cp_init_microcode(adev, AMDGPU_UCODE_ID_CP_RS64_PFP_P0_STACK);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-04-14  9:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-14  8:58 [PATCH v3 1/8] drm/amdgpu: add coordinated MEC pipe reset for GFX compute queues Jesse Zhang
2026-04-14  8:58 ` [PATCH v3 2/8] drm/amdgpu/gfx11: Refactor compute pipe reset and add HQD cleanup Jesse Zhang
2026-04-14  8:58 ` [PATCH v3 3/8] drm/amdgpu/gfx11: Fall back to pipe reset if per-queue reset ring test fails Jesse Zhang
2026-04-14  8:58 ` [PATCH v3 4/8] drm/amdgpu/gfx11: enable per-pipe reset support for compute queues Jesse Zhang
2026-04-14  8:58 ` [PATCH v3 5/8] drm/amdgpu/gfx12: Refactor compute pipe reset and add HQD cleanup Jesse Zhang
2026-04-14  8:58 ` [PATCH v3 6/8] drm/amdgpu/gfx12: Fall back to pipe reset if per-queue reset ring test fails Jesse Zhang
2026-04-14  8:58 ` [PATCH v3 7/8] drm/amdgpu/gfx12: enable per-pipe reset support for compute queues Jesse Zhang
2026-04-14  8:58 ` [PATCH v3 8/8] drm/amdgpu/gfx_v12_0: set gfx.rs64_enable from PFP header on GFX12 Jesse Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox