[PATCH 00/34] GC per queue reset

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/34] GC per queue reset
@ 2024-07-18 14:06 Alex Deucher
  2024-07-18 14:07 ` [PATCH 01/34] drm/amdgpu/mes: add API for legacy " Alex Deucher
                   ` (35 more replies)
  0 siblings, 36 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:06 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

This adds preliminary support for GC per queue reset.  In this
case, only the jobs currently in the queue are lost.  If this
fails, we fall back to a full adapter reset.

Alex Deucher (19):
  drm/amdgpu/mes: add API for legacy queue reset
  drm/amdgpu/mes11: add API for legacy queue reset
  drm/amdgpu/mes12: add API for legacy queue reset
  drm/amdgpu/mes: add API for user queue reset
  drm/amdgpu/mes11: add API for user queue reset
  drm/amdgpu/mes12: add API for user queue reset
  drm/amdgpu: add new ring reset callback
  drm/amdgpu: add per ring reset support (v2)
  drm/amdgpu/gfx11: add ring reset callbacks
  drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
  drm/amdgpu/gfx10: add ring reset callbacks
  drm/amdgpu/gfx10: rework reset sequence
  drm/amdgpu/gfx9: add ring reset callback
  drm/amdgpu/gfx9.4.3: add ring reset callback
  drm/amdgpu/gfx12: add ring reset callbacks
  drm/amdgpu/gfx12: fallback to driver reset compute queue directly
  drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
  drm/amdgpu/gfx11: add a mutex for the gfx semaphore
  drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()

Jiadong Zhu (13):
  drm/amdgpu/gfx11: wait for reset done before remap
  drm/amdgpu/gfx10: remap queue after reset successfully
  drm/amdgpu/gfx10: wait for reset done before remap
  drm/amdgpu/gfx9: remap queue after reset successfully
  drm/amdgpu/gfx9: wait for reset done before remap
  drm/amdgpu/gfx9.4.3: remap queue after reset successfully
  drm/amdgpu/gfx_9.4.3: wait for reset done before remap
  drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue
  drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
  drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
  drm/amdgpu/mes: modify mes api for mmio queue reset
  drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
  drm/amdgpu/mes11: implement mmio queue reset for gfx11

Prike Liang (2):
  drm/amdgpu: increase the reset counter for the queue reset
  drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2)

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |   6 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  18 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c    |  88 ++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h    |  37 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 158 ++++++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 117 +++++++++++++--
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h     |   3 +
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     |  95 ++++++++++++-
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 126 +++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 125 +++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c     | 132 +++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/mes_v12_0.c     |  54 +++++++
 14 files changed, 930 insertions(+), 32 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/34] drm/amdgpu/mes: add API for legacy queue reset
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 02/34] drm/amdgpu/mes11: " Alex Deucher
                   ` (34 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Add API for resetting kernel queues.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 24 ++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 16 ++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index e499d6ba306b..1739aa11cbd2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -819,6 +819,30 @@ int amdgpu_mes_unmap_legacy_queue(struct amdgpu_device *adev,
 	return r;
 }
 
+int amdgpu_mes_reset_legacy_queue(struct amdgpu_device *adev,
+				  struct amdgpu_ring *ring,
+				  unsigned int vmid)
+{
+	struct mes_reset_legacy_queue_input queue_input;
+	int r;
+
+	memset(&queue_input, 0, sizeof(queue_input));
+
+	queue_input.queue_type = ring->funcs->type;
+	queue_input.doorbell_offset = ring->doorbell_index;
+	queue_input.pipe_id = ring->pipe;
+	queue_input.queue_id = ring->queue;
+	queue_input.mqd_addr = amdgpu_bo_gpu_offset(ring->mqd_obj);
+	queue_input.wptr_addr = ring->wptr_gpu_addr;
+	queue_input.vmid = vmid;
+
+	r = adev->mes.funcs->reset_legacy_queue(&adev->mes, &queue_input);
+	if (r)
+		DRM_ERROR("failed to reset legacy queue\n");
+
+	return r;
+}
+
 uint32_t amdgpu_mes_rreg(struct amdgpu_device *adev, uint32_t reg)
 {
 	struct mes_misc_op_input op_input;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index e11051271f71..4456956c325b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -279,6 +279,16 @@ struct mes_resume_gang_input {
 	uint64_t	gang_context_addr;
 };
 
+struct mes_reset_legacy_queue_input {
+	uint32_t                           queue_type;
+	uint32_t                           doorbell_offset;
+	uint32_t                           pipe_id;
+	uint32_t                           queue_id;
+	uint64_t                           mqd_addr;
+	uint64_t                           wptr_addr;
+	uint32_t                           vmid;
+};
+
 enum mes_misc_opcode {
 	MES_MISC_OP_WRITE_REG,
 	MES_MISC_OP_READ_REG,
@@ -347,6 +357,9 @@ struct amdgpu_mes_funcs {
 
 	int (*misc_op)(struct amdgpu_mes *mes,
 		       struct mes_misc_op_input *input);
+
+	int (*reset_legacy_queue)(struct amdgpu_mes *mes,
+				  struct mes_reset_legacy_queue_input *input);
 };
 
 #define amdgpu_mes_kiq_hw_init(adev) (adev)->mes.kiq_hw_init((adev))
@@ -381,6 +394,9 @@ int amdgpu_mes_unmap_legacy_queue(struct amdgpu_device *adev,
 				  struct amdgpu_ring *ring,
 				  enum amdgpu_unmap_queues_action action,
 				  u64 gpu_addr, u64 seq);
+int amdgpu_mes_reset_legacy_queue(struct amdgpu_device *adev,
+				  struct amdgpu_ring *ring,
+				  unsigned int vmid);
 
 uint32_t amdgpu_mes_rreg(struct amdgpu_device *adev, uint32_t reg);
 int amdgpu_mes_wreg(struct amdgpu_device *adev,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 02/34] drm/amdgpu/mes11: add API for legacy queue reset
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
  2024-07-18 14:07 ` [PATCH 01/34] drm/amdgpu/mes: add API for legacy " Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 03/34] drm/amdgpu/mes12: " Alex Deucher
                   ` (33 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Add API for resetting kernel queues.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 33 ++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index 27d54ec82208..f611183e1ebf 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -587,6 +587,38 @@ static int mes_v11_0_set_hw_resources_1(struct amdgpu_mes *mes)
 			offsetof(union MESAPI_SET_HW_RESOURCES_1, api_status));
 }
 
+static int mes_v11_0_reset_legacy_queue(struct amdgpu_mes *mes,
+					struct mes_reset_legacy_queue_input *input)
+{
+	union MESAPI__RESET mes_reset_queue_pkt;
+
+	memset(&mes_reset_queue_pkt, 0, sizeof(mes_reset_queue_pkt));
+
+	mes_reset_queue_pkt.header.type = MES_API_TYPE_SCHEDULER;
+	mes_reset_queue_pkt.header.opcode = MES_SCH_API_RESET;
+	mes_reset_queue_pkt.header.dwsize = API_FRAME_SIZE_IN_DWORDS;
+
+	mes_reset_queue_pkt.queue_type =
+		convert_to_mes_queue_type(input->queue_type);
+
+	if (mes_reset_queue_pkt.queue_type == MES_QUEUE_TYPE_GFX) {
+		mes_reset_queue_pkt.reset_legacy_gfx = 1;
+		mes_reset_queue_pkt.pipe_id_lp = input->pipe_id;
+		mes_reset_queue_pkt.queue_id_lp = input->queue_id;
+		mes_reset_queue_pkt.mqd_mc_addr_lp = input->mqd_addr;
+		mes_reset_queue_pkt.doorbell_offset_lp = input->doorbell_offset;
+		mes_reset_queue_pkt.wptr_addr_lp = input->wptr_addr;
+		mes_reset_queue_pkt.vmid_id_lp = input->vmid;
+	} else {
+		mes_reset_queue_pkt.reset_queue_only = 1;
+		mes_reset_queue_pkt.doorbell_offset = input->doorbell_offset;
+	}
+
+	return mes_v11_0_submit_pkt_and_poll_completion(mes,
+			&mes_reset_queue_pkt, sizeof(mes_reset_queue_pkt),
+			offsetof(union MESAPI__RESET, api_status));
+}
+
 static const struct amdgpu_mes_funcs mes_v11_0_funcs = {
 	.add_hw_queue = mes_v11_0_add_hw_queue,
 	.remove_hw_queue = mes_v11_0_remove_hw_queue,
@@ -595,6 +627,7 @@ static const struct amdgpu_mes_funcs mes_v11_0_funcs = {
 	.suspend_gang = mes_v11_0_suspend_gang,
 	.resume_gang = mes_v11_0_resume_gang,
 	.misc_op = mes_v11_0_misc_op,
+	.reset_legacy_queue = mes_v11_0_reset_legacy_queue,
 };
 
 static int mes_v11_0_allocate_ucode_buffer(struct amdgpu_device *adev,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 03/34] drm/amdgpu/mes12: add API for legacy queue reset
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
  2024-07-18 14:07 ` [PATCH 01/34] drm/amdgpu/mes: add API for legacy " Alex Deucher
  2024-07-18 14:07 ` [PATCH 02/34] drm/amdgpu/mes11: " Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 04/34] drm/amdgpu/mes: add API for user " Alex Deucher
                   ` (32 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Add API for resetting kernel queues.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 33 ++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
index c9f74231ad59..14b8c88fb0e0 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
@@ -634,6 +634,38 @@ static void mes_v12_0_enable_unmapped_doorbell_handling(
 	WREG32_SOC15(GC, 0, regCP_UNMAPPED_DOORBELL, data);
 }
 
+static int mes_v12_0_reset_legacy_queue(struct amdgpu_mes *mes,
+					struct mes_reset_legacy_queue_input *input)
+{
+	union MESAPI__RESET mes_reset_queue_pkt;
+
+	memset(&mes_reset_queue_pkt, 0, sizeof(mes_reset_queue_pkt));
+
+	mes_reset_queue_pkt.header.type = MES_API_TYPE_SCHEDULER;
+	mes_reset_queue_pkt.header.opcode = MES_SCH_API_RESET;
+	mes_reset_queue_pkt.header.dwsize = API_FRAME_SIZE_IN_DWORDS;
+
+	mes_reset_queue_pkt.queue_type =
+		convert_to_mes_queue_type(input->queue_type);
+
+	if (mes_reset_queue_pkt.queue_type == MES_QUEUE_TYPE_GFX) {
+		mes_reset_queue_pkt.reset_legacy_gfx = 1;
+		mes_reset_queue_pkt.pipe_id_lp = input->pipe_id;
+		mes_reset_queue_pkt.queue_id_lp = input->queue_id;
+		mes_reset_queue_pkt.mqd_mc_addr_lp = input->mqd_addr;
+		mes_reset_queue_pkt.doorbell_offset_lp = input->doorbell_offset;
+		mes_reset_queue_pkt.wptr_addr_lp = input->wptr_addr;
+		mes_reset_queue_pkt.vmid_id_lp = input->vmid;
+	} else {
+		mes_reset_queue_pkt.reset_queue_only = 1;
+		mes_reset_queue_pkt.doorbell_offset = input->doorbell_offset;
+	}
+
+	return mes_v12_0_submit_pkt_and_poll_completion(mes,
+			&mes_reset_queue_pkt, sizeof(mes_reset_queue_pkt),
+			offsetof(union MESAPI__RESET, api_status));
+}
+
 static const struct amdgpu_mes_funcs mes_v12_0_funcs = {
 	.add_hw_queue = mes_v12_0_add_hw_queue,
 	.remove_hw_queue = mes_v12_0_remove_hw_queue,
@@ -642,6 +674,7 @@ static const struct amdgpu_mes_funcs mes_v12_0_funcs = {
 	.suspend_gang = mes_v12_0_suspend_gang,
 	.resume_gang = mes_v12_0_resume_gang,
 	.misc_op = mes_v12_0_misc_op,
+	.reset_legacy_queue = mes_v12_0_reset_legacy_queue,
 };
 
 static int mes_v12_0_allocate_ucode_buffer(struct amdgpu_device *adev,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 04/34] drm/amdgpu/mes: add API for user queue reset
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (2 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 03/34] drm/amdgpu/mes12: " Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 05/34] drm/amdgpu/mes11: " Alex Deucher
                   ` (31 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Add API for resetting user queues.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 43 +++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  9 ++++++
 2 files changed, 52 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 1739aa11cbd2..b3d6a9fa6775 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -774,6 +774,49 @@ int amdgpu_mes_remove_hw_queue(struct amdgpu_device *adev, int queue_id)
 	return 0;
 }
 
+int amdgpu_mes_reset_hw_queue(struct amdgpu_device *adev, int queue_id)
+{
+	unsigned long flags;
+	struct amdgpu_mes_queue *queue;
+	struct amdgpu_mes_gang *gang;
+	struct mes_reset_queue_input queue_input;
+	int r;
+
+	/*
+	 * Avoid taking any other locks under MES lock to avoid circular
+	 * lock dependencies.
+	 */
+	amdgpu_mes_lock(&adev->mes);
+
+	/* remove the mes gang from idr list */
+	spin_lock_irqsave(&adev->mes.queue_id_lock, flags);
+
+	queue = idr_find(&adev->mes.queue_id_idr, queue_id);
+	if (!queue) {
+		spin_unlock_irqrestore(&adev->mes.queue_id_lock, flags);
+		amdgpu_mes_unlock(&adev->mes);
+		DRM_ERROR("queue id %d doesn't exist\n", queue_id);
+		return -EINVAL;
+	}
+	spin_unlock_irqrestore(&adev->mes.queue_id_lock, flags);
+
+	DRM_DEBUG("try to reset queue, doorbell off = 0x%llx\n",
+		  queue->doorbell_off);
+
+	gang = queue->gang;
+	queue_input.doorbell_offset = queue->doorbell_off;
+	queue_input.gang_context_addr = gang->gang_ctx_gpu_addr;
+
+	r = adev->mes.funcs->reset_hw_queue(&adev->mes, &queue_input);
+	if (r)
+		DRM_ERROR("failed to reset hardware queue, queue id = %d\n",
+			  queue_id);
+
+	amdgpu_mes_unlock(&adev->mes);
+
+	return 0;
+}
+
 int amdgpu_mes_map_legacy_queue(struct amdgpu_device *adev,
 				struct amdgpu_ring *ring)
 {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index 4456956c325b..771b63db1846 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -248,6 +248,11 @@ struct mes_remove_queue_input {
 	uint64_t	gang_context_addr;
 };
 
+struct mes_reset_queue_input {
+	uint32_t	doorbell_offset;
+	uint64_t	gang_context_addr;
+};
+
 struct mes_map_legacy_queue_input {
 	uint32_t                           queue_type;
 	uint32_t                           doorbell_offset;
@@ -360,6 +365,9 @@ struct amdgpu_mes_funcs {
 
 	int (*reset_legacy_queue)(struct amdgpu_mes *mes,
 				  struct mes_reset_legacy_queue_input *input);
+
+	int (*reset_hw_queue)(struct amdgpu_mes *mes,
+			      struct mes_reset_queue_input *input);
 };
 
 #define amdgpu_mes_kiq_hw_init(adev) (adev)->mes.kiq_hw_init((adev))
@@ -387,6 +395,7 @@ int amdgpu_mes_add_hw_queue(struct amdgpu_device *adev, int gang_id,
 			    struct amdgpu_mes_queue_properties *qprops,
 			    int *queue_id);
 int amdgpu_mes_remove_hw_queue(struct amdgpu_device *adev, int queue_id);
+int amdgpu_mes_reset_hw_queue(struct amdgpu_device *adev, int queue_id);
 
 int amdgpu_mes_map_legacy_queue(struct amdgpu_device *adev,
 				struct amdgpu_ring *ring);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 05/34] drm/amdgpu/mes11: add API for user queue reset
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (3 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 04/34] drm/amdgpu/mes: add API for user " Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 06/34] drm/amdgpu/mes12: " Alex Deucher
                   ` (30 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Add API for resetting user queues.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index f611183e1ebf..bf8fb6a1becb 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -350,6 +350,26 @@ static int mes_v11_0_remove_hw_queue(struct amdgpu_mes *mes,
 			offsetof(union MESAPI__REMOVE_QUEUE, api_status));
 }
 
+static int mes_v11_0_reset_hw_queue(struct amdgpu_mes *mes,
+				    struct mes_reset_queue_input *input)
+{
+	union MESAPI__RESET mes_reset_queue_pkt;
+
+	memset(&mes_reset_queue_pkt, 0, sizeof(mes_reset_queue_pkt));
+
+	mes_reset_queue_pkt.header.type = MES_API_TYPE_SCHEDULER;
+	mes_reset_queue_pkt.header.opcode = MES_SCH_API_RESET;
+	mes_reset_queue_pkt.header.dwsize = API_FRAME_SIZE_IN_DWORDS;
+
+	mes_reset_queue_pkt.doorbell_offset = input->doorbell_offset;
+	mes_reset_queue_pkt.gang_context_addr = input->gang_context_addr;
+	/*mes_reset_queue_pkt.reset_queue_only = 1;*/
+
+	return mes_v11_0_submit_pkt_and_poll_completion(mes,
+			&mes_reset_queue_pkt, sizeof(mes_reset_queue_pkt),
+			offsetof(union MESAPI__REMOVE_QUEUE, api_status));
+}
+
 static int mes_v11_0_map_legacy_queue(struct amdgpu_mes *mes,
 				      struct mes_map_legacy_queue_input *input)
 {
@@ -628,6 +648,7 @@ static const struct amdgpu_mes_funcs mes_v11_0_funcs = {
 	.resume_gang = mes_v11_0_resume_gang,
 	.misc_op = mes_v11_0_misc_op,
 	.reset_legacy_queue = mes_v11_0_reset_legacy_queue,
+	.reset_hw_queue = mes_v11_0_reset_hw_queue,
 };
 
 static int mes_v11_0_allocate_ucode_buffer(struct amdgpu_device *adev,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 06/34] drm/amdgpu/mes12: add API for user queue reset
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (4 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 05/34] drm/amdgpu/mes11: " Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 07/34] drm/amdgpu: add new ring reset callback Alex Deucher
                   ` (29 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Add API for resetting user queues.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
index 14b8c88fb0e0..aea6225df539 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
@@ -334,6 +334,26 @@ static int mes_v12_0_remove_hw_queue(struct amdgpu_mes *mes,
 			offsetof(union MESAPI__REMOVE_QUEUE, api_status));
 }
 
+static int mes_v12_0_reset_hw_queue(struct amdgpu_mes *mes,
+				    struct mes_reset_queue_input *input)
+{
+	union MESAPI__RESET mes_reset_queue_pkt;
+
+	memset(&mes_reset_queue_pkt, 0, sizeof(mes_reset_queue_pkt));
+
+	mes_reset_queue_pkt.header.type = MES_API_TYPE_SCHEDULER;
+	mes_reset_queue_pkt.header.opcode = MES_SCH_API_RESET;
+	mes_reset_queue_pkt.header.dwsize = API_FRAME_SIZE_IN_DWORDS;
+
+	mes_reset_queue_pkt.doorbell_offset = input->doorbell_offset;
+	mes_reset_queue_pkt.gang_context_addr = input->gang_context_addr;
+	/*mes_reset_queue_pkt.reset_queue_only = 1;*/
+
+	return mes_v12_0_submit_pkt_and_poll_completion(mes,
+			&mes_reset_queue_pkt, sizeof(mes_reset_queue_pkt),
+			offsetof(union MESAPI__REMOVE_QUEUE, api_status));
+}
+
 static int mes_v12_0_map_legacy_queue(struct amdgpu_mes *mes,
 				      struct mes_map_legacy_queue_input *input)
 {
@@ -675,6 +695,7 @@ static const struct amdgpu_mes_funcs mes_v12_0_funcs = {
 	.resume_gang = mes_v12_0_resume_gang,
 	.misc_op = mes_v12_0_misc_op,
 	.reset_legacy_queue = mes_v12_0_reset_legacy_queue,
+	.reset_hw_queue = mes_v12_0_reset_hw_queue,
 };
 
 static int mes_v12_0_allocate_ucode_buffer(struct amdgpu_device *adev,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 07/34] drm/amdgpu: add new ring reset callback
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (5 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 06/34] drm/amdgpu/mes12: " Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 08/34] drm/amdgpu: add per ring reset support (v2) Alex Deucher
                   ` (28 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Use this to reset just a single ring.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 582053f1cd56..c7f15edeb367 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -235,6 +235,7 @@ struct amdgpu_ring_funcs {
 	void (*patch_cntl)(struct amdgpu_ring *ring, unsigned offset);
 	void (*patch_ce)(struct amdgpu_ring *ring, unsigned offset);
 	void (*patch_de)(struct amdgpu_ring *ring, unsigned offset);
+	int (*reset)(struct amdgpu_ring *ring, unsigned int vmid);
 };
 
 struct amdgpu_ring {
@@ -334,6 +335,7 @@ struct amdgpu_ring {
 #define amdgpu_ring_patch_cntl(r, o) ((r)->funcs->patch_cntl((r), (o)))
 #define amdgpu_ring_patch_ce(r, o) ((r)->funcs->patch_ce((r), (o)))
 #define amdgpu_ring_patch_de(r, o) ((r)->funcs->patch_de((r), (o)))
+#define amdgpu_ring_reset(r, v) (r)->funcs->reset((r), (v))
 
 unsigned int amdgpu_ring_max_ibs(enum amdgpu_ring_type type);
 int amdgpu_ring_alloc(struct amdgpu_ring *ring, unsigned ndw);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 08/34] drm/amdgpu: add per ring reset support (v2)
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (6 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 07/34] drm/amdgpu: add new ring reset callback Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 09/34] drm/amdgpu: increase the reset counter for the queue reset Alex Deucher
                   ` (27 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

If a specific job is hung, try and reset just the
ring associated with the job.

v2: move to amdgpu_job.c

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index e238f2832f65..5d4883df64d8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -72,6 +72,23 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 
 	dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
 
+	/* attempt a per ring reset */
+	if (amdgpu_gpu_recovery &&
+	    ring->funcs->reset) {
+		if (amdgpu_ring_sched_ready(ring))
+			drm_sched_stop(&ring->sched, s_job);
+		r = amdgpu_ring_reset(ring, job->vmid);
+		if (!r) {
+			/* XXX: these are required for subsequent jobs to work */
+			amdgpu_fence_driver_clear_job_fences(ring);
+			amdgpu_fence_driver_force_completion(ring);
+			drm_sched_increase_karma(s_job);
+			if (amdgpu_ring_sched_ready(ring))
+				drm_sched_start(&ring->sched, true);
+			goto exit;
+		}
+	}
+
 	if (amdgpu_device_should_recover_gpu(ring->adev)) {
 		struct amdgpu_reset_context reset_context;
 		memset(&reset_context, 0, sizeof(reset_context));
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 09/34] drm/amdgpu: increase the reset counter for the queue reset
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (7 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 08/34] drm/amdgpu: add per ring reset support (v2) Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 10/34] drm/amdgpu/gfx11: add ring reset callbacks Alex Deucher
                   ` (26 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Prike Liang, Alex Deucher

From: Prike Liang <Prike.Liang@amd.com>

Update the reset counter for the amdgpu_cs_query_reset_state()

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 5d4883df64d8..7107c4d3a3b6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -80,6 +80,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		r = amdgpu_ring_reset(ring, job->vmid);
 		if (!r) {
 			/* XXX: these are required for subsequent jobs to work */
+			atomic_inc(&ring->adev->gpu_reset_counter);
 			amdgpu_fence_driver_clear_job_fences(ring);
 			amdgpu_fence_driver_force_completion(ring);
 			drm_sched_increase_karma(s_job);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 10/34] drm/amdgpu/gfx11: add ring reset callbacks
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (8 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 09/34] drm/amdgpu: increase the reset counter for the queue reset Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 11/34] drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2) Alex Deucher
                   ` (25 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Add ring reset callbacks for gfx and compute.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index ce5cb60b8628..56606c6dbb15 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -6520,6 +6520,22 @@ static void gfx_v11_0_emit_mem_sync(struct amdgpu_ring *ring)
 	amdgpu_ring_write(ring, gcr_cntl); /* GCR_CNTL */
 }
 
+static int gfx_v11_0_reset_ring(struct amdgpu_ring *ring, unsigned int vmid)
+{
+	int r;
+
+	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid);
+	if (r)
+		return r;
+
+	/* reset the ring */
+	ring->wptr = 0;
+	*ring->wptr_cpu_addr = 0;
+	amdgpu_ring_clear_ring(ring);
+
+	return amdgpu_ring_test_ring(ring);
+}
+
 static void gfx_v11_ip_print(void *handle, struct drm_printer *p)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
@@ -6721,6 +6737,7 @@ static const struct amdgpu_ring_funcs gfx_v11_0_ring_funcs_gfx = {
 	.emit_reg_write_reg_wait = gfx_v11_0_ring_emit_reg_write_reg_wait,
 	.soft_recovery = gfx_v11_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v11_0_emit_mem_sync,
+	.reset = gfx_v11_0_reset_ring,
 };
 
 static const struct amdgpu_ring_funcs gfx_v11_0_ring_funcs_compute = {
@@ -6758,6 +6775,7 @@ static const struct amdgpu_ring_funcs gfx_v11_0_ring_funcs_compute = {
 	.emit_reg_write_reg_wait = gfx_v11_0_ring_emit_reg_write_reg_wait,
 	.soft_recovery = gfx_v11_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v11_0_emit_mem_sync,
+	.reset = gfx_v11_0_reset_ring,
 };
 
 static const struct amdgpu_ring_funcs gfx_v11_0_ring_funcs_kiq = {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 11/34] drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2)
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (9 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 10/34] drm/amdgpu/gfx11: add ring reset callbacks Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 12/34] drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue() Alex Deucher
                   ` (24 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Prike Liang, Alex Deucher

From: Prike Liang <Prike.Liang@amd.com>

Since the MES FW resets kernel compute queue always failed, this
may caused by the KIQ failed to process unmap KCQ. So, before MES
FW work properly that will fallback to driver executes dequeue and
resets SPI directly. Besides, rework the ring reset function and make
the busy ring type reset in each function respectively.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 84 ++++++++++++++++++++++----
 1 file changed, 71 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 56606c6dbb15..22073a839922 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -3966,13 +3966,13 @@ static int gfx_v11_0_gfx_mqd_init(struct amdgpu_device *adev, void *m,
 	return 0;
 }
 
-static int gfx_v11_0_gfx_init_queue(struct amdgpu_ring *ring)
+static int gfx_v11_0_gfx_init_queue(struct amdgpu_ring *ring, bool reset)
 {
 	struct amdgpu_device *adev = ring->adev;
 	struct v11_gfx_mqd *mqd = ring->mqd_ptr;
 	int mqd_idx = ring - &adev->gfx.gfx_ring[0];
 
-	if (!amdgpu_in_reset(adev) && !adev->in_suspend) {
+	if (!reset && !amdgpu_in_reset(adev) && !adev->in_suspend) {
 		memset((void *)mqd, 0, sizeof(*mqd));
 		mutex_lock(&adev->srbm_mutex);
 		soc21_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
@@ -4008,7 +4008,7 @@ static int gfx_v11_0_cp_async_gfx_ring_resume(struct amdgpu_device *adev)
 
 		r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
 		if (!r) {
-			r = gfx_v11_0_gfx_init_queue(ring);
+			r = gfx_v11_0_gfx_init_queue(ring, false);
 			amdgpu_bo_kunmap(ring->mqd_obj);
 			ring->mqd_ptr = NULL;
 		}
@@ -4303,13 +4303,13 @@ static int gfx_v11_0_kiq_init_queue(struct amdgpu_ring *ring)
 	return 0;
 }
 
-static int gfx_v11_0_kcq_init_queue(struct amdgpu_ring *ring)
+static int gfx_v11_0_kcq_init_queue(struct amdgpu_ring *ring, bool reset)
 {
 	struct amdgpu_device *adev = ring->adev;
 	struct v11_compute_mqd *mqd = ring->mqd_ptr;
 	int mqd_idx = ring - &adev->gfx.compute_ring[0];
 
-	if (!amdgpu_in_reset(adev) && !adev->in_suspend) {
+	if (!reset && !amdgpu_in_reset(adev) && !adev->in_suspend) {
 		memset((void *)mqd, 0, sizeof(*mqd));
 		mutex_lock(&adev->srbm_mutex);
 		soc21_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
@@ -4373,7 +4373,7 @@ static int gfx_v11_0_kcq_resume(struct amdgpu_device *adev)
 			goto done;
 		r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
 		if (!r) {
-			r = gfx_v11_0_kcq_init_queue(ring);
+			r = gfx_v11_0_kcq_init_queue(ring, false);
 			amdgpu_bo_kunmap(ring->mqd_obj);
 			ring->mqd_ptr = NULL;
 		}
@@ -6520,18 +6520,76 @@ static void gfx_v11_0_emit_mem_sync(struct amdgpu_ring *ring)
 	amdgpu_ring_write(ring, gcr_cntl); /* GCR_CNTL */
 }
 
-static int gfx_v11_0_reset_ring(struct amdgpu_ring *ring, unsigned int vmid)
+static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 {
+	struct amdgpu_device *adev = ring->adev;
 	int r;
 
 	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid);
 	if (r)
 		return r;
 
-	/* reset the ring */
-	ring->wptr = 0;
-	*ring->wptr_cpu_addr = 0;
-	amdgpu_ring_clear_ring(ring);
+	r = amdgpu_bo_reserve(ring->mqd_obj, false);
+	if (unlikely(r != 0)){
+		dev_err(adev->dev, "fail to resv mqd_obj\n");
+		return r;
+	}
+	r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
+	if (!r) {
+		r = gfx_v11_0_gfx_init_queue(ring, true);
+		amdgpu_bo_kunmap(ring->mqd_obj);
+		ring->mqd_ptr = NULL;
+	}
+	amdgpu_bo_unreserve(ring->mqd_obj);
+	if (r){
+		dev_err(adev->dev, "fail to unresv mqd_obj\n");
+		return r;
+	}
+
+	r = amdgpu_mes_map_legacy_queue(adev, ring);
+	if (r) {
+		dev_err(adev->dev, "failed to remap kgq\n");
+		return r;
+	}
+
+	return amdgpu_ring_test_ring(ring);
+}
+
+static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
+{
+	struct amdgpu_device *adev = ring->adev;
+	int r;
+
+	gfx_v11_0_set_safe_mode(adev, 0);
+	mutex_lock(&adev->srbm_mutex);
+	soc21_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
+	WREG32_SOC15(GC, 0, regCP_HQD_DEQUEUE_REQUEST, 0x2);
+	WREG32_SOC15(GC, 0, regSPI_COMPUTE_QUEUE_RESET, 0x1);
+	soc21_grbm_select(adev, 0, 0, 0, 0);
+	mutex_unlock(&adev->srbm_mutex);
+	gfx_v11_0_unset_safe_mode(adev, 0);
+
+	r = amdgpu_bo_reserve(ring->mqd_obj, false);
+	if (unlikely(r != 0)){
+		dev_err(adev->dev, "fail to resv mqd_obj\n");
+		return r;
+	}
+	r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
+	if (!r) {
+		r = gfx_v11_0_kcq_init_queue(ring, true);
+		amdgpu_bo_kunmap(ring->mqd_obj);
+		ring->mqd_ptr = NULL;
+	}
+	amdgpu_bo_unreserve(ring->mqd_obj);
+	if (r) {
+		dev_err(adev->dev, "fail to unresv mqd_obj\n");
+		return r;
+	}
+	r = amdgpu_mes_map_legacy_queue(adev, ring);
+	if (r) {
+		dev_err(adev->dev, "failed to remap kcq\n");
+		return r;
+	}
 
 	return amdgpu_ring_test_ring(ring);
 }
@@ -6737,7 +6795,7 @@ static const struct amdgpu_ring_funcs gfx_v11_0_ring_funcs_gfx = {
 	.emit_reg_write_reg_wait = gfx_v11_0_ring_emit_reg_write_reg_wait,
 	.soft_recovery = gfx_v11_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v11_0_emit_mem_sync,
-	.reset = gfx_v11_0_reset_ring,
+	.reset = gfx_v11_0_reset_kgq,
 };
 
 static const struct amdgpu_ring_funcs gfx_v11_0_ring_funcs_compute = {
@@ -6775,7 +6833,7 @@ static const struct amdgpu_ring_funcs gfx_v11_0_ring_funcs_compute = {
 	.emit_reg_write_reg_wait = gfx_v11_0_ring_emit_reg_write_reg_wait,
 	.soft_recovery = gfx_v11_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v11_0_emit_mem_sync,
-	.reset = gfx_v11_0_reset_ring,
+	.reset = gfx_v11_0_reset_kcq,
 };
 
 static const struct amdgpu_ring_funcs gfx_v11_0_ring_funcs_kiq = {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 12/34] drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (10 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 11/34] drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2) Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 13/34] drm/amdgpu/gfx11: wait for reset done before remap Alex Deucher
                   ` (23 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Rename to gfx_v11_0_kgq_init_queue() to better align with
the other naming in the file.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 22073a839922..9be58725c251 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -3966,7 +3966,7 @@ static int gfx_v11_0_gfx_mqd_init(struct amdgpu_device *adev, void *m,
 	return 0;
 }
 
-static int gfx_v11_0_gfx_init_queue(struct amdgpu_ring *ring, bool reset)
+static int gfx_v11_0_kgq_init_queue(struct amdgpu_ring *ring, bool reset)
 {
 	struct amdgpu_device *adev = ring->adev;
 	struct v11_gfx_mqd *mqd = ring->mqd_ptr;
@@ -4008,7 +4008,7 @@ static int gfx_v11_0_cp_async_gfx_ring_resume(struct amdgpu_device *adev)
 
 		r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
 		if (!r) {
-			r = gfx_v11_0_gfx_init_queue(ring, false);
+			r = gfx_v11_0_kgq_init_queue(ring, false);
 			amdgpu_bo_kunmap(ring->mqd_obj);
 			ring->mqd_ptr = NULL;
 		}
@@ -6536,7 +6536,7 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 	}
 	r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
 	if (!r) {
-		r = gfx_v11_0_gfx_init_queue(ring, true);
+		r = gfx_v11_0_kgq_init_queue(ring, true);
 		amdgpu_bo_kunmap(ring->mqd_obj);
 		ring->mqd_ptr = NULL;
 	}
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 13/34] drm/amdgpu/gfx11: wait for reset done before remap
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (11 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 12/34] drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue() Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 14/34] drm/amdgpu/gfx10: add ring reset callbacks Alex Deucher
                   ` (22 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Jiadong Zhu, Alex Deucher

From: Jiadong Zhu <Jiadong.Zhu@amd.com>

There is a racing condition that cp firmware modifies
MQD in reset sequence after driver updates it for
remapping. We have to wait till CP_HQD_ACTIVE becoming
false then remap the queue.

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 9be58725c251..1b2de8e81ccd 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -6558,16 +6558,29 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 static int gfx_v11_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
 {
 	struct amdgpu_device *adev = ring->adev;
-	int r;
+	int i, r = 0;
 
 	gfx_v11_0_set_safe_mode(adev, 0);
 	mutex_lock(&adev->srbm_mutex);
 	soc21_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
 	WREG32_SOC15(GC, 0, regCP_HQD_DEQUEUE_REQUEST, 0x2);
 	WREG32_SOC15(GC, 0, regSPI_COMPUTE_QUEUE_RESET, 0x1);
+
+	/* make sure dequeue is complete*/
+	for (i = 0; i < adev->usec_timeout; i++) {
+		if (!(RREG32_SOC15(GC, 0, regCP_HQD_ACTIVE) & 1))
+			break;
+		udelay(1);
+	}
+	if (i >= adev->usec_timeout)
+		r = -ETIMEDOUT;
 	soc21_grbm_select(adev, 0, 0, 0, 0);
 	mutex_unlock(&adev->srbm_mutex);
 	gfx_v11_0_unset_safe_mode(adev, 0);
+	if (r) {
+		dev_err(adev->dev, "fail to wait on hqd deactive\n");
+		return r;
+	}
 
 	r = amdgpu_bo_reserve(ring->mqd_obj, false);
 	if (unlikely(r != 0)){
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 14/34] drm/amdgpu/gfx10: add ring reset callbacks
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (12 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 13/34] drm/amdgpu/gfx11: wait for reset done before remap Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 15/34] drm/amdgpu/gfx10: remap queue after reset successfully Alex Deucher
                   ` (21 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Add ring reset callbacks for gfx and compute.

v2: fix gfx handling
v3: wait for KIQ to complete

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 91 ++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 32c0cc52861c..3d0446337751 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -9454,6 +9454,95 @@ static void gfx_v10_0_emit_mem_sync(struct amdgpu_ring *ring)
 	amdgpu_ring_write(ring, gcr_cntl); /* GCR_CNTL */
 }
 
+static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
+{
+	struct amdgpu_device *adev = ring->adev;
+	struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
+	struct amdgpu_ring *kiq_ring = &kiq->ring;
+	unsigned long flags;
+	u32 tmp;
+	u64 addr;
+	int r;
+
+	if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
+		return -EINVAL;
+
+	spin_lock_irqsave(&kiq->ring_lock, flags);
+
+	if (amdgpu_ring_alloc(kiq_ring, 5 + 7 + 7 + kiq->pmf->map_queues_size)) {
+		spin_unlock_irqrestore(&kiq->ring_lock, flags);
+		return -ENOMEM;
+	}
+
+	addr = amdgpu_bo_gpu_offset(ring->mqd_obj) +
+		offsetof(struct v10_gfx_mqd, cp_gfx_hqd_active);
+	tmp = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
+	if (ring->pipe == 0)
+		tmp = REG_SET_FIELD(tmp, CP_VMID_RESET, PIPE0_QUEUES, 1 << ring->queue);
+	else
+		tmp = REG_SET_FIELD(tmp, CP_VMID_RESET, PIPE1_QUEUES, 1 << ring->queue);
+
+	gfx_v10_0_ring_emit_wreg(kiq_ring,
+				 SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), tmp);
+	gfx_v10_0_wait_reg_mem(kiq_ring, 0, 1, 0,
+			       lower_32_bits(addr), upper_32_bits(addr),
+			       0, 1, 0x20);
+	gfx_v10_0_ring_emit_reg_wait(kiq_ring,
+				     SOC15_REG_OFFSET(GC, 0, mmCP_VMID_RESET), 0, 0xffffffff);
+	kiq->pmf->kiq_map_queues(kiq_ring, ring);
+	amdgpu_ring_commit(kiq_ring);
+
+	spin_unlock_irqrestore(&kiq->ring_lock, flags);
+
+	r = amdgpu_ring_test_ring(kiq_ring);
+	if (r)
+		return r;
+
+	/* reset the ring */
+	ring->wptr = 0;
+	*ring->wptr_cpu_addr = 0;
+	amdgpu_ring_clear_ring(ring);
+
+	return amdgpu_ring_test_ring(ring);
+}
+
+static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
+			       unsigned int vmid)
+{
+	struct amdgpu_device *adev = ring->adev;
+	struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
+	struct amdgpu_ring *kiq_ring = &kiq->ring;
+	unsigned long flags;
+	int r;
+
+	if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
+		return -EINVAL;
+
+	spin_lock_irqsave(&kiq->ring_lock, flags);
+
+	if (amdgpu_ring_alloc(kiq_ring, kiq->pmf->unmap_queues_size)) {
+		spin_unlock_irqrestore(&kiq->ring_lock, flags);
+		return -ENOMEM;
+	}
+
+	kiq->pmf->kiq_unmap_queues(kiq_ring, ring, RESET_QUEUES,
+				   0, 0);
+	amdgpu_ring_commit(kiq_ring);
+
+	spin_unlock_irqrestore(&kiq->ring_lock, flags);
+
+	r = amdgpu_ring_test_ring(kiq_ring);
+	if (r)
+		return r;
+
+	/* reset the ring */
+	ring->wptr = 0;
+	*ring->wptr_cpu_addr = 0;
+	amdgpu_ring_clear_ring(ring);
+
+	return amdgpu_ring_test_ring(ring);
+}
+
 static void gfx_v10_ip_print(void *handle, struct drm_printer *p)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
@@ -9657,6 +9746,7 @@ static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_gfx = {
 	.emit_reg_write_reg_wait = gfx_v10_0_ring_emit_reg_write_reg_wait,
 	.soft_recovery = gfx_v10_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v10_0_emit_mem_sync,
+	.reset = gfx_v10_0_reset_kgq,
 };
 
 static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_compute = {
@@ -9693,6 +9783,7 @@ static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_compute = {
 	.emit_reg_write_reg_wait = gfx_v10_0_ring_emit_reg_write_reg_wait,
 	.soft_recovery = gfx_v10_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v10_0_emit_mem_sync,
+	.reset = gfx_v10_0_reset_kcq,
 };
 
 static const struct amdgpu_ring_funcs gfx_v10_0_ring_funcs_kiq = {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 15/34] drm/amdgpu/gfx10: remap queue after reset successfully
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (13 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 14/34] drm/amdgpu/gfx10: add ring reset callbacks Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 16/34] drm/amdgpu/gfx10: wait for reset done before remap Alex Deucher
                   ` (20 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Jiadong Zhu, Alex Deucher

From: Jiadong Zhu <Jiadong.Zhu@amd.com>

Kiq command unmap_queues only does the dequeueing action.
We have to map the queue back with clean mqd.

v2: fix up error handling (Alex)

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 46 ++++++++++++++++++++------
 1 file changed, 35 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 3d0446337751..5cc0d22d1e2f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -7086,13 +7086,13 @@ static int gfx_v10_0_kiq_init_queue(struct amdgpu_ring *ring)
 	return 0;
 }
 
-static int gfx_v10_0_kcq_init_queue(struct amdgpu_ring *ring)
+static int gfx_v10_0_kcq_init_queue(struct amdgpu_ring *ring, bool restore)
 {
 	struct amdgpu_device *adev = ring->adev;
 	struct v10_compute_mqd *mqd = ring->mqd_ptr;
 	int mqd_idx = ring - &adev->gfx.compute_ring[0];
 
-	if (!amdgpu_in_reset(adev) && !adev->in_suspend) {
+	if (!restore && !amdgpu_in_reset(adev) && !adev->in_suspend) {
 		memset((void *)mqd, 0, sizeof(*mqd));
 		mutex_lock(&adev->srbm_mutex);
 		nv_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
@@ -7154,7 +7154,7 @@ static int gfx_v10_0_kcq_resume(struct amdgpu_device *adev)
 			goto done;
 		r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
 		if (!r) {
-			r = gfx_v10_0_kcq_init_queue(ring);
+			r = gfx_v10_0_kcq_init_queue(ring, false);
 			amdgpu_bo_kunmap(ring->mqd_obj);
 			ring->mqd_ptr = NULL;
 		}
@@ -9521,25 +9521,49 @@ static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
 	spin_lock_irqsave(&kiq->ring_lock, flags);
 
 	if (amdgpu_ring_alloc(kiq_ring, kiq->pmf->unmap_queues_size)) {
-		spin_unlock_irqrestore(&kiq->ring_lock, flags);
-		return -ENOMEM;
+		r = -ENOMEM;
+		goto out;
 	}
 
 	kiq->pmf->kiq_unmap_queues(kiq_ring, ring, RESET_QUEUES,
 				   0, 0);
 	amdgpu_ring_commit(kiq_ring);
 
-	spin_unlock_irqrestore(&kiq->ring_lock, flags);
+	r = amdgpu_ring_test_ring(kiq_ring);
+	if (r)
+		goto out;
+
+	r = amdgpu_bo_reserve(ring->mqd_obj, false);
+	if (unlikely(r != 0)){
+		dev_err(adev->dev, "fail to resv mqd_obj\n");
+		goto out;
+	}
+	r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
+	if (!r) {
+		r = gfx_v10_0_kcq_init_queue(ring, true);
+		amdgpu_bo_kunmap(ring->mqd_obj);
+		ring->mqd_ptr = NULL;
+	}
+	amdgpu_bo_unreserve(ring->mqd_obj);
+	if (r){
+		dev_err(adev->dev, "fail to unresv mqd_obj\n");
+		goto out;
+	}
+
+	if (amdgpu_ring_alloc(kiq_ring, kiq->pmf->map_queues_size)) {
+		r = -ENOMEM;
+		goto out;
+	}
+	kiq->pmf->kiq_map_queues(kiq_ring, ring);
+	amdgpu_ring_commit(kiq_ring);
 
 	r = amdgpu_ring_test_ring(kiq_ring);
+
+out:
+	spin_unlock_irqrestore(&kiq->ring_lock, flags);
 	if (r)
 		return r;
 
-	/* reset the ring */
-	ring->wptr = 0;
-	*ring->wptr_cpu_addr = 0;
-	amdgpu_ring_clear_ring(ring);
-
 	return amdgpu_ring_test_ring(ring);
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 16/34] drm/amdgpu/gfx10: wait for reset done before remap
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (14 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 15/34] drm/amdgpu/gfx10: remap queue after reset successfully Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 17/34] drm/amdgpu/gfx10: rework reset sequence Alex Deucher
                   ` (19 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Jiadong Zhu, Alex Deucher

From: Jiadong Zhu <Jiadong.Zhu@amd.com>

There is a racing condition that cp firmware modifies
MQD in reset sequence after driver updates it for
remapping. We have to wait till CP_HQD_ACTIVE becoming
false then remap the queue.

v2: fix KIQ locking (Alex)

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 41 +++++++++++++++++++-------
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 5cc0d22d1e2f..e9d93bf909db 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -9513,7 +9513,7 @@ static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
 	struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
 	struct amdgpu_ring *kiq_ring = &kiq->ring;
 	unsigned long flags;
-	int r;
+	int i, r;
 
 	if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
 		return -EINVAL;
@@ -9521,8 +9521,8 @@ static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
 	spin_lock_irqsave(&kiq->ring_lock, flags);
 
 	if (amdgpu_ring_alloc(kiq_ring, kiq->pmf->unmap_queues_size)) {
-		r = -ENOMEM;
-		goto out;
+		spin_unlock_irqrestore(&kiq->ring_lock, flags);
+		return -ENOMEM;
 	}
 
 	kiq->pmf->kiq_unmap_queues(kiq_ring, ring, RESET_QUEUES,
@@ -9530,13 +9530,33 @@ static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
 	amdgpu_ring_commit(kiq_ring);
 
 	r = amdgpu_ring_test_ring(kiq_ring);
+	spin_unlock_irqrestore(&kiq->ring_lock, flags);
 	if (r)
-		goto out;
+		return r;
+
+	/* make sure dequeue is complete*/
+	gfx_v10_0_set_safe_mode(adev, 0);
+	mutex_lock(&adev->srbm_mutex);
+	nv_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
+	for (i = 0; i < adev->usec_timeout; i++) {
+		if (!(RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE) & 1))
+			break;
+		udelay(1);
+	}
+	if (i >= adev->usec_timeout)
+		r = -ETIMEDOUT;
+	nv_grbm_select(adev, 0, 0, 0, 0);
+	mutex_unlock(&adev->srbm_mutex);
+	gfx_v10_0_unset_safe_mode(adev, 0);
+	if (r) {
+		dev_err(adev->dev, "fail to wait on hqd deactive\n");
+		return r;
+	}
 
 	r = amdgpu_bo_reserve(ring->mqd_obj, false);
 	if (unlikely(r != 0)){
 		dev_err(adev->dev, "fail to resv mqd_obj\n");
-		goto out;
+		return r;
 	}
 	r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
 	if (!r) {
@@ -9545,21 +9565,20 @@ static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring,
 		ring->mqd_ptr = NULL;
 	}
 	amdgpu_bo_unreserve(ring->mqd_obj);
-	if (r){
+	if (r) {
 		dev_err(adev->dev, "fail to unresv mqd_obj\n");
-		goto out;
+		return r;
 	}
 
+	spin_lock_irqsave(&kiq->ring_lock, flags);
 	if (amdgpu_ring_alloc(kiq_ring, kiq->pmf->map_queues_size)) {
-		r = -ENOMEM;
-		goto out;
+		spin_unlock_irqrestore(&kiq->ring_lock, flags);
+		return -ENOMEM;
 	}
 	kiq->pmf->kiq_map_queues(kiq_ring, ring);
 	amdgpu_ring_commit(kiq_ring);
 
 	r = amdgpu_ring_test_ring(kiq_ring);
-
-out:
 	spin_unlock_irqrestore(&kiq->ring_lock, flags);
 	if (r)
 		return r;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 17/34] drm/amdgpu/gfx10: rework reset sequence
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (15 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 16/34] drm/amdgpu/gfx10: wait for reset done before remap Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 18/34] drm/amdgpu/gfx9: add ring reset callback Alex Deucher
                   ` (18 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

To match other GFX IPs.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index e9d93bf909db..b833943faa53 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -6748,13 +6748,13 @@ static int gfx_v10_0_gfx_mqd_init(struct amdgpu_device *adev, void *m,
 	return 0;
 }
 
-static int gfx_v10_0_gfx_init_queue(struct amdgpu_ring *ring)
+static int gfx_v10_0_kgq_init_queue(struct amdgpu_ring *ring, bool reset)
 {
 	struct amdgpu_device *adev = ring->adev;
 	struct v10_gfx_mqd *mqd = ring->mqd_ptr;
 	int mqd_idx = ring - &adev->gfx.gfx_ring[0];
 
-	if (!amdgpu_in_reset(adev) && !adev->in_suspend) {
+	if (!reset && !amdgpu_in_reset(adev) && !adev->in_suspend) {
 		memset((void *)mqd, 0, sizeof(*mqd));
 		mutex_lock(&adev->srbm_mutex);
 		nv_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
@@ -6806,7 +6806,7 @@ static int gfx_v10_0_cp_async_gfx_ring_resume(struct amdgpu_device *adev)
 
 		r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
 		if (!r) {
-			r = gfx_v10_0_gfx_init_queue(ring);
+			r = gfx_v10_0_kgq_init_queue(ring, false);
 			amdgpu_bo_kunmap(ring->mqd_obj);
 			ring->mqd_ptr = NULL;
 		}
@@ -9498,10 +9498,22 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 	if (r)
 		return r;
 
-	/* reset the ring */
-	ring->wptr = 0;
-	*ring->wptr_cpu_addr = 0;
-	amdgpu_ring_clear_ring(ring);
+	r = amdgpu_bo_reserve(ring->mqd_obj, false);
+	if (unlikely(r != 0)){
+		DRM_ERROR("fail to resv mqd_obj\n");
+		return r;
+	}
+	r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
+	if (!r) {
+		r = gfx_v10_0_kgq_init_queue(ring, true);
+		amdgpu_bo_kunmap(ring->mqd_obj);
+		ring->mqd_ptr = NULL;
+	}
+	amdgpu_bo_unreserve(ring->mqd_obj);
+	if (r){
+		DRM_ERROR("fail to unresv mqd_obj\n");
+		return r;
+	}
 
 	return amdgpu_ring_test_ring(ring);
 }
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 18/34] drm/amdgpu/gfx9: add ring reset callback
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (16 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 17/34] drm/amdgpu/gfx10: rework reset sequence Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 19/34] drm/amdgpu/gfx9: remap queue after reset successfully Alex Deucher
                   ` (17 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Add ring reset callback for compute.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 38 +++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 675a1a8e2515..78495df29b5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -7100,6 +7100,43 @@ static void gfx_v9_0_emit_wave_limit(struct amdgpu_ring *ring, bool enable)
 	}
 }
 
+static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
+			      unsigned int vmid)
+{
+	struct amdgpu_device *adev = ring->adev;
+	struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
+	struct amdgpu_ring *kiq_ring = &kiq->ring;
+	unsigned long flags;
+	int r;
+
+	if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
+		return -EINVAL;
+
+	spin_lock_irqsave(&kiq->ring_lock, flags);
+
+	if (amdgpu_ring_alloc(kiq_ring, kiq->pmf->unmap_queues_size)) {
+		spin_unlock_irqrestore(&kiq->ring_lock, flags);
+		return -ENOMEM;
+	}
+
+	kiq->pmf->kiq_unmap_queues(kiq_ring, ring, RESET_QUEUES,
+				   0, 0);
+	amdgpu_ring_commit(kiq_ring);
+
+	spin_unlock_irqrestore(&kiq->ring_lock, flags);
+
+	r = amdgpu_ring_test_ring(kiq_ring);
+	if (r)
+		return r;
+
+	/* reset the ring */
+	ring->wptr = 0;
+	*ring->wptr_cpu_addr = 0;
+	amdgpu_ring_clear_ring(ring);
+
+	return amdgpu_ring_test_ring(ring);
+}
+
 static void gfx_v9_ip_print(void *handle, struct drm_printer *p)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
@@ -7346,6 +7383,7 @@ static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_compute = {
 	.soft_recovery = gfx_v9_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v9_0_emit_mem_sync,
 	.emit_wave_limit = gfx_v9_0_emit_wave_limit,
+	.reset = gfx_v9_0_reset_kcq,
 };
 
 static const struct amdgpu_ring_funcs gfx_v9_0_ring_funcs_kiq = {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 19/34] drm/amdgpu/gfx9: remap queue after reset successfully
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (17 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 18/34] drm/amdgpu/gfx9: add ring reset callback Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 20/34] drm/amdgpu/gfx9: wait for reset done before remap Alex Deucher
                   ` (16 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Jiadong Zhu, Alex Deucher

From: Jiadong Zhu <Jiadong.Zhu@amd.com>

Kiq command unmap_queues only does the dequeueing action.
We have to map the queue back with clean mqd.

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 36 ++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 78495df29b5c..3a819c6923c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -3742,7 +3742,7 @@ static int gfx_v9_0_kiq_init_queue(struct amdgpu_ring *ring)
 	return 0;
 }
 
-static int gfx_v9_0_kcq_init_queue(struct amdgpu_ring *ring)
+static int gfx_v9_0_kcq_init_queue(struct amdgpu_ring *ring, bool restore)
 {
 	struct amdgpu_device *adev = ring->adev;
 	struct v9_mqd *mqd = ring->mqd_ptr;
@@ -3754,8 +3754,8 @@ static int gfx_v9_0_kcq_init_queue(struct amdgpu_ring *ring)
 	 */
 	tmp_mqd = (struct v9_mqd *)adev->gfx.mec.mqd_backup[mqd_idx];
 
-	if (!tmp_mqd->cp_hqd_pq_control ||
-	    (!amdgpu_in_reset(adev) && !adev->in_suspend)) {
+	if (!restore && (!tmp_mqd->cp_hqd_pq_control ||
+	    (!amdgpu_in_reset(adev) && !adev->in_suspend))) {
 		memset((void *)mqd, 0, sizeof(struct v9_mqd_allocation));
 		((struct v9_mqd_allocation *)mqd)->dynamic_cu_mask = 0xFFFFFFFF;
 		((struct v9_mqd_allocation *)mqd)->dynamic_rb_mask = 0xFFFFFFFF;
@@ -3819,7 +3819,7 @@ static int gfx_v9_0_kcq_resume(struct amdgpu_device *adev)
 			goto done;
 		r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
 		if (!r) {
-			r = gfx_v9_0_kcq_init_queue(ring);
+			r = gfx_v9_0_kcq_init_queue(ring, false);
 			amdgpu_bo_kunmap(ring->mqd_obj);
 			ring->mqd_ptr = NULL;
 		}
@@ -7129,11 +7129,29 @@ static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
 	if (r)
 		return r;
 
-	/* reset the ring */
-	ring->wptr = 0;
-	*ring->wptr_cpu_addr = 0;
-	amdgpu_ring_clear_ring(ring);
-
+	r = amdgpu_bo_reserve(ring->mqd_obj, false);
+	if (unlikely(r != 0)){
+		DRM_ERROR("fail to resv mqd_obj\n");
+		return r;
+	}
+	r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
+	if (!r) {
+		r = gfx_v9_0_kcq_init_queue(ring, true);
+		amdgpu_bo_kunmap(ring->mqd_obj);
+		ring->mqd_ptr = NULL;
+	}
+	amdgpu_bo_unreserve(ring->mqd_obj);
+	if (r){
+		DRM_ERROR("fail to unresv mqd_obj\n");
+		return r;
+	}
+	r = amdgpu_ring_alloc(kiq_ring, kiq->pmf->map_queues_size);
+	kiq->pmf->kiq_map_queues(kiq_ring, ring);
+	r = amdgpu_ring_test_ring(kiq_ring);
+	if (r){
+		DRM_ERROR("fail to remap queue\n");
+		return r;
+	}
 	return amdgpu_ring_test_ring(ring);
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 20/34] drm/amdgpu/gfx9: wait for reset done before remap
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (18 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 19/34] drm/amdgpu/gfx9: remap queue after reset successfully Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 21/34] drm/amdgpu/gfx9.4.3: add ring reset callback Alex Deucher
                   ` (15 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Jiadong Zhu, Alex Deucher

From: Jiadong Zhu <Jiadong.Zhu@amd.com>

There is a racing condition that cp firmware modifies
MQD in reset sequence after driver updates it for
remapping. We have to wait till CP_HQD_ACTIVE becoming
false then remap the queue.

v2: fix KIQ locking (Alex)

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 35 +++++++++++++++++++++++----
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 3a819c6923c6..fdc3fb636e02 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -7107,7 +7107,7 @@ static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
 	struct amdgpu_kiq *kiq = &adev->gfx.kiq[0];
 	struct amdgpu_ring *kiq_ring = &kiq->ring;
 	unsigned long flags;
-	int r;
+	int i, r;
 
 	if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
 		return -EINVAL;
@@ -7129,9 +7129,28 @@ static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
 	if (r)
 		return r;
 
+	/* make sure dequeue is complete*/
+	gfx_v9_0_set_safe_mode(adev, 0);
+	mutex_lock(&adev->srbm_mutex);
+	soc15_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0, 0);
+	for (i = 0; i < adev->usec_timeout; i++) {
+		if (!(RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE) & 1))
+			break;
+		udelay(1);
+	}
+	if (i >= adev->usec_timeout)
+		r = -ETIMEDOUT;
+	soc15_grbm_select(adev, 0, 0, 0, 0, 0);
+	mutex_unlock(&adev->srbm_mutex);
+	gfx_v9_0_unset_safe_mode(adev, 0);
+	if (r) {
+		dev_err(adev->dev, "fail to wait on hqd deactive\n");
+		return r;
+	}
+
 	r = amdgpu_bo_reserve(ring->mqd_obj, false);
 	if (unlikely(r != 0)){
-		DRM_ERROR("fail to resv mqd_obj\n");
+		dev_err(adev->dev, "fail to resv mqd_obj\n");
 		return r;
 	}
 	r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
@@ -7141,14 +7160,20 @@ static int gfx_v9_0_reset_kcq(struct amdgpu_ring *ring,
 		ring->mqd_ptr = NULL;
 	}
 	amdgpu_bo_unreserve(ring->mqd_obj);
-	if (r){
-		DRM_ERROR("fail to unresv mqd_obj\n");
+	if (r) {
+		dev_err(adev->dev, "fail to unresv mqd_obj\n");
 		return r;
 	}
+	spin_lock_irqsave(&kiq->ring_lock, flags);
 	r = amdgpu_ring_alloc(kiq_ring, kiq->pmf->map_queues_size);
+	if (r) {
+		spin_unlock_irqrestore(&kiq->ring_lock, flags);
+		return -ENOMEM;
+	}
 	kiq->pmf->kiq_map_queues(kiq_ring, ring);
 	r = amdgpu_ring_test_ring(kiq_ring);
-	if (r){
+	spin_unlock_irqrestore(&kiq->ring_lock, flags);
+	if (r) {
 		DRM_ERROR("fail to remap queue\n");
 		return r;
 	}
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 21/34] drm/amdgpu/gfx9.4.3: add ring reset callback
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (19 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 20/34] drm/amdgpu/gfx9: wait for reset done before remap Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 22/34] drm/amdgpu/gfx9.4.3: remap queue after reset successfully Alex Deucher
                   ` (14 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Add ring reset callback for compute.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 38 +++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
index 98fe6c40da64..6cf90ebdbad1 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
@@ -3263,6 +3263,43 @@ static void gfx_v9_4_3_emit_wave_limit(struct amdgpu_ring *ring, bool enable)
 	}
 }
 
+static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
+				unsigned int vmid)
+{
+	struct amdgpu_device *adev = ring->adev;
+	struct amdgpu_kiq *kiq = &adev->gfx.kiq[ring->xcc_id];
+	struct amdgpu_ring *kiq_ring = &kiq->ring;
+	unsigned long flags;
+	int r;
+
+	if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
+		return -EINVAL;
+
+	spin_lock_irqsave(&kiq->ring_lock, flags);
+
+	if (amdgpu_ring_alloc(kiq_ring, kiq->pmf->unmap_queues_size)) {
+		spin_unlock_irqrestore(&kiq->ring_lock, flags);
+		return -ENOMEM;
+	}
+
+	kiq->pmf->kiq_unmap_queues(kiq_ring, ring, RESET_QUEUES,
+				   0, 0);
+	amdgpu_ring_commit(kiq_ring);
+
+	spin_unlock_irqrestore(&kiq->ring_lock, flags);
+
+	r = amdgpu_ring_test_ring(kiq_ring);
+	if (r)
+		return r;
+
+	/* reset the ring */
+	ring->wptr = 0;
+	*ring->wptr_cpu_addr = 0;
+	amdgpu_ring_clear_ring(ring);
+
+	return amdgpu_ring_test_ring(ring);
+}
+
 enum amdgpu_gfx_cp_ras_mem_id {
 	AMDGPU_GFX_CP_MEM1 = 1,
 	AMDGPU_GFX_CP_MEM2,
@@ -4235,6 +4272,7 @@ static const struct amdgpu_ring_funcs gfx_v9_4_3_ring_funcs_compute = {
 	.soft_recovery = gfx_v9_4_3_ring_soft_recovery,
 	.emit_mem_sync = gfx_v9_4_3_emit_mem_sync,
 	.emit_wave_limit = gfx_v9_4_3_emit_wave_limit,
+	.reset = gfx_v9_4_3_reset_kcq,
 };
 
 static const struct amdgpu_ring_funcs gfx_v9_4_3_ring_funcs_kiq = {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 22/34] drm/amdgpu/gfx9.4.3: remap queue after reset successfully
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (20 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 21/34] drm/amdgpu/gfx9.4.3: add ring reset callback Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 23/34] drm/amdgpu/gfx_9.4.3: wait for reset done before remap Alex Deucher
                   ` (13 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Jiadong Zhu, Alex Deucher

From: Jiadong Zhu <Jiadong.Zhu@amd.com>

Kiq command unmap_queues only does the dequeueing action.
We have to map the queue back with clean mqd.

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 36 ++++++++++++++++++-------
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
index 6cf90ebdbad1..394790d00385 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
@@ -1917,7 +1917,7 @@ static int gfx_v9_4_3_xcc_kiq_init_queue(struct amdgpu_ring *ring, int xcc_id)
 	return 0;
 }
 
-static int gfx_v9_4_3_xcc_kcq_init_queue(struct amdgpu_ring *ring, int xcc_id)
+static int gfx_v9_4_3_xcc_kcq_init_queue(struct amdgpu_ring *ring, int xcc_id, bool restore)
 {
 	struct amdgpu_device *adev = ring->adev;
 	struct v9_mqd *mqd = ring->mqd_ptr;
@@ -1929,8 +1929,8 @@ static int gfx_v9_4_3_xcc_kcq_init_queue(struct amdgpu_ring *ring, int xcc_id)
 	 */
 	tmp_mqd = (struct v9_mqd *)adev->gfx.mec.mqd_backup[mqd_idx];
 
-	if (!tmp_mqd->cp_hqd_pq_control ||
-	    (!amdgpu_in_reset(adev) && !adev->in_suspend)) {
+	if (!restore && (!tmp_mqd->cp_hqd_pq_control ||
+	    (!amdgpu_in_reset(adev) && !adev->in_suspend))) {
 		memset((void *)mqd, 0, sizeof(struct v9_mqd_allocation));
 		((struct v9_mqd_allocation *)mqd)->dynamic_cu_mask = 0xFFFFFFFF;
 		((struct v9_mqd_allocation *)mqd)->dynamic_rb_mask = 0xFFFFFFFF;
@@ -2015,7 +2015,7 @@ static int gfx_v9_4_3_xcc_kcq_resume(struct amdgpu_device *adev, int xcc_id)
 			goto done;
 		r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
 		if (!r) {
-			r = gfx_v9_4_3_xcc_kcq_init_queue(ring, xcc_id);
+			r = gfx_v9_4_3_xcc_kcq_init_queue(ring, xcc_id, false);
 			amdgpu_bo_kunmap(ring->mqd_obj);
 			ring->mqd_ptr = NULL;
 		}
@@ -3292,11 +3292,29 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
 	if (r)
 		return r;
 
-	/* reset the ring */
-	ring->wptr = 0;
-	*ring->wptr_cpu_addr = 0;
-	amdgpu_ring_clear_ring(ring);
-
+	r = amdgpu_bo_reserve(ring->mqd_obj, false);
+	if (unlikely(r != 0)){
+		DRM_ERROR("fail to resv mqd_obj\n");
+		return r;
+	}
+	r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
+	if (!r) {
+		r = gfx_v9_4_3_xcc_kcq_init_queue(ring, ring->xcc_id, true);
+		amdgpu_bo_kunmap(ring->mqd_obj);
+		ring->mqd_ptr = NULL;
+	}
+	amdgpu_bo_unreserve(ring->mqd_obj);
+	if (r){
+		DRM_ERROR("fail to unresv mqd_obj\n");
+		return r;
+	}
+	r = amdgpu_ring_alloc(kiq_ring, kiq->pmf->map_queues_size);
+	kiq->pmf->kiq_map_queues(kiq_ring, ring);
+	r = amdgpu_ring_test_ring(kiq_ring);
+	if (r){
+		DRM_ERROR("fail to remap queue\n");
+		return r;
+	}
 	return amdgpu_ring_test_ring(ring);
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 23/34] drm/amdgpu/gfx_9.4.3: wait for reset done before remap
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (21 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 22/34] drm/amdgpu/gfx9.4.3: remap queue after reset successfully Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 24/34] drm/amdgpu/gfx12: add ring reset callbacks Alex Deucher
                   ` (12 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Jiadong Zhu, Alex Deucher

From: Jiadong Zhu <Jiadong.Zhu@amd.com>

There is a racing condition that cp firmware modifies
MQD in reset sequence after driver updates it for
remapping. We have to wait till CP_HQD_ACTIVE becoming
false then remap the queue.

v2: fix KIQ locking (Alex)

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 37 +++++++++++++++++++++----
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
index 394790d00385..717320d92e68 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
@@ -3270,7 +3270,7 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
 	struct amdgpu_kiq *kiq = &adev->gfx.kiq[ring->xcc_id];
 	struct amdgpu_ring *kiq_ring = &kiq->ring;
 	unsigned long flags;
-	int r;
+	int r, i;
 
 	if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
 		return -EINVAL;
@@ -3292,9 +3292,28 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
 	if (r)
 		return r;
 
+	/* make sure dequeue is complete*/
+	gfx_v9_4_3_xcc_set_safe_mode(adev, ring->xcc_id);
+	mutex_lock(&adev->srbm_mutex);
+	soc15_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0, GET_INST(GC, ring->xcc_id));
+	for (i = 0; i < adev->usec_timeout; i++) {
+		if (!(RREG32_SOC15(GC, 0, regCP_HQD_ACTIVE) & 1))
+			break;
+		udelay(1);
+	}
+	if (i >= adev->usec_timeout)
+		r = -ETIMEDOUT;
+	soc15_grbm_select(adev, 0, 0, 0, 0, GET_INST(GC, ring->xcc_id));
+	mutex_unlock(&adev->srbm_mutex);
+	gfx_v9_4_3_xcc_unset_safe_mode(adev, ring->xcc_id);
+	if (r) {
+		dev_err(adev->dev, "fail to wait on hqd deactive\n");
+		return r;
+	}
+
 	r = amdgpu_bo_reserve(ring->mqd_obj, false);
 	if (unlikely(r != 0)){
-		DRM_ERROR("fail to resv mqd_obj\n");
+		dev_err(adev->dev, "fail to resv mqd_obj\n");
 		return r;
 	}
 	r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
@@ -3304,15 +3323,21 @@ static int gfx_v9_4_3_reset_kcq(struct amdgpu_ring *ring,
 		ring->mqd_ptr = NULL;
 	}
 	amdgpu_bo_unreserve(ring->mqd_obj);
-	if (r){
-		DRM_ERROR("fail to unresv mqd_obj\n");
+	if (r) {
+		dev_err(adev->dev, "fail to unresv mqd_obj\n");
 		return r;
 	}
+	spin_lock_irqsave(&kiq->ring_lock, flags);
 	r = amdgpu_ring_alloc(kiq_ring, kiq->pmf->map_queues_size);
+	if (r) {
+		spin_unlock_irqrestore(&kiq->ring_lock, flags);
+		return -ENOMEM;
+	}
 	kiq->pmf->kiq_map_queues(kiq_ring, ring);
 	r = amdgpu_ring_test_ring(kiq_ring);
-	if (r){
-		DRM_ERROR("fail to remap queue\n");
+	spin_unlock_irqrestore(&kiq->ring_lock, flags);
+	if (r) {
+		dev_err(adev->dev, "fail to remap queue\n");
 		return r;
 	}
 	return amdgpu_ring_test_ring(ring);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 24/34] drm/amdgpu/gfx12: add ring reset callbacks
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (22 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 23/34] drm/amdgpu/gfx_9.4.3: wait for reset done before remap Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 25/34] drm/amdgpu/gfx12: fallback to driver reset compute queue directly Alex Deucher
                   ` (11 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Add ring reset callbacks for gfx and compute.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index 63b073fd4dc7..9ed6c8ba6b33 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -5135,6 +5135,22 @@ static void gfx_v12_ip_dump(void *handle)
 	amdgpu_gfx_off_ctrl(adev, true);
 }
 
+static int gfx_v12_0_reset_ring(struct amdgpu_ring *ring, unsigned int vmid)
+{
+	int r;
+
+	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid);
+	if (r)
+		return r;
+
+	/* reset the ring */
+	ring->wptr = 0;
+	*ring->wptr_cpu_addr = 0;
+	amdgpu_ring_clear_ring(ring);
+
+	return amdgpu_ring_test_ring(ring);
+}
+
 static const struct amd_ip_funcs gfx_v12_0_ip_funcs = {
 	.name = "gfx_v12_0",
 	.early_init = gfx_v12_0_early_init,
@@ -5197,6 +5213,7 @@ static const struct amdgpu_ring_funcs gfx_v12_0_ring_funcs_gfx = {
 	.emit_reg_write_reg_wait = gfx_v12_0_ring_emit_reg_write_reg_wait,
 	.soft_recovery = gfx_v12_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v12_0_emit_mem_sync,
+	.reset = gfx_v12_0_reset_ring,
 };
 
 static const struct amdgpu_ring_funcs gfx_v12_0_ring_funcs_compute = {
@@ -5231,6 +5248,7 @@ static const struct amdgpu_ring_funcs gfx_v12_0_ring_funcs_compute = {
 	.emit_reg_write_reg_wait = gfx_v12_0_ring_emit_reg_write_reg_wait,
 	.soft_recovery = gfx_v12_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v12_0_emit_mem_sync,
+	.reset = gfx_v12_0_reset_ring,
 };
 
 static const struct amdgpu_ring_funcs gfx_v12_0_ring_funcs_kiq = {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 25/34] drm/amdgpu/gfx12: fallback to driver reset compute queue directly
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (23 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 24/34] drm/amdgpu/gfx12: add ring reset callbacks Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 26/34] drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue Alex Deucher
                   ` (10 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Since the MES FW resets kernel compute queue always failed, this
may caused by the KIQ failed to process unmap KCQ. So, before MES
FW work properly that will fallback to driver executes dequeue and
resets SPI directly. Besides, rework the ring reset function and make
the busy ring type reset in each function respectively.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 93 ++++++++++++++++++++++----
 1 file changed, 79 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index 9ed6c8ba6b33..c4193fa2fea4 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -2910,13 +2910,13 @@ static int gfx_v12_0_gfx_mqd_init(struct amdgpu_device *adev, void *m,
 	return 0;
 }
 
-static int gfx_v12_0_gfx_init_queue(struct amdgpu_ring *ring)
+static int gfx_v12_0_kgq_init_queue(struct amdgpu_ring *ring, bool reset)
 {
 	struct amdgpu_device *adev = ring->adev;
 	struct v12_gfx_mqd *mqd = ring->mqd_ptr;
 	int mqd_idx = ring - &adev->gfx.gfx_ring[0];
 
-	if (!amdgpu_in_reset(adev) && !adev->in_suspend) {
+	if (!reset && !amdgpu_in_reset(adev) && !adev->in_suspend) {
 		memset((void *)mqd, 0, sizeof(*mqd));
 		mutex_lock(&adev->srbm_mutex);
 		soc24_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
@@ -2952,7 +2952,7 @@ static int gfx_v12_0_cp_async_gfx_ring_resume(struct amdgpu_device *adev)
 
 		r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
 		if (!r) {
-			r = gfx_v12_0_gfx_init_queue(ring);
+			r = gfx_v12_0_kgq_init_queue(ring, false);
 			amdgpu_bo_kunmap(ring->mqd_obj);
 			ring->mqd_ptr = NULL;
 		}
@@ -3256,13 +3256,13 @@ static int gfx_v12_0_kiq_init_queue(struct amdgpu_ring *ring)
 	return 0;
 }
 
-static int gfx_v12_0_kcq_init_queue(struct amdgpu_ring *ring)
+static int gfx_v12_0_kcq_init_queue(struct amdgpu_ring *ring, bool reset)
 {
 	struct amdgpu_device *adev = ring->adev;
 	struct v12_compute_mqd *mqd = ring->mqd_ptr;
 	int mqd_idx = ring - &adev->gfx.compute_ring[0];
 
-	if (!amdgpu_in_reset(adev) && !adev->in_suspend) {
+	if (!reset && !amdgpu_in_reset(adev) && !adev->in_suspend) {
 		memset((void *)mqd, 0, sizeof(*mqd));
 		mutex_lock(&adev->srbm_mutex);
 		soc24_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
@@ -3326,7 +3326,7 @@ static int gfx_v12_0_kcq_resume(struct amdgpu_device *adev)
 			goto done;
 		r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
 		if (!r) {
-			r = gfx_v12_0_kcq_init_queue(ring);
+			r = gfx_v12_0_kcq_init_queue(ring, false);
 			amdgpu_bo_kunmap(ring->mqd_obj);
 			ring->mqd_ptr = NULL;
 		}
@@ -5135,18 +5135,83 @@ static void gfx_v12_ip_dump(void *handle)
 	amdgpu_gfx_off_ctrl(adev, true);
 }
 
-static int gfx_v12_0_reset_ring(struct amdgpu_ring *ring, unsigned int vmid)
+static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 {
+	struct amdgpu_device *adev = ring->adev;
 	int r;
 
 	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid);
-	if (r)
+	if (r) {
+		dev_err(adev->dev, "reset via MES failed %d\n", r);
 		return r;
+	}
 
-	/* reset the ring */
-	ring->wptr = 0;
-	*ring->wptr_cpu_addr = 0;
-	amdgpu_ring_clear_ring(ring);
+	r = amdgpu_bo_reserve(ring->mqd_obj, false);
+	if (unlikely(r != 0)){
+		dev_err(adev->dev, "fail to resv mqd_obj\n");
+		return r;
+	}
+	r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
+	if (!r) {
+		r = gfx_v12_0_kgq_init_queue(ring, true);
+		amdgpu_bo_kunmap(ring->mqd_obj);
+		ring->mqd_ptr = NULL;
+	}
+	amdgpu_bo_unreserve(ring->mqd_obj);
+	if (r){
+		DRM_ERROR("fail to unresv mqd_obj\n");
+		return r;
+	}
+
+	r = amdgpu_mes_map_legacy_queue(adev, ring);
+	if (r) {
+		dev_err(adev->dev, "failed to remap kgq\n");
+		return r;
+	}
+
+	return amdgpu_ring_test_ring(ring);
+}
+
+static int gfx_v12_0_reset_kcq(struct amdgpu_ring *ring, unsigned int vmid)
+{
+	struct amdgpu_device *adev = ring->adev;
+	int r, i;
+
+	gfx_v12_0_set_safe_mode(adev, 0);
+	mutex_lock(&adev->srbm_mutex);
+	soc24_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0);
+	WREG32_SOC15(GC, 0, regCP_HQD_DEQUEUE_REQUEST, 0x2);
+	WREG32_SOC15(GC, 0, regSPI_COMPUTE_QUEUE_RESET, 0x1);
+	for (i = 0; i < adev->usec_timeout; i++) {
+		if (!(RREG32_SOC15(GC, 0, regCP_HQD_ACTIVE) & 1))
+			break;
+		udelay(1);
+	}
+	soc24_grbm_select(adev, 0, 0, 0, 0);
+	mutex_unlock(&adev->srbm_mutex);
+	gfx_v12_0_unset_safe_mode(adev, 0);
+
+	r = amdgpu_bo_reserve(ring->mqd_obj, false);
+	if (unlikely(r != 0)){
+		DRM_ERROR("fail to resv mqd_obj\n");
+		return r;
+	}
+	r = amdgpu_bo_kmap(ring->mqd_obj, (void **)&ring->mqd_ptr);
+	if (!r) {
+		r = gfx_v12_0_kcq_init_queue(ring, true);
+		amdgpu_bo_kunmap(ring->mqd_obj);
+		ring->mqd_ptr = NULL;
+	}
+	amdgpu_bo_unreserve(ring->mqd_obj);
+	if (r){
+		DRM_ERROR("fail to unresv mqd_obj\n");
+		return r;
+	}
+	r = amdgpu_mes_map_legacy_queue(adev, ring);
+	if (r) {
+		dev_err(adev->dev, "failed to remap kcq\n");
+		return r;
+	}
 
 	return amdgpu_ring_test_ring(ring);
 }
@@ -5213,7 +5278,7 @@ static const struct amdgpu_ring_funcs gfx_v12_0_ring_funcs_gfx = {
 	.emit_reg_write_reg_wait = gfx_v12_0_ring_emit_reg_write_reg_wait,
 	.soft_recovery = gfx_v12_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v12_0_emit_mem_sync,
-	.reset = gfx_v12_0_reset_ring,
+	.reset = gfx_v12_0_reset_kgq,
 };
 
 static const struct amdgpu_ring_funcs gfx_v12_0_ring_funcs_compute = {
@@ -5248,7 +5313,7 @@ static const struct amdgpu_ring_funcs gfx_v12_0_ring_funcs_compute = {
 	.emit_reg_write_reg_wait = gfx_v12_0_ring_emit_reg_write_reg_wait,
 	.soft_recovery = gfx_v12_0_ring_soft_recovery,
 	.emit_mem_sync = gfx_v12_0_emit_mem_sync,
-	.reset = gfx_v12_0_reset_ring,
+	.reset = gfx_v12_0_reset_kcq,
 };
 
 static const struct amdgpu_ring_funcs gfx_v12_0_ring_funcs_kiq = {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 26/34] drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (24 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 25/34] drm/amdgpu/gfx12: fallback to driver reset compute queue directly Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 27/34] drm/amdgpu/gfx9: implement reset_hw_queue for gfx9 Alex Deucher
                   ` (9 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Jiadong Zhu, Alex Deucher

From: Jiadong Zhu <Jiadong.Zhu@amd.com>

Add reset_hw_queue in kiq_pm4_funcs callbacks.

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index 86d3fa7eef90..6fe77e483bb7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -138,6 +138,10 @@ struct kiq_pm4_funcs {
 	void (*kiq_invalidate_tlbs)(struct amdgpu_ring *kiq_ring,
 				uint16_t pasid, uint32_t flush_type,
 				bool all_hub);
+	void (*kiq_reset_hw_queue)(struct amdgpu_ring *kiq_ring,
+				   uint32_t queue_type, uint32_t me_id,
+				   uint32_t pipe_id, uint32_t queue_id,
+				   uint32_t xcc_id, uint32_t vmid);
 	/* Packet sizes */
 	int set_resources_size;
 	int map_queues_size;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 27/34] drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (25 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 26/34] drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 28/34] drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3 Alex Deucher
                   ` (8 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Jiadong Zhu, Alex Deucher

From: Jiadong Zhu <Jiadong.Zhu@amd.com>

Using mmio to do queue reset. Enter safe mode
when writing registers.

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 37 +++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index fdc3fb636e02..1a04f52ce0a3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -893,6 +893,8 @@ static int gfx_v9_0_ras_error_inject(struct amdgpu_device *adev,
 static void gfx_v9_0_reset_ras_error_count(struct amdgpu_device *adev);
 static void gfx_v9_0_update_spm_vmid_internal(struct amdgpu_device *adev,
 					      unsigned int vmid);
+static void gfx_v9_0_set_safe_mode(struct amdgpu_device *adev, int xcc_id);
+static void gfx_v9_0_unset_safe_mode(struct amdgpu_device *adev, int xcc_id);
 
 static void gfx_v9_0_kiq_set_resources(struct amdgpu_ring *kiq_ring,
 				uint64_t queue_mask)
@@ -1004,12 +1006,47 @@ static void gfx_v9_0_kiq_invalidate_tlbs(struct amdgpu_ring *kiq_ring,
 			PACKET3_INVALIDATE_TLBS_FLUSH_TYPE(flush_type));
 }
 
+
+static void gfx_v9_0_kiq_reset_hw_queue(struct amdgpu_ring *kiq_ring, uint32_t queue_type,
+					uint32_t me_id, uint32_t pipe_id, uint32_t queue_id,
+					uint32_t xcc_id, uint32_t vmid)
+{
+	struct amdgpu_device *adev = kiq_ring->adev;
+	unsigned i;
+
+	/* enter save mode */
+	gfx_v9_0_set_safe_mode(adev, xcc_id);
+	mutex_lock(&adev->srbm_mutex);
+	soc15_grbm_select(adev, me_id, pipe_id, queue_id, 0, 0);
+
+	if (queue_type == AMDGPU_RING_TYPE_COMPUTE) {
+		WREG32_SOC15(GC, 0, mmCP_HQD_DEQUEUE_REQUEST, 0x2);
+		WREG32_SOC15(GC, 0, mmSPI_COMPUTE_QUEUE_RESET, 0x1);
+		/* wait till dequeue take effects */
+		for (i = 0; i < adev->usec_timeout; i++) {
+			if (!(RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE) & 1))
+				break;
+			udelay(1);
+		}
+		if (i >= adev->usec_timeout)
+			dev_err(adev->dev, "fail to wait on hqd deactive\n");
+	} else {
+		dev_err(adev->dev, "reset queue_type(%d) not supported\n", queue_type);
+	}
+
+	soc15_grbm_select(adev, 0, 0, 0, 0, 0);
+	mutex_unlock(&adev->srbm_mutex);
+	/* exit safe mode */
+	gfx_v9_0_unset_safe_mode(adev, xcc_id);
+}
+
 static const struct kiq_pm4_funcs gfx_v9_0_kiq_pm4_funcs = {
 	.kiq_set_resources = gfx_v9_0_kiq_set_resources,
 	.kiq_map_queues = gfx_v9_0_kiq_map_queues,
 	.kiq_unmap_queues = gfx_v9_0_kiq_unmap_queues,
 	.kiq_query_status = gfx_v9_0_kiq_query_status,
 	.kiq_invalidate_tlbs = gfx_v9_0_kiq_invalidate_tlbs,
+	.kiq_reset_hw_queue = gfx_v9_0_kiq_reset_hw_queue,
 	.set_resources_size = 8,
 	.map_queues_size = 7,
 	.unmap_queues_size = 6,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 28/34] drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (26 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 27/34] drm/amdgpu/gfx9: implement reset_hw_queue for gfx9 Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 29/34] drm/amdgpu/mes: modify mes api for mmio queue reset Alex Deucher
                   ` (7 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Jiadong Zhu, Alex Deucher

From: Jiadong Zhu <Jiadong.Zhu@amd.com>

Using mmio to do queue reset. Enter safe mode
before writing mmio registers.

v2: set register instance offset according to xcc id.

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 36 +++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
index 717320d92e68..267d5998bb80 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
@@ -71,6 +71,8 @@ static void gfx_v9_4_3_set_gds_init(struct amdgpu_device *adev);
 static void gfx_v9_4_3_set_rlc_funcs(struct amdgpu_device *adev);
 static int gfx_v9_4_3_get_cu_info(struct amdgpu_device *adev,
 				struct amdgpu_cu_info *cu_info);
+static void gfx_v9_4_3_xcc_set_safe_mode(struct amdgpu_device *adev, int xcc_id);
+static void gfx_v9_4_3_xcc_unset_safe_mode(struct amdgpu_device *adev, int xcc_id);
 
 static void gfx_v9_4_3_kiq_set_resources(struct amdgpu_ring *kiq_ring,
 				uint64_t queue_mask)
@@ -182,12 +184,46 @@ static void gfx_v9_4_3_kiq_invalidate_tlbs(struct amdgpu_ring *kiq_ring,
 			PACKET3_INVALIDATE_TLBS_FLUSH_TYPE(flush_type));
 }
 
+static void gfx_v9_4_3_kiq_reset_hw_queue(struct amdgpu_ring *kiq_ring, uint32_t queue_type,
+					  uint32_t me_id, uint32_t pipe_id, uint32_t queue_id,
+					  uint32_t xcc_id, uint32_t vmid)
+{
+	struct amdgpu_device *adev = kiq_ring->adev;
+	unsigned i;
+
+	/* enter save mode */
+	gfx_v9_4_3_xcc_set_safe_mode(adev, xcc_id);
+	mutex_lock(&adev->srbm_mutex);
+	soc15_grbm_select(adev, me_id, pipe_id, queue_id, 0, xcc_id);
+
+	if (queue_type == AMDGPU_RING_TYPE_COMPUTE) {
+		WREG32_SOC15(GC, GET_INST(GC, xcc_id), regCP_HQD_DEQUEUE_REQUEST, 0x2);
+		WREG32_SOC15(GC, GET_INST(GC, xcc_id), regSPI_COMPUTE_QUEUE_RESET, 0x1);
+		/* wait till dequeue take effects */
+		for (i = 0; i < adev->usec_timeout; i++) {
+			if (!(RREG32_SOC15(GC, GET_INST(GC, xcc_id), regCP_HQD_ACTIVE) & 1))
+				break;
+			udelay(1);
+		}
+		if (i >= adev->usec_timeout)
+			dev_err(adev->dev, "fail to wait on hqd deactive\n");
+	} else {
+		dev_err(adev->dev, "reset queue_type(%d) not supported\n\n", queue_type);
+	}
+
+	soc15_grbm_select(adev, 0, 0, 0, 0, 0);
+	mutex_unlock(&adev->srbm_mutex);
+	/* exit safe mode */
+	gfx_v9_4_3_xcc_unset_safe_mode(adev, xcc_id);
+}
+
 static const struct kiq_pm4_funcs gfx_v9_4_3_kiq_pm4_funcs = {
 	.kiq_set_resources = gfx_v9_4_3_kiq_set_resources,
 	.kiq_map_queues = gfx_v9_4_3_kiq_map_queues,
 	.kiq_unmap_queues = gfx_v9_4_3_kiq_unmap_queues,
 	.kiq_query_status = gfx_v9_4_3_kiq_query_status,
 	.kiq_invalidate_tlbs = gfx_v9_4_3_kiq_invalidate_tlbs,
+	.kiq_reset_hw_queue = gfx_v9_4_3_kiq_reset_hw_queue,
 	.set_resources_size = 8,
 	.map_queues_size = 7,
 	.unmap_queues_size = 6,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 29/34] drm/amdgpu/mes: modify mes api for mmio queue reset
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (27 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 28/34] drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3 Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 30/34] drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio Alex Deucher
                   ` (6 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Jiadong Zhu, Alex Deucher

From: Jiadong Zhu <Jiadong.Zhu@amd.com>

Add me/pipe/queue parameters for queue reset input.

v2: fix build (Alex)

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c |  3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 14 +++++++++++++-
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c  |  2 +-
 4 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index b3d6a9fa6775..950c26ee3fb8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -864,7 +864,8 @@ int amdgpu_mes_unmap_legacy_queue(struct amdgpu_device *adev,
 
 int amdgpu_mes_reset_legacy_queue(struct amdgpu_device *adev,
 				  struct amdgpu_ring *ring,
-				  unsigned int vmid)
+				  unsigned int vmid,
+				  bool use_mmio)
 {
 	struct mes_reset_legacy_queue_input queue_input;
 	int r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index 771b63db1846..e6a4ef643967 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -251,6 +251,13 @@ struct mes_remove_queue_input {
 struct mes_reset_queue_input {
 	uint32_t	doorbell_offset;
 	uint64_t	gang_context_addr;
+	bool		use_mmio;
+	uint32_t	queue_type;
+	uint32_t	me_id;
+	uint32_t	pipe_id;
+	uint32_t	queue_id;
+	uint32_t	xcc_id;
+	uint32_t	vmid;
 };
 
 struct mes_map_legacy_queue_input {
@@ -287,6 +294,8 @@ struct mes_resume_gang_input {
 struct mes_reset_legacy_queue_input {
 	uint32_t                           queue_type;
 	uint32_t                           doorbell_offset;
+	bool                               use_mmio;
+	uint32_t                           me_id;
 	uint32_t                           pipe_id;
 	uint32_t                           queue_id;
 	uint64_t                           mqd_addr;
@@ -396,6 +405,8 @@ int amdgpu_mes_add_hw_queue(struct amdgpu_device *adev, int gang_id,
 			    int *queue_id);
 int amdgpu_mes_remove_hw_queue(struct amdgpu_device *adev, int queue_id);
 int amdgpu_mes_reset_hw_queue(struct amdgpu_device *adev, int queue_id);
+int amdgpu_mes_reset_hw_queue_mmio(struct amdgpu_device *adev, int queue_type,
+				   int me_id, int pipe_id, int queue_id, int vmid);
 
 int amdgpu_mes_map_legacy_queue(struct amdgpu_device *adev,
 				struct amdgpu_ring *ring);
@@ -405,7 +416,8 @@ int amdgpu_mes_unmap_legacy_queue(struct amdgpu_device *adev,
 				  u64 gpu_addr, u64 seq);
 int amdgpu_mes_reset_legacy_queue(struct amdgpu_device *adev,
 				  struct amdgpu_ring *ring,
-				  unsigned int vmid);
+				  unsigned int vmid,
+				  bool use_mmio);
 
 uint32_t amdgpu_mes_rreg(struct amdgpu_device *adev, uint32_t reg);
 int amdgpu_mes_wreg(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 1b2de8e81ccd..348bc1b1784a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -6525,7 +6525,7 @@ static int gfx_v11_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 	struct amdgpu_device *adev = ring->adev;
 	int r;
 
-	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid);
+	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
 	if (r)
 		return r;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index c4193fa2fea4..ba121491f5a7 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -5140,7 +5140,7 @@ static int gfx_v12_0_reset_kgq(struct amdgpu_ring *ring, unsigned int vmid)
 	struct amdgpu_device *adev = ring->adev;
 	int r;
 
-	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid);
+	r = amdgpu_mes_reset_legacy_queue(ring->adev, ring, vmid, false);
 	if (r) {
 		dev_err(adev->dev, "reset via MES failed %d\n", r);
 		return r;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 30/34] drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (28 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 29/34] drm/amdgpu/mes: modify mes api for mmio queue reset Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 31/34] drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL Alex Deucher
                   ` (5 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Jiadong Zhu, Alex Deucher

From: Jiadong Zhu <Jiadong.Zhu@amd.com>

The reset_queue api could be used from kfd or kgd.

v2: add use_mmio parameter for mes_reset_legacy_queue.

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 950c26ee3fb8..0bc6ce26ce45 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -817,6 +817,24 @@ int amdgpu_mes_reset_hw_queue(struct amdgpu_device *adev, int queue_id)
 	return 0;
 }
 
+int amdgpu_mes_reset_hw_queue_mmio(struct amdgpu_device *adev, int queue_type,
+				   int me_id, int pipe_id, int queue_id, int vmid)
+{
+	struct mes_reset_queue_input queue_input;
+	int r;
+
+	queue_input.use_mmio = true;
+	queue_input.me_id = me_id;
+	queue_input.pipe_id = pipe_id;
+	queue_input.queue_id = queue_id;
+	queue_input.vmid = vmid;
+	r = adev->mes.funcs->reset_hw_queue(&adev->mes, &queue_input);
+	if (r)
+		DRM_ERROR("failed to reset hardware queue by mmio, queue id = %d\n",
+			  queue_id);
+	return r;
+}
+
 int amdgpu_mes_map_legacy_queue(struct amdgpu_device *adev,
 				struct amdgpu_ring *ring)
 {
@@ -874,11 +892,13 @@ int amdgpu_mes_reset_legacy_queue(struct amdgpu_device *adev,
 
 	queue_input.queue_type = ring->funcs->type;
 	queue_input.doorbell_offset = ring->doorbell_index;
+	queue_input.me_id = ring->me;
 	queue_input.pipe_id = ring->pipe;
 	queue_input.queue_id = ring->queue;
 	queue_input.mqd_addr = amdgpu_bo_gpu_offset(ring->mqd_obj);
 	queue_input.wptr_addr = ring->wptr_gpu_addr;
 	queue_input.vmid = vmid;
+	queue_input.use_mmio = use_mmio;
 
 	r = adev->mes.funcs->reset_legacy_queue(&adev->mes, &queue_input);
 	if (r)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 31/34] drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (29 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 30/34] drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 32/34] drm/amdgpu/gfx11: add a mutex for the gfx semaphore Alex Deucher
                   ` (4 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

Need to enter safe mode before touching GC MMIO.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 348bc1b1784a..9bd42533ce61 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -4763,6 +4763,8 @@ static int gfx_v11_0_soft_reset(void *handle)
 	int r, i, j, k;
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
 
+	gfx_v11_0_set_safe_mode(adev, 0);
+
 	tmp = RREG32_SOC15(GC, 0, regCP_INT_CNTL);
 	tmp = REG_SET_FIELD(tmp, CP_INT_CNTL, CMP_BUSY_INT_ENABLE, 0);
 	tmp = REG_SET_FIELD(tmp, CP_INT_CNTL, CNTX_BUSY_INT_ENABLE, 0);
@@ -4770,8 +4772,6 @@ static int gfx_v11_0_soft_reset(void *handle)
 	tmp = REG_SET_FIELD(tmp, CP_INT_CNTL, GFX_IDLE_INT_ENABLE, 0);
 	WREG32_SOC15(GC, 0, regCP_INT_CNTL, tmp);
 
-	gfx_v11_0_set_safe_mode(adev, 0);
-
 	mutex_lock(&adev->srbm_mutex);
 	for (i = 0; i < adev->gfx.mec.num_mec; ++i) {
 		for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 32/34] drm/amdgpu/gfx11: add a mutex for the gfx semaphore
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (30 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 31/34] drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 33/34] drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex() Alex Deucher
                   ` (3 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

This will be used in more places in the future so
add a mutex.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |  2 ++
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 10 +++++++---
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index bcacf2e35eba..dcffd57da8db 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4055,6 +4055,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	mutex_init(&adev->notifier_lock);
 	mutex_init(&adev->pm.stable_pstate_ctx_lock);
 	mutex_init(&adev->benchmark_mutex);
+	mutex_init(&adev->gfx.reset_sem_mutex);
 
 	amdgpu_device_init_apu_flags(adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index 6fe77e483bb7..17b945b545b4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -444,6 +444,8 @@ struct amdgpu_gfx {
 	uint32_t			*ip_dump_core;
 	uint32_t			*ip_dump_compute_queues;
 	uint32_t			*ip_dump_gfx_queues;
+
+	struct mutex			reset_sem_mutex;
 };
 
 struct amdgpu_gfx_ras_reg_entry {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 9bd42533ce61..37af298d0918 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -4725,10 +4725,12 @@ static int gfx_v11_0_wait_for_idle(void *handle)
 }
 
 static int gfx_v11_0_request_gfx_index_mutex(struct amdgpu_device *adev,
-					     int req)
+					     bool req)
 {
 	u32 i, tmp, val;
 
+	if (req)
+		mutex_lock(&adev->gfx.reset_sem_mutex);
 	for (i = 0; i < adev->usec_timeout; i++) {
 		/* Request with MeId=2, PipeId=0 */
 		tmp = REG_SET_FIELD(0, CP_GFX_INDEX_MUTEX, REQUEST, req);
@@ -4749,6 +4751,8 @@ static int gfx_v11_0_request_gfx_index_mutex(struct amdgpu_device *adev,
 		}
 		udelay(1);
 	}
+	if (!req)
+		mutex_unlock(&adev->gfx.reset_sem_mutex);
 
 	if (i >= adev->usec_timeout)
 		return -EINVAL;
@@ -4796,7 +4800,7 @@ static int gfx_v11_0_soft_reset(void *handle)
 	mutex_unlock(&adev->srbm_mutex);
 
 	/* Try to acquire the gfx mutex before access to CP_VMID_RESET */
-	r = gfx_v11_0_request_gfx_index_mutex(adev, 1);
+	r = gfx_v11_0_request_gfx_index_mutex(adev, true);
 	if (r) {
 		DRM_ERROR("Failed to acquire the gfx mutex during soft reset\n");
 		return r;
@@ -4811,7 +4815,7 @@ static int gfx_v11_0_soft_reset(void *handle)
 	RREG32_SOC15(GC, 0, regCP_VMID_RESET);
 
 	/* release the gfx mutex */
-	r = gfx_v11_0_request_gfx_index_mutex(adev, 0);
+	r = gfx_v11_0_request_gfx_index_mutex(adev, false);
 	if (r) {
 		DRM_ERROR("Failed to release the gfx mutex during soft reset\n");
 		return r;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 33/34] drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (31 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 32/34] drm/amdgpu/gfx11: add a mutex for the gfx semaphore Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 14:07 ` [PATCH 34/34] drm/amdgpu/mes11: implement mmio queue reset for gfx11 Alex Deucher
                   ` (2 subsequent siblings)
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher

It will be used by the queue reset code.

Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 4 ++--
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 37af298d0918..20be1b9ecdc3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -4724,8 +4724,8 @@ static int gfx_v11_0_wait_for_idle(void *handle)
 	return -ETIMEDOUT;
 }
 
-static int gfx_v11_0_request_gfx_index_mutex(struct amdgpu_device *adev,
-					     bool req)
+int gfx_v11_0_request_gfx_index_mutex(struct amdgpu_device *adev,
+				      bool req)
 {
 	u32 i, tmp, val;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h
index 10cfc29c27c9..157a5c812259 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h
@@ -26,4 +26,7 @@
 
 extern const struct amdgpu_ip_block_version gfx_v11_0_ip_block;
 
+int gfx_v11_0_request_gfx_index_mutex(struct amdgpu_device *adev,
+				      bool req);
+
 #endif
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 34/34] drm/amdgpu/mes11: implement mmio queue reset for gfx11
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (32 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 33/34] drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex() Alex Deucher
@ 2024-07-18 14:07 ` Alex Deucher
  2024-07-18 16:29 ` [PATCH 00/34] GC per queue reset Friedrich Vock
  2024-07-18 16:54 ` Alex Deucher
  35 siblings, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 14:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Jiadong Zhu, Alex Deucher

From: Jiadong Zhu <Jiadong.Zhu@amd.com>

Implement queue reset for graphic and compute queue.

v2: use amdgpu_gfx_rlc funcs to enter/exit safe mode.
v3: use gfx_v11_0_request_gfx_index_mutex()

Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 78 ++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index bf8fb6a1becb..fb617f6cef13 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -26,6 +26,7 @@
 #include "amdgpu.h"
 #include "soc15_common.h"
 #include "soc21.h"
+#include "gfx_v11_0.h"
 #include "gc/gc_11_0_0_offset.h"
 #include "gc/gc_11_0_0_sh_mask.h"
 #include "gc/gc_11_0_0_default.h"
@@ -350,9 +351,81 @@ static int mes_v11_0_remove_hw_queue(struct amdgpu_mes *mes,
 			offsetof(union MESAPI__REMOVE_QUEUE, api_status));
 }
 
+static int mes_v11_0_reset_queue_mmio(struct amdgpu_mes *mes, uint32_t queue_type,
+				      uint32_t me_id, uint32_t pipe_id,
+				      uint32_t queue_id, uint32_t vmid)
+{
+	struct amdgpu_device *adev = mes->adev;
+	uint32_t value;
+	int i, r = 0;
+
+	amdgpu_gfx_rlc_enter_safe_mode(adev, 0);
+
+	if (queue_type == AMDGPU_RING_TYPE_GFX) {
+		dev_info(adev->dev, "reset gfx queue (%d:%d:%d: vmid:%d)\n",
+			 me_id, pipe_id, queue_id, vmid);
+
+		gfx_v11_0_request_gfx_index_mutex(adev, true);
+		/* all se allow writes */
+		WREG32_SOC15(GC, 0, regGRBM_GFX_INDEX,
+			     (uint32_t)(0x1 << GRBM_GFX_INDEX__SE_BROADCAST_WRITES__SHIFT));
+		value = REG_SET_FIELD(0, CP_VMID_RESET, RESET_REQUEST, 1 << vmid);
+		if (pipe_id == 0)
+			value = REG_SET_FIELD(value, CP_VMID_RESET, PIPE0_QUEUES, 1 << queue_id);
+		else
+			value = REG_SET_FIELD(value, CP_VMID_RESET, PIPE1_QUEUES, 1 << queue_id);
+		WREG32_SOC15(GC, 0, regCP_VMID_RESET, value);
+		gfx_v11_0_request_gfx_index_mutex(adev, false);
+
+		mutex_lock(&adev->srbm_mutex);
+		soc21_grbm_select(adev, me_id, pipe_id, queue_id, 0);
+		/* wait till dequeue take effects */
+		for (i = 0; i < adev->usec_timeout; i++) {
+			if (!(RREG32_SOC15(GC, 0, regCP_GFX_HQD_ACTIVE) & 1))
+				break;
+			udelay(1);
+		}
+		if (i >= adev->usec_timeout){
+			dev_err(adev->dev, "failed to wait on gfx hqd deactive\n");
+			r = -ETIMEDOUT;
+		}
+
+		soc21_grbm_select(adev, 0, 0, 0, 0);
+		mutex_unlock(&adev->srbm_mutex);
+	} else if (queue_type == AMDGPU_RING_TYPE_COMPUTE) {
+		dev_info(adev->dev, "reset compute queue (%d:%d:%d)\n",
+			 me_id, pipe_id, queue_id);
+		mutex_lock(&adev->srbm_mutex);
+		soc21_grbm_select(adev, me_id, pipe_id, queue_id, 0);
+		WREG32_SOC15(GC, 0, regCP_HQD_DEQUEUE_REQUEST, 0x2);
+		WREG32_SOC15(GC, 0, regSPI_COMPUTE_QUEUE_RESET, 0x1);
+
+		/* wait till dequeue take effects */
+		for (i = 0; i < adev->usec_timeout; i++) {
+			if (!(RREG32_SOC15(GC, 0, regCP_HQD_ACTIVE) & 1))
+			break;
+			udelay(1);
+		}
+		if (i >= adev->usec_timeout){
+			dev_err(adev->dev, "failed to wait on hqd deactive\n");
+			r = -ETIMEDOUT;
+		}
+		soc21_grbm_select(adev, 0, 0, 0, 0);
+		mutex_unlock(&adev->srbm_mutex);
+	}
+
+	amdgpu_gfx_rlc_exit_safe_mode(adev, 0);
+	return r;
+}
+
 static int mes_v11_0_reset_hw_queue(struct amdgpu_mes *mes,
 				    struct mes_reset_queue_input *input)
 {
+	if (input->use_mmio)
+		return mes_v11_0_reset_queue_mmio(mes, input->queue_type,
+						  input->me_id, input->pipe_id,
+						  input->queue_id, input->vmid);
+
 	union MESAPI__RESET mes_reset_queue_pkt;
 
 	memset(&mes_reset_queue_pkt, 0, sizeof(mes_reset_queue_pkt));
@@ -612,6 +685,11 @@ static int mes_v11_0_reset_legacy_queue(struct amdgpu_mes *mes,
 {
 	union MESAPI__RESET mes_reset_queue_pkt;
 
+	if (input->use_mmio)
+		return mes_v11_0_reset_queue_mmio(mes, input->queue_type,
+						  input->me_id, input->pipe_id,
+						  input->queue_id, input->vmid);
+
 	memset(&mes_reset_queue_pkt, 0, sizeof(mes_reset_queue_pkt));
 
 	mes_reset_queue_pkt.header.type = MES_API_TYPE_SCHEDULER;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/34] GC per queue reset
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (33 preceding siblings ...)
  2024-07-18 14:07 ` [PATCH 34/34] drm/amdgpu/mes11: implement mmio queue reset for gfx11 Alex Deucher
@ 2024-07-18 16:29 ` Friedrich Vock
  2024-07-19 13:39   ` Alex Deucher
  2024-07-18 16:54 ` Alex Deucher
  35 siblings, 1 reply; 42+ messages in thread
From: Friedrich Vock @ 2024-07-18 16:29 UTC (permalink / raw)
  To: Alex Deucher, amd-gfx

Hi,

On 18.07.24 16:06, Alex Deucher wrote:
> This adds preliminary support for GC per queue reset.  In this
> case, only the jobs currently in the queue are lost.  If this
> fails, we fall back to a full adapter reset.

First of all, thank you so much for working on this! It's great to
finally see progress in making GPU resets better.

I've just taken this patchset (together with your other
patchsets[1][2][3]) for a quick spin on my
Navi21 with the GPU reset tests[4] I had written a while ago - the
current patchset sadly seems to have some regressions WRT recovery there.

I ran the tests under my Plasma Wayland session once - this triggered a
list double-add in drm_sched_stop (calltrace follows):

? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
? do_trap (arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154)
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
? do_error_trap (./arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:175)
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
? exc_invalid_op (arch/x86/kernel/traps.c:266)
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:568)
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
drm_sched_stop (./include/linux/list.h:151 ./include/linux/list.h:169 drivers/gpu/drm/scheduler/sched_main.c:617)
amdgpu_device_gpu_recover (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5808)
amdgpu_job_timedout (drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:103)
drm_sched_job_timedout (drivers/gpu/drm/scheduler/sched_main.c:569)
process_one_work (kernel/workqueue.c:2633)
worker_thread (kernel/workqueue.c:2700 (discriminator 2) kernel/workqueue.c:2787 (discriminator 2))
? __pfx_worker_thread (kernel/workqueue.c:2733)
kthread (kernel/kthread.c:388)
? __pfx_kthread (kernel/kthread.c:341)
ret_from_fork (arch/x86/kernel/process.c:147)
? __pfx_kthread (kernel/kthread.c:341)
ret_from_fork_asm (arch/x86/entry/entry_64.S:251)

When running the tests without a desktop environment active, the
double-add disappeared, but the GPU reset still didn't go well - the TTY
remained frozen and the kernel log contained a few messages like:

[drm] *ERROR* [CRTC:90:crtc-0] flip_done timed out

which I guess means at least the display subsystem is hung.

Hope this info is enough to repro/investigate.

Thanks,
Friedrich

[1] https://lore.kernel.org/amd-gfx/20240717203740.14059-1-alexander.deucher@amd.com/T/#t
[2] https://lore.kernel.org/amd-gfx/20240717203847.14600-1-alexander.deucher@amd.com/T/#t
[3] https://lore.kernel.org/amd-gfx/230ee72e-4f7f-4894-a789-2e1e5788344f@amd.com/T/#t
[4] https://gitlab.steamos.cloud/holo/HangTestSuite

>
> Alex Deucher (19):
>    drm/amdgpu/mes: add API for legacy queue reset
>    drm/amdgpu/mes11: add API for legacy queue reset
>    drm/amdgpu/mes12: add API for legacy queue reset
>    drm/amdgpu/mes: add API for user queue reset
>    drm/amdgpu/mes11: add API for user queue reset
>    drm/amdgpu/mes12: add API for user queue reset
>    drm/amdgpu: add new ring reset callback
>    drm/amdgpu: add per ring reset support (v2)
>    drm/amdgpu/gfx11: add ring reset callbacks
>    drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
>    drm/amdgpu/gfx10: add ring reset callbacks
>    drm/amdgpu/gfx10: rework reset sequence
>    drm/amdgpu/gfx9: add ring reset callback
>    drm/amdgpu/gfx9.4.3: add ring reset callback
>    drm/amdgpu/gfx12: add ring reset callbacks
>    drm/amdgpu/gfx12: fallback to driver reset compute queue directly
>    drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
>    drm/amdgpu/gfx11: add a mutex for the gfx semaphore
>    drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()
>
> Jiadong Zhu (13):
>    drm/amdgpu/gfx11: wait for reset done before remap
>    drm/amdgpu/gfx10: remap queue after reset successfully
>    drm/amdgpu/gfx10: wait for reset done before remap
>    drm/amdgpu/gfx9: remap queue after reset successfully
>    drm/amdgpu/gfx9: wait for reset done before remap
>    drm/amdgpu/gfx9.4.3: remap queue after reset successfully
>    drm/amdgpu/gfx_9.4.3: wait for reset done before remap
>    drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue
>    drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
>    drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
>    drm/amdgpu/mes: modify mes api for mmio queue reset
>    drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
>    drm/amdgpu/mes11: implement mmio queue reset for gfx11
>
> Prike Liang (2):
>    drm/amdgpu: increase the reset counter for the queue reset
>    drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2)
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |   6 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  18 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c    |  88 ++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h    |  37 +++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
>   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 158 ++++++++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 117 +++++++++++++--
>   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h     |   3 +
>   drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     |  95 ++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 126 +++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 125 +++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c     | 132 +++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/mes_v12_0.c     |  54 +++++++
>   14 files changed, 930 insertions(+), 32 deletions(-)
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/34] GC per queue reset
  2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
                   ` (34 preceding siblings ...)
  2024-07-18 16:29 ` [PATCH 00/34] GC per queue reset Friedrich Vock
@ 2024-07-18 16:54 ` Alex Deucher
  2024-07-23  8:50   ` Christopher Snowhill
  35 siblings, 1 reply; 42+ messages in thread
From: Alex Deucher @ 2024-07-18 16:54 UTC (permalink / raw)
  To: Alex Deucher; +Cc: amd-gfx

On Thu, Jul 18, 2024 at 10:15 AM Alex Deucher <alexander.deucher@amd.com> wrote:
>
> This adds preliminary support for GC per queue reset.  In this
> case, only the jobs currently in the queue are lost.  If this
> fails, we fall back to a full adapter reset.

Also available here via git:
https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next-queue-reset

Alex

>
> Alex Deucher (19):
>   drm/amdgpu/mes: add API for legacy queue reset
>   drm/amdgpu/mes11: add API for legacy queue reset
>   drm/amdgpu/mes12: add API for legacy queue reset
>   drm/amdgpu/mes: add API for user queue reset
>   drm/amdgpu/mes11: add API for user queue reset
>   drm/amdgpu/mes12: add API for user queue reset
>   drm/amdgpu: add new ring reset callback
>   drm/amdgpu: add per ring reset support (v2)
>   drm/amdgpu/gfx11: add ring reset callbacks
>   drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
>   drm/amdgpu/gfx10: add ring reset callbacks
>   drm/amdgpu/gfx10: rework reset sequence
>   drm/amdgpu/gfx9: add ring reset callback
>   drm/amdgpu/gfx9.4.3: add ring reset callback
>   drm/amdgpu/gfx12: add ring reset callbacks
>   drm/amdgpu/gfx12: fallback to driver reset compute queue directly
>   drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
>   drm/amdgpu/gfx11: add a mutex for the gfx semaphore
>   drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()
>
> Jiadong Zhu (13):
>   drm/amdgpu/gfx11: wait for reset done before remap
>   drm/amdgpu/gfx10: remap queue after reset successfully
>   drm/amdgpu/gfx10: wait for reset done before remap
>   drm/amdgpu/gfx9: remap queue after reset successfully
>   drm/amdgpu/gfx9: wait for reset done before remap
>   drm/amdgpu/gfx9.4.3: remap queue after reset successfully
>   drm/amdgpu/gfx_9.4.3: wait for reset done before remap
>   drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue
>   drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
>   drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
>   drm/amdgpu/mes: modify mes api for mmio queue reset
>   drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
>   drm/amdgpu/mes11: implement mmio queue reset for gfx11
>
> Prike Liang (2):
>   drm/amdgpu: increase the reset counter for the queue reset
>   drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2)
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  18 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c    |  88 ++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h    |  37 +++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 158 ++++++++++++++++++++-
>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 117 +++++++++++++--
>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h     |   3 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     |  95 ++++++++++++-
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 126 +++++++++++++++-
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 125 +++++++++++++++-
>  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c     | 132 +++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/mes_v12_0.c     |  54 +++++++
>  14 files changed, 930 insertions(+), 32 deletions(-)
>
> --
> 2.45.2
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/34] GC per queue reset
  2024-07-18 16:29 ` [PATCH 00/34] GC per queue reset Friedrich Vock
@ 2024-07-19 13:39   ` Alex Deucher
  2024-07-19 22:52     ` Alex Deucher
  2024-07-24  9:20     ` Zhu, Jiadong
  0 siblings, 2 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-19 13:39 UTC (permalink / raw)
  To: Friedrich Vock; +Cc: Alex Deucher, amd-gfx

On Thu, Jul 18, 2024 at 1:00 PM Friedrich Vock <friedrich.vock@gmx.de> wrote:
>
> Hi,
>
> On 18.07.24 16:06, Alex Deucher wrote:
> > This adds preliminary support for GC per queue reset.  In this
> > case, only the jobs currently in the queue are lost.  If this
> > fails, we fall back to a full adapter reset.
>
> First of all, thank you so much for working on this! It's great to
> finally see progress in making GPU resets better.
>
> I've just taken this patchset (together with your other
> patchsets[1][2][3]) for a quick spin on my
> Navi21 with the GPU reset tests[4] I had written a while ago - the
> current patchset sadly seems to have some regressions WRT recovery there.
>
> I ran the tests under my Plasma Wayland session once - this triggered a
> list double-add in drm_sched_stop (calltrace follows):

I think this should fix the double add:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 7107c4d3a3b6..555d3b671bdb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -88,6 +88,8 @@ static enum drm_gpu_sched_stat
amdgpu_job_timedout(struct drm_sched_job *s_job)
                                drm_sched_start(&ring->sched, true);
                        goto exit;
                }
+               if (amdgpu_ring_sched_ready(ring))
+                       drm_sched_start(&ring->sched, true);
        }

        if (amdgpu_device_should_recover_gpu(ring->adev)) {


>
> ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
> ? do_trap (arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154)
> ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> ? do_error_trap (./arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:175)
> ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> ? exc_invalid_op (arch/x86/kernel/traps.c:266)
> ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:568)
> ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> drm_sched_stop (./include/linux/list.h:151 ./include/linux/list.h:169 drivers/gpu/drm/scheduler/sched_main.c:617)
> amdgpu_device_gpu_recover (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5808)
> amdgpu_job_timedout (drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:103)
> drm_sched_job_timedout (drivers/gpu/drm/scheduler/sched_main.c:569)
> process_one_work (kernel/workqueue.c:2633)
> worker_thread (kernel/workqueue.c:2700 (discriminator 2) kernel/workqueue.c:2787 (discriminator 2))
> ? __pfx_worker_thread (kernel/workqueue.c:2733)
> kthread (kernel/kthread.c:388)
> ? __pfx_kthread (kernel/kthread.c:341)
> ret_from_fork (arch/x86/kernel/process.c:147)
> ? __pfx_kthread (kernel/kthread.c:341)
> ret_from_fork_asm (arch/x86/entry/entry_64.S:251)
>
> When running the tests without a desktop environment active, the
> double-add disappeared, but the GPU reset still didn't go well - the TTY
> remained frozen and the kernel log contained a few messages like:
>
> [drm] *ERROR* [CRTC:90:crtc-0] flip_done timed out

I don't think the display hardware is hung, I think it's a fence
signalling issue after the reset.  We are investigating some
limitations we are seeing in the handling of fences.

>
> which I guess means at least the display subsystem is hung.
>
> Hope this info is enough to repro/investigate.

Thanks for testing!

Alex

>
> Thanks,
> Friedrich
>
> [1] https://lore.kernel.org/amd-gfx/20240717203740.14059-1-alexander.deucher@amd.com/T/#t
> [2] https://lore.kernel.org/amd-gfx/20240717203847.14600-1-alexander.deucher@amd.com/T/#t
> [3] https://lore.kernel.org/amd-gfx/230ee72e-4f7f-4894-a789-2e1e5788344f@amd.com/T/#t
> [4] https://gitlab.steamos.cloud/holo/HangTestSuite
>
>
> >
> > Alex Deucher (19):
> >    drm/amdgpu/mes: add API for legacy queue reset
> >    drm/amdgpu/mes11: add API for legacy queue reset
> >    drm/amdgpu/mes12: add API for legacy queue reset
> >    drm/amdgpu/mes: add API for user queue reset
> >    drm/amdgpu/mes11: add API for user queue reset
> >    drm/amdgpu/mes12: add API for user queue reset
> >    drm/amdgpu: add new ring reset callback
> >    drm/amdgpu: add per ring reset support (v2)
> >    drm/amdgpu/gfx11: add ring reset callbacks
> >    drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
> >    drm/amdgpu/gfx10: add ring reset callbacks
> >    drm/amdgpu/gfx10: rework reset sequence
> >    drm/amdgpu/gfx9: add ring reset callback
> >    drm/amdgpu/gfx9.4.3: add ring reset callback
> >    drm/amdgpu/gfx12: add ring reset callbacks
> >    drm/amdgpu/gfx12: fallback to driver reset compute queue directly
> >    drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
> >    drm/amdgpu/gfx11: add a mutex for the gfx semaphore
> >    drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()
> >
> > Jiadong Zhu (13):
> >    drm/amdgpu/gfx11: wait for reset done before remap
> >    drm/amdgpu/gfx10: remap queue after reset successfully
> >    drm/amdgpu/gfx10: wait for reset done before remap
> >    drm/amdgpu/gfx9: remap queue after reset successfully
> >    drm/amdgpu/gfx9: wait for reset done before remap
> >    drm/amdgpu/gfx9.4.3: remap queue after reset successfully
> >    drm/amdgpu/gfx_9.4.3: wait for reset done before remap
> >    drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue
> >    drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
> >    drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
> >    drm/amdgpu/mes: modify mes api for mmio queue reset
> >    drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
> >    drm/amdgpu/mes11: implement mmio queue reset for gfx11
> >
> > Prike Liang (2):
> >    drm/amdgpu: increase the reset counter for the queue reset
> >    drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2)
> >
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |   6 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  18 +++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c    |  88 ++++++++++++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h    |  37 +++++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
> >   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 158 ++++++++++++++++++++-
> >   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 117 +++++++++++++--
> >   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h     |   3 +
> >   drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     |  95 ++++++++++++-
> >   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 126 +++++++++++++++-
> >   drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 125 +++++++++++++++-
> >   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c     | 132 +++++++++++++++++
> >   drivers/gpu/drm/amd/amdgpu/mes_v12_0.c     |  54 +++++++
> >   14 files changed, 930 insertions(+), 32 deletions(-)
> >
>
>
>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/34] GC per queue reset
  2024-07-19 13:39   ` Alex Deucher
@ 2024-07-19 22:52     ` Alex Deucher
  2024-07-24  9:20     ` Zhu, Jiadong
  1 sibling, 0 replies; 42+ messages in thread
From: Alex Deucher @ 2024-07-19 22:52 UTC (permalink / raw)
  To: Friedrich Vock; +Cc: Alex Deucher, amd-gfx

On Fri, Jul 19, 2024 at 9:39 AM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Thu, Jul 18, 2024 at 1:00 PM Friedrich Vock <friedrich.vock@gmx.de> wrote:
> >
> > Hi,
> >
> > On 18.07.24 16:06, Alex Deucher wrote:
> > > This adds preliminary support for GC per queue reset.  In this
> > > case, only the jobs currently in the queue are lost.  If this
> > > fails, we fall back to a full adapter reset.
> >
> > First of all, thank you so much for working on this! It's great to
> > finally see progress in making GPU resets better.
> >
> > I've just taken this patchset (together with your other
> > patchsets[1][2][3]) for a quick spin on my
> > Navi21 with the GPU reset tests[4] I had written a while ago - the
> > current patchset sadly seems to have some regressions WRT recovery there.
> >
> > I ran the tests under my Plasma Wayland session once - this triggered a
> > list double-add in drm_sched_stop (calltrace follows):
>
> I think this should fix the double add:
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 7107c4d3a3b6..555d3b671bdb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -88,6 +88,8 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
>                                 drm_sched_start(&ring->sched, true);
>                         goto exit;
>                 }
> +               if (amdgpu_ring_sched_ready(ring))
> +                       drm_sched_start(&ring->sched, true);
>         }
>
>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>
>

FWIW, I think I've fixed a lot of this up.  there are a lot of patches
to resend, for now so just grab the latest from:
https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next-queue-reset
This updated patch is the important one:
https://gitlab.freedesktop.org/agd5f/linux/-/commit/ddad62a355b7650f05862c88c2f7979e938071f2

Alex

> >
> > ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
> > ? do_trap (arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154)
> > ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> > ? do_error_trap (./arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:175)
> > ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> > ? exc_invalid_op (arch/x86/kernel/traps.c:266)
> > ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> > ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:568)
> > ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> > ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> > drm_sched_stop (./include/linux/list.h:151 ./include/linux/list.h:169 drivers/gpu/drm/scheduler/sched_main.c:617)
> > amdgpu_device_gpu_recover (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5808)
> > amdgpu_job_timedout (drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:103)
> > drm_sched_job_timedout (drivers/gpu/drm/scheduler/sched_main.c:569)
> > process_one_work (kernel/workqueue.c:2633)
> > worker_thread (kernel/workqueue.c:2700 (discriminator 2) kernel/workqueue.c:2787 (discriminator 2))
> > ? __pfx_worker_thread (kernel/workqueue.c:2733)
> > kthread (kernel/kthread.c:388)
> > ? __pfx_kthread (kernel/kthread.c:341)
> > ret_from_fork (arch/x86/kernel/process.c:147)
> > ? __pfx_kthread (kernel/kthread.c:341)
> > ret_from_fork_asm (arch/x86/entry/entry_64.S:251)
> >
> > When running the tests without a desktop environment active, the
> > double-add disappeared, but the GPU reset still didn't go well - the TTY
> > remained frozen and the kernel log contained a few messages like:
> >
> > [drm] *ERROR* [CRTC:90:crtc-0] flip_done timed out
>
> I don't think the display hardware is hung, I think it's a fence
> signalling issue after the reset.  We are investigating some
> limitations we are seeing in the handling of fences.
>
> >
> > which I guess means at least the display subsystem is hung.
> >
> > Hope this info is enough to repro/investigate.
>
> Thanks for testing!
>
> Alex
>
> >
> > Thanks,
> > Friedrich
> >
> > [1] https://lore.kernel.org/amd-gfx/20240717203740.14059-1-alexander.deucher@amd.com/T/#t
> > [2] https://lore.kernel.org/amd-gfx/20240717203847.14600-1-alexander.deucher@amd.com/T/#t
> > [3] https://lore.kernel.org/amd-gfx/230ee72e-4f7f-4894-a789-2e1e5788344f@amd.com/T/#t
> > [4] https://gitlab.steamos.cloud/holo/HangTestSuite
> >
> >
> > >
> > > Alex Deucher (19):
> > >    drm/amdgpu/mes: add API for legacy queue reset
> > >    drm/amdgpu/mes11: add API for legacy queue reset
> > >    drm/amdgpu/mes12: add API for legacy queue reset
> > >    drm/amdgpu/mes: add API for user queue reset
> > >    drm/amdgpu/mes11: add API for user queue reset
> > >    drm/amdgpu/mes12: add API for user queue reset
> > >    drm/amdgpu: add new ring reset callback
> > >    drm/amdgpu: add per ring reset support (v2)
> > >    drm/amdgpu/gfx11: add ring reset callbacks
> > >    drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
> > >    drm/amdgpu/gfx10: add ring reset callbacks
> > >    drm/amdgpu/gfx10: rework reset sequence
> > >    drm/amdgpu/gfx9: add ring reset callback
> > >    drm/amdgpu/gfx9.4.3: add ring reset callback
> > >    drm/amdgpu/gfx12: add ring reset callbacks
> > >    drm/amdgpu/gfx12: fallback to driver reset compute queue directly
> > >    drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
> > >    drm/amdgpu/gfx11: add a mutex for the gfx semaphore
> > >    drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()
> > >
> > > Jiadong Zhu (13):
> > >    drm/amdgpu/gfx11: wait for reset done before remap
> > >    drm/amdgpu/gfx10: remap queue after reset successfully
> > >    drm/amdgpu/gfx10: wait for reset done before remap
> > >    drm/amdgpu/gfx9: remap queue after reset successfully
> > >    drm/amdgpu/gfx9: wait for reset done before remap
> > >    drm/amdgpu/gfx9.4.3: remap queue after reset successfully
> > >    drm/amdgpu/gfx_9.4.3: wait for reset done before remap
> > >    drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue
> > >    drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
> > >    drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
> > >    drm/amdgpu/mes: modify mes api for mmio queue reset
> > >    drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
> > >    drm/amdgpu/mes11: implement mmio queue reset for gfx11
> > >
> > > Prike Liang (2):
> > >    drm/amdgpu: increase the reset counter for the queue reset
> > >    drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2)
> > >
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |   6 +
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  18 +++
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c    |  88 ++++++++++++
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h    |  37 +++++
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
> > >   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 158 ++++++++++++++++++++-
> > >   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 117 +++++++++++++--
> > >   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h     |   3 +
> > >   drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     |  95 ++++++++++++-
> > >   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 126 +++++++++++++++-
> > >   drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 125 +++++++++++++++-
> > >   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c     | 132 +++++++++++++++++
> > >   drivers/gpu/drm/amd/amdgpu/mes_v12_0.c     |  54 +++++++
> > >   14 files changed, 930 insertions(+), 32 deletions(-)
> > >
> >
> >
> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/34] GC per queue reset
  2024-07-18 16:54 ` Alex Deucher
@ 2024-07-23  8:50   ` Christopher Snowhill
  0 siblings, 0 replies; 42+ messages in thread
From: Christopher Snowhill @ 2024-07-23  8:50 UTC (permalink / raw)
  To: Alex Deucher; +Cc: amd-gfx

Alex Deucher <alexdeucher@gmail.com> writes:

> On Thu, Jul 18, 2024 at 10:15 AM Alex Deucher <alexander.deucher@amd.com> wrote:
>>
>> This adds preliminary support for GC per queue reset.  In this
>> case, only the jobs currently in the queue are lost.  If this
>> fails, we fall back to a full adapter reset.
>
> Also available here via git:
> https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next-queue-reset

Just tested this, after encountering the double-add crash trying to
reset after a GPU hang. It doesn't seem to gracefully recover from this
particular GPU hang, but at least now it resets properly. Still not
going to attempt to run it against KDE / Plasma 6.1.3 on Arch, as that
loves to hang if there's any Xwayland involved in the GPU reset event.

However, under labwc-git with my own PR applied to it, it recovers okay,
though Xwayland eventually crashes and is restarted by labwc. Here's a
dmesg log excerpt of the reset and recovery event:

[  189.830630] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=52410, emitted seq=52412
[  189.830642] amdgpu 0000:0a:00.0: amdgpu: Process information: process Stray-Win64-Shi pid 11560 thread vkd3d_queue pid 11719
[  190.099191] amdgpu 0000:0a:00.0: amdgpu: GPU reset begin!
[  190.457702] amdgpu 0000:0a:00.0: amdgpu: Dumping IP State
[  190.459418] amdgpu 0000:0a:00.0: amdgpu: Dumping IP State Completed
[  190.459420] amdgpu 0000:0a:00.0: amdgpu: MODE1 reset
[  190.459423] amdgpu 0000:0a:00.0: amdgpu: GPU mode1 reset
[  190.459483] amdgpu 0000:0a:00.0: amdgpu: GPU smu mode1 reset
[  190.967464] amdgpu 0000:0a:00.0: amdgpu: GPU reset succeeded, trying to resume
[  190.967824] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[  190.967912] [drm] VRAM is lost due to GPU reset!
[  190.967914] amdgpu 0000:0a:00.0: amdgpu: PSP is resuming...
[  191.042264] amdgpu 0000:0a:00.0: amdgpu: reserve 0xa00000 from 0x82fd000000 for PSP TMR
[  191.143003] amdgpu 0000:0a:00.0: amdgpu: RAS: optional ras ta ucode is not available
[  191.156566] amdgpu 0000:0a:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  191.156572] amdgpu 0000:0a:00.0: amdgpu: SMU is resuming...
[  191.156576] amdgpu 0000:0a:00.0: amdgpu: smu driver if version = 0x0000000e, smu fw if version = 0x00000012, smu fw program = 0, version = 0x00413e00 (65.62.0)
[  191.156580] amdgpu 0000:0a:00.0: amdgpu: SMU driver if version not matched
[  191.156609] amdgpu 0000:0a:00.0: amdgpu: use vbios provided pptable
[  191.215750] amdgpu 0000:0a:00.0: amdgpu: SMU is resumed successfully!
[  191.217023] [drm] DMUB hardware initialized: version=0x02020020
[  191.530005] [drm] kiq ring mec 2 pipe 1 q 0
[  191.532863] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[  191.532866] amdgpu 0000:0a:00.0: amdgpu: ring gfx_0.1.0 uses VM inv eng 1 on hub 0
[  191.532867] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 4 on hub 0
[  191.532869] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 5 on hub 0
[  191.532870] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[  191.532871] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[  191.532872] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[  191.532874] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[  191.532875] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[  191.532876] amdgpu 0000:0a:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[  191.532878] amdgpu 0000:0a:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 12 on hub 0
[  191.532879] amdgpu 0000:0a:00.0: amdgpu: ring sdma0 uses VM inv eng 13 on hub 0
[  191.532880] amdgpu 0000:0a:00.0: amdgpu: ring sdma1 uses VM inv eng 14 on hub 0
[  191.532881] amdgpu 0000:0a:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[  191.532883] amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
[  191.532884] amdgpu 0000:0a:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
[  191.532885] amdgpu 0000:0a:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[  191.536522] amdgpu 0000:0a:00.0: amdgpu: recover vram bo from shadow start
[  191.555443] amdgpu 0000:0a:00.0: amdgpu: recover vram bo from shadow done
[  191.555471] amdgpu 0000:0a:00.0: amdgpu: GPU reset(2) succeeded!
[  191.555663] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

Yes, I can reliably hang my gfx ring if I run Stray with -dx12 switch
applied. In-game, though, not on the title screen.


> Alex
>
>>
>> Alex Deucher (19):
>>   drm/amdgpu/mes: add API for legacy queue reset
>>   drm/amdgpu/mes11: add API for legacy queue reset
>>   drm/amdgpu/mes12: add API for legacy queue reset
>>   drm/amdgpu/mes: add API for user queue reset
>>   drm/amdgpu/mes11: add API for user queue reset
>>   drm/amdgpu/mes12: add API for user queue reset
>>   drm/amdgpu: add new ring reset callback
>>   drm/amdgpu: add per ring reset support (v2)
>>   drm/amdgpu/gfx11: add ring reset callbacks
>>   drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
>>   drm/amdgpu/gfx10: add ring reset callbacks
>>   drm/amdgpu/gfx10: rework reset sequence
>>   drm/amdgpu/gfx9: add ring reset callback
>>   drm/amdgpu/gfx9.4.3: add ring reset callback
>>   drm/amdgpu/gfx12: add ring reset callbacks
>>   drm/amdgpu/gfx12: fallback to driver reset compute queue directly
>>   drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
>>   drm/amdgpu/gfx11: add a mutex for the gfx semaphore
>>   drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()
>>
>> Jiadong Zhu (13):
>>   drm/amdgpu/gfx11: wait for reset done before remap
>>   drm/amdgpu/gfx10: remap queue after reset successfully
>>   drm/amdgpu/gfx10: wait for reset done before remap
>>   drm/amdgpu/gfx9: remap queue after reset successfully
>>   drm/amdgpu/gfx9: wait for reset done before remap
>>   drm/amdgpu/gfx9.4.3: remap queue after reset successfully
>>   drm/amdgpu/gfx_9.4.3: wait for reset done before remap
>>   drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue
>>   drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
>>   drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
>>   drm/amdgpu/mes: modify mes api for mmio queue reset
>>   drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
>>   drm/amdgpu/mes11: implement mmio queue reset for gfx11
>>
>> Prike Liang (2):
>>   drm/amdgpu: increase the reset counter for the queue reset
>>   drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2)
>>
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |   6 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  18 +++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c    |  88 ++++++++++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h    |  37 +++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
>>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 158 ++++++++++++++++++++-
>>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 117 +++++++++++++--
>>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h     |   3 +
>>  drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     |  95 ++++++++++++-
>>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 126 +++++++++++++++-
>>  drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 125 +++++++++++++++-
>>  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c     | 132 +++++++++++++++++
>>  drivers/gpu/drm/amd/amdgpu/mes_v12_0.c     |  54 +++++++
>>  14 files changed, 930 insertions(+), 32 deletions(-)
>>
>> --
>> 2.45.2
>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH 00/34] GC per queue reset
  2024-07-19 13:39   ` Alex Deucher
  2024-07-19 22:52     ` Alex Deucher
@ 2024-07-24  9:20     ` Zhu, Jiadong
  2024-07-25  7:44       ` Friedrich Vock
  1 sibling, 1 reply; 42+ messages in thread
From: Zhu, Jiadong @ 2024-07-24  9:20 UTC (permalink / raw)
  To: Alex Deucher, Friedrich Vock
  Cc: Deucher, Alexander, amd-gfx@lists.freedesktop.org

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Alex
> Deucher
> Sent: Friday, July 19, 2024 9:40 PM
> To: Friedrich Vock <friedrich.vock@gmx.de>
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; amd-
> gfx@lists.freedesktop.org
> Subject: Re: [PATCH 00/34] GC per queue reset
>
> On Thu, Jul 18, 2024 at 1:00 PM Friedrich Vock <friedrich.vock@gmx.de>
> wrote:
> >
> > Hi,
> >
> > On 18.07.24 16:06, Alex Deucher wrote:
> > > This adds preliminary support for GC per queue reset.  In this case,
> > > only the jobs currently in the queue are lost.  If this fails, we
> > > fall back to a full adapter reset.
> >
> > First of all, thank you so much for working on this! It's great to
> > finally see progress in making GPU resets better.
> >
> > I've just taken this patchset (together with your other
> > patchsets[1][2][3]) for a quick spin on my
> > Navi21 with the GPU reset tests[4] I had written a while ago - the
> > current patchset sadly seems to have some regressions WRT recovery
> there.
> >
> > I ran the tests under my Plasma Wayland session once - this triggered
> > a list double-add in drm_sched_stop (calltrace follows):
>
> I think this should fix the double add:
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 7107c4d3a3b6..555d3b671bdb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -88,6 +88,8 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
>                                 drm_sched_start(&ring->sched, true);
>                         goto exit;
>                 }
> +               if (amdgpu_ring_sched_ready(ring))
> +                       drm_sched_start(&ring->sched, true);
>         }
>
>         if (amdgpu_device_should_recover_gpu(ring->adev)) {
>
>
> >
> > ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434
> > arch/x86/kernel/dumpstack.c:447) ? do_trap
> > (arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154) ?
> > __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ?
> > do_error_trap (./arch/x86/include/asm/traps.h:58
> > arch/x86/kernel/traps.c:175) ? __list_add_valid_or_report
> > (lib/list_debug.c:35 (discriminator 1)) ? exc_invalid_op
> > (arch/x86/kernel/traps.c:266) ? __list_add_valid_or_report
> > (lib/list_debug.c:35 (discriminator 1)) ? asm_exc_invalid_op
> > (./arch/x86/include/asm/idtentry.h:568)
> > ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ?
> > __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
> > drm_sched_stop (./include/linux/list.h:151 ./include/linux/list.h:169
> > drivers/gpu/drm/scheduler/sched_main.c:617)
> > amdgpu_device_gpu_recover
> > (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5808)
> > amdgpu_job_timedout
> (drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:103)
> > drm_sched_job_timedout (drivers/gpu/drm/scheduler/sched_main.c:569)
> > process_one_work (kernel/workqueue.c:2633) worker_thread
> > (kernel/workqueue.c:2700 (discriminator 2) kernel/workqueue.c:2787
> > (discriminator 2)) ? __pfx_worker_thread (kernel/workqueue.c:2733)
> > kthread (kernel/kthread.c:388) ? __pfx_kthread (kernel/kthread.c:341)
> > ret_from_fork (arch/x86/kernel/process.c:147) ? __pfx_kthread
> > (kernel/kthread.c:341) ret_from_fork_asm
> > (arch/x86/entry/entry_64.S:251)
> >
> > When running the tests without a desktop environment active, the
> > double-add disappeared, but the GPU reset still didn't go well - the
> > TTY remained frozen and the kernel log contained a few messages like:
> >
> > [drm] *ERROR* [CRTC:90:crtc-0] flip_done timed out

Hi Friedrich, we cannot reproduce the flip_done timed out on dgpu.
could you have a check if the hangtest runs on integrated gpu or the dgpu. If it runs on igpu, could you have a try to disable igpu in bios to see if it works. Thanks.

Thanks,
Jiadong

> I don't think the display hardware is hung, I think it's a fence signalling issue
> after the reset.  We are investigating some limitations we are seeing in the
> handling of fences.
>
> >
> > which I guess means at least the display subsystem is hung.
> >
> > Hope this info is enough to repro/investigate.
>
> Thanks for testing!
>
> Alex
>
> >
> > Thanks,
> > Friedrich
> >
> > [1]
> > https://lore.kernel.org/amd-gfx/20240717203740.14059-1-alexander.deuch
> > er@amd.com/T/#t [2]
> > https://lore.kernel.org/amd-gfx/20240717203847.14600-1-alexander.deuch
> > er@amd.com/T/#t [3]
> > https://lore.kernel.org/amd-gfx/230ee72e-4f7f-4894-a789-
> 2e1e5788344f@a
> > md.com/T/#t [4] https://gitlab.steamos.cloud/holo/HangTestSuite
> >
> >
> > >
> > > Alex Deucher (19):
> > >    drm/amdgpu/mes: add API for legacy queue reset
> > >    drm/amdgpu/mes11: add API for legacy queue reset
> > >    drm/amdgpu/mes12: add API for legacy queue reset
> > >    drm/amdgpu/mes: add API for user queue reset
> > >    drm/amdgpu/mes11: add API for user queue reset
> > >    drm/amdgpu/mes12: add API for user queue reset
> > >    drm/amdgpu: add new ring reset callback
> > >    drm/amdgpu: add per ring reset support (v2)
> > >    drm/amdgpu/gfx11: add ring reset callbacks
> > >    drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
> > >    drm/amdgpu/gfx10: add ring reset callbacks
> > >    drm/amdgpu/gfx10: rework reset sequence
> > >    drm/amdgpu/gfx9: add ring reset callback
> > >    drm/amdgpu/gfx9.4.3: add ring reset callback
> > >    drm/amdgpu/gfx12: add ring reset callbacks
> > >    drm/amdgpu/gfx12: fallback to driver reset compute queue directly
> > >    drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
> > >    drm/amdgpu/gfx11: add a mutex for the gfx semaphore
> > >    drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()
> > >
> > > Jiadong Zhu (13):
> > >    drm/amdgpu/gfx11: wait for reset done before remap
> > >    drm/amdgpu/gfx10: remap queue after reset successfully
> > >    drm/amdgpu/gfx10: wait for reset done before remap
> > >    drm/amdgpu/gfx9: remap queue after reset successfully
> > >    drm/amdgpu/gfx9: wait for reset done before remap
> > >    drm/amdgpu/gfx9.4.3: remap queue after reset successfully
> > >    drm/amdgpu/gfx_9.4.3: wait for reset done before remap
> > >    drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for
> reset_hw_queue
> > >    drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
> > >    drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
> > >    drm/amdgpu/mes: modify mes api for mmio queue reset
> > >    drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
> > >    drm/amdgpu/mes11: implement mmio queue reset for gfx11
> > >
> > > Prike Liang (2):
> > >    drm/amdgpu: increase the reset counter for the queue reset
> > >    drm/amdgpu/gfx11: fallback to driver reset compute queue directly
> > > (v2)
> > >
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |   6 +
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  18 +++
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c    |  88 ++++++++++++
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h    |  37 +++++
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
> > >   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 158
> ++++++++++++++++++++-
> > >   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 117 +++++++++++++--
> > >   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h     |   3 +
> > >   drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     |  95 ++++++++++++-
> > >   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 126 +++++++++++++++-
> > >   drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 125
> +++++++++++++++-
> > >   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c     | 132
> +++++++++++++++++
> > >   drivers/gpu/drm/amd/amdgpu/mes_v12_0.c     |  54 +++++++
> > >   14 files changed, 930 insertions(+), 32 deletions(-)
> > >
> >
> >
> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/34] GC per queue reset
  2024-07-24  9:20     ` Zhu, Jiadong
@ 2024-07-25  7:44       ` Friedrich Vock
  0 siblings, 0 replies; 42+ messages in thread
From: Friedrich Vock @ 2024-07-25  7:44 UTC (permalink / raw)
  To: Zhu, Jiadong, Alex Deucher
  Cc: Deucher, Alexander, amd-gfx@lists.freedesktop.org

On 24.07.24 11:20, Zhu, Jiadong wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Alex
>> Deucher
>> Sent: Friday, July 19, 2024 9:40 PM
>> To: Friedrich Vock <friedrich.vock@gmx.de>
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; amd-
>> gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 00/34] GC per queue reset
>>
>> On Thu, Jul 18, 2024 at 1:00 PM Friedrich Vock <friedrich.vock@gmx.de>
>> wrote:
>>>
>>> Hi,
>>>
>>> On 18.07.24 16:06, Alex Deucher wrote:
>>>> This adds preliminary support for GC per queue reset.  In this case,
>>>> only the jobs currently in the queue are lost.  If this fails, we
>>>> fall back to a full adapter reset.
>>>
>>> First of all, thank you so much for working on this! It's great to
>>> finally see progress in making GPU resets better.
>>>
>>> I've just taken this patchset (together with your other
>>> patchsets[1][2][3]) for a quick spin on my
>>> Navi21 with the GPU reset tests[4] I had written a while ago - the
>>> current patchset sadly seems to have some regressions WRT recovery
>> there.
>>>
>>> I ran the tests under my Plasma Wayland session once - this triggered
>>> a list double-add in drm_sched_stop (calltrace follows):
>>
>> I think this should fix the double add:
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 7107c4d3a3b6..555d3b671bdb 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -88,6 +88,8 @@ static enum drm_gpu_sched_stat
>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>                                  drm_sched_start(&ring->sched, true);
>>                          goto exit;
>>                  }
>> +               if (amdgpu_ring_sched_ready(ring))
>> +                       drm_sched_start(&ring->sched, true);
>>          }
>>
>>          if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>
>>
>>>
>>> ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434
>>> arch/x86/kernel/dumpstack.c:447) ? do_trap
>>> (arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154) ?
>>> __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ?
>>> do_error_trap (./arch/x86/include/asm/traps.h:58
>>> arch/x86/kernel/traps.c:175) ? __list_add_valid_or_report
>>> (lib/list_debug.c:35 (discriminator 1)) ? exc_invalid_op
>>> (arch/x86/kernel/traps.c:266) ? __list_add_valid_or_report
>>> (lib/list_debug.c:35 (discriminator 1)) ? asm_exc_invalid_op
>>> (./arch/x86/include/asm/idtentry.h:568)
>>> ? __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1)) ?
>>> __list_add_valid_or_report (lib/list_debug.c:35 (discriminator 1))
>>> drm_sched_stop (./include/linux/list.h:151 ./include/linux/list.h:169
>>> drivers/gpu/drm/scheduler/sched_main.c:617)
>>> amdgpu_device_gpu_recover
>>> (drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5808)
>>> amdgpu_job_timedout
>> (drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:103)
>>> drm_sched_job_timedout (drivers/gpu/drm/scheduler/sched_main.c:569)
>>> process_one_work (kernel/workqueue.c:2633) worker_thread
>>> (kernel/workqueue.c:2700 (discriminator 2) kernel/workqueue.c:2787
>>> (discriminator 2)) ? __pfx_worker_thread (kernel/workqueue.c:2733)
>>> kthread (kernel/kthread.c:388) ? __pfx_kthread (kernel/kthread.c:341)
>>> ret_from_fork (arch/x86/kernel/process.c:147) ? __pfx_kthread
>>> (kernel/kthread.c:341) ret_from_fork_asm
>>> (arch/x86/entry/entry_64.S:251)
>>>
>>> When running the tests without a desktop environment active, the
>>> double-add disappeared, but the GPU reset still didn't go well - the
>>> TTY remained frozen and the kernel log contained a few messages like:
>>>
>>> [drm] *ERROR* [CRTC:90:crtc-0] flip_done timed out
>
> Hi Friedrich, we cannot reproduce the flip_done timed out on dgpu.
> could you have a check if the hangtest runs on integrated gpu or the dgpu. If it runs on igpu, could you have a try to disable igpu in bios to see if it works. Thanks.

Hi,

I double-checked with the iGPU disabled in BIOS and can still reproduce.
In case it matters, note that I had a typo in my original message: I'm
testing on Navi22, not 21 - sorry about that.

Also, it seems like the issue also occurs on normal
amd-staging-drm-next without the per-queue reset patches, so this
actually an earlier, unrelated regression.

I'll try bisecting later and will open a separate GitLab issue for this.

Regards,
Friedrich

>
> Thanks,
> Jiadong
>
>> I don't think the display hardware is hung, I think it's a fence signalling issue
>> after the reset.  We are investigating some limitations we are seeing in the
>> handling of fences.
>>
>>>
>>> which I guess means at least the display subsystem is hung.
>>>
>>> Hope this info is enough to repro/investigate.
>>
>> Thanks for testing!
>>
>> Alex
>>
>>>
>>> Thanks,
>>> Friedrich
>>>
>>> [1]
>>> https://lore.kernel.org/amd-gfx/20240717203740.14059-1-alexander.deuch
>>> er@amd.com/T/#t [2]
>>> https://lore.kernel.org/amd-gfx/20240717203847.14600-1-alexander.deuch
>>> er@amd.com/T/#t [3]
>>> https://lore.kernel.org/amd-gfx/230ee72e-4f7f-4894-a789-
>> 2e1e5788344f@a
>>> md.com/T/#t [4] https://gitlab.steamos.cloud/holo/HangTestSuite
>>>
>>>
>>>>
>>>> Alex Deucher (19):
>>>>     drm/amdgpu/mes: add API for legacy queue reset
>>>>     drm/amdgpu/mes11: add API for legacy queue reset
>>>>     drm/amdgpu/mes12: add API for legacy queue reset
>>>>     drm/amdgpu/mes: add API for user queue reset
>>>>     drm/amdgpu/mes11: add API for user queue reset
>>>>     drm/amdgpu/mes12: add API for user queue reset
>>>>     drm/amdgpu: add new ring reset callback
>>>>     drm/amdgpu: add per ring reset support (v2)
>>>>     drm/amdgpu/gfx11: add ring reset callbacks
>>>>     drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
>>>>     drm/amdgpu/gfx10: add ring reset callbacks
>>>>     drm/amdgpu/gfx10: rework reset sequence
>>>>     drm/amdgpu/gfx9: add ring reset callback
>>>>     drm/amdgpu/gfx9.4.3: add ring reset callback
>>>>     drm/amdgpu/gfx12: add ring reset callbacks
>>>>     drm/amdgpu/gfx12: fallback to driver reset compute queue directly
>>>>     drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL
>>>>     drm/amdgpu/gfx11: add a mutex for the gfx semaphore
>>>>     drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex()
>>>>
>>>> Jiadong Zhu (13):
>>>>     drm/amdgpu/gfx11: wait for reset done before remap
>>>>     drm/amdgpu/gfx10: remap queue after reset successfully
>>>>     drm/amdgpu/gfx10: wait for reset done before remap
>>>>     drm/amdgpu/gfx9: remap queue after reset successfully
>>>>     drm/amdgpu/gfx9: wait for reset done before remap
>>>>     drm/amdgpu/gfx9.4.3: remap queue after reset successfully
>>>>     drm/amdgpu/gfx_9.4.3: wait for reset done before remap
>>>>     drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for
>> reset_hw_queue
>>>>     drm/amdgpu/gfx9: implement reset_hw_queue for gfx9
>>>>     drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3
>>>>     drm/amdgpu/mes: modify mes api for mmio queue reset
>>>>     drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
>>>>     drm/amdgpu/mes11: implement mmio queue reset for gfx11
>>>>
>>>> Prike Liang (2):
>>>>     drm/amdgpu: increase the reset counter for the queue reset
>>>>     drm/amdgpu/gfx11: fallback to driver reset compute queue directly
>>>> (v2)
>>>>
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   1 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h    |   6 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  18 +++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c    |  88 ++++++++++++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h    |  37 +++++
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
>>>>    drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     | 158
>> ++++++++++++++++++++-
>>>>    drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c     | 117 +++++++++++++--
>>>>    drivers/gpu/drm/amd/amdgpu/gfx_v11_0.h     |   3 +
>>>>    drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c     |  95 ++++++++++++-
>>>>    drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 126 +++++++++++++++-
>>>>    drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c    | 125
>> +++++++++++++++-
>>>>    drivers/gpu/drm/amd/amdgpu/mes_v11_0.c     | 132
>> +++++++++++++++++
>>>>    drivers/gpu/drm/amd/amdgpu/mes_v12_0.c     |  54 +++++++
>>>>    14 files changed, 930 insertions(+), 32 deletions(-)
>>>>
>>>
>>>
>>>


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2024-07-25  7:49 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-18 14:06 [PATCH 00/34] GC per queue reset Alex Deucher
2024-07-18 14:07 ` [PATCH 01/34] drm/amdgpu/mes: add API for legacy " Alex Deucher
2024-07-18 14:07 ` [PATCH 02/34] drm/amdgpu/mes11: " Alex Deucher
2024-07-18 14:07 ` [PATCH 03/34] drm/amdgpu/mes12: " Alex Deucher
2024-07-18 14:07 ` [PATCH 04/34] drm/amdgpu/mes: add API for user " Alex Deucher
2024-07-18 14:07 ` [PATCH 05/34] drm/amdgpu/mes11: " Alex Deucher
2024-07-18 14:07 ` [PATCH 06/34] drm/amdgpu/mes12: " Alex Deucher
2024-07-18 14:07 ` [PATCH 07/34] drm/amdgpu: add new ring reset callback Alex Deucher
2024-07-18 14:07 ` [PATCH 08/34] drm/amdgpu: add per ring reset support (v2) Alex Deucher
2024-07-18 14:07 ` [PATCH 09/34] drm/amdgpu: increase the reset counter for the queue reset Alex Deucher
2024-07-18 14:07 ` [PATCH 10/34] drm/amdgpu/gfx11: add ring reset callbacks Alex Deucher
2024-07-18 14:07 ` [PATCH 11/34] drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2) Alex Deucher
2024-07-18 14:07 ` [PATCH 12/34] drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue() Alex Deucher
2024-07-18 14:07 ` [PATCH 13/34] drm/amdgpu/gfx11: wait for reset done before remap Alex Deucher
2024-07-18 14:07 ` [PATCH 14/34] drm/amdgpu/gfx10: add ring reset callbacks Alex Deucher
2024-07-18 14:07 ` [PATCH 15/34] drm/amdgpu/gfx10: remap queue after reset successfully Alex Deucher
2024-07-18 14:07 ` [PATCH 16/34] drm/amdgpu/gfx10: wait for reset done before remap Alex Deucher
2024-07-18 14:07 ` [PATCH 17/34] drm/amdgpu/gfx10: rework reset sequence Alex Deucher
2024-07-18 14:07 ` [PATCH 18/34] drm/amdgpu/gfx9: add ring reset callback Alex Deucher
2024-07-18 14:07 ` [PATCH 19/34] drm/amdgpu/gfx9: remap queue after reset successfully Alex Deucher
2024-07-18 14:07 ` [PATCH 20/34] drm/amdgpu/gfx9: wait for reset done before remap Alex Deucher
2024-07-18 14:07 ` [PATCH 21/34] drm/amdgpu/gfx9.4.3: add ring reset callback Alex Deucher
2024-07-18 14:07 ` [PATCH 22/34] drm/amdgpu/gfx9.4.3: remap queue after reset successfully Alex Deucher
2024-07-18 14:07 ` [PATCH 23/34] drm/amdgpu/gfx_9.4.3: wait for reset done before remap Alex Deucher
2024-07-18 14:07 ` [PATCH 24/34] drm/amdgpu/gfx12: add ring reset callbacks Alex Deucher
2024-07-18 14:07 ` [PATCH 25/34] drm/amdgpu/gfx12: fallback to driver reset compute queue directly Alex Deucher
2024-07-18 14:07 ` [PATCH 26/34] drm/amdgpu/gfx: add a new kiq_pm4_funcs callback for reset_hw_queue Alex Deucher
2024-07-18 14:07 ` [PATCH 27/34] drm/amdgpu/gfx9: implement reset_hw_queue for gfx9 Alex Deucher
2024-07-18 14:07 ` [PATCH 28/34] drm/amdgpu/gfx9.4.3: implement reset_hw_queue for gfx9.4.3 Alex Deucher
2024-07-18 14:07 ` [PATCH 29/34] drm/amdgpu/mes: modify mes api for mmio queue reset Alex Deucher
2024-07-18 14:07 ` [PATCH 30/34] drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio Alex Deucher
2024-07-18 14:07 ` [PATCH 31/34] drm/amdgpu/gfx11: enter safe mode before touching CP_INT_CNTL Alex Deucher
2024-07-18 14:07 ` [PATCH 32/34] drm/amdgpu/gfx11: add a mutex for the gfx semaphore Alex Deucher
2024-07-18 14:07 ` [PATCH 33/34] drm/amdgpu/gfx11: export gfx_v11_0_request_gfx_index_mutex() Alex Deucher
2024-07-18 14:07 ` [PATCH 34/34] drm/amdgpu/mes11: implement mmio queue reset for gfx11 Alex Deucher
2024-07-18 16:29 ` [PATCH 00/34] GC per queue reset Friedrich Vock
2024-07-19 13:39   ` Alex Deucher
2024-07-19 22:52     ` Alex Deucher
2024-07-24  9:20     ` Zhu, Jiadong
2024-07-25  7:44       ` Friedrich Vock
2024-07-18 16:54 ` Alex Deucher
2024-07-23  8:50   ` Christopher Snowhill

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.