[PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset
@ 2024-05-28 17:23 Yunxiang Li
  2024-05-28 17:23 ` [PATCH v2 01/10] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
                   ` (10 more replies)
  0 siblings, 11 replies; 52+ messages in thread
From: Yunxiang Li @ 2024-05-28 17:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Likun.Gao, Hawking.Zhang,
	Yunxiang Li

If another thread accesses the gpu while the GPU is being reset, the
reset could fail. This is especially problematic on SRIOV since host
may reset the GPU even if guest is not yet ready.

There are code in place that tries to prevent stray access, but over
time bugs have crept in making it not reliable. This series hopes to
address these bugs.

Likun Gao (1):
  drm/amd/amdgpu: remove unnecessary flush when enable gart

Yunxiang Li (9):
  drm/amdgpu: add skip_hw_access checks for sriov
  drm/amdgpu: fix sriov host flr handler
  drm/amdgpu: abort fence poll if reset is started
  drm/amdgpu/kfd: remove is_hws_hang and is_resetting
  drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover
  drm/amdgpu: use helper in amdgpu_gart_unbind
  drm/amdgpu: fix locking scope when flushing tlb
  drm/amdgpu: fix missing reset domain locks
  Revert "drm/amdgpu: Queue KFD reset workitem in VF FED"

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      |  9 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c       | 66 ++++++++--------
 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c   |  2 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  8 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c       |  7 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h       |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c      | 25 +++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h      |  2 +
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c        |  3 -
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c        |  3 -
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c        |  3 -
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c        |  3 -
 drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c        |  4 -
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c        |  2 +-
 drivers/gpu/drm/amd/amdgpu/mes_v12_0.c        |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         | 37 ++++-----
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c         | 37 ++++-----
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |  6 --
 drivers/gpu/drm/amd/amdkfd/kfd_device.c       |  1 -
 .../drm/amd/amdkfd/kfd_device_queue_manager.c | 79 ++++++++-----------
 .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 -
 drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c | 11 ++-
 .../gpu/drm/amd/amdkfd/kfd_packet_manager.c   |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  4 +-
 .../amd/amdkfd/kfd_process_queue_manager.c    | 13 ++-
 27 files changed, 164 insertions(+), 177 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 01/10] drm/amdgpu: add skip_hw_access checks for sriov
  2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
@ 2024-05-28 17:23 ` Yunxiang Li
  2024-05-29  6:36   ` Christian König
  2024-05-28 17:23 ` [PATCH v2 02/10] drm/amdgpu: fix sriov host flr handler Yunxiang Li
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 52+ messages in thread
From: Yunxiang Li @ 2024-05-28 17:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Likun.Gao, Hawking.Zhang,
	Yunxiang Li

Accessing registers via host is missing the check for skip_hw_access and
the lockdep check that comes with it.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 3d5f58e76f2d..3cf8416f8cb0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -977,6 +977,9 @@ u32 amdgpu_virt_rlcg_reg_rw(struct amdgpu_device *adev, u32 offset, u32 v, u32 f
 		return 0;
 	}
 
+	if (amdgpu_device_skip_hw_access(adev))
+		return 0;
+
 	reg_access_ctrl = &adev->gfx.rlc.reg_access_ctrl[xcc_id];
 	scratch_reg0 = (void __iomem *)adev->rmmio + 4 * reg_access_ctrl->scratch_reg0;
 	scratch_reg1 = (void __iomem *)adev->rmmio + 4 * reg_access_ctrl->scratch_reg1;
@@ -1047,6 +1050,9 @@ void amdgpu_sriov_wreg(struct amdgpu_device *adev,
 {
 	u32 rlcg_flag;
 
+	if (amdgpu_device_skip_hw_access(adev))
+		return;
+
 	if (!amdgpu_sriov_runtime(adev) &&
 		amdgpu_virt_get_rlcg_reg_access_flag(adev, acc_flags, hwip, true, &rlcg_flag)) {
 		amdgpu_virt_rlcg_reg_rw(adev, offset, value, rlcg_flag, xcc_id);
@@ -1064,6 +1070,9 @@ u32 amdgpu_sriov_rreg(struct amdgpu_device *adev,
 {
 	u32 rlcg_flag;
 
+	if (amdgpu_device_skip_hw_access(adev))
+		return 0;
+
 	if (!amdgpu_sriov_runtime(adev) &&
 		amdgpu_virt_get_rlcg_reg_access_flag(adev, acc_flags, hwip, false, &rlcg_flag))
 		return amdgpu_virt_rlcg_reg_rw(adev, offset, 0, rlcg_flag, xcc_id);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 02/10] drm/amdgpu: fix sriov host flr handler
  2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
  2024-05-28 17:23 ` [PATCH v2 01/10] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
@ 2024-05-28 17:23 ` Yunxiang Li
  2024-05-29  6:41   ` Christian König
  2024-05-28 17:23 ` [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started Yunxiang Li
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 52+ messages in thread
From: Yunxiang Li @ 2024-05-28 17:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Likun.Gao, Hawking.Zhang,
	Yunxiang Li

We send back the ready to reset message before we stop anything. This is
wrong. Move it to when we are actually ready for the FLR to happen.

In the current state since we take tens of seconds to stop everything,
it is very likely that host would give up waiting and reset the GPU
before we send ready, so it would be the same as before. But this gets
rid of the hack with reset_domain locking and also let us know how slow
the reset actually is on the host. The pre-reset speed can thus be
improved later.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 14 ++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 ++
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      | 37 ++++++++--------------
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      | 37 ++++++++--------------
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 ----
 6 files changed, 46 insertions(+), 52 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index bf1a6593dc5e..eb77b4ec3cb4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5069,6 +5069,8 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 	struct amdgpu_hive_info *hive = NULL;
 
 	if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
+		amdgpu_virt_ready_to_reset(adev);
+		amdgpu_virt_wait_reset(adev);
 		clear_bit(AMDGPU_HOST_FLR, &reset_context->flags);
 		r = amdgpu_virt_request_full_gpu(adev, true);
 	} else {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 3cf8416f8cb0..44450507c140 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -152,6 +152,20 @@ void amdgpu_virt_request_init_data(struct amdgpu_device *adev)
 		DRM_WARN("host doesn't support REQ_INIT_DATA handshake\n");
 }
 
+/**
+ * amdgpu_virt_ready_to_reset() - send ready to reset to host
+ * @adev:	amdgpu device.
+ * Send ready to reset message to GPU hypervisor to signal we have stopped GPU
+ * activity and is ready for host FLR
+ */
+void amdgpu_virt_ready_to_reset(struct amdgpu_device *adev)
+{
+	struct amdgpu_virt *virt = &adev->virt;
+
+	if (virt->ops && virt->ops->reset_gpu)
+		virt->ops->ready_to_reset(adev);
+}
+
 /**
  * amdgpu_virt_wait_reset() - wait for reset gpu completed
  * @adev:	amdgpu device.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index 642f1fd287d8..66de5380d9a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -88,6 +88,7 @@ struct amdgpu_virt_ops {
 	int (*rel_full_gpu)(struct amdgpu_device *adev, bool init);
 	int (*req_init_data)(struct amdgpu_device *adev);
 	int (*reset_gpu)(struct amdgpu_device *adev);
+	void (*ready_to_reset)(struct amdgpu_device *adev);
 	int (*wait_reset)(struct amdgpu_device *adev);
 	void (*trans_msg)(struct amdgpu_device *adev, enum idh_request req,
 			  u32 data1, u32 data2, u32 data3);
@@ -345,6 +346,7 @@ int amdgpu_virt_request_full_gpu(struct amdgpu_device *adev, bool init);
 int amdgpu_virt_release_full_gpu(struct amdgpu_device *adev, bool init);
 int amdgpu_virt_reset_gpu(struct amdgpu_device *adev);
 void amdgpu_virt_request_init_data(struct amdgpu_device *adev);
+void amdgpu_virt_ready_to_reset(struct amdgpu_device *adev);
 int amdgpu_virt_wait_reset(struct amdgpu_device *adev);
 int amdgpu_virt_alloc_mm_table(struct amdgpu_device *adev);
 void amdgpu_virt_free_mm_table(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index f4c47492e0cd..3fdd1fc84723 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -249,38 +249,28 @@ static int xgpu_ai_set_mailbox_ack_irq(struct amdgpu_device *adev,
 	return 0;
 }
 
-static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
+static void xgpu_ai_ready_to_reset(struct amdgpu_device *adev)
 {
-	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
-	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
-	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
-
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-	 * otherwise the mailbox msg will be ruined/reseted by
-	 * the VF FLR.
-	 */
-	if (atomic_cmpxchg(&adev->reset_domain->in_gpu_reset, 0, 1) != 0)
-		return;
-
-	down_write(&adev->reset_domain->sem);
-
-	amdgpu_virt_fini_data_exchange(adev);
-
 	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
+}
 
+static int xgpu_ai_wait_reset(struct amdgpu_device *adev)
+{
+	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
 	do {
 		if (xgpu_ai_mailbox_peek_msg(adev) == IDH_FLR_NOTIFICATION_CMPL)
-			goto flr_done;
-
+			return 0;
 		msleep(10);
 		timeout -= 10;
 	} while (timeout > 1);
-
 	dev_warn(adev->dev, "waiting IDH_FLR_NOTIFICATION_CMPL timeout\n");
+	return -ETIME;
+}
 
-flr_done:
-	atomic_set(&adev->reset_domain->in_gpu_reset, 0);
-	up_write(&adev->reset_domain->sem);
+static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
+{
+	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
+	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 
 	/* Trigger recovery for world switch failure if no TDR */
 	if (amdgpu_device_should_recover_gpu(adev)
@@ -417,7 +407,8 @@ const struct amdgpu_virt_ops xgpu_ai_virt_ops = {
 	.req_full_gpu	= xgpu_ai_request_full_gpu_access,
 	.rel_full_gpu	= xgpu_ai_release_full_gpu_access,
 	.reset_gpu = xgpu_ai_request_reset,
-	.wait_reset = NULL,
+	.ready_to_reset = xgpu_ai_ready_to_reset,
+	.wait_reset = xgpu_ai_wait_reset,
 	.trans_msg = xgpu_ai_mailbox_trans_msg,
 	.req_init_data  = xgpu_ai_request_init_data,
 	.ras_poison_handler = xgpu_ai_ras_poison_handler,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 37b49a5ed2a1..cd6ec1afff2b 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -282,38 +282,28 @@ static int xgpu_nv_set_mailbox_ack_irq(struct amdgpu_device *adev,
 	return 0;
 }
 
-static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
+static void xgpu_nv_ready_to_reset(struct amdgpu_device *adev)
 {
-	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
-	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
-	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
-
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-	 * otherwise the mailbox msg will be ruined/reseted by
-	 * the VF FLR.
-	 */
-	if (atomic_cmpxchg(&adev->reset_domain->in_gpu_reset, 0, 1) != 0)
-		return;
-
-	down_write(&adev->reset_domain->sem);
-
-	amdgpu_virt_fini_data_exchange(adev);
-
 	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
+}
 
+static int xgpu_nv_wait_reset(struct amdgpu_device *adev)
+{
+	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
 	do {
 		if (xgpu_nv_mailbox_peek_msg(adev) == IDH_FLR_NOTIFICATION_CMPL)
-			goto flr_done;
-
+			return 0;
 		msleep(10);
 		timeout -= 10;
 	} while (timeout > 1);
-
 	dev_warn(adev->dev, "waiting IDH_FLR_NOTIFICATION_CMPL timeout\n");
+	return -ETIME;
+}
 
-flr_done:
-	atomic_set(&adev->reset_domain->in_gpu_reset, 0);
-	up_write(&adev->reset_domain->sem);
+static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
+{
+	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
+	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 
 	/* Trigger recovery for world switch failure if no TDR */
 	if (amdgpu_device_should_recover_gpu(adev)
@@ -455,7 +445,8 @@ const struct amdgpu_virt_ops xgpu_nv_virt_ops = {
 	.rel_full_gpu	= xgpu_nv_release_full_gpu_access,
 	.req_init_data  = xgpu_nv_request_init_data,
 	.reset_gpu = xgpu_nv_request_reset,
-	.wait_reset = NULL,
+	.ready_to_reset = xgpu_nv_ready_to_reset,
+	.wait_reset = xgpu_nv_wait_reset,
 	.trans_msg = xgpu_nv_mailbox_trans_msg,
 	.ras_poison_handler = xgpu_nv_ras_poison_handler,
 };
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index 78cd07744ebe..e1d63bed84bf 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -515,12 +515,6 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 
-	/* wait until RCV_MSG become 3 */
-	if (xgpu_vi_poll_msg(adev, IDH_FLR_NOTIFICATION_CMPL)) {
-		pr_err("failed to receive FLR_CMPL\n");
-		return;
-	}
-
 	/* Trigger recovery due to world switch failure */
 	if (amdgpu_device_should_recover_gpu(adev)) {
 		struct amdgpu_reset_context reset_context;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
  2024-05-28 17:23 ` [PATCH v2 01/10] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
  2024-05-28 17:23 ` [PATCH v2 02/10] drm/amdgpu: fix sriov host flr handler Yunxiang Li
@ 2024-05-28 17:23 ` Yunxiang Li
  2024-05-29  6:38   ` Christian König
  2024-05-28 17:23 ` [PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 52+ messages in thread
From: Yunxiang Li @ 2024-05-28 17:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Likun.Gao, Hawking.Zhang,
	Yunxiang Li

If a reset is triggered, there's no point in waiting for the fence back
anymore, it just makes the reset code wait for a long time for the
reset_domain read lock to be dropped.

This also makes our reply to host FLR fast enough so the host doesn't
timeout.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c   | 7 +++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h   | 3 ++-
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c    | 2 +-
 drivers/gpu/drm/amd/amdgpu/mes_v12_0.c    | 2 +-
 5 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 10832b470448..3c04f034d43e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -376,10 +376,12 @@ signed long amdgpu_fence_wait_polling(struct amdgpu_ring *ring,
 				      uint32_t wait_seq,
 				      signed long timeout)
 {
-
+	int in_reset = amdgpu_in_reset(ring->adev);
 	while ((int32_t)(wait_seq - amdgpu_fence_read(ring)) > 0 && timeout > 0) {
 		udelay(2);
 		timeout -= 2;
+		if (!in_reset && amdgpu_in_reset(ring->adev))
+			return 0;
 	}
 	return timeout > 0 ? timeout : 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 8c6b0987919f..dd22fd93f572 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -32,14 +32,17 @@
 #define AMDGPU_MES_MAX_NUM_OF_QUEUES_PER_PROCESS 1024
 #define AMDGPU_ONE_DOORBELL_SIZE 8
 
-signed long amdgpu_mes_fence_wait_polling(u64 *fence,
+signed long amdgpu_mes_fence_wait_polling(struct amdgpu_device *adev,
+					  u64 *fence,
 					  u64 wait_seq,
 					  signed long timeout)
 {
-
+	int in_reset = amdgpu_in_reset(adev);
 	while ((s64)(wait_seq - *fence) > 0 && timeout > 0) {
 		udelay(2);
 		timeout -= 2;
+		if (!in_reset && amdgpu_in_reset(adev))
+			return 0;
 	}
 	return timeout > 0 ? timeout : 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index b99a2b3cffe3..064cb3995a3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -340,7 +340,8 @@ struct amdgpu_mes_funcs {
 #define amdgpu_mes_kiq_hw_init(adev) (adev)->mes.kiq_hw_init((adev))
 #define amdgpu_mes_kiq_hw_fini(adev) (adev)->mes.kiq_hw_fini((adev))
 
-signed long amdgpu_mes_fence_wait_polling(u64 *fence,
+signed long amdgpu_mes_fence_wait_polling(struct amdgpu_device *adev,
+					  u64 *fence,
 					  u64 wait_seq,
 					  signed long timeout);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index 96629d8130b8..38edd60c6789 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -212,7 +212,7 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
 	else
 		dev_dbg(adev->dev, "MES msg=%d was emitted\n", x_pkt->header.opcode);
 
-	r = amdgpu_mes_fence_wait_polling(fence_ptr, (u64)1, timeout);
+	r = amdgpu_mes_fence_wait_polling(adev, fence_ptr, (u64)1, timeout);
 	amdgpu_device_wb_free(adev, fence_offset);
 	if (r < 1) {
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
index c5a03b79f07e..73430b9c4b27 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
@@ -202,7 +202,7 @@ static int mes_v12_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
 	else
 		dev_dbg(adev->dev, "MES msg=%d was emitted\n", x_pkt->header.opcode);
 
-	r = amdgpu_mes_fence_wait_polling(fence_ptr, (u64)1, timeout);
+	r = amdgpu_mes_fence_wait_polling(adev, fence_ptr, (u64)1, timeout);
 	amdgpu_device_wb_free(adev, fence_offset);
 
 	if (r < 1) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting
  2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
                   ` (2 preceding siblings ...)
  2024-05-28 17:23 ` [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started Yunxiang Li
@ 2024-05-28 17:23 ` Yunxiang Li
  2024-05-29  6:41   ` Christian König
  2024-05-29 23:04   ` Felix Kuehling
  2024-05-28 17:23 ` [PATCH v2 05/10] drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 52+ messages in thread
From: Yunxiang Li @ 2024-05-28 17:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Likun.Gao, Hawking.Zhang,
	Yunxiang Li

is_hws_hang and is_resetting serves pretty much the same purpose and
they all duplicates the work of the reset_domain lock, just check that
directly instead. This also eliminate a few bugs listed below and get
rid of dqm->ops.pre_reset.

kfd_hws_hang did not need to avoid scheduling another reset. If the
on-going reset decided to skip GPU reset we have a bad time, otherwise
the extra reset will get cancelled anyway.

remove_queue_mes forgot to check is_resetting flag compared to the
pre-MES path unmap_queue_cpsch, so it did not block hw access during
reset correctly.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c       |  1 -
 .../drm/amd/amdkfd/kfd_device_queue_manager.c | 79 ++++++++-----------
 .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 -
 drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c | 11 ++-
 .../gpu/drm/amd/amdkfd/kfd_packet_manager.c   |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  4 +-
 .../amd/amdkfd/kfd_process_queue_manager.c    |  4 +-
 7 files changed, 45 insertions(+), 59 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index fba9b9a258a5..3e0f46d60de5 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -935,7 +935,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
 	for (i = 0; i < kfd->num_nodes; i++) {
 		node = kfd->nodes[i];
 		kfd_smi_event_update_gpu_reset(node, false);
-		node->dqm->ops.pre_reset(node->dqm);
 	}
 
 	kgd2kfd_suspend(kfd, false);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 4721b2fccd06..3a2dc31279a4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -35,6 +35,7 @@
 #include "cik_regs.h"
 #include "kfd_kernel_queue.h"
 #include "amdgpu_amdkfd.h"
+#include "amdgpu_reset.h"
 #include "mes_api_def.h"
 #include "kfd_debug.h"
 
@@ -155,14 +156,7 @@ static void kfd_hws_hang(struct device_queue_manager *dqm)
 	/*
 	 * Issue a GPU reset if HWS is unresponsive
 	 */
-	dqm->is_hws_hang = true;
-
-	/* It's possible we're detecting a HWS hang in the
-	 * middle of a GPU reset. No need to schedule another
-	 * reset in this case.
-	 */
-	if (!dqm->is_resetting)
-		schedule_work(&dqm->hw_exception_work);
+	schedule_work(&dqm->hw_exception_work);
 }
 
 static int convert_to_mes_queue_type(int queue_type)
@@ -194,7 +188,7 @@ static int add_queue_mes(struct device_queue_manager *dqm, struct queue *q,
 	int r, queue_type;
 	uint64_t wptr_addr_off;
 
-	if (dqm->is_hws_hang)
+	if (!down_read_trylock(&adev->reset_domain->sem))
 		return -EIO;
 
 	memset(&queue_input, 0x0, sizeof(struct mes_add_queue_input));
@@ -245,6 +239,7 @@ static int add_queue_mes(struct device_queue_manager *dqm, struct queue *q,
 	amdgpu_mes_lock(&adev->mes);
 	r = adev->mes.funcs->add_hw_queue(&adev->mes, &queue_input);
 	amdgpu_mes_unlock(&adev->mes);
+	up_read(&adev->reset_domain->sem);
 	if (r) {
 		dev_err(adev->dev, "failed to add hardware queue to MES, doorbell=0x%x\n",
 			q->properties.doorbell_off);
@@ -262,7 +257,7 @@ static int remove_queue_mes(struct device_queue_manager *dqm, struct queue *q,
 	int r;
 	struct mes_remove_queue_input queue_input;
 
-	if (dqm->is_hws_hang)
+	if (!down_read_trylock(&adev->reset_domain->sem))
 		return -EIO;
 
 	memset(&queue_input, 0x0, sizeof(struct mes_remove_queue_input));
@@ -272,6 +267,7 @@ static int remove_queue_mes(struct device_queue_manager *dqm, struct queue *q,
 	amdgpu_mes_lock(&adev->mes);
 	r = adev->mes.funcs->remove_hw_queue(&adev->mes, &queue_input);
 	amdgpu_mes_unlock(&adev->mes);
+	up_read(&adev->reset_domain->sem);
 
 	if (r) {
 		dev_err(adev->dev, "failed to remove hardware queue from MES, doorbell=0x%x\n",
@@ -1468,20 +1464,13 @@ static int stop_nocpsch(struct device_queue_manager *dqm)
 	}
 
 	if (dqm->dev->adev->asic_type == CHIP_HAWAII)
-		pm_uninit(&dqm->packet_mgr, false);
+		pm_uninit(&dqm->packet_mgr);
 	dqm->sched_running = false;
 	dqm_unlock(dqm);
 
 	return 0;
 }
 
-static void pre_reset(struct device_queue_manager *dqm)
-{
-	dqm_lock(dqm);
-	dqm->is_resetting = true;
-	dqm_unlock(dqm);
-}
-
 static int allocate_sdma_queue(struct device_queue_manager *dqm,
 				struct queue *q, const uint32_t *restore_sdma_id)
 {
@@ -1669,8 +1658,6 @@ static int start_cpsch(struct device_queue_manager *dqm)
 	init_interrupts(dqm);
 
 	/* clear hang status when driver try to start the hw scheduler */
-	dqm->is_hws_hang = false;
-	dqm->is_resetting = false;
 	dqm->sched_running = true;
 
 	if (!dqm->dev->kfd->shared_resources.enable_mes)
@@ -1700,7 +1687,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
 fail_allocate_vidmem:
 fail_set_sched_resources:
 	if (!dqm->dev->kfd->shared_resources.enable_mes)
-		pm_uninit(&dqm->packet_mgr, false);
+		pm_uninit(&dqm->packet_mgr);
 fail_packet_manager_init:
 	dqm_unlock(dqm);
 	return retval;
@@ -1708,22 +1695,17 @@ static int start_cpsch(struct device_queue_manager *dqm)
 
 static int stop_cpsch(struct device_queue_manager *dqm)
 {
-	bool hanging;
-
 	dqm_lock(dqm);
 	if (!dqm->sched_running) {
 		dqm_unlock(dqm);
 		return 0;
 	}
 
-	if (!dqm->is_hws_hang) {
-		if (!dqm->dev->kfd->shared_resources.enable_mes)
-			unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0, USE_DEFAULT_GRACE_PERIOD, false);
-		else
-			remove_all_queues_mes(dqm);
-	}
+	if (!dqm->dev->kfd->shared_resources.enable_mes)
+		unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0, USE_DEFAULT_GRACE_PERIOD, false);
+	else
+		remove_all_queues_mes(dqm);
 
-	hanging = dqm->is_hws_hang || dqm->is_resetting;
 	dqm->sched_running = false;
 
 	if (!dqm->dev->kfd->shared_resources.enable_mes)
@@ -1731,7 +1713,7 @@ static int stop_cpsch(struct device_queue_manager *dqm)
 
 	kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
 	if (!dqm->dev->kfd->shared_resources.enable_mes)
-		pm_uninit(&dqm->packet_mgr, hanging);
+		pm_uninit(&dqm->packet_mgr);
 	dqm_unlock(dqm);
 
 	return 0;
@@ -1957,24 +1939,24 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
 {
 	struct device *dev = dqm->dev->adev->dev;
 	struct mqd_manager *mqd_mgr;
-	int retval = 0;
+	int retval;
 
 	if (!dqm->sched_running)
 		return 0;
-	if (dqm->is_hws_hang || dqm->is_resetting)
-		return -EIO;
 	if (!dqm->active_runlist)
-		return retval;
+		return 0;
+	if (!down_read_trylock(&dqm->dev->adev->reset_domain->sem))
+		return -EIO;
 
 	if (grace_period != USE_DEFAULT_GRACE_PERIOD) {
 		retval = pm_update_grace_period(&dqm->packet_mgr, grace_period);
 		if (retval)
-			return retval;
+			goto out;
 	}
 
 	retval = pm_send_unmap_queue(&dqm->packet_mgr, filter, filter_param, reset);
 	if (retval)
-		return retval;
+		goto out;
 
 	*dqm->fence_addr = KFD_FENCE_INIT;
 	pm_send_query_status(&dqm->packet_mgr, dqm->fence_gpu_addr,
@@ -1985,7 +1967,7 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
 	if (retval) {
 		dev_err(dev, "The cp might be in an unrecoverable state due to an unsuccessful queues preemption\n");
 		kfd_hws_hang(dqm);
-		return retval;
+		goto out;
 	}
 
 	/* In the current MEC firmware implementation, if compute queue
@@ -2001,7 +1983,8 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
 		while (halt_if_hws_hang)
 			schedule();
 		kfd_hws_hang(dqm);
-		return -ETIME;
+		retval = -ETIME;
+		goto out;
 	}
 
 	/* We need to reset the grace period value for this device */
@@ -2014,6 +1997,8 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
 	pm_release_ib(&dqm->packet_mgr);
 	dqm->active_runlist = false;
 
+out:
+	up_read(&dqm->dev->adev->reset_domain->sem);
 	return retval;
 }
 
@@ -2040,13 +2025,13 @@ static int execute_queues_cpsch(struct device_queue_manager *dqm,
 {
 	int retval;
 
-	if (dqm->is_hws_hang)
+	if (!down_read_trylock(&dqm->dev->adev->reset_domain->sem))
 		return -EIO;
 	retval = unmap_queues_cpsch(dqm, filter, filter_param, grace_period, false);
-	if (retval)
-		return retval;
-
-	return map_queues_cpsch(dqm);
+	if (!retval)
+		retval = map_queues_cpsch(dqm);
+	up_read(&dqm->dev->adev->reset_domain->sem);
+	return retval;
 }
 
 static int wait_on_destroy_queue(struct device_queue_manager *dqm,
@@ -2427,10 +2412,12 @@ static int process_termination_cpsch(struct device_queue_manager *dqm,
 	if (!dqm->dev->kfd->shared_resources.enable_mes)
 		retval = execute_queues_cpsch(dqm, filter, 0, USE_DEFAULT_GRACE_PERIOD);
 
-	if ((!dqm->is_hws_hang) && (retval || qpd->reset_wavefronts)) {
+	if ((retval || qpd->reset_wavefronts) &&
+	    down_read_trylock(&dqm->dev->adev->reset_domain->sem)) {
 		pr_warn("Resetting wave fronts (cpsch) on dev %p\n", dqm->dev);
 		dbgdev_wave_reset_wavefronts(dqm->dev, qpd->pqm->process);
 		qpd->reset_wavefronts = false;
+		up_read(&dqm->dev->adev->reset_domain->sem);
 	}
 
 	/* Lastly, free mqd resources.
@@ -2537,7 +2524,6 @@ struct device_queue_manager *device_queue_manager_init(struct kfd_node *dev)
 		dqm->ops.initialize = initialize_cpsch;
 		dqm->ops.start = start_cpsch;
 		dqm->ops.stop = stop_cpsch;
-		dqm->ops.pre_reset = pre_reset;
 		dqm->ops.destroy_queue = destroy_queue_cpsch;
 		dqm->ops.update_queue = update_queue;
 		dqm->ops.register_process = register_process;
@@ -2558,7 +2544,6 @@ struct device_queue_manager *device_queue_manager_init(struct kfd_node *dev)
 		/* initialize dqm for no cp scheduling */
 		dqm->ops.start = start_nocpsch;
 		dqm->ops.stop = stop_nocpsch;
-		dqm->ops.pre_reset = pre_reset;
 		dqm->ops.create_queue = create_queue_nocpsch;
 		dqm->ops.destroy_queue = destroy_queue_nocpsch;
 		dqm->ops.update_queue = update_queue;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
index fcc0ee67f544..3b9b8eabaacc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
@@ -152,7 +152,6 @@ struct device_queue_manager_ops {
 	int	(*initialize)(struct device_queue_manager *dqm);
 	int	(*start)(struct device_queue_manager *dqm);
 	int	(*stop)(struct device_queue_manager *dqm);
-	void	(*pre_reset)(struct device_queue_manager *dqm);
 	void	(*uninitialize)(struct device_queue_manager *dqm);
 	int	(*create_kernel_queue)(struct device_queue_manager *dqm,
 					struct kernel_queue *kq,
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
index 32c926986dbb..3ea75a9d86ec 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
@@ -32,6 +32,7 @@
 #include "kfd_device_queue_manager.h"
 #include "kfd_pm4_headers.h"
 #include "kfd_pm4_opcodes.h"
+#include "amdgpu_reset.h"
 
 #define PM4_COUNT_ZERO (((1 << 15) - 1) << 16)
 
@@ -196,15 +197,17 @@ static bool kq_initialize(struct kernel_queue *kq, struct kfd_node *dev,
 }
 
 /* Uninitialize a kernel queue and free all its memory usages. */
-static void kq_uninitialize(struct kernel_queue *kq, bool hanging)
+static void kq_uninitialize(struct kernel_queue *kq)
 {
-	if (kq->queue->properties.type == KFD_QUEUE_TYPE_HIQ && !hanging)
+	if (kq->queue->properties.type == KFD_QUEUE_TYPE_HIQ && down_read_trylock(&kq->dev->adev->reset_domain->sem)) {
 		kq->mqd_mgr->destroy_mqd(kq->mqd_mgr,
 					kq->queue->mqd,
 					KFD_PREEMPT_TYPE_WAVEFRONT_RESET,
 					KFD_UNMAP_LATENCY_MS,
 					kq->queue->pipe,
 					kq->queue->queue);
+		up_read(&kq->dev->adev->reset_domain->sem);
+	}
 	else if (kq->queue->properties.type == KFD_QUEUE_TYPE_DIQ)
 		kfd_gtt_sa_free(kq->dev, kq->fence_mem_obj);
 
@@ -344,9 +347,9 @@ struct kernel_queue *kernel_queue_init(struct kfd_node *dev,
 	return NULL;
 }
 
-void kernel_queue_uninit(struct kernel_queue *kq, bool hanging)
+void kernel_queue_uninit(struct kernel_queue *kq)
 {
-	kq_uninitialize(kq, hanging);
+	kq_uninitialize(kq);
 	kfree(kq);
 }
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
index 7332ad94eab8..a05d5c1097a8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
@@ -263,10 +263,10 @@ int pm_init(struct packet_manager *pm, struct device_queue_manager *dqm)
 	return 0;
 }
 
-void pm_uninit(struct packet_manager *pm, bool hanging)
+void pm_uninit(struct packet_manager *pm)
 {
 	mutex_destroy(&pm->lock);
-	kernel_queue_uninit(pm->priv_queue, hanging);
+	kernel_queue_uninit(pm->priv_queue);
 	pm->priv_queue = NULL;
 }
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index c51e908f6f19..2b3ec92981e8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1301,7 +1301,7 @@ struct device_queue_manager *device_queue_manager_init(struct kfd_node *dev);
 void device_queue_manager_uninit(struct device_queue_manager *dqm);
 struct kernel_queue *kernel_queue_init(struct kfd_node *dev,
 					enum kfd_queue_type type);
-void kernel_queue_uninit(struct kernel_queue *kq, bool hanging);
+void kernel_queue_uninit(struct kernel_queue *kq);
 int kfd_dqm_evict_pasid(struct device_queue_manager *dqm, u32 pasid);
 
 /* Process Queue Manager */
@@ -1407,7 +1407,7 @@ extern const struct packet_manager_funcs kfd_v9_pm_funcs;
 extern const struct packet_manager_funcs kfd_aldebaran_pm_funcs;
 
 int pm_init(struct packet_manager *pm, struct device_queue_manager *dqm);
-void pm_uninit(struct packet_manager *pm, bool hanging);
+void pm_uninit(struct packet_manager *pm);
 int pm_send_set_resources(struct packet_manager *pm,
 				struct scheduling_resources *res);
 int pm_send_runlist(struct packet_manager *pm, struct list_head *dqm_queues);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index 6bf79c435f2e..86ea610b16f3 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -434,7 +434,7 @@ int pqm_create_queue(struct process_queue_manager *pqm,
 err_create_queue:
 	uninit_queue(q);
 	if (kq)
-		kernel_queue_uninit(kq, false);
+		kernel_queue_uninit(kq);
 	kfree(pqn);
 err_allocate_pqn:
 	/* check if queues list is empty unregister process from device */
@@ -481,7 +481,7 @@ int pqm_destroy_queue(struct process_queue_manager *pqm, unsigned int qid)
 		/* destroy kernel queue (DIQ) */
 		dqm = pqn->kq->dev->dqm;
 		dqm->ops.destroy_kernel_queue(dqm, pqn->kq, &pdd->qpd);
-		kernel_queue_uninit(pqn->kq, false);
+		kernel_queue_uninit(pqn->kq);
 	}
 
 	if (pqn->q) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 05/10] drm/amd/amdgpu: remove unnecessary flush when enable gart
  2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
                   ` (3 preceding siblings ...)
  2024-05-28 17:23 ` [PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
@ 2024-05-28 17:23 ` Yunxiang Li
  2024-05-29  6:43   ` Christian König
  2024-05-28 17:23 ` [PATCH v2 06/10] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover Yunxiang Li
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 52+ messages in thread
From: Yunxiang Li @ 2024-05-28 17:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Likun.Gao, Hawking.Zhang,
	Yunxiang Li

From: Likun Gao <Likun.Gao@amd.com>

Remove hdp flush for gc v11/12 when enable gart.
Remove flush tlb for gc v10/11/12 when enable gart.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 3 ---
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 3 ---
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 3 ---
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 3 ---
 drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 4 ----
 5 files changed, 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index aba0a51be960..5740f94e3e44 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -4395,13 +4395,10 @@ static int gfx_v11_0_gfxhub_enable(struct amdgpu_device *adev)
 	if (r)
 		return r;
 
-	adev->hdp.funcs->flush_hdp(adev, NULL);
-
 	value = (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_ALWAYS) ?
 		false : true;
 
 	adev->gfxhub.funcs->set_fault_enable_default(adev, value);
-	amdgpu_gmc_flush_gpu_tlb(adev, 0, AMDGPU_GFXHUB(0), 0);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index 1ef9de41d193..5048b6eef9da 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -3207,13 +3207,10 @@ static int gfx_v12_0_gfxhub_enable(struct amdgpu_device *adev)
 	if (r)
 		return r;
 
-	adev->hdp.funcs->flush_hdp(adev, NULL);
-
 	value = (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_ALWAYS) ?
 		false : true;
 
 	adev->gfxhub.funcs->set_fault_enable_default(adev, value);
-	amdgpu_gmc_flush_gpu_tlb(adev, 0, AMDGPU_GFXHUB(0), 0);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index d933e19e0cf5..3e0ebe25a80f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -974,9 +974,6 @@ static int gmc_v10_0_gart_enable(struct amdgpu_device *adev)
 	if (!adev->in_s0ix)
 		adev->gfxhub.funcs->set_fault_enable_default(adev, value);
 	adev->mmhub.funcs->set_fault_enable_default(adev, value);
-	gmc_v10_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
-	if (!adev->in_s0ix)
-		gmc_v10_0_flush_gpu_tlb(adev, 0, AMDGPU_GFXHUB(0), 0);
 
 	DRM_INFO("PCIE GART of %uM enabled (table at 0x%016llX).\n",
 		 (unsigned int)(adev->gmc.gart_size >> 20),
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index 527dc917e049..cadbe55f0c8f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -891,9 +891,6 @@ static int gmc_v11_0_gart_enable(struct amdgpu_device *adev)
 	if (r)
 		return r;
 
-	/* Flush HDP after it is initialized */
-	adev->hdp.funcs->flush_hdp(adev, NULL);
-
 	value = (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_ALWAYS) ?
 		false : true;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
index e2c6ec3cc4f3..a677aca69a06 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
@@ -861,14 +861,10 @@ static int gmc_v12_0_gart_enable(struct amdgpu_device *adev)
 	if (r)
 		return r;
 
-	/* Flush HDP after it is initialized */
-	adev->hdp.funcs->flush_hdp(adev, NULL);
-
 	value = (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_ALWAYS) ?
 		false : true;
 
 	adev->mmhub.funcs->set_fault_enable_default(adev, value);
-	gmc_v12_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
 
 	dev_info(adev->dev, "PCIE GART of %uM enabled (table at 0x%016llX).\n",
 		 (unsigned)(adev->gmc.gart_size >> 20),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 06/10] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover
  2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
                   ` (4 preceding siblings ...)
  2024-05-28 17:23 ` [PATCH v2 05/10] drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li
@ 2024-05-28 17:23 ` Yunxiang Li
  2024-05-29  6:45   ` Christian König
  2024-05-28 17:23 ` [PATCH v2 07/10] drm/amdgpu: use helper in amdgpu_gart_unbind Yunxiang Li
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 52+ messages in thread
From: Yunxiang Li @ 2024-05-28 17:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Likun.Gao, Hawking.Zhang,
	Yunxiang Li

At this point the gart is not set up, there's no point to invalidate tlb
here and it could even be harmful.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
index 44367f03316f..0760e70402ec 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
@@ -200,8 +200,6 @@ void amdgpu_gtt_mgr_recover(struct amdgpu_gtt_mgr *mgr)
 		amdgpu_ttm_recover_gart(node->base.bo);
 	}
 	spin_unlock(&mgr->lock);
-
-	amdgpu_gart_invalidate_tlb(adev);
 }
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 07/10] drm/amdgpu: use helper in amdgpu_gart_unbind
  2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
                   ` (5 preceding siblings ...)
  2024-05-28 17:23 ` [PATCH v2 06/10] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover Yunxiang Li
@ 2024-05-28 17:23 ` Yunxiang Li
  2024-05-29  6:46   ` Christian König
  2024-05-28 17:23 ` [PATCH v2 08/10] drm/amdgpu: fix locking scope when flushing tlb Yunxiang Li
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 52+ messages in thread
From: Yunxiang Li @ 2024-05-28 17:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Likun.Gao, Hawking.Zhang,
	Yunxiang Li

When amdgpu_gart_invalidate_tlb helper is introduced this part was left
out of the conversion. Avoid the code duplication here.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
index c623e23049d1..eb172388d99e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
@@ -325,10 +325,7 @@ void amdgpu_gart_unbind(struct amdgpu_device *adev, uint64_t offset,
 			page_base += AMDGPU_GPU_PAGE_SIZE;
 		}
 	}
-	mb();
-	amdgpu_device_flush_hdp(adev, NULL);
-	for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS)
-		amdgpu_gmc_flush_gpu_tlb(adev, 0, i, 0);
+	amdgpu_gart_invalidate_tlb(adev);
 
 	drm_dev_exit(idx);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 08/10] drm/amdgpu: fix locking scope when flushing tlb
  2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
                   ` (6 preceding siblings ...)
  2024-05-28 17:23 ` [PATCH v2 07/10] drm/amdgpu: use helper in amdgpu_gart_unbind Yunxiang Li
@ 2024-05-28 17:23 ` Yunxiang Li
  2024-05-29  6:49   ` Christian König
  2024-05-28 17:23 ` [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks Yunxiang Li
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 52+ messages in thread
From: Yunxiang Li @ 2024-05-28 17:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Likun.Gao, Hawking.Zhang,
	Yunxiang Li

Which method is used to flush tlb does not depend on whether a reset is
in progress or not. We should skip flush altogether if the GPU will get
reset. So put both path under reset_domain read lock.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 66 +++++++++++++------------
 1 file changed, 34 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 603c0738fd03..4edd10b10a92 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -684,12 +684,17 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
 	struct amdgpu_ring *ring = &adev->gfx.kiq[inst].ring;
 	struct amdgpu_kiq *kiq = &adev->gfx.kiq[inst];
 	unsigned int ndw;
-	signed long r;
+	int r;
 	uint32_t seq;
 
-	if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
-	    !down_read_trylock(&adev->reset_domain->sem)) {
+	/*
+	 * A GPU reset should flush all TLBs anyway, so no need to do
+	 * this while one is ongoing.
+	 */
+	if (!down_read_trylock(&adev->reset_domain->sem))
+		return 0;
 
+	if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready) {
 		if (adev->gmc.flush_tlb_needs_extra_type_2)
 			adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
 								 2, all_hub,
@@ -703,43 +708,40 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
 		adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
 							 flush_type, all_hub,
 							 inst);
-		return 0;
-	}
+		r = 0;
+	} else {
+		/* 2 dwords flush + 8 dwords fence */
+		ndw = kiq->pmf->invalidate_tlbs_size + 8;
 
-	/* 2 dwords flush + 8 dwords fence */
-	ndw = kiq->pmf->invalidate_tlbs_size + 8;
+		if (adev->gmc.flush_tlb_needs_extra_type_2)
+			ndw += kiq->pmf->invalidate_tlbs_size;
 
-	if (adev->gmc.flush_tlb_needs_extra_type_2)
-		ndw += kiq->pmf->invalidate_tlbs_size;
+		if (adev->gmc.flush_tlb_needs_extra_type_0)
+			ndw += kiq->pmf->invalidate_tlbs_size;
 
-	if (adev->gmc.flush_tlb_needs_extra_type_0)
-		ndw += kiq->pmf->invalidate_tlbs_size;
+		spin_lock(&adev->gfx.kiq[inst].ring_lock);
+		amdgpu_ring_alloc(ring, ndw);
+		if (adev->gmc.flush_tlb_needs_extra_type_2)
+			kiq->pmf->kiq_invalidate_tlbs(ring, pasid, 2, all_hub);
 
-	spin_lock(&adev->gfx.kiq[inst].ring_lock);
-	amdgpu_ring_alloc(ring, ndw);
-	if (adev->gmc.flush_tlb_needs_extra_type_2)
-		kiq->pmf->kiq_invalidate_tlbs(ring, pasid, 2, all_hub);
+		if (flush_type == 2 && adev->gmc.flush_tlb_needs_extra_type_0)
+			kiq->pmf->kiq_invalidate_tlbs(ring, pasid, 0, all_hub);
 
-	if (flush_type == 2 && adev->gmc.flush_tlb_needs_extra_type_0)
-		kiq->pmf->kiq_invalidate_tlbs(ring, pasid, 0, all_hub);
+		kiq->pmf->kiq_invalidate_tlbs(ring, pasid, flush_type, all_hub);
+		r = amdgpu_fence_emit_polling(ring, &seq, MAX_KIQ_REG_WAIT);
+		if (r) {
+			amdgpu_ring_undo(ring);
+			spin_unlock(&adev->gfx.kiq[inst].ring_lock);
+			goto error_unlock_reset;
+		}
 
-	kiq->pmf->kiq_invalidate_tlbs(ring, pasid, flush_type, all_hub);
-	r = amdgpu_fence_emit_polling(ring, &seq, MAX_KIQ_REG_WAIT);
-	if (r) {
-		amdgpu_ring_undo(ring);
+		amdgpu_ring_commit(ring);
 		spin_unlock(&adev->gfx.kiq[inst].ring_lock);
-		goto error_unlock_reset;
-	}
-
-	amdgpu_ring_commit(ring);
-	spin_unlock(&adev->gfx.kiq[inst].ring_lock);
-	r = amdgpu_fence_wait_polling(ring, seq, usec_timeout);
-	if (r < 1) {
-		dev_err(adev->dev, "wait for kiq fence error: %ld.\n", r);
-		r = -ETIME;
-		goto error_unlock_reset;
+		if (amdgpu_fence_wait_polling(ring, seq, usec_timeout) < 1) {
+			dev_err(adev->dev, "timeout waiting for kiq fence\n");
+			r = -ETIME;
+		}
 	}
-	r = 0;
 
 error_unlock_reset:
 	up_read(&adev->reset_domain->sem);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks
  2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
                   ` (7 preceding siblings ...)
  2024-05-28 17:23 ` [PATCH v2 08/10] drm/amdgpu: fix locking scope when flushing tlb Yunxiang Li
@ 2024-05-28 17:23 ` Yunxiang Li
  2024-05-29  6:55   ` Christian König
  2024-05-30 22:02   ` Felix Kuehling
  2024-05-28 17:23 ` [PATCH v2 10/10] Revert "drm/amdgpu: Queue KFD reset workitem in VF FED" Yunxiang Li
  2024-05-30 21:47 ` [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
  10 siblings, 2 replies; 52+ messages in thread
From: Yunxiang Li @ 2024-05-28 17:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Likun.Gao, Hawking.Zhang,
	Yunxiang Li

These functions are missing the lock for reset domain.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c               | 4 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c                | 8 ++++++--
 drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 9 +++++++--
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
index eb172388d99e..ddc5e9972da8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
@@ -34,6 +34,7 @@
 #include <asm/set_memory.h>
 #endif
 #include "amdgpu.h"
+#include "amdgpu_reset.h"
 #include <drm/drm_drv.h>
 #include <drm/ttm/ttm_tt.h>
 
@@ -401,13 +402,14 @@ void amdgpu_gart_invalidate_tlb(struct amdgpu_device *adev)
 {
 	int i;
 
-	if (!adev->gart.ptr)
+	if (!adev->gart.ptr || !down_read_trylock(&adev->reset_domain->sem))
 		return;
 
 	mb();
 	amdgpu_device_flush_hdp(adev, NULL);
 	for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS)
 		amdgpu_gmc_flush_gpu_tlb(adev, 0, i, 0);
+	up_read(&adev->reset_domain->sem);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index e4742b65032d..52a3170d15b7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -307,8 +307,12 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
 		dev_dbg(adev->dev, "Skip scheduling IBs in ring(%s)",
 			ring->name);
 	} else {
-		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
-				       &fence);
+		r = -ETIME;
+		if (down_read_trylock(&adev->reset_domain->sem)) {
+			r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
+					       job, &fence);
+			up_read(&adev->reset_domain->sem);
+		}
 		if (r)
 			dev_err(adev->dev,
 				"Error scheduling IBs (%d) in ring(%s)", r,
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index 86ea610b16f3..21f5a1fb3bf8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -28,6 +28,7 @@
 #include "kfd_priv.h"
 #include "kfd_kernel_queue.h"
 #include "amdgpu_amdkfd.h"
+#include "amdgpu_reset.h"
 
 static inline struct process_queue_node *get_queue_by_qid(
 			struct process_queue_manager *pqm, unsigned int qid)
@@ -87,8 +88,12 @@ void kfd_process_dequeue_from_device(struct kfd_process_device *pdd)
 		return;
 
 	dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
-	if (dev->kfd->shared_resources.enable_mes)
-		amdgpu_mes_flush_shader_debugger(dev->adev, pdd->proc_ctx_gpu_addr);
+	if (dev->kfd->shared_resources.enable_mes &&
+	    down_read_trylock(&dev->adev->reset_domain->sem)) {
+		amdgpu_mes_flush_shader_debugger(dev->adev,
+						 pdd->proc_ctx_gpu_addr);
+		up_read(&dev->adev->reset_domain->sem);
+	}
 	pdd->already_dequeued = true;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 10/10] Revert "drm/amdgpu: Queue KFD reset workitem in VF FED"
  2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
                   ` (8 preceding siblings ...)
  2024-05-28 17:23 ` [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks Yunxiang Li
@ 2024-05-28 17:23 ` Yunxiang Li
  2024-05-28 19:04   ` Skvortsov, Victor
  2024-05-30 21:47 ` [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
  10 siblings, 1 reply; 52+ messages in thread
From: Yunxiang Li @ 2024-05-28 17:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Likun.Gao, Hawking.Zhang,
	Yunxiang Li

This reverts commit 2149ee697a7a3091a16447c647d4a30f7468553a.

The issue is already fixed by
  fa5a7f2ccb7e ("drm/amdgpu: Fix two reset triggered in a row")

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 44450507c140..4bacbf1db9e5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -616,7 +616,7 @@ static void amdgpu_virt_update_vf2pf_work_item(struct work_struct *work)
 		    amdgpu_sriov_runtime(adev)) {
 			amdgpu_ras_set_fed(adev, true);
 			if (amdgpu_reset_domain_schedule(adev->reset_domain,
-							  &adev->kfd.reset_work))
+							  &adev->virt.flr_work))
 				return;
 			else
 				dev_err(adev->dev, "Failed to queue work! at %s", __func__);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* RE: [PATCH v2 10/10] Revert "drm/amdgpu: Queue KFD reset workitem in VF FED"
  2024-05-28 17:23 ` [PATCH v2 10/10] Revert "drm/amdgpu: Queue KFD reset workitem in VF FED" Yunxiang Li
@ 2024-05-28 19:04   ` Skvortsov, Victor
  0 siblings, 0 replies; 52+ messages in thread
From: Skvortsov, Victor @ 2024-05-28 19:04 UTC (permalink / raw)
  To: Li, Yunxiang (Teddy), amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Koenig, Christian, Gao, Likun, Zhang, Hawking,
	Li, Yunxiang (Teddy)

[AMD Official Use Only - AMD Internal Distribution Only]

Nack to the revert. The FLR sequence is defined as the following (host-initiated reset):

1) host sends FLR_NOTIFICATION
2) Guest gets interrupt and queues FLR work item
3) Guest sends READY_TO_RESET
4) Host sends FLR_NOTIFICATION_COMPLETION
5) Guest starts recovery

In RAS FED, guest interrupts are disabled and therefore it won't receive #1. Consequently #2 & #4 will break.

It doesn't make sense to re-use this sequence as-is in FED scenario. On the other hand,
KFD reset work item performs the guest-initiated reset:

1) Guest waits for mailbox to work (handles the FED disable mailbox)
2) Guest sends REQ_GPU_RESET_ACCESS
3) Host acks back
4) Guest starts recovery

We should keep this commit until proper guest FED reset workitem is implemented.

Thanks,
Victor


> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> Yunxiang Li
> Sent: Tuesday, May 28, 2024 1:24 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>; Gao, Likun <Likun.Gao@amd.com>; Zhang,
> Hawking <Hawking.Zhang@amd.com>; Li, Yunxiang (Teddy)
> <Yunxiang.Li@amd.com>
> Subject: [PATCH v2 10/10] Revert "drm/amdgpu: Queue KFD reset workitem in
> VF FED"
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> This reverts commit 2149ee697a7a3091a16447c647d4a30f7468553a.
>
> The issue is already fixed by
>   fa5a7f2ccb7e ("drm/amdgpu: Fix two reset triggered in a row")
>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index 44450507c140..4bacbf1db9e5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -616,7 +616,7 @@ static void
> amdgpu_virt_update_vf2pf_work_item(struct work_struct *work)
>                     amdgpu_sriov_runtime(adev)) {
>                         amdgpu_ras_set_fed(adev, true);
>                         if (amdgpu_reset_domain_schedule(adev->reset_domain,
> -                                                         &adev->kfd.reset_work))
> +                                                         &adev->virt.flr_work))
>                                 return;
>                         else
>                                 dev_err(adev->dev, "Failed to queue work! at %s", __func__);
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 01/10] drm/amdgpu: add skip_hw_access checks for sriov
  2024-05-28 17:23 ` [PATCH v2 01/10] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
@ 2024-05-29  6:36   ` Christian König
  0 siblings, 0 replies; 52+ messages in thread
From: Christian König @ 2024-05-29  6:36 UTC (permalink / raw)
  To: Yunxiang Li, amd-gfx; +Cc: Alexander.Deucher, Likun.Gao, Hawking.Zhang

Am 28.05.24 um 19:23 schrieb Yunxiang Li:
> Accessing registers via host is missing the check for skip_hw_access and
> the lockdep check that comes with it.
>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 9 +++++++++
>   1 file changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index 3d5f58e76f2d..3cf8416f8cb0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -977,6 +977,9 @@ u32 amdgpu_virt_rlcg_reg_rw(struct amdgpu_device *adev, u32 offset, u32 v, u32 f
>   		return 0;
>   	}
>   
> +	if (amdgpu_device_skip_hw_access(adev))
> +		return 0;
> +

The only other caller of this function is amdgpu_device_xcc_rreg() and 
that one has the amdgpu_device_skip_hw_access() call already. So it 
could be that this one is duplicated.

On the other hand an extra check doesn't really hurt us.

So either way the patch is Reviewed-by: Christian König 
<christian.koenig@amd.com>

Regards,
Christian.


>   	reg_access_ctrl = &adev->gfx.rlc.reg_access_ctrl[xcc_id];
>   	scratch_reg0 = (void __iomem *)adev->rmmio + 4 * reg_access_ctrl->scratch_reg0;
>   	scratch_reg1 = (void __iomem *)adev->rmmio + 4 * reg_access_ctrl->scratch_reg1;
> @@ -1047,6 +1050,9 @@ void amdgpu_sriov_wreg(struct amdgpu_device *adev,
>   {
>   	u32 rlcg_flag;
>   
> +	if (amdgpu_device_skip_hw_access(adev))
> +		return;
> +
>   	if (!amdgpu_sriov_runtime(adev) &&
>   		amdgpu_virt_get_rlcg_reg_access_flag(adev, acc_flags, hwip, true, &rlcg_flag)) {
>   		amdgpu_virt_rlcg_reg_rw(adev, offset, value, rlcg_flag, xcc_id);
> @@ -1064,6 +1070,9 @@ u32 amdgpu_sriov_rreg(struct amdgpu_device *adev,
>   {
>   	u32 rlcg_flag;
>   
> +	if (amdgpu_device_skip_hw_access(adev))
> +		return 0;
> +
>   	if (!amdgpu_sriov_runtime(adev) &&
>   		amdgpu_virt_get_rlcg_reg_access_flag(adev, acc_flags, hwip, false, &rlcg_flag))
>   		return amdgpu_virt_rlcg_reg_rw(adev, offset, 0, rlcg_flag, xcc_id);


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-05-28 17:23 ` [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started Yunxiang Li
@ 2024-05-29  6:38   ` Christian König
  2024-05-29 13:22     ` Li, Yunxiang (Teddy)
  0 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2024-05-29  6:38 UTC (permalink / raw)
  To: Yunxiang Li, amd-gfx; +Cc: Alexander.Deucher, Likun.Gao, Hawking.Zhang

Am 28.05.24 um 19:23 schrieb Yunxiang Li:
> If a reset is triggered, there's no point in waiting for the fence back
> anymore, it just makes the reset code wait for a long time for the
> reset_domain read lock to be dropped.
>
> This also makes our reply to host FLR fast enough so the host doesn't
> timeout.
>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c   | 7 +++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h   | 3 ++-
>   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c    | 2 +-
>   drivers/gpu/drm/amd/amdgpu/mes_v12_0.c    | 2 +-
>   5 files changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 10832b470448..3c04f034d43e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -376,10 +376,12 @@ signed long amdgpu_fence_wait_polling(struct amdgpu_ring *ring,
>   				      uint32_t wait_seq,
>   				      signed long timeout)
>   {
> -
> +	int in_reset = amdgpu_in_reset(ring->adev);
>   	while ((int32_t)(wait_seq - amdgpu_fence_read(ring)) > 0 && timeout > 0) {
>   		udelay(2);
>   		timeout -= 2;
> +		if (!in_reset && amdgpu_in_reset(ring->adev))

Clear NAK to that approach. This is just a pretty unstable hack.

It's perfectly possible that the reset has already started before we 
enter the function.

Regards,
Christian.

> +			return 0;
>   	}
>   	return timeout > 0 ? timeout : 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> index 8c6b0987919f..dd22fd93f572 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> @@ -32,14 +32,17 @@
>   #define AMDGPU_MES_MAX_NUM_OF_QUEUES_PER_PROCESS 1024
>   #define AMDGPU_ONE_DOORBELL_SIZE 8
>   
> -signed long amdgpu_mes_fence_wait_polling(u64 *fence,
> +signed long amdgpu_mes_fence_wait_polling(struct amdgpu_device *adev,
> +					  u64 *fence,
>   					  u64 wait_seq,
>   					  signed long timeout)
>   {
> -
> +	int in_reset = amdgpu_in_reset(adev);
>   	while ((s64)(wait_seq - *fence) > 0 && timeout > 0) {
>   		udelay(2);
>   		timeout -= 2;
> +		if (!in_reset && amdgpu_in_reset(adev))
> +			return 0;
>   	}
>   	return timeout > 0 ? timeout : 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> index b99a2b3cffe3..064cb3995a3d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> @@ -340,7 +340,8 @@ struct amdgpu_mes_funcs {
>   #define amdgpu_mes_kiq_hw_init(adev) (adev)->mes.kiq_hw_init((adev))
>   #define amdgpu_mes_kiq_hw_fini(adev) (adev)->mes.kiq_hw_fini((adev))
>   
> -signed long amdgpu_mes_fence_wait_polling(u64 *fence,
> +signed long amdgpu_mes_fence_wait_polling(struct amdgpu_device *adev,
> +					  u64 *fence,
>   					  u64 wait_seq,
>   					  signed long timeout);
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> index 96629d8130b8..38edd60c6789 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> @@ -212,7 +212,7 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
>   	else
>   		dev_dbg(adev->dev, "MES msg=%d was emitted\n", x_pkt->header.opcode);
>   
> -	r = amdgpu_mes_fence_wait_polling(fence_ptr, (u64)1, timeout);
> +	r = amdgpu_mes_fence_wait_polling(adev, fence_ptr, (u64)1, timeout);
>   	amdgpu_device_wb_free(adev, fence_offset);
>   	if (r < 1) {
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
> index c5a03b79f07e..73430b9c4b27 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
> @@ -202,7 +202,7 @@ static int mes_v12_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
>   	else
>   		dev_dbg(adev->dev, "MES msg=%d was emitted\n", x_pkt->header.opcode);
>   
> -	r = amdgpu_mes_fence_wait_polling(fence_ptr, (u64)1, timeout);
> +	r = amdgpu_mes_fence_wait_polling(adev, fence_ptr, (u64)1, timeout);
>   	amdgpu_device_wb_free(adev, fence_offset);
>   
>   	if (r < 1) {


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 02/10] drm/amdgpu: fix sriov host flr handler
  2024-05-28 17:23 ` [PATCH v2 02/10] drm/amdgpu: fix sriov host flr handler Yunxiang Li
@ 2024-05-29  6:41   ` Christian König
  0 siblings, 0 replies; 52+ messages in thread
From: Christian König @ 2024-05-29  6:41 UTC (permalink / raw)
  To: Yunxiang Li, amd-gfx; +Cc: Alexander.Deucher, Likun.Gao, Hawking.Zhang

Am 28.05.24 um 19:23 schrieb Yunxiang Li:
> We send back the ready to reset message before we stop anything. This is
> wrong. Move it to when we are actually ready for the FLR to happen.
>
> In the current state since we take tens of seconds to stop everything,
> it is very likely that host would give up waiting and reset the GPU
> before we send ready, so it would be the same as before. But this gets
> rid of the hack with reset_domain locking and also let us know how slow
> the reset actually is on the host. The pre-reset speed can thus be
> improved later.
>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>

It looks like a nice cleanup to me, but that is absolutely not my field 
of expertise.

But feel free to add an Acked-by: Christian König <christian.koenig@amd.com>

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 14 ++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 ++
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      | 37 ++++++++--------------
>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      | 37 ++++++++--------------
>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 ----
>   6 files changed, 46 insertions(+), 52 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index bf1a6593dc5e..eb77b4ec3cb4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5069,6 +5069,8 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>   	struct amdgpu_hive_info *hive = NULL;
>   
>   	if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
> +		amdgpu_virt_ready_to_reset(adev);
> +		amdgpu_virt_wait_reset(adev);
>   		clear_bit(AMDGPU_HOST_FLR, &reset_context->flags);
>   		r = amdgpu_virt_request_full_gpu(adev, true);
>   	} else {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index 3cf8416f8cb0..44450507c140 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -152,6 +152,20 @@ void amdgpu_virt_request_init_data(struct amdgpu_device *adev)
>   		DRM_WARN("host doesn't support REQ_INIT_DATA handshake\n");
>   }
>   
> +/**
> + * amdgpu_virt_ready_to_reset() - send ready to reset to host
> + * @adev:	amdgpu device.
> + * Send ready to reset message to GPU hypervisor to signal we have stopped GPU
> + * activity and is ready for host FLR
> + */
> +void amdgpu_virt_ready_to_reset(struct amdgpu_device *adev)
> +{
> +	struct amdgpu_virt *virt = &adev->virt;
> +
> +	if (virt->ops && virt->ops->reset_gpu)
> +		virt->ops->ready_to_reset(adev);
> +}
> +
>   /**
>    * amdgpu_virt_wait_reset() - wait for reset gpu completed
>    * @adev:	amdgpu device.
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> index 642f1fd287d8..66de5380d9a1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> @@ -88,6 +88,7 @@ struct amdgpu_virt_ops {
>   	int (*rel_full_gpu)(struct amdgpu_device *adev, bool init);
>   	int (*req_init_data)(struct amdgpu_device *adev);
>   	int (*reset_gpu)(struct amdgpu_device *adev);
> +	void (*ready_to_reset)(struct amdgpu_device *adev);
>   	int (*wait_reset)(struct amdgpu_device *adev);
>   	void (*trans_msg)(struct amdgpu_device *adev, enum idh_request req,
>   			  u32 data1, u32 data2, u32 data3);
> @@ -345,6 +346,7 @@ int amdgpu_virt_request_full_gpu(struct amdgpu_device *adev, bool init);
>   int amdgpu_virt_release_full_gpu(struct amdgpu_device *adev, bool init);
>   int amdgpu_virt_reset_gpu(struct amdgpu_device *adev);
>   void amdgpu_virt_request_init_data(struct amdgpu_device *adev);
> +void amdgpu_virt_ready_to_reset(struct amdgpu_device *adev);
>   int amdgpu_virt_wait_reset(struct amdgpu_device *adev);
>   int amdgpu_virt_alloc_mm_table(struct amdgpu_device *adev);
>   void amdgpu_virt_free_mm_table(struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index f4c47492e0cd..3fdd1fc84723 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -249,38 +249,28 @@ static int xgpu_ai_set_mailbox_ack_irq(struct amdgpu_device *adev,
>   	return 0;
>   }
>   
> -static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
> +static void xgpu_ai_ready_to_reset(struct amdgpu_device *adev)
>   {
> -	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
> -	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
> -	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
> -
> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
> -	 * otherwise the mailbox msg will be ruined/reseted by
> -	 * the VF FLR.
> -	 */
> -	if (atomic_cmpxchg(&adev->reset_domain->in_gpu_reset, 0, 1) != 0)
> -		return;
> -
> -	down_write(&adev->reset_domain->sem);
> -
> -	amdgpu_virt_fini_data_exchange(adev);
> -
>   	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
> +}
>   
> +static int xgpu_ai_wait_reset(struct amdgpu_device *adev)
> +{
> +	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>   	do {
>   		if (xgpu_ai_mailbox_peek_msg(adev) == IDH_FLR_NOTIFICATION_CMPL)
> -			goto flr_done;
> -
> +			return 0;
>   		msleep(10);
>   		timeout -= 10;
>   	} while (timeout > 1);
> -
>   	dev_warn(adev->dev, "waiting IDH_FLR_NOTIFICATION_CMPL timeout\n");
> +	return -ETIME;
> +}
>   
> -flr_done:
> -	atomic_set(&adev->reset_domain->in_gpu_reset, 0);
> -	up_write(&adev->reset_domain->sem);
> +static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
> +{
> +	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
> +	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>   
>   	/* Trigger recovery for world switch failure if no TDR */
>   	if (amdgpu_device_should_recover_gpu(adev)
> @@ -417,7 +407,8 @@ const struct amdgpu_virt_ops xgpu_ai_virt_ops = {
>   	.req_full_gpu	= xgpu_ai_request_full_gpu_access,
>   	.rel_full_gpu	= xgpu_ai_release_full_gpu_access,
>   	.reset_gpu = xgpu_ai_request_reset,
> -	.wait_reset = NULL,
> +	.ready_to_reset = xgpu_ai_ready_to_reset,
> +	.wait_reset = xgpu_ai_wait_reset,
>   	.trans_msg = xgpu_ai_mailbox_trans_msg,
>   	.req_init_data  = xgpu_ai_request_init_data,
>   	.ras_poison_handler = xgpu_ai_ras_poison_handler,
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> index 37b49a5ed2a1..cd6ec1afff2b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> @@ -282,38 +282,28 @@ static int xgpu_nv_set_mailbox_ack_irq(struct amdgpu_device *adev,
>   	return 0;
>   }
>   
> -static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
> +static void xgpu_nv_ready_to_reset(struct amdgpu_device *adev)
>   {
> -	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
> -	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
> -	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
> -
> -	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
> -	 * otherwise the mailbox msg will be ruined/reseted by
> -	 * the VF FLR.
> -	 */
> -	if (atomic_cmpxchg(&adev->reset_domain->in_gpu_reset, 0, 1) != 0)
> -		return;
> -
> -	down_write(&adev->reset_domain->sem);
> -
> -	amdgpu_virt_fini_data_exchange(adev);
> -
>   	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
> +}
>   
> +static int xgpu_nv_wait_reset(struct amdgpu_device *adev)
> +{
> +	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>   	do {
>   		if (xgpu_nv_mailbox_peek_msg(adev) == IDH_FLR_NOTIFICATION_CMPL)
> -			goto flr_done;
> -
> +			return 0;
>   		msleep(10);
>   		timeout -= 10;
>   	} while (timeout > 1);
> -
>   	dev_warn(adev->dev, "waiting IDH_FLR_NOTIFICATION_CMPL timeout\n");
> +	return -ETIME;
> +}
>   
> -flr_done:
> -	atomic_set(&adev->reset_domain->in_gpu_reset, 0);
> -	up_write(&adev->reset_domain->sem);
> +static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
> +{
> +	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
> +	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>   
>   	/* Trigger recovery for world switch failure if no TDR */
>   	if (amdgpu_device_should_recover_gpu(adev)
> @@ -455,7 +445,8 @@ const struct amdgpu_virt_ops xgpu_nv_virt_ops = {
>   	.rel_full_gpu	= xgpu_nv_release_full_gpu_access,
>   	.req_init_data  = xgpu_nv_request_init_data,
>   	.reset_gpu = xgpu_nv_request_reset,
> -	.wait_reset = NULL,
> +	.ready_to_reset = xgpu_nv_ready_to_reset,
> +	.wait_reset = xgpu_nv_wait_reset,
>   	.trans_msg = xgpu_nv_mailbox_trans_msg,
>   	.ras_poison_handler = xgpu_nv_ras_poison_handler,
>   };
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> index 78cd07744ebe..e1d63bed84bf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> @@ -515,12 +515,6 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
>   	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
>   	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>   
> -	/* wait until RCV_MSG become 3 */
> -	if (xgpu_vi_poll_msg(adev, IDH_FLR_NOTIFICATION_CMPL)) {
> -		pr_err("failed to receive FLR_CMPL\n");
> -		return;
> -	}
> -
>   	/* Trigger recovery due to world switch failure */
>   	if (amdgpu_device_should_recover_gpu(adev)) {
>   		struct amdgpu_reset_context reset_context;


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting
  2024-05-28 17:23 ` [PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
@ 2024-05-29  6:41   ` Christian König
  2024-05-29 23:04   ` Felix Kuehling
  1 sibling, 0 replies; 52+ messages in thread
From: Christian König @ 2024-05-29  6:41 UTC (permalink / raw)
  To: Yunxiang Li, amd-gfx; +Cc: Alexander.Deucher, Likun.Gao, Hawking.Zhang

Am 28.05.24 um 19:23 schrieb Yunxiang Li:
> is_hws_hang and is_resetting serves pretty much the same purpose and
> they all duplicates the work of the reset_domain lock, just check that
> directly instead. This also eliminate a few bugs listed below and get
> rid of dqm->ops.pre_reset.
>
> kfd_hws_hang did not need to avoid scheduling another reset. If the
> on-going reset decided to skip GPU reset we have a bad time, otherwise
> the extra reset will get cancelled anyway.
>
> remove_queue_mes forgot to check is_resetting flag compared to the
> pre-MES path unmap_queue_cpsch, so it did not block hw access during
> reset correctly.

Sounds like the correct approach to me as well, but Felix needs to take 
a look at this.

Christian.

>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_device.c       |  1 -
>   .../drm/amd/amdkfd/kfd_device_queue_manager.c | 79 ++++++++-----------
>   .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 -
>   drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c | 11 ++-
>   .../gpu/drm/amd/amdkfd/kfd_packet_manager.c   |  4 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  4 +-
>   .../amd/amdkfd/kfd_process_queue_manager.c    |  4 +-
>   7 files changed, 45 insertions(+), 59 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index fba9b9a258a5..3e0f46d60de5 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -935,7 +935,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>   	for (i = 0; i < kfd->num_nodes; i++) {
>   		node = kfd->nodes[i];
>   		kfd_smi_event_update_gpu_reset(node, false);
> -		node->dqm->ops.pre_reset(node->dqm);
>   	}
>   
>   	kgd2kfd_suspend(kfd, false);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 4721b2fccd06..3a2dc31279a4 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -35,6 +35,7 @@
>   #include "cik_regs.h"
>   #include "kfd_kernel_queue.h"
>   #include "amdgpu_amdkfd.h"
> +#include "amdgpu_reset.h"
>   #include "mes_api_def.h"
>   #include "kfd_debug.h"
>   
> @@ -155,14 +156,7 @@ static void kfd_hws_hang(struct device_queue_manager *dqm)
>   	/*
>   	 * Issue a GPU reset if HWS is unresponsive
>   	 */
> -	dqm->is_hws_hang = true;
> -
> -	/* It's possible we're detecting a HWS hang in the
> -	 * middle of a GPU reset. No need to schedule another
> -	 * reset in this case.
> -	 */
> -	if (!dqm->is_resetting)
> -		schedule_work(&dqm->hw_exception_work);
> +	schedule_work(&dqm->hw_exception_work);
>   }
>   
>   static int convert_to_mes_queue_type(int queue_type)
> @@ -194,7 +188,7 @@ static int add_queue_mes(struct device_queue_manager *dqm, struct queue *q,
>   	int r, queue_type;
>   	uint64_t wptr_addr_off;
>   
> -	if (dqm->is_hws_hang)
> +	if (!down_read_trylock(&adev->reset_domain->sem))
>   		return -EIO;
>   
>   	memset(&queue_input, 0x0, sizeof(struct mes_add_queue_input));
> @@ -245,6 +239,7 @@ static int add_queue_mes(struct device_queue_manager *dqm, struct queue *q,
>   	amdgpu_mes_lock(&adev->mes);
>   	r = adev->mes.funcs->add_hw_queue(&adev->mes, &queue_input);
>   	amdgpu_mes_unlock(&adev->mes);
> +	up_read(&adev->reset_domain->sem);
>   	if (r) {
>   		dev_err(adev->dev, "failed to add hardware queue to MES, doorbell=0x%x\n",
>   			q->properties.doorbell_off);
> @@ -262,7 +257,7 @@ static int remove_queue_mes(struct device_queue_manager *dqm, struct queue *q,
>   	int r;
>   	struct mes_remove_queue_input queue_input;
>   
> -	if (dqm->is_hws_hang)
> +	if (!down_read_trylock(&adev->reset_domain->sem))
>   		return -EIO;
>   
>   	memset(&queue_input, 0x0, sizeof(struct mes_remove_queue_input));
> @@ -272,6 +267,7 @@ static int remove_queue_mes(struct device_queue_manager *dqm, struct queue *q,
>   	amdgpu_mes_lock(&adev->mes);
>   	r = adev->mes.funcs->remove_hw_queue(&adev->mes, &queue_input);
>   	amdgpu_mes_unlock(&adev->mes);
> +	up_read(&adev->reset_domain->sem);
>   
>   	if (r) {
>   		dev_err(adev->dev, "failed to remove hardware queue from MES, doorbell=0x%x\n",
> @@ -1468,20 +1464,13 @@ static int stop_nocpsch(struct device_queue_manager *dqm)
>   	}
>   
>   	if (dqm->dev->adev->asic_type == CHIP_HAWAII)
> -		pm_uninit(&dqm->packet_mgr, false);
> +		pm_uninit(&dqm->packet_mgr);
>   	dqm->sched_running = false;
>   	dqm_unlock(dqm);
>   
>   	return 0;
>   }
>   
> -static void pre_reset(struct device_queue_manager *dqm)
> -{
> -	dqm_lock(dqm);
> -	dqm->is_resetting = true;
> -	dqm_unlock(dqm);
> -}
> -
>   static int allocate_sdma_queue(struct device_queue_manager *dqm,
>   				struct queue *q, const uint32_t *restore_sdma_id)
>   {
> @@ -1669,8 +1658,6 @@ static int start_cpsch(struct device_queue_manager *dqm)
>   	init_interrupts(dqm);
>   
>   	/* clear hang status when driver try to start the hw scheduler */
> -	dqm->is_hws_hang = false;
> -	dqm->is_resetting = false;
>   	dqm->sched_running = true;
>   
>   	if (!dqm->dev->kfd->shared_resources.enable_mes)
> @@ -1700,7 +1687,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
>   fail_allocate_vidmem:
>   fail_set_sched_resources:
>   	if (!dqm->dev->kfd->shared_resources.enable_mes)
> -		pm_uninit(&dqm->packet_mgr, false);
> +		pm_uninit(&dqm->packet_mgr);
>   fail_packet_manager_init:
>   	dqm_unlock(dqm);
>   	return retval;
> @@ -1708,22 +1695,17 @@ static int start_cpsch(struct device_queue_manager *dqm)
>   
>   static int stop_cpsch(struct device_queue_manager *dqm)
>   {
> -	bool hanging;
> -
>   	dqm_lock(dqm);
>   	if (!dqm->sched_running) {
>   		dqm_unlock(dqm);
>   		return 0;
>   	}
>   
> -	if (!dqm->is_hws_hang) {
> -		if (!dqm->dev->kfd->shared_resources.enable_mes)
> -			unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0, USE_DEFAULT_GRACE_PERIOD, false);
> -		else
> -			remove_all_queues_mes(dqm);
> -	}
> +	if (!dqm->dev->kfd->shared_resources.enable_mes)
> +		unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0, USE_DEFAULT_GRACE_PERIOD, false);
> +	else
> +		remove_all_queues_mes(dqm);
>   
> -	hanging = dqm->is_hws_hang || dqm->is_resetting;
>   	dqm->sched_running = false;
>   
>   	if (!dqm->dev->kfd->shared_resources.enable_mes)
> @@ -1731,7 +1713,7 @@ static int stop_cpsch(struct device_queue_manager *dqm)
>   
>   	kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
>   	if (!dqm->dev->kfd->shared_resources.enable_mes)
> -		pm_uninit(&dqm->packet_mgr, hanging);
> +		pm_uninit(&dqm->packet_mgr);
>   	dqm_unlock(dqm);
>   
>   	return 0;
> @@ -1957,24 +1939,24 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
>   {
>   	struct device *dev = dqm->dev->adev->dev;
>   	struct mqd_manager *mqd_mgr;
> -	int retval = 0;
> +	int retval;
>   
>   	if (!dqm->sched_running)
>   		return 0;
> -	if (dqm->is_hws_hang || dqm->is_resetting)
> -		return -EIO;
>   	if (!dqm->active_runlist)
> -		return retval;
> +		return 0;
> +	if (!down_read_trylock(&dqm->dev->adev->reset_domain->sem))
> +		return -EIO;
>   
>   	if (grace_period != USE_DEFAULT_GRACE_PERIOD) {
>   		retval = pm_update_grace_period(&dqm->packet_mgr, grace_period);
>   		if (retval)
> -			return retval;
> +			goto out;
>   	}
>   
>   	retval = pm_send_unmap_queue(&dqm->packet_mgr, filter, filter_param, reset);
>   	if (retval)
> -		return retval;
> +		goto out;
>   
>   	*dqm->fence_addr = KFD_FENCE_INIT;
>   	pm_send_query_status(&dqm->packet_mgr, dqm->fence_gpu_addr,
> @@ -1985,7 +1967,7 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
>   	if (retval) {
>   		dev_err(dev, "The cp might be in an unrecoverable state due to an unsuccessful queues preemption\n");
>   		kfd_hws_hang(dqm);
> -		return retval;
> +		goto out;
>   	}
>   
>   	/* In the current MEC firmware implementation, if compute queue
> @@ -2001,7 +1983,8 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
>   		while (halt_if_hws_hang)
>   			schedule();
>   		kfd_hws_hang(dqm);
> -		return -ETIME;
> +		retval = -ETIME;
> +		goto out;
>   	}
>   
>   	/* We need to reset the grace period value for this device */
> @@ -2014,6 +1997,8 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
>   	pm_release_ib(&dqm->packet_mgr);
>   	dqm->active_runlist = false;
>   
> +out:
> +	up_read(&dqm->dev->adev->reset_domain->sem);
>   	return retval;
>   }
>   
> @@ -2040,13 +2025,13 @@ static int execute_queues_cpsch(struct device_queue_manager *dqm,
>   {
>   	int retval;
>   
> -	if (dqm->is_hws_hang)
> +	if (!down_read_trylock(&dqm->dev->adev->reset_domain->sem))
>   		return -EIO;
>   	retval = unmap_queues_cpsch(dqm, filter, filter_param, grace_period, false);
> -	if (retval)
> -		return retval;
> -
> -	return map_queues_cpsch(dqm);
> +	if (!retval)
> +		retval = map_queues_cpsch(dqm);
> +	up_read(&dqm->dev->adev->reset_domain->sem);
> +	return retval;
>   }
>   
>   static int wait_on_destroy_queue(struct device_queue_manager *dqm,
> @@ -2427,10 +2412,12 @@ static int process_termination_cpsch(struct device_queue_manager *dqm,
>   	if (!dqm->dev->kfd->shared_resources.enable_mes)
>   		retval = execute_queues_cpsch(dqm, filter, 0, USE_DEFAULT_GRACE_PERIOD);
>   
> -	if ((!dqm->is_hws_hang) && (retval || qpd->reset_wavefronts)) {
> +	if ((retval || qpd->reset_wavefronts) &&
> +	    down_read_trylock(&dqm->dev->adev->reset_domain->sem)) {
>   		pr_warn("Resetting wave fronts (cpsch) on dev %p\n", dqm->dev);
>   		dbgdev_wave_reset_wavefronts(dqm->dev, qpd->pqm->process);
>   		qpd->reset_wavefronts = false;
> +		up_read(&dqm->dev->adev->reset_domain->sem);
>   	}
>   
>   	/* Lastly, free mqd resources.
> @@ -2537,7 +2524,6 @@ struct device_queue_manager *device_queue_manager_init(struct kfd_node *dev)
>   		dqm->ops.initialize = initialize_cpsch;
>   		dqm->ops.start = start_cpsch;
>   		dqm->ops.stop = stop_cpsch;
> -		dqm->ops.pre_reset = pre_reset;
>   		dqm->ops.destroy_queue = destroy_queue_cpsch;
>   		dqm->ops.update_queue = update_queue;
>   		dqm->ops.register_process = register_process;
> @@ -2558,7 +2544,6 @@ struct device_queue_manager *device_queue_manager_init(struct kfd_node *dev)
>   		/* initialize dqm for no cp scheduling */
>   		dqm->ops.start = start_nocpsch;
>   		dqm->ops.stop = stop_nocpsch;
> -		dqm->ops.pre_reset = pre_reset;
>   		dqm->ops.create_queue = create_queue_nocpsch;
>   		dqm->ops.destroy_queue = destroy_queue_nocpsch;
>   		dqm->ops.update_queue = update_queue;
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> index fcc0ee67f544..3b9b8eabaacc 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> @@ -152,7 +152,6 @@ struct device_queue_manager_ops {
>   	int	(*initialize)(struct device_queue_manager *dqm);
>   	int	(*start)(struct device_queue_manager *dqm);
>   	int	(*stop)(struct device_queue_manager *dqm);
> -	void	(*pre_reset)(struct device_queue_manager *dqm);
>   	void	(*uninitialize)(struct device_queue_manager *dqm);
>   	int	(*create_kernel_queue)(struct device_queue_manager *dqm,
>   					struct kernel_queue *kq,
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
> index 32c926986dbb..3ea75a9d86ec 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
> @@ -32,6 +32,7 @@
>   #include "kfd_device_queue_manager.h"
>   #include "kfd_pm4_headers.h"
>   #include "kfd_pm4_opcodes.h"
> +#include "amdgpu_reset.h"
>   
>   #define PM4_COUNT_ZERO (((1 << 15) - 1) << 16)
>   
> @@ -196,15 +197,17 @@ static bool kq_initialize(struct kernel_queue *kq, struct kfd_node *dev,
>   }
>   
>   /* Uninitialize a kernel queue and free all its memory usages. */
> -static void kq_uninitialize(struct kernel_queue *kq, bool hanging)
> +static void kq_uninitialize(struct kernel_queue *kq)
>   {
> -	if (kq->queue->properties.type == KFD_QUEUE_TYPE_HIQ && !hanging)
> +	if (kq->queue->properties.type == KFD_QUEUE_TYPE_HIQ && down_read_trylock(&kq->dev->adev->reset_domain->sem)) {
>   		kq->mqd_mgr->destroy_mqd(kq->mqd_mgr,
>   					kq->queue->mqd,
>   					KFD_PREEMPT_TYPE_WAVEFRONT_RESET,
>   					KFD_UNMAP_LATENCY_MS,
>   					kq->queue->pipe,
>   					kq->queue->queue);
> +		up_read(&kq->dev->adev->reset_domain->sem);
> +	}
>   	else if (kq->queue->properties.type == KFD_QUEUE_TYPE_DIQ)
>   		kfd_gtt_sa_free(kq->dev, kq->fence_mem_obj);
>   
> @@ -344,9 +347,9 @@ struct kernel_queue *kernel_queue_init(struct kfd_node *dev,
>   	return NULL;
>   }
>   
> -void kernel_queue_uninit(struct kernel_queue *kq, bool hanging)
> +void kernel_queue_uninit(struct kernel_queue *kq)
>   {
> -	kq_uninitialize(kq, hanging);
> +	kq_uninitialize(kq);
>   	kfree(kq);
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> index 7332ad94eab8..a05d5c1097a8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> @@ -263,10 +263,10 @@ int pm_init(struct packet_manager *pm, struct device_queue_manager *dqm)
>   	return 0;
>   }
>   
> -void pm_uninit(struct packet_manager *pm, bool hanging)
> +void pm_uninit(struct packet_manager *pm)
>   {
>   	mutex_destroy(&pm->lock);
> -	kernel_queue_uninit(pm->priv_queue, hanging);
> +	kernel_queue_uninit(pm->priv_queue);
>   	pm->priv_queue = NULL;
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index c51e908f6f19..2b3ec92981e8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -1301,7 +1301,7 @@ struct device_queue_manager *device_queue_manager_init(struct kfd_node *dev);
>   void device_queue_manager_uninit(struct device_queue_manager *dqm);
>   struct kernel_queue *kernel_queue_init(struct kfd_node *dev,
>   					enum kfd_queue_type type);
> -void kernel_queue_uninit(struct kernel_queue *kq, bool hanging);
> +void kernel_queue_uninit(struct kernel_queue *kq);
>   int kfd_dqm_evict_pasid(struct device_queue_manager *dqm, u32 pasid);
>   
>   /* Process Queue Manager */
> @@ -1407,7 +1407,7 @@ extern const struct packet_manager_funcs kfd_v9_pm_funcs;
>   extern const struct packet_manager_funcs kfd_aldebaran_pm_funcs;
>   
>   int pm_init(struct packet_manager *pm, struct device_queue_manager *dqm);
> -void pm_uninit(struct packet_manager *pm, bool hanging);
> +void pm_uninit(struct packet_manager *pm);
>   int pm_send_set_resources(struct packet_manager *pm,
>   				struct scheduling_resources *res);
>   int pm_send_runlist(struct packet_manager *pm, struct list_head *dqm_queues);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> index 6bf79c435f2e..86ea610b16f3 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> @@ -434,7 +434,7 @@ int pqm_create_queue(struct process_queue_manager *pqm,
>   err_create_queue:
>   	uninit_queue(q);
>   	if (kq)
> -		kernel_queue_uninit(kq, false);
> +		kernel_queue_uninit(kq);
>   	kfree(pqn);
>   err_allocate_pqn:
>   	/* check if queues list is empty unregister process from device */
> @@ -481,7 +481,7 @@ int pqm_destroy_queue(struct process_queue_manager *pqm, unsigned int qid)
>   		/* destroy kernel queue (DIQ) */
>   		dqm = pqn->kq->dev->dqm;
>   		dqm->ops.destroy_kernel_queue(dqm, pqn->kq, &pdd->qpd);
> -		kernel_queue_uninit(pqn->kq, false);
> +		kernel_queue_uninit(pqn->kq);
>   	}
>   
>   	if (pqn->q) {


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 05/10] drm/amd/amdgpu: remove unnecessary flush when enable gart
  2024-05-28 17:23 ` [PATCH v2 05/10] drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li
@ 2024-05-29  6:43   ` Christian König
  0 siblings, 0 replies; 52+ messages in thread
From: Christian König @ 2024-05-29  6:43 UTC (permalink / raw)
  To: Yunxiang Li, amd-gfx; +Cc: Alexander.Deucher, Likun.Gao, Hawking.Zhang

Am 28.05.24 um 19:23 schrieb Yunxiang Li:
> From: Likun Gao <Likun.Gao@amd.com>
>
> Remove hdp flush for gc v11/12 when enable gart.
> Remove flush tlb for gc v10/11/12 when enable gart.

Maybe add something like "That is done for each GART mapping when it is 
created.".

>
> Signed-off-by: Likun Gao <Likun.Gao@amd.com>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>

With the commit message improved the patch is Reviewed-by: Christian 
König <christian.koenig@amd.com>.

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 3 ---
>   drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 3 ---
>   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 3 ---
>   drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 3 ---
>   drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 4 ----
>   5 files changed, 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> index aba0a51be960..5740f94e3e44 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> @@ -4395,13 +4395,10 @@ static int gfx_v11_0_gfxhub_enable(struct amdgpu_device *adev)
>   	if (r)
>   		return r;
>   
> -	adev->hdp.funcs->flush_hdp(adev, NULL);
> -
>   	value = (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_ALWAYS) ?
>   		false : true;
>   
>   	adev->gfxhub.funcs->set_fault_enable_default(adev, value);
> -	amdgpu_gmc_flush_gpu_tlb(adev, 0, AMDGPU_GFXHUB(0), 0);
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> index 1ef9de41d193..5048b6eef9da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> @@ -3207,13 +3207,10 @@ static int gfx_v12_0_gfxhub_enable(struct amdgpu_device *adev)
>   	if (r)
>   		return r;
>   
> -	adev->hdp.funcs->flush_hdp(adev, NULL);
> -
>   	value = (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_ALWAYS) ?
>   		false : true;
>   
>   	adev->gfxhub.funcs->set_fault_enable_default(adev, value);
> -	amdgpu_gmc_flush_gpu_tlb(adev, 0, AMDGPU_GFXHUB(0), 0);
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> index d933e19e0cf5..3e0ebe25a80f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> @@ -974,9 +974,6 @@ static int gmc_v10_0_gart_enable(struct amdgpu_device *adev)
>   	if (!adev->in_s0ix)
>   		adev->gfxhub.funcs->set_fault_enable_default(adev, value);
>   	adev->mmhub.funcs->set_fault_enable_default(adev, value);
> -	gmc_v10_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> -	if (!adev->in_s0ix)
> -		gmc_v10_0_flush_gpu_tlb(adev, 0, AMDGPU_GFXHUB(0), 0);
>   
>   	DRM_INFO("PCIE GART of %uM enabled (table at 0x%016llX).\n",
>   		 (unsigned int)(adev->gmc.gart_size >> 20),
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> index 527dc917e049..cadbe55f0c8f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> @@ -891,9 +891,6 @@ static int gmc_v11_0_gart_enable(struct amdgpu_device *adev)
>   	if (r)
>   		return r;
>   
> -	/* Flush HDP after it is initialized */
> -	adev->hdp.funcs->flush_hdp(adev, NULL);
> -
>   	value = (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_ALWAYS) ?
>   		false : true;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> index e2c6ec3cc4f3..a677aca69a06 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> @@ -861,14 +861,10 @@ static int gmc_v12_0_gart_enable(struct amdgpu_device *adev)
>   	if (r)
>   		return r;
>   
> -	/* Flush HDP after it is initialized */
> -	adev->hdp.funcs->flush_hdp(adev, NULL);
> -
>   	value = (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_ALWAYS) ?
>   		false : true;
>   
>   	adev->mmhub.funcs->set_fault_enable_default(adev, value);
> -	gmc_v12_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
>   
>   	dev_info(adev->dev, "PCIE GART of %uM enabled (table at 0x%016llX).\n",
>   		 (unsigned)(adev->gmc.gart_size >> 20),


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 06/10] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover
  2024-05-28 17:23 ` [PATCH v2 06/10] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover Yunxiang Li
@ 2024-05-29  6:45   ` Christian König
  0 siblings, 0 replies; 52+ messages in thread
From: Christian König @ 2024-05-29  6:45 UTC (permalink / raw)
  To: Yunxiang Li, amd-gfx; +Cc: Alexander.Deucher, Likun.Gao, Hawking.Zhang

Am 28.05.24 um 19:23 schrieb Yunxiang Li:
> At this point the gart is not set up, there's no point to invalidate tlb
> here and it could even be harmful.
>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 2 --
>   1 file changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
> index 44367f03316f..0760e70402ec 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
> @@ -200,8 +200,6 @@ void amdgpu_gtt_mgr_recover(struct amdgpu_gtt_mgr *mgr)
>   		amdgpu_ttm_recover_gart(node->base.bo);
>   	}
>   	spin_unlock(&mgr->lock);
> -
> -	amdgpu_gart_invalidate_tlb(adev);
>   }
>   
>   /**


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 07/10] drm/amdgpu: use helper in amdgpu_gart_unbind
  2024-05-28 17:23 ` [PATCH v2 07/10] drm/amdgpu: use helper in amdgpu_gart_unbind Yunxiang Li
@ 2024-05-29  6:46   ` Christian König
  0 siblings, 0 replies; 52+ messages in thread
From: Christian König @ 2024-05-29  6:46 UTC (permalink / raw)
  To: Yunxiang Li, amd-gfx; +Cc: Alexander.Deucher, Likun.Gao, Hawking.Zhang

Am 28.05.24 um 19:23 schrieb Yunxiang Li:
> When amdgpu_gart_invalidate_tlb helper is introduced this part was left
> out of the conversion. Avoid the code duplication here.
>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 5 +----
>   1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> index c623e23049d1..eb172388d99e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> @@ -325,10 +325,7 @@ void amdgpu_gart_unbind(struct amdgpu_device *adev, uint64_t offset,
>   			page_base += AMDGPU_GPU_PAGE_SIZE;
>   		}
>   	}
> -	mb();
> -	amdgpu_device_flush_hdp(adev, NULL);
> -	for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS)
> -		amdgpu_gmc_flush_gpu_tlb(adev, 0, i, 0);
> +	amdgpu_gart_invalidate_tlb(adev);
>   
>   	drm_dev_exit(idx);
>   }


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 08/10] drm/amdgpu: fix locking scope when flushing tlb
  2024-05-28 17:23 ` [PATCH v2 08/10] drm/amdgpu: fix locking scope when flushing tlb Yunxiang Li
@ 2024-05-29  6:49   ` Christian König
  0 siblings, 0 replies; 52+ messages in thread
From: Christian König @ 2024-05-29  6:49 UTC (permalink / raw)
  To: Yunxiang Li, amd-gfx; +Cc: Alexander.Deucher, Likun.Gao, Hawking.Zhang

Am 28.05.24 um 19:23 schrieb Yunxiang Li:
> Which method is used to flush tlb does not depend on whether a reset is
> in progress or not. We should skip flush altogether if the GPU will get
> reset. So put both path under reset_domain read lock.
>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

Maybe add CC: stable?

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 66 +++++++++++++------------
>   1 file changed, 34 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 603c0738fd03..4edd10b10a92 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -684,12 +684,17 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
>   	struct amdgpu_ring *ring = &adev->gfx.kiq[inst].ring;
>   	struct amdgpu_kiq *kiq = &adev->gfx.kiq[inst];
>   	unsigned int ndw;
> -	signed long r;
> +	int r;
>   	uint32_t seq;
>   
> -	if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
> -	    !down_read_trylock(&adev->reset_domain->sem)) {
> +	/*
> +	 * A GPU reset should flush all TLBs anyway, so no need to do
> +	 * this while one is ongoing.
> +	 */
> +	if (!down_read_trylock(&adev->reset_domain->sem))
> +		return 0;
>   
> +	if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready) {
>   		if (adev->gmc.flush_tlb_needs_extra_type_2)
>   			adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
>   								 2, all_hub,
> @@ -703,43 +708,40 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
>   		adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
>   							 flush_type, all_hub,
>   							 inst);
> -		return 0;
> -	}
> +		r = 0;
> +	} else {
> +		/* 2 dwords flush + 8 dwords fence */
> +		ndw = kiq->pmf->invalidate_tlbs_size + 8;
>   
> -	/* 2 dwords flush + 8 dwords fence */
> -	ndw = kiq->pmf->invalidate_tlbs_size + 8;
> +		if (adev->gmc.flush_tlb_needs_extra_type_2)
> +			ndw += kiq->pmf->invalidate_tlbs_size;
>   
> -	if (adev->gmc.flush_tlb_needs_extra_type_2)
> -		ndw += kiq->pmf->invalidate_tlbs_size;
> +		if (adev->gmc.flush_tlb_needs_extra_type_0)
> +			ndw += kiq->pmf->invalidate_tlbs_size;
>   
> -	if (adev->gmc.flush_tlb_needs_extra_type_0)
> -		ndw += kiq->pmf->invalidate_tlbs_size;
> +		spin_lock(&adev->gfx.kiq[inst].ring_lock);
> +		amdgpu_ring_alloc(ring, ndw);
> +		if (adev->gmc.flush_tlb_needs_extra_type_2)
> +			kiq->pmf->kiq_invalidate_tlbs(ring, pasid, 2, all_hub);
>   
> -	spin_lock(&adev->gfx.kiq[inst].ring_lock);
> -	amdgpu_ring_alloc(ring, ndw);
> -	if (adev->gmc.flush_tlb_needs_extra_type_2)
> -		kiq->pmf->kiq_invalidate_tlbs(ring, pasid, 2, all_hub);
> +		if (flush_type == 2 && adev->gmc.flush_tlb_needs_extra_type_0)
> +			kiq->pmf->kiq_invalidate_tlbs(ring, pasid, 0, all_hub);
>   
> -	if (flush_type == 2 && adev->gmc.flush_tlb_needs_extra_type_0)
> -		kiq->pmf->kiq_invalidate_tlbs(ring, pasid, 0, all_hub);
> +		kiq->pmf->kiq_invalidate_tlbs(ring, pasid, flush_type, all_hub);
> +		r = amdgpu_fence_emit_polling(ring, &seq, MAX_KIQ_REG_WAIT);
> +		if (r) {
> +			amdgpu_ring_undo(ring);
> +			spin_unlock(&adev->gfx.kiq[inst].ring_lock);
> +			goto error_unlock_reset;
> +		}
>   
> -	kiq->pmf->kiq_invalidate_tlbs(ring, pasid, flush_type, all_hub);
> -	r = amdgpu_fence_emit_polling(ring, &seq, MAX_KIQ_REG_WAIT);
> -	if (r) {
> -		amdgpu_ring_undo(ring);
> +		amdgpu_ring_commit(ring);
>   		spin_unlock(&adev->gfx.kiq[inst].ring_lock);
> -		goto error_unlock_reset;
> -	}
> -
> -	amdgpu_ring_commit(ring);
> -	spin_unlock(&adev->gfx.kiq[inst].ring_lock);
> -	r = amdgpu_fence_wait_polling(ring, seq, usec_timeout);
> -	if (r < 1) {
> -		dev_err(adev->dev, "wait for kiq fence error: %ld.\n", r);
> -		r = -ETIME;
> -		goto error_unlock_reset;
> +		if (amdgpu_fence_wait_polling(ring, seq, usec_timeout) < 1) {
> +			dev_err(adev->dev, "timeout waiting for kiq fence\n");
> +			r = -ETIME;
> +		}
>   	}
> -	r = 0;
>   
>   error_unlock_reset:
>   	up_read(&adev->reset_domain->sem);


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks
  2024-05-28 17:23 ` [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks Yunxiang Li
@ 2024-05-29  6:55   ` Christian König
  2024-05-30 22:02   ` Felix Kuehling
  1 sibling, 0 replies; 52+ messages in thread
From: Christian König @ 2024-05-29  6:55 UTC (permalink / raw)
  To: Yunxiang Li, amd-gfx; +Cc: Alexander.Deucher, Likun.Gao, Hawking.Zhang



Am 28.05.24 um 19:23 schrieb Yunxiang Li:
> These functions are missing the lock for reset domain.
>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c               | 4 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c                | 8 ++++++--
>   drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 9 +++++++--
>   3 files changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> index eb172388d99e..ddc5e9972da8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> @@ -34,6 +34,7 @@
>   #include <asm/set_memory.h>
>   #endif
>   #include "amdgpu.h"
> +#include "amdgpu_reset.h"
>   #include <drm/drm_drv.h>
>   #include <drm/ttm/ttm_tt.h>
>   
> @@ -401,13 +402,14 @@ void amdgpu_gart_invalidate_tlb(struct amdgpu_device *adev)
>   {
>   	int i;
>   
> -	if (!adev->gart.ptr)
> +	if (!adev->gart.ptr || !down_read_trylock(&adev->reset_domain->sem))
>   		return;
>   
>   	mb();
>   	amdgpu_device_flush_hdp(adev, NULL);
>   	for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS)
>   		amdgpu_gmc_flush_gpu_tlb(adev, 0, i, 0);
> +	up_read(&adev->reset_domain->sem);

That is clearly incorrect. amdgpu_gmc_flush_gpu_tlb() already takes the 
lock by itself.

But we might want to add this to amdgpu_gmc_flush_gpu_tlb() if the ring 
is NULL.

>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index e4742b65032d..52a3170d15b7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -307,8 +307,12 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
>   		dev_dbg(adev->dev, "Skip scheduling IBs in ring(%s)",
>   			ring->name);
>   	} else {
> -		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
> -				       &fence);
> +		r = -ETIME;
> +		if (down_read_trylock(&adev->reset_domain->sem)) {
> +			r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
> +					       job, &fence);
> +			up_read(&adev->reset_domain->sem);
> +		}

Job submission is blocked by stopping the scheduler. If you add that 
here you create a circle dependency.

>   		if (r)
>   			dev_err(adev->dev,
>   				"Error scheduling IBs (%d) in ring(%s)", r,
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> index 86ea610b16f3..21f5a1fb3bf8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> @@ -28,6 +28,7 @@
>   #include "kfd_priv.h"
>   #include "kfd_kernel_queue.h"
>   #include "amdgpu_amdkfd.h"
> +#include "amdgpu_reset.h"
>   
>   static inline struct process_queue_node *get_queue_by_qid(
>   			struct process_queue_manager *pqm, unsigned int qid)
> @@ -87,8 +88,12 @@ void kfd_process_dequeue_from_device(struct kfd_process_device *pdd)
>   		return;
>   
>   	dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
> -	if (dev->kfd->shared_resources.enable_mes)
> -		amdgpu_mes_flush_shader_debugger(dev->adev, pdd->proc_ctx_gpu_addr);
> +	if (dev->kfd->shared_resources.enable_mes &&
> +	    down_read_trylock(&dev->adev->reset_domain->sem)) {
> +		amdgpu_mes_flush_shader_debugger(dev->adev,
> +						 pdd->proc_ctx_gpu_addr);
> +		up_read(&dev->adev->reset_domain->sem);
> +	}

That one is most likely correct, but Felix and/or somebody else from the 
KFD team needs to take a look.

Thanks,
Christian.

>   	pdd->already_dequeued = true;
>   }
>   


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-05-29  6:38   ` Christian König
@ 2024-05-29 13:22     ` Li, Yunxiang (Teddy)
  2024-05-29 13:31       ` Christian König
  0 siblings, 1 reply; 52+ messages in thread
From: Li, Yunxiang (Teddy) @ 2024-05-29 13:22 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx@lists.freedesktop.org

[Public]

> It's perfectly possible that the reset has already started before we enter the function.

Yeah, this could and does happen, but it just means we are back to the old behavior. I guess I could use "can I take the read side lock?" to test if the function is called outside of reset or not, would that be acceptable?

So like:
Int not_in_reset = read_try_lock(reset sem);
while (...) {
  if (not_in_reset && amdgpu_in_reset())
      break;
}
If (not_in_reset)
   up_read(reset sem)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-05-29 13:22     ` Li, Yunxiang (Teddy)
@ 2024-05-29 13:31       ` Christian König
  2024-05-29 13:44         ` Li, Yunxiang (Teddy)
  0 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2024-05-29 13:31 UTC (permalink / raw)
  To: Li, Yunxiang (Teddy), amd-gfx@lists.freedesktop.org

Am 29.05.24 um 15:22 schrieb Li, Yunxiang (Teddy):
> [Public]
>
>> It's perfectly possible that the reset has already started before we enter the function.
> Yeah, this could and does happen, but it just means we are back to the old behavior. I guess I could use "can I take the read side lock?" to test if the function is called outside of reset or not, would that be acceptable?
>
> So like:
> Int not_in_reset = read_try_lock(reset sem);
> while (...) {
>    if (not_in_reset && amdgpu_in_reset())
>        break;
> }
> If (not_in_reset)
>     up_read(reset sem)

I don't think trying to add some reset handling here makes sense in the 
first place.

Part of the reset/recovery procedure is to signal all fence and that 
includes the one we are waiting for here.

So this wait should return immediately in a reset anyway.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-05-29 13:31       ` Christian König
@ 2024-05-29 13:44         ` Li, Yunxiang (Teddy)
  2024-05-29 13:55           ` Christian König
  0 siblings, 1 reply; 52+ messages in thread
From: Li, Yunxiang (Teddy) @ 2024-05-29 13:44 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx@lists.freedesktop.org

[AMD Official Use Only - AMD Internal Distribution Only]

> I don't think trying to add some reset handling here makes sense in the first place.
> Part of the reset/recovery procedure is to signal all fence and that includes the one we are waiting for here.
> So this wait should return immediately in a reset anyway.

As far as I can tell, these fence_ptr s that get polled are not packaged into a fence obj, and in practice I see 10s of seconds wait before these timeout and reset can begin. Also after reset there is often a long wait, up to 2 minutes, for all the tlb_fence_work to timeout (not addressed by this patch, still haven't figure out what's going on there)

Teddy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-05-29 13:44         ` Li, Yunxiang (Teddy)
@ 2024-05-29 13:55           ` Christian König
  2024-05-29 14:31             ` Li, Yunxiang (Teddy)
  0 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2024-05-29 13:55 UTC (permalink / raw)
  To: Li, Yunxiang (Teddy), amd-gfx@lists.freedesktop.org

Am 29.05.24 um 15:44 schrieb Li, Yunxiang (Teddy):
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> I don't think trying to add some reset handling here makes sense in the first place.
>> Part of the reset/recovery procedure is to signal all fence and that includes the one we are waiting for here.
>> So this wait should return immediately in a reset anyway.
> As far as I can tell, these fence_ptr s that get polled are not packaged into a fence obj, and in practice I see 10s of seconds wait before these timeout and reset can begin. Also after reset there is often a long wait, up to 2 minutes, for all the tlb_fence_work to timeout (not addressed by this patch, still haven't figure out what's going on there)

The problem is that we don't force complete the non scheduler rings, 
e.g. MES, KIQ etc...

Try to remove this check here from the loop in 
amdgpu_device_pre_asic_reset():

                 if (!amdgpu_ring_sched_ready(ring))
                         continue;

Regards,
Christian.


>
> Teddy


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-05-29 13:55           ` Christian König
@ 2024-05-29 14:31             ` Li, Yunxiang (Teddy)
  2024-05-29 14:35               ` Christian König
  0 siblings, 1 reply; 52+ messages in thread
From: Li, Yunxiang (Teddy) @ 2024-05-29 14:31 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx@lists.freedesktop.org

[Public]

> The problem is that we don't force complete the non scheduler rings, e.g. MES,
> KIQ etc...
>
> Try to remove this check here from the loop in
> amdgpu_device_pre_asic_reset():
>
>                  if (!amdgpu_ring_sched_ready(ring))
>                          continue;

Ah, I see. Though I don't think this would work for the mes case, since each submission grabs their own wb address rather than using the ring.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-05-29 14:31             ` Li, Yunxiang (Teddy)
@ 2024-05-29 14:35               ` Christian König
  2024-05-29 14:48                 ` Li, Yunxiang (Teddy)
  0 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2024-05-29 14:35 UTC (permalink / raw)
  To: Li, Yunxiang (Teddy), amd-gfx@lists.freedesktop.org

Am 29.05.24 um 16:31 schrieb Li, Yunxiang (Teddy):
> [Public]
>
>> The problem is that we don't force complete the non scheduler rings, e.g. MES,
>> KIQ etc...
>>
>> Try to remove this check here from the loop in
>> amdgpu_device_pre_asic_reset():
>>
>>                   if (!amdgpu_ring_sched_ready(ring))
>>                           continue;
> Ah, I see. Though I don't think this would work for the mes case, since each submission grabs their own wb address rather than using the ring.

Yeah, I know. That's one of the reason I've pointed out on the patch 
adding that that this behavior is actually completely broken.

If you run into issues with the MES because of this then please suggest 
a revert of that patch.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-05-29 14:35               ` Christian König
@ 2024-05-29 14:48                 ` Li, Yunxiang (Teddy)
  2024-05-29 15:19                   ` Christian König
  0 siblings, 1 reply; 52+ messages in thread
From: Li, Yunxiang (Teddy) @ 2024-05-29 14:48 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx@lists.freedesktop.org

[AMD Official Use Only - AMD Internal Distribution Only]

> Yeah, I know. That's one of the reason I've pointed out on the patch adding
> that that this behavior is actually completely broken.
>
> If you run into issues with the MES because of this then please suggest a
> revert of that patch.

I think it just need to be improved to allow this force-signal behavior. The current behavior is slow/inconvenient, but the old behavior is wrong. Since MES will continue process submissions even when one submission failed. So with just one fence location there's no way to tell if a command failed or not.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-05-29 14:48                 ` Li, Yunxiang (Teddy)
@ 2024-05-29 15:19                   ` Christian König
  2024-05-31 14:44                     ` Liu, Shaoyun
  0 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2024-05-29 15:19 UTC (permalink / raw)
  To: Li, Yunxiang (Teddy), Koenig, Christian,
	amd-gfx@lists.freedesktop.org

Am 29.05.24 um 16:48 schrieb Li, Yunxiang (Teddy):
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> Yeah, I know. That's one of the reason I've pointed out on the patch adding
>> that that this behavior is actually completely broken.
>>
>> If you run into issues with the MES because of this then please suggest a
>> revert of that patch.
> I think it just need to be improved to allow this force-signal behavior. The current behavior is slow/inconvenient, but the old behavior is wrong. Since MES will continue process submissions even when one submission failed. So with just one fence location there's no way to tell if a command failed or not.

No the MES behavior is broken. When a submission failed it should stop 
processing or signal that the operation didn't completed through some 
other mechanism.

Just not writing the fence and continuing results in tons of problems, 
from the TLB fence all the way to the ring buffer and reset handling.

This is a hard requirement and really can't be changed.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting
  2024-05-28 17:23 ` [PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
  2024-05-29  6:41   ` Christian König
@ 2024-05-29 23:04   ` Felix Kuehling
  2024-05-30  0:06     ` Li, Yunxiang (Teddy)
  1 sibling, 1 reply; 52+ messages in thread
From: Felix Kuehling @ 2024-05-29 23:04 UTC (permalink / raw)
  To: Yunxiang Li, amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Likun.Gao, Hawking.Zhang



On 2024-05-28 13:23, Yunxiang Li wrote:
> is_hws_hang and is_resetting serves pretty much the same purpose and
> they all duplicates the work of the reset_domain lock, just check that
> directly instead. This also eliminate a few bugs listed below and get
> rid of dqm->ops.pre_reset.
> 
> kfd_hws_hang did not need to avoid scheduling another reset. If the
> on-going reset decided to skip GPU reset we have a bad time, otherwise
> the extra reset will get cancelled anyway.
> 
> remove_queue_mes forgot to check is_resetting flag compared to the
> pre-MES path unmap_queue_cpsch, so it did not block hw access during
> reset correctly.
> 
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>

The patch looks good to me. It's been years since I worked on HWS hang and GPU reset handling in KFD, and at the time the reset domain stuff didn't exist. The result of this patch looks a lot cleaner, which is good. If there are regressions, they are hopefully not too hard to fix.

One thing I could see going wrong is, that down_read_trylock(&dqm->dev->adev->reset_domain->sem) will not fail immediately when the reset is scheduled. So there may be multipe attempts at HW access that detect an error or time out, which may get the HW into a worse state or delay the actual reset.

At a minimum, I'd recommend testing this with /sys/kernel/debug/hang_hws on a pre-MES GPU, while some ROCm workload is running.

Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>


> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c       |  1 -
>  .../drm/amd/amdkfd/kfd_device_queue_manager.c | 79 ++++++++-----------
>  .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 -
>  drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c | 11 ++-
>  .../gpu/drm/amd/amdkfd/kfd_packet_manager.c   |  4 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  4 +-
>  .../amd/amdkfd/kfd_process_queue_manager.c    |  4 +-
>  7 files changed, 45 insertions(+), 59 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index fba9b9a258a5..3e0f46d60de5 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -935,7 +935,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>  	for (i = 0; i < kfd->num_nodes; i++) {
>  		node = kfd->nodes[i];
>  		kfd_smi_event_update_gpu_reset(node, false);
> -		node->dqm->ops.pre_reset(node->dqm);
>  	}
>  
>  	kgd2kfd_suspend(kfd, false);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 4721b2fccd06..3a2dc31279a4 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -35,6 +35,7 @@
>  #include "cik_regs.h"
>  #include "kfd_kernel_queue.h"
>  #include "amdgpu_amdkfd.h"
> +#include "amdgpu_reset.h"
>  #include "mes_api_def.h"
>  #include "kfd_debug.h"
>  
> @@ -155,14 +156,7 @@ static void kfd_hws_hang(struct device_queue_manager *dqm)
>  	/*
>  	 * Issue a GPU reset if HWS is unresponsive
>  	 */
> -	dqm->is_hws_hang = true;
> -
> -	/* It's possible we're detecting a HWS hang in the
> -	 * middle of a GPU reset. No need to schedule another
> -	 * reset in this case.
> -	 */
> -	if (!dqm->is_resetting)
> -		schedule_work(&dqm->hw_exception_work);
> +	schedule_work(&dqm->hw_exception_work);
>  }
>  
>  static int convert_to_mes_queue_type(int queue_type)
> @@ -194,7 +188,7 @@ static int add_queue_mes(struct device_queue_manager *dqm, struct queue *q,
>  	int r, queue_type;
>  	uint64_t wptr_addr_off;
>  
> -	if (dqm->is_hws_hang)
> +	if (!down_read_trylock(&adev->reset_domain->sem))
>  		return -EIO;
>  
>  	memset(&queue_input, 0x0, sizeof(struct mes_add_queue_input));
> @@ -245,6 +239,7 @@ static int add_queue_mes(struct device_queue_manager *dqm, struct queue *q,
>  	amdgpu_mes_lock(&adev->mes);
>  	r = adev->mes.funcs->add_hw_queue(&adev->mes, &queue_input);
>  	amdgpu_mes_unlock(&adev->mes);
> +	up_read(&adev->reset_domain->sem);
>  	if (r) {
>  		dev_err(adev->dev, "failed to add hardware queue to MES, doorbell=0x%x\n",
>  			q->properties.doorbell_off);
> @@ -262,7 +257,7 @@ static int remove_queue_mes(struct device_queue_manager *dqm, struct queue *q,
>  	int r;
>  	struct mes_remove_queue_input queue_input;
>  
> -	if (dqm->is_hws_hang)
> +	if (!down_read_trylock(&adev->reset_domain->sem))
>  		return -EIO;
>  
>  	memset(&queue_input, 0x0, sizeof(struct mes_remove_queue_input));
> @@ -272,6 +267,7 @@ static int remove_queue_mes(struct device_queue_manager *dqm, struct queue *q,
>  	amdgpu_mes_lock(&adev->mes);
>  	r = adev->mes.funcs->remove_hw_queue(&adev->mes, &queue_input);
>  	amdgpu_mes_unlock(&adev->mes);
> +	up_read(&adev->reset_domain->sem);
>  
>  	if (r) {
>  		dev_err(adev->dev, "failed to remove hardware queue from MES, doorbell=0x%x\n",
> @@ -1468,20 +1464,13 @@ static int stop_nocpsch(struct device_queue_manager *dqm)
>  	}
>  
>  	if (dqm->dev->adev->asic_type == CHIP_HAWAII)
> -		pm_uninit(&dqm->packet_mgr, false);
> +		pm_uninit(&dqm->packet_mgr);
>  	dqm->sched_running = false;
>  	dqm_unlock(dqm);
>  
>  	return 0;
>  }
>  
> -static void pre_reset(struct device_queue_manager *dqm)
> -{
> -	dqm_lock(dqm);
> -	dqm->is_resetting = true;
> -	dqm_unlock(dqm);
> -}
> -
>  static int allocate_sdma_queue(struct device_queue_manager *dqm,
>  				struct queue *q, const uint32_t *restore_sdma_id)
>  {
> @@ -1669,8 +1658,6 @@ static int start_cpsch(struct device_queue_manager *dqm)
>  	init_interrupts(dqm);
>  
>  	/* clear hang status when driver try to start the hw scheduler */
> -	dqm->is_hws_hang = false;
> -	dqm->is_resetting = false;
>  	dqm->sched_running = true;
>  
>  	if (!dqm->dev->kfd->shared_resources.enable_mes)
> @@ -1700,7 +1687,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
>  fail_allocate_vidmem:
>  fail_set_sched_resources:
>  	if (!dqm->dev->kfd->shared_resources.enable_mes)
> -		pm_uninit(&dqm->packet_mgr, false);
> +		pm_uninit(&dqm->packet_mgr);
>  fail_packet_manager_init:
>  	dqm_unlock(dqm);
>  	return retval;
> @@ -1708,22 +1695,17 @@ static int start_cpsch(struct device_queue_manager *dqm)
>  
>  static int stop_cpsch(struct device_queue_manager *dqm)
>  {
> -	bool hanging;
> -
>  	dqm_lock(dqm);
>  	if (!dqm->sched_running) {
>  		dqm_unlock(dqm);
>  		return 0;
>  	}
>  
> -	if (!dqm->is_hws_hang) {
> -		if (!dqm->dev->kfd->shared_resources.enable_mes)
> -			unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0, USE_DEFAULT_GRACE_PERIOD, false);
> -		else
> -			remove_all_queues_mes(dqm);
> -	}
> +	if (!dqm->dev->kfd->shared_resources.enable_mes)
> +		unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0, USE_DEFAULT_GRACE_PERIOD, false);
> +	else
> +		remove_all_queues_mes(dqm);
>  
> -	hanging = dqm->is_hws_hang || dqm->is_resetting;
>  	dqm->sched_running = false;
>  
>  	if (!dqm->dev->kfd->shared_resources.enable_mes)
> @@ -1731,7 +1713,7 @@ static int stop_cpsch(struct device_queue_manager *dqm)
>  
>  	kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
>  	if (!dqm->dev->kfd->shared_resources.enable_mes)
> -		pm_uninit(&dqm->packet_mgr, hanging);
> +		pm_uninit(&dqm->packet_mgr);
>  	dqm_unlock(dqm);
>  
>  	return 0;
> @@ -1957,24 +1939,24 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
>  {
>  	struct device *dev = dqm->dev->adev->dev;
>  	struct mqd_manager *mqd_mgr;
> -	int retval = 0;
> +	int retval;
>  
>  	if (!dqm->sched_running)
>  		return 0;
> -	if (dqm->is_hws_hang || dqm->is_resetting)
> -		return -EIO;
>  	if (!dqm->active_runlist)
> -		return retval;
> +		return 0;
> +	if (!down_read_trylock(&dqm->dev->adev->reset_domain->sem))
> +		return -EIO;
>  
>  	if (grace_period != USE_DEFAULT_GRACE_PERIOD) {
>  		retval = pm_update_grace_period(&dqm->packet_mgr, grace_period);
>  		if (retval)
> -			return retval;
> +			goto out;
>  	}
>  
>  	retval = pm_send_unmap_queue(&dqm->packet_mgr, filter, filter_param, reset);
>  	if (retval)
> -		return retval;
> +		goto out;
>  
>  	*dqm->fence_addr = KFD_FENCE_INIT;
>  	pm_send_query_status(&dqm->packet_mgr, dqm->fence_gpu_addr,
> @@ -1985,7 +1967,7 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
>  	if (retval) {
>  		dev_err(dev, "The cp might be in an unrecoverable state due to an unsuccessful queues preemption\n");
>  		kfd_hws_hang(dqm);
> -		return retval;
> +		goto out;
>  	}
>  
>  	/* In the current MEC firmware implementation, if compute queue
> @@ -2001,7 +1983,8 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
>  		while (halt_if_hws_hang)
>  			schedule();
>  		kfd_hws_hang(dqm);
> -		return -ETIME;
> +		retval = -ETIME;
> +		goto out;
>  	}
>  
>  	/* We need to reset the grace period value for this device */
> @@ -2014,6 +1997,8 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
>  	pm_release_ib(&dqm->packet_mgr);
>  	dqm->active_runlist = false;
>  
> +out:
> +	up_read(&dqm->dev->adev->reset_domain->sem);
>  	return retval;
>  }
>  
> @@ -2040,13 +2025,13 @@ static int execute_queues_cpsch(struct device_queue_manager *dqm,
>  {
>  	int retval;
>  
> -	if (dqm->is_hws_hang)
> +	if (!down_read_trylock(&dqm->dev->adev->reset_domain->sem))
>  		return -EIO;
>  	retval = unmap_queues_cpsch(dqm, filter, filter_param, grace_period, false);
> -	if (retval)
> -		return retval;
> -
> -	return map_queues_cpsch(dqm);
> +	if (!retval)
> +		retval = map_queues_cpsch(dqm);
> +	up_read(&dqm->dev->adev->reset_domain->sem);
> +	return retval;
>  }
>  
>  static int wait_on_destroy_queue(struct device_queue_manager *dqm,
> @@ -2427,10 +2412,12 @@ static int process_termination_cpsch(struct device_queue_manager *dqm,
>  	if (!dqm->dev->kfd->shared_resources.enable_mes)
>  		retval = execute_queues_cpsch(dqm, filter, 0, USE_DEFAULT_GRACE_PERIOD);
>  
> -	if ((!dqm->is_hws_hang) && (retval || qpd->reset_wavefronts)) {
> +	if ((retval || qpd->reset_wavefronts) &&
> +	    down_read_trylock(&dqm->dev->adev->reset_domain->sem)) {
>  		pr_warn("Resetting wave fronts (cpsch) on dev %p\n", dqm->dev);
>  		dbgdev_wave_reset_wavefronts(dqm->dev, qpd->pqm->process);
>  		qpd->reset_wavefronts = false;
> +		up_read(&dqm->dev->adev->reset_domain->sem);
>  	}
>  
>  	/* Lastly, free mqd resources.
> @@ -2537,7 +2524,6 @@ struct device_queue_manager *device_queue_manager_init(struct kfd_node *dev)
>  		dqm->ops.initialize = initialize_cpsch;
>  		dqm->ops.start = start_cpsch;
>  		dqm->ops.stop = stop_cpsch;
> -		dqm->ops.pre_reset = pre_reset;
>  		dqm->ops.destroy_queue = destroy_queue_cpsch;
>  		dqm->ops.update_queue = update_queue;
>  		dqm->ops.register_process = register_process;
> @@ -2558,7 +2544,6 @@ struct device_queue_manager *device_queue_manager_init(struct kfd_node *dev)
>  		/* initialize dqm for no cp scheduling */
>  		dqm->ops.start = start_nocpsch;
>  		dqm->ops.stop = stop_nocpsch;
> -		dqm->ops.pre_reset = pre_reset;
>  		dqm->ops.create_queue = create_queue_nocpsch;
>  		dqm->ops.destroy_queue = destroy_queue_nocpsch;
>  		dqm->ops.update_queue = update_queue;
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> index fcc0ee67f544..3b9b8eabaacc 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> @@ -152,7 +152,6 @@ struct device_queue_manager_ops {
>  	int	(*initialize)(struct device_queue_manager *dqm);
>  	int	(*start)(struct device_queue_manager *dqm);
>  	int	(*stop)(struct device_queue_manager *dqm);
> -	void	(*pre_reset)(struct device_queue_manager *dqm);
>  	void	(*uninitialize)(struct device_queue_manager *dqm);
>  	int	(*create_kernel_queue)(struct device_queue_manager *dqm,
>  					struct kernel_queue *kq,
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
> index 32c926986dbb..3ea75a9d86ec 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
> @@ -32,6 +32,7 @@
>  #include "kfd_device_queue_manager.h"
>  #include "kfd_pm4_headers.h"
>  #include "kfd_pm4_opcodes.h"
> +#include "amdgpu_reset.h"
>  
>  #define PM4_COUNT_ZERO (((1 << 15) - 1) << 16)
>  
> @@ -196,15 +197,17 @@ static bool kq_initialize(struct kernel_queue *kq, struct kfd_node *dev,
>  }
>  
>  /* Uninitialize a kernel queue and free all its memory usages. */
> -static void kq_uninitialize(struct kernel_queue *kq, bool hanging)
> +static void kq_uninitialize(struct kernel_queue *kq)
>  {
> -	if (kq->queue->properties.type == KFD_QUEUE_TYPE_HIQ && !hanging)
> +	if (kq->queue->properties.type == KFD_QUEUE_TYPE_HIQ && down_read_trylock(&kq->dev->adev->reset_domain->sem)) {
>  		kq->mqd_mgr->destroy_mqd(kq->mqd_mgr,
>  					kq->queue->mqd,
>  					KFD_PREEMPT_TYPE_WAVEFRONT_RESET,
>  					KFD_UNMAP_LATENCY_MS,
>  					kq->queue->pipe,
>  					kq->queue->queue);
> +		up_read(&kq->dev->adev->reset_domain->sem);
> +	}
>  	else if (kq->queue->properties.type == KFD_QUEUE_TYPE_DIQ)
>  		kfd_gtt_sa_free(kq->dev, kq->fence_mem_obj);
>  
> @@ -344,9 +347,9 @@ struct kernel_queue *kernel_queue_init(struct kfd_node *dev,
>  	return NULL;
>  }
>  
> -void kernel_queue_uninit(struct kernel_queue *kq, bool hanging)
> +void kernel_queue_uninit(struct kernel_queue *kq)
>  {
> -	kq_uninitialize(kq, hanging);
> +	kq_uninitialize(kq);
>  	kfree(kq);
>  }
>  
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> index 7332ad94eab8..a05d5c1097a8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
> @@ -263,10 +263,10 @@ int pm_init(struct packet_manager *pm, struct device_queue_manager *dqm)
>  	return 0;
>  }
>  
> -void pm_uninit(struct packet_manager *pm, bool hanging)
> +void pm_uninit(struct packet_manager *pm)
>  {
>  	mutex_destroy(&pm->lock);
> -	kernel_queue_uninit(pm->priv_queue, hanging);
> +	kernel_queue_uninit(pm->priv_queue);
>  	pm->priv_queue = NULL;
>  }
>  
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index c51e908f6f19..2b3ec92981e8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -1301,7 +1301,7 @@ struct device_queue_manager *device_queue_manager_init(struct kfd_node *dev);
>  void device_queue_manager_uninit(struct device_queue_manager *dqm);
>  struct kernel_queue *kernel_queue_init(struct kfd_node *dev,
>  					enum kfd_queue_type type);
> -void kernel_queue_uninit(struct kernel_queue *kq, bool hanging);
> +void kernel_queue_uninit(struct kernel_queue *kq);
>  int kfd_dqm_evict_pasid(struct device_queue_manager *dqm, u32 pasid);
>  
>  /* Process Queue Manager */
> @@ -1407,7 +1407,7 @@ extern const struct packet_manager_funcs kfd_v9_pm_funcs;
>  extern const struct packet_manager_funcs kfd_aldebaran_pm_funcs;
>  
>  int pm_init(struct packet_manager *pm, struct device_queue_manager *dqm);
> -void pm_uninit(struct packet_manager *pm, bool hanging);
> +void pm_uninit(struct packet_manager *pm);
>  int pm_send_set_resources(struct packet_manager *pm,
>  				struct scheduling_resources *res);
>  int pm_send_runlist(struct packet_manager *pm, struct list_head *dqm_queues);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> index 6bf79c435f2e..86ea610b16f3 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> @@ -434,7 +434,7 @@ int pqm_create_queue(struct process_queue_manager *pqm,
>  err_create_queue:
>  	uninit_queue(q);
>  	if (kq)
> -		kernel_queue_uninit(kq, false);
> +		kernel_queue_uninit(kq);
>  	kfree(pqn);
>  err_allocate_pqn:
>  	/* check if queues list is empty unregister process from device */
> @@ -481,7 +481,7 @@ int pqm_destroy_queue(struct process_queue_manager *pqm, unsigned int qid)
>  		/* destroy kernel queue (DIQ) */
>  		dqm = pqn->kq->dev->dqm;
>  		dqm->ops.destroy_kernel_queue(dqm, pqn->kq, &pdd->qpd);
> -		kernel_queue_uninit(pqn->kq, false);
> +		kernel_queue_uninit(pqn->kq);
>  	}
>  
>  	if (pqn->q) {

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting
  2024-05-29 23:04   ` Felix Kuehling
@ 2024-05-30  0:06     ` Li, Yunxiang (Teddy)
  0 siblings, 0 replies; 52+ messages in thread
From: Li, Yunxiang (Teddy) @ 2024-05-30  0:06 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Koenig, Christian, Gao, Likun, Zhang, Hawking

[AMD Official Use Only - AMD Internal Distribution Only]

> One thing I could see going wrong is, that down_read_trylock(&dqm->dev-
> >adev->reset_domain->sem) will not fail immediately when the reset is
> scheduled. So there may be multipe attempts at HW access that detect an
> error or time out, which may get the HW into a worse state or delay the actual
> reset.

I suppose we can always check amdgpu_in_reset first before we do down_read_trylock, this would prevent new readers from coming in while the reset thread is waiting on current readers to finish. With a the rwsem alone I suppose there's a chance that the writer would be starved?

Teddy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset
  2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
                   ` (9 preceding siblings ...)
  2024-05-28 17:23 ` [PATCH v2 10/10] Revert "drm/amdgpu: Queue KFD reset workitem in VF FED" Yunxiang Li
@ 2024-05-30 21:47 ` Yunxiang Li
  2024-05-30 21:47   ` [PATCH v3 1/8] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
                     ` (7 more replies)
  10 siblings, 8 replies; 52+ messages in thread
From: Yunxiang Li @ 2024-05-30 21:47 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, christian.koenig, Yunxiang Li

If another thread accesses the gpu while the GPU is being reset, the
reset could fail. This is especially problematic on SRIOV since host
may reset the GPU even if guest is not yet ready.

There are code in place that tries to prevent stray access, but over
time bugs have crept in making it not reliable. This series hopes to
address these bugs.

v3: dropped:
      drm/amdgpu: abort fence poll if reset is started
      Revert "drm/amdgpu: Queue KFD reset workitem in VF FED"
    updated:
      drm/amdgpu: fix sriov host flr handler
      drm/amdgpu: fix missing reset domain locks

Likun Gao (1):
  drm/amd/amdgpu: remove unnecessary flush when enable gart

Yunxiang Li (7):
  drm/amdgpu: add skip_hw_access checks for sriov
  drm/amdgpu: fix sriov host flr handler
  drm/amdgpu/kfd: remove is_hws_hang and is_resetting
  drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover
  drm/amdgpu: use helper in amdgpu_gart_unbind
  drm/amdgpu: fix locking scope when flushing tlb
  drm/amdgpu: fix missing reset domain locks

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c      | 11 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c       | 66 ++++++++--------
 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c   |  2 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c      | 23 ++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h      |  2 +
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c        |  3 -
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c        |  3 -
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c        |  3 -
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c        |  3 -
 drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c        |  4 -
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         | 39 ++++-----
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c         | 39 ++++-----
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |  6 --
 drivers/gpu/drm/amd/amdkfd/kfd_device.c       |  1 -
 .../drm/amd/amdkfd/kfd_device_queue_manager.c | 79 ++++++++-----------
 .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 -
 drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c | 11 ++-
 .../gpu/drm/amd/amdkfd/kfd_packet_manager.c   |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  4 +-
 .../amd/amdkfd/kfd_process_queue_manager.c    | 13 ++-
 21 files changed, 151 insertions(+), 168 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 1/8] drm/amdgpu: add skip_hw_access checks for sriov
  2024-05-30 21:47 ` [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
@ 2024-05-30 21:47   ` Yunxiang Li
  2024-05-30 21:47   ` [PATCH v3 2/8] drm/amdgpu: fix sriov host flr handler Yunxiang Li
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 52+ messages in thread
From: Yunxiang Li @ 2024-05-30 21:47 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, christian.koenig, Yunxiang Li

Accessing registers via host is missing the check for skip_hw_access and
the lockdep check that comes with it.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 3d5f58e76f2d..3cf8416f8cb0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -977,6 +977,9 @@ u32 amdgpu_virt_rlcg_reg_rw(struct amdgpu_device *adev, u32 offset, u32 v, u32 f
 		return 0;
 	}
 
+	if (amdgpu_device_skip_hw_access(adev))
+		return 0;
+
 	reg_access_ctrl = &adev->gfx.rlc.reg_access_ctrl[xcc_id];
 	scratch_reg0 = (void __iomem *)adev->rmmio + 4 * reg_access_ctrl->scratch_reg0;
 	scratch_reg1 = (void __iomem *)adev->rmmio + 4 * reg_access_ctrl->scratch_reg1;
@@ -1047,6 +1050,9 @@ void amdgpu_sriov_wreg(struct amdgpu_device *adev,
 {
 	u32 rlcg_flag;
 
+	if (amdgpu_device_skip_hw_access(adev))
+		return;
+
 	if (!amdgpu_sriov_runtime(adev) &&
 		amdgpu_virt_get_rlcg_reg_access_flag(adev, acc_flags, hwip, true, &rlcg_flag)) {
 		amdgpu_virt_rlcg_reg_rw(adev, offset, value, rlcg_flag, xcc_id);
@@ -1064,6 +1070,9 @@ u32 amdgpu_sriov_rreg(struct amdgpu_device *adev,
 {
 	u32 rlcg_flag;
 
+	if (amdgpu_device_skip_hw_access(adev))
+		return 0;
+
 	if (!amdgpu_sriov_runtime(adev) &&
 		amdgpu_virt_get_rlcg_reg_access_flag(adev, acc_flags, hwip, false, &rlcg_flag))
 		return amdgpu_virt_rlcg_reg_rw(adev, offset, 0, rlcg_flag, xcc_id);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 2/8] drm/amdgpu: fix sriov host flr handler
  2024-05-30 21:47 ` [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
  2024-05-30 21:47   ` [PATCH v3 1/8] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
@ 2024-05-30 21:47   ` Yunxiang Li
  2024-06-05  1:12     ` Deng, Emily
  2024-05-30 21:48   ` [PATCH v3 3/8] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 52+ messages in thread
From: Yunxiang Li @ 2024-05-30 21:47 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Yunxiang Li, haijun.chang,
	emily.deng

We send back the ready to reset message before we stop anything. This is
wrong. Move it to when we are actually ready for the FLR to happen.

In the current state since we take tens of seconds to stop everything,
it is very likely that host would give up waiting and reset the GPU
before we send ready, so it would be the same as before. But this gets
rid of the hack with reset_domain locking and also let us know how slow
the reset actually is on the host. The pre-reset speed can thus be
improved later.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
v3: still call amdgpu_virt_fini_data_exchange right away, it could take
    awhile for the reset to grab it's lock and call this function in
    pre_reset so during this time the thread will read garbage.

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 14 ++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 ++
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      | 39 +++++++++-------------
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      | 39 +++++++++-------------
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 ----
 6 files changed, 50 insertions(+), 52 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index bf1a6593dc5e..eb77b4ec3cb4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5069,6 +5069,8 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 	struct amdgpu_hive_info *hive = NULL;
 
 	if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
+		amdgpu_virt_ready_to_reset(adev);
+		amdgpu_virt_wait_reset(adev);
 		clear_bit(AMDGPU_HOST_FLR, &reset_context->flags);
 		r = amdgpu_virt_request_full_gpu(adev, true);
 	} else {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 3cf8416f8cb0..44450507c140 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -152,6 +152,20 @@ void amdgpu_virt_request_init_data(struct amdgpu_device *adev)
 		DRM_WARN("host doesn't support REQ_INIT_DATA handshake\n");
 }
 
+/**
+ * amdgpu_virt_ready_to_reset() - send ready to reset to host
+ * @adev:	amdgpu device.
+ * Send ready to reset message to GPU hypervisor to signal we have stopped GPU
+ * activity and is ready for host FLR
+ */
+void amdgpu_virt_ready_to_reset(struct amdgpu_device *adev)
+{
+	struct amdgpu_virt *virt = &adev->virt;
+
+	if (virt->ops && virt->ops->reset_gpu)
+		virt->ops->ready_to_reset(adev);
+}
+
 /**
  * amdgpu_virt_wait_reset() - wait for reset gpu completed
  * @adev:	amdgpu device.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index 642f1fd287d8..66de5380d9a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -88,6 +88,7 @@ struct amdgpu_virt_ops {
 	int (*rel_full_gpu)(struct amdgpu_device *adev, bool init);
 	int (*req_init_data)(struct amdgpu_device *adev);
 	int (*reset_gpu)(struct amdgpu_device *adev);
+	void (*ready_to_reset)(struct amdgpu_device *adev);
 	int (*wait_reset)(struct amdgpu_device *adev);
 	void (*trans_msg)(struct amdgpu_device *adev, enum idh_request req,
 			  u32 data1, u32 data2, u32 data3);
@@ -345,6 +346,7 @@ int amdgpu_virt_request_full_gpu(struct amdgpu_device *adev, bool init);
 int amdgpu_virt_release_full_gpu(struct amdgpu_device *adev, bool init);
 int amdgpu_virt_reset_gpu(struct amdgpu_device *adev);
 void amdgpu_virt_request_init_data(struct amdgpu_device *adev);
+void amdgpu_virt_ready_to_reset(struct amdgpu_device *adev);
 int amdgpu_virt_wait_reset(struct amdgpu_device *adev);
 int amdgpu_virt_alloc_mm_table(struct amdgpu_device *adev);
 void amdgpu_virt_free_mm_table(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index f4c47492e0cd..6b71ee85ee65 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -249,38 +249,30 @@ static int xgpu_ai_set_mailbox_ack_irq(struct amdgpu_device *adev,
 	return 0;
 }
 
-static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
+static void xgpu_ai_ready_to_reset(struct amdgpu_device *adev)
 {
-	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
-	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
-	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
-
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-	 * otherwise the mailbox msg will be ruined/reseted by
-	 * the VF FLR.
-	 */
-	if (atomic_cmpxchg(&adev->reset_domain->in_gpu_reset, 0, 1) != 0)
-		return;
-
-	down_write(&adev->reset_domain->sem);
-
-	amdgpu_virt_fini_data_exchange(adev);
-
 	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
+}
 
+static int xgpu_ai_wait_reset(struct amdgpu_device *adev)
+{
+	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
 	do {
 		if (xgpu_ai_mailbox_peek_msg(adev) == IDH_FLR_NOTIFICATION_CMPL)
-			goto flr_done;
-
+			return 0;
 		msleep(10);
 		timeout -= 10;
 	} while (timeout > 1);
-
 	dev_warn(adev->dev, "waiting IDH_FLR_NOTIFICATION_CMPL timeout\n");
+	return -ETIME;
+}
 
-flr_done:
-	atomic_set(&adev->reset_domain->in_gpu_reset, 0);
-	up_write(&adev->reset_domain->sem);
+static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
+{
+	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
+	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
+
+	amdgpu_virt_fini_data_exchange(adev);
 
 	/* Trigger recovery for world switch failure if no TDR */
 	if (amdgpu_device_should_recover_gpu(adev)
@@ -417,7 +409,8 @@ const struct amdgpu_virt_ops xgpu_ai_virt_ops = {
 	.req_full_gpu	= xgpu_ai_request_full_gpu_access,
 	.rel_full_gpu	= xgpu_ai_release_full_gpu_access,
 	.reset_gpu = xgpu_ai_request_reset,
-	.wait_reset = NULL,
+	.ready_to_reset = xgpu_ai_ready_to_reset,
+	.wait_reset = xgpu_ai_wait_reset,
 	.trans_msg = xgpu_ai_mailbox_trans_msg,
 	.req_init_data  = xgpu_ai_request_init_data,
 	.ras_poison_handler = xgpu_ai_ras_poison_handler,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 37b49a5ed2a1..22af30a15a5f 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -282,38 +282,30 @@ static int xgpu_nv_set_mailbox_ack_irq(struct amdgpu_device *adev,
 	return 0;
 }
 
-static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
+static void xgpu_nv_ready_to_reset(struct amdgpu_device *adev)
 {
-	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
-	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
-	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
-
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-	 * otherwise the mailbox msg will be ruined/reseted by
-	 * the VF FLR.
-	 */
-	if (atomic_cmpxchg(&adev->reset_domain->in_gpu_reset, 0, 1) != 0)
-		return;
-
-	down_write(&adev->reset_domain->sem);
-
-	amdgpu_virt_fini_data_exchange(adev);
-
 	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
+}
 
+static int xgpu_nv_wait_reset(struct amdgpu_device *adev)
+{
+	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
 	do {
 		if (xgpu_nv_mailbox_peek_msg(adev) == IDH_FLR_NOTIFICATION_CMPL)
-			goto flr_done;
-
+			return 0;
 		msleep(10);
 		timeout -= 10;
 	} while (timeout > 1);
-
 	dev_warn(adev->dev, "waiting IDH_FLR_NOTIFICATION_CMPL timeout\n");
+	return -ETIME;
+}
 
-flr_done:
-	atomic_set(&adev->reset_domain->in_gpu_reset, 0);
-	up_write(&adev->reset_domain->sem);
+static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
+{
+	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
+	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
+
+	amdgpu_virt_fini_data_exchange(adev);
 
 	/* Trigger recovery for world switch failure if no TDR */
 	if (amdgpu_device_should_recover_gpu(adev)
@@ -455,7 +447,8 @@ const struct amdgpu_virt_ops xgpu_nv_virt_ops = {
 	.rel_full_gpu	= xgpu_nv_release_full_gpu_access,
 	.req_init_data  = xgpu_nv_request_init_data,
 	.reset_gpu = xgpu_nv_request_reset,
-	.wait_reset = NULL,
+	.ready_to_reset = xgpu_nv_ready_to_reset,
+	.wait_reset = xgpu_nv_wait_reset,
 	.trans_msg = xgpu_nv_mailbox_trans_msg,
 	.ras_poison_handler = xgpu_nv_ras_poison_handler,
 };
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index 78cd07744ebe..e1d63bed84bf 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -515,12 +515,6 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 
-	/* wait until RCV_MSG become 3 */
-	if (xgpu_vi_poll_msg(adev, IDH_FLR_NOTIFICATION_CMPL)) {
-		pr_err("failed to receive FLR_CMPL\n");
-		return;
-	}
-
 	/* Trigger recovery due to world switch failure */
 	if (amdgpu_device_should_recover_gpu(adev)) {
 		struct amdgpu_reset_context reset_context;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 3/8] drm/amdgpu/kfd: remove is_hws_hang and is_resetting
  2024-05-30 21:47 ` [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
  2024-05-30 21:47   ` [PATCH v3 1/8] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
  2024-05-30 21:47   ` [PATCH v3 2/8] drm/amdgpu: fix sriov host flr handler Yunxiang Li
@ 2024-05-30 21:48   ` Yunxiang Li
  2024-05-30 21:48   ` [PATCH v3 4/8] drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 52+ messages in thread
From: Yunxiang Li @ 2024-05-30 21:48 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, christian.koenig, Yunxiang Li, Felix Kuehling

is_hws_hang and is_resetting serves pretty much the same purpose and
they all duplicates the work of the reset_domain lock, just check that
directly instead. This also eliminate a few bugs listed below and get
rid of dqm->ops.pre_reset.

kfd_hws_hang did not need to avoid scheduling another reset. If the
on-going reset decided to skip GPU reset we have a bad time, otherwise
the extra reset will get cancelled anyway.

remove_queue_mes forgot to check is_resetting flag compared to the
pre-MES path unmap_queue_cpsch, so it did not block hw access during
reset correctly.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_device.c       |  1 -
 .../drm/amd/amdkfd/kfd_device_queue_manager.c | 79 ++++++++-----------
 .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 -
 drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c | 11 ++-
 .../gpu/drm/amd/amdkfd/kfd_packet_manager.c   |  4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  4 +-
 .../amd/amdkfd/kfd_process_queue_manager.c    |  4 +-
 7 files changed, 45 insertions(+), 59 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index fba9b9a258a5..3e0f46d60de5 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -935,7 +935,6 @@ int kgd2kfd_pre_reset(struct kfd_dev *kfd)
 	for (i = 0; i < kfd->num_nodes; i++) {
 		node = kfd->nodes[i];
 		kfd_smi_event_update_gpu_reset(node, false);
-		node->dqm->ops.pre_reset(node->dqm);
 	}
 
 	kgd2kfd_suspend(kfd, false);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 4721b2fccd06..3a2dc31279a4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -35,6 +35,7 @@
 #include "cik_regs.h"
 #include "kfd_kernel_queue.h"
 #include "amdgpu_amdkfd.h"
+#include "amdgpu_reset.h"
 #include "mes_api_def.h"
 #include "kfd_debug.h"
 
@@ -155,14 +156,7 @@ static void kfd_hws_hang(struct device_queue_manager *dqm)
 	/*
 	 * Issue a GPU reset if HWS is unresponsive
 	 */
-	dqm->is_hws_hang = true;
-
-	/* It's possible we're detecting a HWS hang in the
-	 * middle of a GPU reset. No need to schedule another
-	 * reset in this case.
-	 */
-	if (!dqm->is_resetting)
-		schedule_work(&dqm->hw_exception_work);
+	schedule_work(&dqm->hw_exception_work);
 }
 
 static int convert_to_mes_queue_type(int queue_type)
@@ -194,7 +188,7 @@ static int add_queue_mes(struct device_queue_manager *dqm, struct queue *q,
 	int r, queue_type;
 	uint64_t wptr_addr_off;
 
-	if (dqm->is_hws_hang)
+	if (!down_read_trylock(&adev->reset_domain->sem))
 		return -EIO;
 
 	memset(&queue_input, 0x0, sizeof(struct mes_add_queue_input));
@@ -245,6 +239,7 @@ static int add_queue_mes(struct device_queue_manager *dqm, struct queue *q,
 	amdgpu_mes_lock(&adev->mes);
 	r = adev->mes.funcs->add_hw_queue(&adev->mes, &queue_input);
 	amdgpu_mes_unlock(&adev->mes);
+	up_read(&adev->reset_domain->sem);
 	if (r) {
 		dev_err(adev->dev, "failed to add hardware queue to MES, doorbell=0x%x\n",
 			q->properties.doorbell_off);
@@ -262,7 +257,7 @@ static int remove_queue_mes(struct device_queue_manager *dqm, struct queue *q,
 	int r;
 	struct mes_remove_queue_input queue_input;
 
-	if (dqm->is_hws_hang)
+	if (!down_read_trylock(&adev->reset_domain->sem))
 		return -EIO;
 
 	memset(&queue_input, 0x0, sizeof(struct mes_remove_queue_input));
@@ -272,6 +267,7 @@ static int remove_queue_mes(struct device_queue_manager *dqm, struct queue *q,
 	amdgpu_mes_lock(&adev->mes);
 	r = adev->mes.funcs->remove_hw_queue(&adev->mes, &queue_input);
 	amdgpu_mes_unlock(&adev->mes);
+	up_read(&adev->reset_domain->sem);
 
 	if (r) {
 		dev_err(adev->dev, "failed to remove hardware queue from MES, doorbell=0x%x\n",
@@ -1468,20 +1464,13 @@ static int stop_nocpsch(struct device_queue_manager *dqm)
 	}
 
 	if (dqm->dev->adev->asic_type == CHIP_HAWAII)
-		pm_uninit(&dqm->packet_mgr, false);
+		pm_uninit(&dqm->packet_mgr);
 	dqm->sched_running = false;
 	dqm_unlock(dqm);
 
 	return 0;
 }
 
-static void pre_reset(struct device_queue_manager *dqm)
-{
-	dqm_lock(dqm);
-	dqm->is_resetting = true;
-	dqm_unlock(dqm);
-}
-
 static int allocate_sdma_queue(struct device_queue_manager *dqm,
 				struct queue *q, const uint32_t *restore_sdma_id)
 {
@@ -1669,8 +1658,6 @@ static int start_cpsch(struct device_queue_manager *dqm)
 	init_interrupts(dqm);
 
 	/* clear hang status when driver try to start the hw scheduler */
-	dqm->is_hws_hang = false;
-	dqm->is_resetting = false;
 	dqm->sched_running = true;
 
 	if (!dqm->dev->kfd->shared_resources.enable_mes)
@@ -1700,7 +1687,7 @@ static int start_cpsch(struct device_queue_manager *dqm)
 fail_allocate_vidmem:
 fail_set_sched_resources:
 	if (!dqm->dev->kfd->shared_resources.enable_mes)
-		pm_uninit(&dqm->packet_mgr, false);
+		pm_uninit(&dqm->packet_mgr);
 fail_packet_manager_init:
 	dqm_unlock(dqm);
 	return retval;
@@ -1708,22 +1695,17 @@ static int start_cpsch(struct device_queue_manager *dqm)
 
 static int stop_cpsch(struct device_queue_manager *dqm)
 {
-	bool hanging;
-
 	dqm_lock(dqm);
 	if (!dqm->sched_running) {
 		dqm_unlock(dqm);
 		return 0;
 	}
 
-	if (!dqm->is_hws_hang) {
-		if (!dqm->dev->kfd->shared_resources.enable_mes)
-			unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0, USE_DEFAULT_GRACE_PERIOD, false);
-		else
-			remove_all_queues_mes(dqm);
-	}
+	if (!dqm->dev->kfd->shared_resources.enable_mes)
+		unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0, USE_DEFAULT_GRACE_PERIOD, false);
+	else
+		remove_all_queues_mes(dqm);
 
-	hanging = dqm->is_hws_hang || dqm->is_resetting;
 	dqm->sched_running = false;
 
 	if (!dqm->dev->kfd->shared_resources.enable_mes)
@@ -1731,7 +1713,7 @@ static int stop_cpsch(struct device_queue_manager *dqm)
 
 	kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
 	if (!dqm->dev->kfd->shared_resources.enable_mes)
-		pm_uninit(&dqm->packet_mgr, hanging);
+		pm_uninit(&dqm->packet_mgr);
 	dqm_unlock(dqm);
 
 	return 0;
@@ -1957,24 +1939,24 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
 {
 	struct device *dev = dqm->dev->adev->dev;
 	struct mqd_manager *mqd_mgr;
-	int retval = 0;
+	int retval;
 
 	if (!dqm->sched_running)
 		return 0;
-	if (dqm->is_hws_hang || dqm->is_resetting)
-		return -EIO;
 	if (!dqm->active_runlist)
-		return retval;
+		return 0;
+	if (!down_read_trylock(&dqm->dev->adev->reset_domain->sem))
+		return -EIO;
 
 	if (grace_period != USE_DEFAULT_GRACE_PERIOD) {
 		retval = pm_update_grace_period(&dqm->packet_mgr, grace_period);
 		if (retval)
-			return retval;
+			goto out;
 	}
 
 	retval = pm_send_unmap_queue(&dqm->packet_mgr, filter, filter_param, reset);
 	if (retval)
-		return retval;
+		goto out;
 
 	*dqm->fence_addr = KFD_FENCE_INIT;
 	pm_send_query_status(&dqm->packet_mgr, dqm->fence_gpu_addr,
@@ -1985,7 +1967,7 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
 	if (retval) {
 		dev_err(dev, "The cp might be in an unrecoverable state due to an unsuccessful queues preemption\n");
 		kfd_hws_hang(dqm);
-		return retval;
+		goto out;
 	}
 
 	/* In the current MEC firmware implementation, if compute queue
@@ -2001,7 +1983,8 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
 		while (halt_if_hws_hang)
 			schedule();
 		kfd_hws_hang(dqm);
-		return -ETIME;
+		retval = -ETIME;
+		goto out;
 	}
 
 	/* We need to reset the grace period value for this device */
@@ -2014,6 +1997,8 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
 	pm_release_ib(&dqm->packet_mgr);
 	dqm->active_runlist = false;
 
+out:
+	up_read(&dqm->dev->adev->reset_domain->sem);
 	return retval;
 }
 
@@ -2040,13 +2025,13 @@ static int execute_queues_cpsch(struct device_queue_manager *dqm,
 {
 	int retval;
 
-	if (dqm->is_hws_hang)
+	if (!down_read_trylock(&dqm->dev->adev->reset_domain->sem))
 		return -EIO;
 	retval = unmap_queues_cpsch(dqm, filter, filter_param, grace_period, false);
-	if (retval)
-		return retval;
-
-	return map_queues_cpsch(dqm);
+	if (!retval)
+		retval = map_queues_cpsch(dqm);
+	up_read(&dqm->dev->adev->reset_domain->sem);
+	return retval;
 }
 
 static int wait_on_destroy_queue(struct device_queue_manager *dqm,
@@ -2427,10 +2412,12 @@ static int process_termination_cpsch(struct device_queue_manager *dqm,
 	if (!dqm->dev->kfd->shared_resources.enable_mes)
 		retval = execute_queues_cpsch(dqm, filter, 0, USE_DEFAULT_GRACE_PERIOD);
 
-	if ((!dqm->is_hws_hang) && (retval || qpd->reset_wavefronts)) {
+	if ((retval || qpd->reset_wavefronts) &&
+	    down_read_trylock(&dqm->dev->adev->reset_domain->sem)) {
 		pr_warn("Resetting wave fronts (cpsch) on dev %p\n", dqm->dev);
 		dbgdev_wave_reset_wavefronts(dqm->dev, qpd->pqm->process);
 		qpd->reset_wavefronts = false;
+		up_read(&dqm->dev->adev->reset_domain->sem);
 	}
 
 	/* Lastly, free mqd resources.
@@ -2537,7 +2524,6 @@ struct device_queue_manager *device_queue_manager_init(struct kfd_node *dev)
 		dqm->ops.initialize = initialize_cpsch;
 		dqm->ops.start = start_cpsch;
 		dqm->ops.stop = stop_cpsch;
-		dqm->ops.pre_reset = pre_reset;
 		dqm->ops.destroy_queue = destroy_queue_cpsch;
 		dqm->ops.update_queue = update_queue;
 		dqm->ops.register_process = register_process;
@@ -2558,7 +2544,6 @@ struct device_queue_manager *device_queue_manager_init(struct kfd_node *dev)
 		/* initialize dqm for no cp scheduling */
 		dqm->ops.start = start_nocpsch;
 		dqm->ops.stop = stop_nocpsch;
-		dqm->ops.pre_reset = pre_reset;
 		dqm->ops.create_queue = create_queue_nocpsch;
 		dqm->ops.destroy_queue = destroy_queue_nocpsch;
 		dqm->ops.update_queue = update_queue;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
index fcc0ee67f544..3b9b8eabaacc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
@@ -152,7 +152,6 @@ struct device_queue_manager_ops {
 	int	(*initialize)(struct device_queue_manager *dqm);
 	int	(*start)(struct device_queue_manager *dqm);
 	int	(*stop)(struct device_queue_manager *dqm);
-	void	(*pre_reset)(struct device_queue_manager *dqm);
 	void	(*uninitialize)(struct device_queue_manager *dqm);
 	int	(*create_kernel_queue)(struct device_queue_manager *dqm,
 					struct kernel_queue *kq,
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
index 32c926986dbb..3ea75a9d86ec 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
@@ -32,6 +32,7 @@
 #include "kfd_device_queue_manager.h"
 #include "kfd_pm4_headers.h"
 #include "kfd_pm4_opcodes.h"
+#include "amdgpu_reset.h"
 
 #define PM4_COUNT_ZERO (((1 << 15) - 1) << 16)
 
@@ -196,15 +197,17 @@ static bool kq_initialize(struct kernel_queue *kq, struct kfd_node *dev,
 }
 
 /* Uninitialize a kernel queue and free all its memory usages. */
-static void kq_uninitialize(struct kernel_queue *kq, bool hanging)
+static void kq_uninitialize(struct kernel_queue *kq)
 {
-	if (kq->queue->properties.type == KFD_QUEUE_TYPE_HIQ && !hanging)
+	if (kq->queue->properties.type == KFD_QUEUE_TYPE_HIQ && down_read_trylock(&kq->dev->adev->reset_domain->sem)) {
 		kq->mqd_mgr->destroy_mqd(kq->mqd_mgr,
 					kq->queue->mqd,
 					KFD_PREEMPT_TYPE_WAVEFRONT_RESET,
 					KFD_UNMAP_LATENCY_MS,
 					kq->queue->pipe,
 					kq->queue->queue);
+		up_read(&kq->dev->adev->reset_domain->sem);
+	}
 	else if (kq->queue->properties.type == KFD_QUEUE_TYPE_DIQ)
 		kfd_gtt_sa_free(kq->dev, kq->fence_mem_obj);
 
@@ -344,9 +347,9 @@ struct kernel_queue *kernel_queue_init(struct kfd_node *dev,
 	return NULL;
 }
 
-void kernel_queue_uninit(struct kernel_queue *kq, bool hanging)
+void kernel_queue_uninit(struct kernel_queue *kq)
 {
-	kq_uninitialize(kq, hanging);
+	kq_uninitialize(kq);
 	kfree(kq);
 }
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
index 7332ad94eab8..a05d5c1097a8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c
@@ -263,10 +263,10 @@ int pm_init(struct packet_manager *pm, struct device_queue_manager *dqm)
 	return 0;
 }
 
-void pm_uninit(struct packet_manager *pm, bool hanging)
+void pm_uninit(struct packet_manager *pm)
 {
 	mutex_destroy(&pm->lock);
-	kernel_queue_uninit(pm->priv_queue, hanging);
+	kernel_queue_uninit(pm->priv_queue);
 	pm->priv_queue = NULL;
 }
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index c51e908f6f19..2b3ec92981e8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1301,7 +1301,7 @@ struct device_queue_manager *device_queue_manager_init(struct kfd_node *dev);
 void device_queue_manager_uninit(struct device_queue_manager *dqm);
 struct kernel_queue *kernel_queue_init(struct kfd_node *dev,
 					enum kfd_queue_type type);
-void kernel_queue_uninit(struct kernel_queue *kq, bool hanging);
+void kernel_queue_uninit(struct kernel_queue *kq);
 int kfd_dqm_evict_pasid(struct device_queue_manager *dqm, u32 pasid);
 
 /* Process Queue Manager */
@@ -1407,7 +1407,7 @@ extern const struct packet_manager_funcs kfd_v9_pm_funcs;
 extern const struct packet_manager_funcs kfd_aldebaran_pm_funcs;
 
 int pm_init(struct packet_manager *pm, struct device_queue_manager *dqm);
-void pm_uninit(struct packet_manager *pm, bool hanging);
+void pm_uninit(struct packet_manager *pm);
 int pm_send_set_resources(struct packet_manager *pm,
 				struct scheduling_resources *res);
 int pm_send_runlist(struct packet_manager *pm, struct list_head *dqm_queues);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index 6bf79c435f2e..86ea610b16f3 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -434,7 +434,7 @@ int pqm_create_queue(struct process_queue_manager *pqm,
 err_create_queue:
 	uninit_queue(q);
 	if (kq)
-		kernel_queue_uninit(kq, false);
+		kernel_queue_uninit(kq);
 	kfree(pqn);
 err_allocate_pqn:
 	/* check if queues list is empty unregister process from device */
@@ -481,7 +481,7 @@ int pqm_destroy_queue(struct process_queue_manager *pqm, unsigned int qid)
 		/* destroy kernel queue (DIQ) */
 		dqm = pqn->kq->dev->dqm;
 		dqm->ops.destroy_kernel_queue(dqm, pqn->kq, &pdd->qpd);
-		kernel_queue_uninit(pqn->kq, false);
+		kernel_queue_uninit(pqn->kq);
 	}
 
 	if (pqn->q) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 4/8] drm/amd/amdgpu: remove unnecessary flush when enable gart
  2024-05-30 21:47 ` [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
                     ` (2 preceding siblings ...)
  2024-05-30 21:48   ` [PATCH v3 3/8] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
@ 2024-05-30 21:48   ` Yunxiang Li
  2024-05-30 21:48   ` [PATCH v3 5/8] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover Yunxiang Li
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 52+ messages in thread
From: Yunxiang Li @ 2024-05-30 21:48 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, christian.koenig, Likun Gao, Yunxiang Li

From: Likun Gao <Likun.Gao@amd.com>

Remove hdp flush for gc v11/12 when enable gart.
Remove flush tlb for gc v10/11/12 when enable gart.
The flush is done for each GART mapping when it is created.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 3 ---
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 3 ---
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 3 ---
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 3 ---
 drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 4 ----
 5 files changed, 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index aba0a51be960..5740f94e3e44 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -4395,13 +4395,10 @@ static int gfx_v11_0_gfxhub_enable(struct amdgpu_device *adev)
 	if (r)
 		return r;
 
-	adev->hdp.funcs->flush_hdp(adev, NULL);
-
 	value = (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_ALWAYS) ?
 		false : true;
 
 	adev->gfxhub.funcs->set_fault_enable_default(adev, value);
-	amdgpu_gmc_flush_gpu_tlb(adev, 0, AMDGPU_GFXHUB(0), 0);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index 1ef9de41d193..5048b6eef9da 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -3207,13 +3207,10 @@ static int gfx_v12_0_gfxhub_enable(struct amdgpu_device *adev)
 	if (r)
 		return r;
 
-	adev->hdp.funcs->flush_hdp(adev, NULL);
-
 	value = (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_ALWAYS) ?
 		false : true;
 
 	adev->gfxhub.funcs->set_fault_enable_default(adev, value);
-	amdgpu_gmc_flush_gpu_tlb(adev, 0, AMDGPU_GFXHUB(0), 0);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index d933e19e0cf5..3e0ebe25a80f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -974,9 +974,6 @@ static int gmc_v10_0_gart_enable(struct amdgpu_device *adev)
 	if (!adev->in_s0ix)
 		adev->gfxhub.funcs->set_fault_enable_default(adev, value);
 	adev->mmhub.funcs->set_fault_enable_default(adev, value);
-	gmc_v10_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
-	if (!adev->in_s0ix)
-		gmc_v10_0_flush_gpu_tlb(adev, 0, AMDGPU_GFXHUB(0), 0);
 
 	DRM_INFO("PCIE GART of %uM enabled (table at 0x%016llX).\n",
 		 (unsigned int)(adev->gmc.gart_size >> 20),
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index 527dc917e049..cadbe55f0c8f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -891,9 +891,6 @@ static int gmc_v11_0_gart_enable(struct amdgpu_device *adev)
 	if (r)
 		return r;
 
-	/* Flush HDP after it is initialized */
-	adev->hdp.funcs->flush_hdp(adev, NULL);
-
 	value = (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_ALWAYS) ?
 		false : true;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
index e2c6ec3cc4f3..a677aca69a06 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
@@ -861,14 +861,10 @@ static int gmc_v12_0_gart_enable(struct amdgpu_device *adev)
 	if (r)
 		return r;
 
-	/* Flush HDP after it is initialized */
-	adev->hdp.funcs->flush_hdp(adev, NULL);
-
 	value = (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_ALWAYS) ?
 		false : true;
 
 	adev->mmhub.funcs->set_fault_enable_default(adev, value);
-	gmc_v12_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
 
 	dev_info(adev->dev, "PCIE GART of %uM enabled (table at 0x%016llX).\n",
 		 (unsigned)(adev->gmc.gart_size >> 20),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 5/8] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover
  2024-05-30 21:47 ` [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
                     ` (3 preceding siblings ...)
  2024-05-30 21:48   ` [PATCH v3 4/8] drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li
@ 2024-05-30 21:48   ` Yunxiang Li
  2024-05-30 21:48   ` [PATCH v3 6/8] drm/amdgpu: use helper in amdgpu_gart_unbind Yunxiang Li
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 52+ messages in thread
From: Yunxiang Li @ 2024-05-30 21:48 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, christian.koenig, Yunxiang Li

At this point the gart is not set up, there's no point to invalidate tlb
here and it could even be harmful.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
index 44367f03316f..0760e70402ec 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
@@ -200,8 +200,6 @@ void amdgpu_gtt_mgr_recover(struct amdgpu_gtt_mgr *mgr)
 		amdgpu_ttm_recover_gart(node->base.bo);
 	}
 	spin_unlock(&mgr->lock);
-
-	amdgpu_gart_invalidate_tlb(adev);
 }
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 6/8] drm/amdgpu: use helper in amdgpu_gart_unbind
  2024-05-30 21:47 ` [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
                     ` (4 preceding siblings ...)
  2024-05-30 21:48   ` [PATCH v3 5/8] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover Yunxiang Li
@ 2024-05-30 21:48   ` Yunxiang Li
  2024-05-30 21:48   ` [PATCH v3 7/8] drm/amdgpu: fix locking scope when flushing tlb Yunxiang Li
  2024-05-30 21:48   ` [PATCH v3 8/8] drm/amdgpu: fix missing reset domain locks Yunxiang Li
  7 siblings, 0 replies; 52+ messages in thread
From: Yunxiang Li @ 2024-05-30 21:48 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, christian.koenig, Yunxiang Li

When amdgpu_gart_invalidate_tlb helper is introduced this part was left
out of the conversion. Avoid the code duplication here.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
index c623e23049d1..eb172388d99e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
@@ -325,10 +325,7 @@ void amdgpu_gart_unbind(struct amdgpu_device *adev, uint64_t offset,
 			page_base += AMDGPU_GPU_PAGE_SIZE;
 		}
 	}
-	mb();
-	amdgpu_device_flush_hdp(adev, NULL);
-	for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS)
-		amdgpu_gmc_flush_gpu_tlb(adev, 0, i, 0);
+	amdgpu_gart_invalidate_tlb(adev);
 
 	drm_dev_exit(idx);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 7/8] drm/amdgpu: fix locking scope when flushing tlb
  2024-05-30 21:47 ` [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
                     ` (5 preceding siblings ...)
  2024-05-30 21:48   ` [PATCH v3 6/8] drm/amdgpu: use helper in amdgpu_gart_unbind Yunxiang Li
@ 2024-05-30 21:48   ` Yunxiang Li
  2024-05-30 21:48   ` [PATCH v3 8/8] drm/amdgpu: fix missing reset domain locks Yunxiang Li
  7 siblings, 0 replies; 52+ messages in thread
From: Yunxiang Li @ 2024-05-30 21:48 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, christian.koenig, Yunxiang Li, stable

Which method is used to flush tlb does not depend on whether a reset is
in progress or not. We should skip flush altogether if the GPU will get
reset. So put both path under reset_domain read lock.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
CC: stable@vger.kernel.org
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 66 +++++++++++++------------
 1 file changed, 34 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 603c0738fd03..4edd10b10a92 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -684,12 +684,17 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
 	struct amdgpu_ring *ring = &adev->gfx.kiq[inst].ring;
 	struct amdgpu_kiq *kiq = &adev->gfx.kiq[inst];
 	unsigned int ndw;
-	signed long r;
+	int r;
 	uint32_t seq;
 
-	if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
-	    !down_read_trylock(&adev->reset_domain->sem)) {
+	/*
+	 * A GPU reset should flush all TLBs anyway, so no need to do
+	 * this while one is ongoing.
+	 */
+	if (!down_read_trylock(&adev->reset_domain->sem))
+		return 0;
 
+	if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready) {
 		if (adev->gmc.flush_tlb_needs_extra_type_2)
 			adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
 								 2, all_hub,
@@ -703,43 +708,40 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
 		adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
 							 flush_type, all_hub,
 							 inst);
-		return 0;
-	}
+		r = 0;
+	} else {
+		/* 2 dwords flush + 8 dwords fence */
+		ndw = kiq->pmf->invalidate_tlbs_size + 8;
 
-	/* 2 dwords flush + 8 dwords fence */
-	ndw = kiq->pmf->invalidate_tlbs_size + 8;
+		if (adev->gmc.flush_tlb_needs_extra_type_2)
+			ndw += kiq->pmf->invalidate_tlbs_size;
 
-	if (adev->gmc.flush_tlb_needs_extra_type_2)
-		ndw += kiq->pmf->invalidate_tlbs_size;
+		if (adev->gmc.flush_tlb_needs_extra_type_0)
+			ndw += kiq->pmf->invalidate_tlbs_size;
 
-	if (adev->gmc.flush_tlb_needs_extra_type_0)
-		ndw += kiq->pmf->invalidate_tlbs_size;
+		spin_lock(&adev->gfx.kiq[inst].ring_lock);
+		amdgpu_ring_alloc(ring, ndw);
+		if (adev->gmc.flush_tlb_needs_extra_type_2)
+			kiq->pmf->kiq_invalidate_tlbs(ring, pasid, 2, all_hub);
 
-	spin_lock(&adev->gfx.kiq[inst].ring_lock);
-	amdgpu_ring_alloc(ring, ndw);
-	if (adev->gmc.flush_tlb_needs_extra_type_2)
-		kiq->pmf->kiq_invalidate_tlbs(ring, pasid, 2, all_hub);
+		if (flush_type == 2 && adev->gmc.flush_tlb_needs_extra_type_0)
+			kiq->pmf->kiq_invalidate_tlbs(ring, pasid, 0, all_hub);
 
-	if (flush_type == 2 && adev->gmc.flush_tlb_needs_extra_type_0)
-		kiq->pmf->kiq_invalidate_tlbs(ring, pasid, 0, all_hub);
+		kiq->pmf->kiq_invalidate_tlbs(ring, pasid, flush_type, all_hub);
+		r = amdgpu_fence_emit_polling(ring, &seq, MAX_KIQ_REG_WAIT);
+		if (r) {
+			amdgpu_ring_undo(ring);
+			spin_unlock(&adev->gfx.kiq[inst].ring_lock);
+			goto error_unlock_reset;
+		}
 
-	kiq->pmf->kiq_invalidate_tlbs(ring, pasid, flush_type, all_hub);
-	r = amdgpu_fence_emit_polling(ring, &seq, MAX_KIQ_REG_WAIT);
-	if (r) {
-		amdgpu_ring_undo(ring);
+		amdgpu_ring_commit(ring);
 		spin_unlock(&adev->gfx.kiq[inst].ring_lock);
-		goto error_unlock_reset;
-	}
-
-	amdgpu_ring_commit(ring);
-	spin_unlock(&adev->gfx.kiq[inst].ring_lock);
-	r = amdgpu_fence_wait_polling(ring, seq, usec_timeout);
-	if (r < 1) {
-		dev_err(adev->dev, "wait for kiq fence error: %ld.\n", r);
-		r = -ETIME;
-		goto error_unlock_reset;
+		if (amdgpu_fence_wait_polling(ring, seq, usec_timeout) < 1) {
+			dev_err(adev->dev, "timeout waiting for kiq fence\n");
+			r = -ETIME;
+		}
 	}
-	r = 0;
 
 error_unlock_reset:
 	up_read(&adev->reset_domain->sem);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 8/8] drm/amdgpu: fix missing reset domain locks
  2024-05-30 21:47 ` [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
                     ` (6 preceding siblings ...)
  2024-05-30 21:48   ` [PATCH v3 7/8] drm/amdgpu: fix locking scope when flushing tlb Yunxiang Li
@ 2024-05-30 21:48   ` Yunxiang Li
  2024-05-31  6:50     ` Christian König
  7 siblings, 1 reply; 52+ messages in thread
From: Yunxiang Li @ 2024-05-30 21:48 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alexander.Deucher, christian.koenig, Yunxiang Li, felix.kuehling

These functions are missing the lock for reset domain.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
v3: only bracket amdgpu_device_flush_hdp with the read lock,
    amdgpu_gmc_flush_gpu_tlb already takes the lock 

 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c               | 6 +++++-
 drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 9 +++++++--
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
index eb172388d99e..256b95232de5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
@@ -34,6 +34,7 @@
 #include <asm/set_memory.h>
 #endif
 #include "amdgpu.h"
+#include "amdgpu_reset.h"
 #include <drm/drm_drv.h>
 #include <drm/ttm/ttm_tt.h>
 
@@ -405,7 +406,10 @@ void amdgpu_gart_invalidate_tlb(struct amdgpu_device *adev)
 		return;
 
 	mb();
-	amdgpu_device_flush_hdp(adev, NULL);
+	if (down_read_trylock(&adev->reset_domain->sem)) {
+		amdgpu_device_flush_hdp(adev, NULL);
+		up_read(&adev->reset_domain->sem);
+	}
 	for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS)
 		amdgpu_gmc_flush_gpu_tlb(adev, 0, i, 0);
 }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index 86ea610b16f3..21f5a1fb3bf8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -28,6 +28,7 @@
 #include "kfd_priv.h"
 #include "kfd_kernel_queue.h"
 #include "amdgpu_amdkfd.h"
+#include "amdgpu_reset.h"
 
 static inline struct process_queue_node *get_queue_by_qid(
 			struct process_queue_manager *pqm, unsigned int qid)
@@ -87,8 +88,12 @@ void kfd_process_dequeue_from_device(struct kfd_process_device *pdd)
 		return;
 
 	dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
-	if (dev->kfd->shared_resources.enable_mes)
-		amdgpu_mes_flush_shader_debugger(dev->adev, pdd->proc_ctx_gpu_addr);
+	if (dev->kfd->shared_resources.enable_mes &&
+	    down_read_trylock(&dev->adev->reset_domain->sem)) {
+		amdgpu_mes_flush_shader_debugger(dev->adev,
+						 pdd->proc_ctx_gpu_addr);
+		up_read(&dev->adev->reset_domain->sem);
+	}
 	pdd->already_dequeued = true;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks
  2024-05-28 17:23 ` [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks Yunxiang Li
  2024-05-29  6:55   ` Christian König
@ 2024-05-30 22:02   ` Felix Kuehling
  2024-05-30 22:35     ` Li, Yunxiang (Teddy)
  2024-05-31  6:52     ` Christian König
  1 sibling, 2 replies; 52+ messages in thread
From: Felix Kuehling @ 2024-05-30 22:02 UTC (permalink / raw)
  To: Yunxiang Li, amd-gfx
  Cc: Alexander.Deucher, christian.koenig, Likun.Gao, Hawking.Zhang

On 2024-05-28 13:23, Yunxiang Li wrote:
> These functions are missing the lock for reset domain.
>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c               | 4 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c                | 8 ++++++--
>   drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 9 +++++++--
>   3 files changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> index eb172388d99e..ddc5e9972da8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> @@ -34,6 +34,7 @@
>   #include <asm/set_memory.h>
>   #endif
>   #include "amdgpu.h"
> +#include "amdgpu_reset.h"
>   #include <drm/drm_drv.h>
>   #include <drm/ttm/ttm_tt.h>
>   
> @@ -401,13 +402,14 @@ void amdgpu_gart_invalidate_tlb(struct amdgpu_device *adev)
>   {
>   	int i;
>   
> -	if (!adev->gart.ptr)
> +	if (!adev->gart.ptr || !down_read_trylock(&adev->reset_domain->sem))
>   		return;
>   
>   	mb();
>   	amdgpu_device_flush_hdp(adev, NULL);
>   	for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS)
>   		amdgpu_gmc_flush_gpu_tlb(adev, 0, i, 0);
> +	up_read(&adev->reset_domain->sem);
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index e4742b65032d..52a3170d15b7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -307,8 +307,12 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
>   		dev_dbg(adev->dev, "Skip scheduling IBs in ring(%s)",
>   			ring->name);
>   	} else {
> -		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
> -				       &fence);
> +		r = -ETIME;
> +		if (down_read_trylock(&adev->reset_domain->sem)) {
> +			r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
> +					       job, &fence);
> +			up_read(&adev->reset_domain->sem);
> +		}
>   		if (r)
>   			dev_err(adev->dev,
>   				"Error scheduling IBs (%d) in ring(%s)", r,
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> index 86ea610b16f3..21f5a1fb3bf8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> @@ -28,6 +28,7 @@
>   #include "kfd_priv.h"
>   #include "kfd_kernel_queue.h"
>   #include "amdgpu_amdkfd.h"
> +#include "amdgpu_reset.h"
>   
>   static inline struct process_queue_node *get_queue_by_qid(
>   			struct process_queue_manager *pqm, unsigned int qid)
> @@ -87,8 +88,12 @@ void kfd_process_dequeue_from_device(struct kfd_process_device *pdd)
>   		return;
>   
>   	dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
> -	if (dev->kfd->shared_resources.enable_mes)
> -		amdgpu_mes_flush_shader_debugger(dev->adev, pdd->proc_ctx_gpu_addr);
> +	if (dev->kfd->shared_resources.enable_mes &&
> +	    down_read_trylock(&dev->adev->reset_domain->sem)) {
> +		amdgpu_mes_flush_shader_debugger(dev->adev,
> +						 pdd->proc_ctx_gpu_addr);
> +		

It's not clear to me what's the requirement for reset domain locking 
around MES calls. We have a lot more of them in 
kfd_device_queue_manager.c (mostly calling adev->mes.funcs->... 
directly). Do they all need to be wrapped individually?

Regards,
   Felix


> up_read(&dev->adev->reset_domain->sem);
> +	}
>   	pdd->already_dequeued = true;
>   }
>   

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks
  2024-05-30 22:02   ` Felix Kuehling
@ 2024-05-30 22:35     ` Li, Yunxiang (Teddy)
  2024-05-31  6:52     ` Christian König
  1 sibling, 0 replies; 52+ messages in thread
From: Li, Yunxiang (Teddy) @ 2024-05-30 22:35 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Koenig, Christian, Gao, Likun, Zhang, Hawking

[AMD Official Use Only - AMD Internal Distribution Only]

> It's not clear to me what's the requirement for reset domain locking around
> MES calls. We have a lot more of them in kfd_device_queue_manager.c
> (mostly calling adev->mes.funcs->...
> directly). Do they all need to be wrapped individually?

If they can be called between HW reset start and hw re-init finish or, in the case of SRIOV, between request full access and release full access then we need to lock them.

Teddy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 8/8] drm/amdgpu: fix missing reset domain locks
  2024-05-30 21:48   ` [PATCH v3 8/8] drm/amdgpu: fix missing reset domain locks Yunxiang Li
@ 2024-05-31  6:50     ` Christian König
  0 siblings, 0 replies; 52+ messages in thread
From: Christian König @ 2024-05-31  6:50 UTC (permalink / raw)
  To: Yunxiang Li, amd-gfx; +Cc: Alexander.Deucher, felix.kuehling

Am 30.05.24 um 23:48 schrieb Yunxiang Li:
> These functions are missing the lock for reset domain.

Please separate the GART changes from the KFD changes. Apart from that 
looks good to me.

Thanks,
Christian.

>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
> ---
> v3: only bracket amdgpu_device_flush_hdp with the read lock,
>      amdgpu_gmc_flush_gpu_tlb already takes the lock
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c               | 6 +++++-
>   drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 9 +++++++--
>   2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> index eb172388d99e..256b95232de5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
> @@ -34,6 +34,7 @@
>   #include <asm/set_memory.h>
>   #endif
>   #include "amdgpu.h"
> +#include "amdgpu_reset.h"
>   #include <drm/drm_drv.h>
>   #include <drm/ttm/ttm_tt.h>
>   
> @@ -405,7 +406,10 @@ void amdgpu_gart_invalidate_tlb(struct amdgpu_device *adev)
>   		return;
>   
>   	mb();
> -	amdgpu_device_flush_hdp(adev, NULL);
> +	if (down_read_trylock(&adev->reset_domain->sem)) {
> +		amdgpu_device_flush_hdp(adev, NULL);
> +		up_read(&adev->reset_domain->sem);
> +	}
>   	for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS)
>   		amdgpu_gmc_flush_gpu_tlb(adev, 0, i, 0);
>   }
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> index 86ea610b16f3..21f5a1fb3bf8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> @@ -28,6 +28,7 @@
>   #include "kfd_priv.h"
>   #include "kfd_kernel_queue.h"
>   #include "amdgpu_amdkfd.h"
> +#include "amdgpu_reset.h"
>   
>   static inline struct process_queue_node *get_queue_by_qid(
>   			struct process_queue_manager *pqm, unsigned int qid)
> @@ -87,8 +88,12 @@ void kfd_process_dequeue_from_device(struct kfd_process_device *pdd)
>   		return;
>   
>   	dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
> -	if (dev->kfd->shared_resources.enable_mes)
> -		amdgpu_mes_flush_shader_debugger(dev->adev, pdd->proc_ctx_gpu_addr);
> +	if (dev->kfd->shared_resources.enable_mes &&
> +	    down_read_trylock(&dev->adev->reset_domain->sem)) {
> +		amdgpu_mes_flush_shader_debugger(dev->adev,
> +						 pdd->proc_ctx_gpu_addr);
> +		up_read(&dev->adev->reset_domain->sem);
> +	}
>   	pdd->already_dequeued = true;
>   }
>   


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks
  2024-05-30 22:02   ` Felix Kuehling
  2024-05-30 22:35     ` Li, Yunxiang (Teddy)
@ 2024-05-31  6:52     ` Christian König
  2024-05-31 15:47       ` Felix Kuehling
  1 sibling, 1 reply; 52+ messages in thread
From: Christian König @ 2024-05-31  6:52 UTC (permalink / raw)
  To: Felix Kuehling, Yunxiang Li, amd-gfx
  Cc: Alexander.Deucher, Likun.Gao, Hawking.Zhang

Am 31.05.24 um 00:02 schrieb Felix Kuehling:
> On 2024-05-28 13:23, Yunxiang Li wrote:
>> These functions are missing the lock for reset domain.
>>
>> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c               | 4 +++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c                | 8 ++++++--
>>   drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 9 +++++++--
>>   3 files changed, 16 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>> index eb172388d99e..ddc5e9972da8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>> @@ -34,6 +34,7 @@
>>   #include <asm/set_memory.h>
>>   #endif
>>   #include "amdgpu.h"
>> +#include "amdgpu_reset.h"
>>   #include <drm/drm_drv.h>
>>   #include <drm/ttm/ttm_tt.h>
>>   @@ -401,13 +402,14 @@ void amdgpu_gart_invalidate_tlb(struct 
>> amdgpu_device *adev)
>>   {
>>       int i;
>>   -    if (!adev->gart.ptr)
>> +    if (!adev->gart.ptr || 
>> !down_read_trylock(&adev->reset_domain->sem))
>>           return;
>>         mb();
>>       amdgpu_device_flush_hdp(adev, NULL);
>>       for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS)
>>           amdgpu_gmc_flush_gpu_tlb(adev, 0, i, 0);
>> +    up_read(&adev->reset_domain->sem);
>>   }
>>     /**
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index e4742b65032d..52a3170d15b7 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -307,8 +307,12 @@ static struct dma_fence *amdgpu_job_run(struct 
>> drm_sched_job *sched_job)
>>           dev_dbg(adev->dev, "Skip scheduling IBs in ring(%s)",
>>               ring->name);
>>       } else {
>> -        r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
>> -                       &fence);
>> +        r = -ETIME;
>> +        if (down_read_trylock(&adev->reset_domain->sem)) {
>> +            r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>> +                           job, &fence);
>> +            up_read(&adev->reset_domain->sem);
>> +        }
>>           if (r)
>>               dev_err(adev->dev,
>>                   "Error scheduling IBs (%d) in ring(%s)", r,
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>> index 86ea610b16f3..21f5a1fb3bf8 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>> @@ -28,6 +28,7 @@
>>   #include "kfd_priv.h"
>>   #include "kfd_kernel_queue.h"
>>   #include "amdgpu_amdkfd.h"
>> +#include "amdgpu_reset.h"
>>     static inline struct process_queue_node *get_queue_by_qid(
>>               struct process_queue_manager *pqm, unsigned int qid)
>> @@ -87,8 +88,12 @@ void kfd_process_dequeue_from_device(struct 
>> kfd_process_device *pdd)
>>           return;
>>         dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
>> -    if (dev->kfd->shared_resources.enable_mes)
>> -        amdgpu_mes_flush_shader_debugger(dev->adev, 
>> pdd->proc_ctx_gpu_addr);
>> +    if (dev->kfd->shared_resources.enable_mes &&
>> + down_read_trylock(&dev->adev->reset_domain->sem)) {
>> +        amdgpu_mes_flush_shader_debugger(dev->adev,
>> +                         pdd->proc_ctx_gpu_addr);
>> +
>
> It's not clear to me what's the requirement for reset domain locking 
> around MES calls. We have a lot more of them in 
> kfd_device_queue_manager.c (mostly calling adev->mes.funcs->... 
> directly). Do they all need to be wrapped individually?

Whenever you call a MES function (or any other function directly 
interacting with the rings or the HW registers) you need to make sure 
that at least the read side of the reset lock is held.

Regards,
Christian.

>
> Regards,
>   Felix
>
>
>> up_read(&dev->adev->reset_domain->sem);
>> +    }
>>       pdd->already_dequeued = true;
>>   }


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-05-29 15:19                   ` Christian König
@ 2024-05-31 14:44                     ` Liu, Shaoyun
  2024-06-03 10:58                       ` Christian König
  0 siblings, 1 reply; 52+ messages in thread
From: Liu, Shaoyun @ 2024-05-31 14:44 UTC (permalink / raw)
  To: Christian König, Li, Yunxiang (Teddy), Koenig, Christian,
	amd-gfx@lists.freedesktop.org, Deucher, Alexander, Xiao, Hua

[AMD Official Use Only - AMD Internal Distribution Only]

Hi, Christian

I think we have a discussion about this before . Alex also have a change that allow driver to use different write back address for the fence for each submission for the  original issue .
From MES  point of view ,  MES will update the fence when the API can be complete successfully, so if the  API (ex . remove_queue) fails  due to  other component issue (ex , CP hang), the  MES will not update the fence In this situation , but  MES itself still works and can respond to other commands (ex ,,read_reg)  .  Alex's change allow driver to check the fence for each API without mess around them  .  If you expect MES to stop responding  to further commands  after one API fails , that will introduce combability issue since this design already exist on products for customer and MES also need to works for windows .  Also MES  always need to respond to  some commands like  RESET  etc  that might make things worse if we need to change the logic .

One possible solution is MES can  trigger an Interrupt  to indicate which submission has failed with the seq number . In this case driver can get the  failure of the  submission to MES in time and  make its own decision for what to do next , What do you think about this ?

Regards
Shaoyun.liu

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Christian König
Sent: Wednesday, May 29, 2024 11:19 AM
To: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started

Am 29.05.24 um 16:48 schrieb Li, Yunxiang (Teddy):
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> Yeah, I know. That's one of the reason I've pointed out on the patch
>> adding that that this behavior is actually completely broken.
>>
>> If you run into issues with the MES because of this then please
>> suggest a revert of that patch.
> I think it just need to be improved to allow this force-signal behavior. The current behavior is slow/inconvenient, but the old behavior is wrong. Since MES will continue process submissions even when one submission failed. So with just one fence location there's no way to tell if a command failed or not.

No the MES behavior is broken. When a submission failed it should stop processing or signal that the operation didn't completed through some other mechanism.

Just not writing the fence and continuing results in tons of problems, from the TLB fence all the way to the ring buffer and reset handling.

This is a hard requirement and really can't be changed.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks
  2024-05-31  6:52     ` Christian König
@ 2024-05-31 15:47       ` Felix Kuehling
  2024-06-04 12:52         ` Li, Yunxiang (Teddy)
  0 siblings, 1 reply; 52+ messages in thread
From: Felix Kuehling @ 2024-05-31 15:47 UTC (permalink / raw)
  To: Christian König, Yunxiang Li, amd-gfx
  Cc: Alexander.Deucher, Likun.Gao, Hawking.Zhang


On 2024-05-31 2:52, Christian König wrote:
> Am 31.05.24 um 00:02 schrieb Felix Kuehling:
>> On 2024-05-28 13:23, Yunxiang Li wrote:
>>> These functions are missing the lock for reset domain.
>>>
>>> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c               | 4 +++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c                | 8 ++++++--
>>>   drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 9 +++++++--
>>>   3 files changed, 16 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> index eb172388d99e..ddc5e9972da8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> @@ -34,6 +34,7 @@
>>>   #include <asm/set_memory.h>
>>>   #endif
>>>   #include "amdgpu.h"
>>> +#include "amdgpu_reset.h"
>>>   #include <drm/drm_drv.h>
>>>   #include <drm/ttm/ttm_tt.h>
>>>   @@ -401,13 +402,14 @@ void amdgpu_gart_invalidate_tlb(struct amdgpu_device *adev)
>>>   {
>>>       int i;
>>>   -    if (!adev->gart.ptr)
>>> +    if (!adev->gart.ptr || !down_read_trylock(&adev->reset_domain->sem))
>>>           return;
>>>         mb();
>>>       amdgpu_device_flush_hdp(adev, NULL);
>>>       for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS)
>>>           amdgpu_gmc_flush_gpu_tlb(adev, 0, i, 0);
>>> +    up_read(&adev->reset_domain->sem);
>>>   }
>>>     /**
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index e4742b65032d..52a3170d15b7 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -307,8 +307,12 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
>>>           dev_dbg(adev->dev, "Skip scheduling IBs in ring(%s)",
>>>               ring->name);
>>>       } else {
>>> -        r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
>>> -                       &fence);
>>> +        r = -ETIME;
>>> +        if (down_read_trylock(&adev->reset_domain->sem)) {
>>> +            r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>> +                           job, &fence);
>>> +            up_read(&adev->reset_domain->sem);
>>> +        }
>>>           if (r)
>>>               dev_err(adev->dev,
>>>                   "Error scheduling IBs (%d) in ring(%s)", r,
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>> index 86ea610b16f3..21f5a1fb3bf8 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>> @@ -28,6 +28,7 @@
>>>   #include "kfd_priv.h"
>>>   #include "kfd_kernel_queue.h"
>>>   #include "amdgpu_amdkfd.h"
>>> +#include "amdgpu_reset.h"
>>>     static inline struct process_queue_node *get_queue_by_qid(
>>>               struct process_queue_manager *pqm, unsigned int qid)
>>> @@ -87,8 +88,12 @@ void kfd_process_dequeue_from_device(struct kfd_process_device *pdd)
>>>           return;
>>>         dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
>>> -    if (dev->kfd->shared_resources.enable_mes)
>>> -        amdgpu_mes_flush_shader_debugger(dev->adev, pdd->proc_ctx_gpu_addr);
>>> +    if (dev->kfd->shared_resources.enable_mes &&
>>> + down_read_trylock(&dev->adev->reset_domain->sem)) {
>>> +        amdgpu_mes_flush_shader_debugger(dev->adev,
>>> +                         pdd->proc_ctx_gpu_addr);
>>> +
>>
>> It's not clear to me what's the requirement for reset domain locking around MES calls. We have a lot more of them in kfd_device_queue_manager.c (mostly calling adev->mes.funcs->... directly). Do they all need to be wrapped individually?
> 
> Whenever you call a MES function (or any other function directly interacting with the rings or the HW registers) you need to make sure that at least the read side of the reset lock is held.

Having to do that for each caller of amdgpu_mes functions seems error prone.

Would it make sense to wrap that inside amdgpu_mes_lock/unlock? Maybe turn it into amdgpu_mes_trylock/unlock and make sure that all the amdgpu_mes functions that take that lock can fail and return an error code. Add an attribute so the compiler can flag callers that ignore the return values. This would make it easier to let the compiler spot places that don't handle errors due to reset lock failures.

Regards,
  Felix

> 
> Regards,
> Christian.
> 
>>
>> Regards,
>>   Felix
>>
>>
>>> up_read(&dev->adev->reset_domain->sem);
>>> +    }
>>>       pdd->already_dequeued = true;
>>>   }
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-05-31 14:44                     ` Liu, Shaoyun
@ 2024-06-03 10:58                       ` Christian König
  2024-06-03 18:28                         ` Liu, Shaoyun
  0 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2024-06-03 10:58 UTC (permalink / raw)
  To: Liu, Shaoyun, Christian König, Li, Yunxiang (Teddy),
	amd-gfx@lists.freedesktop.org, Deucher, Alexander, Xiao, Hua

Hi Shaoyun,

yes my thinking goes into the same direction. The basic problem here is 
that we are trying to stuff two different information into the same 
variable.

The first information is if the commands haven been read by the MES from 
the ring buffer. This information is necessary for the normal ring 
buffer and reset handling, e.g. prevents ring buffer overflow, ordering 
of command, lockups during reset etc...

The second information is if a certain operation was successfully or 
not. For example this is necessary to get signaled back if y queue 
map/unmap operation has been successfully or if the CP not responding or 
any other error has happened etc...

Another issue is that while it is in general a good idea to have the 
firmware work in a way where errors are reported instead of completely 
stopping all processing, here we run into trouble because the driver 
usually assumes that work can be scheduled on the ring buffer and a 
subsequent work is processed only when everything previously submitted 
has completed successfully.

So as initial fix for the issue we see I've send Alex a patch on Friday 
to partially revert his change to use an individual writeback for each 
submission. Instead we will submit an addition QUERY_STATUS command 
after the real command and let that one write fence value. This way the 
fence value is always written, independent of the result of the operation.

Additional to that we need to insert something like a dependency between 
submissions, e.g. when you have commands A, B and C on the ring and C 
can only execute when A was successfully then we need to somehow tell 
that the MES. Only other alternative is to not scheduler commands behind 
each other on the ring and that in turn is a bad idea from the 
performance point of view.

Regards,
Christian.

Am 31.05.24 um 16:44 schrieb Liu, Shaoyun:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi, Christian
>
> I think we have a discussion about this before . Alex also have a change that allow driver to use different write back address for the fence for each submission for the  original issue .
>  From MES  point of view ,  MES will update the fence when the API can be complete successfully, so if the  API (ex . remove_queue) fails  due to  other component issue (ex , CP hang), the  MES will not update the fence In this situation , but  MES itself still works and can respond to other commands (ex ,,read_reg)  .  Alex's change allow driver to check the fence for each API without mess around them  .  If you expect MES to stop responding  to further commands  after one API fails , that will introduce combability issue since this design already exist on products for customer and MES also need to works for windows .  Also MES  always need to respond to  some commands like  RESET  etc  that might make things worse if we need to change the logic .
>
> One possible solution is MES can  trigger an Interrupt  to indicate which submission has failed with the seq number . In this case driver can get the  failure of the  submission to MES in time and  make its own decision for what to do next , What do you think about this ?
>
> Regards
> Shaoyun.liu
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Christian König
> Sent: Wednesday, May 29, 2024 11:19 AM
> To: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
>
> Am 29.05.24 um 16:48 schrieb Li, Yunxiang (Teddy):
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>>> Yeah, I know. That's one of the reason I've pointed out on the patch
>>> adding that that this behavior is actually completely broken.
>>>
>>> If you run into issues with the MES because of this then please
>>> suggest a revert of that patch.
>> I think it just need to be improved to allow this force-signal behavior. The current behavior is slow/inconvenient, but the old behavior is wrong. Since MES will continue process submissions even when one submission failed. So with just one fence location there's no way to tell if a command failed or not.
> No the MES behavior is broken. When a submission failed it should stop processing or signal that the operation didn't completed through some other mechanism.
>
> Just not writing the fence and continuing results in tons of problems, from the TLB fence all the way to the ring buffer and reset handling.
>
> This is a hard requirement and really can't be changed.
>
> Regards,
> Christian.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-06-03 10:58                       ` Christian König
@ 2024-06-03 18:28                         ` Liu, Shaoyun
  2024-06-04  8:07                           ` Christian König
  0 siblings, 1 reply; 52+ messages in thread
From: Liu, Shaoyun @ 2024-06-03 18:28 UTC (permalink / raw)
  To: Koenig, Christian, Christian König, Li,  Yunxiang (Teddy),
	amd-gfx@lists.freedesktop.org, Deucher, Alexander, Xiao, Hua

[AMD Official Use Only - AMD Internal Distribution Only]

Thanks Christian for the detail explanation.
I checked your patch , you try to use query_scheduler_status package  to check the command completion . It  may not work as expected since this API query the status is for MES itself , so mes can update the fence address with  the expected seq value, but the  command  itself (ex .remove_queue for mes and  then  mes send the  unmap_queue to kiq internally)  still fails.
For mes , driver always poll for the command completion ,  do you think it's an acceptable solution that MES set a specific failure value(ex , -1)  in the fence address to indicate the failure of the  operation ?  But that should be similar to let driver poll the completion till timeout .  MES internally also need to wait for a timeout on some  command that it sent  to CP (ex.  2 seconds for unmap_queue command).  I'm actually a little bit confused here , has driver use the lock to ensure there is only one submission into MES at any time ?
 MES can also trigger the interrupt on the failure if driver side require us to do so , the  payload will have the seq number to indicate which submission cause the failure , that might requires more code change from   driver side .Please let me what's preferred from driver side.

Regards
Shaoyun.liu

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Monday, June 3, 2024 6:59 AM
To: Liu, Shaoyun <Shaoyun.Liu@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Xiao, Hua <Hua.Xiao@amd.com>
Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started

Hi Shaoyun,

yes my thinking goes into the same direction. The basic problem here is that we are trying to stuff two different information into the same variable.

The first information is if the commands haven been read by the MES from the ring buffer. This information is necessary for the normal ring buffer and reset handling, e.g. prevents ring buffer overflow, ordering of command, lockups during reset etc...

The second information is if a certain operation was successfully or not. For example this is necessary to get signaled back if y queue map/unmap operation has been successfully or if the CP not responding or any other error has happened etc...

Another issue is that while it is in general a good idea to have the firmware work in a way where errors are reported instead of completely stopping all processing, here we run into trouble because the driver usually assumes that work can be scheduled on the ring buffer and a subsequent work is processed only when everything previously submitted has completed successfully.

So as initial fix for the issue we see I've send Alex a patch on Friday to partially revert his change to use an individual writeback for each submission. Instead we will submit an addition QUERY_STATUS command after the real command and let that one write fence value. This way the fence value is always written, independent of the result of the operation.

Additional to that we need to insert something like a dependency between submissions, e.g. when you have commands A, B and C on the ring and C can only execute when A was successfully then we need to somehow tell that the MES. Only other alternative is to not scheduler commands behind each other on the ring and that in turn is a bad idea from the performance point of view.

Regards,
Christian.

Am 31.05.24 um 16:44 schrieb Liu, Shaoyun:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi, Christian
>
> I think we have a discussion about this before . Alex also have a change that allow driver to use different write back address for the fence for each submission for the  original issue .
>  From MES  point of view ,  MES will update the fence when the API can be complete successfully, so if the  API (ex . remove_queue) fails  due to  other component issue (ex , CP hang), the  MES will not update the fence In this situation , but  MES itself still works and can respond to other commands (ex ,,read_reg)  .  Alex's change allow driver to check the fence for each API without mess around them  .  If you expect MES to stop responding  to further commands  after one API fails , that will introduce combability issue since this design already exist on products for customer and MES also need to works for windows .  Also MES  always need to respond to  some commands like  RESET  etc  that might make things worse if we need to change the logic .
>
> One possible solution is MES can  trigger an Interrupt  to indicate which submission has failed with the seq number . In this case driver can get the  failure of the  submission to MES in time and  make its own decision for what to do next , What do you think about this ?
>
> Regards
> Shaoyun.liu
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> Christian König
> Sent: Wednesday, May 29, 2024 11:19 AM
> To: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is
> started
>
> Am 29.05.24 um 16:48 schrieb Li, Yunxiang (Teddy):
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>>> Yeah, I know. That's one of the reason I've pointed out on the patch
>>> adding that that this behavior is actually completely broken.
>>>
>>> If you run into issues with the MES because of this then please
>>> suggest a revert of that patch.
>> I think it just need to be improved to allow this force-signal behavior. The current behavior is slow/inconvenient, but the old behavior is wrong. Since MES will continue process submissions even when one submission failed. So with just one fence location there's no way to tell if a command failed or not.
> No the MES behavior is broken. When a submission failed it should stop processing or signal that the operation didn't completed through some other mechanism.
>
> Just not writing the fence and continuing results in tons of problems, from the TLB fence all the way to the ring buffer and reset handling.
>
> This is a hard requirement and really can't be changed.
>
> Regards,
> Christian.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-06-03 18:28                         ` Liu, Shaoyun
@ 2024-06-04  8:07                           ` Christian König
  2024-06-05 12:32                             ` Liu, Shaoyun
  0 siblings, 1 reply; 52+ messages in thread
From: Christian König @ 2024-06-04  8:07 UTC (permalink / raw)
  To: Liu, Shaoyun, Christian König, Li, Yunxiang (Teddy),
	amd-gfx@lists.freedesktop.org, Deucher, Alexander, Xiao, Hua

Hi Shaoyun,

see inline.

Am 03.06.24 um 20:28 schrieb Liu, Shaoyun:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Thanks Christian for the detail explanation.
> I checked your patch , you try to use query_scheduler_status package  to check the command completion . It  may not work as expected since this API query the status is for MES itself , so mes can update the fence address with  the expected seq value, but the  command  itself (ex .remove_queue for mes and  then  mes send the  unmap_queue to kiq internally)  still fails.

And that is exactly the desired behavior.

See the fence value is for ring buffer processing and getting feedback 
in the case of a reset for example if the MES has processed the commands.

If that processing is successfully or not *must* be completely 
irrelevant for writing the fence value.

> For mes , driver always poll for the command completion

No, exactly that's what we want to avoid as much as possible.

Polling means that we throw away tons of CPU cycles and especially on 
fault handling and TLB flushing that is an absolutely in-acceptable 
performance loss.

>   ,  do you think it's an acceptable solution that MES set a specific failure value(ex , -1)  in the fence address to indicate the failure of the  operation ?  But that should be similar to let driver poll the completion till timeout .  MES internally also need to wait for a timeout on some  command that it sent  to CP (ex.  2 seconds for unmap_queue command).

No, what we should really do is to separate the fence and the result 
values. And then give an input dependency on each operation.

> I'm actually a little bit confused here , has driver use the lock to ensure there is only one submission into MES at any time ?

No, exactly that's what we try to avoid. Othertwise we don't need a ring 
buffer in the first place.

>   MES can also trigger the interrupt on the failure if driver side require us to do so , the  payload will have the seq number to indicate which submission cause the failure , that might requires more code change from   driver side .Please let me what's preferred from driver side.

I can come up with a more detailed explanation of the driver 
requirements when I'm back from sick leave.

Regards,
Christian.

>
> Regards
> Shaoyun.liu
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Monday, June 3, 2024 6:59 AM
> To: Liu, Shaoyun <Shaoyun.Liu@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Xiao, Hua <Hua.Xiao@amd.com>
> Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
>
> Hi Shaoyun,
>
> yes my thinking goes into the same direction. The basic problem here is that we are trying to stuff two different information into the same variable.
>
> The first information is if the commands haven been read by the MES from the ring buffer. This information is necessary for the normal ring buffer and reset handling, e.g. prevents ring buffer overflow, ordering of command, lockups during reset etc...
>
> The second information is if a certain operation was successfully or not. For example this is necessary to get signaled back if y queue map/unmap operation has been successfully or if the CP not responding or any other error has happened etc...
>
> Another issue is that while it is in general a good idea to have the firmware work in a way where errors are reported instead of completely stopping all processing, here we run into trouble because the driver usually assumes that work can be scheduled on the ring buffer and a subsequent work is processed only when everything previously submitted has completed successfully.
>
> So as initial fix for the issue we see I've send Alex a patch on Friday to partially revert his change to use an individual writeback for each submission. Instead we will submit an addition QUERY_STATUS command after the real command and let that one write fence value. This way the fence value is always written, independent of the result of the operation.
>
> Additional to that we need to insert something like a dependency between submissions, e.g. when you have commands A, B and C on the ring and C can only execute when A was successfully then we need to somehow tell that the MES. Only other alternative is to not scheduler commands behind each other on the ring and that in turn is a bad idea from the performance point of view.
>
> Regards,
> Christian.
>
> Am 31.05.24 um 16:44 schrieb Liu, Shaoyun:
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>> Hi, Christian
>>
>> I think we have a discussion about this before . Alex also have a change that allow driver to use different write back address for the fence for each submission for the  original issue .
>>   From MES  point of view ,  MES will update the fence when the API can be complete successfully, so if the  API (ex . remove_queue) fails  due to  other component issue (ex , CP hang), the  MES will not update the fence In this situation , but  MES itself still works and can respond to other commands (ex ,,read_reg)  .  Alex's change allow driver to check the fence for each API without mess around them  .  If you expect MES to stop responding  to further commands  after one API fails , that will introduce combability issue since this design already exist on products for customer and MES also need to works for windows .  Also MES  always need to respond to  some commands like  RESET  etc  that might make things worse if we need to change the logic .
>>
>> One possible solution is MES can  trigger an Interrupt  to indicate which submission has failed with the seq number . In this case driver can get the  failure of the  submission to MES in time and  make its own decision for what to do next , What do you think about this ?
>>
>> Regards
>> Shaoyun.liu
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>> Christian König
>> Sent: Wednesday, May 29, 2024 11:19 AM
>> To: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is
>> started
>>
>> Am 29.05.24 um 16:48 schrieb Li, Yunxiang (Teddy):
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>>> Yeah, I know. That's one of the reason I've pointed out on the patch
>>>> adding that that this behavior is actually completely broken.
>>>>
>>>> If you run into issues with the MES because of this then please
>>>> suggest a revert of that patch.
>>> I think it just need to be improved to allow this force-signal behavior. The current behavior is slow/inconvenient, but the old behavior is wrong. Since MES will continue process submissions even when one submission failed. So with just one fence location there's no way to tell if a command failed or not.
>> No the MES behavior is broken. When a submission failed it should stop processing or signal that the operation didn't completed through some other mechanism.
>>
>> Just not writing the fence and continuing results in tons of problems, from the TLB fence all the way to the ring buffer and reset handling.
>>
>> This is a hard requirement and really can't be changed.
>>
>> Regards,
>> Christian.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks
  2024-05-31 15:47       ` Felix Kuehling
@ 2024-06-04 12:52         ` Li, Yunxiang (Teddy)
  0 siblings, 0 replies; 52+ messages in thread
From: Li, Yunxiang (Teddy) @ 2024-06-04 12:52 UTC (permalink / raw)
  To: Kuehling, Felix, Koenig, Christian, amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Gao, Likun, Zhang, Hawking

[AMD Official Use Only - AMD Internal Distribution Only]

The trouble with taking the read side lock in the MES helper functions is that we use a lot of them during reset under the write lock. So either we need to duplicate the helper functions or we will get inconsistencies where a random subset of the helper functions takes the lock themselves but the other helper functions you need to take the lock outside. "Always need to take the lock before accessing MES" seems to be the least bad option.

Teddy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v3 2/8] drm/amdgpu: fix sriov host flr handler
  2024-05-30 21:47   ` [PATCH v3 2/8] drm/amdgpu: fix sriov host flr handler Yunxiang Li
@ 2024-06-05  1:12     ` Deng, Emily
  0 siblings, 0 replies; 52+ messages in thread
From: Deng, Emily @ 2024-06-05  1:12 UTC (permalink / raw)
  To: Li, Yunxiang (Teddy), amd-gfx@lists.freedesktop.org
  Cc: Deucher, Alexander, Koenig, Christian, Chang, HaiJun

[AMD Official Use Only - AMD Internal Distribution Only]

Review-by: Emily Deng <Emily.Deng@amd.com>

>-----Original Message-----
>From: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Sent: Friday, May 31, 2024 5:48 AM
>To: amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
><Christian.Koenig@amd.com>; Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>;
>Chang, HaiJun <HaiJun.Chang@amd.com>; Deng, Emily
><Emily.Deng@amd.com>
>Subject: [PATCH v3 2/8] drm/amdgpu: fix sriov host flr handler
>
>We send back the ready to reset message before we stop anything. This is
>wrong. Move it to when we are actually ready for the FLR to happen.
>
>In the current state since we take tens of seconds to stop everything, it is
>very likely that host would give up waiting and reset the GPU before we send
>ready, so it would be the same as before. But this gets rid of the hack with
>reset_domain locking and also let us know how slow the reset actually is on
>the host. The pre-reset speed can thus be improved later.
>
>Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
>---
>v3: still call amdgpu_virt_fini_data_exchange right away, it could take
>    awhile for the reset to grab it's lock and call this function in
>    pre_reset so during this time the thread will read garbage.
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 14 ++++++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 ++
> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      | 39 +++++++++-------------
> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      | 39 +++++++++-------------
> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 ----
> 6 files changed, 50 insertions(+), 52 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>index bf1a6593dc5e..eb77b4ec3cb4 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>@@ -5069,6 +5069,8 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
>       struct amdgpu_hive_info *hive = NULL;
>
>       if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
>+              amdgpu_virt_ready_to_reset(adev);
>+              amdgpu_virt_wait_reset(adev);
>               clear_bit(AMDGPU_HOST_FLR, &reset_context->flags);
>               r = amdgpu_virt_request_full_gpu(adev, true);
>       } else {
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>index 3cf8416f8cb0..44450507c140 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
>@@ -152,6 +152,20 @@ void amdgpu_virt_request_init_data(struct
>amdgpu_device *adev)
>               DRM_WARN("host doesn't support REQ_INIT_DATA
>handshake\n");  }
>
>+/**
>+ * amdgpu_virt_ready_to_reset() - send ready to reset to host
>+ * @adev:     amdgpu device.
>+ * Send ready to reset message to GPU hypervisor to signal we have
>+stopped GPU
>+ * activity and is ready for host FLR
>+ */
>+void amdgpu_virt_ready_to_reset(struct amdgpu_device *adev) {
>+      struct amdgpu_virt *virt = &adev->virt;
>+
>+      if (virt->ops && virt->ops->reset_gpu)
>+              virt->ops->ready_to_reset(adev);
>+}
>+
> /**
>  * amdgpu_virt_wait_reset() - wait for reset gpu completed
>  * @adev:     amdgpu device.
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>index 642f1fd287d8..66de5380d9a1 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>@@ -88,6 +88,7 @@ struct amdgpu_virt_ops {
>       int (*rel_full_gpu)(struct amdgpu_device *adev, bool init);
>       int (*req_init_data)(struct amdgpu_device *adev);
>       int (*reset_gpu)(struct amdgpu_device *adev);
>+      void (*ready_to_reset)(struct amdgpu_device *adev);
>       int (*wait_reset)(struct amdgpu_device *adev);
>       void (*trans_msg)(struct amdgpu_device *adev, enum idh_request
>req,
>                         u32 data1, u32 data2, u32 data3);
>@@ -345,6 +346,7 @@ int amdgpu_virt_request_full_gpu(struct
>amdgpu_device *adev, bool init);  int amdgpu_virt_release_full_gpu(struct
>amdgpu_device *adev, bool init);  int amdgpu_virt_reset_gpu(struct
>amdgpu_device *adev);  void amdgpu_virt_request_init_data(struct
>amdgpu_device *adev);
>+void amdgpu_virt_ready_to_reset(struct amdgpu_device *adev);
> int amdgpu_virt_wait_reset(struct amdgpu_device *adev);  int
>amdgpu_virt_alloc_mm_table(struct amdgpu_device *adev);  void
>amdgpu_virt_free_mm_table(struct amdgpu_device *adev); diff --git
>a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>index f4c47492e0cd..6b71ee85ee65 100644
>--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>@@ -249,38 +249,30 @@ static int xgpu_ai_set_mailbox_ack_irq(struct
>amdgpu_device *adev,
>       return 0;
> }
>
>-static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>+static void xgpu_ai_ready_to_reset(struct amdgpu_device *adev)
> {
>-      struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt,
>flr_work);
>-      struct amdgpu_device *adev = container_of(virt, struct
>amdgpu_device, virt);
>-      int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>-
>-      /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>-       * otherwise the mailbox msg will be ruined/reseted by
>-       * the VF FLR.
>-       */
>-      if (atomic_cmpxchg(&adev->reset_domain->in_gpu_reset, 0, 1) != 0)
>-              return;
>-
>-      down_write(&adev->reset_domain->sem);
>-
>-      amdgpu_virt_fini_data_exchange(adev);
>-
>       xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>+}
>
>+static int xgpu_ai_wait_reset(struct amdgpu_device *adev) {
>+      int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>       do {
>               if (xgpu_ai_mailbox_peek_msg(adev) ==
>IDH_FLR_NOTIFICATION_CMPL)
>-                      goto flr_done;
>-
>+                      return 0;
>               msleep(10);
>               timeout -= 10;
>       } while (timeout > 1);
>-
>       dev_warn(adev->dev, "waiting IDH_FLR_NOTIFICATION_CMPL
>timeout\n");
>+      return -ETIME;
>+}
>
>-flr_done:
>-      atomic_set(&adev->reset_domain->in_gpu_reset, 0);
>-      up_write(&adev->reset_domain->sem);
>+static void xgpu_ai_mailbox_flr_work(struct work_struct *work) {
>+      struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt,
>flr_work);
>+      struct amdgpu_device *adev = container_of(virt, struct
>amdgpu_device,
>+virt);
>+
>+      amdgpu_virt_fini_data_exchange(adev);
>
>       /* Trigger recovery for world switch failure if no TDR */
>       if (amdgpu_device_should_recover_gpu(adev)
>@@ -417,7 +409,8 @@ const struct amdgpu_virt_ops xgpu_ai_virt_ops = {
>       .req_full_gpu   = xgpu_ai_request_full_gpu_access,
>       .rel_full_gpu   = xgpu_ai_release_full_gpu_access,
>       .reset_gpu = xgpu_ai_request_reset,
>-      .wait_reset = NULL,
>+      .ready_to_reset = xgpu_ai_ready_to_reset,
>+      .wait_reset = xgpu_ai_wait_reset,
>       .trans_msg = xgpu_ai_mailbox_trans_msg,
>       .req_init_data  = xgpu_ai_request_init_data,
>       .ras_poison_handler = xgpu_ai_ras_poison_handler, diff --git
>a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>index 37b49a5ed2a1..22af30a15a5f 100644
>--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>@@ -282,38 +282,30 @@ static int xgpu_nv_set_mailbox_ack_irq(struct
>amdgpu_device *adev,
>       return 0;
> }
>
>-static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
>+static void xgpu_nv_ready_to_reset(struct amdgpu_device *adev)
> {
>-      struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt,
>flr_work);
>-      struct amdgpu_device *adev = container_of(virt, struct
>amdgpu_device, virt);
>-      int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>-
>-      /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>-       * otherwise the mailbox msg will be ruined/reseted by
>-       * the VF FLR.
>-       */
>-      if (atomic_cmpxchg(&adev->reset_domain->in_gpu_reset, 0, 1) != 0)
>-              return;
>-
>-      down_write(&adev->reset_domain->sem);
>-
>-      amdgpu_virt_fini_data_exchange(adev);
>-
>       xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>+}
>
>+static int xgpu_nv_wait_reset(struct amdgpu_device *adev) {
>+      int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>       do {
>               if (xgpu_nv_mailbox_peek_msg(adev) ==
>IDH_FLR_NOTIFICATION_CMPL)
>-                      goto flr_done;
>-
>+                      return 0;
>               msleep(10);
>               timeout -= 10;
>       } while (timeout > 1);
>-
>       dev_warn(adev->dev, "waiting IDH_FLR_NOTIFICATION_CMPL
>timeout\n");
>+      return -ETIME;
>+}
>
>-flr_done:
>-      atomic_set(&adev->reset_domain->in_gpu_reset, 0);
>-      up_write(&adev->reset_domain->sem);
>+static void xgpu_nv_mailbox_flr_work(struct work_struct *work) {
>+      struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt,
>flr_work);
>+      struct amdgpu_device *adev = container_of(virt, struct
>amdgpu_device,
>+virt);
>+
>+      amdgpu_virt_fini_data_exchange(adev);
>
>       /* Trigger recovery for world switch failure if no TDR */
>       if (amdgpu_device_should_recover_gpu(adev)
>@@ -455,7 +447,8 @@ const struct amdgpu_virt_ops xgpu_nv_virt_ops = {
>       .rel_full_gpu   = xgpu_nv_release_full_gpu_access,
>       .req_init_data  = xgpu_nv_request_init_data,
>       .reset_gpu = xgpu_nv_request_reset,
>-      .wait_reset = NULL,
>+      .ready_to_reset = xgpu_nv_ready_to_reset,
>+      .wait_reset = xgpu_nv_wait_reset,
>       .trans_msg = xgpu_nv_mailbox_trans_msg,
>       .ras_poison_handler = xgpu_nv_ras_poison_handler,  }; diff --git
>a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>index 78cd07744ebe..e1d63bed84bf 100644
>--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>@@ -515,12 +515,6 @@ static void xgpu_vi_mailbox_flr_work(struct
>work_struct *work)
>       struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt,
>flr_work);
>       struct amdgpu_device *adev = container_of(virt, struct
>amdgpu_device, virt);
>
>-      /* wait until RCV_MSG become 3 */
>-      if (xgpu_vi_poll_msg(adev, IDH_FLR_NOTIFICATION_CMPL)) {
>-              pr_err("failed to receive FLR_CMPL\n");
>-              return;
>-      }
>-
>       /* Trigger recovery due to world switch failure */
>       if (amdgpu_device_should_recover_gpu(adev)) {
>               struct amdgpu_reset_context reset_context;
>--
>2.34.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started
  2024-06-04  8:07                           ` Christian König
@ 2024-06-05 12:32                             ` Liu, Shaoyun
  0 siblings, 0 replies; 52+ messages in thread
From: Liu, Shaoyun @ 2024-06-05 12:32 UTC (permalink / raw)
  To: Koenig, Christian, Christian König, Li,  Yunxiang (Teddy),
	amd-gfx@lists.freedesktop.org, Deucher, Alexander, Xiao, Hua

[AMD Official Use Only - AMD Internal Distribution Only]

Hi, Christian
If you just want to know the  status of MES , then this approach is ok . My original thinking is  the driver might also need to know the status of the  functionality it requires . ex . after call remove_queue , whether the CP has  actually unmapped it from the HQD ,  but after think it again , you are right , if driver want to know the  CP status , it can  use the  QUERY_STATUS from mes misc _op  interface.
Please keep one thing in mind , currently  MES don't generate the  interrupt after it successfully execute the  command , it just update the specified  fence value on the specified address , so driver side need to poll it to check whether the  command  finished successfully or not .  We can  discuss this when  you come up a new design .

Regards
Shaoyun.liu


-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Tuesday, June 4, 2024 4:07 AM
To: Liu, Shaoyun <Shaoyun.Liu@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Xiao, Hua <Hua.Xiao@amd.com>
Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started

Hi Shaoyun,

see inline.

Am 03.06.24 um 20:28 schrieb Liu, Shaoyun:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Thanks Christian for the detail explanation.
> I checked your patch , you try to use query_scheduler_status package  to check the command completion . It  may not work as expected since this API query the status is for MES itself , so mes can update the fence address with  the expected seq value, but the  command  itself (ex .remove_queue for mes and  then  mes send the  unmap_queue to kiq internally)  still fails.

And that is exactly the desired behavior.

See the fence value is for ring buffer processing and getting feedback in the case of a reset for example if the MES has processed the commands.

If that processing is successfully or not *must* be completely irrelevant for writing the fence value.

> For mes , driver always poll for the command completion

No, exactly that's what we want to avoid as much as possible.

Polling means that we throw away tons of CPU cycles and especially on fault handling and TLB flushing that is an absolutely in-acceptable performance loss.

>   ,  do you think it's an acceptable solution that MES set a specific failure value(ex , -1)  in the fence address to indicate the failure of the  operation ?  But that should be similar to let driver poll the completion till timeout .  MES internally also need to wait for a timeout on some  command that it sent  to CP (ex.  2 seconds for unmap_queue command).

No, what we should really do is to separate the fence and the result values. And then give an input dependency on each operation.

> I'm actually a little bit confused here , has driver use the lock to ensure there is only one submission into MES at any time ?

No, exactly that's what we try to avoid. Othertwise we don't need a ring buffer in the first place.

>   MES can also trigger the interrupt on the failure if driver side require us to do so , the  payload will have the seq number to indicate which submission cause the failure , that might requires more code change from   driver side .Please let me what's preferred from driver side.

I can come up with a more detailed explanation of the driver requirements when I'm back from sick leave.

Regards,
Christian.

>
> Regards
> Shaoyun.liu
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Monday, June 3, 2024 6:59 AM
> To: Liu, Shaoyun <Shaoyun.Liu@amd.com>; Christian König
> <ckoenig.leichtzumerken@gmail.com>; Li, Yunxiang (Teddy)
> <Yunxiang.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher,
> Alexander <Alexander.Deucher@amd.com>; Xiao, Hua <Hua.Xiao@amd.com>
> Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is
> started
>
> Hi Shaoyun,
>
> yes my thinking goes into the same direction. The basic problem here is that we are trying to stuff two different information into the same variable.
>
> The first information is if the commands haven been read by the MES from the ring buffer. This information is necessary for the normal ring buffer and reset handling, e.g. prevents ring buffer overflow, ordering of command, lockups during reset etc...
>
> The second information is if a certain operation was successfully or not. For example this is necessary to get signaled back if y queue map/unmap operation has been successfully or if the CP not responding or any other error has happened etc...
>
> Another issue is that while it is in general a good idea to have the firmware work in a way where errors are reported instead of completely stopping all processing, here we run into trouble because the driver usually assumes that work can be scheduled on the ring buffer and a subsequent work is processed only when everything previously submitted has completed successfully.
>
> So as initial fix for the issue we see I've send Alex a patch on Friday to partially revert his change to use an individual writeback for each submission. Instead we will submit an addition QUERY_STATUS command after the real command and let that one write fence value. This way the fence value is always written, independent of the result of the operation.
>
> Additional to that we need to insert something like a dependency between submissions, e.g. when you have commands A, B and C on the ring and C can only execute when A was successfully then we need to somehow tell that the MES. Only other alternative is to not scheduler commands behind each other on the ring and that in turn is a bad idea from the performance point of view.
>
> Regards,
> Christian.
>
> Am 31.05.24 um 16:44 schrieb Liu, Shaoyun:
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>> Hi, Christian
>>
>> I think we have a discussion about this before . Alex also have a change that allow driver to use different write back address for the fence for each submission for the  original issue .
>>   From MES  point of view ,  MES will update the fence when the API can be complete successfully, so if the  API (ex . remove_queue) fails  due to  other component issue (ex , CP hang), the  MES will not update the fence In this situation , but  MES itself still works and can respond to other commands (ex ,,read_reg)  .  Alex's change allow driver to check the fence for each API without mess around them  .  If you expect MES to stop responding  to further commands  after one API fails , that will introduce combability issue since this design already exist on products for customer and MES also need to works for windows .  Also MES  always need to respond to  some commands like  RESET  etc  that might make things worse if we need to change the logic .
>>
>> One possible solution is MES can  trigger an Interrupt  to indicate which submission has failed with the seq number . In this case driver can get the  failure of the  submission to MES in time and  make its own decision for what to do next , What do you think about this ?
>>
>> Regards
>> Shaoyun.liu
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
>> Christian König
>> Sent: Wednesday, May 29, 2024 11:19 AM
>> To: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset
>> is started
>>
>> Am 29.05.24 um 16:48 schrieb Li, Yunxiang (Teddy):
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>>> Yeah, I know. That's one of the reason I've pointed out on the
>>>> patch adding that that this behavior is actually completely broken.
>>>>
>>>> If you run into issues with the MES because of this then please
>>>> suggest a revert of that patch.
>>> I think it just need to be improved to allow this force-signal behavior. The current behavior is slow/inconvenient, but the old behavior is wrong. Since MES will continue process submissions even when one submission failed. So with just one fence location there's no way to tell if a command failed or not.
>> No the MES behavior is broken. When a submission failed it should stop processing or signal that the operation didn't completed through some other mechanism.
>>
>> Just not writing the fence and continuing results in tons of problems, from the TLB fence all the way to the ring buffer and reset handling.
>>
>> This is a hard requirement and really can't be changed.
>>
>> Regards,
>> Christian.


^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2024-06-05 12:32 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-28 17:23 [PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
2024-05-28 17:23 ` [PATCH v2 01/10] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
2024-05-29  6:36   ` Christian König
2024-05-28 17:23 ` [PATCH v2 02/10] drm/amdgpu: fix sriov host flr handler Yunxiang Li
2024-05-29  6:41   ` Christian König
2024-05-28 17:23 ` [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started Yunxiang Li
2024-05-29  6:38   ` Christian König
2024-05-29 13:22     ` Li, Yunxiang (Teddy)
2024-05-29 13:31       ` Christian König
2024-05-29 13:44         ` Li, Yunxiang (Teddy)
2024-05-29 13:55           ` Christian König
2024-05-29 14:31             ` Li, Yunxiang (Teddy)
2024-05-29 14:35               ` Christian König
2024-05-29 14:48                 ` Li, Yunxiang (Teddy)
2024-05-29 15:19                   ` Christian König
2024-05-31 14:44                     ` Liu, Shaoyun
2024-06-03 10:58                       ` Christian König
2024-06-03 18:28                         ` Liu, Shaoyun
2024-06-04  8:07                           ` Christian König
2024-06-05 12:32                             ` Liu, Shaoyun
2024-05-28 17:23 ` [PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
2024-05-29  6:41   ` Christian König
2024-05-29 23:04   ` Felix Kuehling
2024-05-30  0:06     ` Li, Yunxiang (Teddy)
2024-05-28 17:23 ` [PATCH v2 05/10] drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li
2024-05-29  6:43   ` Christian König
2024-05-28 17:23 ` [PATCH v2 06/10] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover Yunxiang Li
2024-05-29  6:45   ` Christian König
2024-05-28 17:23 ` [PATCH v2 07/10] drm/amdgpu: use helper in amdgpu_gart_unbind Yunxiang Li
2024-05-29  6:46   ` Christian König
2024-05-28 17:23 ` [PATCH v2 08/10] drm/amdgpu: fix locking scope when flushing tlb Yunxiang Li
2024-05-29  6:49   ` Christian König
2024-05-28 17:23 ` [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks Yunxiang Li
2024-05-29  6:55   ` Christian König
2024-05-30 22:02   ` Felix Kuehling
2024-05-30 22:35     ` Li, Yunxiang (Teddy)
2024-05-31  6:52     ` Christian König
2024-05-31 15:47       ` Felix Kuehling
2024-06-04 12:52         ` Li, Yunxiang (Teddy)
2024-05-28 17:23 ` [PATCH v2 10/10] Revert "drm/amdgpu: Queue KFD reset workitem in VF FED" Yunxiang Li
2024-05-28 19:04   ` Skvortsov, Victor
2024-05-30 21:47 ` [PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset Yunxiang Li
2024-05-30 21:47   ` [PATCH v3 1/8] drm/amdgpu: add skip_hw_access checks for sriov Yunxiang Li
2024-05-30 21:47   ` [PATCH v3 2/8] drm/amdgpu: fix sriov host flr handler Yunxiang Li
2024-06-05  1:12     ` Deng, Emily
2024-05-30 21:48   ` [PATCH v3 3/8] drm/amdgpu/kfd: remove is_hws_hang and is_resetting Yunxiang Li
2024-05-30 21:48   ` [PATCH v3 4/8] drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li
2024-05-30 21:48   ` [PATCH v3 5/8] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover Yunxiang Li
2024-05-30 21:48   ` [PATCH v3 6/8] drm/amdgpu: use helper in amdgpu_gart_unbind Yunxiang Li
2024-05-30 21:48   ` [PATCH v3 7/8] drm/amdgpu: fix locking scope when flushing tlb Yunxiang Li
2024-05-30 21:48   ` [PATCH v3 8/8] drm/amdgpu: fix missing reset domain locks Yunxiang Li
2024-05-31  6:50     ` Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox