[PATCH 0/4] Improve the dev coredump

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4]  Improve the dev coredump
@ 2024-08-16  7:54 Trigger.Huang
  2024-08-16  7:54 ` [PATCH 1/4] drm/amdgpu: Add gpu_coredump parameter Trigger.Huang
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Trigger.Huang @ 2024-08-16  7:54 UTC (permalink / raw)
  To: amd-gfx; +Cc: sunil.khatri, alexander.deucher, Trigger Huang

From: Trigger Huang <Trigger.Huang@amd.com>

The current dev coredump implementation sometimes cannot fully satisfy customer's requirements due to:
1, dev coredump is under the control of gpu_recovery, thinking about the following application scenarios:
	1), Customer may need to do the core dump with gpu_recovery disabled. This can be used for GPU hang debug
        1), Customer may need to disable the core dump with gpu_recovery enabled. This can be used for quick GPU recovery, especially for some lightweight hangs that can be processed by soft recovery or per ring reset.
        1), Customer may need to enable the core dump with gpu_recovery enabled. This can be used for GPU recovery but record the core dump for further check in stress test or system health check.
It seems not easy to support all the scenarios by only using amdgpu_gpu_recovery.

2, When job timeout happened, the dump GPU status will be happened after a lot of operations, like soft_reset. The concern here is that the status is not so close to the real GPU's error status.

So we introduced the new solution
1, A new parameter, gpu_coredump, is added to decouple the coredump and gpu reset
2, Do the coredump immediately after a job timeout

Trigger Huang (4):
  drm/amdgpu: Add gpu_coredump parameter
  drm/amdgpu: Use gpu_coredump to control core dump
  drm/amdgpu: skip printing vram_lost if needed
  drm/amdgpu: Do core dump immediately when job tmo

 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  1 +
 .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c  | 19 +++---
 .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h  |  6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  8 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 64 +++++++++++++++++++
 6 files changed, 89 insertions(+), 15 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/4] drm/amdgpu: Add gpu_coredump parameter
  2024-08-16  7:54 [PATCH 0/4] Improve the dev coredump Trigger.Huang
@ 2024-08-16  7:54 ` Trigger.Huang
  2024-08-16 13:55   ` Alex Deucher
  2024-08-16  7:54 ` [PATCH 2/4] drm/amdgpu: Use gpu_coredump to control core dump Trigger.Huang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Trigger.Huang @ 2024-08-16  7:54 UTC (permalink / raw)
  To: amd-gfx; +Cc: sunil.khatri, alexander.deucher, Trigger Huang

From: Trigger Huang <Trigger.Huang@amd.com>

Add new separate parameter to control GPU coredump procedure. This can
be used to decouple the coredump procedure from gpu recovery procedure

V2: enable gpu_coredump by default (Alex)

Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 937de21a7142..4dd465ad14af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -201,6 +201,7 @@ extern uint amdgpu_force_long_training;
 extern int amdgpu_lbpw;
 extern int amdgpu_compute_multipipe;
 extern int amdgpu_gpu_recovery;
+extern int amdgpu_gpu_coredump;
 extern int amdgpu_emu_mode;
 extern uint amdgpu_smu_memory_pool_size;
 extern int amdgpu_smu_pptable_id;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index b9529948f2b2..06bd20d83db7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -178,6 +178,7 @@ uint amdgpu_force_long_training;
 int amdgpu_lbpw = -1;
 int amdgpu_compute_multipipe = -1;
 int amdgpu_gpu_recovery = -1; /* auto */
+int amdgpu_gpu_coredump = 1;
 int amdgpu_emu_mode;
 uint amdgpu_smu_memory_pool_size;
 int amdgpu_smu_pptable_id = -1;
@@ -556,6 +557,13 @@ module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
 MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)");
 module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
 
+/**
+ * DOC: gpu_coredump (int)
+ * Set to enable GPU coredump mechanism (1 = enable, 0 = disable). The default is 1
+ */
+MODULE_PARM_DESC(gpu_coredump, "Enable GPU coredump mechanism, (1 = enable(default), 0 = disable)");
+module_param_named(gpu_coredump, amdgpu_gpu_coredump, int, 0444);
+
 /**
  * DOC: emu_mode (int)
  * Set value 1 to enable emulation mode. This is only needed when running on an emulator. The default is 0 (disabled).
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/4] drm/amdgpu: Use gpu_coredump to control core dump
  2024-08-16  7:54 [PATCH 0/4] Improve the dev coredump Trigger.Huang
  2024-08-16  7:54 ` [PATCH 1/4] drm/amdgpu: Add gpu_coredump parameter Trigger.Huang
@ 2024-08-16  7:54 ` Trigger.Huang
  2024-08-16  7:54 ` [PATCH 3/4] drm/amdgpu: skip printing vram_lost if needed Trigger.Huang
  2024-08-16  7:54 ` [PATCH 4/4] drm/amdgpu: Do core dump immediately when job tmo Trigger.Huang
  3 siblings, 0 replies; 10+ messages in thread
From: Trigger.Huang @ 2024-08-16  7:54 UTC (permalink / raw)
  To: amd-gfx; +Cc: sunil.khatri, alexander.deucher, Trigger Huang

From: Trigger Huang <Trigger.Huang@amd.com>

Do the dev core dump if gpu_coredump is enabled

Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 482db4ebcc4b..9885d0606b0a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5340,7 +5340,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 			}
 		}
 
-		if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
+		if (amdgpu_gpu_coredump && (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags))) {
 			dev_info(tmp_adev->dev, "Dumping IP State\n");
 			/* Trigger ip dump before we reset the asic */
 			for (i = 0; i < tmp_adev->num_ip_blocks; i++)
@@ -5444,7 +5444,7 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
 
 				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
 
-				if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags))
+				if (amdgpu_gpu_coredump && (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)))
 					amdgpu_coredump(tmp_adev, vram_lost, reset_context);
 
 				if (vram_lost) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/4] drm/amdgpu: skip printing vram_lost if needed
  2024-08-16  7:54 [PATCH 0/4] Improve the dev coredump Trigger.Huang
  2024-08-16  7:54 ` [PATCH 1/4] drm/amdgpu: Add gpu_coredump parameter Trigger.Huang
  2024-08-16  7:54 ` [PATCH 2/4] drm/amdgpu: Use gpu_coredump to control core dump Trigger.Huang
@ 2024-08-16  7:54 ` Trigger.Huang
  2024-08-16 13:52   ` Alex Deucher
  2024-08-16  7:54 ` [PATCH 4/4] drm/amdgpu: Do core dump immediately when job tmo Trigger.Huang
  3 siblings, 1 reply; 10+ messages in thread
From: Trigger.Huang @ 2024-08-16  7:54 UTC (permalink / raw)
  To: amd-gfx; +Cc: sunil.khatri, alexander.deucher, Trigger Huang

From: Trigger Huang <Trigger.Huang@amd.com>

The vm lost status can only be obtained after a GPU reset occurs, but
sometimes a dev core dump can be happened before GPU reset. So a new
argument is added to tell the dev core dump implementation whether to
skip printing the vram_lost status in the dump.
And this patch is also trying to decouple the core dump function from
the GPU reset function, by replacing the argument amdgpu_reset_context
with amdgpu_job to specify the context for core dump.

Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
---
 .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c  | 19 ++++++++++---------
 .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h  |  6 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  2 +-
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
index cf2b4dd4d865..a860f52d8bb0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
@@ -28,8 +28,9 @@
 #include "atom.h"
 
 #ifndef CONFIG_DEV_COREDUMP
-void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
-		     struct amdgpu_reset_context *reset_context)
+void amdgpu_coredump(struct amdgpu_device *adev, bool skip_vram_check,
+		     bool vram_lost, struct amdgpu_job *job)
+
 {
 }
 #else
@@ -315,7 +316,7 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
 		}
 	}
 
-	if (coredump->reset_vram_lost)
+	if (!(coredump->skip_vram_check) && coredump->reset_vram_lost)
 		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
 
 	return count - iter.remain;
@@ -326,12 +327,11 @@ static void amdgpu_devcoredump_free(void *data)
 	kfree(data);
 }
 
-void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
-		     struct amdgpu_reset_context *reset_context)
+void amdgpu_coredump(struct amdgpu_device *adev, bool skip_vram_check,
+		     bool vram_lost, struct amdgpu_job *job)
 {
-	struct amdgpu_coredump_info *coredump;
 	struct drm_device *dev = adev_to_drm(adev);
-	struct amdgpu_job *job = reset_context->job;
+	struct amdgpu_coredump_info *coredump;
 	struct drm_sched_job *s_job;
 
 	coredump = kzalloc(sizeof(*coredump), GFP_NOWAIT);
@@ -341,11 +341,12 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
 		return;
 	}
 
+	coredump->skip_vram_check = skip_vram_check;
 	coredump->reset_vram_lost = vram_lost;
 
-	if (reset_context->job && reset_context->job->vm) {
+	if (job && job->vm) {
 		struct amdgpu_task_info *ti;
-		struct amdgpu_vm *vm = reset_context->job->vm;
+		struct amdgpu_vm *vm = job->vm;
 
 		ti = amdgpu_vm_get_task_info_vm(vm);
 		if (ti) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h
index 52459512cb2b..c4e522e49251 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h
@@ -26,7 +26,6 @@
 #define __AMDGPU_DEV_COREDUMP_H__
 
 #include "amdgpu.h"
-#include "amdgpu_reset.h"
 
 #ifdef CONFIG_DEV_COREDUMP
 
@@ -36,12 +35,13 @@ struct amdgpu_coredump_info {
 	struct amdgpu_device            *adev;
 	struct amdgpu_task_info         reset_task_info;
 	struct timespec64               reset_time;
+	bool                            skip_vram_check;
 	bool                            reset_vram_lost;
 	struct amdgpu_ring              *ring;
 };
 #endif
 
-void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
-		     struct amdgpu_reset_context *reset_context);
+void amdgpu_coredump(struct amdgpu_device *adev, bool skip_vram_check,
+		     bool vram_lost, struct amdgpu_job *job);
 
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9885d0606b0a..825cc62cd75d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5445,7 +5445,7 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
 				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
 
 				if (amdgpu_gpu_coredump && (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)))
-					amdgpu_coredump(tmp_adev, vram_lost, reset_context);
+					amdgpu_coredump(tmp_adev, false, vram_lost, reset_context->job);
 
 				if (vram_lost) {
 					DRM_INFO("VRAM is lost due to GPU reset!\n");
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/4] drm/amdgpu: Do core dump immediately when job tmo
  2024-08-16  7:54 [PATCH 0/4] Improve the dev coredump Trigger.Huang
                   ` (2 preceding siblings ...)
  2024-08-16  7:54 ` [PATCH 3/4] drm/amdgpu: skip printing vram_lost if needed Trigger.Huang
@ 2024-08-16  7:54 ` Trigger.Huang
  2024-08-16 13:58   ` Alex Deucher
  3 siblings, 1 reply; 10+ messages in thread
From: Trigger.Huang @ 2024-08-16  7:54 UTC (permalink / raw)
  To: amd-gfx; +Cc: sunil.khatri, alexander.deucher, Trigger Huang

From: Trigger Huang <Trigger.Huang@amd.com>

Do the coredump immediately after a job timeout to get a closer
representation of GPU's error status.

V2: This will skip printing vram_lost as the GPU reset is not
happened yet (Alex)

Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 64 +++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index c6a1783fc9ef..009551335d05 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -28,9 +28,63 @@
 #include <drm/drm_drv.h>
 
 #include "amdgpu.h"
+#include "amdgpu_dev_coredump.h"
+#include "amdgpu_xgmi.h"
 #include "amdgpu_trace.h"
 #include "amdgpu_reset.h"
 
+static void amdgpu_job_do_core_dump(struct amdgpu_device *adev,
+			       struct amdgpu_job *job)
+{
+	int i = 0;
+
+	dev_info(adev->dev, "Dumping IP State\n");
+	for (i = 0; i < adev->num_ip_blocks; i++) {
+		if (adev->ip_blocks[i].version->funcs->dump_ip_state)
+			adev->ip_blocks[i].version->funcs
+				->dump_ip_state((void *)adev);
+		dev_info(adev->dev, "Dumping IP State Completed\n");
+	}
+
+	amdgpu_coredump(adev, true, false, job);
+}
+
+static void amdgpu_job_core_dump(struct amdgpu_device *adev, struct amdgpu_job *job)
+{
+	struct list_head device_list, *device_list_handle =  NULL;
+	struct amdgpu_device *tmp_adev = NULL;
+	struct amdgpu_hive_info *hive = NULL;
+
+	if (!amdgpu_sriov_vf(adev))
+		hive = amdgpu_get_xgmi_hive(adev);
+	if (hive)
+		mutex_lock(&hive->hive_lock);
+	/*
+	 * Reuse the logic in amdgpu_device_gpu_recover() to build list of
+	 * devices for code dump
+	 */
+	INIT_LIST_HEAD(&device_list);
+	if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1) && hive) {
+		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
+			list_add_tail(&tmp_adev->reset_list, &device_list);
+		if (!list_is_first(&adev->reset_list, &device_list))
+			list_rotate_to_front(&adev->reset_list, &device_list);
+		device_list_handle = &device_list;
+	} else {
+		list_add_tail(&adev->reset_list, &device_list);
+		device_list_handle = &device_list;
+	}
+
+	/* Do the coredump for each device */
+	list_for_each_entry(tmp_adev, device_list_handle, reset_list)
+		amdgpu_job_do_core_dump(tmp_adev, job);
+
+	if (hive) {
+		mutex_unlock(&hive->hive_lock);
+		amdgpu_put_xgmi_hive(hive);
+	}
+}
+
 static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 {
 	struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
@@ -48,6 +102,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		return DRM_GPU_SCHED_STAT_ENODEV;
 	}
 
+	/*
+	 * Do the coredump immediately after a job timeout to get a closer
+	 * representation of GPU's error status.
+	 */
+	if (amdgpu_gpu_coredump)
+		amdgpu_job_core_dump(adev, job);
 
 	adev->job_hang = true;
 
@@ -101,6 +161,10 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		reset_context.src = AMDGPU_RESET_SRC_JOB;
 		clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
 
+		/* To avoid a double coredump, as we have already done it */
+		if (amdgpu_gpu_coredump)
+			set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
+
 		r = amdgpu_device_gpu_recover(ring->adev, job, &reset_context);
 		if (r)
 			dev_err(adev->dev, "GPU Recovery Failed: %d\n", r);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: skip printing vram_lost if needed
  2024-08-16  7:54 ` [PATCH 3/4] drm/amdgpu: skip printing vram_lost if needed Trigger.Huang
@ 2024-08-16 13:52   ` Alex Deucher
  2024-08-19  9:30     ` Huang, Trigger
  0 siblings, 1 reply; 10+ messages in thread
From: Alex Deucher @ 2024-08-16 13:52 UTC (permalink / raw)
  To: Trigger.Huang; +Cc: amd-gfx, sunil.khatri, alexander.deucher

On Fri, Aug 16, 2024 at 3:55 AM <Trigger.Huang@amd.com> wrote:
>
> From: Trigger Huang <Trigger.Huang@amd.com>
>
> The vm lost status can only be obtained after a GPU reset occurs, but
> sometimes a dev core dump can be happened before GPU reset. So a new
> argument is added to tell the dev core dump implementation whether to
> skip printing the vram_lost status in the dump.
> And this patch is also trying to decouple the core dump function from
> the GPU reset function, by replacing the argument amdgpu_reset_context
> with amdgpu_job to specify the context for core dump.
>
> Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
> Suggested-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>  .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c  | 19 ++++++++++---------
>  .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h  |  6 +++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  2 +-
>  3 files changed, 14 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> index cf2b4dd4d865..a860f52d8bb0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> @@ -28,8 +28,9 @@
>  #include "atom.h"
>
>  #ifndef CONFIG_DEV_COREDUMP
> -void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> -                    struct amdgpu_reset_context *reset_context)
> +void amdgpu_coredump(struct amdgpu_device *adev, bool skip_vram_check,
> +                    bool vram_lost, struct amdgpu_job *job)
> +
>  {
>  }
>  #else
> @@ -315,7 +316,7 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
>                 }
>         }
>
> -       if (coredump->reset_vram_lost)
> +       if (!(coredump->skip_vram_check) && coredump->reset_vram_lost)
>                 drm_printf(&p, "VRAM is lost due to GPU reset!\n");

You might want to say something like:
drm_printf(&p, "VRAM lost status skipped\n");
in the skip case so we know that we skipped it so users don't assume
it wasn't lost.

Alex

>
>         return count - iter.remain;
> @@ -326,12 +327,11 @@ static void amdgpu_devcoredump_free(void *data)
>         kfree(data);
>  }
>
> -void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> -                    struct amdgpu_reset_context *reset_context)
> +void amdgpu_coredump(struct amdgpu_device *adev, bool skip_vram_check,
> +                    bool vram_lost, struct amdgpu_job *job)
>  {
> -       struct amdgpu_coredump_info *coredump;
>         struct drm_device *dev = adev_to_drm(adev);
> -       struct amdgpu_job *job = reset_context->job;
> +       struct amdgpu_coredump_info *coredump;
>         struct drm_sched_job *s_job;
>
>         coredump = kzalloc(sizeof(*coredump), GFP_NOWAIT);
> @@ -341,11 +341,12 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
>                 return;
>         }
>
> +       coredump->skip_vram_check = skip_vram_check;
>         coredump->reset_vram_lost = vram_lost;
>
> -       if (reset_context->job && reset_context->job->vm) {
> +       if (job && job->vm) {
>                 struct amdgpu_task_info *ti;
> -               struct amdgpu_vm *vm = reset_context->job->vm;
> +               struct amdgpu_vm *vm = job->vm;
>
>                 ti = amdgpu_vm_get_task_info_vm(vm);
>                 if (ti) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h
> index 52459512cb2b..c4e522e49251 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h
> @@ -26,7 +26,6 @@
>  #define __AMDGPU_DEV_COREDUMP_H__
>
>  #include "amdgpu.h"
> -#include "amdgpu_reset.h"
>
>  #ifdef CONFIG_DEV_COREDUMP
>
> @@ -36,12 +35,13 @@ struct amdgpu_coredump_info {
>         struct amdgpu_device            *adev;
>         struct amdgpu_task_info         reset_task_info;
>         struct timespec64               reset_time;
> +       bool                            skip_vram_check;
>         bool                            reset_vram_lost;
>         struct amdgpu_ring              *ring;
>  };
>  #endif
>
> -void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> -                    struct amdgpu_reset_context *reset_context);
> +void amdgpu_coredump(struct amdgpu_device *adev, bool skip_vram_check,
> +                    bool vram_lost, struct amdgpu_job *job);
>
>  #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 9885d0606b0a..825cc62cd75d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5445,7 +5445,7 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>                                 vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
>
>                                 if (amdgpu_gpu_coredump && (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)))
> -                                       amdgpu_coredump(tmp_adev, vram_lost, reset_context);
> +                                       amdgpu_coredump(tmp_adev, false, vram_lost, reset_context->job);
>
>                                 if (vram_lost) {
>                                         DRM_INFO("VRAM is lost due to GPU reset!\n");
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] drm/amdgpu: Add gpu_coredump parameter
  2024-08-16  7:54 ` [PATCH 1/4] drm/amdgpu: Add gpu_coredump parameter Trigger.Huang
@ 2024-08-16 13:55   ` Alex Deucher
  0 siblings, 0 replies; 10+ messages in thread
From: Alex Deucher @ 2024-08-16 13:55 UTC (permalink / raw)
  To: Trigger.Huang; +Cc: amd-gfx, sunil.khatri, alexander.deucher

On Fri, Aug 16, 2024 at 3:55 AM <Trigger.Huang@amd.com> wrote:
>
> From: Trigger Huang <Trigger.Huang@amd.com>
>
> Add new separate parameter to control GPU coredump procedure. This can
> be used to decouple the coredump procedure from gpu recovery procedure
>
> V2: enable gpu_coredump by default (Alex)
>
> Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>

I still don't think we'd want to turn this off.  Does it ever hurt
anything?  It seems like someone would accidentally turn this off and
then run into a hang and regret it because the didn't get a dump.

Alex

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
>  2 files changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 937de21a7142..4dd465ad14af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -201,6 +201,7 @@ extern uint amdgpu_force_long_training;
>  extern int amdgpu_lbpw;
>  extern int amdgpu_compute_multipipe;
>  extern int amdgpu_gpu_recovery;
> +extern int amdgpu_gpu_coredump;
>  extern int amdgpu_emu_mode;
>  extern uint amdgpu_smu_memory_pool_size;
>  extern int amdgpu_smu_pptable_id;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index b9529948f2b2..06bd20d83db7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -178,6 +178,7 @@ uint amdgpu_force_long_training;
>  int amdgpu_lbpw = -1;
>  int amdgpu_compute_multipipe = -1;
>  int amdgpu_gpu_recovery = -1; /* auto */
> +int amdgpu_gpu_coredump = 1;
>  int amdgpu_emu_mode;
>  uint amdgpu_smu_memory_pool_size;
>  int amdgpu_smu_pptable_id = -1;
> @@ -556,6 +557,13 @@ module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
>  MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)");
>  module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
>
> +/**
> + * DOC: gpu_coredump (int)
> + * Set to enable GPU coredump mechanism (1 = enable, 0 = disable). The default is 1
> + */
> +MODULE_PARM_DESC(gpu_coredump, "Enable GPU coredump mechanism, (1 = enable(default), 0 = disable)");
> +module_param_named(gpu_coredump, amdgpu_gpu_coredump, int, 0444);
> +
>  /**
>   * DOC: emu_mode (int)
>   * Set value 1 to enable emulation mode. This is only needed when running on an emulator. The default is 0 (disabled).
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] drm/amdgpu: Do core dump immediately when job tmo
  2024-08-16  7:54 ` [PATCH 4/4] drm/amdgpu: Do core dump immediately when job tmo Trigger.Huang
@ 2024-08-16 13:58   ` Alex Deucher
  2024-08-19  9:37     ` Huang, Trigger
  0 siblings, 1 reply; 10+ messages in thread
From: Alex Deucher @ 2024-08-16 13:58 UTC (permalink / raw)
  To: Trigger.Huang; +Cc: amd-gfx, sunil.khatri, alexander.deucher

On Fri, Aug 16, 2024 at 4:05 AM <Trigger.Huang@amd.com> wrote:
>
> From: Trigger Huang <Trigger.Huang@amd.com>
>
> Do the coredump immediately after a job timeout to get a closer
> representation of GPU's error status.
>
> V2: This will skip printing vram_lost as the GPU reset is not
> happened yet (Alex)
>
> Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 64 +++++++++++++++++++++++++
>  1 file changed, 64 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index c6a1783fc9ef..009551335d05 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -28,9 +28,63 @@
>  #include <drm/drm_drv.h>
>
>  #include "amdgpu.h"
> +#include "amdgpu_dev_coredump.h"
> +#include "amdgpu_xgmi.h"
>  #include "amdgpu_trace.h"
>  #include "amdgpu_reset.h"
>
> +static void amdgpu_job_do_core_dump(struct amdgpu_device *adev,
> +                              struct amdgpu_job *job)
> +{
> +       int i = 0;
> +
> +       dev_info(adev->dev, "Dumping IP State\n");
> +       for (i = 0; i < adev->num_ip_blocks; i++) {
> +               if (adev->ip_blocks[i].version->funcs->dump_ip_state)
> +                       adev->ip_blocks[i].version->funcs
> +                               ->dump_ip_state((void *)adev);
> +               dev_info(adev->dev, "Dumping IP State Completed\n");
> +       }
> +
> +       amdgpu_coredump(adev, true, false, job);
> +}
> +
> +static void amdgpu_job_core_dump(struct amdgpu_device *adev, struct amdgpu_job *job)
> +{
> +       struct list_head device_list, *device_list_handle =  NULL;
> +       struct amdgpu_device *tmp_adev = NULL;
> +       struct amdgpu_hive_info *hive = NULL;
> +
> +       if (!amdgpu_sriov_vf(adev))
> +               hive = amdgpu_get_xgmi_hive(adev);
> +       if (hive)
> +               mutex_lock(&hive->hive_lock);
> +       /*
> +        * Reuse the logic in amdgpu_device_gpu_recover() to build list of
> +        * devices for code dump
> +        */
> +       INIT_LIST_HEAD(&device_list);
> +       if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1) && hive) {
> +               list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
> +                       list_add_tail(&tmp_adev->reset_list, &device_list);
> +               if (!list_is_first(&adev->reset_list, &device_list))
> +                       list_rotate_to_front(&adev->reset_list, &device_list);
> +               device_list_handle = &device_list;
> +       } else {
> +               list_add_tail(&adev->reset_list, &device_list);
> +               device_list_handle = &device_list;
> +       }
> +
> +       /* Do the coredump for each device */
> +       list_for_each_entry(tmp_adev, device_list_handle, reset_list)
> +               amdgpu_job_do_core_dump(tmp_adev, job);
> +
> +       if (hive) {
> +               mutex_unlock(&hive->hive_lock);
> +               amdgpu_put_xgmi_hive(hive);
> +       }
> +}
> +
>  static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>  {
>         struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
> @@ -48,6 +102,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>                 return DRM_GPU_SCHED_STAT_ENODEV;
>         }
>
> +       /*
> +        * Do the coredump immediately after a job timeout to get a closer
> +        * representation of GPU's error status.
> +        */
> +       if (amdgpu_gpu_coredump)
> +               amdgpu_job_core_dump(adev, job);
>
>         adev->job_hang = true;
>
> @@ -101,6 +161,10 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>                 reset_context.src = AMDGPU_RESET_SRC_JOB;
>                 clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>
> +               /* To avoid a double coredump, as we have already done it */
> +               if (amdgpu_gpu_coredump)
> +                       set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
> +

I could go either way on this hunk.  Do you see any problems with a
double core dump?  We are effectively doing two resets...

Alex

>                 r = amdgpu_device_gpu_recover(ring->adev, job, &reset_context);
>                 if (r)
>                         dev_err(adev->dev, "GPU Recovery Failed: %d\n", r);
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 3/4] drm/amdgpu: skip printing vram_lost if needed
  2024-08-16 13:52   ` Alex Deucher
@ 2024-08-19  9:30     ` Huang, Trigger
  0 siblings, 0 replies; 10+ messages in thread
From: Huang, Trigger @ 2024-08-19  9:30 UTC (permalink / raw)
  To: Alex Deucher
  Cc: amd-gfx@lists.freedesktop.org, Khatri,  Sunil, Deucher, Alexander

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Alex Deucher <alexdeucher@gmail.com>
> Sent: Friday, August 16, 2024 9:52 PM
> To: Huang, Trigger <Trigger.Huang@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil <Sunil.Khatri@amd.com>;
> Deucher, Alexander <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH 3/4] drm/amdgpu: skip printing vram_lost if needed
>
> On Fri, Aug 16, 2024 at 3:55 AM <Trigger.Huang@amd.com> wrote:
> >
> > From: Trigger Huang <Trigger.Huang@amd.com>
> >
> > The vm lost status can only be obtained after a GPU reset occurs, but
> > sometimes a dev core dump can be happened before GPU reset. So a new
> > argument is added to tell the dev core dump implementation whether to
> > skip printing the vram_lost status in the dump.
> > And this patch is also trying to decouple the core dump function from
> > the GPU reset function, by replacing the argument amdgpu_reset_context
> > with amdgpu_job to specify the context for core dump.
> >
> > Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
> > Suggested-by: Alex Deucher <alexander.deucher@amd.com>
> > ---
> >  .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c  | 19
> > ++++++++++---------  .../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h
> |  6 +++---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  2 +-
> >  3 files changed, 14 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> > index cf2b4dd4d865..a860f52d8bb0 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
> > @@ -28,8 +28,9 @@
> >  #include "atom.h"
> >
> >  #ifndef CONFIG_DEV_COREDUMP
> > -void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > -                    struct amdgpu_reset_context *reset_context)
> > +void amdgpu_coredump(struct amdgpu_device *adev, bool
> skip_vram_check,
> > +                    bool vram_lost, struct amdgpu_job *job)
> > +
> >  {
> >  }
> >  #else
> > @@ -315,7 +316,7 @@ amdgpu_devcoredump_read(char *buffer, loff_t
> offset, size_t count,
> >                 }
> >         }
> >
> > -       if (coredump->reset_vram_lost)
> > +       if (!(coredump->skip_vram_check) && coredump->reset_vram_lost)
> >                 drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>
> You might want to say something like:
> drm_printf(&p, "VRAM lost status skipped\n"); in the skip case so we know
> that we skipped it so users don't assume it wasn't lost.

Sounds good, I will add it.

Regardss.
Trigger

>
> Alex
>
> >
> >         return count - iter.remain;
> > @@ -326,12 +327,11 @@ static void amdgpu_devcoredump_free(void
> *data)
> >         kfree(data);
> >  }
> >
> > -void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > -                    struct amdgpu_reset_context *reset_context)
> > +void amdgpu_coredump(struct amdgpu_device *adev, bool
> skip_vram_check,
> > +                    bool vram_lost, struct amdgpu_job *job)
> >  {
> > -       struct amdgpu_coredump_info *coredump;
> >         struct drm_device *dev = adev_to_drm(adev);
> > -       struct amdgpu_job *job = reset_context->job;
> > +       struct amdgpu_coredump_info *coredump;
> >         struct drm_sched_job *s_job;
> >
> >         coredump = kzalloc(sizeof(*coredump), GFP_NOWAIT); @@ -341,11
> > +341,12 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool
> vram_lost,
> >                 return;
> >         }
> >
> > +       coredump->skip_vram_check = skip_vram_check;
> >         coredump->reset_vram_lost = vram_lost;
> >
> > -       if (reset_context->job && reset_context->job->vm) {
> > +       if (job && job->vm) {
> >                 struct amdgpu_task_info *ti;
> > -               struct amdgpu_vm *vm = reset_context->job->vm;
> > +               struct amdgpu_vm *vm = job->vm;
> >
> >                 ti = amdgpu_vm_get_task_info_vm(vm);
> >                 if (ti) {
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h
> > index 52459512cb2b..c4e522e49251 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h
> > @@ -26,7 +26,6 @@
> >  #define __AMDGPU_DEV_COREDUMP_H__
> >
> >  #include "amdgpu.h"
> > -#include "amdgpu_reset.h"
> >
> >  #ifdef CONFIG_DEV_COREDUMP
> >
> > @@ -36,12 +35,13 @@ struct amdgpu_coredump_info {
> >         struct amdgpu_device            *adev;
> >         struct amdgpu_task_info         reset_task_info;
> >         struct timespec64               reset_time;
> > +       bool                            skip_vram_check;
> >         bool                            reset_vram_lost;
> >         struct amdgpu_ring              *ring;
> >  };
> >  #endif
> >
> > -void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > -                    struct amdgpu_reset_context *reset_context);
> > +void amdgpu_coredump(struct amdgpu_device *adev, bool
> skip_vram_check,
> > +                    bool vram_lost, struct amdgpu_job *job);
> >
> >  #endif
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 9885d0606b0a..825cc62cd75d 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -5445,7 +5445,7 @@ int amdgpu_do_asic_reset(struct list_head
> *device_list_handle,
> >                                 vram_lost =
> > amdgpu_device_check_vram_lost(tmp_adev);
> >
> >                                 if (amdgpu_gpu_coredump &&
> (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)))
> > -                                       amdgpu_coredump(tmp_adev, vram_lost,
> reset_context);
> > +                                       amdgpu_coredump(tmp_adev,
> > + false, vram_lost, reset_context->job);
> >
> >                                 if (vram_lost) {
> >                                         DRM_INFO("VRAM is lost due to
> > GPU reset!\n");
> > --
> > 2.34.1
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 4/4] drm/amdgpu: Do core dump immediately when job tmo
  2024-08-16 13:58   ` Alex Deucher
@ 2024-08-19  9:37     ` Huang, Trigger
  0 siblings, 0 replies; 10+ messages in thread
From: Huang, Trigger @ 2024-08-19  9:37 UTC (permalink / raw)
  To: Alex Deucher
  Cc: amd-gfx@lists.freedesktop.org, Khatri,  Sunil, Deucher, Alexander

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Alex Deucher <alexdeucher@gmail.com>
> Sent: Friday, August 16, 2024 9:59 PM
> To: Huang, Trigger <Trigger.Huang@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil <Sunil.Khatri@amd.com>;
> Deucher, Alexander <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH 4/4] drm/amdgpu: Do core dump immediately when job
> tmo
>
> On Fri, Aug 16, 2024 at 4:05 AM <Trigger.Huang@amd.com> wrote:
> >
> > From: Trigger Huang <Trigger.Huang@amd.com>
> >
> > Do the coredump immediately after a job timeout to get a closer
> > representation of GPU's error status.
> >
> > V2: This will skip printing vram_lost as the GPU reset is not happened
> > yet (Alex)
> >
> > Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 64
> > +++++++++++++++++++++++++
> >  1 file changed, 64 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index c6a1783fc9ef..009551335d05 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -28,9 +28,63 @@
> >  #include <drm/drm_drv.h>
> >
> >  #include "amdgpu.h"
> > +#include "amdgpu_dev_coredump.h"
> > +#include "amdgpu_xgmi.h"
> >  #include "amdgpu_trace.h"
> >  #include "amdgpu_reset.h"
> >
> > +static void amdgpu_job_do_core_dump(struct amdgpu_device *adev,
> > +                              struct amdgpu_job *job) {
> > +       int i = 0;
> > +
> > +       dev_info(adev->dev, "Dumping IP State\n");
> > +       for (i = 0; i < adev->num_ip_blocks; i++) {
> > +               if (adev->ip_blocks[i].version->funcs->dump_ip_state)
> > +                       adev->ip_blocks[i].version->funcs
> > +                               ->dump_ip_state((void *)adev);
> > +               dev_info(adev->dev, "Dumping IP State Completed\n");
> > +       }
> > +
> > +       amdgpu_coredump(adev, true, false, job); }
> > +
> > +static void amdgpu_job_core_dump(struct amdgpu_device *adev, struct
> > +amdgpu_job *job) {
> > +       struct list_head device_list, *device_list_handle =  NULL;
> > +       struct amdgpu_device *tmp_adev = NULL;
> > +       struct amdgpu_hive_info *hive = NULL;
> > +
> > +       if (!amdgpu_sriov_vf(adev))
> > +               hive = amdgpu_get_xgmi_hive(adev);
> > +       if (hive)
> > +               mutex_lock(&hive->hive_lock);
> > +       /*
> > +        * Reuse the logic in amdgpu_device_gpu_recover() to build list of
> > +        * devices for code dump
> > +        */
> > +       INIT_LIST_HEAD(&device_list);
> > +       if (!amdgpu_sriov_vf(adev) && (adev-
> >gmc.xgmi.num_physical_nodes > 1) && hive) {
> > +               list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
> > +                       list_add_tail(&tmp_adev->reset_list, &device_list);
> > +               if (!list_is_first(&adev->reset_list, &device_list))
> > +                       list_rotate_to_front(&adev->reset_list, &device_list);
> > +               device_list_handle = &device_list;
> > +       } else {
> > +               list_add_tail(&adev->reset_list, &device_list);
> > +               device_list_handle = &device_list;
> > +       }
> > +
> > +       /* Do the coredump for each device */
> > +       list_for_each_entry(tmp_adev, device_list_handle, reset_list)
> > +               amdgpu_job_do_core_dump(tmp_adev, job);
> > +
> > +       if (hive) {
> > +               mutex_unlock(&hive->hive_lock);
> > +               amdgpu_put_xgmi_hive(hive);
> > +       }
> > +}
> > +
> >  static enum drm_gpu_sched_stat amdgpu_job_timedout(struct
> > drm_sched_job *s_job)  {
> >         struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched); @@
> > -48,6 +102,12 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >                 return DRM_GPU_SCHED_STAT_ENODEV;
> >         }
> >
> > +       /*
> > +        * Do the coredump immediately after a job timeout to get a closer
> > +        * representation of GPU's error status.
> > +        */
> > +       if (amdgpu_gpu_coredump)
> > +               amdgpu_job_core_dump(adev, job);
> >
> >         adev->job_hang = true;
> >
> > @@ -101,6 +161,10 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >                 reset_context.src = AMDGPU_RESET_SRC_JOB;
> >                 clear_bit(AMDGPU_NEED_FULL_RESET,
> > &reset_context.flags);
> >
> > +               /* To avoid a double coredump, as we have already done it */
> > +               if (amdgpu_gpu_coredump)
> > +                       set_bit(AMDGPU_SKIP_COREDUMP,
> > + &reset_context.flags);
> > +
>
> I could go either way on this hunk.  Do you see any problems with a double
> core dump?  We are effectively doing two resets...

I prefer to do it once, as we have already got the very close representation of GPU's error status.
What's more, dumping the IP status again may cost several milliseconds as well as requiring extra MMIO access for reading all the registers.

Regards,
Trigger
>
> Alex
>
> >                 r = amdgpu_device_gpu_recover(ring->adev, job, &reset_context);
> >                 if (r)
> >                         dev_err(adev->dev, "GPU Recovery Failed:
> > %d\n", r);
> > --
> > 2.34.1
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-08-19  9:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-16  7:54 [PATCH 0/4] Improve the dev coredump Trigger.Huang
2024-08-16  7:54 ` [PATCH 1/4] drm/amdgpu: Add gpu_coredump parameter Trigger.Huang
2024-08-16 13:55   ` Alex Deucher
2024-08-16  7:54 ` [PATCH 2/4] drm/amdgpu: Use gpu_coredump to control core dump Trigger.Huang
2024-08-16  7:54 ` [PATCH 3/4] drm/amdgpu: skip printing vram_lost if needed Trigger.Huang
2024-08-16 13:52   ` Alex Deucher
2024-08-19  9:30     ` Huang, Trigger
2024-08-16  7:54 ` [PATCH 4/4] drm/amdgpu: Do core dump immediately when job tmo Trigger.Huang
2024-08-16 13:58   ` Alex Deucher
2024-08-19  9:37     ` Huang, Trigger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox