[PATCH 0/3] Improve the dev coredump

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] Improve the dev coredump
@ 2024-08-15 11:38 Trigger.Huang
  2024-08-15 11:38 ` [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter Trigger.Huang
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Trigger.Huang @ 2024-08-15 11:38 UTC (permalink / raw)
  To: amd-gfx; +Cc: sunil.khatri, alexander.deucher, Trigger Huang

From: Trigger Huang <Trigger.Huang@amd.com>

The current dev coredump implementation sometimes cannot fully satisfy customer's requirements due to:
1, The enablement of dev coredump is under the control of gpu_recovery. Customer can not do dev coredump with gpu_recovery disabled
2, When job timeout happened, the dump GPU status will be happened after a lot of operations, like soft_reset. The concern here is that the status is not so close to the real GPU's error status.

So we introduced a new solution
1, A new parameter, gpu_coredump, is added to decouple the coredump and gpu reset
2, Do the coredump immediately after a job timeout

Trigger Huang (3):
  drm/amdgpu: Add gpu_coredump parameter
  drm/amdgpu: introduce new API for GPU core dump
  drm/amdgpu: Change the timing of doing coredump

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 89 ++++++++++++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  8 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 10 +++
 4 files changed, 97 insertions(+), 13 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
  2024-08-15 11:38 [PATCH 0/3] Improve the dev coredump Trigger.Huang
@ 2024-08-15 11:38 ` Trigger.Huang
  2024-08-15 16:01   ` Alex Deucher
  2024-08-15 11:38 ` [PATCH 2/3] drm/amdgpu: introduce new API for GPU core dump Trigger.Huang
  2024-08-15 11:38 ` [PATCH 3/3] drm/amdgpu: Change the timing of doing coredump Trigger.Huang
  2 siblings, 1 reply; 11+ messages in thread
From: Trigger.Huang @ 2024-08-15 11:38 UTC (permalink / raw)
  To: amd-gfx; +Cc: sunil.khatri, alexander.deucher, Trigger Huang

From: Trigger Huang <Trigger.Huang@amd.com>

Add new separate parameter to control GPU coredump procedure. This can
be used to decouple the coredump procedure from gpu recovery procedure

Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 937de21a7142..4dd465ad14af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -201,6 +201,7 @@ extern uint amdgpu_force_long_training;
 extern int amdgpu_lbpw;
 extern int amdgpu_compute_multipipe;
 extern int amdgpu_gpu_recovery;
+extern int amdgpu_gpu_coredump;
 extern int amdgpu_emu_mode;
 extern uint amdgpu_smu_memory_pool_size;
 extern int amdgpu_smu_pptable_id;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index b9529948f2b2..c5d357420236 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -178,6 +178,7 @@ uint amdgpu_force_long_training;
 int amdgpu_lbpw = -1;
 int amdgpu_compute_multipipe = -1;
 int amdgpu_gpu_recovery = -1; /* auto */
+int amdgpu_gpu_coredump;
 int amdgpu_emu_mode;
 uint amdgpu_smu_memory_pool_size;
 int amdgpu_smu_pptable_id = -1;
@@ -556,6 +557,13 @@ module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
 MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)");
 module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
 
+/**
+ * DOC: gpu_coredump (int)
+ * Set to enable GPU coredump mechanism (1 = enable, 0 = disable). The default is 0
+ */
+MODULE_PARM_DESC(gpu_coredump, "Enable GPU coredump mechanism, (1 = enable, 0 = disable (default))");
+module_param_named(gpu_coredump, amdgpu_gpu_coredump, int, 0444);
+
 /**
  * DOC: emu_mode (int)
  * Set value 1 to enable emulation mode. This is only needed when running on an emulator. The default is 0 (disabled).
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] drm/amdgpu: introduce new API for GPU core dump
  2024-08-15 11:38 [PATCH 0/3] Improve the dev coredump Trigger.Huang
  2024-08-15 11:38 ` [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter Trigger.Huang
@ 2024-08-15 11:38 ` Trigger.Huang
  2024-08-15 11:38 ` [PATCH 3/3] drm/amdgpu: Change the timing of doing coredump Trigger.Huang
  2 siblings, 0 replies; 11+ messages in thread
From: Trigger.Huang @ 2024-08-15 11:38 UTC (permalink / raw)
  To: amd-gfx; +Cc: sunil.khatri, alexander.deucher, Trigger Huang

From: Trigger Huang <Trigger.Huang@amd.com>

Put ip dump and register to dev_coredumpm together

Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 73 ++++++++++++++++++++++
 2 files changed, 75 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 4dd465ad14af..b64284324961 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1401,6 +1401,8 @@ bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
 int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			      struct amdgpu_job *job,
 			      struct amdgpu_reset_context *reset_context);
+void amdgpu_device_gpu_coredump(struct amdgpu_device *adev,
+				struct amdgpu_job *job);
 void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
 int amdgpu_device_pci_reset(struct amdgpu_device *adev);
 bool amdgpu_device_need_post(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a6b8d0ba4758..69d6a5b7ca34 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -145,6 +145,8 @@ const char *amdgpu_asic_name[] = {
 };
 
 static inline void amdgpu_device_stop_pending_resets(struct amdgpu_device *adev);
+static void amdgpu_device_gpu_coredump_single(struct amdgpu_device *adev,
+					      struct amdgpu_job *job);
 
 /**
  * DOC: pcie_replay_count
@@ -5965,6 +5967,77 @@ static void amdgpu_device_partner_bandwidth(struct amdgpu_device *adev,
 	}
 }
 
+static void amdgpu_device_gpu_coredump_single(struct amdgpu_device *adev,
+				struct amdgpu_job *job)
+{
+	struct amdgpu_reset_context reset_context;
+	bool vram_lost;
+	int i = 0;
+
+	dev_info(adev->dev, "Dumping IP State\n");
+	for (i = 0; i < adev->num_ip_blocks; i++) {
+		if (adev->ip_blocks[i].version->funcs->dump_ip_state)
+			adev->ip_blocks[i].version->funcs
+				->dump_ip_state((void *)adev);
+		dev_info(adev->dev, "Dumping IP State Completed\n");
+	}
+
+	vram_lost = amdgpu_device_check_vram_lost(adev);
+
+	memset(&reset_context, 0, sizeof(reset_context));
+	reset_context.job = job;
+
+	amdgpu_coredump(adev, vram_lost, &reset_context);
+}
+
+/**
+ * amdgpu_device_gpu_coredump - Dump GPU status
+ *
+ * @adev: amdgpu_device pointer
+ * @job:  dump for specific job
+ *
+ * Attempt to dump all the current GPU status, like registers, ring buffer, etc.
+ * This function is mostly invoked when something goes wrong on the GPU,
+ * for example, job timeout.
+ */
+void amdgpu_device_gpu_coredump(struct amdgpu_device *adev,
+				struct amdgpu_job *job)
+{
+	struct list_head device_list, *device_list_handle =  NULL;
+	struct amdgpu_device *tmp_adev = NULL;
+	struct amdgpu_hive_info *hive = NULL;
+
+	if (!amdgpu_sriov_vf(adev))
+		hive = amdgpu_get_xgmi_hive(adev);
+	if (hive)
+		mutex_lock(&hive->hive_lock);
+
+	/*
+	 * Reuse the logic in amdgpu_device_gpu_recover() to build list of
+	 * devices for code dump
+	 */
+	INIT_LIST_HEAD(&device_list);
+	if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1) && hive) {
+		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
+			list_add_tail(&tmp_adev->reset_list, &device_list);
+		if (!list_is_first(&adev->reset_list, &device_list))
+			list_rotate_to_front(&adev->reset_list, &device_list);
+		device_list_handle = &device_list;
+	} else {
+		list_add_tail(&adev->reset_list, &device_list);
+		device_list_handle = &device_list;
+	}
+
+	/* Do the coredump for each device */
+	list_for_each_entry(tmp_adev, device_list_handle, reset_list)
+		amdgpu_device_gpu_coredump_single(tmp_adev, job);
+
+	if (hive) {
+		mutex_unlock(&hive->hive_lock);
+		amdgpu_put_xgmi_hive(hive);
+	}
+}
+
 /**
  * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
  *
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] drm/amdgpu: Change the timing of doing coredump
  2024-08-15 11:38 [PATCH 0/3] Improve the dev coredump Trigger.Huang
  2024-08-15 11:38 ` [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter Trigger.Huang
  2024-08-15 11:38 ` [PATCH 2/3] drm/amdgpu: introduce new API for GPU core dump Trigger.Huang
@ 2024-08-15 11:38 ` Trigger.Huang
  2024-08-15 16:09   ` Alex Deucher
  2 siblings, 1 reply; 11+ messages in thread
From: Trigger.Huang @ 2024-08-15 11:38 UTC (permalink / raw)
  To: amd-gfx; +Cc: sunil.khatri, alexander.deucher, Trigger Huang

From: Trigger Huang <Trigger.Huang@amd.com>

Do the coredump immediately after a job timeout to get a closer
representation of GPU's error status. For other code paths that
need to do the coredump, keep the original logic unchanged, except:
1,All the coredump operations will be under the control of parameter
amdgpu_gpu_coredump
2, Do the ip dump and register to dev codedump system together.

Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++-------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 10 ++++++++++
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 69d6a5b7ca34..a8eb76d82921 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5341,15 +5341,9 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 			}
 		}
 
-		if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
-			dev_info(tmp_adev->dev, "Dumping IP State\n");
-			/* Trigger ip dump before we reset the asic */
-			for (i = 0; i < tmp_adev->num_ip_blocks; i++)
-				if (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
-					tmp_adev->ip_blocks[i].version->funcs
-						->dump_ip_state((void *)tmp_adev);
-			dev_info(tmp_adev->dev, "Dumping IP State Completed\n");
-		}
+		if (amdgpu_gpu_coredump &&
+		    (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)))
+			amdgpu_device_gpu_coredump_single(tmp_adev, job);
 
 		if (need_full_reset)
 			r = amdgpu_device_ip_suspend(adev);
@@ -5444,10 +5438,6 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
 					goto out;
 
 				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
-
-				if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags))
-					amdgpu_coredump(tmp_adev, vram_lost, reset_context);
-
 				if (vram_lost) {
 					DRM_INFO("VRAM is lost due to GPU reset!\n");
 					amdgpu_inc_vram_lost(tmp_adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index c6a1783fc9ef..63869820c334 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -48,6 +48,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		return DRM_GPU_SCHED_STAT_ENODEV;
 	}
 
+	/*
+	 * Do the coredump immediately after a job timeout to get a closer
+	 * representation of GPU's error status.
+	 */
+	if (amdgpu_gpu_coredump)
+		amdgpu_device_gpu_coredump(adev, job);
 
 	adev->job_hang = true;
 
@@ -101,6 +107,10 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		reset_context.src = AMDGPU_RESET_SRC_JOB;
 		clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
 
+		/* To avoid a double coredump, as we have already done it */
+		if (amdgpu_gpu_coredump)
+			set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
+
 		r = amdgpu_device_gpu_recover(ring->adev, job, &reset_context);
 		if (r)
 			dev_err(adev->dev, "GPU Recovery Failed: %d\n", r);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
  2024-08-15 11:38 ` [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter Trigger.Huang
@ 2024-08-15 16:01   ` Alex Deucher
  2024-08-16  6:36     ` Huang, Trigger
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Deucher @ 2024-08-15 16:01 UTC (permalink / raw)
  To: Trigger.Huang; +Cc: amd-gfx, sunil.khatri, alexander.deucher

On Thu, Aug 15, 2024 at 7:39 AM <Trigger.Huang@amd.com> wrote:
>
> From: Trigger Huang <Trigger.Huang@amd.com>
>
> Add new separate parameter to control GPU coredump procedure. This can
> be used to decouple the coredump procedure from gpu recovery procedure
>
> Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
>  2 files changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 937de21a7142..4dd465ad14af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -201,6 +201,7 @@ extern uint amdgpu_force_long_training;
>  extern int amdgpu_lbpw;
>  extern int amdgpu_compute_multipipe;
>  extern int amdgpu_gpu_recovery;
> +extern int amdgpu_gpu_coredump;
>  extern int amdgpu_emu_mode;
>  extern uint amdgpu_smu_memory_pool_size;
>  extern int amdgpu_smu_pptable_id;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index b9529948f2b2..c5d357420236 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -178,6 +178,7 @@ uint amdgpu_force_long_training;
>  int amdgpu_lbpw = -1;
>  int amdgpu_compute_multipipe = -1;
>  int amdgpu_gpu_recovery = -1; /* auto */
> +int amdgpu_gpu_coredump;
>  int amdgpu_emu_mode;
>  uint amdgpu_smu_memory_pool_size;
>  int amdgpu_smu_pptable_id = -1;
> @@ -556,6 +557,13 @@ module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
>  MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)");
>  module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
>
> +/**
> + * DOC: gpu_coredump (int)
> + * Set to enable GPU coredump mechanism (1 = enable, 0 = disable). The default is 0
> + */
> +MODULE_PARM_DESC(gpu_coredump, "Enable GPU coredump mechanism, (1 = enable, 0 = disable (default))");
> +module_param_named(gpu_coredump, amdgpu_gpu_coredump, int, 0444);

I don't think we need a separate parameter for this, although if we
do, this would need to be enabled by default.  If it needs to be
decoupled from reset, that's fine, but I don't see the need for a
separate knob.

Alex

> +
>  /**
>   * DOC: emu_mode (int)
>   * Set value 1 to enable emulation mode. This is only needed when running on an emulator. The default is 0 (disabled).
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] drm/amdgpu: Change the timing of doing coredump
  2024-08-15 11:38 ` [PATCH 3/3] drm/amdgpu: Change the timing of doing coredump Trigger.Huang
@ 2024-08-15 16:09   ` Alex Deucher
  2024-08-16  6:37     ` Huang, Trigger
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Deucher @ 2024-08-15 16:09 UTC (permalink / raw)
  To: Trigger.Huang; +Cc: amd-gfx, sunil.khatri, alexander.deucher

On Thu, Aug 15, 2024 at 7:50 AM <Trigger.Huang@amd.com> wrote:
>
> From: Trigger Huang <Trigger.Huang@amd.com>
>
> Do the coredump immediately after a job timeout to get a closer
> representation of GPU's error status. For other code paths that
> need to do the coredump, keep the original logic unchanged, except:
> 1,All the coredump operations will be under the control of parameter
> amdgpu_gpu_coredump
> 2, Do the ip dump and register to dev codedump system together.
>
> Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++-------------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 10 ++++++++++
>  2 files changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 69d6a5b7ca34..a8eb76d82921 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5341,15 +5341,9 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>                         }
>                 }
>
> -               if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
> -                       dev_info(tmp_adev->dev, "Dumping IP State\n");
> -                       /* Trigger ip dump before we reset the asic */
> -                       for (i = 0; i < tmp_adev->num_ip_blocks; i++)
> -                               if (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
> -                                       tmp_adev->ip_blocks[i].version->funcs
> -                                               ->dump_ip_state((void *)tmp_adev);
> -                       dev_info(tmp_adev->dev, "Dumping IP State Completed\n");
> -               }
> +               if (amdgpu_gpu_coredump &&
> +                   (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)))
> +                       amdgpu_device_gpu_coredump_single(tmp_adev, job);
>
>                 if (need_full_reset)
>                         r = amdgpu_device_ip_suspend(adev);
> @@ -5444,10 +5438,6 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>                                         goto out;
>
>                                 vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
> -
> -                               if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags))
> -                                       amdgpu_coredump(tmp_adev, vram_lost, reset_context);

This needs to stay here, otherwise, we won't know the status of vram_lost.

> -
>                                 if (vram_lost) {
>                                         DRM_INFO("VRAM is lost due to GPU reset!\n");
>                                         amdgpu_inc_vram_lost(tmp_adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index c6a1783fc9ef..63869820c334 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -48,6 +48,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>                 return DRM_GPU_SCHED_STAT_ENODEV;
>         }
>
> +       /*
> +        * Do the coredump immediately after a job timeout to get a closer
> +        * representation of GPU's error status.
> +        */
> +       if (amdgpu_gpu_coredump)
> +               amdgpu_device_gpu_coredump(adev, job);

The problem with doing this here is that we miss core dumps for GPU
resets that happen for reasons besides a user job timeout.  E.g.,
resets from KFD or a hang due to bad programming in a kernel context.

If you want to keep this here, I'd suggest something like:

if (!amdgpu_gpu_recovery)
     amdgu_core_dump();

That way you'll get a dump in both cases.  Maybe add a flag to skip
printing vram_lost in this case since it happens before reset so it's
irrelevant.

Alex

>
>         adev->job_hang = true;
>
> @@ -101,6 +107,10 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>                 reset_context.src = AMDGPU_RESET_SRC_JOB;
>                 clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>
> +               /* To avoid a double coredump, as we have already done it */
> +               if (amdgpu_gpu_coredump)
> +                       set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
> +
>                 r = amdgpu_device_gpu_recover(ring->adev, job, &reset_context);
>                 if (r)
>                         dev_err(adev->dev, "GPU Recovery Failed: %d\n", r);
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
  2024-08-15 16:01   ` Alex Deucher
@ 2024-08-16  6:36     ` Huang, Trigger
  2024-08-16  6:39       ` Huang, Trigger
  2024-08-16 13:43       ` Alex Deucher
  0 siblings, 2 replies; 11+ messages in thread
From: Huang, Trigger @ 2024-08-16  6:36 UTC (permalink / raw)
  To: Alex Deucher
  Cc: amd-gfx@lists.freedesktop.org, Khatri,  Sunil, Deucher, Alexander

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Alex Deucher <alexdeucher@gmail.com>
> Sent: Friday, August 16, 2024 12:02 AM
> To: Huang, Trigger <Trigger.Huang@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil <Sunil.Khatri@amd.com>;
> Deucher, Alexander <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
>
> On Thu, Aug 15, 2024 at 7:39 AM <Trigger.Huang@amd.com> wrote:
> >
> > From: Trigger Huang <Trigger.Huang@amd.com>
> >
> > Add new separate parameter to control GPU coredump procedure. This can
> > be used to decouple the coredump procedure from gpu recovery procedure
> >
> > Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
> >  2 files changed, 9 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 937de21a7142..4dd465ad14af 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -201,6 +201,7 @@ extern uint amdgpu_force_long_training;  extern
> > int amdgpu_lbpw;  extern int amdgpu_compute_multipipe;  extern int
> > amdgpu_gpu_recovery;
> > +extern int amdgpu_gpu_coredump;
> >  extern int amdgpu_emu_mode;
> >  extern uint amdgpu_smu_memory_pool_size;  extern int
> > amdgpu_smu_pptable_id; diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index b9529948f2b2..c5d357420236 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -178,6 +178,7 @@ uint amdgpu_force_long_training;  int
> amdgpu_lbpw
> > = -1;  int amdgpu_compute_multipipe = -1;  int amdgpu_gpu_recovery =
> > -1; /* auto */
> > +int amdgpu_gpu_coredump;
> >  int amdgpu_emu_mode;
> >  uint amdgpu_smu_memory_pool_size;
> >  int amdgpu_smu_pptable_id = -1;
> > @@ -556,6 +557,13 @@ module_param_named(compute_multipipe,
> > amdgpu_compute_multipipe, int, 0444);
> MODULE_PARM_DESC(gpu_recovery,
> > "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 =
> > auto)");  module_param_named(gpu_recovery, amdgpu_gpu_recovery, int,
> > 0444);
> >
> > +/**
> > + * DOC: gpu_coredump (int)
> > + * Set to enable GPU coredump mechanism (1 = enable, 0 = disable).
> > +The default is 0  */ MODULE_PARM_DESC(gpu_coredump, "Enable GPU
> > +coredump mechanism, (1 = enable, 0 = disable (default))");
> > +module_param_named(gpu_coredump, amdgpu_gpu_coredump, int,
> 0444);
>
> I don't think we need a separate parameter for this, although if we do, this
> would need to be enabled by default.  If it needs to be decoupled from reset,
> that's fine, but I don't see the need for a separate knob.
>
> Alex

Hi Alex,
It is fine to enable it by default
There are several application scenarios that I can think of.
        1, Customer may need to do the core dump with gpu_recovery disabled. This can be used for GPU hang debug
        2, Customer may need to disable the core dump with gpu_recovery enabled. This can be used for quick GPU recovery, especially for some lightweight hangs that can be processed by soft recovery or per ring reset.
        3, Customer may need to enable the core dump with gpu_recovery enabled. This can be used for GPU recovery but record the core dump for further check in stress test or system health check.
It seems not easy to support all the scenarios by only using amdgpu_gpu_coredump, right?

Regards,
Trigger
>
> > +
> >  /**
> >   * DOC: emu_mode (int)
> >   * Set value 1 to enable emulation mode. This is only needed when running
> on an emulator. The default is 0 (disabled).
> > --
> > 2.34.1
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH 3/3] drm/amdgpu: Change the timing of doing coredump
  2024-08-15 16:09   ` Alex Deucher
@ 2024-08-16  6:37     ` Huang, Trigger
  0 siblings, 0 replies; 11+ messages in thread
From: Huang, Trigger @ 2024-08-16  6:37 UTC (permalink / raw)
  To: Alex Deucher
  Cc: amd-gfx@lists.freedesktop.org, Khatri,  Sunil, Deucher, Alexander

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Alex Deucher <alexdeucher@gmail.com>
> Sent: Friday, August 16, 2024 12:09 AM
> To: Huang, Trigger <Trigger.Huang@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil <Sunil.Khatri@amd.com>;
> Deucher, Alexander <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH 3/3] drm/amdgpu: Change the timing of doing coredump
>
> On Thu, Aug 15, 2024 at 7:50 AM <Trigger.Huang@amd.com> wrote:
> >
> > From: Trigger Huang <Trigger.Huang@amd.com>
> >
> > Do the coredump immediately after a job timeout to get a closer
> > representation of GPU's error status. For other code paths that need
> > to do the coredump, keep the original logic unchanged, except:
> > 1,All the coredump operations will be under the control of parameter
> > amdgpu_gpu_coredump 2, Do the ip dump and register to dev codedump
> > system together.
> >
> > Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++-------------
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 10 ++++++++++
> >  2 files changed, 13 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 69d6a5b7ca34..a8eb76d82921 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -5341,15 +5341,9 @@ int amdgpu_device_pre_asic_reset(struct
> amdgpu_device *adev,
> >                         }
> >                 }
> >
> > -               if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
> > -                       dev_info(tmp_adev->dev, "Dumping IP State\n");
> > -                       /* Trigger ip dump before we reset the asic */
> > -                       for (i = 0; i < tmp_adev->num_ip_blocks; i++)
> > -                               if (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
> > -                                       tmp_adev->ip_blocks[i].version->funcs
> > -                                               ->dump_ip_state((void *)tmp_adev);
> > -                       dev_info(tmp_adev->dev, "Dumping IP State Completed\n");
> > -               }
> > +               if (amdgpu_gpu_coredump &&
> > +                   (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)))
> > +                       amdgpu_device_gpu_coredump_single(tmp_adev,
> > + job);
> >
> >                 if (need_full_reset)
> >                         r = amdgpu_device_ip_suspend(adev); @@
> > -5444,10 +5438,6 @@ int amdgpu_do_asic_reset(struct list_head
> *device_list_handle,
> >                                         goto out;
> >
> >                                 vram_lost =
> > amdgpu_device_check_vram_lost(tmp_adev);
> > -
> > -                               if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context-
> >flags))
> > -                                       amdgpu_coredump(tmp_adev, vram_lost,
> reset_context);
>
> This needs to stay here, otherwise, we won't know the status of vram_lost.
>

Ok, understood, seems vram_lost is the only dependence between core dump and GPU rest. Will not change the logic here.

> > -
> >                                 if (vram_lost) {
> >                                         DRM_INFO("VRAM is lost due to GPU reset!\n");
> >
> > amdgpu_inc_vram_lost(tmp_adev); diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index c6a1783fc9ef..63869820c334 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -48,6 +48,12 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >                 return DRM_GPU_SCHED_STAT_ENODEV;
> >         }
> >
> > +       /*
> > +        * Do the coredump immediately after a job timeout to get a closer
> > +        * representation of GPU's error status.
> > +        */
> > +       if (amdgpu_gpu_coredump)
> > +               amdgpu_device_gpu_coredump(adev, job);
>
> The problem with doing this here is that we miss core dumps for GPU resets
> that happen for reasons besides a user job timeout.  E.g., resets from KFD or a
> hang due to bad programming in a kernel context.

Ok, I will not change the logic in the code path of other reasons, but I want to add a control of the master switch, amdgpu_gpu_coredump, when doing dump in other code paths.

>
> If you want to keep this here, I'd suggest something like:
>
> if (!amdgpu_gpu_recovery)
>      amdgu_core_dump();
>
> That way you'll get a dump in both cases.  Maybe add a flag to skip printing
> vram_lost in this case since it happens before reset so it's irrelevant.
>
> Alex

Good idea, let me add a flag to skip printing vram_lost in job timeout scenario, but still printing vram_lost in other code patch, like KFD hang.
And as described before, there are several application scenarios, it seems only using amdgpu_gpu_recovery can not support all the scenarios

Regards,
Trigger
>
> >
> >         adev->job_hang = true;
> >
> > @@ -101,6 +107,10 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >                 reset_context.src = AMDGPU_RESET_SRC_JOB;
> >                 clear_bit(AMDGPU_NEED_FULL_RESET,
> > &reset_context.flags);
> >
> > +               /* To avoid a double coredump, as we have already done it */
> > +               if (amdgpu_gpu_coredump)
> > +                       set_bit(AMDGPU_SKIP_COREDUMP,
> > + &reset_context.flags);
> > +
> >                 r = amdgpu_device_gpu_recover(ring->adev, job, &reset_context);
> >                 if (r)
> >                         dev_err(adev->dev, "GPU Recovery Failed:
> > %d\n", r);
> > --
> > 2.34.1
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
  2024-08-16  6:36     ` Huang, Trigger
@ 2024-08-16  6:39       ` Huang, Trigger
  2024-08-16 13:43       ` Alex Deucher
  1 sibling, 0 replies; 11+ messages in thread
From: Huang, Trigger @ 2024-08-16  6:39 UTC (permalink / raw)
  To: Huang, Trigger, Alex Deucher
  Cc: amd-gfx@lists.freedesktop.org, Khatri,  Sunil, Deucher, Alexander

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of
> Huang, Trigger
> Sent: Friday, August 16, 2024 2:36 PM
> To: Alex Deucher <alexdeucher@gmail.com>
> Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil <Sunil.Khatri@amd.com>;
> Deucher, Alexander <Alexander.Deucher@amd.com>
> Subject: RE: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> > -----Original Message-----
> > From: Alex Deucher <alexdeucher@gmail.com>
> > Sent: Friday, August 16, 2024 12:02 AM
> > To: Huang, Trigger <Trigger.Huang@amd.com>
> > Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil
> > <Sunil.Khatri@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>
> > Subject: Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
> >
> > On Thu, Aug 15, 2024 at 7:39 AM <Trigger.Huang@amd.com> wrote:
> > >
> > > From: Trigger Huang <Trigger.Huang@amd.com>
> > >
> > > Add new separate parameter to control GPU coredump procedure. This
> > > can be used to decouple the coredump procedure from gpu recovery
> > > procedure
> > >
> > > Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
> > >  2 files changed, 9 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > index 937de21a7142..4dd465ad14af 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > @@ -201,6 +201,7 @@ extern uint amdgpu_force_long_training;  extern
> > > int amdgpu_lbpw;  extern int amdgpu_compute_multipipe;  extern int
> > > amdgpu_gpu_recovery;
> > > +extern int amdgpu_gpu_coredump;
> > >  extern int amdgpu_emu_mode;
> > >  extern uint amdgpu_smu_memory_pool_size;  extern int
> > > amdgpu_smu_pptable_id; diff --git
> > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > index b9529948f2b2..c5d357420236 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > @@ -178,6 +178,7 @@ uint amdgpu_force_long_training;  int
> > amdgpu_lbpw
> > > = -1;  int amdgpu_compute_multipipe = -1;  int amdgpu_gpu_recovery =
> > > -1; /* auto */
> > > +int amdgpu_gpu_coredump;
> > >  int amdgpu_emu_mode;
> > >  uint amdgpu_smu_memory_pool_size;
> > >  int amdgpu_smu_pptable_id = -1;
> > > @@ -556,6 +557,13 @@ module_param_named(compute_multipipe,
> > > amdgpu_compute_multipipe, int, 0444);
> > MODULE_PARM_DESC(gpu_recovery,
> > > "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 =
> > > auto)");  module_param_named(gpu_recovery, amdgpu_gpu_recovery,
> int,
> > > 0444);
> > >
> > > +/**
> > > + * DOC: gpu_coredump (int)
> > > + * Set to enable GPU coredump mechanism (1 = enable, 0 = disable).
> > > +The default is 0  */ MODULE_PARM_DESC(gpu_coredump, "Enable GPU
> > > +coredump mechanism, (1 = enable, 0 = disable (default))");
> > > +module_param_named(gpu_coredump, amdgpu_gpu_coredump, int,
> > 0444);
> >
> > I don't think we need a separate parameter for this, although if we
> > do, this would need to be enabled by default.  If it needs to be
> > decoupled from reset, that's fine, but I don't see the need for a separate
> knob.
> >
> > Alex
>
> Hi Alex,
> It is fine to enable it by default
> There are several application scenarios that I can think of.
>         1, Customer may need to do the core dump with gpu_recovery disabled.
> This can be used for GPU hang debug
>         2, Customer may need to disable the core dump with gpu_recovery
> enabled. This can be used for quick GPU recovery, especially for some
> lightweight hangs that can be processed by soft recovery or per ring reset.
>         3, Customer may need to enable the core dump with gpu_recovery
> enabled. This can be used for GPU recovery but record the core dump for
> further check in stress test or system health check.
> It seems not easy to support all the scenarios by only using
> amdgpu_gpu_coredump, right?

Sorry, this is a typo, here I mean amdgpu_gpu_recovery, while not amdgpu_gpu_coredump.

>
> Regards,
> Trigger
> >
> > > +
> > >  /**
> > >   * DOC: emu_mode (int)
> > >   * Set value 1 to enable emulation mode. This is only needed when
> > > running
> > on an emulator. The default is 0 (disabled).
> > > --
> > > 2.34.1
> > >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
  2024-08-16  6:36     ` Huang, Trigger
  2024-08-16  6:39       ` Huang, Trigger
@ 2024-08-16 13:43       ` Alex Deucher
  2024-08-19  9:29         ` Huang, Trigger
  1 sibling, 1 reply; 11+ messages in thread
From: Alex Deucher @ 2024-08-16 13:43 UTC (permalink / raw)
  To: Huang, Trigger
  Cc: amd-gfx@lists.freedesktop.org, Khatri, Sunil, Deucher, Alexander

On Fri, Aug 16, 2024 at 2:36 AM Huang, Trigger <Trigger.Huang@amd.com> wrote:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> > -----Original Message-----
> > From: Alex Deucher <alexdeucher@gmail.com>
> > Sent: Friday, August 16, 2024 12:02 AM
> > To: Huang, Trigger <Trigger.Huang@amd.com>
> > Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil <Sunil.Khatri@amd.com>;
> > Deucher, Alexander <Alexander.Deucher@amd.com>
> > Subject: Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
> >
> > On Thu, Aug 15, 2024 at 7:39 AM <Trigger.Huang@amd.com> wrote:
> > >
> > > From: Trigger Huang <Trigger.Huang@amd.com>
> > >
> > > Add new separate parameter to control GPU coredump procedure. This can
> > > be used to decouple the coredump procedure from gpu recovery procedure
> > >
> > > Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
> > >  2 files changed, 9 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > index 937de21a7142..4dd465ad14af 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > @@ -201,6 +201,7 @@ extern uint amdgpu_force_long_training;  extern
> > > int amdgpu_lbpw;  extern int amdgpu_compute_multipipe;  extern int
> > > amdgpu_gpu_recovery;
> > > +extern int amdgpu_gpu_coredump;
> > >  extern int amdgpu_emu_mode;
> > >  extern uint amdgpu_smu_memory_pool_size;  extern int
> > > amdgpu_smu_pptable_id; diff --git
> > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > index b9529948f2b2..c5d357420236 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > @@ -178,6 +178,7 @@ uint amdgpu_force_long_training;  int
> > amdgpu_lbpw
> > > = -1;  int amdgpu_compute_multipipe = -1;  int amdgpu_gpu_recovery =
> > > -1; /* auto */
> > > +int amdgpu_gpu_coredump;
> > >  int amdgpu_emu_mode;
> > >  uint amdgpu_smu_memory_pool_size;
> > >  int amdgpu_smu_pptable_id = -1;
> > > @@ -556,6 +557,13 @@ module_param_named(compute_multipipe,
> > > amdgpu_compute_multipipe, int, 0444);
> > MODULE_PARM_DESC(gpu_recovery,
> > > "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 =
> > > auto)");  module_param_named(gpu_recovery, amdgpu_gpu_recovery, int,
> > > 0444);
> > >
> > > +/**
> > > + * DOC: gpu_coredump (int)
> > > + * Set to enable GPU coredump mechanism (1 = enable, 0 = disable).
> > > +The default is 0  */ MODULE_PARM_DESC(gpu_coredump, "Enable GPU
> > > +coredump mechanism, (1 = enable, 0 = disable (default))");
> > > +module_param_named(gpu_coredump, amdgpu_gpu_coredump, int,
> > 0444);
> >
> > I don't think we need a separate parameter for this, although if we do, this
> > would need to be enabled by default.  If it needs to be decoupled from reset,
> > that's fine, but I don't see the need for a separate knob.
> >
> > Alex
>
> Hi Alex,
> It is fine to enable it by default
> There are several application scenarios that I can think of.
>         1, Customer may need to do the core dump with gpu_recovery disabled. This can be used for GPU hang debug
>         2, Customer may need to disable the core dump with gpu_recovery enabled. This can be used for quick GPU recovery, especially for some lightweight hangs that can be processed by soft recovery or per ring reset.
>         3, Customer may need to enable the core dump with gpu_recovery enabled. This can be used for GPU recovery but record the core dump for further check in stress test or system health check.
> It seems not easy to support all the scenarios by only using amdgpu_gpu_coredump, right?

We always want devcoredump enabled by default for full adapter resets,
otherwise it kind of defeats the purpose of the feature.  Do we want
devcoredumps for cases where we can recover via soft recovery or per
queue reset?  If we mainly care about full adapter reset then we can
do something like:

1. in amdgpu_job_timedout(), we can do:
if (!amdgpu_gpu_recovery)
    amdgpu_dev_coredump()
between per ring reset and full adapter reset.  That way it won't get
called for soft recovery or per queue reset.

2. leave the current dev_coredump code in place for the case when
recovery is enabled.

If we want it for both soft-recovery and queue reset and full adapter
reset, then we just just unconditionally call it in
amdgpu_job_timedout().  If the finer grained resets don't work, we'll
get two dumps, but I think that's probably ok.

Alex

>
> Regards,
> Trigger
> >
> > > +
> > >  /**
> > >   * DOC: emu_mode (int)
> > >   * Set value 1 to enable emulation mode. This is only needed when running
> > on an emulator. The default is 0 (disabled).
> > > --
> > > 2.34.1
> > >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
  2024-08-16 13:43       ` Alex Deucher
@ 2024-08-19  9:29         ` Huang, Trigger
  0 siblings, 0 replies; 11+ messages in thread
From: Huang, Trigger @ 2024-08-19  9:29 UTC (permalink / raw)
  To: Alex Deucher
  Cc: amd-gfx@lists.freedesktop.org, Khatri,  Sunil, Deucher, Alexander

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Alex Deucher <alexdeucher@gmail.com>
> Sent: Friday, August 16, 2024 9:43 PM
> To: Huang, Trigger <Trigger.Huang@amd.com>
> Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil <Sunil.Khatri@amd.com>;
> Deucher, Alexander <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
>
> On Fri, Aug 16, 2024 at 2:36 AM Huang, Trigger <Trigger.Huang@amd.com>
> wrote:
> >
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > > -----Original Message-----
> > > From: Alex Deucher <alexdeucher@gmail.com>
> > > Sent: Friday, August 16, 2024 12:02 AM
> > > To: Huang, Trigger <Trigger.Huang@amd.com>
> > > Cc: amd-gfx@lists.freedesktop.org; Khatri, Sunil
> > > <Sunil.Khatri@amd.com>; Deucher, Alexander
> > > <Alexander.Deucher@amd.com>
> > > Subject: Re: [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter
> > >
> > > On Thu, Aug 15, 2024 at 7:39 AM <Trigger.Huang@amd.com> wrote:
> > > >
> > > > From: Trigger Huang <Trigger.Huang@amd.com>
> > > >
> > > > Add new separate parameter to control GPU coredump procedure. This
> > > > can be used to decouple the coredump procedure from gpu recovery
> > > > procedure
> > > >
> > > > Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
> > > >  2 files changed, 9 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > index 937de21a7142..4dd465ad14af 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > @@ -201,6 +201,7 @@ extern uint amdgpu_force_long_training;
> > > > extern int amdgpu_lbpw;  extern int amdgpu_compute_multipipe;
> > > > extern int amdgpu_gpu_recovery;
> > > > +extern int amdgpu_gpu_coredump;
> > > >  extern int amdgpu_emu_mode;
> > > >  extern uint amdgpu_smu_memory_pool_size;  extern int
> > > > amdgpu_smu_pptable_id; diff --git
> > > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > index b9529948f2b2..c5d357420236 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > @@ -178,6 +178,7 @@ uint amdgpu_force_long_training;  int
> > > amdgpu_lbpw
> > > > = -1;  int amdgpu_compute_multipipe = -1;  int amdgpu_gpu_recovery
> > > > = -1; /* auto */
> > > > +int amdgpu_gpu_coredump;
> > > >  int amdgpu_emu_mode;
> > > >  uint amdgpu_smu_memory_pool_size;  int amdgpu_smu_pptable_id =
> > > > -1; @@ -556,6 +557,13 @@
> module_param_named(compute_multipipe,
> > > > amdgpu_compute_multipipe, int, 0444);
> > > MODULE_PARM_DESC(gpu_recovery,
> > > > "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 =
> > > > auto)");  module_param_named(gpu_recovery, amdgpu_gpu_recovery,
> > > > int, 0444);
> > > >
> > > > +/**
> > > > + * DOC: gpu_coredump (int)
> > > > + * Set to enable GPU coredump mechanism (1 = enable, 0 = disable).
> > > > +The default is 0  */ MODULE_PARM_DESC(gpu_coredump, "Enable GPU
> > > > +coredump mechanism, (1 = enable, 0 = disable (default))");
> > > > +module_param_named(gpu_coredump, amdgpu_gpu_coredump, int,
> > > 0444);
> > >
> > > I don't think we need a separate parameter for this, although if we
> > > do, this would need to be enabled by default.  If it needs to be
> > > decoupled from reset, that's fine, but I don't see the need for a separate
> knob.
> > >
> > > Alex
> >
> > Hi Alex,
> > It is fine to enable it by default
> > There are several application scenarios that I can think of.
> >         1, Customer may need to do the core dump with gpu_recovery disabled.
> This can be used for GPU hang debug
> >         2, Customer may need to disable the core dump with gpu_recovery
> enabled. This can be used for quick GPU recovery, especially for some
> lightweight hangs that can be processed by soft recovery or per ring reset.
> >         3, Customer may need to enable the core dump with gpu_recovery
> enabled. This can be used for GPU recovery but record the core dump for
> further check in stress test or system health check.
> > It seems not easy to support all the scenarios by only using
> amdgpu_gpu_coredump, right?
>
> We always want devcoredump enabled by default for full adapter resets,
> otherwise it kind of defeats the purpose of the feature.  Do we want
> devcoredumps for cases where we can recover via soft recovery or per queue
> reset?  If we mainly care about full adapter reset then we can do something
> like:
>
> 1. in amdgpu_job_timedout(), we can do:
> if (!amdgpu_gpu_recovery)
>     amdgpu_dev_coredump()
> between per ring reset and full adapter reset.  That way it won't get called for
> soft recovery or per queue reset.
>
> 2. leave the current dev_coredump code in place for the case when recovery is
> enabled.
>
> If we want it for both soft-recovery and queue reset and full adapter reset,
> then we just just unconditionally call it in amdgpu_job_timedout().  If the finer
> grained resets don't work, we'll get two dumps, but I think that's probably ok.

OK, that sounds reasonable. Let me drop the new parameter in the new patch.
And I think we want it for all(soft recovery+ queue reset+ full adapter reset) in job timeout scenario , and my only concern is that we need to do the coredump immediately after a job timeout to get a closer representation of GPU's error status.
For other cases/scenarios, I will leave the current logic unchanged in the new patch

Thanks,
Trigger
>
> Alex
>
> >
> > Regards,
> > Trigger
> > >
> > > > +
> > > >  /**
> > > >   * DOC: emu_mode (int)
> > > >   * Set value 1 to enable emulation mode. This is only needed when
> > > > running
> > > on an emulator. The default is 0 (disabled).
> > > > --
> > > > 2.34.1
> > > >

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-08-19  9:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-15 11:38 [PATCH 0/3] Improve the dev coredump Trigger.Huang
2024-08-15 11:38 ` [PATCH 1/3] drm/amdgpu: Add gpu_coredump parameter Trigger.Huang
2024-08-15 16:01   ` Alex Deucher
2024-08-16  6:36     ` Huang, Trigger
2024-08-16  6:39       ` Huang, Trigger
2024-08-16 13:43       ` Alex Deucher
2024-08-19  9:29         ` Huang, Trigger
2024-08-15 11:38 ` [PATCH 2/3] drm/amdgpu: introduce new API for GPU core dump Trigger.Huang
2024-08-15 11:38 ` [PATCH 3/3] drm/amdgpu: Change the timing of doing coredump Trigger.Huang
2024-08-15 16:09   ` Alex Deucher
2024-08-16  6:37     ` Huang, Trigger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox