From: "Khatri, Sunil" <sunil.khatri@amd.com>
To: Trigger.Huang@amd.com, amd-gfx@lists.freedesktop.org
Cc: alexander.deucher@amd.com
Subject: Re: [PATCH 2/2] drm/amdgpu: Do core dump immediately when job tmo
Date: Mon, 19 Aug 2024 16:00:46 +0530 [thread overview]
Message-ID: <a0978549-9bd3-e985-76eb-f50115f76bf4@amd.com> (raw)
In-Reply-To: <20240819095331.460721-3-Trigger.Huang@amd.com>
On 8/19/2024 3:23 PM, Trigger.Huang@amd.com wrote:
> From: Trigger Huang <Trigger.Huang@amd.com>
>
> Do the coredump immediately after a job timeout to get a closer
> representation of GPU's error status.
>
> V2: This will skip printing vram_lost as the GPU reset is not
> happened yet (Alex)
>
> V3: Unconditionally call the core dump as we care about all the reset
> functions(soft-recovery and queue reset and full adapter reset, Alex)
>
> Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 62 +++++++++++++++++++++++++
> 1 file changed, 62 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index c6a1783fc9ef..ebbb1434073e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -30,6 +30,61 @@
> #include "amdgpu.h"
> #include "amdgpu_trace.h"
> #include "amdgpu_reset.h"
> +#include "amdgpu_dev_coredump.h"
> +#include "amdgpu_xgmi.h"
> +
> +static void amdgpu_job_do_core_dump(struct amdgpu_device *adev,
> + struct amdgpu_job *job)
> +{
> + int i;
> +
> + dev_info(adev->dev, "Dumping IP State\n");
> + for (i = 0; i < adev->num_ip_blocks; i++) {
> + if (adev->ip_blocks[i].version->funcs->dump_ip_state)
> + adev->ip_blocks[i].version->funcs
> + ->dump_ip_state((void *)adev);
> + dev_info(adev->dev, "Dumping IP State Completed\n");
> + }
> +
> + amdgpu_coredump(adev, true, false, job);
> +}
> +
> +static void amdgpu_job_core_dump(struct amdgpu_device *adev,
> + struct amdgpu_job *job)
> +{
> + struct list_head device_list, *device_list_handle = NULL;
> + struct amdgpu_device *tmp_adev = NULL;
> + struct amdgpu_hive_info *hive = NULL;
> +
> + if (!amdgpu_sriov_vf(adev))
> + hive = amdgpu_get_xgmi_hive(adev);
> + if (hive)
> + mutex_lock(&hive->hive_lock);
> + /*
> + * Reuse the logic in amdgpu_device_gpu_recover() to build list of
> + * devices for code dump
> + */
> + INIT_LIST_HEAD(&device_list);
> + if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1) && hive) {
> + list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
> + list_add_tail(&tmp_adev->reset_list, &device_list);
> + if (!list_is_first(&adev->reset_list, &device_list))
> + list_rotate_to_front(&adev->reset_list, &device_list);
> + device_list_handle = &device_list;
> + } else {
> + list_add_tail(&adev->reset_list, &device_list);
> + device_list_handle = &device_list;
> + }
> +
> + /* Do the coredump for each device */
> + list_for_each_entry(tmp_adev, device_list_handle, reset_list)
> + amdgpu_job_do_core_dump(tmp_adev, job);
> +
> + if (hive) {
> + mutex_unlock(&hive->hive_lock);
> + amdgpu_put_xgmi_hive(hive);
> + }
> +}
>
> static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> {
> @@ -48,6 +103,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> return DRM_GPU_SCHED_STAT_ENODEV;
> }
>
> + amdgpu_job_core_dump(adev, job);
The philosophy is hang and recovery is to let the HW and software try to
recover. Here we try to do a soft recovery first and i think we should
wait for seft recovery and if fails then we do dump and thats exactly we
are doing here.
Also we need to make sure that the tasks which are already in queue are
put on hold and the their sync points are signalled before we dump.
check once what all steps are taken before we dump in the current
implementation.
Regards
Sunil khatri
>
> adev->job_hang = true;
>
> @@ -101,6 +157,12 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> reset_context.src = AMDGPU_RESET_SRC_JOB;
> clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>
> + /*
> + * To avoid an unnecessary extra coredump, as we have already
> + * got the very close representation of GPU's error status
> + */
> + set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
> +
> r = amdgpu_device_gpu_recover(ring->adev, job, &reset_context);
> if (r)
> dev_err(adev->dev, "GPU Recovery Failed: %d\n", r);
next prev parent reply other threads:[~2024-08-19 10:30 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-19 9:53 [PATCH 0/2] Improve the dev coredump Trigger.Huang
2024-08-19 9:53 ` [PATCH 1/2] drm/amdgpu: skip printing vram_lost if needed Trigger.Huang
2024-08-19 9:53 ` [PATCH 2/2] drm/amdgpu: Do core dump immediately when job tmo Trigger.Huang
2024-08-19 10:30 ` Khatri, Sunil [this message]
2024-08-20 7:30 ` Huang, Trigger
2024-08-20 14:06 ` Alex Deucher
2024-08-20 15:07 ` Khatri, Sunil
2024-08-20 15:29 ` Alex Deucher
2024-08-20 15:31 ` Khatri, Sunil
2024-08-20 16:01 ` Alex Deucher
2024-08-20 16:54 ` Khatri, Sunil
2024-08-21 8:19 ` Huang, Trigger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a0978549-9bd3-e985-76eb-f50115f76bf4@amd.com \
--to=sunil.khatri@amd.com \
--cc=Trigger.Huang@amd.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox