From: <Trigger.Huang@amd.com>
To: <amd-gfx@lists.freedesktop.org>
Cc: <sunil.khatri@amd.com>, <alexander.deucher@amd.com>,
Trigger Huang <Trigger.Huang@amd.com>
Subject: [PATCH 0/4] Improve the dev coredump
Date: Fri, 16 Aug 2024 15:54:43 +0800 [thread overview]
Message-ID: <20240816075447.442983-1-Trigger.Huang@amd.com> (raw)
From: Trigger Huang <Trigger.Huang@amd.com>
The current dev coredump implementation sometimes cannot fully satisfy customer's requirements due to:
1, dev coredump is under the control of gpu_recovery, thinking about the following application scenarios:
1), Customer may need to do the core dump with gpu_recovery disabled. This can be used for GPU hang debug
1), Customer may need to disable the core dump with gpu_recovery enabled. This can be used for quick GPU recovery, especially for some lightweight hangs that can be processed by soft recovery or per ring reset.
1), Customer may need to enable the core dump with gpu_recovery enabled. This can be used for GPU recovery but record the core dump for further check in stress test or system health check.
It seems not easy to support all the scenarios by only using amdgpu_gpu_recovery.
2, When job timeout happened, the dump GPU status will be happened after a lot of operations, like soft_reset. The concern here is that the status is not so close to the real GPU's error status.
So we introduced the new solution
1, A new parameter, gpu_coredump, is added to decouple the coredump and gpu reset
2, Do the coredump immediately after a job timeout
Trigger Huang (4):
drm/amdgpu: Add gpu_coredump parameter
drm/amdgpu: Use gpu_coredump to control core dump
drm/amdgpu: skip printing vram_lost if needed
drm/amdgpu: Do core dump immediately when job tmo
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
.../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 19 +++---
.../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.h | 6 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 64 +++++++++++++++++++
6 files changed, 89 insertions(+), 15 deletions(-)
--
2.34.1
next reply other threads:[~2024-08-16 7:55 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-16 7:54 Trigger.Huang [this message]
2024-08-16 7:54 ` [PATCH 1/4] drm/amdgpu: Add gpu_coredump parameter Trigger.Huang
2024-08-16 13:55 ` Alex Deucher
2024-08-16 7:54 ` [PATCH 2/4] drm/amdgpu: Use gpu_coredump to control core dump Trigger.Huang
2024-08-16 7:54 ` [PATCH 3/4] drm/amdgpu: skip printing vram_lost if needed Trigger.Huang
2024-08-16 13:52 ` Alex Deucher
2024-08-19 9:30 ` Huang, Trigger
2024-08-16 7:54 ` [PATCH 4/4] drm/amdgpu: Do core dump immediately when job tmo Trigger.Huang
2024-08-16 13:58 ` Alex Deucher
2024-08-19 9:37 ` Huang, Trigger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240816075447.442983-1-Trigger.Huang@amd.com \
--to=trigger.huang@amd.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=sunil.khatri@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox