Re: [PATCH 2/2] drm/amdgpu: Do core dump immediately when job tmo

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Khatri, Sunil" <sunil.khatri@amd.com>
To: "Huang, Trigger" <Trigger.Huang@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>
Subject: Re: [PATCH 2/2] drm/amdgpu: Do core dump immediately when job tmo
Date: Tue, 20 Aug 2024 21:01:24 +0530	[thread overview]
Message-ID: <cc16efe1-5de5-20bb-8556-143a121de6d0@amd.com> (raw)
In-Reply-To: <SA1PR12MB7442E8D1DB91114A51FC9EB0FE8D2@SA1PR12MB7442.namprd12.prod.outlook.com>


On 8/20/2024 1:00 PM, Huang, Trigger wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> -----Original Message-----
>> From: Khatri, Sunil <Sunil.Khatri@amd.com>
>> Sent: Monday, August 19, 2024 6:31 PM
>> To: Huang, Trigger <Trigger.Huang@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>
>> Subject: Re: [PATCH 2/2] drm/amdgpu: Do core dump immediately when job
>> tmo
>>
>>
>> On 8/19/2024 3:23 PM, Trigger.Huang@amd.com wrote:
>>> From: Trigger Huang <Trigger.Huang@amd.com>
>>>
>>> Do the coredump immediately after a job timeout to get a closer
>>> representation of GPU's error status.
>>>
>>> V2: This will skip printing vram_lost as the GPU reset is not happened
>>> yet (Alex)
>>>
>>> V3: Unconditionally call the core dump as we care about all the reset
>>> functions(soft-recovery and queue reset and full adapter reset, Alex)
>>>
>>> Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 62
>> +++++++++++++++++++++++++
>>>    1 file changed, 62 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index c6a1783fc9ef..ebbb1434073e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -30,6 +30,61 @@
>>>    #include "amdgpu.h"
>>>    #include "amdgpu_trace.h"
>>>    #include "amdgpu_reset.h"
>>> +#include "amdgpu_dev_coredump.h"
>>> +#include "amdgpu_xgmi.h"
>>> +
>>> +static void amdgpu_job_do_core_dump(struct amdgpu_device *adev,
>>> +                               struct amdgpu_job *job)
>>> +{
>>> +   int i;
>>> +
>>> +   dev_info(adev->dev, "Dumping IP State\n");
>>> +   for (i = 0; i < adev->num_ip_blocks; i++) {
>>> +           if (adev->ip_blocks[i].version->funcs->dump_ip_state)
>>> +                   adev->ip_blocks[i].version->funcs
>>> +                           ->dump_ip_state((void *)adev);
>>> +           dev_info(adev->dev, "Dumping IP State Completed\n");
>>> +   }
>>> +
>>> +   amdgpu_coredump(adev, true, false, job); }
>>> +
>>> +static void amdgpu_job_core_dump(struct amdgpu_device *adev,
>>> +                            struct amdgpu_job *job)
>>> +{
>>> +   struct list_head device_list, *device_list_handle =  NULL;
>>> +   struct amdgpu_device *tmp_adev = NULL;
>>> +   struct amdgpu_hive_info *hive = NULL;
>>> +
>>> +   if (!amdgpu_sriov_vf(adev))
>>> +           hive = amdgpu_get_xgmi_hive(adev);
>>> +   if (hive)
>>> +           mutex_lock(&hive->hive_lock);
>>> +   /*
>>> +    * Reuse the logic in amdgpu_device_gpu_recover() to build list of
>>> +    * devices for code dump
>>> +    */
>>> +   INIT_LIST_HEAD(&device_list);
>>> +   if (!amdgpu_sriov_vf(adev) && (adev-
>>> gmc.xgmi.num_physical_nodes > 1) && hive) {
>>> +           list_for_each_entry(tmp_adev, &hive->device_list,
>> gmc.xgmi.head)
>>> +                   list_add_tail(&tmp_adev->reset_list, &device_list);
>>> +           if (!list_is_first(&adev->reset_list, &device_list))
>>> +                   list_rotate_to_front(&adev->reset_list, &device_list);
>>> +           device_list_handle = &device_list;
>>> +   } else {
>>> +           list_add_tail(&adev->reset_list, &device_list);
>>> +           device_list_handle = &device_list;
>>> +   }
>>> +
>>> +   /* Do the coredump for each device */
>>> +   list_for_each_entry(tmp_adev, device_list_handle, reset_list)
>>> +           amdgpu_job_do_core_dump(tmp_adev, job);
>>> +
>>> +   if (hive) {
>>> +           mutex_unlock(&hive->hive_lock);
>>> +           amdgpu_put_xgmi_hive(hive);
>>> +   }
>>> +}
>>>
>>>    static enum drm_gpu_sched_stat amdgpu_job_timedout(struct
>> drm_sched_job *s_job)
>>>    {
>>> @@ -48,6 +103,7 @@ static enum drm_gpu_sched_stat
>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>              return DRM_GPU_SCHED_STAT_ENODEV;
>>>      }
>>>
>>> +   amdgpu_job_core_dump(adev, job);
>> The philosophy is hang and recovery is to let the HW and software try to
>> recover. Here we try to do a soft recovery first and i think we should wait for
>> seft recovery and if fails then we do dump and thats exactly we are doing here.
> Hi Sunil ,
> thanks for the suggestion, and that's reasonable. But my concern is that after soft recovery happened, the GPU's status may change(take gfx 9 for example, it will try to kill the current hang wave)
>   Actually, in most cases, a real shader hang cannot be resolved through soft recovery, and at that moment, we need to get a very close dump/snapshot/representation of GPU's current error status.
> Just like the scandump, when we trying to do a scandump for a shader hang, we will disable gpu_recovery, and no soft recovery/per-queue reset/HW reset will happen before the scandump, right?
> On most products, there are no scandump interfaces, so core dump is even more important for debugging GPU hang issue.
>
> Regards,
> Trigger
>
>> Also we need to make sure that the tasks which are already in queue are put
>> on hold and the their sync points are signalled before we dump.
>> check once what all steps are taken before we dump in the current
>> implementation.
> Do you mean sometimes like:
>          drm_sched_wqueue_stop(&ring->sched);
>          amdgpu_fence_driver_force_completion(ring); // Since there is no GPU reset happened, is it reasonable to call it here?
>          amdgpu_job_core_dump(adev, job);
Before hard reset we do all those things. But i guess we may not need 
that in case we capturing before soft reset itself but you need to test 
it, to see the dump values are true or not.
Also apart from hardware state we dump a lot of other information like 
ring buffers and in case jobs are still submitting we might be not able 
to get the right data as the ring might be in use and being 
consumed/filled up that time and that's why scheduler stop helps. But in 
case soft reset is successful we do not want to do that.

So here is what i think but Alex please suggest if it make sense.
If recovery is disabled : Capture ip dump before soft reset. (Give close 
register state but ring buffer need to be seen as it is in use as 
scheduler is running)
if recovery is enabled : capture ip dump (Current implementation make 
sure to disable drm sched and fence time out)

function ptr print ip state could be called to capture dump when its 
needed in both above cases. Right now print is called when dump is 
actually dumped which is when data file which is generated in 
devcoredump is read.

Regards
Sunil Khatri


> Regards,
> Trigger
>
>> Regards
>>
>> Sunil khatri
>>
>>>      adev->job_hang = true;
>>>
>>> @@ -101,6 +157,12 @@ static enum drm_gpu_sched_stat
>> amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>              reset_context.src = AMDGPU_RESET_SRC_JOB;
>>>              clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>>>
>>> +           /*
>>> +            * To avoid an unnecessary extra coredump, as we have
>> already
>>> +            * got the very close representation of GPU's error status
>>> +            */
>>> +           set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
>>> +
>>>              r = amdgpu_device_gpu_recover(ring->adev, job,
>> &reset_context);
>>>              if (r)
>>>                      dev_err(adev->dev, "GPU Recovery Failed: %d\n", r);

next prev parent reply	other threads:[~2024-08-20 15:31 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-19  9:53 [PATCH 0/2] Improve the dev coredump Trigger.Huang
2024-08-19  9:53 ` [PATCH 1/2] drm/amdgpu: skip printing vram_lost if needed Trigger.Huang
2024-08-19  9:53 ` [PATCH 2/2] drm/amdgpu: Do core dump immediately when job tmo Trigger.Huang
2024-08-19 10:30   ` Khatri, Sunil
2024-08-20  7:30     ` Huang, Trigger
2024-08-20 14:06       ` Alex Deucher
2024-08-20 15:07         ` Khatri, Sunil
2024-08-20 15:29           ` Alex Deucher
2024-08-20 15:31       ` Khatri, Sunil [this message]
2024-08-20 16:01         ` Alex Deucher
2024-08-20 16:54           ` Khatri, Sunil
2024-08-21  8:19             ` Huang, Trigger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cc16efe1-5de5-20bb-8556-143a121de6d0@amd.com \
    --to=sunil.khatri@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Trigger.Huang@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox