From: "Christian König" <christian.koenig@amd.com>
To: "Khatri, Sunil" <sukhatri@amd.com>,
Sunil Khatri <sunil.khatri@amd.com>,
Alex Deucher <alexander.deucher@amd.com>,
Shashank Sharma <shashank.sharma@amd.com>
Cc: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-kernel@vger.kernel.org, Mukul Joshi <mukul.joshi@amd.com>,
Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>,
"Sharma, Shashank" <Shashank.Sharma@amd.com>
Subject: Re: [PATCH] drm/amdgpu: add vm fault information to devcoredump
Date: Thu, 7 Mar 2024 13:40:08 +0100 [thread overview]
Message-ID: <f3b6d600-e8f2-48cf-b19b-ddb28e9bcbee@amd.com> (raw)
In-Reply-To: <bd6a70dc-d710-498e-b4ed-35c6106cd898@amd.com>
Am 07.03.24 um 09:37 schrieb Khatri, Sunil:
>
> On 3/7/2024 1:47 PM, Christian König wrote:
>> Am 06.03.24 um 19:19 schrieb Sunil Khatri:
>>> Add page fault information to the devcoredump.
>>>
>>> Output of devcoredump:
>>> **** AMDGPU Device Coredump ****
>>> version: 1
>>> kernel: 6.7.0-amd-staging-drm-next
>>> module: amdgpu
>>> time: 29.725011811
>>> process_name: soft_recovery_p PID: 1720
>>>
>>> Ring timed out details
>>> IP Type: 0 Ring Name: gfx_0.0.0
>>>
>>> [gfxhub] Page fault observed for GPU family:143
>>> Faulty page starting at address 0x0000000000000000
>>> Protection fault status register:0x301031
>>>
>>> VRAM is lost due to GPU reset!
>>>
>>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++++++++++++++-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 +
>>> 2 files changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> index 147100c27c2d..d7fea6cdf2f9 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t
>>> offset, size_t count,
>>> coredump->ring->name);
>>> }
>>> + if (coredump->fault_info.status) {
>>> + struct amdgpu_vm_fault_info *fault_info =
>>> &coredump->fault_info;
>>> +
>>> + drm_printf(&p, "\n[%s] Page fault observed for GPU
>>> family:%d\n",
>>> + fault_info->vmhub ? "mmhub" : "gfxhub",
>>> + coredump->adev->family);
>>> + drm_printf(&p, "Faulty page starting at address 0x%016llx\n",
>>> + fault_info->addr);
>>> + drm_printf(&p, "Protection fault status register:0x%x\n",
>>> + fault_info->status);
>>> + }
>>> +
>>> if (coredump->reset_vram_lost)
>>> - drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>>> + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");
>>> if (coredump->adev->reset_info.num_regs) {
>>> drm_printf(&p, "AMDGPU register dumps:\nOffset:
>>> Value:\n");
>>> @@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device
>>> *adev, bool vram_lost,
>>> if (job) {
>>> s_job = &job->base;
>>> coredump->ring = to_amdgpu_ring(s_job->sched);
>>> + coredump->fault_info = job->vm->fault_info;
>>
>> That's illegal. The VM pointer might already be stale at this point.
>>
>> I think you need to add the fault info of the last fault globally in
>> the VRAM manager or move this to the process info Shashank is working
>> on.
>> Are you saying that during the reset or otherwise a vm which is part
>> of this job could have been freed and we might have a NULL
>> dereference or invalid reference? Till now based on the resets and
>> pagefaults that i have created till now using the same app which we
>> are using for IH overflow i am able to get the valid vm only.
>>
>> Assuming amdgpu_vm is freed for this job or stale, are you
>> suggesting to update this information in adev-> vm_manager along with
>> existing per vm fault_info or only in vm_manager ?
Good question. having it both in the VM as well as the VM manager sounds
like the simplest option for now.
Regards,
Christian.
>>
>> Regards,
>> Christian.
>>
>>> }
>>> coredump->adev = adev;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>> index 60522963aaca..3197955264f9 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>> @@ -98,6 +98,7 @@ struct amdgpu_coredump_info {
>>> struct timespec64 reset_time;
>>> bool reset_vram_lost;
>>> struct amdgpu_ring *ring;
>>> + struct amdgpu_vm_fault_info fault_info;
>>> };
>>> #endif
>>
next prev parent reply other threads:[~2024-03-07 12:40 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-06 18:19 [PATCH] drm/amdgpu: cache in more vm fault information Sunil Khatri
2024-03-06 18:19 ` [PATCH] drm/amdgpu: add vm fault information to devcoredump Sunil Khatri
2024-03-06 19:21 ` Deucher, Alexander
2024-03-07 4:02 ` Khatri, Sunil
2024-03-07 8:17 ` Christian König
2024-03-07 8:37 ` Khatri, Sunil
2024-03-07 12:40 ` Christian König [this message]
2024-03-07 12:43 ` Khatri, Sunil
2024-03-06 18:20 ` [PATCH] drm/amdgpu: cache in more vm fault information Khatri, Sunil
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f3b6d600-e8f2-48cf-b19b-ddb28e9bcbee@amd.com \
--to=christian.koenig@amd.com \
--cc=Arunpravin.PaneerSelvam@amd.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mukul.joshi@amd.com \
--cc=shashank.sharma@amd.com \
--cc=sukhatri@amd.com \
--cc=sunil.khatri@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.