* [PATCH v2 0/2] Add pagefault support for devcoredump
@ 2024-03-07 20:50 Sunil Khatri
2024-03-07 20:50 ` [PATCH v2 1/2] drm/amdgpu: add recent pagefault info in vm_manager Sunil Khatri
2024-03-07 20:50 ` [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump Sunil Khatri
0 siblings, 2 replies; 6+ messages in thread
From: Sunil Khatri @ 2024-03-07 20:50 UTC (permalink / raw)
To: Alex Deucher, Christian König, Shashank Sharma
Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi,
Arunpravin Paneer Selvam, Sunil Khatri
Add support of devcoredump from global object of amdgpu_device
Sunil Khatri (2):
drm/amdgpu: add recent pagefault info in vm_manager
drm/amdgpu: add vm fault information to devcoredump
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++-
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8 ++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 ++
3 files changed, 23 insertions(+), 1 deletion(-)
--
2.34.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v2 1/2] drm/amdgpu: add recent pagefault info in vm_manager
2024-03-07 20:50 [PATCH v2 0/2] Add pagefault support for devcoredump Sunil Khatri
@ 2024-03-07 20:50 ` Sunil Khatri
2024-03-07 20:50 ` [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump Sunil Khatri
1 sibling, 0 replies; 6+ messages in thread
From: Sunil Khatri @ 2024-03-07 20:50 UTC (permalink / raw)
To: Alex Deucher, Christian König, Shashank Sharma
Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi,
Arunpravin Paneer Selvam, Sunil Khatri
Currently page fault information is stored per
vm and which could be freed or stale during
reset. Add it pagefault information in the
vm_manager which is a global space for vm's
and remains valid across.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8 ++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 ++
2 files changed, 10 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 4299ce386322..81fb3465e197 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2924,6 +2924,14 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev,
if (vm && status) {
vm->fault_info.addr = addr;
vm->fault_info.status = status;
+ /*
+ * Update the fault information globally for later usage
+ * when vm could be stale or freed.
+ */
+ adev->vm_manager.fault_info.addr = addr;
+ adev->vm_manager.fault_info.vmhub = vmhub;
+ adev->vm_manager.fault_info.status = status;
+
if (AMDGPU_IS_GFXHUB(vmhub)) {
vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX;
vm->fault_info.vmhub |=
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 047ec1930d12..8efa8422f4f7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -422,6 +422,8 @@ struct amdgpu_vm_manager {
* look up VM of a page fault
*/
struct xarray pasids;
+ /* Global registration of recent page fault information */
+ struct amdgpu_vm_fault_info fault_info;
};
struct amdgpu_bo_va_mapping;
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump
2024-03-07 20:50 [PATCH v2 0/2] Add pagefault support for devcoredump Sunil Khatri
2024-03-07 20:50 ` [PATCH v2 1/2] drm/amdgpu: add recent pagefault info in vm_manager Sunil Khatri
@ 2024-03-07 20:50 ` Sunil Khatri
2024-03-08 9:09 ` Christian König
1 sibling, 1 reply; 6+ messages in thread
From: Sunil Khatri @ 2024-03-07 20:50 UTC (permalink / raw)
To: Alex Deucher, Christian König, Shashank Sharma
Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi,
Arunpravin Paneer Selvam, Sunil Khatri
Add page fault information to the devcoredump.
Output of devcoredump:
**** AMDGPU Device Coredump ****
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 29.725011811
process_name: soft_recovery_p PID: 1720
Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0
[gfxhub] Page fault observed
Faulty page starting at address: 0x0000000000000000
Protection fault status register: 0x301031
VRAM is lost due to GPU reset!
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index 147100c27c2d..8794a3c21176 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
coredump->ring->name);
}
+ if (coredump->adev) {
+ struct amdgpu_vm_fault_info *fault_info =
+ &coredump->adev->vm_manager.fault_info;
+
+ drm_printf(&p, "\n[%s] Page fault observed\n",
+ fault_info->vmhub ? "mmhub" : "gfxhub");
+ drm_printf(&p, "Faulty page starting at address: 0x%016llx\n",
+ fault_info->addr);
+ drm_printf(&p, "Protection fault status register: 0x%x\n",
+ fault_info->status);
+ }
+
if (coredump->reset_vram_lost)
- drm_printf(&p, "VRAM is lost due to GPU reset!\n");
+ drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");
if (coredump->adev->reset_info.num_regs) {
drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n");
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump
2024-03-07 20:50 ` [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump Sunil Khatri
@ 2024-03-08 9:09 ` Christian König
2024-03-08 9:16 ` Khatri, Sunil
0 siblings, 1 reply; 6+ messages in thread
From: Christian König @ 2024-03-08 9:09 UTC (permalink / raw)
To: Sunil Khatri, Alex Deucher, Shashank Sharma
Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi,
Arunpravin Paneer Selvam
Am 07.03.24 um 21:50 schrieb Sunil Khatri:
> Add page fault information to the devcoredump.
>
> Output of devcoredump:
> **** AMDGPU Device Coredump ****
> version: 1
> kernel: 6.7.0-amd-staging-drm-next
> module: amdgpu
> time: 29.725011811
> process_name: soft_recovery_p PID: 1720
>
> Ring timed out details
> IP Type: 0 Ring Name: gfx_0.0.0
>
> [gfxhub] Page fault observed
> Faulty page starting at address: 0x0000000000000000
> Protection fault status register: 0x301031
>
> VRAM is lost due to GPU reset!
>
> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> index 147100c27c2d..8794a3c21176 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
> coredump->ring->name);
> }
>
> + if (coredump->adev) {
> + struct amdgpu_vm_fault_info *fault_info =
> + &coredump->adev->vm_manager.fault_info;
> +
> + drm_printf(&p, "\n[%s] Page fault observed\n",
> + fault_info->vmhub ? "mmhub" : "gfxhub");
> + drm_printf(&p, "Faulty page starting at address: 0x%016llx\n",
> + fault_info->addr);
> + drm_printf(&p, "Protection fault status register: 0x%x\n",
> + fault_info->status);
> + }
> +
> if (coredump->reset_vram_lost)
> - drm_printf(&p, "VRAM is lost due to GPU reset!\n");
> + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");
Why this additional new line?
Apart from that looks really good to me.
Regards,
Christian.
> if (coredump->adev->reset_info.num_regs) {
> drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n");
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump
2024-03-08 9:09 ` Christian König
@ 2024-03-08 9:16 ` Khatri, Sunil
2024-03-08 10:11 ` Christian König
0 siblings, 1 reply; 6+ messages in thread
From: Khatri, Sunil @ 2024-03-08 9:16 UTC (permalink / raw)
To: Christian König, Sunil Khatri, Alex Deucher, Shashank Sharma
Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi,
Arunpravin Paneer Selvam
On 3/8/2024 2:39 PM, Christian König wrote:
> Am 07.03.24 um 21:50 schrieb Sunil Khatri:
>> Add page fault information to the devcoredump.
>>
>> Output of devcoredump:
>> **** AMDGPU Device Coredump ****
>> version: 1
>> kernel: 6.7.0-amd-staging-drm-next
>> module: amdgpu
>> time: 29.725011811
>> process_name: soft_recovery_p PID: 1720
>>
>> Ring timed out details
>> IP Type: 0 Ring Name: gfx_0.0.0
>>
>> [gfxhub] Page fault observed
>> Faulty page starting at address: 0x0000000000000000
>> Protection fault status register: 0x301031
>>
>> VRAM is lost due to GPU reset!
>>
>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++-
>> 1 file changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> index 147100c27c2d..8794a3c21176 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t
>> offset, size_t count,
>> coredump->ring->name);
>> }
>> + if (coredump->adev) {
>> + struct amdgpu_vm_fault_info *fault_info =
>> + &coredump->adev->vm_manager.fault_info;
>> +
>> + drm_printf(&p, "\n[%s] Page fault observed\n",
>> + fault_info->vmhub ? "mmhub" : "gfxhub");
>> + drm_printf(&p, "Faulty page starting at address: 0x%016llx\n",
>> + fault_info->addr);
>> + drm_printf(&p, "Protection fault status register: 0x%x\n",
>> + fault_info->status);
>> + }
>> +
>> if (coredump->reset_vram_lost)
>> - drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>> + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");
>
> Why this additional new line?
The intent is the devcoredump have different sections clearly demarcated
with an new line else "VRAM is lost due to GPU reset!" seems part of the
page fault information.
[gfxhub] Page fault observed
Faulty page starting at address: 0x0000000000000000
Protection fault status register: 0x301031
VRAM is lost due to GPU reset!
Regards
Sunil
>
> Apart from that looks really good to me.
>
> Regards,
> Christian.
>
>> if (coredump->adev->reset_info.num_regs) {
>> drm_printf(&p, "AMDGPU register dumps:\nOffset:
>> Value:\n");
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump
2024-03-08 9:16 ` Khatri, Sunil
@ 2024-03-08 10:11 ` Christian König
0 siblings, 0 replies; 6+ messages in thread
From: Christian König @ 2024-03-08 10:11 UTC (permalink / raw)
To: Khatri, Sunil, Christian König, Sunil Khatri, Alex Deucher,
Shashank Sharma
Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi,
Arunpravin Paneer Selvam
Am 08.03.24 um 10:16 schrieb Khatri, Sunil:
>
> On 3/8/2024 2:39 PM, Christian König wrote:
>> Am 07.03.24 um 21:50 schrieb Sunil Khatri:
>>> Add page fault information to the devcoredump.
>>>
>>> Output of devcoredump:
>>> **** AMDGPU Device Coredump ****
>>> version: 1
>>> kernel: 6.7.0-amd-staging-drm-next
>>> module: amdgpu
>>> time: 29.725011811
>>> process_name: soft_recovery_p PID: 1720
>>>
>>> Ring timed out details
>>> IP Type: 0 Ring Name: gfx_0.0.0
>>>
>>> [gfxhub] Page fault observed
>>> Faulty page starting at address: 0x0000000000000000
>>> Protection fault status register: 0x301031
>>>
>>> VRAM is lost due to GPU reset!
>>>
>>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++-
>>> 1 file changed, 13 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> index 147100c27c2d..8794a3c21176 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t
>>> offset, size_t count,
>>> coredump->ring->name);
>>> }
>>> + if (coredump->adev) {
>>> + struct amdgpu_vm_fault_info *fault_info =
>>> + &coredump->adev->vm_manager.fault_info;
>>> +
>>> + drm_printf(&p, "\n[%s] Page fault observed\n",
>>> + fault_info->vmhub ? "mmhub" : "gfxhub");
>>> + drm_printf(&p, "Faulty page starting at address: 0x%016llx\n",
>>> + fault_info->addr);
>>> + drm_printf(&p, "Protection fault status register: 0x%x\n",
>>> + fault_info->status);
>>> + }
>>> +
>>> if (coredump->reset_vram_lost)
>>> - drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>>> + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");
>>
>> Why this additional new line?
> The intent is the devcoredump have different sections clearly
> demarcated with an new line else "VRAM is lost due to GPU reset!"
> seems part of the page fault information.
> [gfxhub] Page fault observed
> Faulty page starting at address: 0x0000000000000000
> Protection fault status register: 0x301031
>
> VRAM is lost due to GPU reset!
In that case I would print the newline independent if VRAM is lost or
not. Otherwise you get:
Protection fault status register:...
VRAM is lost due to GPU reset!
AMDGPU register dumps:
In one case and:
Protection fault status register:...
AMDGPU register dumps:
In the other case which breaks this sectioning quite a bit.
Regards,
Christian.
>
> Regards
> Sunil
>
>>
>> Apart from that looks really good to me.
>>
>> Regards,
>> Christian.
>>
>>> if (coredump->adev->reset_info.num_regs) {
>>> drm_printf(&p, "AMDGPU register dumps:\nOffset:
>>> Value:\n");
>>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-03-08 10:12 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-07 20:50 [PATCH v2 0/2] Add pagefault support for devcoredump Sunil Khatri
2024-03-07 20:50 ` [PATCH v2 1/2] drm/amdgpu: add recent pagefault info in vm_manager Sunil Khatri
2024-03-07 20:50 ` [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump Sunil Khatri
2024-03-08 9:09 ` Christian König
2024-03-08 9:16 ` Khatri, Sunil
2024-03-08 10:11 ` Christian König
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox