public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Add pagefault support for devcoredump
@ 2024-03-07 20:50 Sunil Khatri
  2024-03-07 20:50 ` [PATCH v2 1/2] drm/amdgpu: add recent pagefault info in vm_manager Sunil Khatri
  2024-03-07 20:50 ` [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump Sunil Khatri
  0 siblings, 2 replies; 6+ messages in thread
From: Sunil Khatri @ 2024-03-07 20:50 UTC (permalink / raw)
  To: Alex Deucher, Christian König, Shashank Sharma
  Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi,
	Arunpravin Paneer Selvam, Sunil Khatri

Add support of devcoredump from global object of amdgpu_device 



Sunil Khatri (2):
  drm/amdgpu: add recent pagefault info in vm_manager
  drm/amdgpu: add vm fault information to devcoredump

 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c    |  8 ++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h    |  2 ++
 3 files changed, 23 insertions(+), 1 deletion(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/2] drm/amdgpu: add recent pagefault info in vm_manager
  2024-03-07 20:50 [PATCH v2 0/2] Add pagefault support for devcoredump Sunil Khatri
@ 2024-03-07 20:50 ` Sunil Khatri
  2024-03-07 20:50 ` [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump Sunil Khatri
  1 sibling, 0 replies; 6+ messages in thread
From: Sunil Khatri @ 2024-03-07 20:50 UTC (permalink / raw)
  To: Alex Deucher, Christian König, Shashank Sharma
  Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi,
	Arunpravin Paneer Selvam, Sunil Khatri

Currently page fault information is stored per
vm and which could be freed or stale during
reset. Add it pagefault information in the
vm_manager which is a global space for vm's
and remains valid across.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8 ++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 4299ce386322..81fb3465e197 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2924,6 +2924,14 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev,
 	if (vm && status) {
 		vm->fault_info.addr = addr;
 		vm->fault_info.status = status;
+		/*
+		 * Update the fault information globally for later usage
+		 * when vm could be stale or freed.
+		 */
+		adev->vm_manager.fault_info.addr = addr;
+		adev->vm_manager.fault_info.vmhub = vmhub;
+		adev->vm_manager.fault_info.status = status;
+
 		if (AMDGPU_IS_GFXHUB(vmhub)) {
 			vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX;
 			vm->fault_info.vmhub |=
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 047ec1930d12..8efa8422f4f7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -422,6 +422,8 @@ struct amdgpu_vm_manager {
 	 * look up VM of a page fault
 	 */
 	struct xarray				pasids;
+	/* Global registration of recent page fault information */
+	struct amdgpu_vm_fault_info	fault_info;
 };
 
 struct amdgpu_bo_va_mapping;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump
  2024-03-07 20:50 [PATCH v2 0/2] Add pagefault support for devcoredump Sunil Khatri
  2024-03-07 20:50 ` [PATCH v2 1/2] drm/amdgpu: add recent pagefault info in vm_manager Sunil Khatri
@ 2024-03-07 20:50 ` Sunil Khatri
  2024-03-08  9:09   ` Christian König
  1 sibling, 1 reply; 6+ messages in thread
From: Sunil Khatri @ 2024-03-07 20:50 UTC (permalink / raw)
  To: Alex Deucher, Christian König, Shashank Sharma
  Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi,
	Arunpravin Paneer Selvam, Sunil Khatri

Add page fault information to the devcoredump.

Output of devcoredump:
**** AMDGPU Device Coredump ****
version: 1
kernel: 6.7.0-amd-staging-drm-next
module: amdgpu
time: 29.725011811
process_name: soft_recovery_p PID: 1720

Ring timed out details
IP Type: 0 Ring Name: gfx_0.0.0

[gfxhub] Page fault observed
Faulty page starting at address: 0x0000000000000000
Protection fault status register: 0x301031

VRAM is lost due to GPU reset!

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index 147100c27c2d..8794a3c21176 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
 			   coredump->ring->name);
 	}
 
+	if (coredump->adev) {
+		struct amdgpu_vm_fault_info *fault_info =
+			&coredump->adev->vm_manager.fault_info;
+
+		drm_printf(&p, "\n[%s] Page fault observed\n",
+			   fault_info->vmhub ? "mmhub" : "gfxhub");
+		drm_printf(&p, "Faulty page starting at address: 0x%016llx\n",
+			   fault_info->addr);
+		drm_printf(&p, "Protection fault status register: 0x%x\n",
+			   fault_info->status);
+	}
+
 	if (coredump->reset_vram_lost)
-		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
+		drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");
 	if (coredump->adev->reset_info.num_regs) {
 		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump
  2024-03-07 20:50 ` [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump Sunil Khatri
@ 2024-03-08  9:09   ` Christian König
  2024-03-08  9:16     ` Khatri, Sunil
  0 siblings, 1 reply; 6+ messages in thread
From: Christian König @ 2024-03-08  9:09 UTC (permalink / raw)
  To: Sunil Khatri, Alex Deucher, Shashank Sharma
  Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi,
	Arunpravin Paneer Selvam

Am 07.03.24 um 21:50 schrieb Sunil Khatri:
> Add page fault information to the devcoredump.
>
> Output of devcoredump:
> **** AMDGPU Device Coredump ****
> version: 1
> kernel: 6.7.0-amd-staging-drm-next
> module: amdgpu
> time: 29.725011811
> process_name: soft_recovery_p PID: 1720
>
> Ring timed out details
> IP Type: 0 Ring Name: gfx_0.0.0
>
> [gfxhub] Page fault observed
> Faulty page starting at address: 0x0000000000000000
> Protection fault status register: 0x301031
>
> VRAM is lost due to GPU reset!
>
> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++-
>   1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> index 147100c27c2d..8794a3c21176 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
>   			   coredump->ring->name);
>   	}
>   
> +	if (coredump->adev) {
> +		struct amdgpu_vm_fault_info *fault_info =
> +			&coredump->adev->vm_manager.fault_info;
> +
> +		drm_printf(&p, "\n[%s] Page fault observed\n",
> +			   fault_info->vmhub ? "mmhub" : "gfxhub");
> +		drm_printf(&p, "Faulty page starting at address: 0x%016llx\n",
> +			   fault_info->addr);
> +		drm_printf(&p, "Protection fault status register: 0x%x\n",
> +			   fault_info->status);
> +	}
> +
>   	if (coredump->reset_vram_lost)
> -		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
> +		drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");

Why this additional new line?

Apart from that looks really good to me.

Regards,
Christian.

>   	if (coredump->adev->reset_info.num_regs) {
>   		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
>   


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump
  2024-03-08  9:09   ` Christian König
@ 2024-03-08  9:16     ` Khatri, Sunil
  2024-03-08 10:11       ` Christian König
  0 siblings, 1 reply; 6+ messages in thread
From: Khatri, Sunil @ 2024-03-08  9:16 UTC (permalink / raw)
  To: Christian König, Sunil Khatri, Alex Deucher, Shashank Sharma
  Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi,
	Arunpravin Paneer Selvam


On 3/8/2024 2:39 PM, Christian König wrote:
> Am 07.03.24 um 21:50 schrieb Sunil Khatri:
>> Add page fault information to the devcoredump.
>>
>> Output of devcoredump:
>> **** AMDGPU Device Coredump ****
>> version: 1
>> kernel: 6.7.0-amd-staging-drm-next
>> module: amdgpu
>> time: 29.725011811
>> process_name: soft_recovery_p PID: 1720
>>
>> Ring timed out details
>> IP Type: 0 Ring Name: gfx_0.0.0
>>
>> [gfxhub] Page fault observed
>> Faulty page starting at address: 0x0000000000000000
>> Protection fault status register: 0x301031
>>
>> VRAM is lost due to GPU reset!
>>
>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++-
>>   1 file changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> index 147100c27c2d..8794a3c21176 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
>> offset, size_t count,
>>                  coredump->ring->name);
>>       }
>>   +    if (coredump->adev) {
>> +        struct amdgpu_vm_fault_info *fault_info =
>> +            &coredump->adev->vm_manager.fault_info;
>> +
>> +        drm_printf(&p, "\n[%s] Page fault observed\n",
>> +               fault_info->vmhub ? "mmhub" : "gfxhub");
>> +        drm_printf(&p, "Faulty page starting at address: 0x%016llx\n",
>> +               fault_info->addr);
>> +        drm_printf(&p, "Protection fault status register: 0x%x\n",
>> +               fault_info->status);
>> +    }
>> +
>>       if (coredump->reset_vram_lost)
>> -        drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>> +        drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");
>
> Why this additional new line?
The intent is the devcoredump have different sections clearly demarcated 
with an new line else "VRAM is lost due to GPU reset!" seems part of the 
page fault information.
[gfxhub] Page fault observed
Faulty page starting at address: 0x0000000000000000
Protection fault status register: 0x301031

VRAM is lost due to GPU reset!

Regards
Sunil

>
> Apart from that looks really good to me.
>
> Regards,
> Christian.
>
>>       if (coredump->adev->reset_info.num_regs) {
>>           drm_printf(&p, "AMDGPU register dumps:\nOffset:     
>> Value:\n");
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump
  2024-03-08  9:16     ` Khatri, Sunil
@ 2024-03-08 10:11       ` Christian König
  0 siblings, 0 replies; 6+ messages in thread
From: Christian König @ 2024-03-08 10:11 UTC (permalink / raw)
  To: Khatri, Sunil, Christian König, Sunil Khatri, Alex Deucher,
	Shashank Sharma
  Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi,
	Arunpravin Paneer Selvam

Am 08.03.24 um 10:16 schrieb Khatri, Sunil:
>
> On 3/8/2024 2:39 PM, Christian König wrote:
>> Am 07.03.24 um 21:50 schrieb Sunil Khatri:
>>> Add page fault information to the devcoredump.
>>>
>>> Output of devcoredump:
>>> **** AMDGPU Device Coredump ****
>>> version: 1
>>> kernel: 6.7.0-amd-staging-drm-next
>>> module: amdgpu
>>> time: 29.725011811
>>> process_name: soft_recovery_p PID: 1720
>>>
>>> Ring timed out details
>>> IP Type: 0 Ring Name: gfx_0.0.0
>>>
>>> [gfxhub] Page fault observed
>>> Faulty page starting at address: 0x0000000000000000
>>> Protection fault status register: 0x301031
>>>
>>> VRAM is lost due to GPU reset!
>>>
>>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++-
>>>   1 file changed, 13 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> index 147100c27c2d..8794a3c21176 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
>>> offset, size_t count,
>>>                  coredump->ring->name);
>>>       }
>>>   +    if (coredump->adev) {
>>> +        struct amdgpu_vm_fault_info *fault_info =
>>> +            &coredump->adev->vm_manager.fault_info;
>>> +
>>> +        drm_printf(&p, "\n[%s] Page fault observed\n",
>>> +               fault_info->vmhub ? "mmhub" : "gfxhub");
>>> +        drm_printf(&p, "Faulty page starting at address: 0x%016llx\n",
>>> +               fault_info->addr);
>>> +        drm_printf(&p, "Protection fault status register: 0x%x\n",
>>> +               fault_info->status);
>>> +    }
>>> +
>>>       if (coredump->reset_vram_lost)
>>> -        drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>>> +        drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");
>>
>> Why this additional new line?
> The intent is the devcoredump have different sections clearly 
> demarcated with an new line else "VRAM is lost due to GPU reset!" 
> seems part of the page fault information.
> [gfxhub] Page fault observed
> Faulty page starting at address: 0x0000000000000000
> Protection fault status register: 0x301031
>
> VRAM is lost due to GPU reset!

In that case I would print the newline independent if VRAM is lost or 
not. Otherwise you get:

Protection fault status register:...

VRAM is lost due to GPU reset!
AMDGPU register dumps:

In one case and:


Protection fault status register:...
AMDGPU register dumps:

In the other case which breaks this sectioning quite a bit.

Regards,
Christian.

>
> Regards
> Sunil
>
>>
>> Apart from that looks really good to me.
>>
>> Regards,
>> Christian.
>>
>>>       if (coredump->adev->reset_info.num_regs) {
>>>           drm_printf(&p, "AMDGPU register dumps:\nOffset:     
>>> Value:\n");
>>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-03-08 10:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-07 20:50 [PATCH v2 0/2] Add pagefault support for devcoredump Sunil Khatri
2024-03-07 20:50 ` [PATCH v2 1/2] drm/amdgpu: add recent pagefault info in vm_manager Sunil Khatri
2024-03-07 20:50 ` [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump Sunil Khatri
2024-03-08  9:09   ` Christian König
2024-03-08  9:16     ` Khatri, Sunil
2024-03-08 10:11       ` Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox