* [PATCH v2 0/2] Add pagefault support for devcoredump @ 2024-03-07 20:50 Sunil Khatri 2024-03-07 20:50 ` [PATCH v2 1/2] drm/amdgpu: add recent pagefault info in vm_manager Sunil Khatri 2024-03-07 20:50 ` [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump Sunil Khatri 0 siblings, 2 replies; 6+ messages in thread From: Sunil Khatri @ 2024-03-07 20:50 UTC (permalink / raw) To: Alex Deucher, Christian König, Shashank Sharma Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi, Arunpravin Paneer Selvam, Sunil Khatri Add support of devcoredump from global object of amdgpu_device Sunil Khatri (2): drm/amdgpu: add recent pagefault info in vm_manager drm/amdgpu: add vm fault information to devcoredump drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8 ++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 ++ 3 files changed, 23 insertions(+), 1 deletion(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v2 1/2] drm/amdgpu: add recent pagefault info in vm_manager 2024-03-07 20:50 [PATCH v2 0/2] Add pagefault support for devcoredump Sunil Khatri @ 2024-03-07 20:50 ` Sunil Khatri 2024-03-07 20:50 ` [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump Sunil Khatri 1 sibling, 0 replies; 6+ messages in thread From: Sunil Khatri @ 2024-03-07 20:50 UTC (permalink / raw) To: Alex Deucher, Christian König, Shashank Sharma Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi, Arunpravin Paneer Selvam, Sunil Khatri Currently page fault information is stored per vm and which could be freed or stale during reset. Add it pagefault information in the vm_manager which is a global space for vm's and remains valid across. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8 ++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 ++ 2 files changed, 10 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 4299ce386322..81fb3465e197 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2924,6 +2924,14 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, if (vm && status) { vm->fault_info.addr = addr; vm->fault_info.status = status; + /* + * Update the fault information globally for later usage + * when vm could be stale or freed. + */ + adev->vm_manager.fault_info.addr = addr; + adev->vm_manager.fault_info.vmhub = vmhub; + adev->vm_manager.fault_info.status = status; + if (AMDGPU_IS_GFXHUB(vmhub)) { vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX; vm->fault_info.vmhub |= diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index 047ec1930d12..8efa8422f4f7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h @@ -422,6 +422,8 @@ struct amdgpu_vm_manager { * look up VM of a page fault */ struct xarray pasids; + /* Global registration of recent page fault information */ + struct amdgpu_vm_fault_info fault_info; }; struct amdgpu_bo_va_mapping; -- 2.34.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump 2024-03-07 20:50 [PATCH v2 0/2] Add pagefault support for devcoredump Sunil Khatri 2024-03-07 20:50 ` [PATCH v2 1/2] drm/amdgpu: add recent pagefault info in vm_manager Sunil Khatri @ 2024-03-07 20:50 ` Sunil Khatri 2024-03-08 9:09 ` Christian König 1 sibling, 1 reply; 6+ messages in thread From: Sunil Khatri @ 2024-03-07 20:50 UTC (permalink / raw) To: Alex Deucher, Christian König, Shashank Sharma Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi, Arunpravin Paneer Selvam, Sunil Khatri Add page fault information to the devcoredump. Output of devcoredump: **** AMDGPU Device Coredump **** version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 29.725011811 process_name: soft_recovery_p PID: 1720 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 [gfxhub] Page fault observed Faulty page starting at address: 0x0000000000000000 Protection fault status register: 0x301031 VRAM is lost due to GPU reset! Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 147100c27c2d..8794a3c21176 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->ring->name); } + if (coredump->adev) { + struct amdgpu_vm_fault_info *fault_info = + &coredump->adev->vm_manager.fault_info; + + drm_printf(&p, "\n[%s] Page fault observed\n", + fault_info->vmhub ? "mmhub" : "gfxhub"); + drm_printf(&p, "Faulty page starting at address: 0x%016llx\n", + fault_info->addr); + drm_printf(&p, "Protection fault status register: 0x%x\n", + fault_info->status); + } + if (coredump->reset_vram_lost) - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n"); -- 2.34.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump 2024-03-07 20:50 ` [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump Sunil Khatri @ 2024-03-08 9:09 ` Christian König 2024-03-08 9:16 ` Khatri, Sunil 0 siblings, 1 reply; 6+ messages in thread From: Christian König @ 2024-03-08 9:09 UTC (permalink / raw) To: Sunil Khatri, Alex Deucher, Shashank Sharma Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi, Arunpravin Paneer Selvam Am 07.03.24 um 21:50 schrieb Sunil Khatri: > Add page fault information to the devcoredump. > > Output of devcoredump: > **** AMDGPU Device Coredump **** > version: 1 > kernel: 6.7.0-amd-staging-drm-next > module: amdgpu > time: 29.725011811 > process_name: soft_recovery_p PID: 1720 > > Ring timed out details > IP Type: 0 Ring Name: gfx_0.0.0 > > [gfxhub] Page fault observed > Faulty page starting at address: 0x0000000000000000 > Protection fault status register: 0x301031 > > VRAM is lost due to GPU reset! > > Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > index 147100c27c2d..8794a3c21176 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, > coredump->ring->name); > } > > + if (coredump->adev) { > + struct amdgpu_vm_fault_info *fault_info = > + &coredump->adev->vm_manager.fault_info; > + > + drm_printf(&p, "\n[%s] Page fault observed\n", > + fault_info->vmhub ? "mmhub" : "gfxhub"); > + drm_printf(&p, "Faulty page starting at address: 0x%016llx\n", > + fault_info->addr); > + drm_printf(&p, "Protection fault status register: 0x%x\n", > + fault_info->status); > + } > + > if (coredump->reset_vram_lost) > - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); > + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); Why this additional new line? Apart from that looks really good to me. Regards, Christian. > if (coredump->adev->reset_info.num_regs) { > drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n"); > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump 2024-03-08 9:09 ` Christian König @ 2024-03-08 9:16 ` Khatri, Sunil 2024-03-08 10:11 ` Christian König 0 siblings, 1 reply; 6+ messages in thread From: Khatri, Sunil @ 2024-03-08 9:16 UTC (permalink / raw) To: Christian König, Sunil Khatri, Alex Deucher, Shashank Sharma Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi, Arunpravin Paneer Selvam On 3/8/2024 2:39 PM, Christian König wrote: > Am 07.03.24 um 21:50 schrieb Sunil Khatri: >> Add page fault information to the devcoredump. >> >> Output of devcoredump: >> **** AMDGPU Device Coredump **** >> version: 1 >> kernel: 6.7.0-amd-staging-drm-next >> module: amdgpu >> time: 29.725011811 >> process_name: soft_recovery_p PID: 1720 >> >> Ring timed out details >> IP Type: 0 Ring Name: gfx_0.0.0 >> >> [gfxhub] Page fault observed >> Faulty page starting at address: 0x0000000000000000 >> Protection fault status register: 0x301031 >> >> VRAM is lost due to GPU reset! >> >> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- >> 1 file changed, 13 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> index 147100c27c2d..8794a3c21176 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t >> offset, size_t count, >> coredump->ring->name); >> } >> + if (coredump->adev) { >> + struct amdgpu_vm_fault_info *fault_info = >> + &coredump->adev->vm_manager.fault_info; >> + >> + drm_printf(&p, "\n[%s] Page fault observed\n", >> + fault_info->vmhub ? "mmhub" : "gfxhub"); >> + drm_printf(&p, "Faulty page starting at address: 0x%016llx\n", >> + fault_info->addr); >> + drm_printf(&p, "Protection fault status register: 0x%x\n", >> + fault_info->status); >> + } >> + >> if (coredump->reset_vram_lost) >> - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); >> + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); > > Why this additional new line? The intent is the devcoredump have different sections clearly demarcated with an new line else "VRAM is lost due to GPU reset!" seems part of the page fault information. [gfxhub] Page fault observed Faulty page starting at address: 0x0000000000000000 Protection fault status register: 0x301031 VRAM is lost due to GPU reset! Regards Sunil > > Apart from that looks really good to me. > > Regards, > Christian. > >> if (coredump->adev->reset_info.num_regs) { >> drm_printf(&p, "AMDGPU register dumps:\nOffset: >> Value:\n"); > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump 2024-03-08 9:16 ` Khatri, Sunil @ 2024-03-08 10:11 ` Christian König 0 siblings, 0 replies; 6+ messages in thread From: Christian König @ 2024-03-08 10:11 UTC (permalink / raw) To: Khatri, Sunil, Christian König, Sunil Khatri, Alex Deucher, Shashank Sharma Cc: amd-gfx, dri-devel, linux-kernel, Mukul Joshi, Arunpravin Paneer Selvam Am 08.03.24 um 10:16 schrieb Khatri, Sunil: > > On 3/8/2024 2:39 PM, Christian König wrote: >> Am 07.03.24 um 21:50 schrieb Sunil Khatri: >>> Add page fault information to the devcoredump. >>> >>> Output of devcoredump: >>> **** AMDGPU Device Coredump **** >>> version: 1 >>> kernel: 6.7.0-amd-staging-drm-next >>> module: amdgpu >>> time: 29.725011811 >>> process_name: soft_recovery_p PID: 1720 >>> >>> Ring timed out details >>> IP Type: 0 Ring Name: gfx_0.0.0 >>> >>> [gfxhub] Page fault observed >>> Faulty page starting at address: 0x0000000000000000 >>> Protection fault status register: 0x301031 >>> >>> VRAM is lost due to GPU reset! >>> >>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- >>> 1 file changed, 13 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >>> index 147100c27c2d..8794a3c21176 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >>> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t >>> offset, size_t count, >>> coredump->ring->name); >>> } >>> + if (coredump->adev) { >>> + struct amdgpu_vm_fault_info *fault_info = >>> + &coredump->adev->vm_manager.fault_info; >>> + >>> + drm_printf(&p, "\n[%s] Page fault observed\n", >>> + fault_info->vmhub ? "mmhub" : "gfxhub"); >>> + drm_printf(&p, "Faulty page starting at address: 0x%016llx\n", >>> + fault_info->addr); >>> + drm_printf(&p, "Protection fault status register: 0x%x\n", >>> + fault_info->status); >>> + } >>> + >>> if (coredump->reset_vram_lost) >>> - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); >>> + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); >> >> Why this additional new line? > The intent is the devcoredump have different sections clearly > demarcated with an new line else "VRAM is lost due to GPU reset!" > seems part of the page fault information. > [gfxhub] Page fault observed > Faulty page starting at address: 0x0000000000000000 > Protection fault status register: 0x301031 > > VRAM is lost due to GPU reset! In that case I would print the newline independent if VRAM is lost or not. Otherwise you get: Protection fault status register:... VRAM is lost due to GPU reset! AMDGPU register dumps: In one case and: Protection fault status register:... AMDGPU register dumps: In the other case which breaks this sectioning quite a bit. Regards, Christian. > > Regards > Sunil > >> >> Apart from that looks really good to me. >> >> Regards, >> Christian. >> >>> if (coredump->adev->reset_info.num_regs) { >>> drm_printf(&p, "AMDGPU register dumps:\nOffset: >>> Value:\n"); >> ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-03-08 10:12 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-03-07 20:50 [PATCH v2 0/2] Add pagefault support for devcoredump Sunil Khatri 2024-03-07 20:50 ` [PATCH v2 1/2] drm/amdgpu: add recent pagefault info in vm_manager Sunil Khatri 2024-03-07 20:50 ` [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump Sunil Khatri 2024-03-08 9:09 ` Christian König 2024-03-08 9:16 ` Khatri, Sunil 2024-03-08 10:11 ` Christian König
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox