public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: Sunil Khatri <sunil.khatri@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Shashank Sharma <shashank.sharma@amd.com>
Cc: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, Mukul Joshi <mukul.joshi@amd.com>,
	Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Subject: Re: [PATCH] drm/amdgpu: add vm fault information to devcoredump
Date: Thu, 7 Mar 2024 09:17:05 +0100	[thread overview]
Message-ID: <f61edcbe-938f-4c48-920e-64c8352e87f4@amd.com> (raw)
In-Reply-To: <20240306181937.3551648-2-sunil.khatri@amd.com>

Am 06.03.24 um 19:19 schrieb Sunil Khatri:
> Add page fault information to the devcoredump.
>
> Output of devcoredump:
> **** AMDGPU Device Coredump ****
> version: 1
> kernel: 6.7.0-amd-staging-drm-next
> module: amdgpu
> time: 29.725011811
> process_name: soft_recovery_p PID: 1720
>
> Ring timed out details
> IP Type: 0 Ring Name: gfx_0.0.0
>
> [gfxhub] Page fault observed for GPU family:143
> Faulty page starting at address 0x0000000000000000
> Protection fault status register:0x301031
>
> VRAM is lost due to GPU reset!
>
> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 ++++++++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  1 +
>   2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> index 147100c27c2d..d7fea6cdf2f9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
>   			   coredump->ring->name);
>   	}
>   
> +	if (coredump->fault_info.status) {
> +		struct amdgpu_vm_fault_info *fault_info = &coredump->fault_info;
> +
> +		drm_printf(&p, "\n[%s] Page fault observed for GPU family:%d\n",
> +			   fault_info->vmhub ? "mmhub" : "gfxhub",
> +			   coredump->adev->family);
> +		drm_printf(&p, "Faulty page starting at address 0x%016llx\n",
> +			   fault_info->addr);
> +		drm_printf(&p, "Protection fault status register:0x%x\n",
> +			   fault_info->status);
> +	}
> +
>   	if (coredump->reset_vram_lost)
> -		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
> +		drm_printf(&p, "\nVRAM is lost due to GPU reset!\n");
>   	if (coredump->adev->reset_info.num_regs) {
>   		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
>   
> @@ -253,6 +265,7 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
>   	if (job) {
>   		s_job = &job->base;
>   		coredump->ring = to_amdgpu_ring(s_job->sched);
> +		coredump->fault_info = job->vm->fault_info;

That's illegal. The VM pointer might already be stale at this point.

I think you need to add the fault info of the last fault globally in the 
VRAM manager or move this to the process info Shashank is working on.

Regards,
Christian.

>   	}
>   
>   	coredump->adev = adev;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> index 60522963aaca..3197955264f9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> @@ -98,6 +98,7 @@ struct amdgpu_coredump_info {
>   	struct timespec64               reset_time;
>   	bool                            reset_vram_lost;
>   	struct amdgpu_ring			*ring;
> +	struct amdgpu_vm_fault_info	fault_info;
>   };
>   #endif
>   


  parent reply	other threads:[~2024-03-07  8:17 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-06 18:19 [PATCH] drm/amdgpu: cache in more vm fault information Sunil Khatri
2024-03-06 18:19 ` [PATCH] drm/amdgpu: add vm fault information to devcoredump Sunil Khatri
2024-03-06 19:21   ` Deucher, Alexander
2024-03-07  4:02     ` Khatri, Sunil
2024-03-07  8:17   ` Christian König [this message]
2024-03-07  8:37     ` Khatri, Sunil
2024-03-07 12:40       ` Christian König
2024-03-07 12:43         ` Khatri, Sunil
2024-03-06 18:20 ` [PATCH] drm/amdgpu: cache in more vm fault information Khatri, Sunil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f61edcbe-938f-4c48-920e-64c8352e87f4@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Arunpravin.PaneerSelvam@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mukul.joshi@amd.com \
    --cc=shashank.sharma@amd.com \
    --cc=sunil.khatri@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox