From: Raag Jadav <raag.jadav@intel.com>
To: "André Almeida" <andrealmeid@igalia.com>
Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
kernel-dev@igalia.com, amd-gfx@lists.freedesktop.org,
intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org,
"Alex Deucher" <alexander.deucher@amd.com>,
"'Christian König'" <christian.koenig@amd.com>,
siqueira@igalia.com, airlied@gmail.com, simona@ffwll.ch,
rodrigo.vivi@intel.com, jani.nikula@linux.intel.com,
lucas.demarchi@intel.com
Subject: Re: [PATCH 1/2] drm: Create an app info option for wedge events
Date: Fri, 28 Feb 2025 16:20:05 +0200 [thread overview]
Message-ID: <Z8HGFRGOYvyCCWWu@black.fi.intel.com> (raw)
In-Reply-To: <20250228121353.1442591-2-andrealmeid@igalia.com>
Cc: Lucas
On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> When a device get wedged, it might be caused by a guilty application.
> For userspace, knowing which app was the cause can be useful for some
> situations, like for implementing a policy, logs or for giving a chance
> for the compositor to let the user know what app caused the problem.
> This is an optional argument, when `PID=-1` there's no information about
> the app caused the problem, or if any app was involved during the hang.
>
> Sometimes just the PID isn't enough giving that the app might be already
> dead by the time userspace will try to check what was this PID's name,
> so to make the life easier also notify what's the app's name in the user
> event.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
> drivers/gpu/drm/drm_drv.c | 16 +++++++++++++---
> drivers/gpu/drm/i915/gt/intel_reset.c | 3 ++-
> drivers/gpu/drm/xe/xe_device.c | 3 ++-
> include/drm/drm_device.h | 8 ++++++++
> include/drm/drm_drv.h | 3 ++-
> 7 files changed, 29 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 24ba52d76045..00b9b87dafd8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -6124,7 +6124,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> atomic_set(&adev->reset_domain->reset_res, r);
>
> if (!r)
> - drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
>
> return r;
> }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index ef1b77f1e88f..3ed9cbcab1ad 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -150,7 +150,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> amdgpu_fence_driver_force_completion(ring);
> if (amdgpu_ring_sched_ready(ring))
> drm_sched_start(&ring->sched, 0);
> - drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
> dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
> goto exit;
> }
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 17fc5dc708f4..48faafd82a99 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -522,6 +522,7 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> * drm_dev_wedged_event - generate a device wedged uevent
> * @dev: DRM device
> * @method: method(s) to be used for recovery
> + * @info: optional information about the guilty app
> *
> * This generates a device wedged uevent for the DRM device specified by @dev.
> * Recovery @method\(s) of choice will be sent in the uevent environment as
> @@ -534,13 +535,14 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> *
> * Returns: 0 on success, negative error code otherwise.
> */
> -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> + struct drm_wedge_app_info *info)
> {
> const char *recovery = NULL;
> unsigned int len, opt;
> /* Event string length up to 28+ characters with available methods */
> - char event_string[32];
> - char *envp[] = { event_string, NULL };
> + char event_string[32], pid_string[15], comm_string[TASK_COMM_LEN];
> + char *envp[] = { event_string, pid_string, comm_string, NULL };
>
> len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
>
> @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
> "but recovered through reset" : "needs recovery");
>
> + if (info) {
> + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> + snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
> + } else {
> + snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
> + snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
> + }
This is not much use for wedge cases that needs recovery, since at that point
the userspace will need to clean house anyway.
Which leaves us with only 'none' case and perhaps the need for standardization
of "optional telemetry collection".
Thoughts?
Raag
next prev parent reply other threads:[~2025-02-28 14:20 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-28 12:13 [PATCH 0/2] drm: Create an app info option for wedge events André Almeida
2025-02-28 12:13 ` [PATCH 1/2] " André Almeida
2025-02-28 14:20 ` Raag Jadav [this message]
2025-02-28 21:54 ` André Almeida
2025-03-01 5:53 ` Raag Jadav
2025-03-10 21:27 ` André Almeida
2025-03-11 17:09 ` Raag Jadav
2025-03-12 10:06 ` Raag Jadav
2025-03-12 21:59 ` André Almeida
2025-03-13 11:07 ` Raag Jadav
2025-02-28 12:13 ` [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info André Almeida
2025-02-28 14:58 ` Raag Jadav
2025-02-28 21:49 ` André Almeida
2025-03-01 6:04 ` Raag Jadav
2025-03-10 21:53 ` André Almeida
2025-03-10 22:03 ` Alex Deucher
2025-03-11 17:13 ` Raag Jadav
2025-03-12 8:25 ` Christian König
2025-03-12 9:13 ` Raag Jadav
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z8HGFRGOYvyCCWWu@black.fi.intel.com \
--to=raag.jadav@intel.com \
--cc=airlied@gmail.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=andrealmeid@igalia.com \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=jani.nikula@linux.intel.com \
--cc=kernel-dev@igalia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lucas.demarchi@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=simona@ffwll.ch \
--cc=siqueira@igalia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox