* [PATCH v2 0/3] drm: Create an app info option for wedge events
@ 2025-05-11 22:47 André Almeida
2025-05-11 22:47 ` [PATCH v2 1/3] " André Almeida
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: André Almeida @ 2025-05-11 22:47 UTC (permalink / raw)
To: Alex Deucher, Christian König, siqueira, airlied, simona,
Raag Jadav, rodrigo.vivi, jani.nikula, Xaver Hugl,
Pierre-Loup A . Griffais
Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
André Almeida
This patchset implements a request made by Xaver Hugl about wedge events:
"I'd really like to have the PID of the client that triggered the GPU
reset, so that we can kill it if multiple resets are triggered in a
row (or switch to software rendering if it's KWin itself) and show a
user-friendly notification about why their app(s) crashed, but that
can be added later."
From https://lore.kernel.org/dri-devel/CAFZQkGwJ4qgHV8WTp2=svJ_VXhb-+Y8_VNtKB=jLsk6DqMYp9w@mail.gmail.com/
For testing, I've used amdgpu's debug_mask options debug_disable_soft_recovery
and debug_disable_gpu_ring_reset to test both wedge event paths in the driver.
To trigger a ring timeout, I've used this app:
https://gitlab.freedesktop.org/andrealmeid/gpu-timeout
Thanks!
Changelog:
v2:
- Rebased on top of drm/drm-next
- Added new patch for documentation
André Almeida (3):
drm: Create an app info option for wedge events
drm/doc: Add a section about "App information" for the wedge API
drm/amdgpu: Make use of drm_wedge_app_info
Documentation/gpu/drm-uapi.rst | 15 +++++++++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++--
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 6 +++++-
drivers/gpu/drm/drm_drv.c | 16 +++++++++++++---
drivers/gpu/drm/i915/gt/intel_reset.c | 3 ++-
drivers/gpu/drm/xe/xe_device.c | 3 ++-
include/drm/drm_device.h | 8 ++++++++
include/drm/drm_drv.h | 3 ++-
8 files changed, 64 insertions(+), 9 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 1/3] drm: Create an app info option for wedge events
2025-05-11 22:47 [PATCH v2 0/3] drm: Create an app info option for wedge events André Almeida
@ 2025-05-11 22:47 ` André Almeida
2025-05-12 6:08 ` Krzysztof Karas
2025-05-12 15:34 ` Rodrigo Vivi
2025-05-11 22:47 ` [PATCH v2 2/3] drm/doc: Add a section about "App information" for the wedge API André Almeida
2025-05-11 22:47 ` [PATCH v2 3/3] drm/amdgpu: Make use of drm_wedge_app_info André Almeida
2 siblings, 2 replies; 9+ messages in thread
From: André Almeida @ 2025-05-11 22:47 UTC (permalink / raw)
To: Alex Deucher, Christian König, siqueira, airlied, simona,
Raag Jadav, rodrigo.vivi, jani.nikula, Xaver Hugl,
Pierre-Loup A . Griffais
Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
André Almeida
When a device get wedged, it might be caused by a guilty application.
For userspace, knowing which app was the cause can be useful for some
situations, like for implementing a policy, logs or for giving a chance
for the compositor to let the user know what app caused the problem.
This is an optional argument, when `PID=-1` there's no information about
the app caused the problem, or if any app was involved during the hang.
Sometimes just the PID isn't enough giving that the app might be already
dead by the time userspace will try to check what was this PID's name,
so to make the life easier also notify what's the app's name in the user
event.
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
drivers/gpu/drm/drm_drv.c | 16 +++++++++++++---
drivers/gpu/drm/i915/gt/intel_reset.c | 3 ++-
drivers/gpu/drm/xe/xe_device.c | 3 ++-
include/drm/drm_device.h | 8 ++++++++
include/drm/drm_drv.h | 3 ++-
7 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7f354cd532dc..c8a51418d0e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6216,7 +6216,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
atomic_set(&adev->reset_domain->reset_res, r);
if (!r)
- drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
+ drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
return r;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index acb21fc8b3ce..a47b2eb301e5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -166,7 +166,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
if (amdgpu_ring_sched_ready(ring))
drm_sched_start(&ring->sched, 0);
dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
- drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
+ drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
goto exit;
}
dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 3dc7acd56b1d..1816ef4251e7 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -542,6 +542,7 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
* drm_dev_wedged_event - generate a device wedged uevent
* @dev: DRM device
* @method: method(s) to be used for recovery
+ * @info: optional information about the guilty app
*
* This generates a device wedged uevent for the DRM device specified by @dev.
* Recovery @method\(s) of choice will be sent in the uevent environment as
@@ -554,13 +555,14 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
*
* Returns: 0 on success, negative error code otherwise.
*/
-int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
+int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
+ struct drm_wedge_app_info *info)
{
const char *recovery = NULL;
unsigned int len, opt;
/* Event string length up to 28+ characters with available methods */
- char event_string[32];
- char *envp[] = { event_string, NULL };
+ char event_string[32], pid_string[15], comm_string[TASK_COMM_LEN];
+ char *envp[] = { event_string, pid_string, comm_string, NULL };
len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
@@ -582,6 +584,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
"but recovered through reset" : "needs recovery");
+ if (info) {
+ snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
+ snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
+ } else {
+ snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
+ snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
+ }
+
return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
}
EXPORT_SYMBOL(drm_dev_wedged_event);
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index dbdcfe130ad4..ba1d8fdc3c7b 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1448,7 +1448,8 @@ static void intel_gt_reset_global(struct intel_gt *gt,
kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event);
else
drm_dev_wedged_event(>->i915->drm,
- DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
+ DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
+ NULL);
}
/**
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index c02c4c4e9412..f329613e061f 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1168,7 +1168,8 @@ void xe_device_declare_wedged(struct xe_device *xe)
/* Notify userspace of wedged device */
drm_dev_wedged_event(&xe->drm,
- DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
+ DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
+ NULL);
}
for_each_gt(gt, xe, id)
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index e2f894f1b90a..b87401d5079e 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -30,6 +30,14 @@ struct pci_controller;
#define DRM_WEDGE_RECOVERY_REBIND BIT(1) /* unbind + bind driver */
#define DRM_WEDGE_RECOVERY_BUS_RESET BIT(2) /* unbind + reset bus device + bind */
+/**
+ * struct drm_wedge_app_info - information about the guilty app of a wedge dev
+ */
+struct drm_wedge_app_info {
+ pid_t pid;
+ char *comm;
+};
+
/**
* enum switch_power_state - power state of drm device
*/
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index a43d707b5f36..8fc6412a6345 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -482,7 +482,8 @@ void drm_put_dev(struct drm_device *dev);
bool drm_dev_enter(struct drm_device *dev, int *idx);
void drm_dev_exit(int idx);
void drm_dev_unplug(struct drm_device *dev);
-int drm_dev_wedged_event(struct drm_device *dev, unsigned long method);
+int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
+ struct drm_wedge_app_info *info);
/**
* drm_dev_is_unplugged - is a DRM device unplugged
--
2.49.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v2 2/3] drm/doc: Add a section about "App information" for the wedge API
2025-05-11 22:47 [PATCH v2 0/3] drm: Create an app info option for wedge events André Almeida
2025-05-11 22:47 ` [PATCH v2 1/3] " André Almeida
@ 2025-05-11 22:47 ` André Almeida
2025-05-12 18:37 ` Raag Jadav
2025-05-11 22:47 ` [PATCH v2 3/3] drm/amdgpu: Make use of drm_wedge_app_info André Almeida
2 siblings, 1 reply; 9+ messages in thread
From: André Almeida @ 2025-05-11 22:47 UTC (permalink / raw)
To: Alex Deucher, Christian König, siqueira, airlied, simona,
Raag Jadav, rodrigo.vivi, jani.nikula, Xaver Hugl,
Pierre-Loup A . Griffais
Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
André Almeida
Add a section about "App information" for the wedge API.
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
Documentation/gpu/drm-uapi.rst | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 69f72e71a96e..826abe265a24 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -446,6 +446,21 @@ telemetry information (devcoredump, syslog). This is useful because the first
hang is usually the most critical one which can result in consequential hangs or
complete wedging.
+App information
+---------------
+
+The information about which application (if any) caused the device to get in the
+wedge state is useful for userspace if they want to notify the user about what
+happened (e.g. the compositor display a message to the user "The <app name>
+caused a graphical error and the system recovered") or to implement policies
+(e.g. the daemon may "ban" an app that keeps resetting the device). If the app
+information is not available, the uevent will display as ``PID=-1`` and
+``APP=none``. Otherwise, ``PID`` and ``APP`` will advertise about the guilty
+app.
+
+The reliability of this information is driver and hardware specific, and should
+be taken with a caution regarding it's precision.
+
Consumer prerequisites
----------------------
--
2.49.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v2 3/3] drm/amdgpu: Make use of drm_wedge_app_info
2025-05-11 22:47 [PATCH v2 0/3] drm: Create an app info option for wedge events André Almeida
2025-05-11 22:47 ` [PATCH v2 1/3] " André Almeida
2025-05-11 22:47 ` [PATCH v2 2/3] drm/doc: Add a section about "App information" for the wedge API André Almeida
@ 2025-05-11 22:47 ` André Almeida
2 siblings, 0 replies; 9+ messages in thread
From: André Almeida @ 2025-05-11 22:47 UTC (permalink / raw)
To: Alex Deucher, Christian König, siqueira, airlied, simona,
Raag Jadav, rodrigo.vivi, jani.nikula, Xaver Hugl,
Pierre-Loup A . Griffais
Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
André Almeida
To notify userspace about which app (if any) made the device get in a
wedge state, make use of drm_wedge_app_info parameter, filling it with
the app PID and name.
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++--
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 6 +++++-
2 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c8a51418d0e7..e6d8f6d0ec47 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6215,8 +6215,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
atomic_set(&adev->reset_domain->reset_res, r);
- if (!r)
- drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
+ if (!r) {
+ struct drm_wedge_app_info aux, *info = NULL;
+
+ if (job) {
+ struct amdgpu_task_info *ti;
+
+ ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid);
+ if (ti) {
+ aux.pid = ti->pid;
+ aux.comm = ti->process_name;
+ info = &aux;
+ amdgpu_vm_put_task_info(ti);
+ }
+ }
+
+ drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, info);
+ }
return r;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index a47b2eb301e5..98efa3318ddb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -89,6 +89,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
{
struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
struct amdgpu_job *job = to_amdgpu_job(s_job);
+ struct drm_wedge_app_info aux, *info = NULL;
struct amdgpu_task_info *ti;
struct amdgpu_device *adev = ring->adev;
int idx;
@@ -127,6 +128,9 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
dev_err(adev->dev,
"Process information: process %s pid %d thread %s pid %d\n",
ti->process_name, ti->tgid, ti->task_name, ti->pid);
+ aux.pid = ti->pid;
+ aux.comm = ti->process_name;
+ info = &aux;
amdgpu_vm_put_task_info(ti);
}
@@ -166,7 +170,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
if (amdgpu_ring_sched_ready(ring))
drm_sched_start(&ring->sched, 0);
dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
- drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
+ drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, info);
goto exit;
}
dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
--
2.49.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/3] drm: Create an app info option for wedge events
2025-05-11 22:47 ` [PATCH v2 1/3] " André Almeida
@ 2025-05-12 6:08 ` Krzysztof Karas
2025-05-12 13:40 ` André Almeida
2025-05-12 14:53 ` Christian König
2025-05-12 15:34 ` Rodrigo Vivi
1 sibling, 2 replies; 9+ messages in thread
From: Krzysztof Karas @ 2025-05-12 6:08 UTC (permalink / raw)
To: André Almeida
Cc: Alex Deucher, Christian König, siqueira, airlied, simona,
Raag Jadav, rodrigo.vivi, jani.nikula, Xaver Hugl,
Pierre-Loup A . Griffais, dri-devel, linux-kernel, kernel-dev,
amd-gfx, intel-xe, intel-gfx
Hi André,
[...]
> @@ -582,6 +584,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
> "but recovered through reset" : "needs recovery");
>
> + if (info) {
> + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> + snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
> + } else {
> + snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
> + snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
I think using PID=-1 and APP=none might be misleading, because
something did cause the wedge if we landed here. You could use
"PID=unknown" and "APP=unknown" or ensure these arrays are
zeroed and fill them only if "info" is available:
- char *envp[] = { event_string, NULL };
+ char pid_string[15] = {}, comm_string[TASK_COMM_LEN] = {};
+ char *envp[] = { event_string, pid_string, comm_string, NULL };
[...]
+ if (info) {
+ snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
+ snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
+ }
Then, when printing the logs later you could check if they have
a value and only use them if they do (or handle that however
you would see fit :) ).
Best Regards,
Krzysztof
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/3] drm: Create an app info option for wedge events
2025-05-12 6:08 ` Krzysztof Karas
@ 2025-05-12 13:40 ` André Almeida
2025-05-12 14:53 ` Christian König
1 sibling, 0 replies; 9+ messages in thread
From: André Almeida @ 2025-05-12 13:40 UTC (permalink / raw)
To: Krzysztof Karas
Cc: Alex Deucher, Christian König, siqueira, airlied, simona,
Raag Jadav, rodrigo.vivi, jani.nikula, Xaver Hugl,
Pierre-Loup A . Griffais, dri-devel, linux-kernel, kernel-dev,
amd-gfx, intel-xe, intel-gfx
Hi Krzysztof,
Thanks for the feedback.
Em 12/05/2025 03:08, Krzysztof Karas escreveu:
> Hi André,
>
> [...]
>
>> @@ -582,6 +584,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
>> drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
>> "but recovered through reset" : "needs recovery");
>>
>> + if (info) {
>> + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
>> + snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
>> + } else {
>> + snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
>> + snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
>
> I think using PID=-1 and APP=none might be misleading, because
> something did cause the wedge if we landed here. You could use
> "PID=unknown" and "APP=unknown" or ensure these arrays are
> zeroed and fill them only if "info" is available:
>
> - char *envp[] = { event_string, NULL };
> + char pid_string[15] = {}, comm_string[TASK_COMM_LEN] = {};
> + char *envp[] = { event_string, pid_string, comm_string, NULL };
>
> [...]
>
> + if (info) {
> + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> + snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
> + }
>
> Then, when printing the logs later you could check if they have
> a value and only use them if they do (or handle that however
> you would see fit :) ).
>
That works for me, I will address this in my v3.
Thanks!
> Best Regards,
> Krzysztof
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/3] drm: Create an app info option for wedge events
2025-05-12 6:08 ` Krzysztof Karas
2025-05-12 13:40 ` André Almeida
@ 2025-05-12 14:53 ` Christian König
1 sibling, 0 replies; 9+ messages in thread
From: Christian König @ 2025-05-12 14:53 UTC (permalink / raw)
To: Krzysztof Karas, André Almeida
Cc: Alex Deucher, siqueira, airlied, simona, Raag Jadav, rodrigo.vivi,
jani.nikula, Xaver Hugl, Pierre-Loup A . Griffais, dri-devel,
linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx
On 5/12/25 08:08, Krzysztof Karas wrote:
> Hi André,
>
> [...]
>
>> @@ -582,6 +584,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
>> drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
>> "but recovered through reset" : "needs recovery");
>>
>> + if (info) {
>> + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
>> + snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
>> + } else {
>> + snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
>> + snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
>
> I think using PID=-1 and APP=none might be misleading, because
> something did cause the wedge if we landed here.
Yeah, that certainly won't fly. 1 is a perfectly valid pid.
I would just set pid_string and comm_string to empty if info isn't available.
Regards,
Christian.
You could use
> "PID=unknown" and "APP=unknown" or ensure these arrays are
> zeroed and fill them only if "info" is available:
>
> - char *envp[] = { event_string, NULL };
> + char pid_string[15] = {}, comm_string[TASK_COMM_LEN] = {};
> + char *envp[] = { event_string, pid_string, comm_string, NULL };
>
> [...]
>
> + if (info) {
> + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> + snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
> + }
>
> Then, when printing the logs later you could check if they have
> a value and only use them if they do (or handle that however
> you would see fit :) ).
>
> Best Regards,
> Krzysztof
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/3] drm: Create an app info option for wedge events
2025-05-11 22:47 ` [PATCH v2 1/3] " André Almeida
2025-05-12 6:08 ` Krzysztof Karas
@ 2025-05-12 15:34 ` Rodrigo Vivi
1 sibling, 0 replies; 9+ messages in thread
From: Rodrigo Vivi @ 2025-05-12 15:34 UTC (permalink / raw)
To: André Almeida
Cc: Alex Deucher, Christian König, siqueira, airlied, simona,
Raag Jadav, jani.nikula, Xaver Hugl, Pierre-Loup A . Griffais,
dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx
On Sun, May 11, 2025 at 07:47:43PM -0300, André Almeida wrote:
> When a device get wedged, it might be caused by a guilty application.
> For userspace, knowing which app was the cause can be useful for some
> situations, like for implementing a policy, logs or for giving a chance
> for the compositor to let the user know what app caused the problem.
> This is an optional argument, when `PID=-1` there's no information about
> the app caused the problem, or if any app was involved during the hang.
>
> Sometimes just the PID isn't enough giving that the app might be already
> dead by the time userspace will try to check what was this PID's name,
> so to make the life easier also notify what's the app's name in the user
> event.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
> drivers/gpu/drm/drm_drv.c | 16 +++++++++++++---
> drivers/gpu/drm/i915/gt/intel_reset.c | 3 ++-
> drivers/gpu/drm/xe/xe_device.c | 3 ++-
> include/drm/drm_device.h | 8 ++++++++
> include/drm/drm_drv.h | 3 ++-
> 7 files changed, 29 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 7f354cd532dc..c8a51418d0e7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -6216,7 +6216,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> atomic_set(&adev->reset_domain->reset_res, r);
>
> if (!r)
> - drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
>
> return r;
> }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index acb21fc8b3ce..a47b2eb301e5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -166,7 +166,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> if (amdgpu_ring_sched_ready(ring))
> drm_sched_start(&ring->sched, 0);
> dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
> - drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
> goto exit;
> }
> dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 3dc7acd56b1d..1816ef4251e7 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -542,6 +542,7 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> * drm_dev_wedged_event - generate a device wedged uevent
> * @dev: DRM device
> * @method: method(s) to be used for recovery
> + * @info: optional information about the guilty app
> *
> * This generates a device wedged uevent for the DRM device specified by @dev.
> * Recovery @method\(s) of choice will be sent in the uevent environment as
> @@ -554,13 +555,14 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> *
> * Returns: 0 on success, negative error code otherwise.
> */
> -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> + struct drm_wedge_app_info *info)
> {
> const char *recovery = NULL;
> unsigned int len, opt;
> /* Event string length up to 28+ characters with available methods */
> - char event_string[32];
> - char *envp[] = { event_string, NULL };
> + char event_string[32], pid_string[15], comm_string[TASK_COMM_LEN];
> + char *envp[] = { event_string, pid_string, comm_string, NULL };
>
> len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
>
> @@ -582,6 +584,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
> "but recovered through reset" : "needs recovery");
>
> + if (info) {
> + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> + snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
> + } else {
> + snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
> + snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
> + }
> +
> return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> }
> EXPORT_SYMBOL(drm_dev_wedged_event);
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index dbdcfe130ad4..ba1d8fdc3c7b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -1448,7 +1448,8 @@ static void intel_gt_reset_global(struct intel_gt *gt,
> kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event);
> else
> drm_dev_wedged_event(>->i915->drm,
> - DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
> + DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
> + NULL);
(For future) I believe we can later modify the gt_reset handles here
to get ctx->id when we are coming from reset paths with context and
NULL when we don't have pid.
Perhaps one possibility could be strings to make this flexible
'pid:%d'
'path:debugfs'
But just a brainstorm kind of idea... not a strong feeling or need.
> }
>
> /**
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index c02c4c4e9412..f329613e061f 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1168,7 +1168,8 @@ void xe_device_declare_wedged(struct xe_device *xe)
>
> /* Notify userspace of wedged device */
> drm_dev_wedged_event(&xe->drm,
> - DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
> + DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
> + NULL);
similarly here, for the paths where we come to the wedge declaration from
exeq_queue, we could get q->vm->xef->pid...
But again, no blocker from my side. For both i915 and Xe:
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> }
>
> for_each_gt(gt, xe, id)
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index e2f894f1b90a..b87401d5079e 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -30,6 +30,14 @@ struct pci_controller;
> #define DRM_WEDGE_RECOVERY_REBIND BIT(1) /* unbind + bind driver */
> #define DRM_WEDGE_RECOVERY_BUS_RESET BIT(2) /* unbind + reset bus device + bind */
>
> +/**
> + * struct drm_wedge_app_info - information about the guilty app of a wedge dev
> + */
> +struct drm_wedge_app_info {
> + pid_t pid;
> + char *comm;
> +};
> +
> /**
> * enum switch_power_state - power state of drm device
> */
> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> index a43d707b5f36..8fc6412a6345 100644
> --- a/include/drm/drm_drv.h
> +++ b/include/drm/drm_drv.h
> @@ -482,7 +482,8 @@ void drm_put_dev(struct drm_device *dev);
> bool drm_dev_enter(struct drm_device *dev, int *idx);
> void drm_dev_exit(int idx);
> void drm_dev_unplug(struct drm_device *dev);
> -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method);
> +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> + struct drm_wedge_app_info *info);
>
> /**
> * drm_dev_is_unplugged - is a DRM device unplugged
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/3] drm/doc: Add a section about "App information" for the wedge API
2025-05-11 22:47 ` [PATCH v2 2/3] drm/doc: Add a section about "App information" for the wedge API André Almeida
@ 2025-05-12 18:37 ` Raag Jadav
0 siblings, 0 replies; 9+ messages in thread
From: Raag Jadav @ 2025-05-12 18:37 UTC (permalink / raw)
To: André Almeida
Cc: Alex Deucher, Christian König, siqueira, airlied, simona,
rodrigo.vivi, jani.nikula, Xaver Hugl, Pierre-Loup A . Griffais,
dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx
On Sun, May 11, 2025 at 07:47:44PM -0300, André Almeida wrote:
> Add a section about "App information" for the wedge API.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> Documentation/gpu/drm-uapi.rst | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 69f72e71a96e..826abe265a24 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -446,6 +446,21 @@ telemetry information (devcoredump, syslog). This is useful because the first
> hang is usually the most critical one which can result in consequential hangs or
> complete wedging.
>
> +App information
> +---------------
> +
> +The information about which application (if any) caused the device to get in the
I'm wondering if we should change the wording to "application involved in device
wedging", or can we guarantee it to be the cause?
My limited understanding is that we'd still need the full dump to find the cause,
if it's possible to also note here.
Raag
> +wedge state is useful for userspace if they want to notify the user about what
> +happened (e.g. the compositor display a message to the user "The <app name>
> +caused a graphical error and the system recovered") or to implement policies
> +(e.g. the daemon may "ban" an app that keeps resetting the device). If the app
> +information is not available, the uevent will display as ``PID=-1`` and
> +``APP=none``. Otherwise, ``PID`` and ``APP`` will advertise about the guilty
> +app.
> +
> +The reliability of this information is driver and hardware specific, and should
> +be taken with a caution regarding it's precision.
> +
> Consumer prerequisites
> ----------------------
>
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-05-12 18:37 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-11 22:47 [PATCH v2 0/3] drm: Create an app info option for wedge events André Almeida
2025-05-11 22:47 ` [PATCH v2 1/3] " André Almeida
2025-05-12 6:08 ` Krzysztof Karas
2025-05-12 13:40 ` André Almeida
2025-05-12 14:53 ` Christian König
2025-05-12 15:34 ` Rodrigo Vivi
2025-05-11 22:47 ` [PATCH v2 2/3] drm/doc: Add a section about "App information" for the wedge API André Almeida
2025-05-12 18:37 ` Raag Jadav
2025-05-11 22:47 ` [PATCH v2 3/3] drm/amdgpu: Make use of drm_wedge_app_info André Almeida
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).