[PATCH 0/2] drm: Create an app info option for wedge events

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] drm: Create an app info option for wedge events
@ 2025-02-28 12:13 André Almeida
  2025-02-28 12:13 ` [PATCH 1/2] " André Almeida
  2025-02-28 12:13 ` [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info André Almeida
  0 siblings, 2 replies; 19+ messages in thread
From: André Almeida @ 2025-02-28 12:13 UTC (permalink / raw)
  To: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx
  Cc: Alex Deucher, 'Christian König', siqueira, airlied,
	simona, Raag Jadav, rodrigo.vivi, jani.nikula, André Almeida

This patchset implements a request made by Xaver Hugl about wedge events:

"I'd really like to have the PID of the client that triggered the GPU
reset, so that we can kill it if multiple resets are triggered in a
row (or switch to software rendering if it's KWin itself) and show a
user-friendly notification about why their app(s) crashed, but that
can be added later."

From https://lore.kernel.org/dri-devel/CAFZQkGwJ4qgHV8WTp2=svJ_VXhb-+Y8_VNtKB=jLsk6DqMYp9w@mail.gmail.com/

This patchset depends on https://lore.kernel.org/dri-devel/20250225010221.537059-1-andrealmeid@igalia.com/

For testing, I've used amdgpu's debug_mask options debug_disable_soft_recovery
and debug_disable_gpu_ring_reset to test both wedge event paths in the driver.
To trigger a ring timeout, I used this app:
https://gitlab.freedesktop.org/andrealmeid/gpu-timeout

Thanks!

André Almeida (2):
  drm: Create an app info option for wedge events
  drm/amdgpu: Make use of drm_wedge_app_info

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 ++++
 drivers/gpu/drm/drm_drv.c                  | 16 +++++++++++++---
 drivers/gpu/drm/i915/gt/intel_reset.c      |  3 ++-
 drivers/gpu/drm/xe/xe_device.c             |  3 ++-
 include/drm/drm_device.h                   |  8 ++++++++
 include/drm/drm_drv.h                      |  3 ++-
 7 files changed, 48 insertions(+), 8 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/2] drm: Create an app info option for wedge events
  2025-02-28 12:13 [PATCH 0/2] drm: Create an app info option for wedge events André Almeida
@ 2025-02-28 12:13 ` André Almeida
  2025-02-28 14:20   ` Raag Jadav
  2025-02-28 12:13 ` [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info André Almeida
  1 sibling, 1 reply; 19+ messages in thread
From: André Almeida @ 2025-02-28 12:13 UTC (permalink / raw)
  To: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx
  Cc: Alex Deucher, 'Christian König', siqueira, airlied,
	simona, Raag Jadav, rodrigo.vivi, jani.nikula, André Almeida

When a device get wedged, it might be caused by a guilty application.
For userspace, knowing which app was the cause can be useful for some
situations, like for implementing a policy, logs or for giving a chance
for the compositor to let the user know what app caused the problem.
This is an optional argument, when `PID=-1` there's no information about
the app caused the problem, or if any app was involved during the hang.

Sometimes just the PID isn't enough giving that the app might be already
dead by the time userspace will try to check what was this PID's name,
so to make the life easier also notify what's the app's name in the user
event.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
 drivers/gpu/drm/drm_drv.c                  | 16 +++++++++++++---
 drivers/gpu/drm/i915/gt/intel_reset.c      |  3 ++-
 drivers/gpu/drm/xe/xe_device.c             |  3 ++-
 include/drm/drm_device.h                   |  8 ++++++++
 include/drm/drm_drv.h                      |  3 ++-
 7 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 24ba52d76045..00b9b87dafd8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6124,7 +6124,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	atomic_set(&adev->reset_domain->reset_res, r);
 
 	if (!r)
-		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
+		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
 
 	return r;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index ef1b77f1e88f..3ed9cbcab1ad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -150,7 +150,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 			amdgpu_fence_driver_force_completion(ring);
 			if (amdgpu_ring_sched_ready(ring))
 				drm_sched_start(&ring->sched, 0);
-			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
+			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
 			dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
 			goto exit;
 		}
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 17fc5dc708f4..48faafd82a99 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -522,6 +522,7 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
  * drm_dev_wedged_event - generate a device wedged uevent
  * @dev: DRM device
  * @method: method(s) to be used for recovery
+ * @info: optional information about the guilty app
  *
  * This generates a device wedged uevent for the DRM device specified by @dev.
  * Recovery @method\(s) of choice will be sent in the uevent environment as
@@ -534,13 +535,14 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
  *
  * Returns: 0 on success, negative error code otherwise.
  */
-int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
+int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
+			 struct drm_wedge_app_info *info)
 {
 	const char *recovery = NULL;
 	unsigned int len, opt;
 	/* Event string length up to 28+ characters with available methods */
-	char event_string[32];
-	char *envp[] = { event_string, NULL };
+	char event_string[32], pid_string[15], comm_string[TASK_COMM_LEN];
+	char *envp[] = { event_string, pid_string, comm_string, NULL };
 
 	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
 
@@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
 	drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
 		 "but recovered through reset" : "needs recovery");
 
+	if (info) {
+		snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
+		snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
+	} else {
+		snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
+		snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
+	}
+
 	return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
 }
 EXPORT_SYMBOL(drm_dev_wedged_event);
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index d6dc12fd87c1..928b9f048b6a 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1424,7 +1424,8 @@ static void intel_gt_reset_global(struct intel_gt *gt,
 		kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event);
 	else
 		drm_dev_wedged_event(&gt->i915->drm,
-				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
+				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
+				     NULL);
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index fc4a49f25c09..8a349c7daf24 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1037,7 +1037,8 @@ void xe_device_declare_wedged(struct xe_device *xe)
 
 		/* Notify userspace of wedged device */
 		drm_dev_wedged_event(&xe->drm,
-				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET);
+				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
+				     NULL);
 	}
 
 	for_each_gt(gt, xe, id)
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index 6ea54a578cda..8b9d614257e6 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -30,6 +30,14 @@ struct pci_controller;
 #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
 #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
 
+/**
+ * struct drm_wedge_app_info - information about the guilty app of a wedge dev
+ */
+struct drm_wedge_app_info {
+	pid_t pid;
+	char *comm;
+};
+
 /**
  * enum switch_power_state - power state of drm device
  */
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index a43d707b5f36..8fc6412a6345 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -482,7 +482,8 @@ void drm_put_dev(struct drm_device *dev);
 bool drm_dev_enter(struct drm_device *dev, int *idx);
 void drm_dev_exit(int idx);
 void drm_dev_unplug(struct drm_device *dev);
-int drm_dev_wedged_event(struct drm_device *dev, unsigned long method);
+int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
+			 struct drm_wedge_app_info *info);
 
 /**
  * drm_dev_is_unplugged - is a DRM device unplugged
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info
  2025-02-28 12:13 [PATCH 0/2] drm: Create an app info option for wedge events André Almeida
  2025-02-28 12:13 ` [PATCH 1/2] " André Almeida
@ 2025-02-28 12:13 ` André Almeida
  2025-02-28 14:58   ` Raag Jadav
  1 sibling, 1 reply; 19+ messages in thread
From: André Almeida @ 2025-02-28 12:13 UTC (permalink / raw)
  To: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx
  Cc: Alex Deucher, 'Christian König', siqueira, airlied,
	simona, Raag Jadav, rodrigo.vivi, jani.nikula, André Almeida

To notify userspace about which app (if any) made the device get in a
wedge state, make use of drm_wedge_app_info parameter, filling it with
the app PID and name.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  6 +++++-
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 00b9b87dafd8..e06adf6f34fd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	atomic_set(&adev->reset_domain->reset_res, r);
 
-	if (!r)
-		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
+	if (!r) {
+		struct drm_wedge_app_info aux, *info = NULL;
+
+		if (job) {
+			struct amdgpu_task_info *ti;
+
+			ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid);
+			if (ti) {
+				aux.pid = ti->pid;
+				aux.comm = ti->process_name;
+				info = &aux;
+				amdgpu_vm_put_task_info(ti);
+			}
+		}
+
+		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, info);
+	}
 
 	return r;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 3ed9cbcab1ad..8ab23182127e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -89,6 +89,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 {
 	struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
 	struct amdgpu_job *job = to_amdgpu_job(s_job);
+	struct drm_wedge_app_info aux, *info = NULL;
 	struct amdgpu_task_info *ti;
 	struct amdgpu_device *adev = ring->adev;
 	int idx;
@@ -127,6 +128,9 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		dev_err(adev->dev,
 			"Process information: process %s pid %d thread %s pid %d\n",
 			ti->process_name, ti->tgid, ti->task_name, ti->pid);
+		aux.pid = ti->pid;
+		aux.comm = ti->process_name;
+		info = &aux;
 		amdgpu_vm_put_task_info(ti);
 	}
 
@@ -150,7 +154,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 			amdgpu_fence_driver_force_completion(ring);
 			if (amdgpu_ring_sched_ready(ring))
 				drm_sched_start(&ring->sched, 0);
-			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
+			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, info);
 			dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
 			goto exit;
 		}
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] drm: Create an app info option for wedge events
  2025-02-28 12:13 ` [PATCH 1/2] " André Almeida
@ 2025-02-28 14:20   ` Raag Jadav
  2025-02-28 21:54     ` André Almeida
  0 siblings, 1 reply; 19+ messages in thread
From: Raag Jadav @ 2025-02-28 14:20 UTC (permalink / raw)
  To: André Almeida
  Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
	Alex Deucher, 'Christian König', siqueira, airlied,
	simona, rodrigo.vivi, jani.nikula, lucas.demarchi

Cc: Lucas

On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> When a device get wedged, it might be caused by a guilty application.
> For userspace, knowing which app was the cause can be useful for some
> situations, like for implementing a policy, logs or for giving a chance
> for the compositor to let the user know what app caused the problem.
> This is an optional argument, when `PID=-1` there's no information about
> the app caused the problem, or if any app was involved during the hang.
> 
> Sometimes just the PID isn't enough giving that the app might be already
> dead by the time userspace will try to check what was this PID's name,
> so to make the life easier also notify what's the app's name in the user
> event.
> 
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>  drivers/gpu/drm/drm_drv.c                  | 16 +++++++++++++---
>  drivers/gpu/drm/i915/gt/intel_reset.c      |  3 ++-
>  drivers/gpu/drm/xe/xe_device.c             |  3 ++-
>  include/drm/drm_device.h                   |  8 ++++++++
>  include/drm/drm_drv.h                      |  3 ++-
>  7 files changed, 29 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 24ba52d76045..00b9b87dafd8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -6124,7 +6124,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>  	atomic_set(&adev->reset_domain->reset_res, r);
>  
>  	if (!r)
> -		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> +		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
>  
>  	return r;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index ef1b77f1e88f..3ed9cbcab1ad 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -150,7 +150,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>  			amdgpu_fence_driver_force_completion(ring);
>  			if (amdgpu_ring_sched_ready(ring))
>  				drm_sched_start(&ring->sched, 0);
> -			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> +			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
>  			dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
>  			goto exit;
>  		}
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 17fc5dc708f4..48faafd82a99 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -522,6 +522,7 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>   * drm_dev_wedged_event - generate a device wedged uevent
>   * @dev: DRM device
>   * @method: method(s) to be used for recovery
> + * @info: optional information about the guilty app
>   *
>   * This generates a device wedged uevent for the DRM device specified by @dev.
>   * Recovery @method\(s) of choice will be sent in the uevent environment as
> @@ -534,13 +535,14 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>   *
>   * Returns: 0 on success, negative error code otherwise.
>   */
> -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> +			 struct drm_wedge_app_info *info)
>  {
>  	const char *recovery = NULL;
>  	unsigned int len, opt;
>  	/* Event string length up to 28+ characters with available methods */
> -	char event_string[32];
> -	char *envp[] = { event_string, NULL };
> +	char event_string[32], pid_string[15], comm_string[TASK_COMM_LEN];
> +	char *envp[] = { event_string, pid_string, comm_string, NULL };
>  
>  	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
>  
> @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
>  	drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
>  		 "but recovered through reset" : "needs recovery");
>  
> +	if (info) {
> +		snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> +		snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
> +	} else {
> +		snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
> +		snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
> +	}

This is not much use for wedge cases that needs recovery, since at that point
the userspace will need to clean house anyway.

Which leaves us with only 'none' case and perhaps the need for standardization
of "optional telemetry collection".

Thoughts?

Raag

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info
  2025-02-28 12:13 ` [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info André Almeida
@ 2025-02-28 14:58   ` Raag Jadav
  2025-02-28 21:49     ` André Almeida
  0 siblings, 1 reply; 19+ messages in thread
From: Raag Jadav @ 2025-02-28 14:58 UTC (permalink / raw)
  To: André Almeida
  Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
	Alex Deucher, 'Christian König', siqueira, airlied,
	simona, rodrigo.vivi, jani.nikula

On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote:
> To notify userspace about which app (if any) made the device get in a
> wedge state, make use of drm_wedge_app_info parameter, filling it with
> the app PID and name.
> 
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  6 +++++-
>  2 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 00b9b87dafd8..e06adf6f34fd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>  
>  	atomic_set(&adev->reset_domain->reset_res, r);
>  
> -	if (!r)
> -		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
> +	if (!r) {
> +		struct drm_wedge_app_info aux, *info = NULL;
> +
> +		if (job) {
> +			struct amdgpu_task_info *ti;
> +
> +			ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid);
> +			if (ti) {
> +				aux.pid = ti->pid;
> +				aux.comm = ti->process_name;
> +				info = &aux;
> +				amdgpu_vm_put_task_info(ti);
> +			}
> +		}

Is this guaranteed to be guilty app and not some scheduled worker?

Raag

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info
  2025-02-28 14:58   ` Raag Jadav
@ 2025-02-28 21:49     ` André Almeida
  2025-03-01  6:04       ` Raag Jadav
  0 siblings, 1 reply; 19+ messages in thread
From: André Almeida @ 2025-02-28 21:49 UTC (permalink / raw)
  To: Raag Jadav
  Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
	Alex Deucher, 'Christian König', siqueira, airlied,
	simona, rodrigo.vivi, jani.nikula

Hi Raag,

On 2/28/25 11:58, Raag Jadav wrote:
> On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote:
>> To notify userspace about which app (if any) made the device get in a
>> wedge state, make use of drm_wedge_app_info parameter, filling it with
>> the app PID and name.
>>
>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  6 +++++-
>>   2 files changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 00b9b87dafd8..e06adf6f34fd 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>   
>>   	atomic_set(&adev->reset_domain->reset_res, r);
>>   
>> -	if (!r)
>> -		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
>> +	if (!r) {
>> +		struct drm_wedge_app_info aux, *info = NULL;
>> +
>> +		if (job) {
>> +			struct amdgpu_task_info *ti;
>> +
>> +			ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid);
>> +			if (ti) {
>> +				aux.pid = ti->pid;
>> +				aux.comm = ti->process_name;
>> +				info = &aux;
>> +				amdgpu_vm_put_task_info(ti);
>> +			}
>> +		}
> Is this guaranteed to be guilty app and not some scheduled worker?

This is how amdgpu decides which app is the guilty one earlier in the 
code as in the print:

     ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid);

     "Process information: process %s pid %d thread %s pid %d\n"

So I think it's consistent with what the driver thinks it's the guilty 
process.

> Raag
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] drm: Create an app info option for wedge events
  2025-02-28 14:20   ` Raag Jadav
@ 2025-02-28 21:54     ` André Almeida
  2025-03-01  5:53       ` Raag Jadav
  0 siblings, 1 reply; 19+ messages in thread
From: André Almeida @ 2025-02-28 21:54 UTC (permalink / raw)
  To: Raag Jadav
  Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
	Alex Deucher, 'Christian König', siqueira, airlied,
	simona, rodrigo.vivi, jani.nikula, lucas.demarchi

Hi Raag,

On 2/28/25 11:20, Raag Jadav wrote:
> Cc: Lucas
>
> On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
>> When a device get wedged, it might be caused by a guilty application.
>> For userspace, knowing which app was the cause can be useful for some
>> situations, like for implementing a policy, logs or for giving a chance
>> for the compositor to let the user know what app caused the problem.
>> This is an optional argument, when `PID=-1` there's no information about
>> the app caused the problem, or if any app was involved during the hang.
>>
>> Sometimes just the PID isn't enough giving that the app might be already
>> dead by the time userspace will try to check what was this PID's name,
>> so to make the life easier also notify what's the app's name in the user
>> event.
>>
>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>   drivers/gpu/drm/drm_drv.c                  | 16 +++++++++++++---
>>   drivers/gpu/drm/i915/gt/intel_reset.c      |  3 ++-
>>   drivers/gpu/drm/xe/xe_device.c             |  3 ++-
>>   include/drm/drm_device.h                   |  8 ++++++++
>>   include/drm/drm_drv.h                      |  3 ++-
>>   7 files changed, 29 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 24ba52d76045..00b9b87dafd8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -6124,7 +6124,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>   	atomic_set(&adev->reset_domain->reset_res, r);
>>   
>>   	if (!r)
>> -		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
>> +		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
>>   
>>   	return r;
>>   }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index ef1b77f1e88f..3ed9cbcab1ad 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -150,7 +150,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>>   			amdgpu_fence_driver_force_completion(ring);
>>   			if (amdgpu_ring_sched_ready(ring))
>>   				drm_sched_start(&ring->sched, 0);
>> -			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
>> +			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
>>   			dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
>>   			goto exit;
>>   		}
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index 17fc5dc708f4..48faafd82a99 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -522,6 +522,7 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>    * drm_dev_wedged_event - generate a device wedged uevent
>>    * @dev: DRM device
>>    * @method: method(s) to be used for recovery
>> + * @info: optional information about the guilty app
>>    *
>>    * This generates a device wedged uevent for the DRM device specified by @dev.
>>    * Recovery @method\(s) of choice will be sent in the uevent environment as
>> @@ -534,13 +535,14 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>    *
>>    * Returns: 0 on success, negative error code otherwise.
>>    */
>> -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
>> +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
>> +			 struct drm_wedge_app_info *info)
>>   {
>>   	const char *recovery = NULL;
>>   	unsigned int len, opt;
>>   	/* Event string length up to 28+ characters with available methods */
>> -	char event_string[32];
>> -	char *envp[] = { event_string, NULL };
>> +	char event_string[32], pid_string[15], comm_string[TASK_COMM_LEN];
>> +	char *envp[] = { event_string, pid_string, comm_string, NULL };
>>   
>>   	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
>>   
>> @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
>>   	drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
>>   		 "but recovered through reset" : "needs recovery");
>>   
>> +	if (info) {
>> +		snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
>> +		snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
>> +	} else {
>> +		snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
>> +		snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
>> +	}
> This is not much use for wedge cases that needs recovery, since at that point
> the userspace will need to clean house anyway.
>
> Which leaves us with only 'none' case and perhaps the need for standardization
> of "optional telemetry collection".
>
> Thoughts?

I had the feeling that 'none' was already meant to be used for that. Do 
you think we should move to another naming? Given that we didn't reach 
the merge window yet we could potentially change that name without much 
damage.

> Raag
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] drm: Create an app info option for wedge events
  2025-02-28 21:54     ` André Almeida
@ 2025-03-01  5:53       ` Raag Jadav
  2025-03-10 21:27         ` André Almeida
  0 siblings, 1 reply; 19+ messages in thread
From: Raag Jadav @ 2025-03-01  5:53 UTC (permalink / raw)
  To: André Almeida
  Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
	Alex Deucher, 'Christian König', siqueira, airlied,
	simona, rodrigo.vivi, jani.nikula, lucas.demarchi

On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:
> Hi Raag,
> 
> On 2/28/25 11:20, Raag Jadav wrote:
> > Cc: Lucas
> > 
> > On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> > > When a device get wedged, it might be caused by a guilty application.
> > > For userspace, knowing which app was the cause can be useful for some
> > > situations, like for implementing a policy, logs or for giving a chance
> > > for the compositor to let the user know what app caused the problem.
> > > This is an optional argument, when `PID=-1` there's no information about
> > > the app caused the problem, or if any app was involved during the hang.
> > > 
> > > Sometimes just the PID isn't enough giving that the app might be already
> > > dead by the time userspace will try to check what was this PID's name,
> > > so to make the life easier also notify what's the app's name in the user
> > > event.
> > > 
> > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
> > >   drivers/gpu/drm/drm_drv.c                  | 16 +++++++++++++---
> > >   drivers/gpu/drm/i915/gt/intel_reset.c      |  3 ++-
> > >   drivers/gpu/drm/xe/xe_device.c             |  3 ++-
> > >   include/drm/drm_device.h                   |  8 ++++++++
> > >   include/drm/drm_drv.h                      |  3 ++-
> > >   7 files changed, 29 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index 24ba52d76045..00b9b87dafd8 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -6124,7 +6124,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> > >   	atomic_set(&adev->reset_domain->reset_res, r);
> > >   	if (!r)
> > > -		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> > > +		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
> > >   	return r;
> > >   }
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > index ef1b77f1e88f..3ed9cbcab1ad 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > @@ -150,7 +150,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> > >   			amdgpu_fence_driver_force_completion(ring);
> > >   			if (amdgpu_ring_sched_ready(ring))
> > >   				drm_sched_start(&ring->sched, 0);
> > > -			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> > > +			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
> > >   			dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
> > >   			goto exit;
> > >   		}
> > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > index 17fc5dc708f4..48faafd82a99 100644
> > > --- a/drivers/gpu/drm/drm_drv.c
> > > +++ b/drivers/gpu/drm/drm_drv.c
> > > @@ -522,6 +522,7 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > >    * drm_dev_wedged_event - generate a device wedged uevent
> > >    * @dev: DRM device
> > >    * @method: method(s) to be used for recovery
> > > + * @info: optional information about the guilty app
> > >    *
> > >    * This generates a device wedged uevent for the DRM device specified by @dev.
> > >    * Recovery @method\(s) of choice will be sent in the uevent environment as
> > > @@ -534,13 +535,14 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > >    *
> > >    * Returns: 0 on success, negative error code otherwise.
> > >    */
> > > -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> > > +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> > > +			 struct drm_wedge_app_info *info)
> > >   {
> > >   	const char *recovery = NULL;
> > >   	unsigned int len, opt;
> > >   	/* Event string length up to 28+ characters with available methods */
> > > -	char event_string[32];
> > > -	char *envp[] = { event_string, NULL };
> > > +	char event_string[32], pid_string[15], comm_string[TASK_COMM_LEN];
> > > +	char *envp[] = { event_string, pid_string, comm_string, NULL };
> > >   	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
> > > @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> > >   	drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
> > >   		 "but recovered through reset" : "needs recovery");
> > > +	if (info) {
> > > +		snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> > > +		snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
> > > +	} else {
> > > +		snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
> > > +		snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
> > > +	}
> > This is not much use for wedge cases that needs recovery, since at that point
> > the userspace will need to clean house anyway.
> > 
> > Which leaves us with only 'none' case and perhaps the need for standardization
> > of "optional telemetry collection".
> > 
> > Thoughts?
> 
> I had the feeling that 'none' was already meant to be used for that. Do you
> think we should move to another naming? Given that we didn't reach the merge
> window yet we could potentially change that name without much damage.

No, I meant thoughts on possible telemetry data that the drivers might
think is useful for userspace (along with PID) and can be presented in
a vendor agnostic manner (just like wedged event).

Raag

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info
  2025-02-28 21:49     ` André Almeida
@ 2025-03-01  6:04       ` Raag Jadav
  2025-03-10 21:53         ` André Almeida
  0 siblings, 1 reply; 19+ messages in thread
From: Raag Jadav @ 2025-03-01  6:04 UTC (permalink / raw)
  To: André Almeida
  Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
	Alex Deucher, 'Christian König', siqueira, airlied,
	simona, rodrigo.vivi, jani.nikula

On Fri, Feb 28, 2025 at 06:49:43PM -0300, André Almeida wrote:
> Hi Raag,
> 
> On 2/28/25 11:58, Raag Jadav wrote:
> > On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote:
> > > To notify userspace about which app (if any) made the device get in a
> > > wedge state, make use of drm_wedge_app_info parameter, filling it with
> > > the app PID and name.
> > > 
> > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++--
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  6 +++++-
> > >   2 files changed, 22 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index 00b9b87dafd8..e06adf6f34fd 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> > >   	atomic_set(&adev->reset_domain->reset_res, r);
> > > -	if (!r)
> > > -		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
> > > +	if (!r) {
> > > +		struct drm_wedge_app_info aux, *info = NULL;
> > > +
> > > +		if (job) {
> > > +			struct amdgpu_task_info *ti;
> > > +
> > > +			ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid);
> > > +			if (ti) {
> > > +				aux.pid = ti->pid;
> > > +				aux.comm = ti->process_name;
> > > +				info = &aux;
> > > +				amdgpu_vm_put_task_info(ti);
> > > +			}
> > > +		}
> > Is this guaranteed to be guilty app and not some scheduled worker?
> 
> This is how amdgpu decides which app is the guilty one earlier in the code
> as in the print:
> 
>     ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid);
> 
>     "Process information: process %s pid %d thread %s pid %d\n"
> 
> So I think it's consistent with what the driver thinks it's the guilty
> process.

Sure, but with something like app_info we're kind of hinting to userspace
that an application was _indeed_ involved with reset. Is that also guaranteed?

Is it possible that an application needlessly suffers from a false positive
scenario (reset due to other factors)?

Raag

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] drm: Create an app info option for wedge events
  2025-03-01  5:53       ` Raag Jadav
@ 2025-03-10 21:27         ` André Almeida
  2025-03-11 17:09           ` Raag Jadav
  0 siblings, 1 reply; 19+ messages in thread
From: André Almeida @ 2025-03-10 21:27 UTC (permalink / raw)
  To: Raag Jadav
  Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
	Alex Deucher, 'Christian König', siqueira, airlied,
	simona, rodrigo.vivi, jani.nikula, lucas.demarchi, Xaver Hugl

Em 01/03/2025 02:53, Raag Jadav escreveu:
> On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:
>> Hi Raag,
>>
>> On 2/28/25 11:20, Raag Jadav wrote:
>>> Cc: Lucas
>>>
>>> On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
>>>> When a device get wedged, it might be caused by a guilty application.
>>>> For userspace, knowing which app was the cause can be useful for some
>>>> situations, like for implementing a policy, logs or for giving a chance
>>>> for the compositor to let the user know what app caused the problem.
>>>> This is an optional argument, when `PID=-1` there's no information about
>>>> the app caused the problem, or if any app was involved during the hang.
>>>>
>>>> Sometimes just the PID isn't enough giving that the app might be already
>>>> dead by the time userspace will try to check what was this PID's name,
>>>> so to make the life easier also notify what's the app's name in the user
>>>> event.
>>>>
>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>

[...]

>>>>    	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
>>>> @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
>>>>    	drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
>>>>    		 "but recovered through reset" : "needs recovery");
>>>> +	if (info) {
>>>> +		snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
>>>> +		snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
>>>> +	} else {
>>>> +		snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
>>>> +		snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
>>>> +	}
>>> This is not much use for wedge cases that needs recovery, since at that point
>>> the userspace will need to clean house anyway.
>>>
>>> Which leaves us with only 'none' case and perhaps the need for standardization
>>> of "optional telemetry collection".
>>>
>>> Thoughts?
>>
>> I had the feeling that 'none' was already meant to be used for that. Do you
>> think we should move to another naming? Given that we didn't reach the merge
>> window yet we could potentially change that name without much damage.
> 
> No, I meant thoughts on possible telemetry data that the drivers might
> think is useful for userspace (along with PID) and can be presented in
> a vendor agnostic manner (just like wedged event).

I'm not if I agree that this will only be used for telemetry and for the 
`none` use case. As stated by Xaver, there's use case to know which app 
caused the device to get wedged (like switching to software rendering) 
and to display something for the user after the recovery is done (e.g. 
"The game <app name> stopped working and Plasma has reset").

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info
  2025-03-01  6:04       ` Raag Jadav
@ 2025-03-10 21:53         ` André Almeida
  2025-03-10 22:03           ` Alex Deucher
  0 siblings, 1 reply; 19+ messages in thread
From: André Almeida @ 2025-03-10 21:53 UTC (permalink / raw)
  To: Raag Jadav
  Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
	Alex Deucher, 'Christian König', siqueira, airlied,
	simona, rodrigo.vivi, jani.nikula, Xaver Hugl

Em 01/03/2025 03:04, Raag Jadav escreveu:
> On Fri, Feb 28, 2025 at 06:49:43PM -0300, André Almeida wrote:
>> Hi Raag,
>>
>> On 2/28/25 11:58, Raag Jadav wrote:
>>> On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote:
>>>> To notify userspace about which app (if any) made the device get in a
>>>> wedge state, make use of drm_wedge_app_info parameter, filling it with
>>>> the app PID and name.
>>>>
>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++--
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  6 +++++-
>>>>    2 files changed, 22 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 00b9b87dafd8..e06adf6f34fd 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>    	atomic_set(&adev->reset_domain->reset_res, r);
>>>> -	if (!r)
>>>> -		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
>>>> +	if (!r) {
>>>> +		struct drm_wedge_app_info aux, *info = NULL;
>>>> +
>>>> +		if (job) {
>>>> +			struct amdgpu_task_info *ti;
>>>> +
>>>> +			ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid);
>>>> +			if (ti) {
>>>> +				aux.pid = ti->pid;
>>>> +				aux.comm = ti->process_name;
>>>> +				info = &aux;
>>>> +				amdgpu_vm_put_task_info(ti);
>>>> +			}
>>>> +		}
>>> Is this guaranteed to be guilty app and not some scheduled worker?
>>
>> This is how amdgpu decides which app is the guilty one earlier in the code
>> as in the print:
>>
>>      ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid);
>>
>>      "Process information: process %s pid %d thread %s pid %d\n"
>>
>> So I think it's consistent with what the driver thinks it's the guilty
>> process.
> 
> Sure, but with something like app_info we're kind of hinting to userspace
> that an application was _indeed_ involved with reset. Is that also guaranteed?
> 
> Is it possible that an application needlessly suffers from a false positive
> scenario (reset due to other factors)?
> 

I asked Alex Deucher in IRC about that and yes, there's a chance that 
this is a false positive. However, for the majority of cases this is the 
right app that caused the hang. This is what amdgpu is doing for GL 
robustness as well and devcoredump, so it's very consistent with how 
amdgpu deals with this scenario even if the mechanism is still not perfect.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info
  2025-03-10 21:53         ` André Almeida
@ 2025-03-10 22:03           ` Alex Deucher
  2025-03-11 17:13             ` Raag Jadav
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Deucher @ 2025-03-10 22:03 UTC (permalink / raw)
  To: André Almeida
  Cc: Raag Jadav, dri-devel, linux-kernel, kernel-dev, amd-gfx,
	intel-xe, intel-gfx, Alex Deucher, Christian König, siqueira,
	airlied, simona, rodrigo.vivi, jani.nikula, Xaver Hugl

On Mon, Mar 10, 2025 at 5:54 PM André Almeida <andrealmeid@igalia.com> wrote:
>
> Em 01/03/2025 03:04, Raag Jadav escreveu:
> > On Fri, Feb 28, 2025 at 06:49:43PM -0300, André Almeida wrote:
> >> Hi Raag,
> >>
> >> On 2/28/25 11:58, Raag Jadav wrote:
> >>> On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote:
> >>>> To notify userspace about which app (if any) made the device get in a
> >>>> wedge state, make use of drm_wedge_app_info parameter, filling it with
> >>>> the app PID and name.
> >>>>
> >>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> >>>> ---
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++--
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  6 +++++-
> >>>>    2 files changed, 22 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>> index 00b9b87dafd8..e06adf6f34fd 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>> @@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> >>>>            atomic_set(&adev->reset_domain->reset_res, r);
> >>>> -  if (!r)
> >>>> -          drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
> >>>> +  if (!r) {
> >>>> +          struct drm_wedge_app_info aux, *info = NULL;
> >>>> +
> >>>> +          if (job) {
> >>>> +                  struct amdgpu_task_info *ti;
> >>>> +
> >>>> +                  ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid);
> >>>> +                  if (ti) {
> >>>> +                          aux.pid = ti->pid;
> >>>> +                          aux.comm = ti->process_name;
> >>>> +                          info = &aux;
> >>>> +                          amdgpu_vm_put_task_info(ti);
> >>>> +                  }
> >>>> +          }
> >>> Is this guaranteed to be guilty app and not some scheduled worker?
> >>
> >> This is how amdgpu decides which app is the guilty one earlier in the code
> >> as in the print:
> >>
> >>      ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid);
> >>
> >>      "Process information: process %s pid %d thread %s pid %d\n"
> >>
> >> So I think it's consistent with what the driver thinks it's the guilty
> >> process.
> >
> > Sure, but with something like app_info we're kind of hinting to userspace
> > that an application was _indeed_ involved with reset. Is that also guaranteed?
> >
> > Is it possible that an application needlessly suffers from a false positive
> > scenario (reset due to other factors)?
> >
>
> I asked Alex Deucher in IRC about that and yes, there's a chance that
> this is a false positive. However, for the majority of cases this is the
> right app that caused the hang. This is what amdgpu is doing for GL
> robustness as well and devcoredump, so it's very consistent with how
> amdgpu deals with this scenario even if the mechanism is still not perfect.

It's usually the guilty one, but it's not guaranteed.  For example,
say you have a ROCm user queue and a gfx job submitted to a kernel
queue.  The actual guilty job may be the ROCm user queue, but the
driver may not detect that the ROCm queue was hung until some other
event (e.g., memory pressure).  However, the timer for the gfx job may
timeout before that happens on the ROCm queue so in that case the gfx
job would be incorrectly considered guilty.

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] drm: Create an app info option for wedge events
  2025-03-10 21:27         ` André Almeida
@ 2025-03-11 17:09           ` Raag Jadav
  2025-03-12 10:06             ` Raag Jadav
  0 siblings, 1 reply; 19+ messages in thread
From: Raag Jadav @ 2025-03-11 17:09 UTC (permalink / raw)
  To: André Almeida
  Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
	Alex Deucher, 'Christian König', siqueira, airlied,
	simona, rodrigo.vivi, jani.nikula, lucas.demarchi, Xaver Hugl

On Mon, Mar 10, 2025 at 06:27:53PM -0300, André Almeida wrote:
> Em 01/03/2025 02:53, Raag Jadav escreveu:
> > On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:
> > > Hi Raag,
> > > 
> > > On 2/28/25 11:20, Raag Jadav wrote:
> > > > Cc: Lucas
> > > > 
> > > > On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> > > > > When a device get wedged, it might be caused by a guilty application.
> > > > > For userspace, knowing which app was the cause can be useful for some
> > > > > situations, like for implementing a policy, logs or for giving a chance
> > > > > for the compositor to let the user know what app caused the problem.
> > > > > This is an optional argument, when `PID=-1` there's no information about
> > > > > the app caused the problem, or if any app was involved during the hang.
> > > > > 
> > > > > Sometimes just the PID isn't enough giving that the app might be already
> > > > > dead by the time userspace will try to check what was this PID's name,
> > > > > so to make the life easier also notify what's the app's name in the user
> > > > > event.
> > > > > 
> > > > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> 
> [...]
> 
> > > > >    	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
> > > > > @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> > > > >    	drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
> > > > >    		 "but recovered through reset" : "needs recovery");
> > > > > +	if (info) {
> > > > > +		snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> > > > > +		snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
> > > > > +	} else {
> > > > > +		snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
> > > > > +		snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
> > > > > +	}
> > > > This is not much use for wedge cases that needs recovery, since at that point
> > > > the userspace will need to clean house anyway.
> > > > 
> > > > Which leaves us with only 'none' case and perhaps the need for standardization
> > > > of "optional telemetry collection".
> > > > 
> > > > Thoughts?
> > > 
> > > I had the feeling that 'none' was already meant to be used for that. Do you
> > > think we should move to another naming? Given that we didn't reach the merge
> > > window yet we could potentially change that name without much damage.
> > 
> > No, I meant thoughts on possible telemetry data that the drivers might
> > think is useful for userspace (along with PID) and can be presented in
> > a vendor agnostic manner (just like wedged event).
> 
> I'm not if I agree that this will only be used for telemetry and for the
> `none` use case. As stated by Xaver, there's use case to know which app
> caused the device to get wedged (like switching to software rendering) and
> to display something for the user after the recovery is done (e.g. "The game
> <app name> stopped working and Plasma has reset").

Sure, but since this information is already available in coredump, I was
hoping to have something like a standardized DRM level coredump with both
vendor specific and agnostic sections, which the drivers can (and hopefully
transition to) use in conjunction with wedged event to provide wider
telemetry and is useful for all wedge cases.

Would that serve the usecase here?

Raag

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info
  2025-03-10 22:03           ` Alex Deucher
@ 2025-03-11 17:13             ` Raag Jadav
  2025-03-12  8:25               ` Christian König
  0 siblings, 1 reply; 19+ messages in thread
From: Raag Jadav @ 2025-03-11 17:13 UTC (permalink / raw)
  To: Alex Deucher
  Cc: André Almeida, dri-devel, linux-kernel, kernel-dev, amd-gfx,
	intel-xe, intel-gfx, Alex Deucher, Christian König, siqueira,
	airlied, simona, rodrigo.vivi, jani.nikula, Xaver Hugl

On Mon, Mar 10, 2025 at 06:03:27PM -0400, Alex Deucher wrote:
> On Mon, Mar 10, 2025 at 5:54 PM André Almeida <andrealmeid@igalia.com> wrote:
> >
> > Em 01/03/2025 03:04, Raag Jadav escreveu:
> > > On Fri, Feb 28, 2025 at 06:49:43PM -0300, André Almeida wrote:
> > >> Hi Raag,
> > >>
> > >> On 2/28/25 11:58, Raag Jadav wrote:
> > >>> On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote:
> > >>>> To notify userspace about which app (if any) made the device get in a
> > >>>> wedge state, make use of drm_wedge_app_info parameter, filling it with
> > >>>> the app PID and name.
> > >>>>
> > >>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > >>>> ---
> > >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++--
> > >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  6 +++++-
> > >>>>    2 files changed, 22 insertions(+), 3 deletions(-)
> > >>>>
> > >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>>> index 00b9b87dafd8..e06adf6f34fd 100644
> > >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>>> @@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> > >>>>            atomic_set(&adev->reset_domain->reset_res, r);
> > >>>> -  if (!r)
> > >>>> -          drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
> > >>>> +  if (!r) {
> > >>>> +          struct drm_wedge_app_info aux, *info = NULL;
> > >>>> +
> > >>>> +          if (job) {
> > >>>> +                  struct amdgpu_task_info *ti;
> > >>>> +
> > >>>> +                  ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid);
> > >>>> +                  if (ti) {
> > >>>> +                          aux.pid = ti->pid;
> > >>>> +                          aux.comm = ti->process_name;
> > >>>> +                          info = &aux;
> > >>>> +                          amdgpu_vm_put_task_info(ti);
> > >>>> +                  }
> > >>>> +          }
> > >>> Is this guaranteed to be guilty app and not some scheduled worker?
> > >>
> > >> This is how amdgpu decides which app is the guilty one earlier in the code
> > >> as in the print:
> > >>
> > >>      ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid);
> > >>
> > >>      "Process information: process %s pid %d thread %s pid %d\n"
> > >>
> > >> So I think it's consistent with what the driver thinks it's the guilty
> > >> process.
> > >
> > > Sure, but with something like app_info we're kind of hinting to userspace
> > > that an application was _indeed_ involved with reset. Is that also guaranteed?
> > >
> > > Is it possible that an application needlessly suffers from a false positive
> > > scenario (reset due to other factors)?
> > >
> >
> > I asked Alex Deucher in IRC about that and yes, there's a chance that
> > this is a false positive. However, for the majority of cases this is the
> > right app that caused the hang. This is what amdgpu is doing for GL
> > robustness as well and devcoredump, so it's very consistent with how
> > amdgpu deals with this scenario even if the mechanism is still not perfect.
> 
> It's usually the guilty one, but it's not guaranteed.  For example,
> say you have a ROCm user queue and a gfx job submitted to a kernel
> queue.  The actual guilty job may be the ROCm user queue, but the
> driver may not detect that the ROCm queue was hung until some other
> event (e.g., memory pressure).  However, the timer for the gfx job may
> timeout before that happens on the ROCm queue so in that case the gfx
> job would be incorrectly considered guilty.

So it boils down to what are the chances of that happening and whether
it's significant enough to open the door for API abuse.

Considering this is amd specific accuracy, it's still an open question
how other drivers are/will be managing it.

Raag

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info
  2025-03-11 17:13             ` Raag Jadav
@ 2025-03-12  8:25               ` Christian König
  2025-03-12  9:13                 ` Raag Jadav
  0 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2025-03-12  8:25 UTC (permalink / raw)
  To: Raag Jadav, Alex Deucher
  Cc: André Almeida, dri-devel, linux-kernel, kernel-dev, amd-gfx,
	intel-xe, intel-gfx, Alex Deucher, siqueira, airlied, simona,
	rodrigo.vivi, jani.nikula, Xaver Hugl

Am 11.03.25 um 18:13 schrieb Raag Jadav:
> On Mon, Mar 10, 2025 at 06:03:27PM -0400, Alex Deucher wrote:
>> On Mon, Mar 10, 2025 at 5:54 PM André Almeida <andrealmeid@igalia.com> wrote:
>>> Em 01/03/2025 03:04, Raag Jadav escreveu:
>>>> On Fri, Feb 28, 2025 at 06:49:43PM -0300, André Almeida wrote:
>>>>> Hi Raag,
>>>>>
>>>>> On 2/28/25 11:58, Raag Jadav wrote:
>>>>>> On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote:
>>>>>>> To notify userspace about which app (if any) made the device get in a
>>>>>>> wedge state, make use of drm_wedge_app_info parameter, filling it with
>>>>>>> the app PID and name.
>>>>>>>
>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>> ---
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++--
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  6 +++++-
>>>>>>>    2 files changed, 22 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 00b9b87dafd8..e06adf6f34fd 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>>>            atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>> -  if (!r)
>>>>>>> -          drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
>>>>>>> +  if (!r) {
>>>>>>> +          struct drm_wedge_app_info aux, *info = NULL;
>>>>>>> +
>>>>>>> +          if (job) {
>>>>>>> +                  struct amdgpu_task_info *ti;
>>>>>>> +
>>>>>>> +                  ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid);
>>>>>>> +                  if (ti) {
>>>>>>> +                          aux.pid = ti->pid;
>>>>>>> +                          aux.comm = ti->process_name;
>>>>>>> +                          info = &aux;
>>>>>>> +                          amdgpu_vm_put_task_info(ti);
>>>>>>> +                  }
>>>>>>> +          }
>>>>>> Is this guaranteed to be guilty app and not some scheduled worker?
>>>>> This is how amdgpu decides which app is the guilty one earlier in the code
>>>>> as in the print:
>>>>>
>>>>>      ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid);
>>>>>
>>>>>      "Process information: process %s pid %d thread %s pid %d\n"
>>>>>
>>>>> So I think it's consistent with what the driver thinks it's the guilty
>>>>> process.
>>>> Sure, but with something like app_info we're kind of hinting to userspace
>>>> that an application was _indeed_ involved with reset. Is that also guaranteed?
>>>>
>>>> Is it possible that an application needlessly suffers from a false positive
>>>> scenario (reset due to other factors)?
>>>>
>>> I asked Alex Deucher in IRC about that and yes, there's a chance that
>>> this is a false positive. However, for the majority of cases this is the
>>> right app that caused the hang. This is what amdgpu is doing for GL
>>> robustness as well and devcoredump, so it's very consistent with how
>>> amdgpu deals with this scenario even if the mechanism is still not perfect.
>> It's usually the guilty one, but it's not guaranteed.  For example,
>> say you have a ROCm user queue and a gfx job submitted to a kernel
>> queue.  The actual guilty job may be the ROCm user queue, but the
>> driver may not detect that the ROCm queue was hung until some other
>> event (e.g., memory pressure).  However, the timer for the gfx job may
>> timeout before that happens on the ROCm queue so in that case the gfx
>> job would be incorrectly considered guilty.
> So it boils down to what are the chances of that happening and whether
> it's significant enough to open the door for API abuse.

We are also working on an enforce_isolation parameter for amdgpu which only allows one application at a time to use the hardware for the cost of performance.

The problem is simply that when you don't allow multiple applications to use the HW at the same time you also don't get full utilization of the different HW blocks.

It can also be that a crash only happens because two applications do something at the same time which is not supposed to happen at the same time. Those issue are then usually fixed by firmware updates, but are really really hard to debug.

I don't see much potential for abuse here since you can't easily control from userspace when a lockup happens. And the AMD Linux team has made quite a bunch of requirements to the HW/FW engineers recently which should improve the situation on future HW generations.

So I think we should probably just document that reliability of this information is driver and hardware specific and should be taken with a grain of salt and call it a day.

Regards,
Christian.

>
> Considering this is amd specific accuracy, it's still an open question
> how other drivers are/will be managing it.
>
> Raag


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info
  2025-03-12  8:25               ` Christian König
@ 2025-03-12  9:13                 ` Raag Jadav
  0 siblings, 0 replies; 19+ messages in thread
From: Raag Jadav @ 2025-03-12  9:13 UTC (permalink / raw)
  To: Christian König
  Cc: Alex Deucher, André Almeida, dri-devel, linux-kernel,
	kernel-dev, amd-gfx, intel-xe, intel-gfx, Alex Deucher, siqueira,
	airlied, simona, rodrigo.vivi, jani.nikula, Xaver Hugl

On Wed, Mar 12, 2025 at 09:25:08AM +0100, Christian König wrote:
>Am 11.03.25 um 18:13 schrieb Raag Jadav:
>> On Mon, Mar 10, 2025 at 06:03:27PM -0400, Alex Deucher wrote:
>>> On Mon, Mar 10, 2025 at 5:54 PM André Almeida <andrealmeid@igalia.com> wrote:
>>>> Em 01/03/2025 03:04, Raag Jadav escreveu:
>>>>> On Fri, Feb 28, 2025 at 06:49:43PM -0300, André Almeida wrote:
>>>>>> Hi Raag,
>>>>>>
>>>>>> On 2/28/25 11:58, Raag Jadav wrote:
>>>>>>> On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote:
>>>>>>>> To notify userspace about which app (if any) made the device get in a
>>>>>>>> wedge state, make use of drm_wedge_app_info parameter, filling it with
>>>>>>>> the app PID and name.
>>>>>>>>
>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>>>>> ---
>>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++--
>>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  6 +++++-
>>>>>>>>    2 files changed, 22 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index 00b9b87dafd8..e06adf6f34fd 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>>>>            atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>> -  if (!r)
>>>>>>>> -          drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
>>>>>>>> +  if (!r) {
>>>>>>>> +          struct drm_wedge_app_info aux, *info = NULL;
>>>>>>>> +
>>>>>>>> +          if (job) {
>>>>>>>> +                  struct amdgpu_task_info *ti;
>>>>>>>> +
>>>>>>>> +                  ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid);
>>>>>>>> +                  if (ti) {
>>>>>>>> +                          aux.pid = ti->pid;
>>>>>>>> +                          aux.comm = ti->process_name;
>>>>>>>> +                          info = &aux;
>>>>>>>> +                          amdgpu_vm_put_task_info(ti);
>>>>>>>> +                  }
>>>>>>>> +          }
>>>>>>> Is this guaranteed to be guilty app and not some scheduled worker?
>>>>>> This is how amdgpu decides which app is the guilty one earlier in the code
>>>>>> as in the print:
>>>>>>
>>>>>>      ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid);
>>>>>>
>>>>>>      "Process information: process %s pid %d thread %s pid %d\n"
>>>>>>
>>>>>> So I think it's consistent with what the driver thinks it's the guilty
>>>>>> process.
>>>>> Sure, but with something like app_info we're kind of hinting to userspace
>>>>> that an application was _indeed_ involved with reset. Is that also guaranteed?
>>>>>
>>>>> Is it possible that an application needlessly suffers from a false positive
>>>>> scenario (reset due to other factors)?
>>>>>
>>>> I asked Alex Deucher in IRC about that and yes, there's a chance that
>>>> this is a false positive. However, for the majority of cases this is the
>>>> right app that caused the hang. This is what amdgpu is doing for GL
>>>> robustness as well and devcoredump, so it's very consistent with how
>>>> amdgpu deals with this scenario even if the mechanism is still not perfect.
>>> It's usually the guilty one, but it's not guaranteed.  For example,
>>> say you have a ROCm user queue and a gfx job submitted to a kernel
>>> queue.  The actual guilty job may be the ROCm user queue, but the
>>> driver may not detect that the ROCm queue was hung until some other
>>> event (e.g., memory pressure).  However, the timer for the gfx job may
>>> timeout before that happens on the ROCm queue so in that case the gfx
>>> job would be incorrectly considered guilty.
>> So it boils down to what are the chances of that happening and whether
>> it's significant enough to open the door for API abuse.
> 
> We are also working on an enforce_isolation parameter for amdgpu which only allows one application at a time to use the hardware for the cost of performance.
> 
> The problem is simply that when you don't allow multiple applications to use the HW at the same time you also don't get full utilization of the different HW blocks.
> 
> It can also be that a crash only happens because two applications do something at the same time which is not supposed to happen at the same time. Those issue are then usually fixed by firmware updates, but are really really hard to debug.
> 
> I don't see much potential for abuse here since you can't easily control from userspace when a lockup happens. And the AMD Linux team has made quite a bunch of requirements to the HW/FW engineers recently which should improve the situation on future HW generations.
> 
> So I think we should probably just document that reliability of this information is driver and hardware specific and should be taken with a grain of salt and call it a day.

Sounds reasonable. The new event string will need its own chapter anyway.

Raag

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] drm: Create an app info option for wedge events
  2025-03-11 17:09           ` Raag Jadav
@ 2025-03-12 10:06             ` Raag Jadav
  2025-03-12 21:59               ` André Almeida
  0 siblings, 1 reply; 19+ messages in thread
From: Raag Jadav @ 2025-03-12 10:06 UTC (permalink / raw)
  To: André Almeida
  Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
	Alex Deucher, 'Christian König', siqueira, airlied,
	simona, rodrigo.vivi, jani.nikula, lucas.demarchi, Xaver Hugl

On Tue, Mar 11, 2025 at 07:09:45PM +0200, Raag Jadav wrote:
> On Mon, Mar 10, 2025 at 06:27:53PM -0300, André Almeida wrote:
> > Em 01/03/2025 02:53, Raag Jadav escreveu:
> > > On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:
> > > > Hi Raag,
> > > > 
> > > > On 2/28/25 11:20, Raag Jadav wrote:
> > > > > Cc: Lucas
> > > > > 
> > > > > On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> > > > > > When a device get wedged, it might be caused by a guilty application.
> > > > > > For userspace, knowing which app was the cause can be useful for some
> > > > > > situations, like for implementing a policy, logs or for giving a chance
> > > > > > for the compositor to let the user know what app caused the problem.
> > > > > > This is an optional argument, when `PID=-1` there's no information about
> > > > > > the app caused the problem, or if any app was involved during the hang.
> > > > > > 
> > > > > > Sometimes just the PID isn't enough giving that the app might be already
> > > > > > dead by the time userspace will try to check what was this PID's name,
> > > > > > so to make the life easier also notify what's the app's name in the user
> > > > > > event.
> > > > > > 
> > > > > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > 
> > [...]
> > 
> > > > > >    	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
> > > > > > @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> > > > > >    	drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
> > > > > >    		 "but recovered through reset" : "needs recovery");
> > > > > > +	if (info) {
> > > > > > +		snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> > > > > > +		snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
> > > > > > +	} else {
> > > > > > +		snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
> > > > > > +		snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
> > > > > > +	}
> > > > > This is not much use for wedge cases that needs recovery, since at that point
> > > > > the userspace will need to clean house anyway.
> > > > > 
> > > > > Which leaves us with only 'none' case and perhaps the need for standardization
> > > > > of "optional telemetry collection".
> > > > > 
> > > > > Thoughts?
> > > > 
> > > > I had the feeling that 'none' was already meant to be used for that. Do you
> > > > think we should move to another naming? Given that we didn't reach the merge
> > > > window yet we could potentially change that name without much damage.
> > > 
> > > No, I meant thoughts on possible telemetry data that the drivers might
> > > think is useful for userspace (along with PID) and can be presented in
> > > a vendor agnostic manner (just like wedged event).
> > 
> > I'm not if I agree that this will only be used for telemetry and for the
> > `none` use case. As stated by Xaver, there's use case to know which app
> > caused the device to get wedged (like switching to software rendering) and
> > to display something for the user after the recovery is done (e.g. "The game
> > <app name> stopped working and Plasma has reset").
> 
> Sure, but since this information is already available in coredump, I was
> hoping to have something like a standardized DRM level coredump with both
> vendor specific and agnostic sections, which the drivers can (and hopefully
> transition to) use in conjunction with wedged event to provide wider
> telemetry and is useful for all wedge cases.

This is more useful because,

1. It gives drivers an opportunity to present the telemetry that they are
   interested in without needing to add a new event string (like PID or APP)
   for their case.

2. When we consider wedging as a usecase, there's a lot more that goes
   into it than an application that might be behaving strangely. So a wider
   telemetry is what I would hope to look at in such a scenario.

Raag

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] drm: Create an app info option for wedge events
  2025-03-12 10:06             ` Raag Jadav
@ 2025-03-12 21:59               ` André Almeida
  2025-03-13 11:07                 ` Raag Jadav
  0 siblings, 1 reply; 19+ messages in thread
From: André Almeida @ 2025-03-12 21:59 UTC (permalink / raw)
  To: Raag Jadav
  Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
	Alex Deucher, 'Christian König', siqueira, airlied,
	simona, rodrigo.vivi, jani.nikula, lucas.demarchi, Xaver Hugl,
	Pierre-Loup Griffais

Em 12/03/2025 07:06, Raag Jadav escreveu:
> On Tue, Mar 11, 2025 at 07:09:45PM +0200, Raag Jadav wrote:
>> On Mon, Mar 10, 2025 at 06:27:53PM -0300, André Almeida wrote:
>>> Em 01/03/2025 02:53, Raag Jadav escreveu:
>>>> On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:
>>>>> Hi Raag,
>>>>>
>>>>> On 2/28/25 11:20, Raag Jadav wrote:
>>>>>> Cc: Lucas
>>>>>>
>>>>>> On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
>>>>>>> When a device get wedged, it might be caused by a guilty application.
>>>>>>> For userspace, knowing which app was the cause can be useful for some
>>>>>>> situations, like for implementing a policy, logs or for giving a chance
>>>>>>> for the compositor to let the user know what app caused the problem.
>>>>>>> This is an optional argument, when `PID=-1` there's no information about
>>>>>>> the app caused the problem, or if any app was involved during the hang.
>>>>>>>
>>>>>>> Sometimes just the PID isn't enough giving that the app might be already
>>>>>>> dead by the time userspace will try to check what was this PID's name,
>>>>>>> so to make the life easier also notify what's the app's name in the user
>>>>>>> event.
>>>>>>>
>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>
>>> [...]
>>>
>>>>>>>     	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
>>>>>>> @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
>>>>>>>     	drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
>>>>>>>     		 "but recovered through reset" : "needs recovery");
>>>>>>> +	if (info) {
>>>>>>> +		snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
>>>>>>> +		snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
>>>>>>> +	} else {
>>>>>>> +		snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
>>>>>>> +		snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
>>>>>>> +	}
>>>>>> This is not much use for wedge cases that needs recovery, since at that point
>>>>>> the userspace will need to clean house anyway.
>>>>>>
>>>>>> Which leaves us with only 'none' case and perhaps the need for standardization
>>>>>> of "optional telemetry collection".
>>>>>>
>>>>>> Thoughts?
>>>>>
>>>>> I had the feeling that 'none' was already meant to be used for that. Do you
>>>>> think we should move to another naming? Given that we didn't reach the merge
>>>>> window yet we could potentially change that name without much damage.
>>>>
>>>> No, I meant thoughts on possible telemetry data that the drivers might
>>>> think is useful for userspace (along with PID) and can be presented in
>>>> a vendor agnostic manner (just like wedged event).
>>>
>>> I'm not if I agree that this will only be used for telemetry and for the
>>> `none` use case. As stated by Xaver, there's use case to know which app
>>> caused the device to get wedged (like switching to software rendering) and
>>> to display something for the user after the recovery is done (e.g. "The game
>>> <app name> stopped working and Plasma has reset").
>>
>> Sure, but since this information is already available in coredump, I was
>> hoping to have something like a standardized DRM level coredump with both
>> vendor specific and agnostic sections, which the drivers can (and hopefully
>> transition to) use in conjunction with wedged event to provide wider
>> telemetry and is useful for all wedge cases.
> 
> This is more useful because,
> 
> 1. It gives drivers an opportunity to present the telemetry that they are
>     interested in without needing to add a new event string (like PID or APP)
>     for their case.
> 
> 2. When we consider wedging as a usecase, there's a lot more that goes
>     into it than an application that might be behaving strangely. So a wider
>     telemetry is what I would hope to look at in such a scenario.
> 

I agree that coredump is the way to go for telemetry, we already have 
the name and PID of the guilty app there, along with more information 
about the GPU state. But I don't think it should be consumed like an 
uAPI. Even if we wire up some common DRM code for that, I don't think we 
can guarantee the stability of it as we can do for an uevent. coredump 
can be disabled and by default is only accessible by root.

So I think that coredump is really good after the fact and if the user 
is willing to go ahead and report a bug somewhere. But for the goal of 
notifying the compositor, the same uevent that the compositor is already 
listening to will have everything they need to deal with this reset.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] drm: Create an app info option for wedge events
  2025-03-12 21:59               ` André Almeida
@ 2025-03-13 11:07                 ` Raag Jadav
  0 siblings, 0 replies; 19+ messages in thread
From: Raag Jadav @ 2025-03-13 11:07 UTC (permalink / raw)
  To: André Almeida
  Cc: dri-devel, linux-kernel, kernel-dev, amd-gfx, intel-xe, intel-gfx,
	Alex Deucher, 'Christian König', siqueira, airlied,
	simona, rodrigo.vivi, jani.nikula, lucas.demarchi, Xaver Hugl,
	Pierre-Loup Griffais

On Wed, Mar 12, 2025 at 06:59:33PM -0300, André Almeida wrote:
> Em 12/03/2025 07:06, Raag Jadav escreveu:
> > On Tue, Mar 11, 2025 at 07:09:45PM +0200, Raag Jadav wrote:
> > > On Mon, Mar 10, 2025 at 06:27:53PM -0300, André Almeida wrote:
> > > > Em 01/03/2025 02:53, Raag Jadav escreveu:
> > > > > On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:
> > > > > > Hi Raag,
> > > > > > 
> > > > > > On 2/28/25 11:20, Raag Jadav wrote:
> > > > > > > Cc: Lucas
> > > > > > > 
> > > > > > > On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> > > > > > > > When a device get wedged, it might be caused by a guilty application.
> > > > > > > > For userspace, knowing which app was the cause can be useful for some
> > > > > > > > situations, like for implementing a policy, logs or for giving a chance
> > > > > > > > for the compositor to let the user know what app caused the problem.
> > > > > > > > This is an optional argument, when `PID=-1` there's no information about
> > > > > > > > the app caused the problem, or if any app was involved during the hang.
> > > > > > > > 
> > > > > > > > Sometimes just the PID isn't enough giving that the app might be already
> > > > > > > > dead by the time userspace will try to check what was this PID's name,
> > > > > > > > so to make the life easier also notify what's the app's name in the user
> > > > > > > > event.
> > > > > > > > 
> > > > > > > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > > 
> > > > [...]
> > > > 
> > > > > > > >     	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
> > > > > > > > @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> > > > > > > >     	drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
> > > > > > > >     		 "but recovered through reset" : "needs recovery");
> > > > > > > > +	if (info) {
> > > > > > > > +		snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> > > > > > > > +		snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
> > > > > > > > +	} else {
> > > > > > > > +		snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
> > > > > > > > +		snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
> > > > > > > > +	}
> > > > > > > This is not much use for wedge cases that needs recovery, since at that point
> > > > > > > the userspace will need to clean house anyway.
> > > > > > > 
> > > > > > > Which leaves us with only 'none' case and perhaps the need for standardization
> > > > > > > of "optional telemetry collection".
> > > > > > > 
> > > > > > > Thoughts?
> > > > > > 
> > > > > > I had the feeling that 'none' was already meant to be used for that. Do you
> > > > > > think we should move to another naming? Given that we didn't reach the merge
> > > > > > window yet we could potentially change that name without much damage.
> > > > > 
> > > > > No, I meant thoughts on possible telemetry data that the drivers might
> > > > > think is useful for userspace (along with PID) and can be presented in
> > > > > a vendor agnostic manner (just like wedged event).
> > > > 
> > > > I'm not if I agree that this will only be used for telemetry and for the
> > > > `none` use case. As stated by Xaver, there's use case to know which app
> > > > caused the device to get wedged (like switching to software rendering) and
> > > > to display something for the user after the recovery is done (e.g. "The game
> > > > <app name> stopped working and Plasma has reset").
> > > 
> > > Sure, but since this information is already available in coredump, I was
> > > hoping to have something like a standardized DRM level coredump with both
> > > vendor specific and agnostic sections, which the drivers can (and hopefully
> > > transition to) use in conjunction with wedged event to provide wider
> > > telemetry and is useful for all wedge cases.
> > 
> > This is more useful because,
> > 
> > 1. It gives drivers an opportunity to present the telemetry that they are
> >     interested in without needing to add a new event string (like PID or APP)
> >     for their case.
> > 
> > 2. When we consider wedging as a usecase, there's a lot more that goes
> >     into it than an application that might be behaving strangely. So a wider
> >     telemetry is what I would hope to look at in such a scenario.
> > 
> 
> I agree that coredump is the way to go for telemetry, we already have the
> name and PID of the guilty app there, along with more information about the
> GPU state. But I don't think it should be consumed like an uAPI. Even if we
> wire up some common DRM code for that, I don't think we can guarantee the
> stability of it as we can do for an uevent. coredump can be disabled and by
> default is only accessible by root.

Hm, this made me curious about how a pid of a specific user will be dealt
with in a multi-user scenario. I know it's not a common scenario for the
usercase but setting the rules to avoid side-effects (or could there be
any?) might be a good idea.

> So I think that coredump is really good after the fact and if the user is
> willing to go ahead and report a bug somewhere. But for the goal of
> notifying the compositor, the same uevent that the compositor is already
> listening to will have everything they need to deal with this reset.

I agree that having to deal with coredump will be cumbersome for the
usecase. Although I'm still leaning towards the idea that we should
consider the room for new usecases that probably want to expose new data,
and having to add a new string each time might not be the best of the
approach.

But that's just my opinion and we should probably wait for wider feedback.

Raag

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-03-13 11:08 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-28 12:13 [PATCH 0/2] drm: Create an app info option for wedge events André Almeida
2025-02-28 12:13 ` [PATCH 1/2] " André Almeida
2025-02-28 14:20   ` Raag Jadav
2025-02-28 21:54     ` André Almeida
2025-03-01  5:53       ` Raag Jadav
2025-03-10 21:27         ` André Almeida
2025-03-11 17:09           ` Raag Jadav
2025-03-12 10:06             ` Raag Jadav
2025-03-12 21:59               ` André Almeida
2025-03-13 11:07                 ` Raag Jadav
2025-02-28 12:13 ` [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info André Almeida
2025-02-28 14:58   ` Raag Jadav
2025-02-28 21:49     ` André Almeida
2025-03-01  6:04       ` Raag Jadav
2025-03-10 21:53         ` André Almeida
2025-03-10 22:03           ` Alex Deucher
2025-03-11 17:13             ` Raag Jadav
2025-03-12  8:25               ` Christian König
2025-03-12  9:13                 ` Raag Jadav

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox