Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Raag Jadav <raag.jadav@intel.com>
To: "André Almeida" <andrealmeid@igalia.com>
Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	kernel-dev@igalia.com, amd-gfx@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"'Christian König'" <christian.koenig@amd.com>,
	siqueira@igalia.com, airlied@gmail.com, simona@ffwll.ch,
	rodrigo.vivi@intel.com, jani.nikula@linux.intel.com,
	lucas.demarchi@intel.com
Subject: Re: [PATCH 1/2] drm: Create an app info option for wedge events
Date: Sat, 1 Mar 2025 07:53:06 +0200	[thread overview]
Message-ID: <Z8KgwswQQyGxhsR1@black.fi.intel.com> (raw)
In-Reply-To: <58763d8e-46a1-4753-9401-987fb3dac50b@igalia.com>

On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:
> Hi Raag,
> 
> On 2/28/25 11:20, Raag Jadav wrote:
> > Cc: Lucas
> > 
> > On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> > > When a device get wedged, it might be caused by a guilty application.
> > > For userspace, knowing which app was the cause can be useful for some
> > > situations, like for implementing a policy, logs or for giving a chance
> > > for the compositor to let the user know what app caused the problem.
> > > This is an optional argument, when `PID=-1` there's no information about
> > > the app caused the problem, or if any app was involved during the hang.
> > > 
> > > Sometimes just the PID isn't enough giving that the app might be already
> > > dead by the time userspace will try to check what was this PID's name,
> > > so to make the life easier also notify what's the app's name in the user
> > > event.
> > > 
> > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
> > >   drivers/gpu/drm/drm_drv.c                  | 16 +++++++++++++---
> > >   drivers/gpu/drm/i915/gt/intel_reset.c      |  3 ++-
> > >   drivers/gpu/drm/xe/xe_device.c             |  3 ++-
> > >   include/drm/drm_device.h                   |  8 ++++++++
> > >   include/drm/drm_drv.h                      |  3 ++-
> > >   7 files changed, 29 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index 24ba52d76045..00b9b87dafd8 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -6124,7 +6124,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> > >   	atomic_set(&adev->reset_domain->reset_res, r);
> > >   	if (!r)
> > > -		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> > > +		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
> > >   	return r;
> > >   }
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > index ef1b77f1e88f..3ed9cbcab1ad 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > @@ -150,7 +150,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> > >   			amdgpu_fence_driver_force_completion(ring);
> > >   			if (amdgpu_ring_sched_ready(ring))
> > >   				drm_sched_start(&ring->sched, 0);
> > > -			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> > > +			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
> > >   			dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
> > >   			goto exit;
> > >   		}
> > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > index 17fc5dc708f4..48faafd82a99 100644
> > > --- a/drivers/gpu/drm/drm_drv.c
> > > +++ b/drivers/gpu/drm/drm_drv.c
> > > @@ -522,6 +522,7 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > >    * drm_dev_wedged_event - generate a device wedged uevent
> > >    * @dev: DRM device
> > >    * @method: method(s) to be used for recovery
> > > + * @info: optional information about the guilty app
> > >    *
> > >    * This generates a device wedged uevent for the DRM device specified by @dev.
> > >    * Recovery @method\(s) of choice will be sent in the uevent environment as
> > > @@ -534,13 +535,14 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > >    *
> > >    * Returns: 0 on success, negative error code otherwise.
> > >    */
> > > -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> > > +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> > > +			 struct drm_wedge_app_info *info)
> > >   {
> > >   	const char *recovery = NULL;
> > >   	unsigned int len, opt;
> > >   	/* Event string length up to 28+ characters with available methods */
> > > -	char event_string[32];
> > > -	char *envp[] = { event_string, NULL };
> > > +	char event_string[32], pid_string[15], comm_string[TASK_COMM_LEN];
> > > +	char *envp[] = { event_string, pid_string, comm_string, NULL };
> > >   	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
> > > @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> > >   	drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
> > >   		 "but recovered through reset" : "needs recovery");
> > > +	if (info) {
> > > +		snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> > > +		snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
> > > +	} else {
> > > +		snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
> > > +		snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
> > > +	}
> > This is not much use for wedge cases that needs recovery, since at that point
> > the userspace will need to clean house anyway.
> > 
> > Which leaves us with only 'none' case and perhaps the need for standardization
> > of "optional telemetry collection".
> > 
> > Thoughts?
> 
> I had the feeling that 'none' was already meant to be used for that. Do you
> think we should move to another naming? Given that we didn't reach the merge
> window yet we could potentially change that name without much damage.

No, I meant thoughts on possible telemetry data that the drivers might
think is useful for userspace (along with PID) and can be presented in
a vendor agnostic manner (just like wedged event).

Raag

  reply	other threads:[~2025-03-01  5:53 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-28 12:13 [PATCH 0/2] drm: Create an app info option for wedge events André Almeida
2025-02-28 12:13 ` [PATCH 1/2] " André Almeida
2025-02-28 14:20   ` Raag Jadav
2025-02-28 21:54     ` André Almeida
2025-03-01  5:53       ` Raag Jadav [this message]
2025-03-10 21:27         ` André Almeida
2025-03-11 17:09           ` Raag Jadav
2025-03-12 10:06             ` Raag Jadav
2025-03-12 21:59               ` André Almeida
2025-03-13 11:07                 ` Raag Jadav
2025-02-28 12:13 ` [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info André Almeida
2025-02-28 14:58   ` Raag Jadav
2025-02-28 21:49     ` André Almeida
2025-03-01  6:04       ` Raag Jadav
2025-03-10 21:53         ` André Almeida
2025-03-10 22:03           ` Alex Deucher
2025-03-11 17:13             ` Raag Jadav
2025-03-12  8:25               ` Christian König
2025-03-12  9:13                 ` Raag Jadav
2025-02-28 13:54 ` ✗ CI.Patch_applied: failure for drm: Create an app info option for wedge events Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z8KgwswQQyGxhsR1@black.fi.intel.com \
    --to=raag.jadav@intel.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrealmeid@igalia.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jani.nikula@linux.intel.com \
    --cc=kernel-dev@igalia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lucas.demarchi@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona@ffwll.ch \
    --cc=siqueira@igalia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox