Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Simona Vetter <simona.vetter@ffwll.ch>
To: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: "Raag Jadav" <raag.jadav@intel.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Simona Vetter" <simona.vetter@ffwll.ch>,
	"Riana Tauro" <riana.tauro@intel.com>,
	intel-xe@lists.freedesktop.org, anshuman.gupta@intel.com,
	lucas.demarchi@intel.com, aravind.iddamsetty@linux.intel.com,
	umesh.nerlige.ramappa@intel.com, frank.scarbrough@intel.com,
	sk.anirban@intel.com, "André Almeida" <andrealmeid@igalia.com>,
	"David Airlie" <airlied@gmail.com>,
	dri-devel@lists.freedesktop.org
Subject: Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
Date: Thu, 10 Jul 2025 11:01:36 +0200	[thread overview]
Message-ID: <aG-BcFN6M9BtjB2j@phenom.ffwll.local> (raw)
In-Reply-To: <aG6eNcygPshsSlC8@intel.com>

On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> > On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > > On 09.07.25 15:41, Simona Vetter wrote:
> > > > On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > > >> Certain errors can cause the device to be wedged and may
> > > >> require a vendor specific recovery method to restore normal
> > > >> operation.
> > > >>
> > > >> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > > >> must provide additional recovery documentation if this method
> > > >> is used.
> > > >>
> > > >> v2: fix documentation (Raag)
> > > >>
> > > >> Cc: André Almeida <andrealmeid@igalia.com>
> > > >> Cc: Christian König <christian.koenig@amd.com>
> > > >> Cc: David Airlie <airlied@gmail.com>
> > > >> Cc: <dri-devel@lists.freedesktop.org>
> > > >> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > > >> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > > 
> > > > I'm not really understanding what this is useful for, maybe concrete
> > > > example in the form of driver code that uses this, and some tool or
> > > > documentation steps that should be taken for recovery?
> 
> The case here is when FW underneath identified something badly corrupted on
> FW land and decided that only a firmware-flashing could solve the day and
> raise interrupt to the driver. At that point we want to wedge, but immediately
> hint the admin the recommended action.
> 
> > > 
> > > The recovery method for this particular case is to flash in a new firmware.
> > > 
> > > > The issues I'm seeing here is that eventually we'll get different
> > > > vendor-specific recovery steps, and maybe even on the same device, and
> > > > that leads us to an enumeration issue. Since it's just a string and an
> > > > enum I think it'd be better to just allocate a new one every time there's
> > > > a new strange recovery method instead of this opaque approach.
> > > 
> > > That is exactly the opposite of what we discussed so far.

Sorry, I missed that context.

> > > The original idea was to add a firmware-flush recovery method which
> > > looked a bit wage since it didn't give any information on what to do
> > > exactly.
> > > 
> > > That's why I suggested to add a more generic vendor-specific event
> > > with refers to the documentation and system log to see what actually
> > > needs to be done.
> > > 
> > > Otherwise we would end up with events like firmware-flash, update FW
> > > image A, update FW image B, FW version mismatch etc....

Yeah, that's kinda what I expect to happen, and we have enough numbers for
this all to not be an issue.

> > Agree. Any newly allocated method that is specific to a vendor is going to
> > be opaque anyway, since it can't be generic for all drivers. This just helps
> > reduce the noise in DRM core.
> > 
> > And yes, there could be different vendor-specific cases for the same driver
> > and the driver should be able to provide the means to distinguish between
> > them.
> 
> Sim, what's your take on this then?
> 
> Should we get back to the original idea of firmware-flash?

Maybe intel-firmware-flash or something, meaning prefix with the vendor?

The reason I think it should be specific is because I'm assuming you want
to script this. And if you have a big fleet with different vendors, then
"vendor-specific" doesn't tell you enough. But if it's something like
$vendor-$magic_step then it does become scriptable, and we do have have a
place to put some documentation on what you should do instead.

If the point of this interface isn't that it's scriptable, then I'm not
sure why it needs to be an uevent?

I guess if you all want to stick with vendor-specific then I think that's
ok with me too, but the docs should at least explain how to figure out
from the uevent which vendor you're on with a small example. What I'm
worried is that if we have this on multiple drivers userspace will
otherwise make a complete mess and might want to run the wrong recovery
steps.

I think ideally, no matter what, we'd have a concrete driver patch which
then also comes with the documentation for what exactly you're supposed to
do as something you can script. And not just this stand-alone patch here.

Cheers, Sima
> 
> > 
> > Raag
> > 
> > > >> ---
> > > >>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> > > >>  drivers/gpu/drm/drm_drv.c      | 2 ++
> > > >>  include/drm/drm_device.h       | 4 ++++
> > > >>  3 files changed, 11 insertions(+), 4 deletions(-)
> > > >>
> > > >> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > >> index 263e5a97c080..c33070bdb347 100644
> > > >> --- a/Documentation/gpu/drm-uapi.rst
> > > >> +++ b/Documentation/gpu/drm-uapi.rst
> > > >> @@ -421,10 +421,10 @@ Recovery
> > > >>  Current implementation defines three recovery methods, out of which, drivers
> > > >>  can use any one, multiple or none. Method(s) of choice will be sent in the
> > > >>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > >> -more side-effects. If driver is unsure about recovery or method is unknown
> > > >> -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > >> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > >> -will be sent instead.
> > > >> +more side-effects. If recovery method is specific to vendor
> > > >> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > > >> +specific documentation for further recovery steps. If driver is unsure about
> > > >> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > > >>  
> > > >>  Userspace consumers can parse this event and attempt recovery as per the
> > > >>  following expectations.
> > > >> @@ -435,6 +435,7 @@ following expectations.
> > > >>      none            optional telemetry collection
> > > >>      rebind          unbind + bind driver
> > > >>      bus-reset       unbind + bus reset/re-enumeration + bind
> > > >> +    vendor-specific vendor specific recovery method
> > > >>      unknown         consumer policy
> > > >>      =============== ========================================
> > > >>  
> > > >> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > >> index cdd591b11488..0ac723a46a91 100644
> > > >> --- a/drivers/gpu/drm/drm_drv.c
> > > >> +++ b/drivers/gpu/drm/drm_drv.c
> > > >> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > > >>  		return "rebind";
> > > >>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > > >>  		return "bus-reset";
> > > >> +	case DRM_WEDGE_RECOVERY_VENDOR:
> > > >> +		return "vendor-specific";
> > > >>  	default:
> > > >>  		return NULL;
> > > >>  	}
> > > >> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > >> index 08b3b2467c4c..08a087f149ff 100644
> > > >> --- a/include/drm/drm_device.h
> > > >> +++ b/include/drm/drm_device.h
> > > >> @@ -26,10 +26,14 @@ struct pci_controller;
> > > >>   * Recovery methods for wedged device in order of less to more side-effects.
> > > >>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > > >>   * use any one, multiple (or'd) or none depending on their needs.
> > > >> + *
> > > >> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > > >> + * details.
> > > >>   */
> > > >>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > > >>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > > >>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > > >> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > > >>  
> > > >>  /**
> > > >>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > > >> -- 
> > > >> 2.47.1
> > > >>
> > > > 
> > > 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

next prev parent reply	other threads:[~2025-07-10  9:01 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-09 11:20 [PATCH v4 0/9] Handle Firmware reported Hardware Errors Riana Tauro
2025-07-09 11:20 ` [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
2025-07-09 13:41   ` Simona Vetter
2025-07-09 14:09     ` Christian König
2025-07-09 14:18       ` Raag Jadav
2025-07-09 16:52         ` Rodrigo Vivi
2025-07-10  9:01           ` Simona Vetter [this message]
2025-07-10  9:37             ` Christian König
2025-07-10 10:24               ` Raag Jadav
2025-07-10 19:00                 ` Rodrigo Vivi
2025-07-10 21:46                   ` Raag Jadav
2025-07-11  5:17                     ` Riana Tauro
2025-07-11  6:08                       ` Raag Jadav
2025-07-11  8:56                   ` Simona Vetter
2025-07-11  8:59               ` Simona Vetter
2025-07-14  5:27                 ` Riana Tauro
2025-07-14 12:33                   ` Simona Vetter
2025-07-09 14:46     ` Riana Tauro
2025-07-09 11:20 ` [PATCH v4 2/9] drm/xe: Set GT as wedged before sending " Riana Tauro
2025-07-09 17:26   ` Matthew Brost
2025-07-09 11:20 ` [PATCH v4 3/9] drm/xe: Add a helper function to set recovery method Riana Tauro
2025-07-09 11:20 ` [PATCH v4 4/9] drm/xe/xe_survivability: Refactor survivability mode Riana Tauro
2025-07-09 11:20 ` [PATCH v4 5/9] drm/xe/xe_survivability: Add support for Runtime " Riana Tauro
2025-07-09 23:44   ` Umesh Nerlige Ramappa
2025-07-10  5:59     ` Riana Tauro
2025-07-10 17:12       ` Umesh Nerlige Ramappa
2025-07-11  5:23         ` Riana Tauro
2025-07-09 11:20 ` [PATCH v4 6/9] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
2025-07-11  5:39   ` Raag Jadav
2025-07-11  6:09     ` Riana Tauro
2025-07-12  5:45       ` Raag Jadav
2025-07-14  9:04         ` Riana Tauro
2025-07-09 11:20 ` [PATCH v4 7/9] drm/xe: Add support to handle hardware errors Riana Tauro
2025-07-10 21:09   ` Umesh Nerlige Ramappa
2025-07-11  5:35     ` Riana Tauro
2025-07-11 17:34       ` Umesh Nerlige Ramappa
2025-07-09 11:20 ` [PATCH v4 8/9] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-07-11  0:36   ` Umesh Nerlige Ramappa
2025-07-11  5:46     ` Riana Tauro
2025-07-11 17:38       ` Umesh Nerlige Ramappa
2025-07-09 11:20 ` [PATCH v4 9/9] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
2025-07-11 17:41   ` Umesh Nerlige Ramappa
2025-07-14  7:07     ` Riana Tauro
2025-07-09 12:28 ` ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev4) Patchwork
2025-07-09 12:30 ` ✓ CI.KUnit: success " Patchwork
2025-07-09 12:44 ` ✗ CI.checksparse: warning " Patchwork
2025-07-09 13:06 ` ✓ Xe.CI.BAT: success " Patchwork
2025-07-09 15:02 ` ✗ Xe.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aG-BcFN6M9BtjB2j@phenom.ffwll.local \
    --to=simona.vetter@ffwll.ch \
    --cc=airlied@gmail.com \
    --cc=andrealmeid@igalia.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=frank.scarbrough@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=sk.anirban@intel.com \
    --cc=umesh.nerlige.ramappa@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox