dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Riana Tauro <riana.tauro@intel.com>
Cc: "Maxime Ripard" <mripard@kernel.org>,
	maarten.lankhorst@linux.intel.com, tzimmermann@suse.de,
	intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	anshuman.gupta@intel.com, lucas.demarchi@intel.com,
	aravind.iddamsetty@linux.intel.com, raag.jadav@intel.com,
	umesh.nerlige.ramappa@intel.com, frank.scarbrough@intel.com,
	sk.anirban@intel.com, simona.vetter@ffwll.ch,
	"André Almeida" <andrealmeid@igalia.com>,
	"Christian König" <christian.koenig@amd.com>,
	"David Airlie" <airlied@gmail.com>
Subject: Re: [PATCH v8 02/10] drm: Add a vendor-specific recovery method to drm device wedged uevent
Date: Thu, 21 Aug 2025 11:17:13 -0400	[thread overview]
Message-ID: <aKc4eSmYqhSWgwMI@intel.com> (raw)
In-Reply-To: <20ee25be-90cf-4a00-8ffa-0b22c7a1b493@intel.com>

On Thu, Aug 21, 2025 at 08:01:39PM +0530, Riana Tauro wrote:
> Hi Maxime
> 
> This patch has the changes suggested wrt to documentation in v7. Have added
> whatever Rodrigo suggested in the doc. Please let us know if the changes are
> okay and if the patch can be merged.

Maxime already told he is okay with the changes:
https://lore.kernel.org/dri-devel/5hkngbuzoryldvjrtjwalxhosdhtweeinpjpyguzltjmee7mpu@vw44iwczytw5/

> 
> This needs a drm-misc maintainer ack to go ahead.

But we still need some formal ack here to move ahead with this patch indeed.

> 
> Thanks
> Riana
> 
> On 8/17/2025 9:55 PM, Rodrigo Vivi wrote:
> > On Thu, Aug 14, 2025 at 05:44:32PM +0530, Riana Tauro wrote:
> > > Address the need for a recovery method (firmware flash on Firmware errors)
> > > introduced in the later patches of Xe KMD.
> > > Whenever XE KMD detects a firmware error, a firmware flash is required to
> > > recover the device to normal operation.
> > > 
> > > The initial proposal to use 'firmware-flash' as a recovery method was
> > > not applicable to other drivers and could cause multiple recovery
> > > methods specific to vendors to be added.
> > > To address this a more generic 'vendor-specific' method is introduced,
> > > guiding users to refer to vendor specific documentation and system logs
> > > for detailed vendor specific recovery procedure.
> > > 
> > > Add a recovery method 'WEDGED=vendor-specific' for such errors.
> > > Vendors must provide additional recovery documentation if this method
> > > is used.
> > > 
> > > It is the responsibility of the consumer to refer to the correct vendor
> > > specific documentation and usecase before attempting a recovery.
> > > 
> > > For example: If driver is XE KMD, the consumer must refer
> > > to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/'.
> > > 
> > > v2: fix documentation (Raag)
> > > v3: add more details to commit message (Sima, Rodrigo, Raag)
> > >      add an example script to the documentation (Raag)
> > > v4: use consistent naming (Raag)
> > > v5: fix commit message
> > > v6: add more documentation
> > > 
> > > Cc: André Almeida <andrealmeid@igalia.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: David Airlie <airlied@gmail.com>
> > > Cc: Simona Vetter <simona.vetter@ffwll.ch>
> > 
> > Cc: Maxime Ripard <mripard@kernel.org>
> > 
> > Folks, is it clear now? can we move ahead and get this through drm-xe-next?
> > 
> > > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > ---
> > >   Documentation/gpu/drm-uapi.rst | 47 +++++++++++++++++++++++++++++-----
> > >   drivers/gpu/drm/drm_drv.c      |  2 ++
> > >   include/drm/drm_device.h       |  4 +++
> > >   3 files changed, 46 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > index 843facf01b2d..669a6b9da0b2 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -418,13 +418,12 @@ needed.
> > >   Recovery
> > >   --------
> > > -Current implementation defines three recovery methods, out of which, drivers
> > > +Current implementation defines four recovery methods, out of which, drivers
> > >   can use any one, multiple or none. Method(s) of choice will be sent in the
> > >   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > -more side-effects. If driver is unsure about recovery or method is unknown
> > > -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > -will be sent instead.
> > > +more side-effects. See the section `Vendor Specific Recovery`_
> > > +for ``WEDGED=vendor-specific``. If driver is unsure about recovery or
> > > +method is unknown, ``WEDGED=unknown`` will be sent instead.
> > >   Userspace consumers can parse this event and attempt recovery as per the
> > >   following expectations.
> > > @@ -435,6 +434,7 @@ following expectations.
> > >       none            optional telemetry collection
> > >       rebind          unbind + bind driver
> > >       bus-reset       unbind + bus reset/re-enumeration + bind
> > > +    vendor-specific vendor specific recovery method
> > >       unknown         consumer policy
> > >       =============== ========================================
> > > @@ -446,6 +446,35 @@ telemetry information (devcoredump, syslog). This is useful because the first
> > >   hang is usually the most critical one which can result in consequential hangs or
> > >   complete wedging.
> > > +
> > > +Vendor Specific Recovery
> > > +------------------------
> > > +
> > > +When ``WEDGED=vendor-specific`` is sent, it indicates that the device requires
> > > +a recovery procedure specific to the hardware vendor and is not one of the
> > > +standardized approaches.
> > > +
> > > +``WEDGED=vendor-specific`` may be used to indicate different cases within a
> > > +single vendor driver, each requiring a distinct recovery procedure.
> > > +In such scenarios, the vendor driver must provide comprehensive documentation
> > > +that describes each case, includes additional hints to identify specific case and
> > > +outlines the corresponding recovery procedures. The documentation includes:
> > > +
> > > +Case - A list of all cases that sends the ``WEDGED=vendor-specific`` recovery method.
> > > +
> > > +Hints - Additional Information to assist the userspace consumer in identifying and
> > > +differentiating between different cases. This can be exposed through sysfs, debugfs,
> > > +traces, dmesg etc.
> > > +
> > > +Recovery Procedure - Clear instructions and guidance for recovering each case.
> > > +This may include userspace scripts, tools needed for the recovery procedure.
> > > +
> > > +It is the responsibility of the admin/userspace consumer to identify the case and
> > > +verify additional identification hints before attempting a recovery procedure.
> > > +
> > > +Example: If the device uses the Xe driver, then userspace consumer should refer to
> > > +:ref:`Xe Device Wedging <xe-device-wedging>` for the detailed documentation.
> > > +
> > >   Task information
> > >   ----------------
> > > @@ -472,8 +501,12 @@ erroring out, all device memory should be unmapped and file descriptors should
> > >   be closed to prevent leaks or undefined behaviour. The idea here is to clear the
> > >   device of all user context beforehand and set the stage for a clean recovery.
> > > -Example
> > > --------
> > > +For ``WEDGED=vendor-specific`` recovery method, it is the responsibility of the
> > > +consumer to check the driver documentation and the usecase before attempting
> > > +a recovery.
> > > +
> > > +Example - rebind
> > > +----------------
> > >   Udev rule::
> > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > index cdd591b11488..0ac723a46a91 100644
> > > --- a/drivers/gpu/drm/drm_drv.c
> > > +++ b/drivers/gpu/drm/drm_drv.c
> > > @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > >   		return "rebind";
> > >   	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > >   		return "bus-reset";
> > > +	case DRM_WEDGE_RECOVERY_VENDOR:
> > > +		return "vendor-specific";
> > >   	default:
> > >   		return NULL;
> > >   	}
> > > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > index a33aedd5e9ec..59fd3f4d5995 100644
> > > --- a/include/drm/drm_device.h
> > > +++ b/include/drm/drm_device.h
> > > @@ -26,10 +26,14 @@ struct pci_controller;
> > >    * Recovery methods for wedged device in order of less to more side-effects.
> > >    * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > >    * use any one, multiple (or'd) or none depending on their needs.
> > > + *
> > > + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > > + * details.
> > >    */
> > >   #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > >   #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > >   #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > > +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > >   /**
> > >    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > > -- 
> > > 2.47.1
> > > 
> 

  reply	other threads:[~2025-08-21 15:17 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-14 12:14 [PATCH v8 00/10] Handle Firmware reported Hardware Errors Riana Tauro
2025-08-14 12:14 ` [PATCH v8 01/10] drm/xe: Add documentation for Xe Device Wedging Riana Tauro
2025-08-17 16:22   ` Rodrigo Vivi
2025-08-14 12:14 ` [PATCH v8 02/10] drm: Add a vendor-specific recovery method to drm device wedged uevent Riana Tauro
2025-08-17 16:25   ` Rodrigo Vivi
2025-08-21 14:31     ` Riana Tauro
2025-08-21 15:17       ` Rodrigo Vivi [this message]
2025-08-25 15:49         ` Maxime Ripard
2025-08-14 12:14 ` [PATCH v8 03/10] drm/xe: Set GT as wedged before sending " Riana Tauro
2025-08-14 12:14 ` [PATCH v8 04/10] drm/xe: Add a helper function to set recovery method Riana Tauro
2025-08-14 12:14 ` [PATCH v8 05/10] drm/xe/xe_survivability: Refactor survivability mode Riana Tauro
2025-08-14 12:14 ` [PATCH v8 06/10] drm/xe/xe_survivability: Add support for Runtime " Riana Tauro
2025-08-14 12:14 ` [PATCH v8 07/10] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
2025-08-14 12:14 ` [PATCH v8 08/10] drm/xe: Add support to handle hardware errors Riana Tauro
2025-08-14 12:14 ` [PATCH v8 09/10] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-08-14 12:14 ` [PATCH v8 10/10] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aKc4eSmYqhSWgwMI@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=airlied@gmail.com \
    --cc=andrealmeid@igalia.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=frank.scarbrough@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=mripard@kernel.org \
    --cc=raag.jadav@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=simona.vetter@ffwll.ch \
    --cc=sk.anirban@intel.com \
    --cc=tzimmermann@suse.de \
    --cc=umesh.nerlige.ramappa@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).