From: Raag Jadav <raag.jadav@intel.com>
To: Riana Tauro <riana.tauro@intel.com>
Cc: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
anshuman.gupta@intel.com, rodrigo.vivi@intel.com,
lucas.demarchi@intel.com, aravind.iddamsetty@linux.intel.com,
umesh.nerlige.ramappa@intel.com, frank.scarbrough@intel.com,
sk.anirban@intel.com, simona.vetter@ffwll.ch, airlied@gmail.com,
"André Almeida" <andrealmeid@igalia.com>,
"Christian König" <christian.koenig@amd.com>
Subject: Re: [PATCH v5 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
Date: Sun, 20 Jul 2025 14:47:32 +0300 [thread overview]
Message-ID: <aHzXVKj_ZbIi6vg9@black.fi.intel.com> (raw)
In-Reply-To: <20250715104730.2109506-2-riana.tauro@intel.com>
On Tue, Jul 15, 2025 at 04:17:21PM +0530, Riana Tauro wrote:
> This patch addresses the need for a recovery method (firmware-update
> on Firmware errors) introduced in the later patches of Xe KMD. Whenever
> XE KMD detects a firmware error, a drm wedged uevent needs to be sent
> to the system administrator/userspace to trigger a firmware update.
>
> The initial proposal to use 'firmware-flash' as a recovery method was
> not applicable to other drivers and could cause multiple recovery
> methods specific to vendors to be added.
> To address this a more generic 'vendor-specific' method is introduced,
> guiding users to refer to vendor specific documentation and system logs
> for detailed vendor specific recovery mechanism.
>
> Add a recovery method 'WEDGED=vendor-specific' for such errors.
> Vendors must provide additional recovery documentation if this method
> is used.
>
> It is the responsibility of the consumer to refer to the correct vendor
> specific documentation and usecase before attempting a recovery.
>
> For example: If driver is XE KMD, the consumer must refer
> to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/'
>
> Recovery script contributed by Raag
Thank you, and a fullstop would make it even more awesome ;)
> v2: fix documentation (Raag)
> v3: add more details to commit message (Sima, Rodrigo, Raag)
> add an example to the documentation (by Raag)
>
> Cc: André Almeida <andrealmeid@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Co-developed-by: Raag Jadav <raag.jadav@intel.com>
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
> Documentation/gpu/drm-uapi.rst | 41 +++++++++++++++++++++++++++++-----
> drivers/gpu/drm/drm_drv.c | 2 ++
> include/drm/drm_device.h | 4 ++++
> 3 files changed, 41 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 843facf01b2d..a1a5a4de68ea 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -421,10 +421,13 @@ Recovery
> Current implementation defines three recovery methods, out of which, drivers
> can use any one, multiple or none. Method(s) of choice will be sent in the
> uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> -more side-effects. If driver is unsure about recovery or method is unknown
> -(like soft/hard system reboot, firmware flashing, physical device replacement
> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> -will be sent instead.
> +more side-effects. If recovery method is specific to vendor
> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> +specific documentation for further recovery steps. As an example if the driver
We'd want to be consistent with terminologies, so it's either 'steps' ...
> +is 'xe' then the documentation of 'Device Wedging' of xe driver needs to be
> +referred for the recovery mechanism.
... or 'mechanism'. Personally I'd use 'procedure' and make it consistent
with the example, but upto you.
> +If driver is unsure about recovery or method is unknown, ``WEDGED=unknown``
> +will be sent instead
And again, punctuations please!
Raag
> Userspace consumers can parse this event and attempt recovery as per the
> following expectations.
> @@ -435,6 +438,7 @@ following expectations.
> none optional telemetry collection
> rebind unbind + bind driver
> bus-reset unbind + bus reset/re-enumeration + bind
> + vendor-specific vendor specific recovery method
> unknown consumer policy
> =============== ========================================
>
> @@ -472,8 +476,12 @@ erroring out, all device memory should be unmapped and file descriptors should
> be closed to prevent leaks or undefined behaviour. The idea here is to clear the
> device of all user context beforehand and set the stage for a clean recovery.
>
> -Example
> --------
> +For ``WEDGED=vendor-specific`` recovery method, it is the responsibility of the
> +consumer to check the driver documentation and the usecase before attempting
> +a recovery.
> +
> +Example - rebind
> +----------------
>
> Udev rule::
>
> @@ -491,6 +499,27 @@ Recovery script::
> echo -n $DEVICE > $DRIVER/unbind
> echo -n $DEVICE > $DRIVER/bind
>
> +Example - vendor-specific
> +-------------------------
> +
> +Udev rule::
> +
> + SUBSYSTEM=="drm", ENV{WEDGED}=="vendor-specific", DEVPATH=="*/drm/card[0-9]",
> + RUN+="/path/to/vendor_specific_recovery.sh $env{DEVPATH}"
> +
> +Recovery script::
> +
> + #!/bin/sh
> +
> + DEVPATH=$(readlink -f /sys/$1/device)
> + DRIVERPATH=$(readlink -f $DEVPATH/driver)
> + DRIVER=$(basename $DRIVERPATH)
> +
> + if [ "$DRIVER" = "xe" ]; then
> + # Refer XE documentation and check usecase and recovery procedure
> + fi
> +
> +
> Customization
> -------------
>
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index cdd591b11488..0ac723a46a91 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> return "rebind";
> case DRM_WEDGE_RECOVERY_BUS_RESET:
> return "bus-reset";
> + case DRM_WEDGE_RECOVERY_VENDOR:
> + return "vendor-specific";
> default:
> return NULL;
> }
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index a33aedd5e9ec..59fd3f4d5995 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -26,10 +26,14 @@ struct pci_controller;
> * Recovery methods for wedged device in order of less to more side-effects.
> * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> * use any one, multiple (or'd) or none depending on their needs.
> + *
> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> + * details.
> */
> #define DRM_WEDGE_RECOVERY_NONE BIT(0) /* optional telemetry collection */
> #define DRM_WEDGE_RECOVERY_REBIND BIT(1) /* unbind + bind driver */
> #define DRM_WEDGE_RECOVERY_BUS_RESET BIT(2) /* unbind + reset bus device + bind */
> +#define DRM_WEDGE_RECOVERY_VENDOR BIT(3) /* vendor specific recovery method */
>
> /**
> * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> --
> 2.47.1
>
next prev parent reply other threads:[~2025-07-20 11:47 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-15 10:47 [PATCH v5 0/9] Handle Firmware reported Hardware Errors Riana Tauro
2025-07-15 10:47 ` [PATCH v5 1/9] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
2025-07-20 11:47 ` Raag Jadav [this message]
2025-07-15 10:47 ` [PATCH v5 2/9] drm/xe: Set GT as wedged before sending " Riana Tauro
2025-07-15 10:47 ` [PATCH v5 3/9] drm/xe: Add a helper function to set recovery method Riana Tauro
2025-07-20 12:04 ` Raag Jadav
2025-07-15 10:47 ` [PATCH v5 4/9] drm/xe/xe_survivability: Refactor survivability mode Riana Tauro
2025-07-23 14:00 ` Raag Jadav
2025-07-23 14:52 ` Riana Tauro
2025-07-15 10:47 ` [PATCH v5 5/9] drm/xe/xe_survivability: Add support for Runtime " Riana Tauro
2025-07-23 14:08 ` Raag Jadav
2025-07-23 14:41 ` Riana Tauro
2025-07-15 10:47 ` [PATCH v5 6/9] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
2025-07-23 13:34 ` Raag Jadav
2025-07-24 5:25 ` Riana Tauro
2025-07-15 10:47 ` [PATCH v5 7/9] drm/xe: Add support to handle hardware errors Riana Tauro
2025-07-15 14:08 ` Summers, Stuart
2025-07-15 16:48 ` Riana Tauro
2025-07-15 16:53 ` Summers, Stuart
2025-07-15 10:47 ` [PATCH v5 8/9] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-07-15 10:47 ` [PATCH v5 9/9] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
2025-07-15 14:10 ` Summers, Stuart
2025-07-15 16:39 ` Riana Tauro
2025-07-15 16:58 ` Summers, Stuart
2025-07-16 4:21 ` Riana Tauro
2025-07-15 12:28 ` ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev5) Patchwork
2025-07-15 12:30 ` ✓ CI.KUnit: success " Patchwork
2025-07-15 12:45 ` ✗ CI.checksparse: warning " Patchwork
2025-07-15 13:33 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-07-15 17:13 ` ✗ Xe.CI.Full: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aHzXVKj_ZbIi6vg9@black.fi.intel.com \
--to=raag.jadav@intel.com \
--cc=airlied@gmail.com \
--cc=andrealmeid@igalia.com \
--cc=anshuman.gupta@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=frank.scarbrough@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=lucas.demarchi@intel.com \
--cc=riana.tauro@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=simona.vetter@ffwll.ch \
--cc=sk.anirban@intel.com \
--cc=umesh.nerlige.ramappa@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox