[PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
       [not found] <20250709112024.1053710-1-riana.tauro@intel.com>
@ 2025-07-09 11:20 ` Riana Tauro
  2025-07-09 13:41   ` Simona Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Riana Tauro @ 2025-07-09 11:20 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban, André Almeida,
	Christian König, David Airlie, dri-devel

Certain errors can cause the device to be wedged and may
require a vendor specific recovery method to restore normal
operation.

Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
must provide additional recovery documentation if this method
is used.

v2: fix documentation (Raag)

Cc: André Almeida <andrealmeid@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: <dri-devel@lists.freedesktop.org>
Suggested-by: Raag Jadav <raag.jadav@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 Documentation/gpu/drm-uapi.rst | 9 +++++----
 drivers/gpu/drm/drm_drv.c      | 2 ++
 include/drm/drm_device.h       | 4 ++++
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 263e5a97c080..c33070bdb347 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -421,10 +421,10 @@ Recovery
 Current implementation defines three recovery methods, out of which, drivers
 can use any one, multiple or none. Method(s) of choice will be sent in the
 uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
-more side-effects. If driver is unsure about recovery or method is unknown
-(like soft/hard system reboot, firmware flashing, physical device replacement
-or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
-will be sent instead.
+more side-effects. If recovery method is specific to vendor
+``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
+specific documentation for further recovery steps. If driver is unsure about
+recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
 
 Userspace consumers can parse this event and attempt recovery as per the
 following expectations.
@@ -435,6 +435,7 @@ following expectations.
     none            optional telemetry collection
     rebind          unbind + bind driver
     bus-reset       unbind + bus reset/re-enumeration + bind
+    vendor-specific vendor specific recovery method
     unknown         consumer policy
     =============== ========================================
 
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index cdd591b11488..0ac723a46a91 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
 		return "rebind";
 	case DRM_WEDGE_RECOVERY_BUS_RESET:
 		return "bus-reset";
+	case DRM_WEDGE_RECOVERY_VENDOR:
+		return "vendor-specific";
 	default:
 		return NULL;
 	}
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index 08b3b2467c4c..08a087f149ff 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -26,10 +26,14 @@ struct pci_controller;
  * Recovery methods for wedged device in order of less to more side-effects.
  * To be used with drm_dev_wedged_event() as recovery @method. Callers can
  * use any one, multiple (or'd) or none depending on their needs.
+ *
+ * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
+ * details.
  */
 #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
 #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
 #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
+#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
 
 /**
  * struct drm_wedge_task_info - information about the guilty task of a wedge dev
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-09 11:20 ` [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
@ 2025-07-09 13:41   ` Simona Vetter
  2025-07-09 14:09     ` Christian König
  2025-07-09 14:46     ` Riana Tauro
  0 siblings, 2 replies; 17+ messages in thread
From: Simona Vetter @ 2025-07-09 13:41 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban, André Almeida,
	Christian König, David Airlie, dri-devel

On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> Certain errors can cause the device to be wedged and may
> require a vendor specific recovery method to restore normal
> operation.
> 
> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> must provide additional recovery documentation if this method
> is used.
> 
> v2: fix documentation (Raag)
> 
> Cc: André Almeida <andrealmeid@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>

I'm not really understanding what this is useful for, maybe concrete
example in the form of driver code that uses this, and some tool or
documentation steps that should be taken for recovery?

The issues I'm seeing here is that eventually we'll get different
vendor-specific recovery steps, and maybe even on the same device, and
that leads us to an enumeration issue. Since it's just a string and an
enum I think it'd be better to just allocate a new one every time there's
a new strange recovery method instead of this opaque approach.

Cheers, Sima

> ---
>  Documentation/gpu/drm-uapi.rst | 9 +++++----
>  drivers/gpu/drm/drm_drv.c      | 2 ++
>  include/drm/drm_device.h       | 4 ++++
>  3 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 263e5a97c080..c33070bdb347 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -421,10 +421,10 @@ Recovery
>  Current implementation defines three recovery methods, out of which, drivers
>  can use any one, multiple or none. Method(s) of choice will be sent in the
>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> -more side-effects. If driver is unsure about recovery or method is unknown
> -(like soft/hard system reboot, firmware flashing, physical device replacement
> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> -will be sent instead.
> +more side-effects. If recovery method is specific to vendor
> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> +specific documentation for further recovery steps. If driver is unsure about
> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
>  
>  Userspace consumers can parse this event and attempt recovery as per the
>  following expectations.
> @@ -435,6 +435,7 @@ following expectations.
>      none            optional telemetry collection
>      rebind          unbind + bind driver
>      bus-reset       unbind + bus reset/re-enumeration + bind
> +    vendor-specific vendor specific recovery method
>      unknown         consumer policy
>      =============== ========================================
>  
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index cdd591b11488..0ac723a46a91 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>  		return "rebind";
>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
>  		return "bus-reset";
> +	case DRM_WEDGE_RECOVERY_VENDOR:
> +		return "vendor-specific";
>  	default:
>  		return NULL;
>  	}
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index 08b3b2467c4c..08a087f149ff 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -26,10 +26,14 @@ struct pci_controller;
>   * Recovery methods for wedged device in order of less to more side-effects.
>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>   * use any one, multiple (or'd) or none depending on their needs.
> + *
> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> + * details.
>   */
>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>  
>  /**
>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> -- 
> 2.47.1
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-09 13:41   ` Simona Vetter
@ 2025-07-09 14:09     ` Christian König
  2025-07-09 14:18       ` Raag Jadav
  2025-07-09 14:46     ` Riana Tauro
  1 sibling, 1 reply; 17+ messages in thread
From: Christian König @ 2025-07-09 14:09 UTC (permalink / raw)
  To: Simona Vetter, Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban, André Almeida, David Airlie,
	dri-devel

On 09.07.25 15:41, Simona Vetter wrote:
> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
>> Certain errors can cause the device to be wedged and may
>> require a vendor specific recovery method to restore normal
>> operation.
>>
>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
>> must provide additional recovery documentation if this method
>> is used.
>>
>> v2: fix documentation (Raag)
>>
>> Cc: André Almeida <andrealmeid@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: <dri-devel@lists.freedesktop.org>
>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> 
> I'm not really understanding what this is useful for, maybe concrete
> example in the form of driver code that uses this, and some tool or
> documentation steps that should be taken for recovery?

The recovery method for this particular case is to flash in a new firmware.

> The issues I'm seeing here is that eventually we'll get different
> vendor-specific recovery steps, and maybe even on the same device, and
> that leads us to an enumeration issue. Since it's just a string and an
> enum I think it'd be better to just allocate a new one every time there's
> a new strange recovery method instead of this opaque approach.

That is exactly the opposite of what we discussed so far.

The original idea was to add a firmware-flush recovery method which looked a bit wage since it didn't give any information on what to do exactly.

That's why I suggested to add a more generic vendor-specific event with refers to the documentation and system log to see what actually needs to be done.

Otherwise we would end up with events like firmware-flash, update FW image A, update FW image B, FW version mismatch etc....

Regards,
Christian.

> 
> Cheers, Sima
> 
>> ---
>>  Documentation/gpu/drm-uapi.rst | 9 +++++----
>>  drivers/gpu/drm/drm_drv.c      | 2 ++
>>  include/drm/drm_device.h       | 4 ++++
>>  3 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>> index 263e5a97c080..c33070bdb347 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -421,10 +421,10 @@ Recovery
>>  Current implementation defines three recovery methods, out of which, drivers
>>  can use any one, multiple or none. Method(s) of choice will be sent in the
>>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>> -more side-effects. If driver is unsure about recovery or method is unknown
>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>> -will be sent instead.
>> +more side-effects. If recovery method is specific to vendor
>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
>> +specific documentation for further recovery steps. If driver is unsure about
>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
>>  
>>  Userspace consumers can parse this event and attempt recovery as per the
>>  following expectations.
>> @@ -435,6 +435,7 @@ following expectations.
>>      none            optional telemetry collection
>>      rebind          unbind + bind driver
>>      bus-reset       unbind + bus reset/re-enumeration + bind
>> +    vendor-specific vendor specific recovery method
>>      unknown         consumer policy
>>      =============== ========================================
>>  
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index cdd591b11488..0ac723a46a91 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>  		return "rebind";
>>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
>>  		return "bus-reset";
>> +	case DRM_WEDGE_RECOVERY_VENDOR:
>> +		return "vendor-specific";
>>  	default:
>>  		return NULL;
>>  	}
>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>> index 08b3b2467c4c..08a087f149ff 100644
>> --- a/include/drm/drm_device.h
>> +++ b/include/drm/drm_device.h
>> @@ -26,10 +26,14 @@ struct pci_controller;
>>   * Recovery methods for wedged device in order of less to more side-effects.
>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>>   * use any one, multiple (or'd) or none depending on their needs.
>> + *
>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
>> + * details.
>>   */
>>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>>  
>>  /**
>>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
>> -- 
>> 2.47.1
>>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-09 14:09     ` Christian König
@ 2025-07-09 14:18       ` Raag Jadav
  2025-07-09 16:52         ` Rodrigo Vivi
  0 siblings, 1 reply; 17+ messages in thread
From: Raag Jadav @ 2025-07-09 14:18 UTC (permalink / raw)
  To: Christian König
  Cc: Simona Vetter, Riana Tauro, intel-xe, anshuman.gupta,
	rodrigo.vivi, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> On 09.07.25 15:41, Simona Vetter wrote:
> > On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> >> Certain errors can cause the device to be wedged and may
> >> require a vendor specific recovery method to restore normal
> >> operation.
> >>
> >> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> >> must provide additional recovery documentation if this method
> >> is used.
> >>
> >> v2: fix documentation (Raag)
> >>
> >> Cc: André Almeida <andrealmeid@igalia.com>
> >> Cc: Christian König <christian.koenig@amd.com>
> >> Cc: David Airlie <airlied@gmail.com>
> >> Cc: <dri-devel@lists.freedesktop.org>
> >> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> >> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > 
> > I'm not really understanding what this is useful for, maybe concrete
> > example in the form of driver code that uses this, and some tool or
> > documentation steps that should be taken for recovery?
> 
> The recovery method for this particular case is to flash in a new firmware.
> 
> > The issues I'm seeing here is that eventually we'll get different
> > vendor-specific recovery steps, and maybe even on the same device, and
> > that leads us to an enumeration issue. Since it's just a string and an
> > enum I think it'd be better to just allocate a new one every time there's
> > a new strange recovery method instead of this opaque approach.
> 
> That is exactly the opposite of what we discussed so far.
> 
> The original idea was to add a firmware-flush recovery method which looked a bit wage since it didn't give any information on what to do exactly.
> 
> That's why I suggested to add a more generic vendor-specific event with refers to the documentation and system log to see what actually needs to be done.
> 
> Otherwise we would end up with events like firmware-flash, update FW image A, update FW image B, FW version mismatch etc....

Agree. Any newly allocated method that is specific to a vendor is going to
be opaque anyway, since it can't be generic for all drivers. This just helps
reduce the noise in DRM core.

And yes, there could be different vendor-specific cases for the same driver
and the driver should be able to provide the means to distinguish between
them.

Raag

> >> ---
> >>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> >>  drivers/gpu/drm/drm_drv.c      | 2 ++
> >>  include/drm/drm_device.h       | 4 ++++
> >>  3 files changed, 11 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> >> index 263e5a97c080..c33070bdb347 100644
> >> --- a/Documentation/gpu/drm-uapi.rst
> >> +++ b/Documentation/gpu/drm-uapi.rst
> >> @@ -421,10 +421,10 @@ Recovery
> >>  Current implementation defines three recovery methods, out of which, drivers
> >>  can use any one, multiple or none. Method(s) of choice will be sent in the
> >>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> >> -more side-effects. If driver is unsure about recovery or method is unknown
> >> -(like soft/hard system reboot, firmware flashing, physical device replacement
> >> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> >> -will be sent instead.
> >> +more side-effects. If recovery method is specific to vendor
> >> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> >> +specific documentation for further recovery steps. If driver is unsure about
> >> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> >>  
> >>  Userspace consumers can parse this event and attempt recovery as per the
> >>  following expectations.
> >> @@ -435,6 +435,7 @@ following expectations.
> >>      none            optional telemetry collection
> >>      rebind          unbind + bind driver
> >>      bus-reset       unbind + bus reset/re-enumeration + bind
> >> +    vendor-specific vendor specific recovery method
> >>      unknown         consumer policy
> >>      =============== ========================================
> >>  
> >> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> >> index cdd591b11488..0ac723a46a91 100644
> >> --- a/drivers/gpu/drm/drm_drv.c
> >> +++ b/drivers/gpu/drm/drm_drv.c
> >> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> >>  		return "rebind";
> >>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> >>  		return "bus-reset";
> >> +	case DRM_WEDGE_RECOVERY_VENDOR:
> >> +		return "vendor-specific";
> >>  	default:
> >>  		return NULL;
> >>  	}
> >> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> >> index 08b3b2467c4c..08a087f149ff 100644
> >> --- a/include/drm/drm_device.h
> >> +++ b/include/drm/drm_device.h
> >> @@ -26,10 +26,14 @@ struct pci_controller;
> >>   * Recovery methods for wedged device in order of less to more side-effects.
> >>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> >>   * use any one, multiple (or'd) or none depending on their needs.
> >> + *
> >> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> >> + * details.
> >>   */
> >>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> >>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> >>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> >> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> >>  
> >>  /**
> >>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> >> -- 
> >> 2.47.1
> >>
> > 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-09 13:41   ` Simona Vetter
  2025-07-09 14:09     ` Christian König
@ 2025-07-09 14:46     ` Riana Tauro
  1 sibling, 0 replies; 17+ messages in thread
From: Riana Tauro @ 2025-07-09 14:46 UTC (permalink / raw)
  To: Simona Vetter
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban, André Almeida,
	Christian König, David Airlie, dri-devel

Hi Sima

On 7/9/2025 7:11 PM, Simona Vetter wrote:
> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
>> Certain errors can cause the device to be wedged and may
>> require a vendor specific recovery method to restore normal
>> operation.
>>
>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
>> must provide additional recovery documentation if this method
>> is used.
>>
>> v2: fix documentation (Raag)
>>
>> Cc: André Almeida <andrealmeid@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: <dri-devel@lists.freedesktop.org>
>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> 
> I'm not really understanding what this is useful for, maybe concrete
> example in the form of driver code that uses this, and some tool or
> documentation steps that should be taken for recovery?

example and documentation for vendor specific recovery are part of the 
same series.
patchwork link: https://patchwork.freedesktop.org/series/149756/

fwupd tool will be using this. This was the initial PR raised.
It is yet to be updated to use 'vendor-specific'

PR: https://github.com/fwupd/fwupd/pull/8922

> 
> The issues I'm seeing here is that eventually we'll get different
> vendor-specific recovery steps, and maybe even on the same device, and
> that leads us to an enumeration issue. Since it's just a string and an
> enum I think it'd be better to just allocate a new one every time there's
> a new strange recovery method instead of this opaque approach.

It started as a specific macro and string but based on review comments 
it was changed to generic macro.

Thanks
Riana

> 
> Cheers, Sima
> 
>> ---
>>   Documentation/gpu/drm-uapi.rst | 9 +++++----
>>   drivers/gpu/drm/drm_drv.c      | 2 ++
>>   include/drm/drm_device.h       | 4 ++++
>>   3 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>> index 263e5a97c080..c33070bdb347 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -421,10 +421,10 @@ Recovery
>>   Current implementation defines three recovery methods, out of which, drivers
>>   can use any one, multiple or none. Method(s) of choice will be sent in the
>>   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>> -more side-effects. If driver is unsure about recovery or method is unknown
>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>> -will be sent instead.
>> +more side-effects. If recovery method is specific to vendor
>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
>> +specific documentation for further recovery steps. If driver is unsure about
>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
>>   
>>   Userspace consumers can parse this event and attempt recovery as per the
>>   following expectations.
>> @@ -435,6 +435,7 @@ following expectations.
>>       none            optional telemetry collection
>>       rebind          unbind + bind driver
>>       bus-reset       unbind + bus reset/re-enumeration + bind
>> +    vendor-specific vendor specific recovery method
>>       unknown         consumer policy
>>       =============== ========================================
>>   
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index cdd591b11488..0ac723a46a91 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>   		return "rebind";
>>   	case DRM_WEDGE_RECOVERY_BUS_RESET:
>>   		return "bus-reset";
>> +	case DRM_WEDGE_RECOVERY_VENDOR:
>> +		return "vendor-specific";
>>   	default:
>>   		return NULL;
>>   	}
>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>> index 08b3b2467c4c..08a087f149ff 100644
>> --- a/include/drm/drm_device.h
>> +++ b/include/drm/drm_device.h
>> @@ -26,10 +26,14 @@ struct pci_controller;
>>    * Recovery methods for wedged device in order of less to more side-effects.
>>    * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>>    * use any one, multiple (or'd) or none depending on their needs.
>> + *
>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
>> + * details.
>>    */
>>   #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>>   #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>>   #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>>   
>>   /**
>>    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
>> -- 
>> 2.47.1
>>
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-09 14:18       ` Raag Jadav
@ 2025-07-09 16:52         ` Rodrigo Vivi
  2025-07-10  9:01           ` Simona Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Rodrigo Vivi @ 2025-07-09 16:52 UTC (permalink / raw)
  To: Raag Jadav
  Cc: Christian König, Simona Vetter, Riana Tauro, intel-xe,
	anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > On 09.07.25 15:41, Simona Vetter wrote:
> > > On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > >> Certain errors can cause the device to be wedged and may
> > >> require a vendor specific recovery method to restore normal
> > >> operation.
> > >>
> > >> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > >> must provide additional recovery documentation if this method
> > >> is used.
> > >>
> > >> v2: fix documentation (Raag)
> > >>
> > >> Cc: André Almeida <andrealmeid@igalia.com>
> > >> Cc: Christian König <christian.koenig@amd.com>
> > >> Cc: David Airlie <airlied@gmail.com>
> > >> Cc: <dri-devel@lists.freedesktop.org>
> > >> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > >> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > 
> > > I'm not really understanding what this is useful for, maybe concrete
> > > example in the form of driver code that uses this, and some tool or
> > > documentation steps that should be taken for recovery?

The case here is when FW underneath identified something badly corrupted on
FW land and decided that only a firmware-flashing could solve the day and
raise interrupt to the driver. At that point we want to wedge, but immediately
hint the admin the recommended action.

> > 
> > The recovery method for this particular case is to flash in a new firmware.
> > 
> > > The issues I'm seeing here is that eventually we'll get different
> > > vendor-specific recovery steps, and maybe even on the same device, and
> > > that leads us to an enumeration issue. Since it's just a string and an
> > > enum I think it'd be better to just allocate a new one every time there's
> > > a new strange recovery method instead of this opaque approach.
> > 
> > That is exactly the opposite of what we discussed so far.
> > 
> > The original idea was to add a firmware-flush recovery method which looked a bit wage since it didn't give any information on what to do exactly.
> > 
> > That's why I suggested to add a more generic vendor-specific event with refers to the documentation and system log to see what actually needs to be done.
> > 
> > Otherwise we would end up with events like firmware-flash, update FW image A, update FW image B, FW version mismatch etc....
> 
> Agree. Any newly allocated method that is specific to a vendor is going to
> be opaque anyway, since it can't be generic for all drivers. This just helps
> reduce the noise in DRM core.
> 
> And yes, there could be different vendor-specific cases for the same driver
> and the driver should be able to provide the means to distinguish between
> them.

Sim, what's your take on this then?

Should we get back to the original idea of firmware-flash?

> 
> Raag
> 
> > >> ---
> > >>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> > >>  drivers/gpu/drm/drm_drv.c      | 2 ++
> > >>  include/drm/drm_device.h       | 4 ++++
> > >>  3 files changed, 11 insertions(+), 4 deletions(-)
> > >>
> > >> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > >> index 263e5a97c080..c33070bdb347 100644
> > >> --- a/Documentation/gpu/drm-uapi.rst
> > >> +++ b/Documentation/gpu/drm-uapi.rst
> > >> @@ -421,10 +421,10 @@ Recovery
> > >>  Current implementation defines three recovery methods, out of which, drivers
> > >>  can use any one, multiple or none. Method(s) of choice will be sent in the
> > >>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > >> -more side-effects. If driver is unsure about recovery or method is unknown
> > >> -(like soft/hard system reboot, firmware flashing, physical device replacement
> > >> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > >> -will be sent instead.
> > >> +more side-effects. If recovery method is specific to vendor
> > >> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > >> +specific documentation for further recovery steps. If driver is unsure about
> > >> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > >>  
> > >>  Userspace consumers can parse this event and attempt recovery as per the
> > >>  following expectations.
> > >> @@ -435,6 +435,7 @@ following expectations.
> > >>      none            optional telemetry collection
> > >>      rebind          unbind + bind driver
> > >>      bus-reset       unbind + bus reset/re-enumeration + bind
> > >> +    vendor-specific vendor specific recovery method
> > >>      unknown         consumer policy
> > >>      =============== ========================================
> > >>  
> > >> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > >> index cdd591b11488..0ac723a46a91 100644
> > >> --- a/drivers/gpu/drm/drm_drv.c
> > >> +++ b/drivers/gpu/drm/drm_drv.c
> > >> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > >>  		return "rebind";
> > >>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > >>  		return "bus-reset";
> > >> +	case DRM_WEDGE_RECOVERY_VENDOR:
> > >> +		return "vendor-specific";
> > >>  	default:
> > >>  		return NULL;
> > >>  	}
> > >> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > >> index 08b3b2467c4c..08a087f149ff 100644
> > >> --- a/include/drm/drm_device.h
> > >> +++ b/include/drm/drm_device.h
> > >> @@ -26,10 +26,14 @@ struct pci_controller;
> > >>   * Recovery methods for wedged device in order of less to more side-effects.
> > >>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > >>   * use any one, multiple (or'd) or none depending on their needs.
> > >> + *
> > >> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > >> + * details.
> > >>   */
> > >>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > >>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > >>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > >> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > >>  
> > >>  /**
> > >>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > >> -- 
> > >> 2.47.1
> > >>
> > > 
> > 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-09 16:52         ` Rodrigo Vivi
@ 2025-07-10  9:01           ` Simona Vetter
  2025-07-10  9:37             ` Christian König
  0 siblings, 1 reply; 17+ messages in thread
From: Simona Vetter @ 2025-07-10  9:01 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Raag Jadav, Christian König, Simona Vetter, Riana Tauro,
	intel-xe, anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> > On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > > On 09.07.25 15:41, Simona Vetter wrote:
> > > > On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > > >> Certain errors can cause the device to be wedged and may
> > > >> require a vendor specific recovery method to restore normal
> > > >> operation.
> > > >>
> > > >> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > > >> must provide additional recovery documentation if this method
> > > >> is used.
> > > >>
> > > >> v2: fix documentation (Raag)
> > > >>
> > > >> Cc: André Almeida <andrealmeid@igalia.com>
> > > >> Cc: Christian König <christian.koenig@amd.com>
> > > >> Cc: David Airlie <airlied@gmail.com>
> > > >> Cc: <dri-devel@lists.freedesktop.org>
> > > >> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > > >> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > > 
> > > > I'm not really understanding what this is useful for, maybe concrete
> > > > example in the form of driver code that uses this, and some tool or
> > > > documentation steps that should be taken for recovery?
> 
> The case here is when FW underneath identified something badly corrupted on
> FW land and decided that only a firmware-flashing could solve the day and
> raise interrupt to the driver. At that point we want to wedge, but immediately
> hint the admin the recommended action.
> 
> > > 
> > > The recovery method for this particular case is to flash in a new firmware.
> > > 
> > > > The issues I'm seeing here is that eventually we'll get different
> > > > vendor-specific recovery steps, and maybe even on the same device, and
> > > > that leads us to an enumeration issue. Since it's just a string and an
> > > > enum I think it'd be better to just allocate a new one every time there's
> > > > a new strange recovery method instead of this opaque approach.
> > > 
> > > That is exactly the opposite of what we discussed so far.

Sorry, I missed that context.

> > > The original idea was to add a firmware-flush recovery method which
> > > looked a bit wage since it didn't give any information on what to do
> > > exactly.
> > > 
> > > That's why I suggested to add a more generic vendor-specific event
> > > with refers to the documentation and system log to see what actually
> > > needs to be done.
> > > 
> > > Otherwise we would end up with events like firmware-flash, update FW
> > > image A, update FW image B, FW version mismatch etc....

Yeah, that's kinda what I expect to happen, and we have enough numbers for
this all to not be an issue.

> > Agree. Any newly allocated method that is specific to a vendor is going to
> > be opaque anyway, since it can't be generic for all drivers. This just helps
> > reduce the noise in DRM core.
> > 
> > And yes, there could be different vendor-specific cases for the same driver
> > and the driver should be able to provide the means to distinguish between
> > them.
> 
> Sim, what's your take on this then?
> 
> Should we get back to the original idea of firmware-flash?

Maybe intel-firmware-flash or something, meaning prefix with the vendor?

The reason I think it should be specific is because I'm assuming you want
to script this. And if you have a big fleet with different vendors, then
"vendor-specific" doesn't tell you enough. But if it's something like
$vendor-$magic_step then it does become scriptable, and we do have have a
place to put some documentation on what you should do instead.

If the point of this interface isn't that it's scriptable, then I'm not
sure why it needs to be an uevent?

I guess if you all want to stick with vendor-specific then I think that's
ok with me too, but the docs should at least explain how to figure out
from the uevent which vendor you're on with a small example. What I'm
worried is that if we have this on multiple drivers userspace will
otherwise make a complete mess and might want to run the wrong recovery
steps.

I think ideally, no matter what, we'd have a concrete driver patch which
then also comes with the documentation for what exactly you're supposed to
do as something you can script. And not just this stand-alone patch here.

Cheers, Sima
> 
> > 
> > Raag
> > 
> > > >> ---
> > > >>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> > > >>  drivers/gpu/drm/drm_drv.c      | 2 ++
> > > >>  include/drm/drm_device.h       | 4 ++++
> > > >>  3 files changed, 11 insertions(+), 4 deletions(-)
> > > >>
> > > >> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > >> index 263e5a97c080..c33070bdb347 100644
> > > >> --- a/Documentation/gpu/drm-uapi.rst
> > > >> +++ b/Documentation/gpu/drm-uapi.rst
> > > >> @@ -421,10 +421,10 @@ Recovery
> > > >>  Current implementation defines three recovery methods, out of which, drivers
> > > >>  can use any one, multiple or none. Method(s) of choice will be sent in the
> > > >>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > >> -more side-effects. If driver is unsure about recovery or method is unknown
> > > >> -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > >> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > >> -will be sent instead.
> > > >> +more side-effects. If recovery method is specific to vendor
> > > >> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > > >> +specific documentation for further recovery steps. If driver is unsure about
> > > >> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > > >>  
> > > >>  Userspace consumers can parse this event and attempt recovery as per the
> > > >>  following expectations.
> > > >> @@ -435,6 +435,7 @@ following expectations.
> > > >>      none            optional telemetry collection
> > > >>      rebind          unbind + bind driver
> > > >>      bus-reset       unbind + bus reset/re-enumeration + bind
> > > >> +    vendor-specific vendor specific recovery method
> > > >>      unknown         consumer policy
> > > >>      =============== ========================================
> > > >>  
> > > >> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > >> index cdd591b11488..0ac723a46a91 100644
> > > >> --- a/drivers/gpu/drm/drm_drv.c
> > > >> +++ b/drivers/gpu/drm/drm_drv.c
> > > >> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > > >>  		return "rebind";
> > > >>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > > >>  		return "bus-reset";
> > > >> +	case DRM_WEDGE_RECOVERY_VENDOR:
> > > >> +		return "vendor-specific";
> > > >>  	default:
> > > >>  		return NULL;
> > > >>  	}
> > > >> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > >> index 08b3b2467c4c..08a087f149ff 100644
> > > >> --- a/include/drm/drm_device.h
> > > >> +++ b/include/drm/drm_device.h
> > > >> @@ -26,10 +26,14 @@ struct pci_controller;
> > > >>   * Recovery methods for wedged device in order of less to more side-effects.
> > > >>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > > >>   * use any one, multiple (or'd) or none depending on their needs.
> > > >> + *
> > > >> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > > >> + * details.
> > > >>   */
> > > >>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > > >>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > > >>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > > >> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > > >>  
> > > >>  /**
> > > >>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > > >> -- 
> > > >> 2.47.1
> > > >>
> > > > 
> > > 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-10  9:01           ` Simona Vetter
@ 2025-07-10  9:37             ` Christian König
  2025-07-10 10:24               ` Raag Jadav
  2025-07-11  8:59               ` Simona Vetter
  0 siblings, 2 replies; 17+ messages in thread
From: Christian König @ 2025-07-10  9:37 UTC (permalink / raw)
  To: Simona Vetter, Rodrigo Vivi
  Cc: Raag Jadav, Riana Tauro, intel-xe, anshuman.gupta, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	sk.anirban, André Almeida, David Airlie, dri-devel

On 10.07.25 11:01, Simona Vetter wrote:
> On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
>> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
>>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
>>>> On 09.07.25 15:41, Simona Vetter wrote:
>>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
>>>>>> Certain errors can cause the device to be wedged and may
>>>>>> require a vendor specific recovery method to restore normal
>>>>>> operation.
>>>>>>
>>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
>>>>>> must provide additional recovery documentation if this method
>>>>>> is used.
>>>>>>
>>>>>> v2: fix documentation (Raag)
>>>>>>
>>>>>> Cc: André Almeida <andrealmeid@igalia.com>
>>>>>> Cc: Christian König <christian.koenig@amd.com>
>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>> Cc: <dri-devel@lists.freedesktop.org>
>>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
>>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>>>>
>>>>> I'm not really understanding what this is useful for, maybe concrete
>>>>> example in the form of driver code that uses this, and some tool or
>>>>> documentation steps that should be taken for recovery?
>>
>> The case here is when FW underneath identified something badly corrupted on
>> FW land and decided that only a firmware-flashing could solve the day and
>> raise interrupt to the driver. At that point we want to wedge, but immediately
>> hint the admin the recommended action.
>>
>>>>
>>>> The recovery method for this particular case is to flash in a new firmware.
>>>>
>>>>> The issues I'm seeing here is that eventually we'll get different
>>>>> vendor-specific recovery steps, and maybe even on the same device, and
>>>>> that leads us to an enumeration issue. Since it's just a string and an
>>>>> enum I think it'd be better to just allocate a new one every time there's
>>>>> a new strange recovery method instead of this opaque approach.
>>>>
>>>> That is exactly the opposite of what we discussed so far.
> 
> Sorry, I missed that context.
> 
>>>> The original idea was to add a firmware-flush recovery method which
>>>> looked a bit wage since it didn't give any information on what to do
>>>> exactly.
>>>>
>>>> That's why I suggested to add a more generic vendor-specific event
>>>> with refers to the documentation and system log to see what actually
>>>> needs to be done.
>>>>
>>>> Otherwise we would end up with events like firmware-flash, update FW
>>>> image A, update FW image B, FW version mismatch etc....
> 
> Yeah, that's kinda what I expect to happen, and we have enough numbers for
> this all to not be an issue.
> 
>>> Agree. Any newly allocated method that is specific to a vendor is going to
>>> be opaque anyway, since it can't be generic for all drivers. This just helps
>>> reduce the noise in DRM core.
>>>
>>> And yes, there could be different vendor-specific cases for the same driver
>>> and the driver should be able to provide the means to distinguish between
>>> them.
>>
>> Sim, what's your take on this then?
>>
>> Should we get back to the original idea of firmware-flash?
> 
> Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> 
> The reason I think it should be specific is because I'm assuming you want
> to script this. And if you have a big fleet with different vendors, then
> "vendor-specific" doesn't tell you enough. But if it's something like
> $vendor-$magic_step then it does become scriptable, and we do have have a
> place to put some documentation on what you should do instead.
> 
> If the point of this interface isn't that it's scriptable, then I'm not
> sure why it needs to be an uevent?

You should probably read up on the previous discussion, cause that is exactly what I asked as well :)

And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example.

In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it.

Regards,
Christian.

> I guess if you all want to stick with vendor-specific then I think that's
> ok with me too, but the docs should at least explain how to figure out
> from the uevent which vendor you're on with a small example. What I'm
> worried is that if we have this on multiple drivers userspace will
> otherwise make a complete mess and might want to run the wrong recovery
> steps.
> 
> I think ideally, no matter what, we'd have a concrete driver patch which
> then also comes with the documentation for what exactly you're supposed to
> do as something you can script. And not just this stand-alone patch here.
> 
> Cheers, Sima
>>
>>>
>>> Raag
>>>
>>>>>> ---
>>>>>>  Documentation/gpu/drm-uapi.rst | 9 +++++----
>>>>>>  drivers/gpu/drm/drm_drv.c      | 2 ++
>>>>>>  include/drm/drm_device.h       | 4 ++++
>>>>>>  3 files changed, 11 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>>>>>> index 263e5a97c080..c33070bdb347 100644
>>>>>> --- a/Documentation/gpu/drm-uapi.rst
>>>>>> +++ b/Documentation/gpu/drm-uapi.rst
>>>>>> @@ -421,10 +421,10 @@ Recovery
>>>>>>  Current implementation defines three recovery methods, out of which, drivers
>>>>>>  can use any one, multiple or none. Method(s) of choice will be sent in the
>>>>>>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
>>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>>>>>> -will be sent instead.
>>>>>> +more side-effects. If recovery method is specific to vendor
>>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
>>>>>> +specific documentation for further recovery steps. If driver is unsure about
>>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
>>>>>>  
>>>>>>  Userspace consumers can parse this event and attempt recovery as per the
>>>>>>  following expectations.
>>>>>> @@ -435,6 +435,7 @@ following expectations.
>>>>>>      none            optional telemetry collection
>>>>>>      rebind          unbind + bind driver
>>>>>>      bus-reset       unbind + bus reset/re-enumeration + bind
>>>>>> +    vendor-specific vendor specific recovery method
>>>>>>      unknown         consumer policy
>>>>>>      =============== ========================================
>>>>>>  
>>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>>>>> index cdd591b11488..0ac723a46a91 100644
>>>>>> --- a/drivers/gpu/drm/drm_drv.c
>>>>>> +++ b/drivers/gpu/drm/drm_drv.c
>>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>>>>>  		return "rebind";
>>>>>>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
>>>>>>  		return "bus-reset";
>>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
>>>>>> +		return "vendor-specific";
>>>>>>  	default:
>>>>>>  		return NULL;
>>>>>>  	}
>>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>>>>>> index 08b3b2467c4c..08a087f149ff 100644
>>>>>> --- a/include/drm/drm_device.h
>>>>>> +++ b/include/drm/drm_device.h
>>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
>>>>>>   * Recovery methods for wedged device in order of less to more side-effects.
>>>>>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>>>>>>   * use any one, multiple (or'd) or none depending on their needs.
>>>>>> + *
>>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
>>>>>> + * details.
>>>>>>   */
>>>>>>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>>>>>>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>>>>>>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
>>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>>>>>>  
>>>>>>  /**
>>>>>>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
>>>>>> -- 
>>>>>> 2.47.1
>>>>>>
>>>>>
>>>>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-10  9:37             ` Christian König
@ 2025-07-10 10:24               ` Raag Jadav
  2025-07-10 19:00                 ` Rodrigo Vivi
  2025-07-11  8:59               ` Simona Vetter
  1 sibling, 1 reply; 17+ messages in thread
From: Raag Jadav @ 2025-07-10 10:24 UTC (permalink / raw)
  To: Christian König
  Cc: Simona Vetter, Rodrigo Vivi, Riana Tauro, intel-xe,
	anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
> On 10.07.25 11:01, Simona Vetter wrote:
> > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> >> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> >>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> >>>> On 09.07.25 15:41, Simona Vetter wrote:
> >>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> >>>>>> Certain errors can cause the device to be wedged and may
> >>>>>> require a vendor specific recovery method to restore normal
> >>>>>> operation.
> >>>>>>
> >>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> >>>>>> must provide additional recovery documentation if this method
> >>>>>> is used.
> >>>>>>
> >>>>>> v2: fix documentation (Raag)
> >>>>>>
> >>>>>> Cc: André Almeida <andrealmeid@igalia.com>
> >>>>>> Cc: Christian König <christian.koenig@amd.com>
> >>>>>> Cc: David Airlie <airlied@gmail.com>
> >>>>>> Cc: <dri-devel@lists.freedesktop.org>
> >>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> >>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> >>>>>
> >>>>> I'm not really understanding what this is useful for, maybe concrete
> >>>>> example in the form of driver code that uses this, and some tool or
> >>>>> documentation steps that should be taken for recovery?
> >>
> >> The case here is when FW underneath identified something badly corrupted on
> >> FW land and decided that only a firmware-flashing could solve the day and
> >> raise interrupt to the driver. At that point we want to wedge, but immediately
> >> hint the admin the recommended action.
> >>
> >>>>
> >>>> The recovery method for this particular case is to flash in a new firmware.
> >>>>
> >>>>> The issues I'm seeing here is that eventually we'll get different
> >>>>> vendor-specific recovery steps, and maybe even on the same device, and
> >>>>> that leads us to an enumeration issue. Since it's just a string and an
> >>>>> enum I think it'd be better to just allocate a new one every time there's
> >>>>> a new strange recovery method instead of this opaque approach.
> >>>>
> >>>> That is exactly the opposite of what we discussed so far.
> > 
> > Sorry, I missed that context.
> > 
> >>>> The original idea was to add a firmware-flush recovery method which
> >>>> looked a bit wage since it didn't give any information on what to do
> >>>> exactly.
> >>>>
> >>>> That's why I suggested to add a more generic vendor-specific event
> >>>> with refers to the documentation and system log to see what actually
> >>>> needs to be done.
> >>>>
> >>>> Otherwise we would end up with events like firmware-flash, update FW
> >>>> image A, update FW image B, FW version mismatch etc....
> > 
> > Yeah, that's kinda what I expect to happen, and we have enough numbers for
> > this all to not be an issue.
> > 
> >>> Agree. Any newly allocated method that is specific to a vendor is going to
> >>> be opaque anyway, since it can't be generic for all drivers. This just helps
> >>> reduce the noise in DRM core.
> >>>
> >>> And yes, there could be different vendor-specific cases for the same driver
> >>> and the driver should be able to provide the means to distinguish between
> >>> them.
> >>
> >> Sim, what's your take on this then?
> >>
> >> Should we get back to the original idea of firmware-flash?
> > 
> > Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> > 
> > The reason I think it should be specific is because I'm assuming you want
> > to script this. And if you have a big fleet with different vendors, then
> > "vendor-specific" doesn't tell you enough. But if it's something like
> > $vendor-$magic_step then it does become scriptable, and we do have have a
> > place to put some documentation on what you should do instead.
> > 
> > If the point of this interface isn't that it's scriptable, then I'm not
> > sure why it needs to be an uevent?
> 
> You should probably read up on the previous discussion, cause that is exactly what I asked as well :)
> 
> And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example.
> 
> In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it.

Yes, and since the recovery procedure is defined and known to the consumer,
it can potentially be automated (atleast for non-firmware cases).

> > I guess if you all want to stick with vendor-specific then I think that's
> > ok with me too, but the docs should at least explain how to figure out
> > from the uevent which vendor you're on with a small example. What I'm
> > worried is that if we have this on multiple drivers userspace will
> > otherwise make a complete mess and might want to run the wrong recovery
> > steps.

The device id along with driver can be identified from uevent (probably
available inside DEVPATH somewhere) to distinguish the vendor. So the consumer
already knows if the device fits the criteria for recovery.

> > I think ideally, no matter what, we'd have a concrete driver patch which
> > then also comes with the documentation for what exactly you're supposed to
> > do as something you can script. And not just this stand-alone patch here.

Perhaps the rest of the series didn't make it to dri-devel, which will answer
most of the above.

Raag

> >>>>>> ---
> >>>>>>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> >>>>>>  drivers/gpu/drm/drm_drv.c      | 2 ++
> >>>>>>  include/drm/drm_device.h       | 4 ++++
> >>>>>>  3 files changed, 11 insertions(+), 4 deletions(-)
> >>>>>>
> >>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> >>>>>> index 263e5a97c080..c33070bdb347 100644
> >>>>>> --- a/Documentation/gpu/drm-uapi.rst
> >>>>>> +++ b/Documentation/gpu/drm-uapi.rst
> >>>>>> @@ -421,10 +421,10 @@ Recovery
> >>>>>>  Current implementation defines three recovery methods, out of which, drivers
> >>>>>>  can use any one, multiple or none. Method(s) of choice will be sent in the
> >>>>>>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> >>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
> >>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
> >>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> >>>>>> -will be sent instead.
> >>>>>> +more side-effects. If recovery method is specific to vendor
> >>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> >>>>>> +specific documentation for further recovery steps. If driver is unsure about
> >>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> >>>>>>  
> >>>>>>  Userspace consumers can parse this event and attempt recovery as per the
> >>>>>>  following expectations.
> >>>>>> @@ -435,6 +435,7 @@ following expectations.
> >>>>>>      none            optional telemetry collection
> >>>>>>      rebind          unbind + bind driver
> >>>>>>      bus-reset       unbind + bus reset/re-enumeration + bind
> >>>>>> +    vendor-specific vendor specific recovery method
> >>>>>>      unknown         consumer policy
> >>>>>>      =============== ========================================
> >>>>>>  
> >>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> >>>>>> index cdd591b11488..0ac723a46a91 100644
> >>>>>> --- a/drivers/gpu/drm/drm_drv.c
> >>>>>> +++ b/drivers/gpu/drm/drm_drv.c
> >>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> >>>>>>  		return "rebind";
> >>>>>>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> >>>>>>  		return "bus-reset";
> >>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
> >>>>>> +		return "vendor-specific";
> >>>>>>  	default:
> >>>>>>  		return NULL;
> >>>>>>  	}
> >>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> >>>>>> index 08b3b2467c4c..08a087f149ff 100644
> >>>>>> --- a/include/drm/drm_device.h
> >>>>>> +++ b/include/drm/drm_device.h
> >>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
> >>>>>>   * Recovery methods for wedged device in order of less to more side-effects.
> >>>>>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> >>>>>>   * use any one, multiple (or'd) or none depending on their needs.
> >>>>>> + *
> >>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> >>>>>> + * details.
> >>>>>>   */
> >>>>>>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> >>>>>>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> >>>>>>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> >>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> >>>>>>  
> >>>>>>  /**
> >>>>>>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> >>>>>> -- 
> >>>>>> 2.47.1
> >>>>>>
> >>>>>
> >>>>
> > 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-10 10:24               ` Raag Jadav
@ 2025-07-10 19:00                 ` Rodrigo Vivi
  2025-07-10 21:46                   ` Raag Jadav
  2025-07-11  8:56                   ` Simona Vetter
  0 siblings, 2 replies; 17+ messages in thread
From: Rodrigo Vivi @ 2025-07-10 19:00 UTC (permalink / raw)
  To: Raag Jadav
  Cc: Christian König, Simona Vetter, Riana Tauro, intel-xe,
	anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Thu, Jul 10, 2025 at 01:24:52PM +0300, Raag Jadav wrote:
> On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
> > On 10.07.25 11:01, Simona Vetter wrote:
> > > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> > >> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> > >>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > >>>> On 09.07.25 15:41, Simona Vetter wrote:
> > >>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > >>>>>> Certain errors can cause the device to be wedged and may
> > >>>>>> require a vendor specific recovery method to restore normal
> > >>>>>> operation.
> > >>>>>>
> > >>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > >>>>>> must provide additional recovery documentation if this method
> > >>>>>> is used.
> > >>>>>>
> > >>>>>> v2: fix documentation (Raag)
> > >>>>>>
> > >>>>>> Cc: André Almeida <andrealmeid@igalia.com>
> > >>>>>> Cc: Christian König <christian.koenig@amd.com>
> > >>>>>> Cc: David Airlie <airlied@gmail.com>
> > >>>>>> Cc: <dri-devel@lists.freedesktop.org>
> > >>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > >>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > >>>>>
> > >>>>> I'm not really understanding what this is useful for, maybe concrete
> > >>>>> example in the form of driver code that uses this, and some tool or
> > >>>>> documentation steps that should be taken for recovery?
> > >>
> > >> The case here is when FW underneath identified something badly corrupted on
> > >> FW land and decided that only a firmware-flashing could solve the day and
> > >> raise interrupt to the driver. At that point we want to wedge, but immediately
> > >> hint the admin the recommended action.
> > >>
> > >>>>
> > >>>> The recovery method for this particular case is to flash in a new firmware.
> > >>>>
> > >>>>> The issues I'm seeing here is that eventually we'll get different
> > >>>>> vendor-specific recovery steps, and maybe even on the same device, and
> > >>>>> that leads us to an enumeration issue. Since it's just a string and an
> > >>>>> enum I think it'd be better to just allocate a new one every time there's
> > >>>>> a new strange recovery method instead of this opaque approach.
> > >>>>
> > >>>> That is exactly the opposite of what we discussed so far.
> > > 
> > > Sorry, I missed that context.
> > > 
> > >>>> The original idea was to add a firmware-flush recovery method which
> > >>>> looked a bit wage since it didn't give any information on what to do
> > >>>> exactly.
> > >>>>
> > >>>> That's why I suggested to add a more generic vendor-specific event
> > >>>> with refers to the documentation and system log to see what actually
> > >>>> needs to be done.
> > >>>>
> > >>>> Otherwise we would end up with events like firmware-flash, update FW
> > >>>> image A, update FW image B, FW version mismatch etc....
> > > 
> > > Yeah, that's kinda what I expect to happen, and we have enough numbers for
> > > this all to not be an issue.
> > > 
> > >>> Agree. Any newly allocated method that is specific to a vendor is going to
> > >>> be opaque anyway, since it can't be generic for all drivers. This just helps
> > >>> reduce the noise in DRM core.
> > >>>
> > >>> And yes, there could be different vendor-specific cases for the same driver
> > >>> and the driver should be able to provide the means to distinguish between
> > >>> them.
> > >>
> > >> Sim, what's your take on this then?
> > >>
> > >> Should we get back to the original idea of firmware-flash?
> > > 
> > > Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> > > 
> > > The reason I think it should be specific is because I'm assuming you want
> > > to script this. And if you have a big fleet with different vendors, then
> > > "vendor-specific" doesn't tell you enough. But if it's something like
> > > $vendor-$magic_step then it does become scriptable, and we do have have a
> > > place to put some documentation on what you should do instead.
> > > 
> > > If the point of this interface isn't that it's scriptable, then I'm not
> > > sure why it needs to be an uevent?
> > 
> > You should probably read up on the previous discussion, cause that is exactly what I asked as well :)
> > 
> > And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example.

I also don't like the idea or even the thought of scripting something like
a firmware-flash. But only to fail with a better pin point to make admin
lives easier with a notification.

> > 
> > In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it.

Exactly, the hardware and firmware are giving the indication of what should be
done. It is not 'unknown'.

> 
> Yes, and since the recovery procedure is defined and known to the consumer,
> it can potentially be automated (atleast for non-firmware cases).
> 
> > > I guess if you all want to stick with vendor-specific then I think that's

Well, I would honestly prefer a direct firmware-flash, but if that is not
usable by other vendors and there's a push back on that, let's go with
the vendor-specific then.

> > > ok with me too, but the docs should at least explain how to figure out
> > > from the uevent which vendor you're on with a small example. What I'm
> > > worried is that if we have this on multiple drivers userspace will
> > > otherwise make a complete mess and might want to run the wrong recovery
> > > steps.
> 
> The device id along with driver can be identified from uevent (probably
> available inside DEVPATH somewhere) to distinguish the vendor. So the consumer
> already knows if the device fits the criteria for recovery.
> 
> > > I think ideally, no matter what, we'd have a concrete driver patch which
> > > then also comes with the documentation for what exactly you're supposed to
> > > do as something you can script. And not just this stand-alone patch here.
> 
> Perhaps the rest of the series didn't make it to dri-devel, which will answer
> most of the above.

Riana, could you please try to provide a bit more documentation like Sima
asked and re-send the entire series to dri-devel?

Thanks,
Rodrigo.

> 
> Raag
> 
> > >>>>>> ---
> > >>>>>>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> > >>>>>>  drivers/gpu/drm/drm_drv.c      | 2 ++
> > >>>>>>  include/drm/drm_device.h       | 4 ++++
> > >>>>>>  3 files changed, 11 insertions(+), 4 deletions(-)
> > >>>>>>
> > >>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > >>>>>> index 263e5a97c080..c33070bdb347 100644
> > >>>>>> --- a/Documentation/gpu/drm-uapi.rst
> > >>>>>> +++ b/Documentation/gpu/drm-uapi.rst
> > >>>>>> @@ -421,10 +421,10 @@ Recovery
> > >>>>>>  Current implementation defines three recovery methods, out of which, drivers
> > >>>>>>  can use any one, multiple or none. Method(s) of choice will be sent in the
> > >>>>>>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > >>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
> > >>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
> > >>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > >>>>>> -will be sent instead.
> > >>>>>> +more side-effects. If recovery method is specific to vendor
> > >>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > >>>>>> +specific documentation for further recovery steps. If driver is unsure about
> > >>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > >>>>>>  
> > >>>>>>  Userspace consumers can parse this event and attempt recovery as per the
> > >>>>>>  following expectations.
> > >>>>>> @@ -435,6 +435,7 @@ following expectations.
> > >>>>>>      none            optional telemetry collection
> > >>>>>>      rebind          unbind + bind driver
> > >>>>>>      bus-reset       unbind + bus reset/re-enumeration + bind
> > >>>>>> +    vendor-specific vendor specific recovery method
> > >>>>>>      unknown         consumer policy
> > >>>>>>      =============== ========================================
> > >>>>>>  
> > >>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > >>>>>> index cdd591b11488..0ac723a46a91 100644
> > >>>>>> --- a/drivers/gpu/drm/drm_drv.c
> > >>>>>> +++ b/drivers/gpu/drm/drm_drv.c
> > >>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > >>>>>>  		return "rebind";
> > >>>>>>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > >>>>>>  		return "bus-reset";
> > >>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
> > >>>>>> +		return "vendor-specific";
> > >>>>>>  	default:
> > >>>>>>  		return NULL;
> > >>>>>>  	}
> > >>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > >>>>>> index 08b3b2467c4c..08a087f149ff 100644
> > >>>>>> --- a/include/drm/drm_device.h
> > >>>>>> +++ b/include/drm/drm_device.h
> > >>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
> > >>>>>>   * Recovery methods for wedged device in order of less to more side-effects.
> > >>>>>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > >>>>>>   * use any one, multiple (or'd) or none depending on their needs.
> > >>>>>> + *
> > >>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > >>>>>> + * details.
> > >>>>>>   */
> > >>>>>>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > >>>>>>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > >>>>>>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > >>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > >>>>>>  
> > >>>>>>  /**
> > >>>>>>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > >>>>>> -- 
> > >>>>>> 2.47.1
> > >>>>>>
> > >>>>>
> > >>>>
> > > 
> > 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-10 19:00                 ` Rodrigo Vivi
@ 2025-07-10 21:46                   ` Raag Jadav
  2025-07-11  5:17                     ` Riana Tauro
  2025-07-11  8:56                   ` Simona Vetter
  1 sibling, 1 reply; 17+ messages in thread
From: Raag Jadav @ 2025-07-10 21:46 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Christian König, Simona Vetter, Riana Tauro, intel-xe,
	anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Thu, Jul 10, 2025 at 03:00:06PM -0400, Rodrigo Vivi wrote:
> On Thu, Jul 10, 2025 at 01:24:52PM +0300, Raag Jadav wrote:
> > On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
> > > On 10.07.25 11:01, Simona Vetter wrote:
> > > > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> > > >> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> > > >>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > > >>>> On 09.07.25 15:41, Simona Vetter wrote:
> > > >>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > > >>>>>> Certain errors can cause the device to be wedged and may
> > > >>>>>> require a vendor specific recovery method to restore normal
> > > >>>>>> operation.
> > > >>>>>>
> > > >>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > > >>>>>> must provide additional recovery documentation if this method
> > > >>>>>> is used.
> > > >>>>>>
> > > >>>>>> v2: fix documentation (Raag)
> > > >>>>>>
> > > >>>>>> Cc: André Almeida <andrealmeid@igalia.com>
> > > >>>>>> Cc: Christian König <christian.koenig@amd.com>
> > > >>>>>> Cc: David Airlie <airlied@gmail.com>
> > > >>>>>> Cc: <dri-devel@lists.freedesktop.org>
> > > >>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > > >>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > >>>>>
> > > >>>>> I'm not really understanding what this is useful for, maybe concrete
> > > >>>>> example in the form of driver code that uses this, and some tool or
> > > >>>>> documentation steps that should be taken for recovery?
> > > >>
> > > >> The case here is when FW underneath identified something badly corrupted on
> > > >> FW land and decided that only a firmware-flashing could solve the day and
> > > >> raise interrupt to the driver. At that point we want to wedge, but immediately
> > > >> hint the admin the recommended action.
> > > >>
> > > >>>>
> > > >>>> The recovery method for this particular case is to flash in a new firmware.
> > > >>>>
> > > >>>>> The issues I'm seeing here is that eventually we'll get different
> > > >>>>> vendor-specific recovery steps, and maybe even on the same device, and
> > > >>>>> that leads us to an enumeration issue. Since it's just a string and an
> > > >>>>> enum I think it'd be better to just allocate a new one every time there's
> > > >>>>> a new strange recovery method instead of this opaque approach.
> > > >>>>
> > > >>>> That is exactly the opposite of what we discussed so far.
> > > > 
> > > > Sorry, I missed that context.
> > > > 
> > > >>>> The original idea was to add a firmware-flush recovery method which
> > > >>>> looked a bit wage since it didn't give any information on what to do
> > > >>>> exactly.
> > > >>>>
> > > >>>> That's why I suggested to add a more generic vendor-specific event
> > > >>>> with refers to the documentation and system log to see what actually
> > > >>>> needs to be done.
> > > >>>>
> > > >>>> Otherwise we would end up with events like firmware-flash, update FW
> > > >>>> image A, update FW image B, FW version mismatch etc....
> > > > 
> > > > Yeah, that's kinda what I expect to happen, and we have enough numbers for
> > > > this all to not be an issue.
> > > > 
> > > >>> Agree. Any newly allocated method that is specific to a vendor is going to
> > > >>> be opaque anyway, since it can't be generic for all drivers. This just helps
> > > >>> reduce the noise in DRM core.
> > > >>>
> > > >>> And yes, there could be different vendor-specific cases for the same driver
> > > >>> and the driver should be able to provide the means to distinguish between
> > > >>> them.
> > > >>
> > > >> Sim, what's your take on this then?
> > > >>
> > > >> Should we get back to the original idea of firmware-flash?
> > > > 
> > > > Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> > > > 
> > > > The reason I think it should be specific is because I'm assuming you want
> > > > to script this. And if you have a big fleet with different vendors, then
> > > > "vendor-specific" doesn't tell you enough. But if it's something like
> > > > $vendor-$magic_step then it does become scriptable, and we do have have a
> > > > place to put some documentation on what you should do instead.
> > > > 
> > > > If the point of this interface isn't that it's scriptable, then I'm not
> > > > sure why it needs to be an uevent?
> > > 
> > > You should probably read up on the previous discussion, cause that is exactly what I asked as well :)
> > > 
> > > And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example.
> 
> I also don't like the idea or even the thought of scripting something like
> a firmware-flash. But only to fail with a better pin point to make admin
> lives easier with a notification.
> 
> > > 
> > > In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it.
> 
> Exactly, the hardware and firmware are giving the indication of what should be
> done. It is not 'unknown'.
> 
> > 
> > Yes, and since the recovery procedure is defined and known to the consumer,
> > it can potentially be automated (atleast for non-firmware cases).
> > 
> > > > I guess if you all want to stick with vendor-specific then I think that's
> 
> Well, I would honestly prefer a direct firmware-flash, but if that is not
> usable by other vendors and there's a push back on that, let's go with
> the vendor-specific then.

I think the procedure for firmware-flash is vendor specific, so the wedged event
alone is not sufficient either way. The consumer will need more guidance from
vendor documentation.

With vendor-specific method, the driver has the opportunity to cover as many
cases as it wants without having to create a new method everytime, and face the
same dilemma of being vendor agnostic.

> > > > ok with me too, but the docs should at least explain how to figure out
> > > > from the uevent which vendor you're on with a small example. What I'm
> > > > worried is that if we have this on multiple drivers userspace will
> > > > otherwise make a complete mess and might want to run the wrong recovery
> > > > steps.
> > 
> > The device id along with driver can be identified from uevent (probably
> > available inside DEVPATH somewhere) to distinguish the vendor. So the consumer
> > already knows if the device fits the criteria for recovery.
> > 
> > > > I think ideally, no matter what, we'd have a concrete driver patch which
> > > > then also comes with the documentation for what exactly you're supposed to
> > > > do as something you can script. And not just this stand-alone patch here.
> > 
> > Perhaps the rest of the series didn't make it to dri-devel, which will answer
> > most of the above.
> 
> Riana, could you please try to provide a bit more documentation like Sima
> asked and re-send the entire series to dri-devel?

With the ideas in this thread also documented so that we don't end up repeating
the same discussion.

Raag

> > > >>>>>> ---
> > > >>>>>>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> > > >>>>>>  drivers/gpu/drm/drm_drv.c      | 2 ++
> > > >>>>>>  include/drm/drm_device.h       | 4 ++++
> > > >>>>>>  3 files changed, 11 insertions(+), 4 deletions(-)
> > > >>>>>>
> > > >>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > >>>>>> index 263e5a97c080..c33070bdb347 100644
> > > >>>>>> --- a/Documentation/gpu/drm-uapi.rst
> > > >>>>>> +++ b/Documentation/gpu/drm-uapi.rst
> > > >>>>>> @@ -421,10 +421,10 @@ Recovery
> > > >>>>>>  Current implementation defines three recovery methods, out of which, drivers
> > > >>>>>>  can use any one, multiple or none. Method(s) of choice will be sent in the
> > > >>>>>>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > >>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
> > > >>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > >>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > >>>>>> -will be sent instead.
> > > >>>>>> +more side-effects. If recovery method is specific to vendor
> > > >>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > > >>>>>> +specific documentation for further recovery steps. If driver is unsure about
> > > >>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > > >>>>>>  
> > > >>>>>>  Userspace consumers can parse this event and attempt recovery as per the
> > > >>>>>>  following expectations.
> > > >>>>>> @@ -435,6 +435,7 @@ following expectations.
> > > >>>>>>      none            optional telemetry collection
> > > >>>>>>      rebind          unbind + bind driver
> > > >>>>>>      bus-reset       unbind + bus reset/re-enumeration + bind
> > > >>>>>> +    vendor-specific vendor specific recovery method
> > > >>>>>>      unknown         consumer policy
> > > >>>>>>      =============== ========================================
> > > >>>>>>  
> > > >>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > >>>>>> index cdd591b11488..0ac723a46a91 100644
> > > >>>>>> --- a/drivers/gpu/drm/drm_drv.c
> > > >>>>>> +++ b/drivers/gpu/drm/drm_drv.c
> > > >>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > > >>>>>>  		return "rebind";
> > > >>>>>>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > > >>>>>>  		return "bus-reset";
> > > >>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
> > > >>>>>> +		return "vendor-specific";
> > > >>>>>>  	default:
> > > >>>>>>  		return NULL;
> > > >>>>>>  	}
> > > >>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > >>>>>> index 08b3b2467c4c..08a087f149ff 100644
> > > >>>>>> --- a/include/drm/drm_device.h
> > > >>>>>> +++ b/include/drm/drm_device.h
> > > >>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
> > > >>>>>>   * Recovery methods for wedged device in order of less to more side-effects.
> > > >>>>>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > > >>>>>>   * use any one, multiple (or'd) or none depending on their needs.
> > > >>>>>> + *
> > > >>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > > >>>>>> + * details.
> > > >>>>>>   */
> > > >>>>>>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > > >>>>>>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > > >>>>>>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > > >>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > > >>>>>>  
> > > >>>>>>  /**
> > > >>>>>>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > > >>>>>> -- 
> > > >>>>>> 2.47.1
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > > 
> > > 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-10 21:46                   ` Raag Jadav
@ 2025-07-11  5:17                     ` Riana Tauro
  2025-07-11  6:08                       ` Raag Jadav
  0 siblings, 1 reply; 17+ messages in thread
From: Riana Tauro @ 2025-07-11  5:17 UTC (permalink / raw)
  To: Raag Jadav, Rodrigo Vivi
  Cc: Christian König, Simona Vetter, intel-xe, anshuman.gupta,
	lucas.demarchi, aravind.iddamsetty, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban, André Almeida, David Airlie,
	dri-devel



On 7/11/2025 3:16 AM, Raag Jadav wrote:
> On Thu, Jul 10, 2025 at 03:00:06PM -0400, Rodrigo Vivi wrote:
>> On Thu, Jul 10, 2025 at 01:24:52PM +0300, Raag Jadav wrote:
>>> On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
>>>> On 10.07.25 11:01, Simona Vetter wrote:
>>>>> On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
>>>>>> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
>>>>>>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
>>>>>>>> On 09.07.25 15:41, Simona Vetter wrote:
>>>>>>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
>>>>>>>>>> Certain errors can cause the device to be wedged and may
>>>>>>>>>> require a vendor specific recovery method to restore normal
>>>>>>>>>> operation.
>>>>>>>>>>
>>>>>>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
>>>>>>>>>> must provide additional recovery documentation if this method
>>>>>>>>>> is used.
>>>>>>>>>>
>>>>>>>>>> v2: fix documentation (Raag)
>>>>>>>>>>
>>>>>>>>>> Cc: André Almeida <andrealmeid@igalia.com>
>>>>>>>>>> Cc: Christian König <christian.koenig@amd.com>
>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>> Cc: <dri-devel@lists.freedesktop.org>
>>>>>>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
>>>>>>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>>>>>>>>
>>>>>>>>> I'm not really understanding what this is useful for, maybe concrete
>>>>>>>>> example in the form of driver code that uses this, and some tool or
>>>>>>>>> documentation steps that should be taken for recovery?
>>>>>>
>>>>>> The case here is when FW underneath identified something badly corrupted on
>>>>>> FW land and decided that only a firmware-flashing could solve the day and
>>>>>> raise interrupt to the driver. At that point we want to wedge, but immediately
>>>>>> hint the admin the recommended action.
>>>>>>
>>>>>>>>
>>>>>>>> The recovery method for this particular case is to flash in a new firmware.
>>>>>>>>
>>>>>>>>> The issues I'm seeing here is that eventually we'll get different
>>>>>>>>> vendor-specific recovery steps, and maybe even on the same device, and
>>>>>>>>> that leads us to an enumeration issue. Since it's just a string and an
>>>>>>>>> enum I think it'd be better to just allocate a new one every time there's
>>>>>>>>> a new strange recovery method instead of this opaque approach.
>>>>>>>>
>>>>>>>> That is exactly the opposite of what we discussed so far.
>>>>>
>>>>> Sorry, I missed that context.
>>>>>
>>>>>>>> The original idea was to add a firmware-flush recovery method which
>>>>>>>> looked a bit wage since it didn't give any information on what to do
>>>>>>>> exactly.
>>>>>>>>
>>>>>>>> That's why I suggested to add a more generic vendor-specific event
>>>>>>>> with refers to the documentation and system log to see what actually
>>>>>>>> needs to be done.
>>>>>>>>
>>>>>>>> Otherwise we would end up with events like firmware-flash, update FW
>>>>>>>> image A, update FW image B, FW version mismatch etc....
>>>>>
>>>>> Yeah, that's kinda what I expect to happen, and we have enough numbers for
>>>>> this all to not be an issue.
>>>>>
>>>>>>> Agree. Any newly allocated method that is specific to a vendor is going to
>>>>>>> be opaque anyway, since it can't be generic for all drivers. This just helps
>>>>>>> reduce the noise in DRM core.
>>>>>>>
>>>>>>> And yes, there could be different vendor-specific cases for the same driver
>>>>>>> and the driver should be able to provide the means to distinguish between
>>>>>>> them.
>>>>>>
>>>>>> Sim, what's your take on this then?
>>>>>>
>>>>>> Should we get back to the original idea of firmware-flash?
>>>>>
>>>>> Maybe intel-firmware-flash or something, meaning prefix with the vendor?
>>>>>
>>>>> The reason I think it should be specific is because I'm assuming you want
>>>>> to script this. And if you have a big fleet with different vendors, then
>>>>> "vendor-specific" doesn't tell you enough. But if it's something like
>>>>> $vendor-$magic_step then it does become scriptable, and we do have have a
>>>>> place to put some documentation on what you should do instead.
>>>>>
>>>>> If the point of this interface isn't that it's scriptable, then I'm not
>>>>> sure why it needs to be an uevent?
>>>>
>>>> You should probably read up on the previous discussion, cause that is exactly what I asked as well :)
>>>>
>>>> And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example.
>>
>> I also don't like the idea or even the thought of scripting something like
>> a firmware-flash. But only to fail with a better pin point to make admin
>> lives easier with a notification.
>>
>>>>
>>>> In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it.
>>
>> Exactly, the hardware and firmware are giving the indication of what should be
>> done. It is not 'unknown'.
>>
>>>
>>> Yes, and since the recovery procedure is defined and known to the consumer,
>>> it can potentially be automated (atleast for non-firmware cases).
>>>
>>>>> I guess if you all want to stick with vendor-specific then I think that's
>>
>> Well, I would honestly prefer a direct firmware-flash, but if that is not
>> usable by other vendors and there's a push back on that, let's go with
>> the vendor-specific then.
> 
> I think the procedure for firmware-flash is vendor specific, so the wedged event
> alone is not sufficient either way. The consumer will need more guidance from
> vendor documentation.

Procedure of firmware-flash is vendor specific, but the term 
'firmware-flash' is still generic. The patch doesn't mention any vendor 
specific firmware or procedure. The push back was for the number of 
macros that can be added for other operations.


> 
> With vendor-specific method, the driver has the opportunity to cover as many
> cases as it wants without having to create a new method everytime, and face the
> same dilemma of being vendor agnostic.
> 
>>>>> ok with me too, but the docs should at least explain how to figure out
>>>>> from the uevent which vendor you're on with a small example. What I'm
>>>>> worried is that if we have this on multiple drivers userspace will
>>>>> otherwise make a complete mess and might want to run the wrong recovery
>>>>> steps.
>>>
>>> The device id along with driver can be identified from uevent (probably
>>> available inside DEVPATH somewhere) to distinguish the vendor. So the consumer
>>> already knows if the device fits the criteria for recovery.
>>>
>>>>> I think ideally, no matter what, we'd have a concrete driver patch which
>>>>> then also comes with the documentation for what exactly you're supposed to
>>>>> do as something you can script. And not just this stand-alone patch here.
>>>
>>> Perhaps the rest of the series didn't make it to dri-devel, which will answer
>>> most of the above.
>>
>> Riana, could you please try to provide a bit more documentation like Sima
>> asked and re-send the entire series to dri-devel?

Sure will send the entire series to dri-devel. The documentation is 
present in the series.

> 
> With the ideas in this thread also documented so that we don't end up repeating
> the same discussion.
It is mentioned in cover letter but i didn't send it to dri-devel. will 
add more details

Thanks
Riana

> 
> Raag
> 
>>>>>>>>>> ---
>>>>>>>>>>   Documentation/gpu/drm-uapi.rst | 9 +++++----
>>>>>>>>>>   drivers/gpu/drm/drm_drv.c      | 2 ++
>>>>>>>>>>   include/drm/drm_device.h       | 4 ++++
>>>>>>>>>>   3 files changed, 11 insertions(+), 4 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>>>>>>>>>> index 263e5a97c080..c33070bdb347 100644
>>>>>>>>>> --- a/Documentation/gpu/drm-uapi.rst
>>>>>>>>>> +++ b/Documentation/gpu/drm-uapi.rst
>>>>>>>>>> @@ -421,10 +421,10 @@ Recovery
>>>>>>>>>>   Current implementation defines three recovery methods, out of which, drivers
>>>>>>>>>>   can use any one, multiple or none. Method(s) of choice will be sent in the
>>>>>>>>>>   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>>>>>>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
>>>>>>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>>>>>>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>>>>>>>>>> -will be sent instead.
>>>>>>>>>> +more side-effects. If recovery method is specific to vendor
>>>>>>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
>>>>>>>>>> +specific documentation for further recovery steps. If driver is unsure about
>>>>>>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
>>>>>>>>>>   
>>>>>>>>>>   Userspace consumers can parse this event and attempt recovery as per the
>>>>>>>>>>   following expectations.
>>>>>>>>>> @@ -435,6 +435,7 @@ following expectations.
>>>>>>>>>>       none            optional telemetry collection
>>>>>>>>>>       rebind          unbind + bind driver
>>>>>>>>>>       bus-reset       unbind + bus reset/re-enumeration + bind
>>>>>>>>>> +    vendor-specific vendor specific recovery method
>>>>>>>>>>       unknown         consumer policy
>>>>>>>>>>       =============== ========================================
>>>>>>>>>>   
>>>>>>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>>>>>>>>> index cdd591b11488..0ac723a46a91 100644
>>>>>>>>>> --- a/drivers/gpu/drm/drm_drv.c
>>>>>>>>>> +++ b/drivers/gpu/drm/drm_drv.c
>>>>>>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>>>>>>>>>   		return "rebind";
>>>>>>>>>>   	case DRM_WEDGE_RECOVERY_BUS_RESET:
>>>>>>>>>>   		return "bus-reset";
>>>>>>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
>>>>>>>>>> +		return "vendor-specific";
>>>>>>>>>>   	default:
>>>>>>>>>>   		return NULL;
>>>>>>>>>>   	}
>>>>>>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>>>>>>>>>> index 08b3b2467c4c..08a087f149ff 100644
>>>>>>>>>> --- a/include/drm/drm_device.h
>>>>>>>>>> +++ b/include/drm/drm_device.h
>>>>>>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
>>>>>>>>>>    * Recovery methods for wedged device in order of less to more side-effects.
>>>>>>>>>>    * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>>>>>>>>>>    * use any one, multiple (or'd) or none depending on their needs.
>>>>>>>>>> + *
>>>>>>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
>>>>>>>>>> + * details.
>>>>>>>>>>    */
>>>>>>>>>>   #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>>>>>>>>>>   #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>>>>>>>>>>   #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
>>>>>>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>>>>>>>>>>   
>>>>>>>>>>   /**
>>>>>>>>>>    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
>>>>>>>>>> -- 
>>>>>>>>>> 2.47.1
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-11  5:17                     ` Riana Tauro
@ 2025-07-11  6:08                       ` Raag Jadav
  0 siblings, 0 replies; 17+ messages in thread
From: Raag Jadav @ 2025-07-11  6:08 UTC (permalink / raw)
  To: Riana Tauro
  Cc: Rodrigo Vivi, Christian König, Simona Vetter, intel-xe,
	anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Fri, Jul 11, 2025 at 10:47:39AM +0530, Riana Tauro wrote:
> On 7/11/2025 3:16 AM, Raag Jadav wrote:
> > On Thu, Jul 10, 2025 at 03:00:06PM -0400, Rodrigo Vivi wrote:
> > > On Thu, Jul 10, 2025 at 01:24:52PM +0300, Raag Jadav wrote:
> > > > On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
> > > > > On 10.07.25 11:01, Simona Vetter wrote:
> > > > > > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> > > > > > > On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> > > > > > > > On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > > > > > > > > On 09.07.25 15:41, Simona Vetter wrote:
> > > > > > > > > > On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > > > > > > > > > > Certain errors can cause the device to be wedged and may
> > > > > > > > > > > require a vendor specific recovery method to restore normal
> > > > > > > > > > > operation.
> > > > > > > > > > > 
> > > > > > > > > > > Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > > > > > > > > > > must provide additional recovery documentation if this method
> > > > > > > > > > > is used.
> > > > > > > > > > > 
> > > > > > > > > > > v2: fix documentation (Raag)
> > > > > > > > > > > 
> > > > > > > > > > > Cc: André Almeida <andrealmeid@igalia.com>
> > > > > > > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > > > > > > Cc: David Airlie <airlied@gmail.com>
> > > > > > > > > > > Cc: <dri-devel@lists.freedesktop.org>
> > > > > > > > > > > Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > > > > > > > > > > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > > > > > > > > 
> > > > > > > > > > I'm not really understanding what this is useful for, maybe concrete
> > > > > > > > > > example in the form of driver code that uses this, and some tool or
> > > > > > > > > > documentation steps that should be taken for recovery?
> > > > > > > 
> > > > > > > The case here is when FW underneath identified something badly corrupted on
> > > > > > > FW land and decided that only a firmware-flashing could solve the day and
> > > > > > > raise interrupt to the driver. At that point we want to wedge, but immediately
> > > > > > > hint the admin the recommended action.
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > The recovery method for this particular case is to flash in a new firmware.
> > > > > > > > > 
> > > > > > > > > > The issues I'm seeing here is that eventually we'll get different
> > > > > > > > > > vendor-specific recovery steps, and maybe even on the same device, and
> > > > > > > > > > that leads us to an enumeration issue. Since it's just a string and an
> > > > > > > > > > enum I think it'd be better to just allocate a new one every time there's
> > > > > > > > > > a new strange recovery method instead of this opaque approach.
> > > > > > > > > 
> > > > > > > > > That is exactly the opposite of what we discussed so far.
> > > > > > 
> > > > > > Sorry, I missed that context.
> > > > > > 
> > > > > > > > > The original idea was to add a firmware-flush recovery method which
> > > > > > > > > looked a bit wage since it didn't give any information on what to do
> > > > > > > > > exactly.
> > > > > > > > > 
> > > > > > > > > That's why I suggested to add a more generic vendor-specific event
> > > > > > > > > with refers to the documentation and system log to see what actually
> > > > > > > > > needs to be done.
> > > > > > > > > 
> > > > > > > > > Otherwise we would end up with events like firmware-flash, update FW
> > > > > > > > > image A, update FW image B, FW version mismatch etc....
> > > > > > 
> > > > > > Yeah, that's kinda what I expect to happen, and we have enough numbers for
> > > > > > this all to not be an issue.
> > > > > > 
> > > > > > > > Agree. Any newly allocated method that is specific to a vendor is going to
> > > > > > > > be opaque anyway, since it can't be generic for all drivers. This just helps
> > > > > > > > reduce the noise in DRM core.
> > > > > > > > 
> > > > > > > > And yes, there could be different vendor-specific cases for the same driver
> > > > > > > > and the driver should be able to provide the means to distinguish between
> > > > > > > > them.
> > > > > > > 
> > > > > > > Sim, what's your take on this then?
> > > > > > > 
> > > > > > > Should we get back to the original idea of firmware-flash?
> > > > > > 
> > > > > > Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> > > > > > 
> > > > > > The reason I think it should be specific is because I'm assuming you want
> > > > > > to script this. And if you have a big fleet with different vendors, then
> > > > > > "vendor-specific" doesn't tell you enough. But if it's something like
> > > > > > $vendor-$magic_step then it does become scriptable, and we do have have a
> > > > > > place to put some documentation on what you should do instead.
> > > > > > 
> > > > > > If the point of this interface isn't that it's scriptable, then I'm not
> > > > > > sure why it needs to be an uevent?
> > > > > 
> > > > > You should probably read up on the previous discussion, cause that is exactly what I asked as well :)
> > > > > 
> > > > > And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example.
> > > 
> > > I also don't like the idea or even the thought of scripting something like
> > > a firmware-flash. But only to fail with a better pin point to make admin
> > > lives easier with a notification.
> > > 
> > > > > 
> > > > > In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it.
> > > 
> > > Exactly, the hardware and firmware are giving the indication of what should be
> > > done. It is not 'unknown'.
> > > 
> > > > 
> > > > Yes, and since the recovery procedure is defined and known to the consumer,
> > > > it can potentially be automated (atleast for non-firmware cases).
> > > > 
> > > > > > I guess if you all want to stick with vendor-specific then I think that's
> > > 
> > > Well, I would honestly prefer a direct firmware-flash, but if that is not
> > > usable by other vendors and there's a push back on that, let's go with
> > > the vendor-specific then.
> > 
> > I think the procedure for firmware-flash is vendor specific, so the wedged event
> > alone is not sufficient either way. The consumer will need more guidance from
> > vendor documentation.
> 
> Procedure of firmware-flash is vendor specific, but the term
> 'firmware-flash' is still generic. The patch doesn't mention any vendor
> specific firmware or procedure. The push back was for the number of macros
> that can be added for other operations.

A common procedure for the methods is what makes them agnostic and usable
for all drivers. Otherwise it's pretty much a chaos for the consumer.

> > With vendor-specific method, the driver has the opportunity to cover as many
> > cases as it wants without having to create a new method everytime, and face the
> > same dilemma of being vendor agnostic.
> > 
> > > > > > ok with me too, but the docs should at least explain how to figure out
> > > > > > from the uevent which vendor you're on with a small example. What I'm
> > > > > > worried is that if we have this on multiple drivers userspace will
> > > > > > otherwise make a complete mess and might want to run the wrong recovery
> > > > > > steps.
> > > > 
> > > > The device id along with driver can be identified from uevent (probably
> > > > available inside DEVPATH somewhere) to distinguish the vendor. So the consumer
> > > > already knows if the device fits the criteria for recovery.
> > > > 
> > > > > > I think ideally, no matter what, we'd have a concrete driver patch which
> > > > > > then also comes with the documentation for what exactly you're supposed to
> > > > > > do as something you can script. And not just this stand-alone patch here.
> > > > 
> > > > Perhaps the rest of the series didn't make it to dri-devel, which will answer
> > > > most of the above.
> > > 
> > > Riana, could you please try to provide a bit more documentation like Sima
> > > asked and re-send the entire series to dri-devel?
> 
> Sure will send the entire series to dri-devel. The documentation is present
> in the series.
> 
> > 
> > With the ideas in this thread also documented so that we don't end up repeating
> > the same discussion.
> It is mentioned in cover letter but i didn't send it to dri-devel. will add
> more details

Thank you.

Raag

> > > > > > > > > > > ---
> > > > > > > > > > >   Documentation/gpu/drm-uapi.rst | 9 +++++----
> > > > > > > > > > >   drivers/gpu/drm/drm_drv.c      | 2 ++
> > > > > > > > > > >   include/drm/drm_device.h       | 4 ++++
> > > > > > > > > > >   3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > > > > > > > > > index 263e5a97c080..c33070bdb347 100644
> > > > > > > > > > > --- a/Documentation/gpu/drm-uapi.rst
> > > > > > > > > > > +++ b/Documentation/gpu/drm-uapi.rst
> > > > > > > > > > > @@ -421,10 +421,10 @@ Recovery
> > > > > > > > > > >   Current implementation defines three recovery methods, out of which, drivers
> > > > > > > > > > >   can use any one, multiple or none. Method(s) of choice will be sent in the
> > > > > > > > > > >   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > > > > > > > > > -more side-effects. If driver is unsure about recovery or method is unknown
> > > > > > > > > > > -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > > > > > > > > > -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > > > > > > > > > -will be sent instead.
> > > > > > > > > > > +more side-effects. If recovery method is specific to vendor
> > > > > > > > > > > +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > > > > > > > > > > +specific documentation for further recovery steps. If driver is unsure about
> > > > > > > > > > > +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > > > > > > > > > >   Userspace consumers can parse this event and attempt recovery as per the
> > > > > > > > > > >   following expectations.
> > > > > > > > > > > @@ -435,6 +435,7 @@ following expectations.
> > > > > > > > > > >       none            optional telemetry collection
> > > > > > > > > > >       rebind          unbind + bind driver
> > > > > > > > > > >       bus-reset       unbind + bus reset/re-enumeration + bind
> > > > > > > > > > > +    vendor-specific vendor specific recovery method
> > > > > > > > > > >       unknown         consumer policy
> > > > > > > > > > >       =============== ========================================
> > > > > > > > > > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > > > > > > > > > index cdd591b11488..0ac723a46a91 100644
> > > > > > > > > > > --- a/drivers/gpu/drm/drm_drv.c
> > > > > > > > > > > +++ b/drivers/gpu/drm/drm_drv.c
> > > > > > > > > > > @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > > > > > > > > > >   		return "rebind";
> > > > > > > > > > >   	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > > > > > > > > > >   		return "bus-reset";
> > > > > > > > > > > +	case DRM_WEDGE_RECOVERY_VENDOR:
> > > > > > > > > > > +		return "vendor-specific";
> > > > > > > > > > >   	default:
> > > > > > > > > > >   		return NULL;
> > > > > > > > > > >   	}
> > > > > > > > > > > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > > > > > > > > > index 08b3b2467c4c..08a087f149ff 100644
> > > > > > > > > > > --- a/include/drm/drm_device.h
> > > > > > > > > > > +++ b/include/drm/drm_device.h
> > > > > > > > > > > @@ -26,10 +26,14 @@ struct pci_controller;
> > > > > > > > > > >    * Recovery methods for wedged device in order of less to more side-effects.
> > > > > > > > > > >    * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > > > > > > > > > >    * use any one, multiple (or'd) or none depending on their needs.
> > > > > > > > > > > + *
> > > > > > > > > > > + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > > > > > > > > > > + * details.
> > > > > > > > > > >    */
> > > > > > > > > > >   #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > > > > > > > > > >   #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > > > > > > > > > >   #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > > > > > > > > > > +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > > > > > > > > > >   /**
> > > > > > > > > > >    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > > > > > > > > > > -- 
> > > > > > > > > > > 2.47.1
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > 
> > > > > 
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-10 19:00                 ` Rodrigo Vivi
  2025-07-10 21:46                   ` Raag Jadav
@ 2025-07-11  8:56                   ` Simona Vetter
  1 sibling, 0 replies; 17+ messages in thread
From: Simona Vetter @ 2025-07-11  8:56 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Raag Jadav, Christian König, Simona Vetter, Riana Tauro,
	intel-xe, anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Thu, Jul 10, 2025 at 03:00:06PM -0400, Rodrigo Vivi wrote:
> On Thu, Jul 10, 2025 at 01:24:52PM +0300, Raag Jadav wrote:
> > On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
> > > On 10.07.25 11:01, Simona Vetter wrote:
> > > > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> > > >> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> > > >>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > > >>>> On 09.07.25 15:41, Simona Vetter wrote:
> > > >>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > > >>>>>> Certain errors can cause the device to be wedged and may
> > > >>>>>> require a vendor specific recovery method to restore normal
> > > >>>>>> operation.
> > > >>>>>>
> > > >>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > > >>>>>> must provide additional recovery documentation if this method
> > > >>>>>> is used.
> > > >>>>>>
> > > >>>>>> v2: fix documentation (Raag)
> > > >>>>>>
> > > >>>>>> Cc: André Almeida <andrealmeid@igalia.com>
> > > >>>>>> Cc: Christian König <christian.koenig@amd.com>
> > > >>>>>> Cc: David Airlie <airlied@gmail.com>
> > > >>>>>> Cc: <dri-devel@lists.freedesktop.org>
> > > >>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > > >>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > >>>>>
> > > >>>>> I'm not really understanding what this is useful for, maybe concrete
> > > >>>>> example in the form of driver code that uses this, and some tool or
> > > >>>>> documentation steps that should be taken for recovery?
> > > >>
> > > >> The case here is when FW underneath identified something badly corrupted on
> > > >> FW land and decided that only a firmware-flashing could solve the day and
> > > >> raise interrupt to the driver. At that point we want to wedge, but immediately
> > > >> hint the admin the recommended action.
> > > >>
> > > >>>>
> > > >>>> The recovery method for this particular case is to flash in a new firmware.
> > > >>>>
> > > >>>>> The issues I'm seeing here is that eventually we'll get different
> > > >>>>> vendor-specific recovery steps, and maybe even on the same device, and
> > > >>>>> that leads us to an enumeration issue. Since it's just a string and an
> > > >>>>> enum I think it'd be better to just allocate a new one every time there's
> > > >>>>> a new strange recovery method instead of this opaque approach.
> > > >>>>
> > > >>>> That is exactly the opposite of what we discussed so far.
> > > > 
> > > > Sorry, I missed that context.
> > > > 
> > > >>>> The original idea was to add a firmware-flush recovery method which
> > > >>>> looked a bit wage since it didn't give any information on what to do
> > > >>>> exactly.
> > > >>>>
> > > >>>> That's why I suggested to add a more generic vendor-specific event
> > > >>>> with refers to the documentation and system log to see what actually
> > > >>>> needs to be done.
> > > >>>>
> > > >>>> Otherwise we would end up with events like firmware-flash, update FW
> > > >>>> image A, update FW image B, FW version mismatch etc....
> > > > 
> > > > Yeah, that's kinda what I expect to happen, and we have enough numbers for
> > > > this all to not be an issue.
> > > > 
> > > >>> Agree. Any newly allocated method that is specific to a vendor is going to
> > > >>> be opaque anyway, since it can't be generic for all drivers. This just helps
> > > >>> reduce the noise in DRM core.
> > > >>>
> > > >>> And yes, there could be different vendor-specific cases for the same driver
> > > >>> and the driver should be able to provide the means to distinguish between
> > > >>> them.
> > > >>
> > > >> Sim, what's your take on this then?
> > > >>
> > > >> Should we get back to the original idea of firmware-flash?
> > > > 
> > > > Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> > > > 
> > > > The reason I think it should be specific is because I'm assuming you want
> > > > to script this. And if you have a big fleet with different vendors, then
> > > > "vendor-specific" doesn't tell you enough. But if it's something like
> > > > $vendor-$magic_step then it does become scriptable, and we do have have a
> > > > place to put some documentation on what you should do instead.
> > > > 
> > > > If the point of this interface isn't that it's scriptable, then I'm not
> > > > sure why it needs to be an uevent?
> > > 
> > > You should probably read up on the previous discussion, cause that is exactly what I asked as well :)
> > > 
> > > And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example.
> 
> I also don't like the idea or even the thought of scripting something like
> a firmware-flash. But only to fail with a better pin point to make admin
> lives easier with a notification.
> 
> > > 
> > > In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it.
> 
> Exactly, the hardware and firmware are giving the indication of what should be
> done. It is not 'unknown'.
> 
> > 
> > Yes, and since the recovery procedure is defined and known to the consumer,
> > it can potentially be automated (atleast for non-firmware cases).
> > 
> > > > I guess if you all want to stick with vendor-specific then I think that's
> 
> Well, I would honestly prefer a direct firmware-flash, but if that is not
> usable by other vendors and there's a push back on that, let's go with
> the vendor-specific then.
> 
> > > > ok with me too, but the docs should at least explain how to figure out
> > > > from the uevent which vendor you're on with a small example. What I'm
> > > > worried is that if we have this on multiple drivers userspace will
> > > > otherwise make a complete mess and might want to run the wrong recovery
> > > > steps.
> > 
> > The device id along with driver can be identified from uevent (probably
> > available inside DEVPATH somewhere) to distinguish the vendor. So the consumer
> > already knows if the device fits the criteria for recovery.
> > 
> > > > I think ideally, no matter what, we'd have a concrete driver patch which
> > > > then also comes with the documentation for what exactly you're supposed to
> > > > do as something you can script. And not just this stand-alone patch here.
> > 
> > Perhaps the rest of the series didn't make it to dri-devel, which will answer
> > most of the above.
> 
> Riana, could you please try to provide a bit more documentation like Sima
> asked and re-send the entire series to dri-devel?

Yeah all the stuff discussed here needs to be included in the commit
message.

Also ideally the patch series would add a user in xe, and the xe patch
would then also include documentation on how that firmware flash should
happen. Also I guess the xe patch I guess should explain how backwards
compatibility should work if there's ever a need for yet-another recovery
method. Asking userspace to filter on specific pciid sounds brittle, I
think worst case we'll just add vendor-specific-1 or something :-)

Which is why I still think we should just enum them all, but I'm ok with
just properly documenting the reasoning behind this all.

Cheers, Sima

> 
> Thanks,
> Rodrigo.
> 
> > 
> > Raag
> > 
> > > >>>>>> ---
> > > >>>>>>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> > > >>>>>>  drivers/gpu/drm/drm_drv.c      | 2 ++
> > > >>>>>>  include/drm/drm_device.h       | 4 ++++
> > > >>>>>>  3 files changed, 11 insertions(+), 4 deletions(-)
> > > >>>>>>
> > > >>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > >>>>>> index 263e5a97c080..c33070bdb347 100644
> > > >>>>>> --- a/Documentation/gpu/drm-uapi.rst
> > > >>>>>> +++ b/Documentation/gpu/drm-uapi.rst
> > > >>>>>> @@ -421,10 +421,10 @@ Recovery
> > > >>>>>>  Current implementation defines three recovery methods, out of which, drivers
> > > >>>>>>  can use any one, multiple or none. Method(s) of choice will be sent in the
> > > >>>>>>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > >>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
> > > >>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > >>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > >>>>>> -will be sent instead.
> > > >>>>>> +more side-effects. If recovery method is specific to vendor
> > > >>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > > >>>>>> +specific documentation for further recovery steps. If driver is unsure about
> > > >>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > > >>>>>>  
> > > >>>>>>  Userspace consumers can parse this event and attempt recovery as per the
> > > >>>>>>  following expectations.
> > > >>>>>> @@ -435,6 +435,7 @@ following expectations.
> > > >>>>>>      none            optional telemetry collection
> > > >>>>>>      rebind          unbind + bind driver
> > > >>>>>>      bus-reset       unbind + bus reset/re-enumeration + bind
> > > >>>>>> +    vendor-specific vendor specific recovery method
> > > >>>>>>      unknown         consumer policy
> > > >>>>>>      =============== ========================================
> > > >>>>>>  
> > > >>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > >>>>>> index cdd591b11488..0ac723a46a91 100644
> > > >>>>>> --- a/drivers/gpu/drm/drm_drv.c
> > > >>>>>> +++ b/drivers/gpu/drm/drm_drv.c
> > > >>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > > >>>>>>  		return "rebind";
> > > >>>>>>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > > >>>>>>  		return "bus-reset";
> > > >>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
> > > >>>>>> +		return "vendor-specific";
> > > >>>>>>  	default:
> > > >>>>>>  		return NULL;
> > > >>>>>>  	}
> > > >>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > >>>>>> index 08b3b2467c4c..08a087f149ff 100644
> > > >>>>>> --- a/include/drm/drm_device.h
> > > >>>>>> +++ b/include/drm/drm_device.h
> > > >>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
> > > >>>>>>   * Recovery methods for wedged device in order of less to more side-effects.
> > > >>>>>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > > >>>>>>   * use any one, multiple (or'd) or none depending on their needs.
> > > >>>>>> + *
> > > >>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > > >>>>>> + * details.
> > > >>>>>>   */
> > > >>>>>>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > > >>>>>>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > > >>>>>>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > > >>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > > >>>>>>  
> > > >>>>>>  /**
> > > >>>>>>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > > >>>>>> -- 
> > > >>>>>> 2.47.1
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > > 
> > > 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-10  9:37             ` Christian König
  2025-07-10 10:24               ` Raag Jadav
@ 2025-07-11  8:59               ` Simona Vetter
  2025-07-14  5:27                 ` Riana Tauro
  1 sibling, 1 reply; 17+ messages in thread
From: Simona Vetter @ 2025-07-11  8:59 UTC (permalink / raw)
  To: Christian König
  Cc: Simona Vetter, Rodrigo Vivi, Raag Jadav, Riana Tauro, intel-xe,
	anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
> On 10.07.25 11:01, Simona Vetter wrote:
> > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> >> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> >>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> >>>> On 09.07.25 15:41, Simona Vetter wrote:
> >>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> >>>>>> Certain errors can cause the device to be wedged and may
> >>>>>> require a vendor specific recovery method to restore normal
> >>>>>> operation.
> >>>>>>
> >>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> >>>>>> must provide additional recovery documentation if this method
> >>>>>> is used.
> >>>>>>
> >>>>>> v2: fix documentation (Raag)
> >>>>>>
> >>>>>> Cc: André Almeida <andrealmeid@igalia.com>
> >>>>>> Cc: Christian König <christian.koenig@amd.com>
> >>>>>> Cc: David Airlie <airlied@gmail.com>
> >>>>>> Cc: <dri-devel@lists.freedesktop.org>
> >>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> >>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> >>>>>
> >>>>> I'm not really understanding what this is useful for, maybe concrete
> >>>>> example in the form of driver code that uses this, and some tool or
> >>>>> documentation steps that should be taken for recovery?
> >>
> >> The case here is when FW underneath identified something badly corrupted on
> >> FW land and decided that only a firmware-flashing could solve the day and
> >> raise interrupt to the driver. At that point we want to wedge, but immediately
> >> hint the admin the recommended action.
> >>
> >>>>
> >>>> The recovery method for this particular case is to flash in a new firmware.
> >>>>
> >>>>> The issues I'm seeing here is that eventually we'll get different
> >>>>> vendor-specific recovery steps, and maybe even on the same device, and
> >>>>> that leads us to an enumeration issue. Since it's just a string and an
> >>>>> enum I think it'd be better to just allocate a new one every time there's
> >>>>> a new strange recovery method instead of this opaque approach.
> >>>>
> >>>> That is exactly the opposite of what we discussed so far.
> > 
> > Sorry, I missed that context.
> > 
> >>>> The original idea was to add a firmware-flush recovery method which
> >>>> looked a bit wage since it didn't give any information on what to do
> >>>> exactly.
> >>>>
> >>>> That's why I suggested to add a more generic vendor-specific event
> >>>> with refers to the documentation and system log to see what actually
> >>>> needs to be done.
> >>>>
> >>>> Otherwise we would end up with events like firmware-flash, update FW
> >>>> image A, update FW image B, FW version mismatch etc....
> > 
> > Yeah, that's kinda what I expect to happen, and we have enough numbers for
> > this all to not be an issue.
> > 
> >>> Agree. Any newly allocated method that is specific to a vendor is going to
> >>> be opaque anyway, since it can't be generic for all drivers. This just helps
> >>> reduce the noise in DRM core.
> >>>
> >>> And yes, there could be different vendor-specific cases for the same driver
> >>> and the driver should be able to provide the means to distinguish between
> >>> them.
> >>
> >> Sim, what's your take on this then?
> >>
> >> Should we get back to the original idea of firmware-flash?
> > 
> > Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> > 
> > The reason I think it should be specific is because I'm assuming you want
> > to script this. And if you have a big fleet with different vendors, then
> > "vendor-specific" doesn't tell you enough. But if it's something like
> > $vendor-$magic_step then it does become scriptable, and we do have have a
> > place to put some documentation on what you should do instead.
> > 
> > If the point of this interface isn't that it's scriptable, then I'm not
> > sure why it needs to be an uevent?
> 
> You should probably read up on the previous discussion, cause that is
> exactly what I asked as well :)
> 
> And no, it should *not* be scripted. That would be a bit brave for a
> firmware update where you should absolutely not power down the system
> for example.

I guess if we clearly state that this is for manual recovery only, or for
cases where you exactly know what you're doing (fleet-specific scripts
instead of generic distros), I guess this very opaque code makes sense.

But we should clearly document then that doing anything scripted here is
very much "you get to keep the pieces", and definitely don't try to do
something fancy generic.

Which without documentation is just really confusing when some of the
other error codes clearly look like they're meant to facilitate scripted
recovery.

> In my understanding the new value "vendor-specific" basically means it
> is a known issue with a documented solution, while "unknown" means the
> driver has no idea how to solve it.

I think that's another detail which should be documented clearly.
-Sima
> 
> Regards,
> Christian.
> 
> > I guess if you all want to stick with vendor-specific then I think that's
> > ok with me too, but the docs should at least explain how to figure out
> > from the uevent which vendor you're on with a small example. What I'm
> > worried is that if we have this on multiple drivers userspace will
> > otherwise make a complete mess and might want to run the wrong recovery
> > steps.
> > 
> > I think ideally, no matter what, we'd have a concrete driver patch which
> > then also comes with the documentation for what exactly you're supposed to
> > do as something you can script. And not just this stand-alone patch here.
> > 
> > Cheers, Sima
> >>
> >>>
> >>> Raag
> >>>
> >>>>>> ---
> >>>>>>  Documentation/gpu/drm-uapi.rst | 9 +++++----
> >>>>>>  drivers/gpu/drm/drm_drv.c      | 2 ++
> >>>>>>  include/drm/drm_device.h       | 4 ++++
> >>>>>>  3 files changed, 11 insertions(+), 4 deletions(-)
> >>>>>>
> >>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> >>>>>> index 263e5a97c080..c33070bdb347 100644
> >>>>>> --- a/Documentation/gpu/drm-uapi.rst
> >>>>>> +++ b/Documentation/gpu/drm-uapi.rst
> >>>>>> @@ -421,10 +421,10 @@ Recovery
> >>>>>>  Current implementation defines three recovery methods, out of which, drivers
> >>>>>>  can use any one, multiple or none. Method(s) of choice will be sent in the
> >>>>>>  uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> >>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
> >>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
> >>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> >>>>>> -will be sent instead.
> >>>>>> +more side-effects. If recovery method is specific to vendor
> >>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> >>>>>> +specific documentation for further recovery steps. If driver is unsure about
> >>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> >>>>>>  
> >>>>>>  Userspace consumers can parse this event and attempt recovery as per the
> >>>>>>  following expectations.
> >>>>>> @@ -435,6 +435,7 @@ following expectations.
> >>>>>>      none            optional telemetry collection
> >>>>>>      rebind          unbind + bind driver
> >>>>>>      bus-reset       unbind + bus reset/re-enumeration + bind
> >>>>>> +    vendor-specific vendor specific recovery method
> >>>>>>      unknown         consumer policy
> >>>>>>      =============== ========================================
> >>>>>>  
> >>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> >>>>>> index cdd591b11488..0ac723a46a91 100644
> >>>>>> --- a/drivers/gpu/drm/drm_drv.c
> >>>>>> +++ b/drivers/gpu/drm/drm_drv.c
> >>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> >>>>>>  		return "rebind";
> >>>>>>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
> >>>>>>  		return "bus-reset";
> >>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
> >>>>>> +		return "vendor-specific";
> >>>>>>  	default:
> >>>>>>  		return NULL;
> >>>>>>  	}
> >>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> >>>>>> index 08b3b2467c4c..08a087f149ff 100644
> >>>>>> --- a/include/drm/drm_device.h
> >>>>>> +++ b/include/drm/drm_device.h
> >>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
> >>>>>>   * Recovery methods for wedged device in order of less to more side-effects.
> >>>>>>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> >>>>>>   * use any one, multiple (or'd) or none depending on their needs.
> >>>>>> + *
> >>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> >>>>>> + * details.
> >>>>>>   */
> >>>>>>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> >>>>>>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> >>>>>>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> >>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> >>>>>>  
> >>>>>>  /**
> >>>>>>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> >>>>>> -- 
> >>>>>> 2.47.1
> >>>>>>
> >>>>>
> >>>>
> > 
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-11  8:59               ` Simona Vetter
@ 2025-07-14  5:27                 ` Riana Tauro
  2025-07-14 12:33                   ` Simona Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Riana Tauro @ 2025-07-14  5:27 UTC (permalink / raw)
  To: Simona Vetter, Christian König, Rodrigo Vivi, Raag Jadav
  Cc: intel-xe, anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel



On 7/11/2025 2:29 PM, Simona Vetter wrote:
> On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
>> On 10.07.25 11:01, Simona Vetter wrote:
>>> On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
>>>> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
>>>>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
>>>>>> On 09.07.25 15:41, Simona Vetter wrote:
>>>>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
>>>>>>>> Certain errors can cause the device to be wedged and may
>>>>>>>> require a vendor specific recovery method to restore normal
>>>>>>>> operation.
>>>>>>>>
>>>>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
>>>>>>>> must provide additional recovery documentation if this method
>>>>>>>> is used.
>>>>>>>>
>>>>>>>> v2: fix documentation (Raag)
>>>>>>>>
>>>>>>>> Cc: André Almeida <andrealmeid@igalia.com>
>>>>>>>> Cc: Christian König <christian.koenig@amd.com>
>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>> Cc: <dri-devel@lists.freedesktop.org>
>>>>>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
>>>>>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>>>>>>
>>>>>>> I'm not really understanding what this is useful for, maybe concrete
>>>>>>> example in the form of driver code that uses this, and some tool or
>>>>>>> documentation steps that should be taken for recovery?
>>>>
>>>> The case here is when FW underneath identified something badly corrupted on
>>>> FW land and decided that only a firmware-flashing could solve the day and
>>>> raise interrupt to the driver. At that point we want to wedge, but immediately
>>>> hint the admin the recommended action.
>>>>
>>>>>>
>>>>>> The recovery method for this particular case is to flash in a new firmware.
>>>>>>
>>>>>>> The issues I'm seeing here is that eventually we'll get different
>>>>>>> vendor-specific recovery steps, and maybe even on the same device, and
>>>>>>> that leads us to an enumeration issue. Since it's just a string and an
>>>>>>> enum I think it'd be better to just allocate a new one every time there's
>>>>>>> a new strange recovery method instead of this opaque approach.
>>>>>>
>>>>>> That is exactly the opposite of what we discussed so far.
>>>
>>> Sorry, I missed that context.
>>>
>>>>>> The original idea was to add a firmware-flush recovery method which
>>>>>> looked a bit wage since it didn't give any information on what to do
>>>>>> exactly.
>>>>>>
>>>>>> That's why I suggested to add a more generic vendor-specific event
>>>>>> with refers to the documentation and system log to see what actually
>>>>>> needs to be done.
>>>>>>
>>>>>> Otherwise we would end up with events like firmware-flash, update FW
>>>>>> image A, update FW image B, FW version mismatch etc....
>>>
>>> Yeah, that's kinda what I expect to happen, and we have enough numbers for
>>> this all to not be an issue.
>>>
>>>>> Agree. Any newly allocated method that is specific to a vendor is going to
>>>>> be opaque anyway, since it can't be generic for all drivers. This just helps
>>>>> reduce the noise in DRM core.
>>>>>
>>>>> And yes, there could be different vendor-specific cases for the same driver
>>>>> and the driver should be able to provide the means to distinguish between
>>>>> them.
>>>>
>>>> Sim, what's your take on this then?
>>>>
>>>> Should we get back to the original idea of firmware-flash?
>>>
>>> Maybe intel-firmware-flash or something, meaning prefix with the vendor?
>>>
>>> The reason I think it should be specific is because I'm assuming you want
>>> to script this. And if you have a big fleet with different vendors, then
>>> "vendor-specific" doesn't tell you enough. But if it's something like
>>> $vendor-$magic_step then it does become scriptable, and we do have have a
>>> place to put some documentation on what you should do instead.
>>>
>>> If the point of this interface isn't that it's scriptable, then I'm not
>>> sure why it needs to be an uevent?
>>
>> You should probably read up on the previous discussion, cause that is
>> exactly what I asked as well :)
>>
>> And no, it should *not* be scripted. That would be a bit brave for a
>> firmware update where you should absolutely not power down the system
>> for example.
> 
> I guess if we clearly state that this is for manual recovery only, or for
> cases where you exactly know what you're doing (fleet-specific scripts
> instead of generic distros), I guess this very opaque code makes sense.
> 
> But we should clearly document then that doing anything scripted here is
> very much "you get to keep the pieces", and definitely don't try to do
> something fancy generic.


The documentation is part of the series but was sent only to intel-xe 
mailing list. Will re-send the entire series to dri-devel

https://lore.kernel.org/intel-xe/aHH2XGuOvz8bSlax@black.fi.intel.com/T/#m883269cf0b1f6891ecc9c24d3d45325f46d56572


> 
> Which without documentation is just really confusing when some of the
> other error codes clearly look like they're meant to facilitate scripted
> recovery.
> 

To get consensus on the patch, is 'vendor-specific' okay or is it better 
to have 'firmware-flash' with additional event parameter 'vendor' if 
number of macros is not a concern?

Thanks
Riana
>> In my understanding the new value "vendor-specific" basically means it
>> is a known issue with a documented solution, while "unknown" means the
>> driver has no idea how to solve it.
> 
> I think that's another detail which should be documented clearly.
> -Sima
>>
>> Regards,
>> Christian.
>>
>>> I guess if you all want to stick with vendor-specific then I think that's
>>> ok with me too, but the docs should at least explain how to figure out
>>> from the uevent which vendor you're on with a small example. What I'm
>>> worried is that if we have this on multiple drivers userspace will
>>> otherwise make a complete mess and might want to run the wrong recovery
>>> steps.
>>>
>>> I think ideally, no matter what, we'd have a concrete driver patch which
>>> then also comes with the documentation for what exactly you're supposed to
>>> do as something you can script. And not just this stand-alone patch here.
>>>
>>> Cheers, Sima
>>>>
>>>>>
>>>>> Raag
>>>>>
>>>>>>>> ---
>>>>>>>>   Documentation/gpu/drm-uapi.rst | 9 +++++----
>>>>>>>>   drivers/gpu/drm/drm_drv.c      | 2 ++
>>>>>>>>   include/drm/drm_device.h       | 4 ++++
>>>>>>>>   3 files changed, 11 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>>>>>>>> index 263e5a97c080..c33070bdb347 100644
>>>>>>>> --- a/Documentation/gpu/drm-uapi.rst
>>>>>>>> +++ b/Documentation/gpu/drm-uapi.rst
>>>>>>>> @@ -421,10 +421,10 @@ Recovery
>>>>>>>>   Current implementation defines three recovery methods, out of which, drivers
>>>>>>>>   can use any one, multiple or none. Method(s) of choice will be sent in the
>>>>>>>>   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>>>>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
>>>>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>>>>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>>>>>>>> -will be sent instead.
>>>>>>>> +more side-effects. If recovery method is specific to vendor
>>>>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
>>>>>>>> +specific documentation for further recovery steps. If driver is unsure about
>>>>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
>>>>>>>>   
>>>>>>>>   Userspace consumers can parse this event and attempt recovery as per the
>>>>>>>>   following expectations.
>>>>>>>> @@ -435,6 +435,7 @@ following expectations.
>>>>>>>>       none            optional telemetry collection
>>>>>>>>       rebind          unbind + bind driver
>>>>>>>>       bus-reset       unbind + bus reset/re-enumeration + bind
>>>>>>>> +    vendor-specific vendor specific recovery method
>>>>>>>>       unknown         consumer policy
>>>>>>>>       =============== ========================================
>>>>>>>>   
>>>>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>>>>>>> index cdd591b11488..0ac723a46a91 100644
>>>>>>>> --- a/drivers/gpu/drm/drm_drv.c
>>>>>>>> +++ b/drivers/gpu/drm/drm_drv.c
>>>>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>>>>>>>   		return "rebind";
>>>>>>>>   	case DRM_WEDGE_RECOVERY_BUS_RESET:
>>>>>>>>   		return "bus-reset";
>>>>>>>> +	case DRM_WEDGE_RECOVERY_VENDOR:
>>>>>>>> +		return "vendor-specific";
>>>>>>>>   	default:
>>>>>>>>   		return NULL;
>>>>>>>>   	}
>>>>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>>>>>>>> index 08b3b2467c4c..08a087f149ff 100644
>>>>>>>> --- a/include/drm/drm_device.h
>>>>>>>> +++ b/include/drm/drm_device.h
>>>>>>>> @@ -26,10 +26,14 @@ struct pci_controller;
>>>>>>>>    * Recovery methods for wedged device in order of less to more side-effects.
>>>>>>>>    * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>>>>>>>>    * use any one, multiple (or'd) or none depending on their needs.
>>>>>>>> + *
>>>>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
>>>>>>>> + * details.
>>>>>>>>    */
>>>>>>>>   #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>>>>>>>>   #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>>>>>>>>   #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
>>>>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>>>>>>>>   
>>>>>>>>   /**
>>>>>>>>    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
>>>>>>>> -- 
>>>>>>>> 2.47.1
>>>>>>>>
>>>>>>>
>>>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-14  5:27                 ` Riana Tauro
@ 2025-07-14 12:33                   ` Simona Vetter
  0 siblings, 0 replies; 17+ messages in thread
From: Simona Vetter @ 2025-07-14 12:33 UTC (permalink / raw)
  To: Riana Tauro
  Cc: Simona Vetter, Christian König, Rodrigo Vivi, Raag Jadav,
	intel-xe, anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban,
	André Almeida, David Airlie, dri-devel

On Mon, Jul 14, 2025 at 10:57:48AM +0530, Riana Tauro wrote:
> 
> 
> On 7/11/2025 2:29 PM, Simona Vetter wrote:
> > On Thu, Jul 10, 2025 at 11:37:14AM +0200, Christian König wrote:
> > > On 10.07.25 11:01, Simona Vetter wrote:
> > > > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote:
> > > > > On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote:
> > > > > > On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote:
> > > > > > > On 09.07.25 15:41, Simona Vetter wrote:
> > > > > > > > On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote:
> > > > > > > > > Certain errors can cause the device to be wedged and may
> > > > > > > > > require a vendor specific recovery method to restore normal
> > > > > > > > > operation.
> > > > > > > > > 
> > > > > > > > > Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > > > > > > > > must provide additional recovery documentation if this method
> > > > > > > > > is used.
> > > > > > > > > 
> > > > > > > > > v2: fix documentation (Raag)
> > > > > > > > > 
> > > > > > > > > Cc: André Almeida <andrealmeid@igalia.com>
> > > > > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > > > > Cc: David Airlie <airlied@gmail.com>
> > > > > > > > > Cc: <dri-devel@lists.freedesktop.org>
> > > > > > > > > Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > > > > > > > > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > > > > > > 
> > > > > > > > I'm not really understanding what this is useful for, maybe concrete
> > > > > > > > example in the form of driver code that uses this, and some tool or
> > > > > > > > documentation steps that should be taken for recovery?
> > > > > 
> > > > > The case here is when FW underneath identified something badly corrupted on
> > > > > FW land and decided that only a firmware-flashing could solve the day and
> > > > > raise interrupt to the driver. At that point we want to wedge, but immediately
> > > > > hint the admin the recommended action.
> > > > > 
> > > > > > > 
> > > > > > > The recovery method for this particular case is to flash in a new firmware.
> > > > > > > 
> > > > > > > > The issues I'm seeing here is that eventually we'll get different
> > > > > > > > vendor-specific recovery steps, and maybe even on the same device, and
> > > > > > > > that leads us to an enumeration issue. Since it's just a string and an
> > > > > > > > enum I think it'd be better to just allocate a new one every time there's
> > > > > > > > a new strange recovery method instead of this opaque approach.
> > > > > > > 
> > > > > > > That is exactly the opposite of what we discussed so far.
> > > > 
> > > > Sorry, I missed that context.
> > > > 
> > > > > > > The original idea was to add a firmware-flush recovery method which
> > > > > > > looked a bit wage since it didn't give any information on what to do
> > > > > > > exactly.
> > > > > > > 
> > > > > > > That's why I suggested to add a more generic vendor-specific event
> > > > > > > with refers to the documentation and system log to see what actually
> > > > > > > needs to be done.
> > > > > > > 
> > > > > > > Otherwise we would end up with events like firmware-flash, update FW
> > > > > > > image A, update FW image B, FW version mismatch etc....
> > > > 
> > > > Yeah, that's kinda what I expect to happen, and we have enough numbers for
> > > > this all to not be an issue.
> > > > 
> > > > > > Agree. Any newly allocated method that is specific to a vendor is going to
> > > > > > be opaque anyway, since it can't be generic for all drivers. This just helps
> > > > > > reduce the noise in DRM core.
> > > > > > 
> > > > > > And yes, there could be different vendor-specific cases for the same driver
> > > > > > and the driver should be able to provide the means to distinguish between
> > > > > > them.
> > > > > 
> > > > > Sim, what's your take on this then?
> > > > > 
> > > > > Should we get back to the original idea of firmware-flash?
> > > > 
> > > > Maybe intel-firmware-flash or something, meaning prefix with the vendor?
> > > > 
> > > > The reason I think it should be specific is because I'm assuming you want
> > > > to script this. And if you have a big fleet with different vendors, then
> > > > "vendor-specific" doesn't tell you enough. But if it's something like
> > > > $vendor-$magic_step then it does become scriptable, and we do have have a
> > > > place to put some documentation on what you should do instead.
> > > > 
> > > > If the point of this interface isn't that it's scriptable, then I'm not
> > > > sure why it needs to be an uevent?
> > > 
> > > You should probably read up on the previous discussion, cause that is
> > > exactly what I asked as well :)
> > > 
> > > And no, it should *not* be scripted. That would be a bit brave for a
> > > firmware update where you should absolutely not power down the system
> > > for example.
> > 
> > I guess if we clearly state that this is for manual recovery only, or for
> > cases where you exactly know what you're doing (fleet-specific scripts
> > instead of generic distros), I guess this very opaque code makes sense.
> > 
> > But we should clearly document then that doing anything scripted here is
> > very much "you get to keep the pieces", and definitely don't try to do
> > something fancy generic.
> 
> 
> The documentation is part of the series but was sent only to intel-xe
> mailing list. Will re-send the entire series to dri-devel
> 
> https://lore.kernel.org/intel-xe/aHH2XGuOvz8bSlax@black.fi.intel.com/T/#m883269cf0b1f6891ecc9c24d3d45325f46d56572

Duh, missed that, but yes definitely send the entire series to all mailing
lists. Especially when adding new drm features like this one does.

> > Which without documentation is just really confusing when some of the
> > other error codes clearly look like they're meant to facilitate scripted
> > recovery.
> > 
> 
> To get consensus on the patch, is 'vendor-specific' okay or is it better to
> have 'firmware-flash' with additional event parameter 'vendor' if number of
> macros is not a concern?

I'll refrain from a vote.
-Sima

> 
> Thanks
> Riana
> > > In my understanding the new value "vendor-specific" basically means it
> > > is a known issue with a documented solution, while "unknown" means the
> > > driver has no idea how to solve it.
> > 
> > I think that's another detail which should be documented clearly.
> > -Sima
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > I guess if you all want to stick with vendor-specific then I think that's
> > > > ok with me too, but the docs should at least explain how to figure out
> > > > from the uevent which vendor you're on with a small example. What I'm
> > > > worried is that if we have this on multiple drivers userspace will
> > > > otherwise make a complete mess and might want to run the wrong recovery
> > > > steps.
> > > > 
> > > > I think ideally, no matter what, we'd have a concrete driver patch which
> > > > then also comes with the documentation for what exactly you're supposed to
> > > > do as something you can script. And not just this stand-alone patch here.
> > > > 
> > > > Cheers, Sima
> > > > > 
> > > > > > 
> > > > > > Raag
> > > > > > 
> > > > > > > > > ---
> > > > > > > > >   Documentation/gpu/drm-uapi.rst | 9 +++++----
> > > > > > > > >   drivers/gpu/drm/drm_drv.c      | 2 ++
> > > > > > > > >   include/drm/drm_device.h       | 4 ++++
> > > > > > > > >   3 files changed, 11 insertions(+), 4 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > > > > > > > index 263e5a97c080..c33070bdb347 100644
> > > > > > > > > --- a/Documentation/gpu/drm-uapi.rst
> > > > > > > > > +++ b/Documentation/gpu/drm-uapi.rst
> > > > > > > > > @@ -421,10 +421,10 @@ Recovery
> > > > > > > > >   Current implementation defines three recovery methods, out of which, drivers
> > > > > > > > >   can use any one, multiple or none. Method(s) of choice will be sent in the
> > > > > > > > >   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > > > > > > > -more side-effects. If driver is unsure about recovery or method is unknown
> > > > > > > > > -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > > > > > > > -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > > > > > > > -will be sent instead.
> > > > > > > > > +more side-effects. If recovery method is specific to vendor
> > > > > > > > > +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > > > > > > > > +specific documentation for further recovery steps. If driver is unsure about
> > > > > > > > > +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead
> > > > > > > > >   Userspace consumers can parse this event and attempt recovery as per the
> > > > > > > > >   following expectations.
> > > > > > > > > @@ -435,6 +435,7 @@ following expectations.
> > > > > > > > >       none            optional telemetry collection
> > > > > > > > >       rebind          unbind + bind driver
> > > > > > > > >       bus-reset       unbind + bus reset/re-enumeration + bind
> > > > > > > > > +    vendor-specific vendor specific recovery method
> > > > > > > > >       unknown         consumer policy
> > > > > > > > >       =============== ========================================
> > > > > > > > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > > > > > > > index cdd591b11488..0ac723a46a91 100644
> > > > > > > > > --- a/drivers/gpu/drm/drm_drv.c
> > > > > > > > > +++ b/drivers/gpu/drm/drm_drv.c
> > > > > > > > > @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > > > > > > > >   		return "rebind";
> > > > > > > > >   	case DRM_WEDGE_RECOVERY_BUS_RESET:
> > > > > > > > >   		return "bus-reset";
> > > > > > > > > +	case DRM_WEDGE_RECOVERY_VENDOR:
> > > > > > > > > +		return "vendor-specific";
> > > > > > > > >   	default:
> > > > > > > > >   		return NULL;
> > > > > > > > >   	}
> > > > > > > > > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > > > > > > > index 08b3b2467c4c..08a087f149ff 100644
> > > > > > > > > --- a/include/drm/drm_device.h
> > > > > > > > > +++ b/include/drm/drm_device.h
> > > > > > > > > @@ -26,10 +26,14 @@ struct pci_controller;
> > > > > > > > >    * Recovery methods for wedged device in order of less to more side-effects.
> > > > > > > > >    * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > > > > > > > >    * use any one, multiple (or'd) or none depending on their needs.
> > > > > > > > > + *
> > > > > > > > > + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > > > > > > > > + * details.
> > > > > > > > >    */
> > > > > > > > >   #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
> > > > > > > > >   #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
> > > > > > > > >   #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> > > > > > > > > +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
> > > > > > > > >   /**
> > > > > > > > >    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > > > > > > > > -- 
> > > > > > > > > 2.47.1
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > 
> > > 
> > 
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-07-14 12:33 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250709112024.1053710-1-riana.tauro@intel.com>
2025-07-09 11:20 ` [PATCH v4 1/9] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
2025-07-09 13:41   ` Simona Vetter
2025-07-09 14:09     ` Christian König
2025-07-09 14:18       ` Raag Jadav
2025-07-09 16:52         ` Rodrigo Vivi
2025-07-10  9:01           ` Simona Vetter
2025-07-10  9:37             ` Christian König
2025-07-10 10:24               ` Raag Jadav
2025-07-10 19:00                 ` Rodrigo Vivi
2025-07-10 21:46                   ` Raag Jadav
2025-07-11  5:17                     ` Riana Tauro
2025-07-11  6:08                       ` Raag Jadav
2025-07-11  8:56                   ` Simona Vetter
2025-07-11  8:59               ` Simona Vetter
2025-07-14  5:27                 ` Riana Tauro
2025-07-14 12:33                   ` Simona Vetter
2025-07-09 14:46     ` Riana Tauro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).