Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
       [not found]   ` <a2bfb8be-35bc-4db9-9352-02eab1ae0881@amd.com>
@ 2025-06-24 14:03     ` Riana Tauro
  2025-06-24 14:23       ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Riana Tauro @ 2025-06-24 14:03 UTC (permalink / raw)
  To: Christian König, intel-xe, dri-devel
  Cc: anshuman.gupta, rodrigo.vivi, lucas.demarchi, aravind.iddamsetty,
	raag.jadav, umesh.nerlige.ramappa, frank.scarbrough,
	André Almeida

Hi Christian

On 6/24/2025 5:56 PM, Christian König wrote:
> On 23.06.25 12:01, Riana Tauro wrote:
>> A device is declared wedged when it is non-recoverable from
>> the driver context.
> 
> Well, not quite.

i took this from the below document. Should it be changed?

https://www.kernel.org/doc/html/v6.16-rc3/gpu/drm-uapi.html#device-wedging

> 
>> Some firmware errors can also cause
>> the device to enter this state and the only method to recover
>> from this would be to do a firmware flash
> 
> What? What exactly do you mean with firmware flash here?
> 
> Usually that means updating the firmware, but I don't see how this will bring you out of a wedge state?

It means updating the firmware.

Series:  https://patchwork.freedesktop.org/series/149756/

In this xe kmd series, there are few firmware errors that cause the card 
to be non-functional. The device is declared wedged and a firmware-flash 
action is sent.

There is corresponding fwupd PR in work that uses this uevent to trigger 
a firmware flash

fwupd PR: https://github.com/fwupd/fwupd/pull/8944/

Thanks
Riana

> 
> Where is the rest of the series?
> 
> Regards,
> Christian.
> 
>> v2: modify documentation (Raag, Rodrigo)
>>
>> Cc: André Almeida <andrealmeid@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   Documentation/gpu/drm-uapi.rst | 6 +++---
>>   drivers/gpu/drm/drm_drv.c      | 2 ++
>>   include/drm/drm_device.h       | 1 +
>>   3 files changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>> index 263e5a97c080..cd2481458755 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -422,9 +422,8 @@ Current implementation defines three recovery methods, out of which, drivers
>>   can use any one, multiple or none. Method(s) of choice will be sent in the
>>   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>>   more side-effects. If driver is unsure about recovery or method is unknown
>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>> -will be sent instead.
>> +(like soft/hard system reboot, physical device replacement or any other procedure
>> +which can't be attempted on the fly), ``WEDGED=unknown`` will be sent instead.
>>   
>>   Userspace consumers can parse this event and attempt recovery as per the
>>   following expectations.
>> @@ -435,6 +434,7 @@ following expectations.
>>       none            optional telemetry collection
>>       rebind          unbind + bind driver
>>       bus-reset       unbind + bus reset/re-enumeration + bind
>> +    firmware-flash  firmware flash
>>       unknown         consumer policy
>>       =============== ========================================
>>   
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index 02556363e918..5f3bbe01c207 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -535,6 +535,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>   		return "rebind";
>>   	case DRM_WEDGE_RECOVERY_BUS_RESET:
>>   		return "bus-reset";
>> +	case DRM_WEDGE_RECOVERY_FW_FLASH:
>> +		return "firmware-flash";
>>   	default:
>>   		return NULL;
>>   	}
>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>> index 08b3b2467c4c..9d57c8882d93 100644
>> --- a/include/drm/drm_device.h
>> +++ b/include/drm/drm_device.h
>> @@ -30,6 +30,7 @@ struct pci_controller;
>>   #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>>   #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>>   #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
>> +#define DRM_WEDGE_RECOVERY_FW_FLASH	BIT(3)  /* firmware flash */
>>   
>>   /**
>>    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
  2025-06-24 14:03     ` [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent Riana Tauro
@ 2025-06-24 14:23       ` Christian König
  2025-06-24 21:36         ` Rodrigo Vivi
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2025-06-24 14:23 UTC (permalink / raw)
  To: Riana Tauro, intel-xe, dri-devel
  Cc: anshuman.gupta, rodrigo.vivi, lucas.demarchi, aravind.iddamsetty,
	raag.jadav, umesh.nerlige.ramappa, frank.scarbrough,
	André Almeida, David Airlie

On 24.06.25 16:03, Riana Tauro wrote:
> Hi Christian
> 
> On 6/24/2025 5:56 PM, Christian König wrote:
>> On 23.06.25 12:01, Riana Tauro wrote:
>>> A device is declared wedged when it is non-recoverable from
>>> the driver context.
>>
>> Well, not quite.
> 
> i took this from the below document. Should it be changed?

The wedge event basically meant that something unexpected happened during the lifetime of the the device (crash, hang whatever).

It can be that the device recovered on it's own and nothing needs to be done (the none case in the documentation) and the event is just send for telemetry collection.

But the usual case is to trigger a bus reset, rebing or even reboot.

> https://www.kernel.org/doc/html/v6.16-rc3/gpu/drm-uapi.html#device-wedging
> 
>>
>>> Some firmware errors can also cause
>>> the device to enter this state and the only method to recover
>>> from this would be to do a firmware flash
>>
>> What? What exactly do you mean with firmware flash here?
>>
>> Usually that means updating the firmware, but I don't see how this will bring you out of a wedge state?
> 
> It means updating the firmware.
> 
> Series:  https://patchwork.freedesktop.org/series/149756/
> 
> In this xe kmd series, there are few firmware errors that cause the card to be non-functional. The device is declared wedged and a firmware-flash action is sent.

Ok, so let me recap that just to make sure that I did understood that correctly.

You find that the firmware flashed into the device is buggy and then raise a wedge event to automatically trigger a firmware update?

Why not fail to load the driver in the first place? Or at least print a big warning into the system log?

I mean a firmware update is usually something which the system administrator triggers very explicitly because when it fails for some reason (e.g. unexpected reset, power outage or whatever) it can sometimes brick the HW.

I think it's rather brave to do this automatically. Are you sure we don't talk past each other on the meaning of the wedge event?

Thanks,
Christian.

> 
> There is corresponding fwupd PR in work that uses this uevent to trigger a firmware flash
> 
> fwupd PR: https://github.com/fwupd/fwupd/pull/8944/
> 
> Thanks
> Riana
> 
>>
>> Where is the rest of the series?
>>
>> Regards,
>> Christian.
>>
>>> v2: modify documentation (Raag, Rodrigo)
>>>
>>> Cc: André Almeida <andrealmeid@igalia.com>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>> ---
>>>   Documentation/gpu/drm-uapi.rst | 6 +++---
>>>   drivers/gpu/drm/drm_drv.c      | 2 ++
>>>   include/drm/drm_device.h       | 1 +
>>>   3 files changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>>> index 263e5a97c080..cd2481458755 100644
>>> --- a/Documentation/gpu/drm-uapi.rst
>>> +++ b/Documentation/gpu/drm-uapi.rst
>>> @@ -422,9 +422,8 @@ Current implementation defines three recovery methods, out of which, drivers
>>>   can use any one, multiple or none. Method(s) of choice will be sent in the
>>>   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>>>   more side-effects. If driver is unsure about recovery or method is unknown
>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>>> -will be sent instead.
>>> +(like soft/hard system reboot, physical device replacement or any other procedure
>>> +which can't be attempted on the fly), ``WEDGED=unknown`` will be sent instead.
>>>     Userspace consumers can parse this event and attempt recovery as per the
>>>   following expectations.
>>> @@ -435,6 +434,7 @@ following expectations.
>>>       none            optional telemetry collection
>>>       rebind          unbind + bind driver
>>>       bus-reset       unbind + bus reset/re-enumeration + bind
>>> +    firmware-flash  firmware flash
>>>       unknown         consumer policy
>>>       =============== ========================================
>>>   diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>> index 02556363e918..5f3bbe01c207 100644
>>> --- a/drivers/gpu/drm/drm_drv.c
>>> +++ b/drivers/gpu/drm/drm_drv.c
>>> @@ -535,6 +535,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>>           return "rebind";
>>>       case DRM_WEDGE_RECOVERY_BUS_RESET:
>>>           return "bus-reset";
>>> +    case DRM_WEDGE_RECOVERY_FW_FLASH:
>>> +        return "firmware-flash";
>>>       default:
>>>           return NULL;
>>>       }
>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>>> index 08b3b2467c4c..9d57c8882d93 100644
>>> --- a/include/drm/drm_device.h
>>> +++ b/include/drm/drm_device.h
>>> @@ -30,6 +30,7 @@ struct pci_controller;
>>>   #define DRM_WEDGE_RECOVERY_NONE        BIT(0)    /* optional telemetry collection */
>>>   #define DRM_WEDGE_RECOVERY_REBIND    BIT(1)    /* unbind + bind driver */
>>>   #define DRM_WEDGE_RECOVERY_BUS_RESET    BIT(2)    /* unbind + reset bus device + bind */
>>> +#define DRM_WEDGE_RECOVERY_FW_FLASH    BIT(3)  /* firmware flash */
>>>     /**
>>>    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
>>
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
  2025-06-24 14:23       ` Christian König
@ 2025-06-24 21:36         ` Rodrigo Vivi
  2025-06-27 21:38           ` Rodrigo Vivi
  0 siblings, 1 reply; 13+ messages in thread
From: Rodrigo Vivi @ 2025-06-24 21:36 UTC (permalink / raw)
  To: Christian König
  Cc: Riana Tauro, intel-xe, dri-devel, anshuman.gupta, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, André Almeida, David Airlie

On Tue, Jun 24, 2025 at 04:23:31PM +0200, Christian König wrote:
> On 24.06.25 16:03, Riana Tauro wrote:
> > Hi Christian
> > 
> > On 6/24/2025 5:56 PM, Christian König wrote:
> >> On 23.06.25 12:01, Riana Tauro wrote:
> >>> A device is declared wedged when it is non-recoverable from
> >>> the driver context.
> >>
> >> Well, not quite.
> > 
> > i took this from the below document. Should it be changed?
> 
> The wedge event basically meant that something unexpected happened during the lifetime of the the device (crash, hang whatever).
> 
> It can be that the device recovered on it's own and nothing needs to be done (the none case in the documentation) and the event is just send for telemetry collection.
> 
> But the usual case is to trigger a bus reset, rebing or even reboot.
> 
> > https://www.kernel.org/doc/html/v6.16-rc3/gpu/drm-uapi.html#device-wedging
> > 
> >>
> >>> Some firmware errors can also cause
> >>> the device to enter this state and the only method to recover
> >>> from this would be to do a firmware flash
> >>
> >> What? What exactly do you mean with firmware flash here?
> >>
> >> Usually that means updating the firmware, but I don't see how this will bring you out of a wedge state?
> > 
> > It means updating the firmware.
> > 
> > Series:  https://patchwork.freedesktop.org/series/149756/
> > 
> > In this xe kmd series, there are few firmware errors that cause the card to be non-functional. The device is declared wedged and a firmware-flash action is sent.
> 
> Ok, so let me recap that just to make sure that I did understood that correctly.
> 
> You find that the firmware flashed into the device is buggy and then raise a wedge event to automatically trigger a firmware update?
> 
> Why not fail to load the driver in the first place?

We already have that in place. If during the probe the fw machinery underneath
identified something is so bad that it needs to be flashed we boot in the
'survivability mode'. The device is not discoverable for any gpu command
submission or memory management, but only fw flashing is possible on that
mode.

This is on top of that. If the fw machinery had a bad unrecoverable error
and decided that fw updating is needed.

> Or at least print a big warning into the system log?
> 
> I mean a firmware update is usually something which the system administrator triggers very explicitly because when it fails for some reason (e.g. unexpected reset, power outage or whatever) it can sometimes brick the HW.
> 
> I think it's rather brave to do this automatically. Are you sure we don't talk past each other on the meaning of the wedge event?

The goal is not to do that automatically, but raise the uevent to the admin
with enough information that they can decide for the right correctable
action.

Thanks,
Rodrigo.

> 
> Thanks,
> Christian.
> 
> > 
> > There is corresponding fwupd PR in work that uses this uevent to trigger a firmware flash
> > 
> > fwupd PR: https://github.com/fwupd/fwupd/pull/8944/
> > 
> > Thanks
> > Riana
> > 
> >>
> >> Where is the rest of the series?
> >>
> >> Regards,
> >> Christian.
> >>
> >>> v2: modify documentation (Raag, Rodrigo)
> >>>
> >>> Cc: André Almeida <andrealmeid@igalia.com>
> >>> Cc: Christian König <christian.koenig@amd.com>
> >>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> >>> ---
> >>>   Documentation/gpu/drm-uapi.rst | 6 +++---
> >>>   drivers/gpu/drm/drm_drv.c      | 2 ++
> >>>   include/drm/drm_device.h       | 1 +
> >>>   3 files changed, 6 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> >>> index 263e5a97c080..cd2481458755 100644
> >>> --- a/Documentation/gpu/drm-uapi.rst
> >>> +++ b/Documentation/gpu/drm-uapi.rst
> >>> @@ -422,9 +422,8 @@ Current implementation defines three recovery methods, out of which, drivers
> >>>   can use any one, multiple or none. Method(s) of choice will be sent in the
> >>>   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> >>>   more side-effects. If driver is unsure about recovery or method is unknown
> >>> -(like soft/hard system reboot, firmware flashing, physical device replacement
> >>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> >>> -will be sent instead.
> >>> +(like soft/hard system reboot, physical device replacement or any other procedure
> >>> +which can't be attempted on the fly), ``WEDGED=unknown`` will be sent instead.
> >>>     Userspace consumers can parse this event and attempt recovery as per the
> >>>   following expectations.
> >>> @@ -435,6 +434,7 @@ following expectations.
> >>>       none            optional telemetry collection
> >>>       rebind          unbind + bind driver
> >>>       bus-reset       unbind + bus reset/re-enumeration + bind
> >>> +    firmware-flash  firmware flash
> >>>       unknown         consumer policy
> >>>       =============== ========================================
> >>>   diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> >>> index 02556363e918..5f3bbe01c207 100644
> >>> --- a/drivers/gpu/drm/drm_drv.c
> >>> +++ b/drivers/gpu/drm/drm_drv.c
> >>> @@ -535,6 +535,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> >>>           return "rebind";
> >>>       case DRM_WEDGE_RECOVERY_BUS_RESET:
> >>>           return "bus-reset";
> >>> +    case DRM_WEDGE_RECOVERY_FW_FLASH:
> >>> +        return "firmware-flash";
> >>>       default:
> >>>           return NULL;
> >>>       }
> >>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> >>> index 08b3b2467c4c..9d57c8882d93 100644
> >>> --- a/include/drm/drm_device.h
> >>> +++ b/include/drm/drm_device.h
> >>> @@ -30,6 +30,7 @@ struct pci_controller;
> >>>   #define DRM_WEDGE_RECOVERY_NONE        BIT(0)    /* optional telemetry collection */
> >>>   #define DRM_WEDGE_RECOVERY_REBIND    BIT(1)    /* unbind + bind driver */
> >>>   #define DRM_WEDGE_RECOVERY_BUS_RESET    BIT(2)    /* unbind + reset bus device + bind */
> >>> +#define DRM_WEDGE_RECOVERY_FW_FLASH    BIT(3)  /* firmware flash */
> >>>     /**
> >>>    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> >>
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
  2025-06-24 21:36         ` Rodrigo Vivi
@ 2025-06-27 21:38           ` Rodrigo Vivi
  2025-06-30  8:29             ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Rodrigo Vivi @ 2025-06-27 21:38 UTC (permalink / raw)
  To: Christian König
  Cc: Riana Tauro, intel-xe, dri-devel, anshuman.gupta, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, André Almeida, David Airlie

On Tue, Jun 24, 2025 at 05:36:29PM -0400, Rodrigo Vivi wrote:
> On Tue, Jun 24, 2025 at 04:23:31PM +0200, Christian König wrote:
> > On 24.06.25 16:03, Riana Tauro wrote:
> > > Hi Christian
> > > 
> > > On 6/24/2025 5:56 PM, Christian König wrote:
> > >> On 23.06.25 12:01, Riana Tauro wrote:
> > >>> A device is declared wedged when it is non-recoverable from
> > >>> the driver context.
> > >>
> > >> Well, not quite.
> > > 
> > > i took this from the below document. Should it be changed?
> > 
> > The wedge event basically meant that something unexpected happened during the lifetime of the the device (crash, hang whatever).
> > 
> > It can be that the device recovered on it's own and nothing needs to be done (the none case in the documentation) and the event is just send for telemetry collection.
> > 
> > But the usual case is to trigger a bus reset, rebing or even reboot.
> > 
> > > https://www.kernel.org/doc/html/v6.16-rc3/gpu/drm-uapi.html#device-wedging
> > > 
> > >>
> > >>> Some firmware errors can also cause
> > >>> the device to enter this state and the only method to recover
> > >>> from this would be to do a firmware flash
> > >>
> > >> What? What exactly do you mean with firmware flash here?
> > >>
> > >> Usually that means updating the firmware, but I don't see how this will bring you out of a wedge state?
> > > 
> > > It means updating the firmware.
> > > 
> > > Series:  https://patchwork.freedesktop.org/series/149756/
> > > 
> > > In this xe kmd series, there are few firmware errors that cause the card to be non-functional. The device is declared wedged and a firmware-flash action is sent.
> > 
> > Ok, so let me recap that just to make sure that I did understood that correctly.
> > 
> > You find that the firmware flashed into the device is buggy and then raise a wedge event to automatically trigger a firmware update?
> > 
> > Why not fail to load the driver in the first place?
> 
> We already have that in place. If during the probe the fw machinery underneath
> identified something is so bad that it needs to be flashed we boot in the
> 'survivability mode'. The device is not discoverable for any gpu command
> submission or memory management, but only fw flashing is possible on that
> mode.
> 
> This is on top of that. If the fw machinery had a bad unrecoverable error
> and decided that fw updating is needed.
> 
> > Or at least print a big warning into the system log?
> > 
> > I mean a firmware update is usually something which the system administrator triggers very explicitly because when it fails for some reason (e.g. unexpected reset, power outage or whatever) it can sometimes brick the HW.
> > 
> > I think it's rather brave to do this automatically. Are you sure we don't talk past each other on the meaning of the wedge event?
> 
> The goal is not to do that automatically, but raise the uevent to the admin
> with enough information that they can decide for the right correctable
> action.

Christian, Andre, any concerns with this still?

> 
> Thanks,
> Rodrigo.
> 
> > 
> > Thanks,
> > Christian.
> > 
> > > 
> > > There is corresponding fwupd PR in work that uses this uevent to trigger a firmware flash
> > > 
> > > fwupd PR: https://github.com/fwupd/fwupd/pull/8944/
> > > 
> > > Thanks
> > > Riana
> > > 
> > >>
> > >> Where is the rest of the series?
> > >>
> > >> Regards,
> > >> Christian.
> > >>
> > >>> v2: modify documentation (Raag, Rodrigo)
> > >>>
> > >>> Cc: André Almeida <andrealmeid@igalia.com>
> > >>> Cc: Christian König <christian.koenig@amd.com>
> > >>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > >>> ---
> > >>>   Documentation/gpu/drm-uapi.rst | 6 +++---
> > >>>   drivers/gpu/drm/drm_drv.c      | 2 ++
> > >>>   include/drm/drm_device.h       | 1 +
> > >>>   3 files changed, 6 insertions(+), 3 deletions(-)
> > >>>
> > >>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > >>> index 263e5a97c080..cd2481458755 100644
> > >>> --- a/Documentation/gpu/drm-uapi.rst
> > >>> +++ b/Documentation/gpu/drm-uapi.rst
> > >>> @@ -422,9 +422,8 @@ Current implementation defines three recovery methods, out of which, drivers
> > >>>   can use any one, multiple or none. Method(s) of choice will be sent in the
> > >>>   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > >>>   more side-effects. If driver is unsure about recovery or method is unknown
> > >>> -(like soft/hard system reboot, firmware flashing, physical device replacement
> > >>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > >>> -will be sent instead.
> > >>> +(like soft/hard system reboot, physical device replacement or any other procedure
> > >>> +which can't be attempted on the fly), ``WEDGED=unknown`` will be sent instead.
> > >>>     Userspace consumers can parse this event and attempt recovery as per the
> > >>>   following expectations.
> > >>> @@ -435,6 +434,7 @@ following expectations.
> > >>>       none            optional telemetry collection
> > >>>       rebind          unbind + bind driver
> > >>>       bus-reset       unbind + bus reset/re-enumeration + bind
> > >>> +    firmware-flash  firmware flash
> > >>>       unknown         consumer policy
> > >>>       =============== ========================================
> > >>>   diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > >>> index 02556363e918..5f3bbe01c207 100644
> > >>> --- a/drivers/gpu/drm/drm_drv.c
> > >>> +++ b/drivers/gpu/drm/drm_drv.c
> > >>> @@ -535,6 +535,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > >>>           return "rebind";
> > >>>       case DRM_WEDGE_RECOVERY_BUS_RESET:
> > >>>           return "bus-reset";
> > >>> +    case DRM_WEDGE_RECOVERY_FW_FLASH:
> > >>> +        return "firmware-flash";
> > >>>       default:
> > >>>           return NULL;
> > >>>       }
> > >>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > >>> index 08b3b2467c4c..9d57c8882d93 100644
> > >>> --- a/include/drm/drm_device.h
> > >>> +++ b/include/drm/drm_device.h
> > >>> @@ -30,6 +30,7 @@ struct pci_controller;
> > >>>   #define DRM_WEDGE_RECOVERY_NONE        BIT(0)    /* optional telemetry collection */
> > >>>   #define DRM_WEDGE_RECOVERY_REBIND    BIT(1)    /* unbind + bind driver */
> > >>>   #define DRM_WEDGE_RECOVERY_BUS_RESET    BIT(2)    /* unbind + reset bus device + bind */
> > >>> +#define DRM_WEDGE_RECOVERY_FW_FLASH    BIT(3)  /* firmware flash */
> > >>>     /**
> > >>>    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > >>
> > > 
> > > 
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
  2025-06-27 21:38           ` Rodrigo Vivi
@ 2025-06-30  8:29             ` Christian König
  2025-06-30 17:33               ` Rodrigo Vivi
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2025-06-30  8:29 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: Riana Tauro, intel-xe, dri-devel, anshuman.gupta, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, André Almeida, David Airlie

On 27.06.25 23:38, Rodrigo Vivi wrote:
>>> Or at least print a big warning into the system log?
>>>
>>> I mean a firmware update is usually something which the system administrator triggers very explicitly because when it fails for some reason (e.g. unexpected reset, power outage or whatever) it can sometimes brick the HW.
>>>
>>> I think it's rather brave to do this automatically. Are you sure we don't talk past each other on the meaning of the wedge event?
>>
>> The goal is not to do that automatically, but raise the uevent to the admin
>> with enough information that they can decide for the right correctable
>> action.
> 
> Christian, Andre, any concerns with this still?

Well, that sounds not quite the correct use case for wedge events.

See the wedge event is made for automation. For example to allow a process supervising containers get the device working again and re-start the container which used it or gather crash log etc .....

When you want to notify the system administrator which manual intervention is necessary then I would just write that into the system log and raise a device event with WEDGED=unknown.

What we could potentially do is to separate between WEDGED=unknown and WEDGED=manual, e.g. between driver has no idea what to do and driver printed useful info into the system log.

But creating an event with WEDGED=firmware-flash just sounds to specific, when we go down that route we might soon have WEDGE=change-bios-setting, WEDGE=....

Regards,
Christian.

> 
>>
>> Thanks,
>> Rodrigo.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
  2025-06-30  8:29             ` Christian König
@ 2025-06-30 17:33               ` Rodrigo Vivi
  2025-07-01 11:37                 ` Riana Tauro
  0 siblings, 1 reply; 13+ messages in thread
From: Rodrigo Vivi @ 2025-06-30 17:33 UTC (permalink / raw)
  To: Christian König
  Cc: Riana Tauro, intel-xe, dri-devel, anshuman.gupta, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, André Almeida, David Airlie

On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
> On 27.06.25 23:38, Rodrigo Vivi wrote:
> >>> Or at least print a big warning into the system log?
> >>>
> >>> I mean a firmware update is usually something which the system administrator triggers very explicitly because when it fails for some reason (e.g. unexpected reset, power outage or whatever) it can sometimes brick the HW.
> >>>
> >>> I think it's rather brave to do this automatically. Are you sure we don't talk past each other on the meaning of the wedge event?
> >>
> >> The goal is not to do that automatically, but raise the uevent to the admin
> >> with enough information that they can decide for the right correctable
> >> action.
> > 
> > Christian, Andre, any concerns with this still?
> 
> Well, that sounds not quite the correct use case for wedge events.
> 
> See the wedge event is made for automation.

I respectfully disagree with this statement.

The wedged state in i915 and xe, then ported to drm, was never just about
automation. Of course, the unbind + flr + rebind is one that driver cannot
do by itself, hence needs automation. But wedge cases were also very useful
in other situations like keeping the device in the failure stage for debuging
(without automation) or keeping other critical things up like display with SW
rendering (again, nothing about automation).

> For example to allow a process supervising containers get the device working again and re-start the container which used it or gather crash log etc .....
> 
> When you want to notify the system administrator which manual intervention is necessary then I would just write that into the system log and raise a device event with WEDGED=unknown.
> 
> What we could potentially do is to separate between WEDGED=unknown and WEDGED=manual, e.g. between driver has no idea what to do and driver printed useful info into the system log.

Well, you are right here. Even our official documentation in drm-uapi.rst
already tells that firmware flashing should be a case for 'unknown'.
Let's go with that then. And use other hints like logs and sysfs so, Admin
has a better information of what to do.

> 
> But creating an event with WEDGED=firmware-flash just sounds to specific, when we go down that route we might soon have WEDGE=change-bios-setting, WEDGE=....

Well, I agree that we shouldn't explode the options exponentially here.

Although I believe that firmware flashing should be a common case in many
case and could be a candidate for another indication.

But let's move on with WEDGE='unknown' for this case.

Thanks,
Rodrigo.

> 
> Regards,
> Christian.
> 
> > 
> >>
> >> Thanks,
> >> Rodrigo.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
  2025-06-30 17:33               ` Rodrigo Vivi
@ 2025-07-01 11:37                 ` Riana Tauro
  2025-07-01 11:41                   ` Riana Tauro
  0 siblings, 1 reply; 13+ messages in thread
From: Riana Tauro @ 2025-07-01 11:37 UTC (permalink / raw)
  To: Rodrigo Vivi, Christian König
  Cc: intel-xe, dri-devel, anshuman.gupta, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, André Almeida, David Airlie

Hi Rodrigo/Christian

On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:
> On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
>> On 27.06.25 23:38, Rodrigo Vivi wrote:
>>>>> Or at least print a big warning into the system log?
>>>>>
>>>>> I mean a firmware update is usually something which the system administrator triggers very explicitly because when it fails for some reason (e.g. unexpected reset, power outage or whatever) it can sometimes brick the HW.
>>>>>
>>>>> I think it's rather brave to do this automatically. Are you sure we don't talk past each other on the meaning of the wedge event?
>>>>
>>>> The goal is not to do that automatically, but raise the uevent to the admin
>>>> with enough information that they can decide for the right correctable
>>>> action.
>>>
>>> Christian, Andre, any concerns with this still?
>>
>> Well, that sounds not quite the correct use case for wedge events.
>>
>> See the wedge event is made for automation.
> 
> I respectfully disagree with this statement.
> 
> The wedged state in i915 and xe, then ported to drm, was never just about
> automation. Of course, the unbind + flr + rebind is one that driver cannot
> do by itself, hence needs automation. But wedge cases were also very useful
> in other situations like keeping the device in the failure stage for debuging
> (without automation) or keeping other critical things up like display with SW
> rendering (again, nothing about automation).
> 
>> For example to allow a process supervising containers get the device working again and re-start the container which used it or gather crash log etc .....
>>
>> When you want to notify the system administrator which manual intervention is necessary then I would just write that into the system log and raise a device event with WEDGED=unknown.
>>
>> What we could potentially do is to separate between WEDGED=unknown and WEDGED=manual, e.g. between driver has no idea what to do and driver printed useful info into the system log.
> 
> Well, you are right here. Even our official documentation in drm-uapi.rst
> already tells that firmware flashing should be a case for 'unknown'.

I had added specific method since we know firmware flash will recover 
the error.  Sure will change it.

In the current code, there is no recovery method named "unknown" even 
though the document mentions it

https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/drm_drv.c#L534

Since we are adding something new, can it be "manual" instead of unknown?


Thanks
Riana

> Let's go with that then. And use other hints like logs and sysfs so, Admin
> has a better information of what to do.
> 
>>
>> But creating an event with WEDGED=firmware-flash just sounds to specific, when we go down that route we might soon have WEDGE=change-bios-setting, WEDGE=....
> 
> Well, I agree that we shouldn't explode the options exponentially here.
> 
> Although I believe that firmware flashing should be a common case in many
> case and could be a candidate for another indication.
> 
> But let's move on with WEDGE='unknown' for this case.
> 
> Thanks,
> Rodrigo.
> 
>>
>> Regards,
>> Christian.
>>
>>>
>>>>
>>>> Thanks,
>>>> Rodrigo.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
  2025-07-01 11:37                 ` Riana Tauro
@ 2025-07-01 11:41                   ` Riana Tauro
  2025-07-01 14:23                     ` Raag Jadav
  0 siblings, 1 reply; 13+ messages in thread
From: Riana Tauro @ 2025-07-01 11:41 UTC (permalink / raw)
  To: Rodrigo Vivi, Christian König
  Cc: intel-xe, dri-devel, anshuman.gupta, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, André Almeida, David Airlie



On 7/1/2025 5:07 PM, Riana Tauro wrote:
> Hi Rodrigo/Christian
> 
> On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:
>> On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
>>> On 27.06.25 23:38, Rodrigo Vivi wrote:
>>>>>> Or at least print a big warning into the system log?
>>>>>>
>>>>>> I mean a firmware update is usually something which the system 
>>>>>> administrator triggers very explicitly because when it fails for 
>>>>>> some reason (e.g. unexpected reset, power outage or whatever) it 
>>>>>> can sometimes brick the HW.
>>>>>>
>>>>>> I think it's rather brave to do this automatically. Are you sure 
>>>>>> we don't talk past each other on the meaning of the wedge event?
>>>>>
>>>>> The goal is not to do that automatically, but raise the uevent to 
>>>>> the admin
>>>>> with enough information that they can decide for the right correctable
>>>>> action.
>>>>
>>>> Christian, Andre, any concerns with this still?
>>>
>>> Well, that sounds not quite the correct use case for wedge events.
>>>
>>> See the wedge event is made for automation.
>>
>> I respectfully disagree with this statement.
>>
>> The wedged state in i915 and xe, then ported to drm, was never just about
>> automation. Of course, the unbind + flr + rebind is one that driver 
>> cannot
>> do by itself, hence needs automation. But wedge cases were also very 
>> useful
>> in other situations like keeping the device in the failure stage for 
>> debuging
>> (without automation) or keeping other critical things up like display 
>> with SW
>> rendering (again, nothing about automation).
>>
>>> For example to allow a process supervising containers get the device 
>>> working again and re-start the container which used it or gather 
>>> crash log etc .....
>>>
>>> When you want to notify the system administrator which manual 
>>> intervention is necessary then I would just write that into the 
>>> system log and raise a device event with WEDGED=unknown.
>>>
>>> What we could potentially do is to separate between WEDGED=unknown 
>>> and WEDGED=manual, e.g. between driver has no idea what to do and 
>>> driver printed useful info into the system log.
>>
>> Well, you are right here. Even our official documentation in drm-uapi.rst
>> already tells that firmware flashing should be a case for 'unknown'.
> 
> I had added specific method since we know firmware flash will recover 
> the error.  Sure will change it.
> 
> In the current code, there is no recovery method named "unknown" even 
> though the document mentions it
> 
> https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/ 
> drm_drv.c#L534
> 
> Since we are adding something new, can it be "manual" instead of unknown?

Okay missed it. It's in the drm_dev_wedged_event function. Will use unknown

> 
> 
> Thanks
> Riana
> 
>> Let's go with that then. And use other hints like logs and sysfs so, 
>> Admin
>> has a better information of what to do.
>>
>>>
>>> But creating an event with WEDGED=firmware-flash just sounds to 
>>> specific, when we go down that route we might soon have WEDGE=change- 
>>> bios-setting, WEDGE=....
>>
>> Well, I agree that we shouldn't explode the options exponentially here.
>>
>> Although I believe that firmware flashing should be a common case in many
>> case and could be a candidate for another indication.
>>
>> But let's move on with WEDGE='unknown' for this case.
>>
>> Thanks,
>> Rodrigo.
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Rodrigo.
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
  2025-07-01 11:41                   ` Riana Tauro
@ 2025-07-01 14:23                     ` Raag Jadav
  2025-07-01 14:35                       ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Raag Jadav @ 2025-07-01 14:23 UTC (permalink / raw)
  To: Riana Tauro
  Cc: Rodrigo Vivi, Christian König, intel-xe, dri-devel,
	anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, André Almeida,
	David Airlie

On Tue, Jul 01, 2025 at 05:11:24PM +0530, Riana Tauro wrote:
> On 7/1/2025 5:07 PM, Riana Tauro wrote:
> > On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:
> > > On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
> > > > On 27.06.25 23:38, Rodrigo Vivi wrote:
> > > > > > > Or at least print a big warning into the system log?
> > > > > > > 
> > > > > > > I mean a firmware update is usually something which
> > > > > > > the system administrator triggers very explicitly
> > > > > > > because when it fails for some reason (e.g.
> > > > > > > unexpected reset, power outage or whatever) it can
> > > > > > > sometimes brick the HW.
> > > > > > > 
> > > > > > > I think it's rather brave to do this automatically.
> > > > > > > Are you sure we don't talk past each other on the
> > > > > > > meaning of the wedge event?
> > > > > > 
> > > > > > The goal is not to do that automatically, but raise the
> > > > > > uevent to the admin
> > > > > > with enough information that they can decide for the right correctable
> > > > > > action.
> > > > > 
> > > > > Christian, Andre, any concerns with this still?
> > > > 
> > > > Well, that sounds not quite the correct use case for wedge events.
> > > > 
> > > > See the wedge event is made for automation.
> > > 
> > > I respectfully disagree with this statement.
> > > 
> > > The wedged state in i915 and xe, then ported to drm, was never just about
> > > automation. Of course, the unbind + flr + rebind is one that driver
> > > cannot
> > > do by itself, hence needs automation. But wedge cases were also very
> > > useful
> > > in other situations like keeping the device in the failure stage for
> > > debuging
> > > (without automation) or keeping other critical things up like
> > > display with SW
> > > rendering (again, nothing about automation).
> > > 
> > > > For example to allow a process supervising containers get the
> > > > device working again and re-start the container which used it or
> > > > gather crash log etc .....
> > > > 
> > > > When you want to notify the system administrator which manual
> > > > intervention is necessary then I would just write that into the
> > > > system log and raise a device event with WEDGED=unknown.
> > > > 
> > > > What we could potentially do is to separate between
> > > > WEDGED=unknown and WEDGED=manual, e.g. between driver has no
> > > > idea what to do and driver printed useful info into the system
> > > > log.
> > > 
> > > Well, you are right here. Even our official documentation in drm-uapi.rst
> > > already tells that firmware flashing should be a case for 'unknown'.
> > 
> > I had added specific method since we know firmware flash will recover
> > the error.  Sure will change it.
> > 
> > In the current code, there is no recovery method named "unknown" even
> > though the document mentions it
> > 
> > https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/
> > drm_drv.c#L534
> > 
> > Since we are adding something new, can it be "manual" instead of unknown?
> 
> Okay missed it. It's in the drm_dev_wedged_event function. Will use unknown
> > 
> > > Let's go with that then. And use other hints like logs and sysfs so,
> > > Admin
> > > has a better information of what to do.
> > > 
> > > > But creating an event with WEDGED=firmware-flash just sounds to
> > > > specific, when we go down that route we might soon have
> > > > WEDGE=change- bios-setting, WEDGE=....
> > > 
> > > Well, I agree that we shouldn't explode the options exponentially here.
> > > 
> > > Although I believe that firmware flashing should be a common case in many
> > > case and could be a candidate for another indication.
> > > 
> > > But let's move on with WEDGE='unknown' for this case.

I understand that WEDGED=firmware-flash can't be handled in a generic way
for all drivers but it is simply not as same as WEDGED=unknown since the
driver knows something specific needs to be done here.

I'm wondering if we could add a WEDGED=vendor-specific method for such
cases?

Chris? André?

Raag

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
  2025-07-01 14:23                     ` Raag Jadav
@ 2025-07-01 14:35                       ` Christian König
  2025-07-01 16:02                         ` Raag Jadav
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2025-07-01 14:35 UTC (permalink / raw)
  To: Raag Jadav, Riana Tauro
  Cc: Rodrigo Vivi, intel-xe, dri-devel, anshuman.gupta, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	André Almeida, David Airlie

On 01.07.25 16:23, Raag Jadav wrote:
> On Tue, Jul 01, 2025 at 05:11:24PM +0530, Riana Tauro wrote:
>> On 7/1/2025 5:07 PM, Riana Tauro wrote:
>>> On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:
>>>> On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
>>>>> On 27.06.25 23:38, Rodrigo Vivi wrote:
>>>>>>>> Or at least print a big warning into the system log?
>>>>>>>>
>>>>>>>> I mean a firmware update is usually something which
>>>>>>>> the system administrator triggers very explicitly
>>>>>>>> because when it fails for some reason (e.g.
>>>>>>>> unexpected reset, power outage or whatever) it can
>>>>>>>> sometimes brick the HW.
>>>>>>>>
>>>>>>>> I think it's rather brave to do this automatically.
>>>>>>>> Are you sure we don't talk past each other on the
>>>>>>>> meaning of the wedge event?
>>>>>>>
>>>>>>> The goal is not to do that automatically, but raise the
>>>>>>> uevent to the admin
>>>>>>> with enough information that they can decide for the right correctable
>>>>>>> action.
>>>>>>
>>>>>> Christian, Andre, any concerns with this still?
>>>>>
>>>>> Well, that sounds not quite the correct use case for wedge events.
>>>>>
>>>>> See the wedge event is made for automation.
>>>>
>>>> I respectfully disagree with this statement.
>>>>
>>>> The wedged state in i915 and xe, then ported to drm, was never just about
>>>> automation. Of course, the unbind + flr + rebind is one that driver
>>>> cannot
>>>> do by itself, hence needs automation. But wedge cases were also very
>>>> useful
>>>> in other situations like keeping the device in the failure stage for
>>>> debuging
>>>> (without automation) or keeping other critical things up like
>>>> display with SW
>>>> rendering (again, nothing about automation).

Granted, automation is probably not the right term.

What I wanted to say is that the wedge event should not replace information in the system log.

>>>>
>>>>> For example to allow a process supervising containers get the
>>>>> device working again and re-start the container which used it or
>>>>> gather crash log etc .....
>>>>>
>>>>> When you want to notify the system administrator which manual
>>>>> intervention is necessary then I would just write that into the
>>>>> system log and raise a device event with WEDGED=unknown.
>>>>>
>>>>> What we could potentially do is to separate between
>>>>> WEDGED=unknown and WEDGED=manual, e.g. between driver has no
>>>>> idea what to do and driver printed useful info into the system
>>>>> log.
>>>>
>>>> Well, you are right here. Even our official documentation in drm-uapi.rst
>>>> already tells that firmware flashing should be a case for 'unknown'.
>>>
>>> I had added specific method since we know firmware flash will recover
>>> the error.  Sure will change it.
>>>
>>> In the current code, there is no recovery method named "unknown" even
>>> though the document mentions it
>>>
>>> https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/
>>> drm_drv.c#L534
>>>
>>> Since we are adding something new, can it be "manual" instead of unknown?
>>
>> Okay missed it. It's in the drm_dev_wedged_event function. Will use unknown
>>>
>>>> Let's go with that then. And use other hints like logs and sysfs so,
>>>> Admin
>>>> has a better information of what to do.
>>>>
>>>>> But creating an event with WEDGED=firmware-flash just sounds to
>>>>> specific, when we go down that route we might soon have
>>>>> WEDGE=change- bios-setting, WEDGE=....
>>>>
>>>> Well, I agree that we shouldn't explode the options exponentially here.
>>>>
>>>> Although I believe that firmware flashing should be a common case in many
>>>> case and could be a candidate for another indication.
>>>>
>>>> But let's move on with WEDGE='unknown' for this case.
> 
> I understand that WEDGED=firmware-flash can't be handled in a generic way
> for all drivers but it is simply not as same as WEDGED=unknown since the
> driver knows something specific needs to be done here.
> 
> I'm wondering if we could add a WEDGED=vendor-specific method for such
> cases?

Works for me as well.

My main concern was that we should not start to invent specific wedge events for all kind of different problems.

On the other hand what's the additional value to distinct between unknown and vendor-specific?

In other words even if the necessary handling is unknown to the wedge event, the administrator could and should still examine the logs to see what to do.

Regards,
Christian.

> 
> Chris? André?
> 
> Raag


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
  2025-07-01 14:35                       ` Christian König
@ 2025-07-01 16:02                         ` Raag Jadav
  2025-07-01 16:44                           ` Riana Tauro
  0 siblings, 1 reply; 13+ messages in thread
From: Raag Jadav @ 2025-07-01 16:02 UTC (permalink / raw)
  To: Christian König
  Cc: Riana Tauro, Rodrigo Vivi, intel-xe, dri-devel, anshuman.gupta,
	lucas.demarchi, aravind.iddamsetty, umesh.nerlige.ramappa,
	frank.scarbrough, André Almeida, David Airlie

On Tue, Jul 01, 2025 at 04:35:42PM +0200, Christian König wrote:
>On 01.07.25 16:23, Raag Jadav wrote:
>> On Tue, Jul 01, 2025 at 05:11:24PM +0530, Riana Tauro wrote:
>>> On 7/1/2025 5:07 PM, Riana Tauro wrote:
>>>> On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:
>>>>> On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
>>>>>> On 27.06.25 23:38, Rodrigo Vivi wrote:
>>>>>>>>> Or at least print a big warning into the system log?
>>>>>>>>>
>>>>>>>>> I mean a firmware update is usually something which
>>>>>>>>> the system administrator triggers very explicitly
>>>>>>>>> because when it fails for some reason (e.g.
>>>>>>>>> unexpected reset, power outage or whatever) it can
>>>>>>>>> sometimes brick the HW.
>>>>>>>>>
>>>>>>>>> I think it's rather brave to do this automatically.
>>>>>>>>> Are you sure we don't talk past each other on the
>>>>>>>>> meaning of the wedge event?
>>>>>>>>
>>>>>>>> The goal is not to do that automatically, but raise the
>>>>>>>> uevent to the admin
>>>>>>>> with enough information that they can decide for the right correctable
>>>>>>>> action.
>>>>>>>
>>>>>>> Christian, Andre, any concerns with this still?
>>>>>>
>>>>>> Well, that sounds not quite the correct use case for wedge events.
>>>>>>
>>>>>> See the wedge event is made for automation.
>>>>>
>>>>> I respectfully disagree with this statement.
>>>>>
>>>>> The wedged state in i915 and xe, then ported to drm, was never just about
>>>>> automation. Of course, the unbind + flr + rebind is one that driver
>>>>> cannot
>>>>> do by itself, hence needs automation. But wedge cases were also very
>>>>> useful
>>>>> in other situations like keeping the device in the failure stage for
>>>>> debuging
>>>>> (without automation) or keeping other critical things up like
>>>>> display with SW
>>>>> rendering (again, nothing about automation).
>
> Granted, automation is probably not the right term.
>
> What I wanted to say is that the wedge event should not replace information in the system log.
>
>>>>>
>>>>>> For example to allow a process supervising containers get the
>>>>>> device working again and re-start the container which used it or
>>>>>> gather crash log etc .....
>>>>>>
>>>>>> When you want to notify the system administrator which manual
>>>>>> intervention is necessary then I would just write that into the
>>>>>> system log and raise a device event with WEDGED=unknown.
>>>>>>
>>>>>> What we could potentially do is to separate between
>>>>>> WEDGED=unknown and WEDGED=manual, e.g. between driver has no
>>>>>> idea what to do and driver printed useful info into the system
>>>>>> log.
>>>>>
>>>>> Well, you are right here. Even our official documentation in drm-uapi.rst
>>>>> already tells that firmware flashing should be a case for 'unknown'.
>>>>
>>>> I had added specific method since we know firmware flash will recover
>>>> the error.  Sure will change it.
>>>>
>>>> In the current code, there is no recovery method named "unknown" even
>>>> though the document mentions it
>>>>
>>>> https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/
>>>> drm_drv.c#L534
>>>>
>>>> Since we are adding something new, can it be "manual" instead of unknown?
>>>
>>> Okay missed it. It's in the drm_dev_wedged_event function. Will use unknown
>>>>
>>>>> Let's go with that then. And use other hints like logs and sysfs so,
>>>>> Admin
>>>>> has a better information of what to do.
>>>>>
>>>>>> But creating an event with WEDGED=firmware-flash just sounds to
>>>>>> specific, when we go down that route we might soon have
>>>>>> WEDGE=change- bios-setting, WEDGE=....
>>>>>
>>>>> Well, I agree that we shouldn't explode the options exponentially here.
>>>>>
>>>>> Although I believe that firmware flashing should be a common case in many
>>>>> case and could be a candidate for another indication.
>>>>>
>>>>> But let's move on with WEDGE='unknown' for this case.
>> 
>> I understand that WEDGED=firmware-flash can't be handled in a generic way
>> for all drivers but it is simply not as same as WEDGED=unknown since the
>> driver knows something specific needs to be done here.
>> 
>> I'm wondering if we could add a WEDGED=vendor-specific method for such
>> cases?
>
> Works for me as well.
>
> My main concern was that we should not start to invent specific wedge events for all kind of different problems.
>
> On the other hand what's the additional value to distinct between unknown and vendor-specific?
>
> In other words even if the necessary handling is unknown to the wedge event, the administrator could and should still examine the logs to see what to do.

They're somewhat similar except the consumer can execute vendor specific
triggers (look at some sys/proc entries or logs) based on device id that
the consumer is already familiar with as defined by the vendor, and could
potentially be automated.

Unknown is basically "I'm clueless and good luck with your investigation".

So the distinction is whether the driver is able to provide definition for
its vendor specific cases and how well documented they are.

Raag

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
  2025-07-01 16:02                         ` Raag Jadav
@ 2025-07-01 16:44                           ` Riana Tauro
  2025-07-01 17:15                             ` André Almeida
  0 siblings, 1 reply; 13+ messages in thread
From: Riana Tauro @ 2025-07-01 16:44 UTC (permalink / raw)
  To: Raag Jadav, Christian König
  Cc: Rodrigo Vivi, intel-xe, dri-devel, anshuman.gupta, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	André Almeida, David Airlie



On 7/1/2025 9:32 PM, Raag Jadav wrote:
> On Tue, Jul 01, 2025 at 04:35:42PM +0200, Christian König wrote:
>> On 01.07.25 16:23, Raag Jadav wrote:
>>> On Tue, Jul 01, 2025 at 05:11:24PM +0530, Riana Tauro wrote:
>>>> On 7/1/2025 5:07 PM, Riana Tauro wrote:
>>>>> On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:
>>>>>> On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
>>>>>>> On 27.06.25 23:38, Rodrigo Vivi wrote:
>>>>>>>>>> Or at least print a big warning into the system log?
>>>>>>>>>>
>>>>>>>>>> I mean a firmware update is usually something which
>>>>>>>>>> the system administrator triggers very explicitly
>>>>>>>>>> because when it fails for some reason (e.g.
>>>>>>>>>> unexpected reset, power outage or whatever) it can
>>>>>>>>>> sometimes brick the HW.
>>>>>>>>>>
>>>>>>>>>> I think it's rather brave to do this automatically.
>>>>>>>>>> Are you sure we don't talk past each other on the
>>>>>>>>>> meaning of the wedge event?
>>>>>>>>>
>>>>>>>>> The goal is not to do that automatically, but raise the
>>>>>>>>> uevent to the admin
>>>>>>>>> with enough information that they can decide for the right correctable
>>>>>>>>> action.
>>>>>>>>
>>>>>>>> Christian, Andre, any concerns with this still?
>>>>>>>
>>>>>>> Well, that sounds not quite the correct use case for wedge events.
>>>>>>>
>>>>>>> See the wedge event is made for automation.
>>>>>>
>>>>>> I respectfully disagree with this statement.
>>>>>>
>>>>>> The wedged state in i915 and xe, then ported to drm, was never just about
>>>>>> automation. Of course, the unbind + flr + rebind is one that driver
>>>>>> cannot
>>>>>> do by itself, hence needs automation. But wedge cases were also very
>>>>>> useful
>>>>>> in other situations like keeping the device in the failure stage for
>>>>>> debuging
>>>>>> (without automation) or keeping other critical things up like
>>>>>> display with SW
>>>>>> rendering (again, nothing about automation).
>>
>> Granted, automation is probably not the right term.
>>
>> What I wanted to say is that the wedge event should not replace information in the system log.
>>
>>>>>>
>>>>>>> For example to allow a process supervising containers get the
>>>>>>> device working again and re-start the container which used it or
>>>>>>> gather crash log etc .....
>>>>>>>
>>>>>>> When you want to notify the system administrator which manual
>>>>>>> intervention is necessary then I would just write that into the
>>>>>>> system log and raise a device event with WEDGED=unknown.
>>>>>>>
>>>>>>> What we could potentially do is to separate between
>>>>>>> WEDGED=unknown and WEDGED=manual, e.g. between driver has no
>>>>>>> idea what to do and driver printed useful info into the system
>>>>>>> log.
>>>>>>
>>>>>> Well, you are right here. Even our official documentation in drm-uapi.rst
>>>>>> already tells that firmware flashing should be a case for 'unknown'.
>>>>>
>>>>> I had added specific method since we know firmware flash will recover
>>>>> the error.  Sure will change it.
>>>>>
>>>>> In the current code, there is no recovery method named "unknown" even
>>>>> though the document mentions it
>>>>>
>>>>> https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/
>>>>> drm_drv.c#L534
>>>>>
>>>>> Since we are adding something new, can it be "manual" instead of unknown?
>>>>
>>>> Okay missed it. It's in the drm_dev_wedged_event function. Will use unknown
>>>>>
>>>>>> Let's go with that then. And use other hints like logs and sysfs so,
>>>>>> Admin
>>>>>> has a better information of what to do.
>>>>>>
>>>>>>> But creating an event with WEDGED=firmware-flash just sounds to
>>>>>>> specific, when we go down that route we might soon have
>>>>>>> WEDGE=change- bios-setting, WEDGE=....
>>>>>>
>>>>>> Well, I agree that we shouldn't explode the options exponentially here.
>>>>>>
>>>>>> Although I believe that firmware flashing should be a common case in many
>>>>>> case and could be a candidate for another indication.
>>>>>>
>>>>>> But let's move on with WEDGE='unknown' for this case.
>>>
>>> I understand that WEDGED=firmware-flash can't be handled in a generic way
>>> for all drivers but it is simply not as same as WEDGED=unknown since the
>>> driver knows something specific needs to be done here.
>>>
>>> I'm wondering if we could add a WEDGED=vendor-specific method for such
>>> cases?
>>
>> Works for me as well.
>>
>> My main concern was that we should not start to invent specific wedge events for all kind of different problems.
>>
>> On the other hand what's the additional value to distinct between unknown and vendor-specific?
>>
>> In other words even if the necessary handling is unknown to the wedge event, the administrator could and should still examine the logs to see what to do.
> 
> They're somewhat similar except the consumer can execute vendor specific
> triggers (look at some sys/proc entries or logs) based on device id that
> the consumer is already familiar with as defined by the vendor, and could
> potentially be automated.
> 
> Unknown is basically "I'm clueless and good luck with your investigation".
> 
> So the distinction is whether the driver is able to provide definition for
> its vendor specific cases and how well documented they are.

Agree with Raag. We know which recovery method works here. Rather than 
using 'unknown', 'manual/vendor' macro seems better with vendor specific 
documentation for recovery.

Thanks
Riana

> 
> Raag



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
  2025-07-01 16:44                           ` Riana Tauro
@ 2025-07-01 17:15                             ` André Almeida
  0 siblings, 0 replies; 13+ messages in thread
From: André Almeida @ 2025-07-01 17:15 UTC (permalink / raw)
  To: Riana Tauro, Raag Jadav, Christian König
  Cc: Rodrigo Vivi, intel-xe, dri-devel, anshuman.gupta, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	David Airlie

Em 01/07/2025 13:44, Riana Tauro escreveu:
> 
> 
> On 7/1/2025 9:32 PM, Raag Jadav wrote:
>> On Tue, Jul 01, 2025 at 04:35:42PM +0200, Christian König wrote:
>>> On 01.07.25 16:23, Raag Jadav wrote:
>>>> On Tue, Jul 01, 2025 at 05:11:24PM +0530, Riana Tauro wrote:
>>>>> On 7/1/2025 5:07 PM, Riana Tauro wrote:
>>>>>> On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:
>>>>>>> On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
>>>>>>>> On 27.06.25 23:38, Rodrigo Vivi wrote:
>>>>>>>>>>> Or at least print a big warning into the system log?
>>>>>>>>>>>
>>>>>>>>>>> I mean a firmware update is usually something which
>>>>>>>>>>> the system administrator triggers very explicitly
>>>>>>>>>>> because when it fails for some reason (e.g.
>>>>>>>>>>> unexpected reset, power outage or whatever) it can
>>>>>>>>>>> sometimes brick the HW.
>>>>>>>>>>>
>>>>>>>>>>> I think it's rather brave to do this automatically.
>>>>>>>>>>> Are you sure we don't talk past each other on the
>>>>>>>>>>> meaning of the wedge event?
>>>>>>>>>>
>>>>>>>>>> The goal is not to do that automatically, but raise the
>>>>>>>>>> uevent to the admin
>>>>>>>>>> with enough information that they can decide for the right 
>>>>>>>>>> correctable
>>>>>>>>>> action.
>>>>>>>>>
>>>>>>>>> Christian, Andre, any concerns with this still?
>>>>>>>>
>>>>>>>> Well, that sounds not quite the correct use case for wedge events.
>>>>>>>>
>>>>>>>> See the wedge event is made for automation.
>>>>>>>
>>>>>>> I respectfully disagree with this statement.
>>>>>>>
>>>>>>> The wedged state in i915 and xe, then ported to drm, was never 
>>>>>>> just about
>>>>>>> automation. Of course, the unbind + flr + rebind is one that driver
>>>>>>> cannot
>>>>>>> do by itself, hence needs automation. But wedge cases were also very
>>>>>>> useful
>>>>>>> in other situations like keeping the device in the failure stage for
>>>>>>> debuging
>>>>>>> (without automation) or keeping other critical things up like
>>>>>>> display with SW
>>>>>>> rendering (again, nothing about automation).
>>>
>>> Granted, automation is probably not the right term.
>>>
>>> What I wanted to say is that the wedge event should not replace 
>>> information in the system log.
>>>
>>>>>>>
>>>>>>>> For example to allow a process supervising containers get the
>>>>>>>> device working again and re-start the container which used it or
>>>>>>>> gather crash log etc .....
>>>>>>>>
>>>>>>>> When you want to notify the system administrator which manual
>>>>>>>> intervention is necessary then I would just write that into the
>>>>>>>> system log and raise a device event with WEDGED=unknown.
>>>>>>>>
>>>>>>>> What we could potentially do is to separate between
>>>>>>>> WEDGED=unknown and WEDGED=manual, e.g. between driver has no
>>>>>>>> idea what to do and driver printed useful info into the system
>>>>>>>> log.
>>>>>>>
>>>>>>> Well, you are right here. Even our official documentation in drm- 
>>>>>>> uapi.rst
>>>>>>> already tells that firmware flashing should be a case for 'unknown'.
>>>>>>
>>>>>> I had added specific method since we know firmware flash will recover
>>>>>> the error.  Sure will change it.
>>>>>>
>>>>>> In the current code, there is no recovery method named "unknown" even
>>>>>> though the document mentions it
>>>>>>
>>>>>> https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/
>>>>>> drm_drv.c#L534
>>>>>>
>>>>>> Since we are adding something new, can it be "manual" instead of 
>>>>>> unknown?
>>>>>
>>>>> Okay missed it. It's in the drm_dev_wedged_event function. Will use 
>>>>> unknown
>>>>>>
>>>>>>> Let's go with that then. And use other hints like logs and sysfs so,
>>>>>>> Admin
>>>>>>> has a better information of what to do.
>>>>>>>
>>>>>>>> But creating an event with WEDGED=firmware-flash just sounds to
>>>>>>>> specific, when we go down that route we might soon have
>>>>>>>> WEDGE=change- bios-setting, WEDGE=....
>>>>>>>
>>>>>>> Well, I agree that we shouldn't explode the options exponentially 
>>>>>>> here.
>>>>>>>
>>>>>>> Although I believe that firmware flashing should be a common case 
>>>>>>> in many
>>>>>>> case and could be a candidate for another indication.
>>>>>>>
>>>>>>> But let's move on with WEDGE='unknown' for this case.
>>>>
>>>> I understand that WEDGED=firmware-flash can't be handled in a 
>>>> generic way
>>>> for all drivers but it is simply not as same as WEDGED=unknown since 
>>>> the
>>>> driver knows something specific needs to be done here.
>>>>
>>>> I'm wondering if we could add a WEDGED=vendor-specific method for such
>>>> cases?
>>>
>>> Works for me as well.
>>>
>>> My main concern was that we should not start to invent specific wedge 
>>> events for all kind of different problems.
>>>
>>> On the other hand what's the additional value to distinct between 
>>> unknown and vendor-specific?
>>>
>>> In other words even if the necessary handling is unknown to the wedge 
>>> event, the administrator could and should still examine the logs to 
>>> see what to do.
>>
>> They're somewhat similar except the consumer can execute vendor specific
>> triggers (look at some sys/proc entries or logs) based on device id that
>> the consumer is already familiar with as defined by the vendor, and could
>> potentially be automated.
>>
>> Unknown is basically "I'm clueless and good luck with your 
>> investigation".
>>
>> So the distinction is whether the driver is able to provide definition 
>> for
>> its vendor specific cases and how well documented they are.
> 
> Agree with Raag. We know which recovery method works here. Rather than 
> using 'unknown', 'manual/vendor' macro seems better with vendor specific 
> documentation for recovery.
> 

That makes sense for me as well, thanks!

> Thanks
> Riana
> 
>>
>> Raag
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-07-01 17:16 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250623100109.1086459-1-riana.tauro@intel.com>
     [not found] ` <20250623100109.1086459-2-riana.tauro@intel.com>
     [not found]   ` <a2bfb8be-35bc-4db9-9352-02eab1ae0881@amd.com>
2025-06-24 14:03     ` [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent Riana Tauro
2025-06-24 14:23       ` Christian König
2025-06-24 21:36         ` Rodrigo Vivi
2025-06-27 21:38           ` Rodrigo Vivi
2025-06-30  8:29             ` Christian König
2025-06-30 17:33               ` Rodrigo Vivi
2025-07-01 11:37                 ` Riana Tauro
2025-07-01 11:41                   ` Riana Tauro
2025-07-01 14:23                     ` Raag Jadav
2025-07-01 14:35                       ` Christian König
2025-07-01 16:02                         ` Raag Jadav
2025-07-01 16:44                           ` Riana Tauro
2025-07-01 17:15                             ` André Almeida

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).