[PATCH v3 0/7] Handle Firmware reported Hardware Errors

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/7] Handle Firmware reported Hardware Errors
@ 2025-07-02 14:11 Riana Tauro
  2025-07-02 14:11 ` [PATCH v3 1/7] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
                   ` (11 more replies)
  0 siblings, 12 replies; 36+ messages in thread
From: Riana Tauro @ 2025-07-02 14:11 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

Add support to handle firmware reported errors. When CSC firmware
errors are encoutered, a error interrupt is received by the GFX device as
a MSI interrupt.

Device Source control registers indicates the source of the error as CSC
The HEC error status register indicates that the error is firmware reported
Depending on the type of firmware error, the error cause is written to the HEC
Firmware error register.

On encountering such CSC firmware errors, the graphics device is
wedged and the only way to recover from these errors is firmware flash.

Add a vendor-specific recovery method to drm device wedged uevent.
The device will enter runtime survivability mode and send a drm device
wedged uevent when a firmware flash is required to notify userspace.

$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent

KERNEL[754.709341] change   /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm)
ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=vendor-specific
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5973
MAJOR=226
MINOR=0

Bspec: 50875, 53073, 53074, 53075, 53076

IGT: https://patchwork.freedesktop.org/patch/660122/

Rev2: add a fault injection for csc errors
      fix review comments

Rev3: add a vendor-specific recovery method
      add support for runtime survivability mode
      enable runtime survivability mode when csc errors are reported


Riana Tauro (7):
  drm: Add a vendor-specific recovery method to device wedged uevent
  drm/xe: Set GT as wedged before sending wedged uevent
  drm/xe/xe_survivability: Add support for Runtime survivability mode
  drm/xe/doc: Document device wedged and runtime survivability
  drm/xe: Add support to handle hardware errors
  drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  drm/xe/xe_hw_error: Add fault injection to trigger csc error handler

 Documentation/gpu/drm-uapi.rst                |   5 +-
 Documentation/gpu/xe/index.rst                |   1 +
 Documentation/gpu/xe/xe_device.rst            |  10 +
 Documentation/gpu/xe/xe_pcode.rst             |   6 +-
 drivers/gpu/drm/drm_drv.c                     |   2 +
 drivers/gpu/drm/xe/Makefile                   |   1 +
 drivers/gpu/drm/xe/regs/xe_gsc_regs.h         |   2 +
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h    |  20 ++
 drivers/gpu/drm/xe/regs/xe_irq_regs.h         |   1 +
 drivers/gpu/drm/xe/xe_debugfs.c               |   2 +
 drivers/gpu/drm/xe/xe_device.c                |  39 +++-
 drivers/gpu/drm/xe/xe_device_types.h          |   3 +
 drivers/gpu/drm/xe/xe_hw_error.c              | 187 ++++++++++++++++++
 drivers/gpu/drm/xe/xe_hw_error.h              |  15 ++
 drivers/gpu/drm/xe/xe_irq.c                   |   4 +
 drivers/gpu/drm/xe/xe_survivability_mode.c    |  57 +++++-
 drivers/gpu/drm/xe/xe_survivability_mode.h    |   4 +-
 .../gpu/drm/xe/xe_survivability_mode_types.h  |   8 +
 include/drm/drm_device.h                      |   4 +
 19 files changed, 350 insertions(+), 21 deletions(-)
 create mode 100644 Documentation/gpu/xe/xe_device.rst
 create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h

-- 
2.47.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 1/7] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
@ 2025-07-02 14:11 ` Riana Tauro
  2025-07-03  4:06   ` Raag Jadav
  2025-07-02 14:11 ` [PATCH v3 2/7] drm/xe: Set GT as wedged before sending " Riana Tauro
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Riana Tauro @ 2025-07-02 14:11 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban, André Almeida,
	Christian König, David Airlie, dri-devel

Certain errors can cause the device to be wedged and may
require a vendor specific recovery method to restore normal
operation.

Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
must provide additional recovery documentation if this method
is used.

Cc: André Almeida <andrealmeid@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: <dri-devel@lists.freedesktop.org>
Suggested-by: Raag Jadav <raag.jadav@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 Documentation/gpu/drm-uapi.rst | 5 ++++-
 drivers/gpu/drm/drm_drv.c      | 2 ++
 include/drm/drm_device.h       | 4 ++++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 263e5a97c080..1ea835a3fc66 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -424,7 +424,9 @@ uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
 more side-effects. If driver is unsure about recovery or method is unknown
 (like soft/hard system reboot, firmware flashing, physical device replacement
 or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
-will be sent instead.
+will be sent instead. If recovery method is specific to vendor
+``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
+specific documentation for further recovery steps.
 
 Userspace consumers can parse this event and attempt recovery as per the
 following expectations.
@@ -435,6 +437,7 @@ following expectations.
     none            optional telemetry collection
     rebind          unbind + bind driver
     bus-reset       unbind + bus reset/re-enumeration + bind
+    vendor-specific vendor specific recovery method
     unknown         consumer policy
     =============== ========================================
 
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 02556363e918..c72e5c67479d 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -535,6 +535,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
 		return "rebind";
 	case DRM_WEDGE_RECOVERY_BUS_RESET:
 		return "bus-reset";
+	case DRM_WEDGE_RECOVERY_VENDOR:
+		return "vendor-specific";
 	default:
 		return NULL;
 	}
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index 08b3b2467c4c..40a4caaa6313 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -26,10 +26,14 @@ struct pci_controller;
  * Recovery methods for wedged device in order of less to more side-effects.
  * To be used with drm_dev_wedged_event() as recovery @method. Callers can
  * use any one, multiple (or'd) or none depending on their needs.
+ *
+ * If DRM_WEDGE_RECOVERY_VENDOR method is used, vendors must provide additional
+ * documentation outlining further recovery steps.
  */
 #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
 #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
 #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
+#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
 
 /**
  * struct drm_wedge_task_info - information about the guilty task of a wedge dev
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/7] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-02 14:11 ` [PATCH v3 1/7] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
@ 2025-07-03  4:06   ` Raag Jadav
  2025-07-03  5:20     ` Riana Tauro
  0 siblings, 1 reply; 36+ messages in thread
From: Raag Jadav @ 2025-07-03  4:06 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	sk.anirban, André Almeida, Christian König,
	David Airlie, dri-devel

On Wed, Jul 02, 2025 at 07:41:11PM +0530, Riana Tauro wrote:
> Certain errors can cause the device to be wedged and may
> require a vendor specific recovery method to restore normal
> operation.
> 
> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> must provide additional recovery documentation if this method
> is used.
> 
> Cc: André Almeida <andrealmeid@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  Documentation/gpu/drm-uapi.rst | 5 ++++-
>  drivers/gpu/drm/drm_drv.c      | 2 ++
>  include/drm/drm_device.h       | 4 ++++
>  3 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 263e5a97c080..1ea835a3fc66 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -424,7 +424,9 @@ uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>  more side-effects. If driver is unsure about recovery or method is unknown
>  (like soft/hard system reboot, firmware flashing, physical device replacement
>  or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``

We may also want to remove the examples for unknown method so that we
don't confuse users in case any of it overlaps with vendor-specific.

> -will be sent instead.
> +will be sent instead. If recovery method is specific to vendor
> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> +specific documentation for further recovery steps.
>  
>  Userspace consumers can parse this event and attempt recovery as per the
>  following expectations.
> @@ -435,6 +437,7 @@ following expectations.
>      none            optional telemetry collection
>      rebind          unbind + bind driver
>      bus-reset       unbind + bus reset/re-enumeration + bind
> +    vendor-specific vendor specific recovery method
>      unknown         consumer policy
>      =============== ========================================
>  
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 02556363e918..c72e5c67479d 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -535,6 +535,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>  		return "rebind";
>  	case DRM_WEDGE_RECOVERY_BUS_RESET:
>  		return "bus-reset";
> +	case DRM_WEDGE_RECOVERY_VENDOR:
> +		return "vendor-specific";
>  	default:
>  		return NULL;
>  	}
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index 08b3b2467c4c..40a4caaa6313 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -26,10 +26,14 @@ struct pci_controller;
>   * Recovery methods for wedged device in order of less to more side-effects.
>   * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>   * use any one, multiple (or'd) or none depending on their needs.
> + *
> + * If DRM_WEDGE_RECOVERY_VENDOR method is used, vendors must provide additional
> + * documentation outlining further recovery steps.

The original documentation is sufficient so let's not duplicate specific
cases here.

Raag

>   */
>  #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>  #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>  #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>  
>  /**
>   * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/7] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-03  4:06   ` Raag Jadav
@ 2025-07-03  5:20     ` Riana Tauro
  2025-07-03  6:40       ` Raag Jadav
  0 siblings, 1 reply; 36+ messages in thread
From: Riana Tauro @ 2025-07-03  5:20 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	sk.anirban, André Almeida, Christian König,
	David Airlie, dri-devel



On 7/3/2025 9:36 AM, Raag Jadav wrote:
> On Wed, Jul 02, 2025 at 07:41:11PM +0530, Riana Tauro wrote:
>> Certain errors can cause the device to be wedged and may
>> require a vendor specific recovery method to restore normal
>> operation.
>>
>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
>> must provide additional recovery documentation if this method
>> is used.
>>
>> Cc: André Almeida <andrealmeid@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: <dri-devel@lists.freedesktop.org>
>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   Documentation/gpu/drm-uapi.rst | 5 ++++-
>>   drivers/gpu/drm/drm_drv.c      | 2 ++
>>   include/drm/drm_device.h       | 4 ++++
>>   3 files changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>> index 263e5a97c080..1ea835a3fc66 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -424,7 +424,9 @@ uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>>   more side-effects. If driver is unsure about recovery or method is unknown
>>   (like soft/hard system reboot, firmware flashing, physical device replacement
>>   or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> 
> We may also want to remove the examples for unknown method so that we
> don't confuse users in case any of it overlaps with vendor-specific.

Okay will remove this

> 
>> -will be sent instead.
>> +will be sent instead. If recovery method is specific to vendor
>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
>> +specific documentation for further recovery steps.
>>   
>>   Userspace consumers can parse this event and attempt recovery as per the
>>   following expectations.
>> @@ -435,6 +437,7 @@ following expectations.
>>       none            optional telemetry collection
>>       rebind          unbind + bind driver
>>       bus-reset       unbind + bus reset/re-enumeration + bind
>> +    vendor-specific vendor specific recovery method
>>       unknown         consumer policy
>>       =============== ========================================
>>   
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index 02556363e918..c72e5c67479d 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -535,6 +535,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>   		return "rebind";
>>   	case DRM_WEDGE_RECOVERY_BUS_RESET:
>>   		return "bus-reset";
>> +	case DRM_WEDGE_RECOVERY_VENDOR:
>> +		return "vendor-specific";
>>   	default:
>>   		return NULL;
>>   	}
>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>> index 08b3b2467c4c..40a4caaa6313 100644
>> --- a/include/drm/drm_device.h
>> +++ b/include/drm/drm_device.h
>> @@ -26,10 +26,14 @@ struct pci_controller;
>>    * Recovery methods for wedged device in order of less to more side-effects.
>>    * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>>    * use any one, multiple (or'd) or none depending on their needs.
>> + *
>> + * If DRM_WEDGE_RECOVERY_VENDOR method is used, vendors must provide additional
>> + * documentation outlining further recovery steps.
> 
> The original documentation is sufficient so let's not duplicate specific
> cases here.

Added it here so anyone checking the code directly is aware.

Riana>
> Raag
> 
>>    */
>>   #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>>   #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>>   #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>>   
>>   /**
>>    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
>> -- 
>> 2.47.1
>>



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/7] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-03  5:20     ` Riana Tauro
@ 2025-07-03  6:40       ` Raag Jadav
  2025-07-03  6:50         ` Riana Tauro
  0 siblings, 1 reply; 36+ messages in thread
From: Raag Jadav @ 2025-07-03  6:40 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	sk.anirban, André Almeida, Christian König,
	David Airlie, dri-devel

On Thu, Jul 03, 2025 at 10:50:53AM +0530, Riana Tauro wrote:
> On 7/3/2025 9:36 AM, Raag Jadav wrote:
> > On Wed, Jul 02, 2025 at 07:41:11PM +0530, Riana Tauro wrote:
> > > Certain errors can cause the device to be wedged and may
> > > require a vendor specific recovery method to restore normal
> > > operation.
> > > 
> > > Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
> > > must provide additional recovery documentation if this method
> > > is used.

...

> > > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > index 08b3b2467c4c..40a4caaa6313 100644
> > > --- a/include/drm/drm_device.h
> > > +++ b/include/drm/drm_device.h
> > > @@ -26,10 +26,14 @@ struct pci_controller;
> > >    * Recovery methods for wedged device in order of less to more side-effects.
> > >    * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > >    * use any one, multiple (or'd) or none depending on their needs.
> > > + *
> > > + * If DRM_WEDGE_RECOVERY_VENDOR method is used, vendors must provide additional
> > > + * documentation outlining further recovery steps.
> > 
> > The original documentation is sufficient so let's not duplicate specific
> > cases here.
> 
> Added it here so anyone checking the code directly is aware.

Then a reference to uapi doc would be more useful.

Raag

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 1/7] drm: Add a vendor-specific recovery method to device wedged uevent
  2025-07-03  6:40       ` Raag Jadav
@ 2025-07-03  6:50         ` Riana Tauro
  0 siblings, 0 replies; 36+ messages in thread
From: Riana Tauro @ 2025-07-03  6:50 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	sk.anirban, André Almeida, Christian König,
	David Airlie, dri-devel



On 7/3/2025 12:10 PM, Raag Jadav wrote:
> On Thu, Jul 03, 2025 at 10:50:53AM +0530, Riana Tauro wrote:
>> On 7/3/2025 9:36 AM, Raag Jadav wrote:
>>> On Wed, Jul 02, 2025 at 07:41:11PM +0530, Riana Tauro wrote:
>>>> Certain errors can cause the device to be wedged and may
>>>> require a vendor specific recovery method to restore normal
>>>> operation.
>>>>
>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors
>>>> must provide additional recovery documentation if this method
>>>> is used.
> 
> ...
> 
>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>>>> index 08b3b2467c4c..40a4caaa6313 100644
>>>> --- a/include/drm/drm_device.h
>>>> +++ b/include/drm/drm_device.h
>>>> @@ -26,10 +26,14 @@ struct pci_controller;
>>>>     * Recovery methods for wedged device in order of less to more side-effects.
>>>>     * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>>>>     * use any one, multiple (or'd) or none depending on their needs.
>>>> + *
>>>> + * If DRM_WEDGE_RECOVERY_VENDOR method is used, vendors must provide additional
>>>> + * documentation outlining further recovery steps.
>>>
>>> The original documentation is sufficient so let's not duplicate specific
>>> cases here.
>>
>> Added it here so anyone checking the code directly is aware.
> 
> Then a reference to uapi doc would be more useful.

That should be okay. Will do that

Thanks
Riana

> 
> Raag


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 2/7] drm/xe: Set GT as wedged before sending wedged uevent
  2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
  2025-07-02 14:11 ` [PATCH v3 1/7] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
@ 2025-07-02 14:11 ` Riana Tauro
  2025-07-02 21:41   ` Rodrigo Vivi
  2025-07-03  4:18   ` Raag Jadav
  2025-07-02 14:11 ` [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode Riana Tauro
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 36+ messages in thread
From: Riana Tauro @ 2025-07-02 14:11 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

Userspace should be notified after setting the device as wedged.
Re-order function calls to set gt wedged before sending uevent.

Suggested-by: Raag Jadav <raag.jadav@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_device.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 0b73cb72bad1..4a38486dccc8 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1123,8 +1123,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
  * xe_device_declare_wedged - Declare device wedged
  * @xe: xe device instance
  *
- * This is a final state that can only be cleared with a module
+ * This is a final state that can only be cleared with the recovery method
+ * specified in the drm wedged uevent. The default recovery method is
  * re-probe (unbind + bind).
+ *
  * In this state every IOCTL will be blocked so the GT cannot be used.
  * In general it will be called upon any critical error such as gt reset
  * failure or guc loading failure. Userspace will be notified of this state
@@ -1151,6 +1153,9 @@ void xe_device_declare_wedged(struct xe_device *xe)
 		return;
 	}
 
+	for_each_gt(gt, xe, id)
+		xe_gt_declare_wedged(gt);
+
 	if (!atomic_xchg(&xe->wedged.flag, 1)) {
 		xe->needs_flr_on_fini = true;
 		drm_err(&xe->drm,
@@ -1164,7 +1169,4 @@ void xe_device_declare_wedged(struct xe_device *xe)
 				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
 				     NULL);
 	}
-
-	for_each_gt(gt, xe, id)
-		xe_gt_declare_wedged(gt);
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/7] drm/xe: Set GT as wedged before sending wedged uevent
  2025-07-02 14:11 ` [PATCH v3 2/7] drm/xe: Set GT as wedged before sending " Riana Tauro
@ 2025-07-02 21:41   ` Rodrigo Vivi
  2025-07-03  4:18   ` Raag Jadav
  1 sibling, 0 replies; 36+ messages in thread
From: Rodrigo Vivi @ 2025-07-02 21:41 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	raag.jadav, umesh.nerlige.ramappa, frank.scarbrough, sk.anirban

On Wed, Jul 02, 2025 at 07:41:12PM +0530, Riana Tauro wrote:
> Userspace should be notified after setting the device as wedged.
> Re-order function calls to set gt wedged before sending uevent.
> 
> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>

Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

> ---
>  drivers/gpu/drm/xe/xe_device.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 0b73cb72bad1..4a38486dccc8 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1123,8 +1123,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>   * xe_device_declare_wedged - Declare device wedged
>   * @xe: xe device instance
>   *
> - * This is a final state that can only be cleared with a module
> + * This is a final state that can only be cleared with the recovery method
> + * specified in the drm wedged uevent. The default recovery method is
>   * re-probe (unbind + bind).
> + *
>   * In this state every IOCTL will be blocked so the GT cannot be used.
>   * In general it will be called upon any critical error such as gt reset
>   * failure or guc loading failure. Userspace will be notified of this state
> @@ -1151,6 +1153,9 @@ void xe_device_declare_wedged(struct xe_device *xe)
>  		return;
>  	}
>  
> +	for_each_gt(gt, xe, id)
> +		xe_gt_declare_wedged(gt);
> +
>  	if (!atomic_xchg(&xe->wedged.flag, 1)) {
>  		xe->needs_flr_on_fini = true;
>  		drm_err(&xe->drm,
> @@ -1164,7 +1169,4 @@ void xe_device_declare_wedged(struct xe_device *xe)
>  				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
>  				     NULL);
>  	}
> -
> -	for_each_gt(gt, xe, id)
> -		xe_gt_declare_wedged(gt);
>  }
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/7] drm/xe: Set GT as wedged before sending wedged uevent
  2025-07-02 14:11 ` [PATCH v3 2/7] drm/xe: Set GT as wedged before sending " Riana Tauro
  2025-07-02 21:41   ` Rodrigo Vivi
@ 2025-07-03  4:18   ` Raag Jadav
  2025-07-03  5:18     ` Riana Tauro
  1 sibling, 1 reply; 36+ messages in thread
From: Raag Jadav @ 2025-07-03  4:18 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	sk.anirban

On Wed, Jul 02, 2025 at 07:41:12PM +0530, Riana Tauro wrote:
> Userspace should be notified after setting the device as wedged.
> Re-order function calls to set gt wedged before sending uevent.
> 
> Suggested-by: Raag Jadav <raag.jadav@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 0b73cb72bad1..4a38486dccc8 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1123,8 +1123,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>   * xe_device_declare_wedged - Declare device wedged
>   * @xe: xe device instance
>   *
> - * This is a final state that can only be cleared with a module
> + * This is a final state that can only be cleared with the recovery method
> + * specified in the drm wedged uevent. The default recovery method is
>   * re-probe (unbind + bind).
> + *
>   * In this state every IOCTL will be blocked so the GT cannot be used.
>   * In general it will be called upon any critical error such as gt reset
>   * failure or guc loading failure. Userspace will be notified of this state
> @@ -1151,6 +1153,9 @@ void xe_device_declare_wedged(struct xe_device *xe)
>  		return;
>  	}
>  
> +	for_each_gt(gt, xe, id)
> +		xe_gt_declare_wedged(gt);

This is changing GuC CT state and can race with ioctls, so I think
the sequence should be

 	if (!atomic_xchg(&xe->wedged.flag, 1)) {
		...
	}

	for_each_gt(gt, xe, id)
		xe_gt_declare_wedged(gt);

	if (xe_device_wedged())
		drm_dev_wedged_event();

Raag

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/7] drm/xe: Set GT as wedged before sending wedged uevent
  2025-07-03  4:18   ` Raag Jadav
@ 2025-07-03  5:18     ` Riana Tauro
  2025-07-03  6:45       ` Raag Jadav
  0 siblings, 1 reply; 36+ messages in thread
From: Riana Tauro @ 2025-07-03  5:18 UTC (permalink / raw)
  To: Raag Jadav
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	sk.anirban

Hi Raag

On 7/3/2025 9:48 AM, Raag Jadav wrote:
> On Wed, Jul 02, 2025 at 07:41:12PM +0530, Riana Tauro wrote:
>> Userspace should be notified after setting the device as wedged.
>> Re-order function calls to set gt wedged before sending uevent.
>>
>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_device.c | 10 ++++++----
>>   1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>> index 0b73cb72bad1..4a38486dccc8 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -1123,8 +1123,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>>    * xe_device_declare_wedged - Declare device wedged
>>    * @xe: xe device instance
>>    *
>> - * This is a final state that can only be cleared with a module
>> + * This is a final state that can only be cleared with the recovery method
>> + * specified in the drm wedged uevent. The default recovery method is
>>    * re-probe (unbind + bind).
>> + *
>>    * In this state every IOCTL will be blocked so the GT cannot be used.
>>    * In general it will be called upon any critical error such as gt reset
>>    * failure or guc loading failure. Userspace will be notified of this state
>> @@ -1151,6 +1153,9 @@ void xe_device_declare_wedged(struct xe_device *xe)
>>   		return;
>>   	}
>>   
>> +	for_each_gt(gt, xe, id)
>> +		xe_gt_declare_wedged(gt);
> 
> This is changing GuC CT state and can race with ioctls, so I think
> the sequence should be
> 

Then isn't the previous flow better. The ioctls are blocked anyway 
before sending uevent.

Thanks
Riana >   	if (!atomic_xchg(&xe->wedged.flag, 1)) {
> 		...
> 	}
> 
> 	for_each_gt(gt, xe, id)
> 		xe_gt_declare_wedged(gt);
> 
> 	if (xe_device_wedged())
> 		drm_dev_wedged_event();
> 
> Raag


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/7] drm/xe: Set GT as wedged before sending wedged uevent
  2025-07-03  5:18     ` Riana Tauro
@ 2025-07-03  6:45       ` Raag Jadav
  2025-07-07  6:44         ` Riana Tauro
  0 siblings, 1 reply; 36+ messages in thread
From: Raag Jadav @ 2025-07-03  6:45 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	sk.anirban

On Thu, Jul 03, 2025 at 10:48:06AM +0530, Riana Tauro wrote:
> On 7/3/2025 9:48 AM, Raag Jadav wrote:
> > On Wed, Jul 02, 2025 at 07:41:12PM +0530, Riana Tauro wrote:
> > > Userspace should be notified after setting the device as wedged.
> > > Re-order function calls to set gt wedged before sending uevent.
> > > 
> > > Suggested-by: Raag Jadav <raag.jadav@intel.com>
> > > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > ---
> > >   drivers/gpu/drm/xe/xe_device.c | 10 ++++++----
> > >   1 file changed, 6 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > > index 0b73cb72bad1..4a38486dccc8 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > @@ -1123,8 +1123,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
> > >    * xe_device_declare_wedged - Declare device wedged
> > >    * @xe: xe device instance
> > >    *
> > > - * This is a final state that can only be cleared with a module
> > > + * This is a final state that can only be cleared with the recovery method
> > > + * specified in the drm wedged uevent. The default recovery method is
> > >    * re-probe (unbind + bind).
> > > + *
> > >    * In this state every IOCTL will be blocked so the GT cannot be used.
> > >    * In general it will be called upon any critical error such as gt reset
> > >    * failure or guc loading failure. Userspace will be notified of this state
> > > @@ -1151,6 +1153,9 @@ void xe_device_declare_wedged(struct xe_device *xe)
> > >   		return;
> > >   	}
> > > +	for_each_gt(gt, xe, id)
> > > +		xe_gt_declare_wedged(gt);
> > 
> > This is changing GuC CT state and can race with ioctls, so I think
> > the sequence should be
> > 
> 
> Then isn't the previous flow better. The ioctls are blocked anyway before
> sending uevent.

Yes, the idea was to move the event call and not xe_gt_declare_wedged().

https://lore.kernel.org/intel-xe/aEMFcBSWL_jPMYKa@black.fi.intel.com

Raag

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 2/7] drm/xe: Set GT as wedged before sending wedged uevent
  2025-07-03  6:45       ` Raag Jadav
@ 2025-07-07  6:44         ` Riana Tauro
  0 siblings, 0 replies; 36+ messages in thread
From: Riana Tauro @ 2025-07-07  6:44 UTC (permalink / raw)
  To: Raag Jadav, rodrigo.vivi, Matthew Brost
  Cc: intel-xe, anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban



On 7/3/2025 12:15 PM, Raag Jadav wrote:
> On Thu, Jul 03, 2025 at 10:48:06AM +0530, Riana Tauro wrote:
>> On 7/3/2025 9:48 AM, Raag Jadav wrote:
>>> On Wed, Jul 02, 2025 at 07:41:12PM +0530, Riana Tauro wrote:
>>>> Userspace should be notified after setting the device as wedged.
>>>> Re-order function calls to set gt wedged before sending uevent.
>>>>
>>>> Suggested-by: Raag Jadav <raag.jadav@intel.com>
>>>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_device.c | 10 ++++++----
>>>>    1 file changed, 6 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>>>> index 0b73cb72bad1..4a38486dccc8 100644
>>>> --- a/drivers/gpu/drm/xe/xe_device.c
>>>> +++ b/drivers/gpu/drm/xe/xe_device.c
>>>> @@ -1123,8 +1123,10 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>>>>     * xe_device_declare_wedged - Declare device wedged
>>>>     * @xe: xe device instance
>>>>     *
>>>> - * This is a final state that can only be cleared with a module
>>>> + * This is a final state that can only be cleared with the recovery method
>>>> + * specified in the drm wedged uevent. The default recovery method is
>>>>     * re-probe (unbind + bind).
>>>> + *
>>>>     * In this state every IOCTL will be blocked so the GT cannot be used.
>>>>     * In general it will be called upon any critical error such as gt reset
>>>>     * failure or guc loading failure. Userspace will be notified of this state
>>>> @@ -1151,6 +1153,9 @@ void xe_device_declare_wedged(struct xe_device *xe)
>>>>    		return;
>>>>    	}
>>>> +	for_each_gt(gt, xe, id)
>>>> +		xe_gt_declare_wedged(gt);
>>>
>>> This is changing GuC CT state and can race with ioctls, so I think
>>> the sequence should be
>>>
>>
>> Then isn't the previous flow better. The ioctls are blocked anyway before
>> sending uevent.
> 
> Yes, the idea was to move the event call and not xe_gt_declare_wedged().
> 
> https://lore.kernel.org/intel-xe/aEMFcBSWL_jPMYKa@black.fi.intel.com

Is there any reason that xe_gt_declare wedged is not inside if?

Can this be moved inside if instead?

Thanks
Riana



> 
> Raag



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode
  2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
  2025-07-02 14:11 ` [PATCH v3 1/7] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
  2025-07-02 14:11 ` [PATCH v3 2/7] drm/xe: Set GT as wedged before sending " Riana Tauro
@ 2025-07-02 14:11 ` Riana Tauro
  2025-07-02 21:40   ` Rodrigo Vivi
                     ` (2 more replies)
  2025-07-02 14:11 ` [PATCH v3 4/7] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
                   ` (8 subsequent siblings)
  11 siblings, 3 replies; 36+ messages in thread
From: Riana Tauro @ 2025-07-02 14:11 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

Certain runtime firmware errors can cause the device to be wedged
requiring a firmware flash to restore normal operation.
Runtime Survivability Mode indicates that a firmware flash is necessary to
recover the device.

The below sysfs is an indication that device is in survivability mode

/sys/bus/pci/devices/<device>/surivability_mode

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_device.c                |  2 +-
 drivers/gpu/drm/xe/xe_survivability_mode.c    | 26 ++++++++++++++++---
 drivers/gpu/drm/xe/xe_survivability_mode.h    |  4 ++-
 .../gpu/drm/xe/xe_survivability_mode_types.h  |  8 ++++++
 4 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 4a38486dccc8..5defa54ccd26 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -716,7 +716,7 @@ int xe_device_probe_early(struct xe_device *xe)
 		 * possible, but still return the previous error for error
 		 * propagation
 		 */
-		err = xe_survivability_mode_enable(xe);
+		err = xe_survivability_mode_enable(xe, XE_SURVIVABILITY_TYPE_BOOT);
 		if (err)
 			return err;
 
diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
index 1f710b3fc599..e1adcb33c9b0 100644
--- a/drivers/gpu/drm/xe/xe_survivability_mode.c
+++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
@@ -129,7 +129,10 @@ static ssize_t survivability_mode_show(struct device *dev,
 	struct xe_survivability_info *info = survivability->info;
 	int index = 0, count = 0;
 
-	for (index = 0; index < MAX_SCRATCH_MMIO; index++) {
+	count += sysfs_emit_at(buff, count, "Survivability mode: %s\n",
+			       survivability->type ? "Runtime" : "Boot");
+
+	for (index = 0; survivability->boot_status && index < MAX_SCRATCH_MMIO; index++) {
 		if (info[index].reg)
 			count += sysfs_emit_at(buff, count, "%s: 0x%x - 0x%x\n", info[index].name,
 					       info[index].reg, info[index].value);
@@ -169,6 +172,10 @@ static int enable_survivability_mode(struct pci_dev *pdev)
 	if (ret)
 		return ret;
 
+	/* Only create sysfs for runtime survivability mode */
+	if (xe_survivability_mode_is_runtime(xe))
+		return 0;
+
 	/* Make sure xe_heci_gsc_init() knows about survivability mode */
 	survivability->mode = true;
 
@@ -189,6 +196,17 @@ static int enable_survivability_mode(struct pci_dev *pdev)
 	return 0;
 }
 
+/**
+ * xe_survivability_mode_is_runtime - check if survivability mode is runtime
+ * @xe: xe device instance
+ *
+ * Returns true if in runtime survivability mode, false otherwise
+ */
+bool xe_survivability_mode_is_runtime(struct xe_device *xe)
+{
+	return xe->survivability.type == XE_SURVIVABILITY_TYPE_RUNTIME;
+}
+
 /**
  * xe_survivability_mode_is_enabled - check if survivability mode is enabled
  * @xe: xe device instance
@@ -251,16 +269,18 @@ bool xe_survivability_mode_is_requested(struct xe_device *xe)
  * Return: 0 if survivability mode is enabled or not requested; negative error
  * code otherwise.
  */
-int xe_survivability_mode_enable(struct xe_device *xe)
+int xe_survivability_mode_enable(struct xe_device *xe, const enum xe_survivability_type type)
 {
 	struct xe_survivability *survivability = &xe->survivability;
 	struct xe_survivability_info *info;
 	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
 
-	if (!xe_survivability_mode_is_requested(xe))
+	if (!xe_survivability_mode_is_requested(xe) &&
+	    type != XE_SURVIVABILITY_TYPE_RUNTIME)
 		return 0;
 
 	survivability->size = MAX_SCRATCH_MMIO;
+	survivability->type = type;
 
 	info = devm_kcalloc(xe->drm.dev, survivability->size, sizeof(*info),
 			    GFP_KERNEL);
diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h b/drivers/gpu/drm/xe/xe_survivability_mode.h
index 02231c2bf008..559d1e99b03a 100644
--- a/drivers/gpu/drm/xe/xe_survivability_mode.h
+++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
@@ -9,9 +9,11 @@
 #include <linux/types.h>
 
 struct xe_device;
+enum xe_survivability_type;
 
-int xe_survivability_mode_enable(struct xe_device *xe);
+int xe_survivability_mode_enable(struct xe_device *xe, const enum xe_survivability_type);
 bool xe_survivability_mode_is_enabled(struct xe_device *xe);
+bool xe_survivability_mode_is_runtime(struct xe_device *xe);
 bool xe_survivability_mode_is_requested(struct xe_device *xe);
 
 #endif /* _XE_SURVIVABILITY_MODE_H_ */
diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
index 19d433e253df..01f07d9c4124 100644
--- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
+++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
@@ -9,6 +9,11 @@
 #include <linux/limits.h>
 #include <linux/types.h>
 
+enum xe_survivability_type {
+	XE_SURVIVABILITY_TYPE_BOOT,
+	XE_SURVIVABILITY_TYPE_RUNTIME,
+};
+
 struct xe_survivability_info {
 	char name[NAME_MAX];
 	u32 reg;
@@ -30,6 +35,9 @@ struct xe_survivability {
 
 	/** @mode: boolean to indicate survivability mode */
 	bool mode;
+
+	/** @type: survivability mode type (boot or runtime) */
+	enum xe_survivability_type type;
 };
 
 #endif /* _XE_SURVIVABILITY_MODE_TYPES_H_ */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode
  2025-07-02 14:11 ` [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode Riana Tauro
@ 2025-07-02 21:40   ` Rodrigo Vivi
  2025-07-03  5:16     ` Riana Tauro
  2025-07-02 23:33   ` kernel test robot
  2025-07-09 18:04   ` Summers, Stuart
  2 siblings, 1 reply; 36+ messages in thread
From: Rodrigo Vivi @ 2025-07-02 21:40 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	raag.jadav, umesh.nerlige.ramappa, frank.scarbrough, sk.anirban

On Wed, Jul 02, 2025 at 07:41:13PM +0530, Riana Tauro wrote:
> Certain runtime firmware errors can cause the device to be wedged
> requiring a firmware flash to restore normal operation.
> Runtime Survivability Mode indicates that a firmware flash is necessary to
> recover the device.
> 
> The below sysfs is an indication that device is in survivability mode
> 
> /sys/bus/pci/devices/<device>/surivability_mode
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.c                |  2 +-
>  drivers/gpu/drm/xe/xe_survivability_mode.c    | 26 ++++++++++++++++---
>  drivers/gpu/drm/xe/xe_survivability_mode.h    |  4 ++-
>  .../gpu/drm/xe/xe_survivability_mode_types.h  |  8 ++++++
>  4 files changed, 35 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 4a38486dccc8..5defa54ccd26 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -716,7 +716,7 @@ int xe_device_probe_early(struct xe_device *xe)
>  		 * possible, but still return the previous error for error
>  		 * propagation
>  		 */
> -		err = xe_survivability_mode_enable(xe);
> +		err = xe_survivability_mode_enable(xe, XE_SURVIVABILITY_TYPE_BOOT);
>  		if (err)
>  			return err;
>  
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
> index 1f710b3fc599..e1adcb33c9b0 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
> @@ -129,7 +129,10 @@ static ssize_t survivability_mode_show(struct device *dev,
>  	struct xe_survivability_info *info = survivability->info;
>  	int index = 0, count = 0;
>  
> -	for (index = 0; index < MAX_SCRATCH_MMIO; index++) {
> +	count += sysfs_emit_at(buff, count, "Survivability mode: %s\n",
> +			       survivability->type ? "Runtime" : "Boot");
> +
> +	for (index = 0; survivability->boot_status && index < MAX_SCRATCH_MMIO; index++) {
>  		if (info[index].reg)
>  			count += sysfs_emit_at(buff, count, "%s: 0x%x - 0x%x\n", info[index].name,
>  					       info[index].reg, info[index].value);
> @@ -169,6 +172,10 @@ static int enable_survivability_mode(struct pci_dev *pdev)
>  	if (ret)
>  		return ret;
>  
> +	/* Only create sysfs for runtime survivability mode */
> +	if (xe_survivability_mode_is_runtime(xe))
> +		return 0;

I'm double confused here:
only create when runtime, but then you return if runtime?
why to only create on runtime mode? or why to skip here?


> +
>  	/* Make sure xe_heci_gsc_init() knows about survivability mode */
>  	survivability->mode = true;
>  
> @@ -189,6 +196,17 @@ static int enable_survivability_mode(struct pci_dev *pdev)
>  	return 0;
>  }
>  
> +/**
> + * xe_survivability_mode_is_runtime - check if survivability mode is runtime
> + * @xe: xe device instance
> + *
> + * Returns true if in runtime survivability mode, false otherwise
> + */
> +bool xe_survivability_mode_is_runtime(struct xe_device *xe)
> +{
> +	return xe->survivability.type == XE_SURVIVABILITY_TYPE_RUNTIME;
> +}
> +
>  /**
>   * xe_survivability_mode_is_enabled - check if survivability mode is enabled
>   * @xe: xe device instance
> @@ -251,16 +269,18 @@ bool xe_survivability_mode_is_requested(struct xe_device *xe)
>   * Return: 0 if survivability mode is enabled or not requested; negative error
>   * code otherwise.
>   */
> -int xe_survivability_mode_enable(struct xe_device *xe)
> +int xe_survivability_mode_enable(struct xe_device *xe, const enum xe_survivability_type type)
>  {
>  	struct xe_survivability *survivability = &xe->survivability;
>  	struct xe_survivability_info *info;
>  	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>  
> -	if (!xe_survivability_mode_is_requested(xe))
> +	if (!xe_survivability_mode_is_requested(xe) &&
> +	    type != XE_SURVIVABILITY_TYPE_RUNTIME)

with this, the function name and its reasoning above is incorrect.
"xe_survivability_mode_enable - Initialize and enable the survivability mode"

no, this function is not doing that anymore. Rather it is getting log
from fw about the boot survivability, or at least initializing the
struct for that. It probably deserves a refactor with some better
naming on the purpose.

>  		return 0;
>  
>  	survivability->size = MAX_SCRATCH_MMIO;
> +	survivability->type = type;
>  
>  	info = devm_kcalloc(xe->drm.dev, survivability->size, sizeof(*info),
>  			    GFP_KERNEL);
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h b/drivers/gpu/drm/xe/xe_survivability_mode.h
> index 02231c2bf008..559d1e99b03a 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode.h
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
> @@ -9,9 +9,11 @@
>  #include <linux/types.h>
>  
>  struct xe_device;
> +enum xe_survivability_type;
>  
> -int xe_survivability_mode_enable(struct xe_device *xe);
> +int xe_survivability_mode_enable(struct xe_device *xe, const enum xe_survivability_type);
>  bool xe_survivability_mode_is_enabled(struct xe_device *xe);
> +bool xe_survivability_mode_is_runtime(struct xe_device *xe);
>  bool xe_survivability_mode_is_requested(struct xe_device *xe);
>  
>  #endif /* _XE_SURVIVABILITY_MODE_H_ */
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> index 19d433e253df..01f07d9c4124 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> @@ -9,6 +9,11 @@
>  #include <linux/limits.h>
>  #include <linux/types.h>
>  
> +enum xe_survivability_type {
> +	XE_SURVIVABILITY_TYPE_BOOT,
> +	XE_SURVIVABILITY_TYPE_RUNTIME,
> +};
> +
>  struct xe_survivability_info {
>  	char name[NAME_MAX];
>  	u32 reg;
> @@ -30,6 +35,9 @@ struct xe_survivability {
>  
>  	/** @mode: boolean to indicate survivability mode */
>  	bool mode;
> +
> +	/** @type: survivability mode type (boot or runtime) */
> +	enum xe_survivability_type type;
>  };
>  
>  #endif /* _XE_SURVIVABILITY_MODE_TYPES_H_ */
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode
  2025-07-02 21:40   ` Rodrigo Vivi
@ 2025-07-03  5:16     ` Riana Tauro
  0 siblings, 0 replies; 36+ messages in thread
From: Riana Tauro @ 2025-07-03  5:16 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: intel-xe, anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	raag.jadav, umesh.nerlige.ramappa, frank.scarbrough, sk.anirban

Hi Rodrigo

On 7/3/2025 3:10 AM, Rodrigo Vivi wrote:
> On Wed, Jul 02, 2025 at 07:41:13PM +0530, Riana Tauro wrote:
>> Certain runtime firmware errors can cause the device to be wedged
>> requiring a firmware flash to restore normal operation.
>> Runtime Survivability Mode indicates that a firmware flash is necessary to
>> recover the device.
>>
>> The below sysfs is an indication that device is in survivability mode
>>
>> /sys/bus/pci/devices/<device>/surivability_mode
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_device.c                |  2 +-
>>   drivers/gpu/drm/xe/xe_survivability_mode.c    | 26 ++++++++++++++++---
>>   drivers/gpu/drm/xe/xe_survivability_mode.h    |  4 ++-
>>   .../gpu/drm/xe/xe_survivability_mode_types.h  |  8 ++++++
>>   4 files changed, 35 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>> index 4a38486dccc8..5defa54ccd26 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -716,7 +716,7 @@ int xe_device_probe_early(struct xe_device *xe)
>>   		 * possible, but still return the previous error for error
>>   		 * propagation
>>   		 */
>> -		err = xe_survivability_mode_enable(xe);
>> +		err = xe_survivability_mode_enable(xe, XE_SURVIVABILITY_TYPE_BOOT);
>>   		if (err)
>>   			return err;
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> index 1f710b3fc599..e1adcb33c9b0 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> @@ -129,7 +129,10 @@ static ssize_t survivability_mode_show(struct device *dev,
>>   	struct xe_survivability_info *info = survivability->info;
>>   	int index = 0, count = 0;
>>   
>> -	for (index = 0; index < MAX_SCRATCH_MMIO; index++) {
>> +	count += sysfs_emit_at(buff, count, "Survivability mode: %s\n",
>> +			       survivability->type ? "Runtime" : "Boot");
>> +
>> +	for (index = 0; survivability->boot_status && index < MAX_SCRATCH_MMIO; index++) {
>>   		if (info[index].reg)
>>   			count += sysfs_emit_at(buff, count, "%s: 0x%x - 0x%x\n", info[index].name,
>>   					       info[index].reg, info[index].value);
>> @@ -169,6 +172,10 @@ static int enable_survivability_mode(struct pci_dev *pdev)
>>   	if (ret)
>>   		return ret;
>>   
>> +	/* Only create sysfs for runtime survivability mode */
>> +	if (xe_survivability_mode_is_runtime(xe))
>> +		return 0;
> 
> I'm double confused here:
> only create when runtime, but then you return if runtime?
> why to only create on runtime mode? or why to skip here?

Maybe need to reword the comment. Runtime survivability doesn't need
to initialize heci or vsec again.  That is applicable only for boot
survivability. That's why the skip.

We need only sysfs for runtime survivability.

Even Runtime survivability can be on pcode failure while resuming from 
d3cold or on critical firmware errors.  The firmware errors don't need
fw register logs but pcode failures do.  Separating boot and runtime 
will cause duplication so created a single function with enum

> 
> 
>> +
>>   	/* Make sure xe_heci_gsc_init() knows about survivability mode */
>>   	survivability->mode = true;
>>   
>> @@ -189,6 +196,17 @@ static int enable_survivability_mode(struct pci_dev *pdev)
>>   	return 0;
>>   }
>>   
>> +/**
>> + * xe_survivability_mode_is_runtime - check if survivability mode is runtime
>> + * @xe: xe device instance
>> + *
>> + * Returns true if in runtime survivability mode, false otherwise
>> + */
>> +bool xe_survivability_mode_is_runtime(struct xe_device *xe)
>> +{
>> +	return xe->survivability.type == XE_SURVIVABILITY_TYPE_RUNTIME;
>> +}
>> +
>>   /**
>>    * xe_survivability_mode_is_enabled - check if survivability mode is enabled
>>    * @xe: xe device instance
>> @@ -251,16 +269,18 @@ bool xe_survivability_mode_is_requested(struct xe_device *xe)
>>    * Return: 0 if survivability mode is enabled or not requested; negative error
>>    * code otherwise.
>>    */
>> -int xe_survivability_mode_enable(struct xe_device *xe)
>> +int xe_survivability_mode_enable(struct xe_device *xe, const enum xe_survivability_type type)
>>   {
>>   	struct xe_survivability *survivability = &xe->survivability;
>>   	struct xe_survivability_info *info;
>>   	struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>>   
>> -	if (!xe_survivability_mode_is_requested(xe))
>> +	if (!xe_survivability_mode_is_requested(xe) &&
>> +	    type != XE_SURVIVABILITY_TYPE_RUNTIME)
> 
> with this, the function name and its reasoning above is incorrect.
> "xe_survivability_mode_enable - Initialize and enable the survivability mode"
> 
> no, this function is not doing that anymore. Rather it is getting log
> from fw about the boot survivability, or at least initializing the
> struct for that. It probably deserves a refactor with some better
> naming on the purpose.
> 

This was named init in the initial series, later renamed to enable in 
later fixes.

Having two functions init and enable seems unnecessary since we are 
doing this only in error scenarios. It'll be something like below if 
have two functions

int ret = xe_survivability_init(xe); //all sysfs init and logs
if (!ret)
	err= xe_survivability_enable(xe, BOOT)
	if (err)
		return;

Should i refactor internally instead?

Thanks
Riana

>>   		return 0;
>>   
>>   	survivability->size = MAX_SCRATCH_MMIO;
>> +	survivability->type = type;
>>   
>>   	info = devm_kcalloc(xe->drm.dev, survivability->size, sizeof(*info),
>>   			    GFP_KERNEL);
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h b/drivers/gpu/drm/xe/xe_survivability_mode.h
>> index 02231c2bf008..559d1e99b03a 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.h
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
>> @@ -9,9 +9,11 @@
>>   #include <linux/types.h>
>>   
>>   struct xe_device;
>> +enum xe_survivability_type;
>>   
>> -int xe_survivability_mode_enable(struct xe_device *xe);
>> +int xe_survivability_mode_enable(struct xe_device *xe, const enum xe_survivability_type);
>>   bool xe_survivability_mode_is_enabled(struct xe_device *xe);
>> +bool xe_survivability_mode_is_runtime(struct xe_device *xe);
>>   bool xe_survivability_mode_is_requested(struct xe_device *xe);
>>   
>>   #endif /* _XE_SURVIVABILITY_MODE_H_ */
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> index 19d433e253df..01f07d9c4124 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> @@ -9,6 +9,11 @@
>>   #include <linux/limits.h>
>>   #include <linux/types.h>
>>   
>> +enum xe_survivability_type {
>> +	XE_SURVIVABILITY_TYPE_BOOT,
>> +	XE_SURVIVABILITY_TYPE_RUNTIME,
>> +};
>> +
>>   struct xe_survivability_info {
>>   	char name[NAME_MAX];
>>   	u32 reg;
>> @@ -30,6 +35,9 @@ struct xe_survivability {
>>   
>>   	/** @mode: boolean to indicate survivability mode */
>>   	bool mode;
>> +
>> +	/** @type: survivability mode type (boot or runtime) */
>> +	enum xe_survivability_type type;
>>   };
>>   
>>   #endif /* _XE_SURVIVABILITY_MODE_TYPES_H_ */
>> -- 
>> 2.47.1
>>



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode
  2025-07-02 14:11 ` [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode Riana Tauro
  2025-07-02 21:40   ` Rodrigo Vivi
@ 2025-07-02 23:33   ` kernel test robot
  2025-07-09 18:04   ` Summers, Stuart
  2 siblings, 0 replies; 36+ messages in thread
From: kernel test robot @ 2025-07-02 23:33 UTC (permalink / raw)
  To: Riana Tauro, intel-xe
  Cc: oe-kbuild-all, riana.tauro, anshuman.gupta, rodrigo.vivi,
	lucas.demarchi, aravind.iddamsetty, raag.jadav,
	umesh.nerlige.ramappa, frank.scarbrough, sk.anirban

Hi Riana,

kernel test robot noticed the following build warnings:

[auto build test WARNING on drm-xe/drm-xe-next]
[also build test WARNING on linus/master v6.16-rc4 next-20250702]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Riana-Tauro/drm-Add-a-vendor-specific-recovery-method-to-device-wedged-uevent/20250703-014925
base:   https://gitlab.freedesktop.org/drm/xe/kernel.git drm-xe-next
patch link:    https://lore.kernel.org/r/20250702141118.3564242-4-riana.tauro%40intel.com
patch subject: [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode
config: arc-randconfig-002-20250703 (https://download.01.org/0day-ci/archive/20250703/202507030724.ANdFmYRE-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 12.4.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250703/202507030724.ANdFmYRE-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507030724.ANdFmYRE-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> Warning: drivers/gpu/drm/xe/xe_survivability_mode.c:272 function parameter 'type' not described in 'xe_survivability_mode_enable'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode
  2025-07-02 14:11 ` [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode Riana Tauro
  2025-07-02 21:40   ` Rodrigo Vivi
  2025-07-02 23:33   ` kernel test robot
@ 2025-07-09 18:04   ` Summers, Stuart
  2025-07-10  5:27     ` Riana Tauro
  2 siblings, 1 reply; 36+ messages in thread
From: Summers, Stuart @ 2025-07-09 18:04 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Tauro,  Riana
  Cc: Jadav, Raag, Anirban, Sk, Vivi, Rodrigo, Scarbrough, Frank,
	aravind.iddamsetty@linux.intel.com, Gupta, Anshuman,
	Nerlige Ramappa, Umesh, De Marchi, Lucas

On Wed, 2025-07-02 at 19:41 +0530, Riana Tauro wrote:
> Certain runtime firmware errors can cause the device to be wedged
> requiring a firmware flash to restore normal operation.
> Runtime Survivability Mode indicates that a firmware flash is
> necessary to
> recover the device.

I'm not understanding why we need to overload survivability mode here
in the case of a CSC (or other hardware error) failure. I see there is
some vesc initialization that happens there and GSC initialization
(need to look further, but presumably this puts GSC in a survivability
state also?). But we already have the vendor specific wedge. Do we
really need the extra hook to survivability mode which was really built
as a boot time config.

Thanks,
Stuart

> 
> The below sysfs is an indication that device is in survivability mode
> 
> /sys/bus/pci/devices/<device>/surivability_mode
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.c                |  2 +-
>  drivers/gpu/drm/xe/xe_survivability_mode.c    | 26 ++++++++++++++++-
> --
>  drivers/gpu/drm/xe/xe_survivability_mode.h    |  4 ++-
>  .../gpu/drm/xe/xe_survivability_mode_types.h  |  8 ++++++
>  4 files changed, 35 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c
> b/drivers/gpu/drm/xe/xe_device.c
> index 4a38486dccc8..5defa54ccd26 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -716,7 +716,7 @@ int xe_device_probe_early(struct xe_device *xe)
>                  * possible, but still return the previous error for
> error
>                  * propagation
>                  */
> -               err = xe_survivability_mode_enable(xe);
> +               err = xe_survivability_mode_enable(xe,
> XE_SURVIVABILITY_TYPE_BOOT);
>                 if (err)
>                         return err;
>  
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c
> b/drivers/gpu/drm/xe/xe_survivability_mode.c
> index 1f710b3fc599..e1adcb33c9b0 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
> @@ -129,7 +129,10 @@ static ssize_t survivability_mode_show(struct
> device *dev,
>         struct xe_survivability_info *info = survivability->info;
>         int index = 0, count = 0;
>  
> -       for (index = 0; index < MAX_SCRATCH_MMIO; index++) {
> +       count += sysfs_emit_at(buff, count, "Survivability mode:
> %s\n",
> +                              survivability->type ? "Runtime" :
> "Boot");
> +
> +       for (index = 0; survivability->boot_status && index <
> MAX_SCRATCH_MMIO; index++) {
>                 if (info[index].reg)
>                         count += sysfs_emit_at(buff, count, "%s: 0x%x
> - 0x%x\n", info[index].name,
>                                                info[index].reg,
> info[index].value);
> @@ -169,6 +172,10 @@ static int enable_survivability_mode(struct
> pci_dev *pdev)
>         if (ret)
>                 return ret;
>  
> +       /* Only create sysfs for runtime survivability mode */
> +       if (xe_survivability_mode_is_runtime(xe))
> +               return 0;
> +
>         /* Make sure xe_heci_gsc_init() knows about survivability
> mode */
>         survivability->mode = true;
>  
> @@ -189,6 +196,17 @@ static int enable_survivability_mode(struct
> pci_dev *pdev)
>         return 0;
>  }
>  
> +/**
> + * xe_survivability_mode_is_runtime - check if survivability mode is
> runtime
> + * @xe: xe device instance
> + *
> + * Returns true if in runtime survivability mode, false otherwise
> + */
> +bool xe_survivability_mode_is_runtime(struct xe_device *xe)
> +{
> +       return xe->survivability.type ==
> XE_SURVIVABILITY_TYPE_RUNTIME;
> +}
> +
>  /**
>   * xe_survivability_mode_is_enabled - check if survivability mode is
> enabled
>   * @xe: xe device instance
> @@ -251,16 +269,18 @@ bool xe_survivability_mode_is_requested(struct
> xe_device *xe)
>   * Return: 0 if survivability mode is enabled or not requested;
> negative error
>   * code otherwise.
>   */
> -int xe_survivability_mode_enable(struct xe_device *xe)
> +int xe_survivability_mode_enable(struct xe_device *xe, const enum
> xe_survivability_type type)
>  {
>         struct xe_survivability *survivability = &xe->survivability;
>         struct xe_survivability_info *info;
>         struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>  
> -       if (!xe_survivability_mode_is_requested(xe))
> +       if (!xe_survivability_mode_is_requested(xe) &&
> +           type != XE_SURVIVABILITY_TYPE_RUNTIME)
>                 return 0;
>  
>         survivability->size = MAX_SCRATCH_MMIO;
> +       survivability->type = type;
>  
>         info = devm_kcalloc(xe->drm.dev, survivability->size,
> sizeof(*info),
>                             GFP_KERNEL);
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h
> b/drivers/gpu/drm/xe/xe_survivability_mode.h
> index 02231c2bf008..559d1e99b03a 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode.h
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
> @@ -9,9 +9,11 @@
>  #include <linux/types.h>
>  
>  struct xe_device;
> +enum xe_survivability_type;
>  
> -int xe_survivability_mode_enable(struct xe_device *xe);
> +int xe_survivability_mode_enable(struct xe_device *xe, const enum
> xe_survivability_type);
>  bool xe_survivability_mode_is_enabled(struct xe_device *xe);
> +bool xe_survivability_mode_is_runtime(struct xe_device *xe);
>  bool xe_survivability_mode_is_requested(struct xe_device *xe);
>  
>  #endif /* _XE_SURVIVABILITY_MODE_H_ */
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> index 19d433e253df..01f07d9c4124 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> @@ -9,6 +9,11 @@
>  #include <linux/limits.h>
>  #include <linux/types.h>
>  
> +enum xe_survivability_type {
> +       XE_SURVIVABILITY_TYPE_BOOT,
> +       XE_SURVIVABILITY_TYPE_RUNTIME,
> +};
> +
>  struct xe_survivability_info {
>         char name[NAME_MAX];
>         u32 reg;
> @@ -30,6 +35,9 @@ struct xe_survivability {
>  
>         /** @mode: boolean to indicate survivability mode */
>         bool mode;
> +
> +       /** @type: survivability mode type (boot or runtime) */
> +       enum xe_survivability_type type;
>  };
>  
>  #endif /* _XE_SURVIVABILITY_MODE_TYPES_H_ */


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode
  2025-07-09 18:04   ` Summers, Stuart
@ 2025-07-10  5:27     ` Riana Tauro
  2025-07-15 17:30       ` Summers, Stuart
  0 siblings, 1 reply; 36+ messages in thread
From: Riana Tauro @ 2025-07-10  5:27 UTC (permalink / raw)
  To: Summers, Stuart, intel-xe@lists.freedesktop.org
  Cc: Jadav, Raag, Anirban, Sk, Vivi, Rodrigo, Scarbrough, Frank,
	aravind.iddamsetty@linux.intel.com, Gupta, Anshuman,
	Nerlige Ramappa, Umesh, De Marchi, Lucas

Hi Stuart

On 7/9/2025 11:34 PM, Summers, Stuart wrote:
> On Wed, 2025-07-02 at 19:41 +0530, Riana Tauro wrote:
>> Certain runtime firmware errors can cause the device to be wedged
>> requiring a firmware flash to restore normal operation.
>> Runtime Survivability Mode indicates that a firmware flash is
>> necessary to
>> recover the device.
> 
> I'm not understanding why we need to overload survivability mode here
> in the case of a CSC (or other hardware error) failure. I see there is
> some vesc initialization that happens there and GSC initialization
> (need to look further, but presumably this puts GSC in a survivability
> state also?). But we already have the vendor specific wedge. Do we
> really need the extra hook to survivability mode which was really built
> as a boot time config.

vendor-specific without a reason is vague and could be reused for a 
different action in the future. There needs to be a indication that this 
wedged uevent indicates firmware flash. So the survivability mode sysfs

This patch will further be extended to handle d3cold resume pcode 
failures which will send a similar wedged event and survivability mode

Thanks
Riana>
> Thanks,
> Stuart
> 
>>
>> The below sysfs is an indication that device is in survivability mode
>>
>> /sys/bus/pci/devices/<device>/surivability_mode
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_device.c                |  2 +-
>>   drivers/gpu/drm/xe/xe_survivability_mode.c    | 26 ++++++++++++++++-
>> --
>>   drivers/gpu/drm/xe/xe_survivability_mode.h    |  4 ++-
>>   .../gpu/drm/xe/xe_survivability_mode_types.h  |  8 ++++++
>>   4 files changed, 35 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>> b/drivers/gpu/drm/xe/xe_device.c
>> index 4a38486dccc8..5defa54ccd26 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -716,7 +716,7 @@ int xe_device_probe_early(struct xe_device *xe)
>>                   * possible, but still return the previous error for
>> error
>>                   * propagation
>>                   */
>> -               err = xe_survivability_mode_enable(xe);
>> +               err = xe_survivability_mode_enable(xe,
>> XE_SURVIVABILITY_TYPE_BOOT);
>>                  if (err)
>>                          return err;
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c
>> b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> index 1f710b3fc599..e1adcb33c9b0 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
>> @@ -129,7 +129,10 @@ static ssize_t survivability_mode_show(struct
>> device *dev,
>>          struct xe_survivability_info *info = survivability->info;
>>          int index = 0, count = 0;
>>   
>> -       for (index = 0; index < MAX_SCRATCH_MMIO; index++) {
>> +       count += sysfs_emit_at(buff, count, "Survivability mode:
>> %s\n",
>> +                              survivability->type ? "Runtime" :
>> "Boot");
>> +
>> +       for (index = 0; survivability->boot_status && index <
>> MAX_SCRATCH_MMIO; index++) {
>>                  if (info[index].reg)
>>                          count += sysfs_emit_at(buff, count, "%s: 0x%x
>> - 0x%x\n", info[index].name,
>>                                                 info[index].reg,
>> info[index].value);
>> @@ -169,6 +172,10 @@ static int enable_survivability_mode(struct
>> pci_dev *pdev)
>>          if (ret)
>>                  return ret;
>>   
>> +       /* Only create sysfs for runtime survivability mode */
>> +       if (xe_survivability_mode_is_runtime(xe))
>> +               return 0;
>> +
>>          /* Make sure xe_heci_gsc_init() knows about survivability
>> mode */
>>          survivability->mode = true;
>>   
>> @@ -189,6 +196,17 @@ static int enable_survivability_mode(struct
>> pci_dev *pdev)
>>          return 0;
>>   }
>>   
>> +/**
>> + * xe_survivability_mode_is_runtime - check if survivability mode is
>> runtime
>> + * @xe: xe device instance
>> + *
>> + * Returns true if in runtime survivability mode, false otherwise
>> + */
>> +bool xe_survivability_mode_is_runtime(struct xe_device *xe)
>> +{
>> +       return xe->survivability.type ==
>> XE_SURVIVABILITY_TYPE_RUNTIME;
>> +}
>> +
>>   /**
>>    * xe_survivability_mode_is_enabled - check if survivability mode is
>> enabled
>>    * @xe: xe device instance
>> @@ -251,16 +269,18 @@ bool xe_survivability_mode_is_requested(struct
>> xe_device *xe)
>>    * Return: 0 if survivability mode is enabled or not requested;
>> negative error
>>    * code otherwise.
>>    */
>> -int xe_survivability_mode_enable(struct xe_device *xe)
>> +int xe_survivability_mode_enable(struct xe_device *xe, const enum
>> xe_survivability_type type)
>>   {
>>          struct xe_survivability *survivability = &xe->survivability;
>>          struct xe_survivability_info *info;
>>          struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
>>   
>> -       if (!xe_survivability_mode_is_requested(xe))
>> +       if (!xe_survivability_mode_is_requested(xe) &&
>> +           type != XE_SURVIVABILITY_TYPE_RUNTIME)
>>                  return 0;
>>   
>>          survivability->size = MAX_SCRATCH_MMIO;
>> +       survivability->type = type;
>>   
>>          info = devm_kcalloc(xe->drm.dev, survivability->size,
>> sizeof(*info),
>>                              GFP_KERNEL);
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h
>> b/drivers/gpu/drm/xe/xe_survivability_mode.h
>> index 02231c2bf008..559d1e99b03a 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode.h
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
>> @@ -9,9 +9,11 @@
>>   #include <linux/types.h>
>>   
>>   struct xe_device;
>> +enum xe_survivability_type;
>>   
>> -int xe_survivability_mode_enable(struct xe_device *xe);
>> +int xe_survivability_mode_enable(struct xe_device *xe, const enum
>> xe_survivability_type);
>>   bool xe_survivability_mode_is_enabled(struct xe_device *xe);
>> +bool xe_survivability_mode_is_runtime(struct xe_device *xe);
>>   bool xe_survivability_mode_is_requested(struct xe_device *xe);
>>   
>>   #endif /* _XE_SURVIVABILITY_MODE_H_ */
>> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> index 19d433e253df..01f07d9c4124 100644
>> --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
>> @@ -9,6 +9,11 @@
>>   #include <linux/limits.h>
>>   #include <linux/types.h>
>>   
>> +enum xe_survivability_type {
>> +       XE_SURVIVABILITY_TYPE_BOOT,
>> +       XE_SURVIVABILITY_TYPE_RUNTIME,
>> +};
>> +
>>   struct xe_survivability_info {
>>          char name[NAME_MAX];
>>          u32 reg;
>> @@ -30,6 +35,9 @@ struct xe_survivability {
>>   
>>          /** @mode: boolean to indicate survivability mode */
>>          bool mode;
>> +
>> +       /** @type: survivability mode type (boot or runtime) */
>> +       enum xe_survivability_type type;
>>   };
>>   
>>   #endif /* _XE_SURVIVABILITY_MODE_TYPES_H_ */
> 




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode
  2025-07-10  5:27     ` Riana Tauro
@ 2025-07-15 17:30       ` Summers, Stuart
  0 siblings, 0 replies; 36+ messages in thread
From: Summers, Stuart @ 2025-07-15 17:30 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Tauro,  Riana
  Cc: Anirban, Sk, Jadav, Raag, Vivi, Rodrigo, Scarbrough, Frank,
	aravind.iddamsetty@linux.intel.com, Gupta, Anshuman,
	De Marchi, Lucas, Nerlige Ramappa, Umesh

On Thu, 2025-07-10 at 10:57 +0530, Riana Tauro wrote:
> Hi Stuart
> 
> On 7/9/2025 11:34 PM, Summers, Stuart wrote:
> > On Wed, 2025-07-02 at 19:41 +0530, Riana Tauro wrote:
> > > Certain runtime firmware errors can cause the device to be wedged
> > > requiring a firmware flash to restore normal operation.
> > > Runtime Survivability Mode indicates that a firmware flash is
> > > necessary to
> > > recover the device.
> > 
> > I'm not understanding why we need to overload survivability mode
> > here
> > in the case of a CSC (or other hardware error) failure. I see there
> > is
> > some vesc initialization that happens there and GSC initialization
> > (need to look further, but presumably this puts GSC in a
> > survivability
> > state also?). But we already have the vendor specific wedge. Do we
> > really need the extra hook to survivability mode which was really
> > built
> > as a boot time config.
> 
> vendor-specific without a reason is vague and could be reused for a 
> different action in the future. There needs to be a indication that
> this 
> wedged uevent indicates firmware flash. So the survivability mode
> sysfs
> 
> This patch will further be extended to handle d3cold resume pcode 
> failures which will send a similar wedged event and survivability
> mode

I know you have the new series up, but just coming back to confirm...
ack from me here. And just for my understanding, basically the idea is
we send the uevent, then set the "state" of the driver/hardware by
triggering survivability runtime mode. The user can then use this sysfs
state to determine that a recovery of some kind is needed - in this
case firmware flash and... re-enumeration? soft reset? or just a driver
reload?

Thanks,
Stuart

> 
> Thanks
> Riana>
> > Thanks,
> > Stuart
> > 
> > > 
> > > The below sysfs is an indication that device is in survivability
> > > mode
> > > 
> > > /sys/bus/pci/devices/<device>/surivability_mode
> > > 
> > > Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> > > ---
> > >   drivers/gpu/drm/xe/xe_device.c                |  2 +-
> > >   drivers/gpu/drm/xe/xe_survivability_mode.c    | 26
> > > ++++++++++++++++-
> > > --
> > >   drivers/gpu/drm/xe/xe_survivability_mode.h    |  4 ++-
> > >   .../gpu/drm/xe/xe_survivability_mode_types.h  |  8 ++++++
> > >   4 files changed, 35 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > b/drivers/gpu/drm/xe/xe_device.c
> > > index 4a38486dccc8..5defa54ccd26 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > @@ -716,7 +716,7 @@ int xe_device_probe_early(struct xe_device
> > > *xe)
> > >                   * possible, but still return the previous error
> > > for
> > > error
> > >                   * propagation
> > >                   */
> > > -               err = xe_survivability_mode_enable(xe);
> > > +               err = xe_survivability_mode_enable(xe,
> > > XE_SURVIVABILITY_TYPE_BOOT);
> > >                  if (err)
> > >                          return err;
> > >   
> > > diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c
> > > b/drivers/gpu/drm/xe/xe_survivability_mode.c
> > > index 1f710b3fc599..e1adcb33c9b0 100644
> > > --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
> > > +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
> > > @@ -129,7 +129,10 @@ static ssize_t
> > > survivability_mode_show(struct
> > > device *dev,
> > >          struct xe_survivability_info *info = survivability-
> > > >info;
> > >          int index = 0, count = 0;
> > >   
> > > -       for (index = 0; index < MAX_SCRATCH_MMIO; index++) {
> > > +       count += sysfs_emit_at(buff, count, "Survivability mode:
> > > %s\n",
> > > +                              survivability->type ? "Runtime" :
> > > "Boot");
> > > +
> > > +       for (index = 0; survivability->boot_status && index <
> > > MAX_SCRATCH_MMIO; index++) {
> > >                  if (info[index].reg)
> > >                          count += sysfs_emit_at(buff, count, "%s:
> > > 0x%x
> > > - 0x%x\n", info[index].name,
> > >                                                 info[index].reg,
> > > info[index].value);
> > > @@ -169,6 +172,10 @@ static int enable_survivability_mode(struct
> > > pci_dev *pdev)
> > >          if (ret)
> > >                  return ret;
> > >   
> > > +       /* Only create sysfs for runtime survivability mode */
> > > +       if (xe_survivability_mode_is_runtime(xe))
> > > +               return 0;
> > > +
> > >          /* Make sure xe_heci_gsc_init() knows about
> > > survivability
> > > mode */
> > >          survivability->mode = true;
> > >   
> > > @@ -189,6 +196,17 @@ static int enable_survivability_mode(struct
> > > pci_dev *pdev)
> > >          return 0;
> > >   }
> > >   
> > > +/**
> > > + * xe_survivability_mode_is_runtime - check if survivability
> > > mode is
> > > runtime
> > > + * @xe: xe device instance
> > > + *
> > > + * Returns true if in runtime survivability mode, false
> > > otherwise
> > > + */
> > > +bool xe_survivability_mode_is_runtime(struct xe_device *xe)
> > > +{
> > > +       return xe->survivability.type ==
> > > XE_SURVIVABILITY_TYPE_RUNTIME;
> > > +}
> > > +
> > >   /**
> > >    * xe_survivability_mode_is_enabled - check if survivability
> > > mode is
> > > enabled
> > >    * @xe: xe device instance
> > > @@ -251,16 +269,18 @@ bool
> > > xe_survivability_mode_is_requested(struct
> > > xe_device *xe)
> > >    * Return: 0 if survivability mode is enabled or not requested;
> > > negative error
> > >    * code otherwise.
> > >    */
> > > -int xe_survivability_mode_enable(struct xe_device *xe)
> > > +int xe_survivability_mode_enable(struct xe_device *xe, const
> > > enum
> > > xe_survivability_type type)
> > >   {
> > >          struct xe_survivability *survivability = &xe-
> > > >survivability;
> > >          struct xe_survivability_info *info;
> > >          struct pci_dev *pdev = to_pci_dev(xe->drm.dev);
> > >   
> > > -       if (!xe_survivability_mode_is_requested(xe))
> > > +       if (!xe_survivability_mode_is_requested(xe) &&
> > > +           type != XE_SURVIVABILITY_TYPE_RUNTIME)
> > >                  return 0;
> > >   
> > >          survivability->size = MAX_SCRATCH_MMIO;
> > > +       survivability->type = type;
> > >   
> > >          info = devm_kcalloc(xe->drm.dev, survivability->size,
> > > sizeof(*info),
> > >                              GFP_KERNEL);
> > > diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.h
> > > b/drivers/gpu/drm/xe/xe_survivability_mode.h
> > > index 02231c2bf008..559d1e99b03a 100644
> > > --- a/drivers/gpu/drm/xe/xe_survivability_mode.h
> > > +++ b/drivers/gpu/drm/xe/xe_survivability_mode.h
> > > @@ -9,9 +9,11 @@
> > >   #include <linux/types.h>
> > >   
> > >   struct xe_device;
> > > +enum xe_survivability_type;
> > >   
> > > -int xe_survivability_mode_enable(struct xe_device *xe);
> > > +int xe_survivability_mode_enable(struct xe_device *xe, const
> > > enum
> > > xe_survivability_type);
> > >   bool xe_survivability_mode_is_enabled(struct xe_device *xe);
> > > +bool xe_survivability_mode_is_runtime(struct xe_device *xe);
> > >   bool xe_survivability_mode_is_requested(struct xe_device *xe);
> > >   
> > >   #endif /* _XE_SURVIVABILITY_MODE_H_ */
> > > diff --git a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> > > b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> > > index 19d433e253df..01f07d9c4124 100644
> > > --- a/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_survivability_mode_types.h
> > > @@ -9,6 +9,11 @@
> > >   #include <linux/limits.h>
> > >   #include <linux/types.h>
> > >   
> > > +enum xe_survivability_type {
> > > +       XE_SURVIVABILITY_TYPE_BOOT,
> > > +       XE_SURVIVABILITY_TYPE_RUNTIME,
> > > +};
> > > +
> > >   struct xe_survivability_info {
> > >          char name[NAME_MAX];
> > >          u32 reg;
> > > @@ -30,6 +35,9 @@ struct xe_survivability {
> > >   
> > >          /** @mode: boolean to indicate survivability mode */
> > >          bool mode;
> > > +
> > > +       /** @type: survivability mode type (boot or runtime) */
> > > +       enum xe_survivability_type type;
> > >   };
> > >   
> > >   #endif /* _XE_SURVIVABILITY_MODE_TYPES_H_ */
> > 
> 
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 4/7] drm/xe/doc: Document device wedged and runtime survivability
  2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (2 preceding siblings ...)
  2025-07-02 14:11 ` [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode Riana Tauro
@ 2025-07-02 14:11 ` Riana Tauro
  2025-07-02 13:55   ` Riana Tauro
  2025-07-03  7:19   ` Raag Jadav
  2025-07-02 14:11 ` [PATCH v3 5/7] drm/xe: Add support to handle hardware errors Riana Tauro
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 36+ messages in thread
From: Riana Tauro @ 2025-07-02 14:11 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

Add documentation for vendor specific device wedged recovery method
and runtime survivability.

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 Documentation/gpu/xe/index.rst             |  1 +
 Documentation/gpu/xe/xe_device.rst         | 10 +++++++
 Documentation/gpu/xe/xe_pcode.rst          |  6 +++--
 drivers/gpu/drm/xe/xe_device.c             | 16 +++++++++++
 drivers/gpu/drm/xe/xe_survivability_mode.c | 31 +++++++++++++++++-----
 5 files changed, 56 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/gpu/xe/xe_device.rst

diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
index 42ba6c263cd0..88b22fad880e 100644
--- a/Documentation/gpu/xe/index.rst
+++ b/Documentation/gpu/xe/index.rst
@@ -25,5 +25,6 @@ DG2, etc is provided to prototype the driver.
    xe_tile
    xe_debugging
    xe_devcoredump
+   xe_device
    xe-drm-usage-stats.rst
    xe_configfs
diff --git a/Documentation/gpu/xe/xe_device.rst b/Documentation/gpu/xe/xe_device.rst
new file mode 100644
index 000000000000..f9b962169919
--- /dev/null
+++ b/Documentation/gpu/xe/xe_device.rst
@@ -0,0 +1,10 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+
+.. _xe-device-wedging:
+
+==================
+Xe Device Wedging
+==================
+
+.. kernel-doc:: drivers/gpu/drm/xe/xe_device.c
+   :doc: Device Wedging
diff --git a/Documentation/gpu/xe/xe_pcode.rst b/Documentation/gpu/xe/xe_pcode.rst
index 5937ef3599b0..2a43601123cb 100644
--- a/Documentation/gpu/xe/xe_pcode.rst
+++ b/Documentation/gpu/xe/xe_pcode.rst
@@ -13,9 +13,11 @@ Internal API
 .. kernel-doc:: drivers/gpu/drm/xe/xe_pcode.c
    :internal:
 
+.. _xe-survivability-mode:
+
 ==================
-Boot Survivability
+Survivability Mode
 ==================
 
 .. kernel-doc:: drivers/gpu/drm/xe/xe_survivability_mode.c
-   :doc: Xe Boot Survivability
+   :doc: Survivability Mode
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 5defa54ccd26..d6b680abc3ae 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1119,6 +1119,22 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
 	xe_pm_runtime_put(xe);
 }
 
+/**
+ * DOC: Device Wedging
+ *
+ * Xe driver uses device wedged uevent as documented in Documentation/gpu/drm-uapi.rst.
+ *
+ * When device is in wedged state, every IOCTL will be blocked and GT cannot be
+ * used. Certain critical errors like gt reset failure, firmware failures can cause
+ * the device to be wedged. The default recovery mechanism for a wedged state
+ * is re-probe (unbind + bind)
+ *
+ * However, CSC firmware errors require a firmware flash to restore normal device
+ * operation. Since firmware flash is a vendor-specific action ``WEDGED=vendor-specific``
+ * recovery method along with :ref:`runtime survivability mode <xe-survivability-mode>`
+ * is used to notify userspace.
+ */
+
 /**
  * xe_device_declare_wedged - Declare device wedged
  * @xe: xe device instance
diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
index e1adcb33c9b0..0dc8fd77a9f4 100644
--- a/drivers/gpu/drm/xe/xe_survivability_mode.c
+++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
@@ -21,15 +21,18 @@
 #define MAX_SCRATCH_MMIO 8
 
 /**
- * DOC: Xe Boot Survivability
+ * DOC: Survivability Mode
  *
- * Boot Survivability is a software based workflow for recovering a system in a failed boot state
+ * Survivability Mode is a software based workflow for recovering a system in a failed boot state
  * Here system recoverability is concerned with recovering the firmware responsible for boot.
  *
- * This is implemented by loading the driver with bare minimum (no drm card) to allow the firmware
- * to be flashed through mei and collect telemetry. The driver's probe flow is modified
- * such that it enters survivability mode when pcode initialization is incomplete and boot status
- * denotes a failure.
+ * Boot Survivability
+ * ===================
+ *
+ * Boot Survivability is implemented by loading the driver with bare minimum (no drm card) to allow
+ * the firmware to be flashed through mei and collect telemetry. The driver's probe flow is
+ * modified such that it enters survivability mode when pcode initialization is incomplete and boot
+ * status denotes a failure.
  *
  * Survivability mode can also be entered manually using the survivability mode attribute available
  * through configfs which is beneficial in several usecases. It can be used to address scenarios
@@ -55,6 +58,22 @@
  *	Provides history of previous failures
  * Auxiliary Information
  *	Certain failures may have information in addition to postcode information
+ *
+ * Runtime Survivability
+ * =====================
+ *
+ * Certain runtime firmware errors can cause the device to enter a non-recoverable state
+ * (:ref:`xe-device-wedging`) requiring a firmware flash to restore normal operation.
+ * Runtime Survivability Mode indicates that a firmware flash is necessary to recover the device and
+ * is indicated by the presence of survivability mode sysfs::
+ *
+ *	/sys/bus/pci/devices/<device>/surivability_mode
+ *
+ * Survivability mode sysfs provides information about the type of survivability mode.
+ *
+ * When such errors occur, userspace is notified with the drm device wedged uevent and runtime
+ * survivability mode. User can then initiate a firmware flash to restore device to normal
+ * operation.
  */
 
 static u32 aux_history_offset(u32 reg_value)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/7] drm/xe/doc: Document device wedged and runtime survivability
  2025-07-02 14:11 ` [PATCH v3 4/7] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
@ 2025-07-02 13:55   ` Riana Tauro
  2025-07-03  7:19   ` Raag Jadav
  1 sibling, 0 replies; 36+ messages in thread
From: Riana Tauro @ 2025-07-02 13:55 UTC (permalink / raw)
  To: intel-xe
  Cc: anshuman.gupta, rodrigo.vivi, lucas.demarchi, aravind.iddamsetty,
	raag.jadav, umesh.nerlige.ramappa, frank.scarbrough, sk.anirban



On 7/2/2025 7:41 PM, Riana Tauro wrote:
> Add documentation for vendor specific device wedged recovery method
> and runtime survivability.
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>   Documentation/gpu/xe/index.rst             |  1 +
>   Documentation/gpu/xe/xe_device.rst         | 10 +++++++
>   Documentation/gpu/xe/xe_pcode.rst          |  6 +++--
>   drivers/gpu/drm/xe/xe_device.c             | 16 +++++++++++
>   drivers/gpu/drm/xe/xe_survivability_mode.c | 31 +++++++++++++++++-----
>   5 files changed, 56 insertions(+), 8 deletions(-)
>   create mode 100644 Documentation/gpu/xe/xe_device.rst
> 
> diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
> index 42ba6c263cd0..88b22fad880e 100644
> --- a/Documentation/gpu/xe/index.rst
> +++ b/Documentation/gpu/xe/index.rst
> @@ -25,5 +25,6 @@ DG2, etc is provided to prototype the driver.
>      xe_tile
>      xe_debugging
>      xe_devcoredump
> +   xe_device
>      xe-drm-usage-stats.rst
>      xe_configfs
> diff --git a/Documentation/gpu/xe/xe_device.rst b/Documentation/gpu/xe/xe_device.rst
> new file mode 100644
> index 000000000000..f9b962169919
> --- /dev/null
> +++ b/Documentation/gpu/xe/xe_device.rst
> @@ -0,0 +1,10 @@
> +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT)
> +
> +.. _xe-device-wedging:
> +
> +==================
> +Xe Device Wedging
> +==================
> +
> +.. kernel-doc:: drivers/gpu/drm/xe/xe_device.c
> +   :doc: Device Wedging
> diff --git a/Documentation/gpu/xe/xe_pcode.rst b/Documentation/gpu/xe/xe_pcode.rst
> index 5937ef3599b0..2a43601123cb 100644
> --- a/Documentation/gpu/xe/xe_pcode.rst
> +++ b/Documentation/gpu/xe/xe_pcode.rst
> @@ -13,9 +13,11 @@ Internal API
>   .. kernel-doc:: drivers/gpu/drm/xe/xe_pcode.c
>      :internal:
>   
> +.. _xe-survivability-mode:
> +
>   ==================
> -Boot Survivability
> +Survivability Mode
>   ==================
>   
>   .. kernel-doc:: drivers/gpu/drm/xe/xe_survivability_mode.c
> -   :doc: Xe Boot Survivability
> +   :doc: Survivability Mode
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 5defa54ccd26..d6b680abc3ae 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1119,6 +1119,22 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>   	xe_pm_runtime_put(xe);
>   }
>   
> +/**
> + * DOC: Device Wedging
> + *
> + * Xe driver uses device wedged uevent as documented in Documentation/gpu/drm-uapi.rst.
> + *
> + * When device is in wedged state, every IOCTL will be blocked and GT cannot be
> + * used. Certain critical errors like gt reset failure, firmware failures can cause
> + * the device to be wedged. The default recovery mechanism for a wedged state
> + * is re-probe (unbind + bind)
> + *
> + * However, CSC firmware errors require a firmware flash to restore normal device
> + * operation. Since firmware flash is a vendor-specific action ``WEDGED=vendor-specific``
> + * recovery method along with :ref:`runtime survivability mode <xe-survivability-mode>`
> + * is used to notify userspace.
> + */
> +
>   /**
>    * xe_device_declare_wedged - Declare device wedged
>    * @xe: xe device instance
> diff --git a/drivers/gpu/drm/xe/xe_survivability_mode.c b/drivers/gpu/drm/xe/xe_survivability_mode.c
> index e1adcb33c9b0..0dc8fd77a9f4 100644
> --- a/drivers/gpu/drm/xe/xe_survivability_mode.c
> +++ b/drivers/gpu/drm/xe/xe_survivability_mode.c
> @@ -21,15 +21,18 @@
>   #define MAX_SCRATCH_MMIO 8
>   
>   /**
> - * DOC: Xe Boot Survivability
> + * DOC: Survivability Mode
>    *
> - * Boot Survivability is a software based workflow for recovering a system in a failed boot state
> + * Survivability Mode is a software based workflow for recovering a system in a failed boot state
>    * Here system recoverability is concerned with recovering the firmware responsible for boot.
>    *
> - * This is implemented by loading the driver with bare minimum (no drm card) to allow the firmware
> - * to be flashed through mei and collect telemetry. The driver's probe flow is modified
> - * such that it enters survivability mode when pcode initialization is incomplete and boot status
> - * denotes a failure.
> + * Boot Survivability
> + * ===================
> + *
> + * Boot Survivability is implemented by loading the driver with bare minimum (no drm card) to allow
> + * the firmware to be flashed through mei and collect telemetry. The driver's probe flow is
> + * modified such that it enters survivability mode when pcode initialization is incomplete and boot
> + * status denotes a failure.
>    *
>    * Survivability mode can also be entered manually using the survivability mode attribute available
>    * through configfs which is beneficial in several usecases. It can be used to address scenarios
> @@ -55,6 +58,22 @@
>    *	Provides history of previous failures
>    * Auxiliary Information
>    *	Certain failures may have information in addition to postcode information
> + *
> + * Runtime Survivability
> + * =====================
> + *
> + * Certain runtime firmware errors can cause the device to enter a non-recoverable state
> + * (:ref:`xe-device-wedging`) requiring a firmware flash to restore normal operation.
> + * Runtime Survivability Mode indicates that a firmware flash is necessary to recover the device and
> + * is indicated by the presence of survivability mode sysfs::
> + *
> + *	/sys/bus/pci/devices/<device>/surivability_mode

typo. Will fix in next rev

> + *
> + * Survivability mode sysfs provides information about the type of survivability mode.
> + *
> + * When such errors occur, userspace is notified with the drm device wedged uevent and runtime
> + * survivability mode. User can then initiate a firmware flash to restore device to normal
> + * operation.
>    */
>   
>   static u32 aux_history_offset(u32 reg_value)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 4/7] drm/xe/doc: Document device wedged and runtime survivability
  2025-07-02 14:11 ` [PATCH v3 4/7] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
  2025-07-02 13:55   ` Riana Tauro
@ 2025-07-03  7:19   ` Raag Jadav
  1 sibling, 0 replies; 36+ messages in thread
From: Raag Jadav @ 2025-07-03  7:19 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, umesh.nerlige.ramappa, frank.scarbrough,
	sk.anirban

On Wed, Jul 02, 2025 at 07:41:14PM +0530, Riana Tauro wrote:
> Add documentation for vendor specific device wedged recovery method
> and runtime survivability.

...

> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 5defa54ccd26..d6b680abc3ae 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1119,6 +1119,22 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>  	xe_pm_runtime_put(xe);
>  }
>  
> +/**
> + * DOC: Device Wedging
> + *
> + * Xe driver uses device wedged uevent as documented in Documentation/gpu/drm-uapi.rst.
> + *
> + * When device is in wedged state, every IOCTL will be blocked and GT cannot be
> + * used. Certain critical errors like gt reset failure, firmware failures can cause
> + * the device to be wedged. The default recovery mechanism for a wedged state
> + * is re-probe (unbind + bind)
> + *
> + * However, CSC firmware errors require a firmware flash to restore normal device
> + * operation. Since firmware flash is a vendor-specific action ``WEDGED=vendor-specific``
> + * recovery method along with :ref:`runtime survivability mode <xe-survivability-mode>`
> + * is used to notify userspace.

I think a bit more context about the expectation from the user would
be useful.

Raag

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 5/7] drm/xe: Add support to handle hardware errors
  2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (3 preceding siblings ...)
  2025-07-02 14:11 ` [PATCH v3 4/7] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
@ 2025-07-02 14:11 ` Riana Tauro
  2025-07-09 17:27   ` Summers, Stuart
  2025-07-02 14:11 ` [PATCH v3 6/7] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Riana Tauro @ 2025-07-02 14:11 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban, Himal Prasad Ghimiray

Gfx device reports two classes of errors: uncorrectable and
correctable. Depending on the severity uncorrectable errors are
further classified as non fatal and fatal

Correctable and non-fatal errors are reported as MSI's and bits in
the Master Interrupt Register indicate the class of the error.
The source of the error is then read from the Device Error Source
Register. Fatal errors are reported as PCIe errors
When a PCIe error is asserted, the OS will perform a device warm reset
which causes the driver to reload. The error registers are sticky
and the values are maintained through a warm reset

Add basic support to handle these errors

Bspec: 50875, 53073, 53074, 53075, 53076

Co-developed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/Makefile                |   1 +
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  15 +++
 drivers/gpu/drm/xe/regs/xe_irq_regs.h      |   1 +
 drivers/gpu/drm/xe/xe_hw_error.c           | 108 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_hw_error.h           |  15 +++
 drivers/gpu/drm/xe/xe_irq.c                |   4 +
 6 files changed, 144 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index 1d97e5b63f4e..fea8ee3b0785 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -73,6 +73,7 @@ xe-y += xe_bb.o \
 	xe_hw_engine.o \
 	xe_hw_engine_class_sysfs.o \
 	xe_hw_engine_group.o \
+	xe_hw_error.o \
 	xe_hw_fence.o \
 	xe_irq.o \
 	xe_lrc.o \
diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
new file mode 100644
index 000000000000..ed9b81fb28a0
--- /dev/null
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef _XE_HW_ERROR_REGS_H_
+#define _XE_HW_ERROR_REGS_H_
+
+#define DEV_ERR_STAT_NONFATAL			0x100178
+#define DEV_ERR_STAT_CORRECTABLE		0x10017c
+#define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
+								  DEV_ERR_STAT_CORRECTABLE, \
+								  DEV_ERR_STAT_NONFATAL))
+
+#endif
diff --git a/drivers/gpu/drm/xe/regs/xe_irq_regs.h b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
index f0ecfcac4003..2758b64cec9e 100644
--- a/drivers/gpu/drm/xe/regs/xe_irq_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
@@ -18,6 +18,7 @@
 #define GFX_MSTR_IRQ				XE_REG(0x190010, XE_REG_OPTION_VF)
 #define   MASTER_IRQ				REG_BIT(31)
 #define   GU_MISC_IRQ				REG_BIT(29)
+#define   ERROR_IRQ(x)				REG_BIT(26 + (x))
 #define   DISPLAY_IRQ				REG_BIT(16)
 #define   GT_DW_IRQ(x)				REG_BIT(x)
 
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
new file mode 100644
index 000000000000..0f2590839900
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include "regs/xe_hw_error_regs.h"
+#include "regs/xe_irq_regs.h"
+
+#include "xe_device.h"
+#include "xe_hw_error.h"
+#include "xe_mmio.h"
+
+/* Error categories reported by hardware */
+enum hardware_error {
+	HARDWARE_ERROR_CORRECTABLE = 0,
+	HARDWARE_ERROR_NONFATAL = 1,
+	HARDWARE_ERROR_FATAL = 2,
+	HARDWARE_ERROR_MAX,
+};
+
+static const char *hw_error_to_str(const enum hardware_error hw_err)
+{
+	switch (hw_err) {
+	case HARDWARE_ERROR_CORRECTABLE:
+		return "CORRECTABLE";
+	case HARDWARE_ERROR_NONFATAL:
+		return "NONFATAL";
+	case HARDWARE_ERROR_FATAL:
+		return "FATAL";
+	default:
+		return "UNKNOWN";
+	}
+}
+
+static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
+{
+	const char *hw_err_str = hw_error_to_str(hw_err);
+	struct xe_device *xe = tile_to_xe(tile);
+	unsigned long flags;
+	u32 err_src;
+
+	if (xe->info.platform != XE_BATTLEMAGE)
+		return;
+
+	spin_lock_irqsave(&xe->irq.lock, flags);
+	err_src = xe_mmio_read32(&tile->mmio, DEV_ERR_STAT_REG(hw_err));
+	if (!err_src) {
+		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported DEV_ERR_STAT_%s blank!\n",
+				    tile->id, hw_err_str);
+		goto unlock;
+	}
+
+	/* TODO: Process errrors per source */
+
+	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
+
+unlock:
+	spin_unlock_irqrestore(&xe->irq.lock, flags);
+}
+
+/**
+ * xe_hw_error_irq_handler - irq handling for hw errors
+ * @tile: tile instance
+ * @master_ctl: value read from master interrupt register
+ *
+ * Xe platforms add three error bits to the master interrupt register to support error handling.
+ * These three bits are used to convey the class of error FATAL, NONFATAL, or CORRECTABLE.
+ * To process the interrupt, determine the source of error by reading the Device Error Source
+ * Register that corresponds to the class of error being serviced.
+ */
+void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
+{
+	enum hardware_error hw_err;
+
+	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
+		if (master_ctl & ERROR_IRQ(hw_err))
+			hw_error_source_handler(tile, hw_err);
+}
+
+/*
+ * Process hardware errors during boot
+ */
+static void process_hw_errors(struct xe_device *xe)
+{
+	struct xe_tile *tile;
+	u32 master_ctl;
+	u8 id;
+
+	for_each_tile(tile, xe, id) {
+		master_ctl = xe_mmio_read32(&tile->mmio, GFX_MSTR_IRQ);
+		xe_hw_error_irq_handler(tile, master_ctl);
+		xe_mmio_write32(&tile->mmio, GFX_MSTR_IRQ, master_ctl);
+	}
+}
+
+/**
+ * xe_hw_error_init - Initialize hw errors
+ * @xe: xe device instance
+ *
+ * Initialize and process hw errors
+ */
+void xe_hw_error_init(struct xe_device *xe)
+{
+	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
+		return;
+
+	process_hw_errors(xe);
+}
diff --git a/drivers/gpu/drm/xe/xe_hw_error.h b/drivers/gpu/drm/xe/xe_hw_error.h
new file mode 100644
index 000000000000..d86e28c5180c
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_hw_error.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+#ifndef XE_HW_ERROR_H_
+#define XE_HW_ERROR_H_
+
+#include <linux/types.h>
+
+struct xe_tile;
+struct xe_device;
+
+void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl);
+void xe_hw_error_init(struct xe_device *xe);
+#endif
diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
index 5362d3174b06..24ccf3bec52c 100644
--- a/drivers/gpu/drm/xe/xe_irq.c
+++ b/drivers/gpu/drm/xe/xe_irq.c
@@ -18,6 +18,7 @@
 #include "xe_gt.h"
 #include "xe_guc.h"
 #include "xe_hw_engine.h"
+#include "xe_hw_error.h"
 #include "xe_memirq.h"
 #include "xe_mmio.h"
 #include "xe_pxp.h"
@@ -466,6 +467,7 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
 		xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
 
 		gt_irq_handler(tile, master_ctl, intr_dw, identity);
+		xe_hw_error_irq_handler(tile, master_ctl);
 
 		/*
 		 * Display interrupts (including display backlight operations
@@ -753,6 +755,8 @@ int xe_irq_install(struct xe_device *xe)
 	int nvec = 1;
 	int err;
 
+	xe_hw_error_init(xe);
+
 	xe_irq_reset(xe);
 
 	if (xe_device_has_msix(xe)) {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 5/7] drm/xe: Add support to handle hardware errors
  2025-07-02 14:11 ` [PATCH v3 5/7] drm/xe: Add support to handle hardware errors Riana Tauro
@ 2025-07-09 17:27   ` Summers, Stuart
  2025-07-10  5:54     ` Riana Tauro
  0 siblings, 1 reply; 36+ messages in thread
From: Summers, Stuart @ 2025-07-09 17:27 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Tauro,  Riana
  Cc: Jadav, Raag, Anirban, Sk, Vivi, Rodrigo, Scarbrough, Frank,
	Ghimiray, Himal Prasad, aravind.iddamsetty@linux.intel.com,
	Gupta, Anshuman, Nerlige Ramappa, Umesh, De Marchi, Lucas

On Wed, 2025-07-02 at 19:41 +0530, Riana Tauro wrote:
> Gfx device reports two classes of errors: uncorrectable and
> correctable. Depending on the severity uncorrectable errors are
> further classified as non fatal and fatal
> 
> Correctable and non-fatal errors are reported as MSI's and bits in
> the Master Interrupt Register indicate the class of the error.
> The source of the error is then read from the Device Error Source
> Register. Fatal errors are reported as PCIe errors
> When a PCIe error is asserted, the OS will perform a device warm
> reset
> which causes the driver to reload. The error registers are sticky
> and the values are maintained through a warm reset
> 
> Add basic support to handle these errors
> 
> Bspec: 50875, 53073, 53074, 53075, 53076
> 
> Co-developed-by: Himal Prasad Ghimiray
> <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Himal Prasad Ghimiray
> <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile                |   1 +
>  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  15 +++
>  drivers/gpu/drm/xe/regs/xe_irq_regs.h      |   1 +
>  drivers/gpu/drm/xe/xe_hw_error.c           | 108
> +++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_hw_error.h           |  15 +++
>  drivers/gpu/drm/xe/xe_irq.c                |   4 +
>  6 files changed, 144 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>  create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
>  create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile
> b/drivers/gpu/drm/xe/Makefile
> index 1d97e5b63f4e..fea8ee3b0785 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -73,6 +73,7 @@ xe-y += xe_bb.o \
>         xe_hw_engine.o \
>         xe_hw_engine_class_sysfs.o \
>         xe_hw_engine_group.o \
> +       xe_hw_error.o \
>         xe_hw_fence.o \
>         xe_irq.o \
>         xe_lrc.o \
> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> new file mode 100644
> index 000000000000..ed9b81fb28a0
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef _XE_HW_ERROR_REGS_H_
> +#define _XE_HW_ERROR_REGS_H_
> +
> +#define DEV_ERR_STAT_NONFATAL                  0x100178
> +#define DEV_ERR_STAT_CORRECTABLE               0x10017c
> +#define
> DEV_ERR_STAT_REG(x)                    XE_REG(_PICK_EVEN((x), \
> +                                                                
> DEV_ERR_STAT_CORRECTABLE, \
> +                                                                
> DEV_ERR_STAT_NONFATAL))
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/regs/xe_irq_regs.h
> b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
> index f0ecfcac4003..2758b64cec9e 100644
> --- a/drivers/gpu/drm/xe/regs/xe_irq_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
> @@ -18,6 +18,7 @@
>  #define GFX_MSTR_IRQ                           XE_REG(0x190010,
> XE_REG_OPTION_VF)
>  #define   MASTER_IRQ                           REG_BIT(31)
>  #define   GU_MISC_IRQ                          REG_BIT(29)
> +#define   ERROR_IRQ(x)                         REG_BIT(26 + (x))
>  #define   DISPLAY_IRQ                          REG_BIT(16)
>  #define   GT_DW_IRQ(x)                         REG_BIT(x)
>  
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c
> b/drivers/gpu/drm/xe/xe_hw_error.c
> new file mode 100644
> index 000000000000..0f2590839900
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -0,0 +1,108 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#include "regs/xe_hw_error_regs.h"
> +#include "regs/xe_irq_regs.h"
> +
> +#include "xe_device.h"
> +#include "xe_hw_error.h"
> +#include "xe_mmio.h"
> +
> +/* Error categories reported by hardware */
> +enum hardware_error {
> +       HARDWARE_ERROR_CORRECTABLE = 0,
> +       HARDWARE_ERROR_NONFATAL = 1,
> +       HARDWARE_ERROR_FATAL = 2,
> +       HARDWARE_ERROR_MAX,
> +};
> +
> +static const char *hw_error_to_str(const enum hardware_error hw_err)
> +{
> +       switch (hw_err) {
> +       case HARDWARE_ERROR_CORRECTABLE:
> +               return "CORRECTABLE";
> +       case HARDWARE_ERROR_NONFATAL:
> +               return "NONFATAL";
> +       case HARDWARE_ERROR_FATAL:
> +               return "FATAL";
> +       default:
> +               return "UNKNOWN";
> +       }
> +}
> +
> +static void hw_error_source_handler(struct xe_tile *tile, const enum
> hardware_error hw_err)
> +{
> +       const char *hw_err_str = hw_error_to_str(hw_err);
> +       struct xe_device *xe = tile_to_xe(tile);
> +       unsigned long flags;
> +       u32 err_src;
> +
> +       if (xe->info.platform != XE_BATTLEMAGE)

Why is this only on BMG? I see these same bits available on other
platforms, e.g. LNL.

> +               return;
> +
> +       spin_lock_irqsave(&xe->irq.lock, flags);
> +       err_src = xe_mmio_read32(&tile->mmio,
> DEV_ERR_STAT_REG(hw_err));
> +       if (!err_src) {
> +               drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported
> DEV_ERR_STAT_%s blank!\n",
> +                                   tile->id, hw_err_str);
> +               goto unlock;
> +       }
> +
> +       /* TODO: Process errrors per source */

Should at least print the bits out on the initial implementation?

> +
> +       xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err),
> err_src);
> +
> +unlock:
> +       spin_unlock_irqrestore(&xe->irq.lock, flags);
> +}
> +
> +/**
> + * xe_hw_error_irq_handler - irq handling for hw errors
> + * @tile: tile instance
> + * @master_ctl: value read from master interrupt register
> + *
> + * Xe platforms add three error bits to the master interrupt
> register to support error handling.
> + * These three bits are used to convey the class of error FATAL,
> NONFATAL, or CORRECTABLE.
> + * To process the interrupt, determine the source of error by
> reading the Device Error Source
> + * Register that corresponds to the class of error being serviced.
> + */
> +void xe_hw_error_irq_handler(struct xe_tile *tile, const u32
> master_ctl)
> +{
> +       enum hardware_error hw_err;
> +
> +       for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
> +               if (master_ctl & ERROR_IRQ(hw_err))
> +                       hw_error_source_handler(tile, hw_err);
> +}
> +
> +/*
> + * Process hardware errors during boot
> + */
> +static void process_hw_errors(struct xe_device *xe)
> +{
> +       struct xe_tile *tile;
> +       u32 master_ctl;
> +       u8 id;
> +
> +       for_each_tile(tile, xe, id) {
> +               master_ctl = xe_mmio_read32(&tile->mmio,
> GFX_MSTR_IRQ);
> +               xe_hw_error_irq_handler(tile, master_ctl);
> +               xe_mmio_write32(&tile->mmio, GFX_MSTR_IRQ,
> master_ctl);
> +       }
> +}
> +
> +/**
> + * xe_hw_error_init - Initialize hw errors
> + * @xe: xe device instance
> + *
> + * Initialize and process hw errors
> + */
> +void xe_hw_error_init(struct xe_device *xe)
> +{
> +       if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))

Again, why skipping integrated? It seems like this might also be viable
for, for instance, LNL? Of course some of the bits might make less
sense for those platforms if they are PCIe-specific. But at least
printing the register on an error seems interesting.

Thanks,
Stuart

> +               return;
> +
> +       process_hw_errors(xe);
> +}
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.h
> b/drivers/gpu/drm/xe/xe_hw_error.h
> new file mode 100644
> index 000000000000..d86e28c5180c
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_hw_error.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +#ifndef XE_HW_ERROR_H_
> +#define XE_HW_ERROR_H_
> +
> +#include <linux/types.h>
> +
> +struct xe_tile;
> +struct xe_device;
> +
> +void xe_hw_error_irq_handler(struct xe_tile *tile, const u32
> master_ctl);
> +void xe_hw_error_init(struct xe_device *xe);
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_irq.c
> b/drivers/gpu/drm/xe/xe_irq.c
> index 5362d3174b06..24ccf3bec52c 100644
> --- a/drivers/gpu/drm/xe/xe_irq.c
> +++ b/drivers/gpu/drm/xe/xe_irq.c
> @@ -18,6 +18,7 @@
>  #include "xe_gt.h"
>  #include "xe_guc.h"
>  #include "xe_hw_engine.h"
> +#include "xe_hw_error.h"
>  #include "xe_memirq.h"
>  #include "xe_mmio.h"
>  #include "xe_pxp.h"
> @@ -466,6 +467,7 @@ static irqreturn_t dg1_irq_handler(int irq, void
> *arg)
>                 xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
>  
>                 gt_irq_handler(tile, master_ctl, intr_dw, identity);
> +               xe_hw_error_irq_handler(tile, master_ctl);
>  
>                 /*
>                  * Display interrupts (including display backlight
> operations
> @@ -753,6 +755,8 @@ int xe_irq_install(struct xe_device *xe)
>         int nvec = 1;
>         int err;
>  
> +       xe_hw_error_init(xe);
> +
>         xe_irq_reset(xe);
>  
>         if (xe_device_has_msix(xe)) {


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 5/7] drm/xe: Add support to handle hardware errors
  2025-07-09 17:27   ` Summers, Stuart
@ 2025-07-10  5:54     ` Riana Tauro
  0 siblings, 0 replies; 36+ messages in thread
From: Riana Tauro @ 2025-07-10  5:54 UTC (permalink / raw)
  To: Summers, Stuart, intel-xe@lists.freedesktop.org
  Cc: Jadav, Raag, Anirban, Sk, Vivi, Rodrigo, Scarbrough, Frank,
	Ghimiray, Himal Prasad, aravind.iddamsetty@linux.intel.com,
	Gupta, Anshuman, Nerlige Ramappa, Umesh, De Marchi, Lucas


Hi Stuart

On 7/9/2025 10:57 PM, Summers, Stuart wrote:
> On Wed, 2025-07-02 at 19:41 +0530, Riana Tauro wrote:
>> Gfx device reports two classes of errors: uncorrectable and
>> correctable. Depending on the severity uncorrectable errors are
>> further classified as non fatal and fatal
>>
>> Correctable and non-fatal errors are reported as MSI's and bits in
>> the Master Interrupt Register indicate the class of the error.
>> The source of the error is then read from the Device Error Source
>> Register. Fatal errors are reported as PCIe errors
>> When a PCIe error is asserted, the OS will perform a device warm
>> reset
>> which causes the driver to reload. The error registers are sticky
>> and the values are maintained through a warm reset
>>
>> Add basic support to handle these errors
>>
>> Bspec: 50875, 53073, 53074, 53075, 53076
>>
>> Co-developed-by: Himal Prasad Ghimiray
>> <himal.prasad.ghimiray@intel.com>
>> Signed-off-by: Himal Prasad Ghimiray
>> <himal.prasad.ghimiray@intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   drivers/gpu/drm/xe/Makefile                |   1 +
>>   drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  15 +++
>>   drivers/gpu/drm/xe/regs/xe_irq_regs.h      |   1 +
>>   drivers/gpu/drm/xe/xe_hw_error.c           | 108
>> +++++++++++++++++++++
>>   drivers/gpu/drm/xe/xe_hw_error.h           |  15 +++
>>   drivers/gpu/drm/xe/xe_irq.c                |   4 +
>>   6 files changed, 144 insertions(+)
>>   create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>>   create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
>>   create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h
>>
>> diff --git a/drivers/gpu/drm/xe/Makefile
>> b/drivers/gpu/drm/xe/Makefile
>> index 1d97e5b63f4e..fea8ee3b0785 100644
>> --- a/drivers/gpu/drm/xe/Makefile
>> +++ b/drivers/gpu/drm/xe/Makefile
>> @@ -73,6 +73,7 @@ xe-y += xe_bb.o \
>>          xe_hw_engine.o \
>>          xe_hw_engine_class_sysfs.o \
>>          xe_hw_engine_group.o \
>> +       xe_hw_error.o \
>>          xe_hw_fence.o \
>>          xe_irq.o \
>>          xe_lrc.o \
>> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> new file mode 100644
>> index 000000000000..ed9b81fb28a0
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> @@ -0,0 +1,15 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#ifndef _XE_HW_ERROR_REGS_H_
>> +#define _XE_HW_ERROR_REGS_H_
>> +
>> +#define DEV_ERR_STAT_NONFATAL                  0x100178
>> +#define DEV_ERR_STAT_CORRECTABLE               0x10017c
>> +#define
>> DEV_ERR_STAT_REG(x)                    XE_REG(_PICK_EVEN((x), \
>> +
>> DEV_ERR_STAT_CORRECTABLE, \
>> +
>> DEV_ERR_STAT_NONFATAL))
>> +
>> +#endif
>> diff --git a/drivers/gpu/drm/xe/regs/xe_irq_regs.h
>> b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
>> index f0ecfcac4003..2758b64cec9e 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_irq_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_irq_regs.h
>> @@ -18,6 +18,7 @@
>>   #define GFX_MSTR_IRQ                           XE_REG(0x190010,
>> XE_REG_OPTION_VF)
>>   #define   MASTER_IRQ                           REG_BIT(31)
>>   #define   GU_MISC_IRQ                          REG_BIT(29)
>> +#define   ERROR_IRQ(x)                         REG_BIT(26 + (x))
>>   #define   DISPLAY_IRQ                          REG_BIT(16)
>>   #define   GT_DW_IRQ(x)                         REG_BIT(x)
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c
>> b/drivers/gpu/drm/xe/xe_hw_error.c
>> new file mode 100644
>> index 000000000000..0f2590839900
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -0,0 +1,108 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +
>> +#include "regs/xe_hw_error_regs.h"
>> +#include "regs/xe_irq_regs.h"
>> +
>> +#include "xe_device.h"
>> +#include "xe_hw_error.h"
>> +#include "xe_mmio.h"
>> +
>> +/* Error categories reported by hardware */
>> +enum hardware_error {
>> +       HARDWARE_ERROR_CORRECTABLE = 0,
>> +       HARDWARE_ERROR_NONFATAL = 1,
>> +       HARDWARE_ERROR_FATAL = 2,
>> +       HARDWARE_ERROR_MAX,
>> +};
>> +
>> +static const char *hw_error_to_str(const enum hardware_error hw_err)
>> +{
>> +       switch (hw_err) {
>> +       case HARDWARE_ERROR_CORRECTABLE:
>> +               return "CORRECTABLE";
>> +       case HARDWARE_ERROR_NONFATAL:
>> +               return "NONFATAL";
>> +       case HARDWARE_ERROR_FATAL:
>> +               return "FATAL";
>> +       default:
>> +               return "UNKNOWN";
>> +       }
>> +}
>> +
>> +static void hw_error_source_handler(struct xe_tile *tile, const enum
>> hardware_error hw_err)
>> +{
>> +       const char *hw_err_str = hw_error_to_str(hw_err);
>> +       struct xe_device *xe = tile_to_xe(tile);
>> +       unsigned long flags;
>> +       u32 err_src;
>> +
>> +       if (xe->info.platform != XE_BATTLEMAGE)
> 
> Why is this only on BMG? I see these same bits available on other
> platforms, e.g. LNL.
> 
>> +               return;
>> +
>> +       spin_lock_irqsave(&xe->irq.lock, flags);
>> +       err_src = xe_mmio_read32(&tile->mmio,
>> DEV_ERR_STAT_REG(hw_err));
>> +       if (!err_src) {
>> +               drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported
>> DEV_ERR_STAT_%s blank!\n",
>> +                                   tile->id, hw_err_str);
>> +               goto unlock;
>> +       }
>> +
>> +       /* TODO: Process errrors per source */
> 
> Should at least print the bits out on the initial implementation?

This patch is taken from 
https://patchwork.freedesktop.org/series/125373/ which was not merged 
due to absence of upstream consumer
Himal/Aravind can provide more details..

I have taken a single patch in this series to add support for
csc errors as it has recovery mechanism and a upstream consumer ie. fwupd.

The processing of all the bits according to source should be a separate 
series. I can retain the TODO and remove the bmg check

> 
>> +
>> +       xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err),
>> err_src);
>> +
>> +unlock:
>> +       spin_unlock_irqrestore(&xe->irq.lock, flags);
>> +}
>> +
>> +/**
>> + * xe_hw_error_irq_handler - irq handling for hw errors
>> + * @tile: tile instance
>> + * @master_ctl: value read from master interrupt register
>> + *
>> + * Xe platforms add three error bits to the master interrupt
>> register to support error handling.
>> + * These three bits are used to convey the class of error FATAL,
>> NONFATAL, or CORRECTABLE.
>> + * To process the interrupt, determine the source of error by
>> reading the Device Error Source
>> + * Register that corresponds to the class of error being serviced.
>> + */
>> +void xe_hw_error_irq_handler(struct xe_tile *tile, const u32
>> master_ctl)
>> +{
>> +       enum hardware_error hw_err;
>> +
>> +       for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
>> +               if (master_ctl & ERROR_IRQ(hw_err))
>> +                       hw_error_source_handler(tile, hw_err);
>> +}
>> +
>> +/*
>> + * Process hardware errors during boot
>> + */
>> +static void process_hw_errors(struct xe_device *xe)
>> +{
>> +       struct xe_tile *tile;
>> +       u32 master_ctl;
>> +       u8 id;
>> +
>> +       for_each_tile(tile, xe, id) {
>> +               master_ctl = xe_mmio_read32(&tile->mmio,
>> GFX_MSTR_IRQ);
>> +               xe_hw_error_irq_handler(tile, master_ctl);
>> +               xe_mmio_write32(&tile->mmio, GFX_MSTR_IRQ,
>> master_ctl);
>> +       }
>> +}
>> +
>> +/**
>> + * xe_hw_error_init - Initialize hw errors
>> + * @xe: xe device instance
>> + *
>> + * Initialize and process hw errors
>> + */
>> +void xe_hw_error_init(struct xe_device *xe)
>> +{
>> +       if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
> 
> Again, why skipping integrated? It seems like this might also be viable
> for, for instance, LNL? Of course some of the bits might make less
> sense for those platforms if they are PCIe-specific. But at least
> printing the register on an error seems interesting.

Are you suggesting to print the raw value in a drm_err log?
I could add that and remove the dgfx check, but if it is processing
of the indiviual sources and keeping count then that should be a 
different series

Thanks
Riana

> 
> Thanks,
> Stuart
> 
>> +               return;
>> +
>> +       process_hw_errors(xe);
>> +}
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.h
>> b/drivers/gpu/drm/xe/xe_hw_error.h
>> new file mode 100644
>> index 000000000000..d86e28c5180c
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.h
>> @@ -0,0 +1,15 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2025 Intel Corporation
>> + */
>> +#ifndef XE_HW_ERROR_H_
>> +#define XE_HW_ERROR_H_
>> +
>> +#include <linux/types.h>
>> +
>> +struct xe_tile;
>> +struct xe_device;
>> +
>> +void xe_hw_error_irq_handler(struct xe_tile *tile, const u32
>> master_ctl);
>> +void xe_hw_error_init(struct xe_device *xe);
>> +#endif
>> diff --git a/drivers/gpu/drm/xe/xe_irq.c
>> b/drivers/gpu/drm/xe/xe_irq.c
>> index 5362d3174b06..24ccf3bec52c 100644
>> --- a/drivers/gpu/drm/xe/xe_irq.c
>> +++ b/drivers/gpu/drm/xe/xe_irq.c
>> @@ -18,6 +18,7 @@
>>   #include "xe_gt.h"
>>   #include "xe_guc.h"
>>   #include "xe_hw_engine.h"
>> +#include "xe_hw_error.h"
>>   #include "xe_memirq.h"
>>   #include "xe_mmio.h"
>>   #include "xe_pxp.h"
>> @@ -466,6 +467,7 @@ static irqreturn_t dg1_irq_handler(int irq, void
>> *arg)
>>                  xe_mmio_write32(mmio, GFX_MSTR_IRQ, master_ctl);
>>   
>>                  gt_irq_handler(tile, master_ctl, intr_dw, identity);
>> +               xe_hw_error_irq_handler(tile, master_ctl);
>>   
>>                  /*
>>                   * Display interrupts (including display backlight
>> operations
>> @@ -753,6 +755,8 @@ int xe_irq_install(struct xe_device *xe)
>>          int nvec = 1;
>>          int err;
>>   
>> +       xe_hw_error_init(xe);
>> +
>>          xe_irq_reset(xe);
>>   
>>          if (xe_device_has_msix(xe)) {
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 6/7] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (4 preceding siblings ...)
  2025-07-02 14:11 ` [PATCH v3 5/7] drm/xe: Add support to handle hardware errors Riana Tauro
@ 2025-07-02 14:11 ` Riana Tauro
  2025-07-02 21:35   ` Rodrigo Vivi
  2025-07-09 17:57   ` Summers, Stuart
  2025-07-02 14:11 ` [PATCH v3 7/7] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 36+ messages in thread
From: Riana Tauro @ 2025-07-02 14:11 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

Add support to handle CSC firmware reported errors. When CSC firmware
errors are encoutered, a error interrupt is received by the GFX device as
a MSI interrupt.

Device Source control registers indicates the source of the error as CSC
The HEC error status register indicates that the error is firmware reported
Depending on the type of error, the error cause is written to the HEC
Firmware error register.

On encountering such CSC firmware errors, the graphics device is
non-recoverable from driver context. The only way to recover from these
errors is firmware flash. The device is then wedged and userspace is
notified with a drm uevent

v2: use vendor recovery method with
    runtime survivability (Christian, Rodrigo, Raag)

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/regs/xe_gsc_regs.h      |  2 +
 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  7 ++-
 drivers/gpu/drm/xe/xe_device.c             | 11 +++-
 drivers/gpu/drm/xe/xe_device_types.h       |  3 +
 drivers/gpu/drm/xe/xe_hw_error.c           | 70 +++++++++++++++++++++-
 5 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
index 9b66cc972a63..180be82672ab 100644
--- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
@@ -13,6 +13,8 @@
 
 /* Definitions of GSC H/W registers, bits, etc */
 
+#define BMG_GSC_HECI1_BASE	0x373000
+
 #define MTL_GSC_HECI1_BASE	0x00116000
 #define MTL_GSC_HECI2_BASE	0x00117000
 
diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
index ed9b81fb28a0..c146b9ef44eb 100644
--- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
+++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
@@ -6,10 +6,15 @@
 #ifndef _XE_HW_ERROR_REGS_H_
 #define _XE_HW_ERROR_REGS_H_
 
+#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
+#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
+
+#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
+
 #define DEV_ERR_STAT_NONFATAL			0x100178
 #define DEV_ERR_STAT_CORRECTABLE		0x10017c
 #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
 								  DEV_ERR_STAT_CORRECTABLE, \
 								  DEV_ERR_STAT_NONFATAL))
-
+#define   XE_CSC_ERROR				BIT(17)
 #endif
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index d6b680abc3ae..fbc50cebfc11 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1154,6 +1154,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
  */
 void xe_device_declare_wedged(struct xe_device *xe)
 {
+	unsigned long recovery_method;
 	struct xe_gt *gt;
 	u8 id;
 
@@ -1169,6 +1170,12 @@ void xe_device_declare_wedged(struct xe_device *xe)
 		return;
 	}
 
+	/* Default recovery method */
+	recovery_method = DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET;
+
+	if (xe_survivability_mode_is_runtime(xe))
+		recovery_method = DRM_WEDGE_RECOVERY_VENDOR;
+
 	for_each_gt(gt, xe, id)
 		xe_gt_declare_wedged(gt);
 
@@ -1181,8 +1188,6 @@ void xe_device_declare_wedged(struct xe_device *xe)
 			dev_name(xe->drm.dev));
 
 		/* Notify userspace of wedged device */
-		drm_dev_wedged_event(&xe->drm,
-				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
-				     NULL);
+		drm_dev_wedged_event(&xe->drm, recovery_method, NULL);
 	}
 }
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 7e4f6d846af6..5daf5ba6bf51 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -241,6 +241,9 @@ struct xe_tile {
 	/** @memirq: Memory Based Interrupts. */
 	struct xe_memirq memirq;
 
+	/** @csc_hw_error_work: worker to report CSC HW errors */
+	struct work_struct csc_hw_error_work;
+
 	/** @pcode: tile's PCODE */
 	struct {
 		/** @pcode.lock: protecting tile's PCODE mailbox data */
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 0f2590839900..73c788fd0dee 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -3,12 +3,16 @@
  * Copyright © 2025 Intel Corporation
  */
 
+#include "regs/xe_gsc_regs.h"
 #include "regs/xe_hw_error_regs.h"
 #include "regs/xe_irq_regs.h"
 
 #include "xe_device.h"
 #include "xe_hw_error.h"
 #include "xe_mmio.h"
+#include "xe_survivability_mode.h"
+
+#define  HEC_UNCORR_FW_ERR_BITS 4
 
 /* Error categories reported by hardware */
 enum hardware_error {
@@ -18,6 +22,13 @@ enum hardware_error {
 	HARDWARE_ERROR_MAX,
 };
 
+static const char * const hec_uncorrected_fw_errors[] = {
+	"Fatal",
+	"CSE Disabled",
+	"FD Corruption",
+	"Data Corruption"
+};
+
 static const char *hw_error_to_str(const enum hardware_error hw_err)
 {
 	switch (hw_err) {
@@ -32,6 +43,58 @@ static const char *hw_error_to_str(const enum hardware_error hw_err)
 	}
 }
 
+static void csc_hw_error_work(struct work_struct *work)
+{
+	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
+	struct xe_device *xe = tile_to_xe(tile);
+	int ret;
+
+	ret = xe_survivability_mode_enable(xe, XE_SURVIVABILITY_TYPE_RUNTIME);
+	if (ret)
+		drm_err(&xe->drm, "Failed to enable runtime survivability mode\n");
+
+	xe_device_declare_wedged(xe);
+}
+
+static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
+{
+	const char *hw_err_str = hw_error_to_str(hw_err);
+	struct xe_device *xe = tile_to_xe(tile);
+	struct xe_mmio *mmio = &tile->mmio;
+	u32 base, err_bit, err_src;
+	unsigned long fw_err;
+
+	if (xe->info.platform != XE_BATTLEMAGE)
+		return;
+
+	/* Not supported in BMG */
+	if (hw_err == HARDWARE_ERROR_CORRECTABLE)
+		return;
+
+	base = BMG_GSC_HECI1_BASE;
+	lockdep_assert_held(&xe->irq.lock);
+	err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
+	if (!err_src) {
+		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
+				    tile->id, hw_err_str);
+		return;
+	}
+
+	if (err_src & UNCORR_FW_REPORTED_ERR) {
+		fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
+		for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
+			drm_err_ratelimited(&xe->drm, HW_ERR
+					    "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
+					     hw_err_str, hec_uncorrected_fw_errors[err_bit],
+					     err_bit);
+
+			schedule_work(&tile->csc_hw_error_work);
+		}
+	}
+
+	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
+}
+
 static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
 	const char *hw_err_str = hw_error_to_str(hw_err);
@@ -50,7 +113,8 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
 		goto unlock;
 	}
 
-	/* TODO: Process errrors per source */
+	if (err_src & XE_CSC_ERROR)
+		csc_hw_error_handler(tile, hw_err);
 
 	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
 
@@ -101,8 +165,12 @@ static void process_hw_errors(struct xe_device *xe)
  */
 void xe_hw_error_init(struct xe_device *xe)
 {
+	struct xe_tile *tile = xe_device_get_root_tile(xe);
+
 	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
 		return;
 
+	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
+
 	process_hw_errors(xe);
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 6/7] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  2025-07-02 14:11 ` [PATCH v3 6/7] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
@ 2025-07-02 21:35   ` Rodrigo Vivi
  2025-07-03  5:28     ` Riana Tauro
  2025-07-09 17:57   ` Summers, Stuart
  1 sibling, 1 reply; 36+ messages in thread
From: Rodrigo Vivi @ 2025-07-02 21:35 UTC (permalink / raw)
  To: Riana Tauro
  Cc: intel-xe, anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	raag.jadav, umesh.nerlige.ramappa, frank.scarbrough, sk.anirban

On Wed, Jul 02, 2025 at 07:41:16PM +0530, Riana Tauro wrote:
> Add support to handle CSC firmware reported errors. When CSC firmware
> errors are encoutered, a error interrupt is received by the GFX device as
> a MSI interrupt.
> 
> Device Source control registers indicates the source of the error as CSC
> The HEC error status register indicates that the error is firmware reported
> Depending on the type of error, the error cause is written to the HEC
> Firmware error register.
> 
> On encountering such CSC firmware errors, the graphics device is
> non-recoverable from driver context. The only way to recover from these
> errors is firmware flash. The device is then wedged and userspace is
> notified with a drm uevent
> 
> v2: use vendor recovery method with
>     runtime survivability (Christian, Rodrigo, Raag)
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  drivers/gpu/drm/xe/regs/xe_gsc_regs.h      |  2 +
>  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  7 ++-
>  drivers/gpu/drm/xe/xe_device.c             | 11 +++-
>  drivers/gpu/drm/xe/xe_device_types.h       |  3 +
>  drivers/gpu/drm/xe/xe_hw_error.c           | 70 +++++++++++++++++++++-
>  5 files changed, 88 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
> index 9b66cc972a63..180be82672ab 100644
> --- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
> @@ -13,6 +13,8 @@
>  
>  /* Definitions of GSC H/W registers, bits, etc */
>  
> +#define BMG_GSC_HECI1_BASE	0x373000
> +
>  #define MTL_GSC_HECI1_BASE	0x00116000
>  #define MTL_GSC_HECI2_BASE	0x00117000
>  
> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> index ed9b81fb28a0..c146b9ef44eb 100644
> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> @@ -6,10 +6,15 @@
>  #ifndef _XE_HW_ERROR_REGS_H_
>  #define _XE_HW_ERROR_REGS_H_
>  
> +#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
> +#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
> +
> +#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
> +
>  #define DEV_ERR_STAT_NONFATAL			0x100178
>  #define DEV_ERR_STAT_CORRECTABLE		0x10017c
>  #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
>  								  DEV_ERR_STAT_CORRECTABLE, \
>  								  DEV_ERR_STAT_NONFATAL))
> -
> +#define   XE_CSC_ERROR				BIT(17)
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index d6b680abc3ae..fbc50cebfc11 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1154,6 +1154,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>   */
>  void xe_device_declare_wedged(struct xe_device *xe)
>  {
> +	unsigned long recovery_method;
>  	struct xe_gt *gt;
>  	u8 id;
>  
> @@ -1169,6 +1170,12 @@ void xe_device_declare_wedged(struct xe_device *xe)
>  		return;
>  	}
>  
> +	/* Default recovery method */
> +	recovery_method = DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET;
> +
> +	if (xe_survivability_mode_is_runtime(xe))
> +		recovery_method = DRM_WEDGE_RECOVERY_VENDOR;

what about the DRM_WEDGE_RECOVERY_VENDOR as an option to this function?

Then, from the survivability mode you call:
xe_device_declare_wedged(xe, DRM_WEDGE_RECOVERY_VENDOR)

> +
>  	for_each_gt(gt, xe, id)
>  		xe_gt_declare_wedged(gt);
>  
> @@ -1181,8 +1188,6 @@ void xe_device_declare_wedged(struct xe_device *xe)
>  			dev_name(xe->drm.dev));
>  
>  		/* Notify userspace of wedged device */
> -		drm_dev_wedged_event(&xe->drm,
> -				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
> -				     NULL);
> +		drm_dev_wedged_event(&xe->drm, recovery_method, NULL);
>  	}
>  }
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 7e4f6d846af6..5daf5ba6bf51 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -241,6 +241,9 @@ struct xe_tile {
>  	/** @memirq: Memory Based Interrupts. */
>  	struct xe_memirq memirq;
>  
> +	/** @csc_hw_error_work: worker to report CSC HW errors */
> +	struct work_struct csc_hw_error_work;
> +
>  	/** @pcode: tile's PCODE */
>  	struct {
>  		/** @pcode.lock: protecting tile's PCODE mailbox data */
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index 0f2590839900..73c788fd0dee 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -3,12 +3,16 @@
>   * Copyright © 2025 Intel Corporation
>   */
>  
> +#include "regs/xe_gsc_regs.h"
>  #include "regs/xe_hw_error_regs.h"
>  #include "regs/xe_irq_regs.h"
>  
>  #include "xe_device.h"
>  #include "xe_hw_error.h"
>  #include "xe_mmio.h"
> +#include "xe_survivability_mode.h"
> +
> +#define  HEC_UNCORR_FW_ERR_BITS 4
>  
>  /* Error categories reported by hardware */
>  enum hardware_error {
> @@ -18,6 +22,13 @@ enum hardware_error {
>  	HARDWARE_ERROR_MAX,
>  };
>  
> +static const char * const hec_uncorrected_fw_errors[] = {
> +	"Fatal",
> +	"CSE Disabled",
> +	"FD Corruption",
> +	"Data Corruption"
> +};
> +
>  static const char *hw_error_to_str(const enum hardware_error hw_err)
>  {
>  	switch (hw_err) {
> @@ -32,6 +43,58 @@ static const char *hw_error_to_str(const enum hardware_error hw_err)
>  	}
>  }
>  
> +static void csc_hw_error_work(struct work_struct *work)
> +{
> +	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
> +	struct xe_device *xe = tile_to_xe(tile);
> +	int ret;
> +
> +	ret = xe_survivability_mode_enable(xe, XE_SURVIVABILITY_TYPE_RUNTIME);
> +	if (ret)
> +		drm_err(&xe->drm, "Failed to enable runtime survivability mode\n");

This could simply call a function xe_survivability_mode_runtime(xe), which
declares the device wedged with vendor specific reason.

> +
> +	xe_device_declare_wedged(xe);
> +}
> +
> +static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
> +{
> +	const char *hw_err_str = hw_error_to_str(hw_err);
> +	struct xe_device *xe = tile_to_xe(tile);
> +	struct xe_mmio *mmio = &tile->mmio;
> +	u32 base, err_bit, err_src;
> +	unsigned long fw_err;
> +
> +	if (xe->info.platform != XE_BATTLEMAGE)
> +		return;
> +
> +	/* Not supported in BMG */
> +	if (hw_err == HARDWARE_ERROR_CORRECTABLE)
> +		return;
> +
> +	base = BMG_GSC_HECI1_BASE;
> +	lockdep_assert_held(&xe->irq.lock);
> +	err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
> +	if (!err_src) {
> +		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
> +				    tile->id, hw_err_str);
> +		return;
> +	}
> +
> +	if (err_src & UNCORR_FW_REPORTED_ERR) {
> +		fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
> +		for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
> +			drm_err_ratelimited(&xe->drm, HW_ERR
> +					    "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
> +					     hw_err_str, hec_uncorrected_fw_errors[err_bit],
> +					     err_bit);
> +
> +			schedule_work(&tile->csc_hw_error_work);
> +		}
> +	}
> +
> +	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
> +}
> +
>  static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>  {
>  	const char *hw_err_str = hw_error_to_str(hw_err);
> @@ -50,7 +113,8 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>  		goto unlock;
>  	}
>  
> -	/* TODO: Process errrors per source */
> +	if (err_src & XE_CSC_ERROR)
> +		csc_hw_error_handler(tile, hw_err);
>  
>  	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>  
> @@ -101,8 +165,12 @@ static void process_hw_errors(struct xe_device *xe)
>   */
>  void xe_hw_error_init(struct xe_device *xe)
>  {
> +	struct xe_tile *tile = xe_device_get_root_tile(xe);
> +
>  	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>  		return;
>  
> +	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
> +
>  	process_hw_errors(xe);
>  }
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 6/7] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  2025-07-02 21:35   ` Rodrigo Vivi
@ 2025-07-03  5:28     ` Riana Tauro
  0 siblings, 0 replies; 36+ messages in thread
From: Riana Tauro @ 2025-07-03  5:28 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: intel-xe, anshuman.gupta, lucas.demarchi, aravind.iddamsetty,
	raag.jadav, umesh.nerlige.ramappa, frank.scarbrough, sk.anirban

Hi Rodrigo

On 7/3/2025 3:05 AM, Rodrigo Vivi wrote:
> On Wed, Jul 02, 2025 at 07:41:16PM +0530, Riana Tauro wrote:
>> Add support to handle CSC firmware reported errors. When CSC firmware
>> errors are encoutered, a error interrupt is received by the GFX device as
>> a MSI interrupt.
>>
>> Device Source control registers indicates the source of the error as CSC
>> The HEC error status register indicates that the error is firmware reported
>> Depending on the type of error, the error cause is written to the HEC
>> Firmware error register.
>>
>> On encountering such CSC firmware errors, the graphics device is
>> non-recoverable from driver context. The only way to recover from these
>> errors is firmware flash. The device is then wedged and userspace is
>> notified with a drm uevent
>>
>> v2: use vendor recovery method with
>>      runtime survivability (Christian, Rodrigo, Raag)
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   drivers/gpu/drm/xe/regs/xe_gsc_regs.h      |  2 +
>>   drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  7 ++-
>>   drivers/gpu/drm/xe/xe_device.c             | 11 +++-
>>   drivers/gpu/drm/xe/xe_device_types.h       |  3 +
>>   drivers/gpu/drm/xe/xe_hw_error.c           | 70 +++++++++++++++++++++-
>>   5 files changed, 88 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>> index 9b66cc972a63..180be82672ab 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>> @@ -13,6 +13,8 @@
>>   
>>   /* Definitions of GSC H/W registers, bits, etc */
>>   
>> +#define BMG_GSC_HECI1_BASE	0x373000
>> +
>>   #define MTL_GSC_HECI1_BASE	0x00116000
>>   #define MTL_GSC_HECI2_BASE	0x00117000
>>   
>> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> index ed9b81fb28a0..c146b9ef44eb 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> @@ -6,10 +6,15 @@
>>   #ifndef _XE_HW_ERROR_REGS_H_
>>   #define _XE_HW_ERROR_REGS_H_
>>   
>> +#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base) + 0x118)
>> +#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
>> +
>> +#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base) + 0x124)
>> +
>>   #define DEV_ERR_STAT_NONFATAL			0x100178
>>   #define DEV_ERR_STAT_CORRECTABLE		0x10017c
>>   #define DEV_ERR_STAT_REG(x)			XE_REG(_PICK_EVEN((x), \
>>   								  DEV_ERR_STAT_CORRECTABLE, \
>>   								  DEV_ERR_STAT_NONFATAL))
>> -
>> +#define   XE_CSC_ERROR				BIT(17)
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>> index d6b680abc3ae..fbc50cebfc11 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -1154,6 +1154,7 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
>>    */
>>   void xe_device_declare_wedged(struct xe_device *xe)
>>   {
>> +	unsigned long recovery_method;
>>   	struct xe_gt *gt;
>>   	u8 id;
>>   
>> @@ -1169,6 +1170,12 @@ void xe_device_declare_wedged(struct xe_device *xe)
>>   		return;
>>   	}
>>   
>> +	/* Default recovery method */
>> +	recovery_method = DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET;
>> +
>> +	if (xe_survivability_mode_is_runtime(xe))
>> +		recovery_method = DRM_WEDGE_RECOVERY_VENDOR;
> 
> what about the DRM_WEDGE_RECOVERY_VENDOR as an option to this function?
> > Then, from the survivability mode you call:
> xe_device_declare_wedged(xe, DRM_WEDGE_RECOVERY_VENDOR)

The default method is used in most of the cases, that is the reason i 
didn't use parameter.

How about retaining this patch if the method is different from default.
https://patchwork.freedesktop.org/patch/660131/?series=149756&rev=2 ?

> 
>> +
>>   	for_each_gt(gt, xe, id)
>>   		xe_gt_declare_wedged(gt);
>>   
>> @@ -1181,8 +1188,6 @@ void xe_device_declare_wedged(struct xe_device *xe)
>>   			dev_name(xe->drm.dev));
>>   
>>   		/* Notify userspace of wedged device */
>> -		drm_dev_wedged_event(&xe->drm,
>> -				     DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET,
>> -				     NULL);
>> +		drm_dev_wedged_event(&xe->drm, recovery_method, NULL);
>>   	}
>>   }
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> index 7e4f6d846af6..5daf5ba6bf51 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -241,6 +241,9 @@ struct xe_tile {
>>   	/** @memirq: Memory Based Interrupts. */
>>   	struct xe_memirq memirq;
>>   
>> +	/** @csc_hw_error_work: worker to report CSC HW errors */
>> +	struct work_struct csc_hw_error_work;
>> +
>>   	/** @pcode: tile's PCODE */
>>   	struct {
>>   		/** @pcode.lock: protecting tile's PCODE mailbox data */
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>> index 0f2590839900..73c788fd0dee 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -3,12 +3,16 @@
>>    * Copyright © 2025 Intel Corporation
>>    */
>>   
>> +#include "regs/xe_gsc_regs.h"
>>   #include "regs/xe_hw_error_regs.h"
>>   #include "regs/xe_irq_regs.h"
>>   
>>   #include "xe_device.h"
>>   #include "xe_hw_error.h"
>>   #include "xe_mmio.h"
>> +#include "xe_survivability_mode.h"
>> +
>> +#define  HEC_UNCORR_FW_ERR_BITS 4
>>   
>>   /* Error categories reported by hardware */
>>   enum hardware_error {
>> @@ -18,6 +22,13 @@ enum hardware_error {
>>   	HARDWARE_ERROR_MAX,
>>   };
>>   
>> +static const char * const hec_uncorrected_fw_errors[] = {
>> +	"Fatal",
>> +	"CSE Disabled",
>> +	"FD Corruption",
>> +	"Data Corruption"
>> +};
>> +
>>   static const char *hw_error_to_str(const enum hardware_error hw_err)
>>   {
>>   	switch (hw_err) {
>> @@ -32,6 +43,58 @@ static const char *hw_error_to_str(const enum hardware_error hw_err)
>>   	}
>>   }
>>   
>> +static void csc_hw_error_work(struct work_struct *work)
>> +{
>> +	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +	int ret;
>> +
>> +	ret = xe_survivability_mode_enable(xe, XE_SURVIVABILITY_TYPE_RUNTIME);
>> +	if (ret)
>> +		drm_err(&xe->drm, "Failed to enable runtime survivability mode\n");
> 
> This could simply call a function xe_survivability_mode_runtime(xe), which
> declares the device wedged with vendor specific reason.

Will do this based on the decision in
[3/7]Add support for Runtime survivability mode

Thanks
Riana>
>> +
>> +	xe_device_declare_wedged(xe);
>> +}
>> +
>> +static void csc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>> +{
>> +	const char *hw_err_str = hw_error_to_str(hw_err);
>> +	struct xe_device *xe = tile_to_xe(tile);
>> +	struct xe_mmio *mmio = &tile->mmio;
>> +	u32 base, err_bit, err_src;
>> +	unsigned long fw_err;
>> +
>> +	if (xe->info.platform != XE_BATTLEMAGE)
>> +		return;
>> +
>> +	/* Not supported in BMG */
>> +	if (hw_err == HARDWARE_ERROR_CORRECTABLE)
>> +		return;
>> +
>> +	base = BMG_GSC_HECI1_BASE;
>> +	lockdep_assert_held(&xe->irq.lock);
>> +	err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
>> +	if (!err_src) {
>> +		drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported HEC_ERR_STATUS_%s blank\n",
>> +				    tile->id, hw_err_str);
>> +		return;
>> +	}
>> +
>> +	if (err_src & UNCORR_FW_REPORTED_ERR) {
>> +		fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
>> +		for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
>> +			drm_err_ratelimited(&xe->drm, HW_ERR
>> +					    "%s: HEC Uncorrected FW %s error reported, bit[%d] is set\n",
>> +					     hw_err_str, hec_uncorrected_fw_errors[err_bit],
>> +					     err_bit);
>> +
>> +			schedule_work(&tile->csc_hw_error_work);
>> +		}
>> +	}
>> +
>> +	xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
>> +}
>> +
>>   static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>   {
>>   	const char *hw_err_str = hw_error_to_str(hw_err);
>> @@ -50,7 +113,8 @@ static void hw_error_source_handler(struct xe_tile *tile, const enum hardware_er
>>   		goto unlock;
>>   	}
>>   
>> -	/* TODO: Process errrors per source */
>> +	if (err_src & XE_CSC_ERROR)
>> +		csc_hw_error_handler(tile, hw_err);
>>   
>>   	xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err), err_src);
>>   
>> @@ -101,8 +165,12 @@ static void process_hw_errors(struct xe_device *xe)
>>    */
>>   void xe_hw_error_init(struct xe_device *xe)
>>   {
>> +	struct xe_tile *tile = xe_device_get_root_tile(xe);
>> +
>>   	if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>>   		return;
>>   
>> +	INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
>> +
>>   	process_hw_errors(xe);
>>   }
>> -- 
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 6/7] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  2025-07-02 14:11 ` [PATCH v3 6/7] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
  2025-07-02 21:35   ` Rodrigo Vivi
@ 2025-07-09 17:57   ` Summers, Stuart
  2025-07-10  5:38     ` Riana Tauro
  1 sibling, 1 reply; 36+ messages in thread
From: Summers, Stuart @ 2025-07-09 17:57 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Tauro,  Riana
  Cc: Jadav, Raag, Anirban, Sk, Vivi, Rodrigo, Scarbrough, Frank,
	aravind.iddamsetty@linux.intel.com, Gupta, Anshuman,
	Nerlige Ramappa, Umesh, De Marchi, Lucas

On Wed, 2025-07-02 at 19:41 +0530, Riana Tauro wrote:
> Add support to handle CSC firmware reported errors. When CSC firmware
> errors are encoutered, a error interrupt is received by the GFX
> device as
> a MSI interrupt.
> 
> Device Source control registers indicates the source of the error as
> CSC
> The HEC error status register indicates that the error is firmware
> reported
> Depending on the type of error, the error cause is written to the HEC
> Firmware error register.
> 
> On encountering such CSC firmware errors, the graphics device is
> non-recoverable from driver context. The only way to recover from
> these
> errors is firmware flash. The device is then wedged and userspace is
> notified with a drm uevent
> 
> v2: use vendor recovery method with
>     runtime survivability (Christian, Rodrigo, Raag)
> 
> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> ---
>  drivers/gpu/drm/xe/regs/xe_gsc_regs.h      |  2 +
>  drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  7 ++-
>  drivers/gpu/drm/xe/xe_device.c             | 11 +++-
>  drivers/gpu/drm/xe/xe_device_types.h       |  3 +
>  drivers/gpu/drm/xe/xe_hw_error.c           | 70
> +++++++++++++++++++++-
>  5 files changed, 88 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
> b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
> index 9b66cc972a63..180be82672ab 100644
> --- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
> @@ -13,6 +13,8 @@
>  
>  /* Definitions of GSC H/W registers, bits, etc */
>  
> +#define BMG_GSC_HECI1_BASE     0x373000
> +
>  #define MTL_GSC_HECI1_BASE     0x00116000
>  #define MTL_GSC_HECI2_BASE     0x00117000
>  
> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> index ed9b81fb28a0..c146b9ef44eb 100644
> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
> @@ -6,10 +6,15 @@
>  #ifndef _XE_HW_ERROR_REGS_H_
>  #define _XE_HW_ERROR_REGS_H_
>  
> +#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base)
> + 0x118)
> +#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
> +
> +#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base)
> + 0x124)
> +
>  #define DEV_ERR_STAT_NONFATAL                  0x100178
>  #define DEV_ERR_STAT_CORRECTABLE               0x10017c
>  #define
> DEV_ERR_STAT_REG(x)                    XE_REG(_PICK_EVEN((x), \
>                                                                  
> DEV_ERR_STAT_CORRECTABLE, \
>                                                                  
> DEV_ERR_STAT_NONFATAL))
> -
> +#define   XE_CSC_ERROR                         BIT(17)
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_device.c
> b/drivers/gpu/drm/xe/xe_device.c
> index d6b680abc3ae..fbc50cebfc11 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1154,6 +1154,7 @@ static void xe_device_wedged_fini(struct
> drm_device *drm, void *arg)
>   */
>  void xe_device_declare_wedged(struct xe_device *xe)
>  {
> +       unsigned long recovery_method;
>         struct xe_gt *gt;
>         u8 id;
>  
> @@ -1169,6 +1170,12 @@ void xe_device_declare_wedged(struct xe_device
> *xe)
>                 return;
>         }
>  
> +       /* Default recovery method */
> +       recovery_method = DRM_WEDGE_RECOVERY_REBIND |
> DRM_WEDGE_RECOVERY_BUS_RESET;
> +
> +       if (xe_survivability_mode_is_runtime(xe))
> +               recovery_method = DRM_WEDGE_RECOVERY_VENDOR;
> +
>         for_each_gt(gt, xe, id)
>                 xe_gt_declare_wedged(gt);
>  
> @@ -1181,8 +1188,6 @@ void xe_device_declare_wedged(struct xe_device
> *xe)
>                         dev_name(xe->drm.dev));
>  
>                 /* Notify userspace of wedged device */
> -               drm_dev_wedged_event(&xe->drm,
> -                                    DRM_WEDGE_RECOVERY_REBIND |
> DRM_WEDGE_RECOVERY_BUS_RESET,
> -                                    NULL);
> +               drm_dev_wedged_event(&xe->drm, recovery_method,
> NULL);
>         }
>  }
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> b/drivers/gpu/drm/xe/xe_device_types.h
> index 7e4f6d846af6..5daf5ba6bf51 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -241,6 +241,9 @@ struct xe_tile {
>         /** @memirq: Memory Based Interrupts. */
>         struct xe_memirq memirq;
>  
> +       /** @csc_hw_error_work: worker to report CSC HW errors */
> +       struct work_struct csc_hw_error_work;
> +
>         /** @pcode: tile's PCODE */
>         struct {
>                 /** @pcode.lock: protecting tile's PCODE mailbox data
> */
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c
> b/drivers/gpu/drm/xe/xe_hw_error.c
> index 0f2590839900..73c788fd0dee 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -3,12 +3,16 @@
>   * Copyright © 2025 Intel Corporation
>   */
>  
> +#include "regs/xe_gsc_regs.h"
>  #include "regs/xe_hw_error_regs.h"
>  #include "regs/xe_irq_regs.h"
>  
>  #include "xe_device.h"
>  #include "xe_hw_error.h"
>  #include "xe_mmio.h"
> +#include "xe_survivability_mode.h"
> +
> +#define  HEC_UNCORR_FW_ERR_BITS 4
>  
>  /* Error categories reported by hardware */
>  enum hardware_error {
> @@ -18,6 +22,13 @@ enum hardware_error {
>         HARDWARE_ERROR_MAX,
>  };
>  
> +static const char * const hec_uncorrected_fw_errors[] = {
> +       "Fatal",
> +       "CSE Disabled",
> +       "FD Corruption",
> +       "Data Corruption"
> +};
> +
>  static const char *hw_error_to_str(const enum hardware_error hw_err)
>  {
>         switch (hw_err) {
> @@ -32,6 +43,58 @@ static const char *hw_error_to_str(const enum
> hardware_error hw_err)
>         }
>  }
>  
> +static void csc_hw_error_work(struct work_struct *work)
> +{
> +       struct xe_tile *tile = container_of(work, typeof(*tile),
> csc_hw_error_work);
> +       struct xe_device *xe = tile_to_xe(tile);
> +       int ret;
> +
> +       ret = xe_survivability_mode_enable(xe,
> XE_SURVIVABILITY_TYPE_RUNTIME);
> +       if (ret)
> +               drm_err(&xe->drm, "Failed to enable runtime
> survivability mode\n");
> +
> +       xe_device_declare_wedged(xe);
> +}
> +
> +static void csc_hw_error_handler(struct xe_tile *tile, const enum
> hardware_error hw_err)
> +{
> +       const char *hw_err_str = hw_error_to_str(hw_err);
> +       struct xe_device *xe = tile_to_xe(tile);
> +       struct xe_mmio *mmio = &tile->mmio;
> +       u32 base, err_bit, err_src;
> +       unsigned long fw_err;
> +
> +       if (xe->info.platform != XE_BATTLEMAGE)
> +               return;
> +
> +       /* Not supported in BMG */
> +       if (hw_err == HARDWARE_ERROR_CORRECTABLE)
> +               return;

Again, here and above, why are we specifically limiting this to BMG?

Thanks,
Stuart

> +
> +       base = BMG_GSC_HECI1_BASE;
> +       lockdep_assert_held(&xe->irq.lock);
> +       err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
> +       if (!err_src) {
> +               drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported
> HEC_ERR_STATUS_%s blank\n",
> +                                   tile->id, hw_err_str);
> +               return;
> +       }
> +
> +       if (err_src & UNCORR_FW_REPORTED_ERR) {
> +               fw_err = xe_mmio_read32(mmio,
> HEC_UNCORR_FW_ERR_DW0(base));
> +               for_each_set_bit(err_bit, &fw_err,
> HEC_UNCORR_FW_ERR_BITS) {
> +                       drm_err_ratelimited(&xe->drm, HW_ERR
> +                                           "%s: HEC Uncorrected FW
> %s error reported, bit[%d] is set\n",
> +                                            hw_err_str,
> hec_uncorrected_fw_errors[err_bit],
> +                                            err_bit);
> +
> +                       schedule_work(&tile->csc_hw_error_work);
> +               }
> +       }
> +
> +       xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
> +}
> +
>  static void hw_error_source_handler(struct xe_tile *tile, const enum
> hardware_error hw_err)
>  {
>         const char *hw_err_str = hw_error_to_str(hw_err);
> @@ -50,7 +113,8 @@ static void hw_error_source_handler(struct xe_tile
> *tile, const enum hardware_er
>                 goto unlock;
>         }
>  
> -       /* TODO: Process errrors per source */

I still think we should have a print here to show the errors we
received, especially since CSC isn't the only bit here. We're just only
implementing recovery support for that case.

Thanks,
Stuart

> +       if (err_src & XE_CSC_ERROR)
> +               csc_hw_error_handler(tile, hw_err);
>  
>         xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err),
> err_src);
>  
> @@ -101,8 +165,12 @@ static void process_hw_errors(struct xe_device
> *xe)
>   */
>  void xe_hw_error_init(struct xe_device *xe)
>  {
> +       struct xe_tile *tile = xe_device_get_root_tile(xe);
> +
>         if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>                 return;
>  
> +       INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
> +
>         process_hw_errors(xe);
>  }


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 6/7] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
  2025-07-09 17:57   ` Summers, Stuart
@ 2025-07-10  5:38     ` Riana Tauro
  0 siblings, 0 replies; 36+ messages in thread
From: Riana Tauro @ 2025-07-10  5:38 UTC (permalink / raw)
  To: Summers, Stuart, intel-xe@lists.freedesktop.org
  Cc: Jadav, Raag, Anirban, Sk, Vivi, Rodrigo, Scarbrough, Frank,
	aravind.iddamsetty@linux.intel.com, Gupta, Anshuman,
	Nerlige Ramappa, Umesh, De Marchi, Lucas

Hi Stuart

On 7/9/2025 11:27 PM, Summers, Stuart wrote:
> On Wed, 2025-07-02 at 19:41 +0530, Riana Tauro wrote:
>> Add support to handle CSC firmware reported errors. When CSC firmware
>> errors are encoutered, a error interrupt is received by the GFX
>> device as
>> a MSI interrupt.
>>
>> Device Source control registers indicates the source of the error as
>> CSC
>> The HEC error status register indicates that the error is firmware
>> reported
>> Depending on the type of error, the error cause is written to the HEC
>> Firmware error register.
>>
>> On encountering such CSC firmware errors, the graphics device is
>> non-recoverable from driver context. The only way to recover from
>> these
>> errors is firmware flash. The device is then wedged and userspace is
>> notified with a drm uevent
>>
>> v2: use vendor recovery method with
>>      runtime survivability (Christian, Rodrigo, Raag)
>>
>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
>> ---
>>   drivers/gpu/drm/xe/regs/xe_gsc_regs.h      |  2 +
>>   drivers/gpu/drm/xe/regs/xe_hw_error_regs.h |  7 ++-
>>   drivers/gpu/drm/xe/xe_device.c             | 11 +++-
>>   drivers/gpu/drm/xe/xe_device_types.h       |  3 +
>>   drivers/gpu/drm/xe/xe_hw_error.c           | 70
>> +++++++++++++++++++++-
>>   5 files changed, 88 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>> b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>> index 9b66cc972a63..180be82672ab 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_gsc_regs.h
>> @@ -13,6 +13,8 @@
>>   
>>   /* Definitions of GSC H/W registers, bits, etc */
>>   
>> +#define BMG_GSC_HECI1_BASE     0x373000
>> +
>>   #define MTL_GSC_HECI1_BASE     0x00116000
>>   #define MTL_GSC_HECI2_BASE     0x00117000
>>   
>> diff --git a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> index ed9b81fb28a0..c146b9ef44eb 100644
>> --- a/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> +++ b/drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
>> @@ -6,10 +6,15 @@
>>   #ifndef _XE_HW_ERROR_REGS_H_
>>   #define _XE_HW_ERROR_REGS_H_
>>   
>> +#define HEC_UNCORR_ERR_STATUS(base)                    XE_REG((base)
>> + 0x118)
>> +#define    UNCORR_FW_REPORTED_ERR                      BIT(6)
>> +
>> +#define HEC_UNCORR_FW_ERR_DW0(base)                    XE_REG((base)
>> + 0x124)
>> +
>>   #define DEV_ERR_STAT_NONFATAL                  0x100178
>>   #define DEV_ERR_STAT_CORRECTABLE               0x10017c
>>   #define
>> DEV_ERR_STAT_REG(x)                    XE_REG(_PICK_EVEN((x), \
>>                                                                   
>> DEV_ERR_STAT_CORRECTABLE, \
>>                                                                   
>> DEV_ERR_STAT_NONFATAL))
>> -
>> +#define   XE_CSC_ERROR                         BIT(17)
>>   #endif
>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>> b/drivers/gpu/drm/xe/xe_device.c
>> index d6b680abc3ae..fbc50cebfc11 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -1154,6 +1154,7 @@ static void xe_device_wedged_fini(struct
>> drm_device *drm, void *arg)
>>    */
>>   void xe_device_declare_wedged(struct xe_device *xe)
>>   {
>> +       unsigned long recovery_method;
>>          struct xe_gt *gt;
>>          u8 id;
>>   
>> @@ -1169,6 +1170,12 @@ void xe_device_declare_wedged(struct xe_device
>> *xe)
>>                  return;
>>          }
>>   
>> +       /* Default recovery method */
>> +       recovery_method = DRM_WEDGE_RECOVERY_REBIND |
>> DRM_WEDGE_RECOVERY_BUS_RESET;
>> +
>> +       if (xe_survivability_mode_is_runtime(xe))
>> +               recovery_method = DRM_WEDGE_RECOVERY_VENDOR;
>> +
>>          for_each_gt(gt, xe, id)
>>                  xe_gt_declare_wedged(gt);
>>   
>> @@ -1181,8 +1188,6 @@ void xe_device_declare_wedged(struct xe_device
>> *xe)
>>                          dev_name(xe->drm.dev));
>>   
>>                  /* Notify userspace of wedged device */
>> -               drm_dev_wedged_event(&xe->drm,
>> -                                    DRM_WEDGE_RECOVERY_REBIND |
>> DRM_WEDGE_RECOVERY_BUS_RESET,
>> -                                    NULL);
>> +               drm_dev_wedged_event(&xe->drm, recovery_method,
>> NULL);
>>          }
>>   }
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h
>> b/drivers/gpu/drm/xe/xe_device_types.h
>> index 7e4f6d846af6..5daf5ba6bf51 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -241,6 +241,9 @@ struct xe_tile {
>>          /** @memirq: Memory Based Interrupts. */
>>          struct xe_memirq memirq;
>>   
>> +       /** @csc_hw_error_work: worker to report CSC HW errors */
>> +       struct work_struct csc_hw_error_work;
>> +
>>          /** @pcode: tile's PCODE */
>>          struct {
>>                  /** @pcode.lock: protecting tile's PCODE mailbox data
>> */
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c
>> b/drivers/gpu/drm/xe/xe_hw_error.c
>> index 0f2590839900..73c788fd0dee 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -3,12 +3,16 @@
>>    * Copyright © 2025 Intel Corporation
>>    */
>>   
>> +#include "regs/xe_gsc_regs.h"
>>   #include "regs/xe_hw_error_regs.h"
>>   #include "regs/xe_irq_regs.h"
>>   
>>   #include "xe_device.h"
>>   #include "xe_hw_error.h"
>>   #include "xe_mmio.h"
>> +#include "xe_survivability_mode.h"
>> +
>> +#define  HEC_UNCORR_FW_ERR_BITS 4
>>   
>>   /* Error categories reported by hardware */
>>   enum hardware_error {
>> @@ -18,6 +22,13 @@ enum hardware_error {
>>          HARDWARE_ERROR_MAX,
>>   };
>>   
>> +static const char * const hec_uncorrected_fw_errors[] = {
>> +       "Fatal",
>> +       "CSE Disabled",
>> +       "FD Corruption",
>> +       "Data Corruption"
>> +};
>> +
>>   static const char *hw_error_to_str(const enum hardware_error hw_err)
>>   {
>>          switch (hw_err) {
>> @@ -32,6 +43,58 @@ static const char *hw_error_to_str(const enum
>> hardware_error hw_err)
>>          }
>>   }
>>   
>> +static void csc_hw_error_work(struct work_struct *work)
>> +{
>> +       struct xe_tile *tile = container_of(work, typeof(*tile),
>> csc_hw_error_work);
>> +       struct xe_device *xe = tile_to_xe(tile);
>> +       int ret;
>> +
>> +       ret = xe_survivability_mode_enable(xe,
>> XE_SURVIVABILITY_TYPE_RUNTIME);
>> +       if (ret)
>> +               drm_err(&xe->drm, "Failed to enable runtime
>> survivability mode\n");
>> +
>> +       xe_device_declare_wedged(xe);
>> +}
>> +
>> +static void csc_hw_error_handler(struct xe_tile *tile, const enum
>> hardware_error hw_err)
>> +{
>> +       const char *hw_err_str = hw_error_to_str(hw_err);
>> +       struct xe_device *xe = tile_to_xe(tile);
>> +       struct xe_mmio *mmio = &tile->mmio;
>> +       u32 base, err_bit, err_src;
>> +       unsigned long fw_err;
>> +
>> +       if (xe->info.platform != XE_BATTLEMAGE)
>> +               return;
>> +
>> +       /* Not supported in BMG */
>> +       if (hw_err == HARDWARE_ERROR_CORRECTABLE)
>> +               return;
> 
> Again, here and above, why are we specifically limiting this to BMG?

This is CSC error handler and this bit is present only from BMG and the 
heci base here is also specific to bmg. Hence the check
CSC in BMG doesn't support correctable errors.

Thanks
Riana

> 
> Thanks,
> Stuart
> 
>> +
>> +       base = BMG_GSC_HECI1_BASE;
>> +       lockdep_assert_held(&xe->irq.lock);
>> +       err_src = xe_mmio_read32(mmio, HEC_UNCORR_ERR_STATUS(base));
>> +       if (!err_src) {
>> +               drm_err_ratelimited(&xe->drm, HW_ERR "Tile%d reported
>> HEC_ERR_STATUS_%s blank\n",
>> +                                   tile->id, hw_err_str);
>> +               return;
>> +       }
>> +
>> +       if (err_src & UNCORR_FW_REPORTED_ERR) {
>> +               fw_err = xe_mmio_read32(mmio,
>> HEC_UNCORR_FW_ERR_DW0(base));
>> +               for_each_set_bit(err_bit, &fw_err,
>> HEC_UNCORR_FW_ERR_BITS) {
>> +                       drm_err_ratelimited(&xe->drm, HW_ERR
>> +                                           "%s: HEC Uncorrected FW
>> %s error reported, bit[%d] is set\n",
>> +                                            hw_err_str,
>> hec_uncorrected_fw_errors[err_bit],
>> +                                            err_bit);
>> +
>> +                       schedule_work(&tile->csc_hw_error_work);
>> +               }
>> +       }
>> +
>> +       xe_mmio_write32(mmio, HEC_UNCORR_ERR_STATUS(base), err_src);
>> +}
>> +
>>   static void hw_error_source_handler(struct xe_tile *tile, const enum
>> hardware_error hw_err)
>>   {
>>          const char *hw_err_str = hw_error_to_str(hw_err);
>> @@ -50,7 +113,8 @@ static void hw_error_source_handler(struct xe_tile
>> *tile, const enum hardware_er
>>                  goto unlock;
>>          }
>>   
>> -       /* TODO: Process errrors per source */
> 
> I still think we should have a print here to show the errors we
> received, especially since CSC isn't the only bit here. We're just only
> implementing recovery support for that case.
> 
> Thanks,
> Stuart
> 
>> +       if (err_src & XE_CSC_ERROR)
>> +               csc_hw_error_handler(tile, hw_err);
>>   
>>          xe_mmio_write32(&tile->mmio, DEV_ERR_STAT_REG(hw_err),
>> err_src);
>>   
>> @@ -101,8 +165,12 @@ static void process_hw_errors(struct xe_device
>> *xe)
>>    */
>>   void xe_hw_error_init(struct xe_device *xe)
>>   {
>> +       struct xe_tile *tile = xe_device_get_root_tile(xe);
>> +
>>          if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
>>                  return;
>>   
>> +       INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
>> +
>>          process_hw_errors(xe);
>>   }
> 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 7/7] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler
  2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (5 preceding siblings ...)
  2025-07-02 14:11 ` [PATCH v3 6/7] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
@ 2025-07-02 14:11 ` Riana Tauro
  2025-07-02 15:53 ` ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev3) Patchwork
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Riana Tauro @ 2025-07-02 14:11 UTC (permalink / raw)
  To: intel-xe
  Cc: riana.tauro, anshuman.gupta, rodrigo.vivi, lucas.demarchi,
	aravind.iddamsetty, raag.jadav, umesh.nerlige.ramappa,
	frank.scarbrough, sk.anirban

Add a debugfs fault handler to trigger csc error handler that
wedges the device and sends drm uevent

Signed-off-by: Riana Tauro <riana.tauro@intel.com>
---
 drivers/gpu/drm/xe/xe_debugfs.c  |  2 ++
 drivers/gpu/drm/xe/xe_hw_error.c | 11 +++++++++++
 2 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c
index d83cd6ed3fa8..134610437aea 100644
--- a/drivers/gpu/drm/xe/xe_debugfs.c
+++ b/drivers/gpu/drm/xe/xe_debugfs.c
@@ -29,6 +29,7 @@
 #endif
 
 DECLARE_FAULT_ATTR(gt_reset_failure);
+DECLARE_FAULT_ATTR(inject_csc_hw_error);
 
 static struct xe_device *node_to_xe(struct drm_info_node *node)
 {
@@ -273,4 +274,5 @@ void xe_debugfs_register(struct xe_device *xe)
 	xe_pxp_debugfs_register(xe->pxp);
 
 	fault_create_debugfs_attr("fail_gt_reset", root, &gt_reset_failure);
+	fault_create_debugfs_attr("inject_csc_hw_error", root, &inject_csc_hw_error);
 }
diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index 73c788fd0dee..6595ffbebfb5 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -3,6 +3,8 @@
  * Copyright © 2025 Intel Corporation
  */
 
+#include <linux/fault-inject.h>
+
 #include "regs/xe_gsc_regs.h"
 #include "regs/xe_hw_error_regs.h"
 #include "regs/xe_irq_regs.h"
@@ -13,6 +15,7 @@
 #include "xe_survivability_mode.h"
 
 #define  HEC_UNCORR_FW_ERR_BITS 4
+extern struct fault_attr inject_csc_hw_error;
 
 /* Error categories reported by hardware */
 enum hardware_error {
@@ -43,6 +46,11 @@ static const char *hw_error_to_str(const enum hardware_error hw_err)
 	}
 }
 
+static bool fault_inject_csc_hw_error(void)
+{
+	return should_fail(&inject_csc_hw_error, 1);
+}
+
 static void csc_hw_error_work(struct work_struct *work)
 {
 	struct xe_tile *tile = container_of(work, typeof(*tile), csc_hw_error_work);
@@ -136,6 +144,9 @@ void xe_hw_error_irq_handler(struct xe_tile *tile, const u32 master_ctl)
 {
 	enum hardware_error hw_err;
 
+	if (fault_inject_csc_hw_error())
+		schedule_work(&tile->csc_hw_error_work);
+
 	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++)
 		if (master_ctl & ERROR_IRQ(hw_err))
 			hw_error_source_handler(tile, hw_err);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev3)
  2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (6 preceding siblings ...)
  2025-07-02 14:11 ` [PATCH v3 7/7] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
@ 2025-07-02 15:53 ` Patchwork
  2025-07-02 15:54 ` ✓ CI.KUnit: success " Patchwork
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Patchwork @ 2025-07-02 15:53 UTC (permalink / raw)
  To: Riana Tauro; +Cc: intel-xe

== Series Details ==

Series: Handle Firmware reported Hardware Errors (rev3)
URL   : https://patchwork.freedesktop.org/series/149756/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
f8ff75ae1d2127635239b134695774ed4045d05b
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit b6d209eeae23719e57dade2159ca0bf74f6ce576
Author: Riana Tauro <riana.tauro@intel.com>
Date:   Wed Jul 2 19:41:17 2025 +0530

    drm/xe/xe_hw_error: Add fault injection to trigger csc error handler
    
    Add a debugfs fault handler to trigger csc error handler that
    wedges the device and sends drm uevent
    
    Signed-off-by: Riana Tauro <riana.tauro@intel.com>
+ /mt/dim checkpatch 94631e6b7f655b1922e3227970cd6180200694af drm-intel
d6f006bbaf81 drm: Add a vendor-specific recovery method to device wedged uevent
99b7de8257f3 drm/xe: Set GT as wedged before sending wedged uevent
a02aa14cb53b drm/xe/xe_survivability: Add support for Runtime survivability mode
cec389e71032 drm/xe/doc: Document device wedged and runtime survivability
-:23: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#23: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 97 lines checked
0c5097c5c2d9 drm/xe: Add support to handle hardware errors
-:39: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#39: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 174 lines checked
227cced4bec0 drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
b6d209eeae23 drm/xe/xe_hw_error: Add fault injection to trigger csc error handler



^ permalink raw reply	[flat|nested] 36+ messages in thread

* ✓ CI.KUnit: success for Handle Firmware reported Hardware Errors (rev3)
  2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (7 preceding siblings ...)
  2025-07-02 15:53 ` ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev3) Patchwork
@ 2025-07-02 15:54 ` Patchwork
  2025-07-02 16:17 ` ✗ CI.checksparse: warning " Patchwork
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Patchwork @ 2025-07-02 15:54 UTC (permalink / raw)
  To: Riana Tauro; +Cc: intel-xe

== Series Details ==

Series: Handle Firmware reported Hardware Errors (rev3)
URL   : https://patchwork.freedesktop.org/series/149756/
State : success

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
[15:53:05] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[15:53:09] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[15:53:38] Starting KUnit Kernel (1/1)...
[15:53:38] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[15:53:38] ================== guc_buf (11 subtests) ===================
[15:53:38] [PASSED] test_smallest
[15:53:38] [PASSED] test_largest
[15:53:38] [PASSED] test_granular
[15:53:38] [PASSED] test_unique
[15:53:38] [PASSED] test_overlap
[15:53:38] [PASSED] test_reusable
[15:53:38] [PASSED] test_too_big
[15:53:38] [PASSED] test_flush
[15:53:38] [PASSED] test_lookup
[15:53:38] [PASSED] test_data
[15:53:38] [PASSED] test_class
[15:53:38] ===================== [PASSED] guc_buf =====================
[15:53:38] =================== guc_dbm (7 subtests) ===================
[15:53:38] [PASSED] test_empty
[15:53:38] [PASSED] test_default
[15:53:38] ======================== test_size  ========================
[15:53:38] [PASSED] 4
[15:53:38] [PASSED] 8
[15:53:38] [PASSED] 32
[15:53:38] [PASSED] 256
[15:53:38] ==================== [PASSED] test_size ====================
[15:53:38] ======================= test_reuse  ========================
[15:53:38] [PASSED] 4
[15:53:38] [PASSED] 8
[15:53:38] [PASSED] 32
[15:53:38] [PASSED] 256
[15:53:38] =================== [PASSED] test_reuse ====================
[15:53:38] =================== test_range_overlap  ====================
[15:53:38] [PASSED] 4
[15:53:38] [PASSED] 8
[15:53:38] [PASSED] 32
[15:53:38] [PASSED] 256
[15:53:38] =============== [PASSED] test_range_overlap ================
[15:53:38] =================== test_range_compact  ====================
[15:53:38] [PASSED] 4
[15:53:38] [PASSED] 8
[15:53:38] [PASSED] 32
[15:53:38] [PASSED] 256
[15:53:38] =============== [PASSED] test_range_compact ================
[15:53:38] ==================== test_range_spare  =====================
[15:53:38] [PASSED] 4
[15:53:38] [PASSED] 8
[15:53:38] [PASSED] 32
[15:53:38] [PASSED] 256
[15:53:38] ================ [PASSED] test_range_spare =================
[15:53:38] ===================== [PASSED] guc_dbm =====================
[15:53:38] =================== guc_idm (6 subtests) ===================
[15:53:38] [PASSED] bad_init
[15:53:38] [PASSED] no_init
[15:53:38] [PASSED] init_fini
[15:53:38] [PASSED] check_used
[15:53:38] [PASSED] check_quota
[15:53:38] [PASSED] check_all
[15:53:38] ===================== [PASSED] guc_idm =====================
[15:53:38] ================== no_relay (3 subtests) ===================
[15:53:38] [PASSED] xe_drops_guc2pf_if_not_ready
[15:53:38] [PASSED] xe_drops_guc2vf_if_not_ready
[15:53:38] [PASSED] xe_rejects_send_if_not_ready
[15:53:38] ==================== [PASSED] no_relay =====================
[15:53:38] ================== pf_relay (14 subtests) ==================
[15:53:38] [PASSED] pf_rejects_guc2pf_too_short
[15:53:38] [PASSED] pf_rejects_guc2pf_too_long
[15:53:38] [PASSED] pf_rejects_guc2pf_no_payload
[15:53:38] [PASSED] pf_fails_no_payload
[15:53:38] [PASSED] pf_fails_bad_origin
[15:53:38] [PASSED] pf_fails_bad_type
[15:53:38] [PASSED] pf_txn_reports_error
[15:53:38] [PASSED] pf_txn_sends_pf2guc
[15:53:38] [PASSED] pf_sends_pf2guc
[15:53:38] [SKIPPED] pf_loopback_nop
[15:53:38] [SKIPPED] pf_loopback_echo
[15:53:38] [SKIPPED] pf_loopback_fail
[15:53:38] [SKIPPED] pf_loopback_busy
[15:53:38] [SKIPPED] pf_loopback_retry
[15:53:38] ==================== [PASSED] pf_relay =====================
[15:53:38] ================== vf_relay (3 subtests) ===================
[15:53:38] [PASSED] vf_rejects_guc2vf_too_short
[15:53:38] [PASSED] vf_rejects_guc2vf_too_long
[15:53:38] [PASSED] vf_rejects_guc2vf_no_payload
[15:53:38] ==================== [PASSED] vf_relay =====================
[15:53:38] ================= pf_service (11 subtests) =================
[15:53:38] [PASSED] pf_negotiate_any
[15:53:38] [PASSED] pf_negotiate_base_match
[15:53:38] [PASSED] pf_negotiate_base_newer
[15:53:38] [PASSED] pf_negotiate_base_next
[15:53:38] [SKIPPED] pf_negotiate_base_older
[15:53:38] [PASSED] pf_negotiate_base_prev
[15:53:38] [PASSED] pf_negotiate_latest_match
[15:53:38] [PASSED] pf_negotiate_latest_newer
[15:53:38] [PASSED] pf_negotiate_latest_next
[15:53:38] [SKIPPED] pf_negotiate_latest_older
[15:53:38] [SKIPPED] pf_negotiate_latest_prev
[15:53:38] =================== [PASSED] pf_service ====================
[15:53:38] ===================== lmtt (1 subtest) =====================
[15:53:38] ======================== test_ops  =========================
[15:53:38] [PASSED] 2-level
[15:53:38] [PASSED] multi-level
[15:53:38] ==================== [PASSED] test_ops =====================
[15:53:38] ====================== [PASSED] lmtt =======================
[15:53:38] =================== xe_mocs (2 subtests) ===================
[15:53:38] ================ xe_live_mocs_kernel_kunit  ================
[15:53:38] =========== [SKIPPED] xe_live_mocs_kernel_kunit ============
[15:53:38] ================ xe_live_mocs_reset_kunit  =================
[15:53:38] ============ [SKIPPED] xe_live_mocs_reset_kunit ============
[15:53:38] ==================== [SKIPPED] xe_mocs =====================
[15:53:38] ================= xe_migrate (2 subtests) ==================
[15:53:38] ================= xe_migrate_sanity_kunit  =================
[15:53:38] ============ [SKIPPED] xe_migrate_sanity_kunit =============
[15:53:38] ================== xe_validate_ccs_kunit  ==================
[15:53:38] ============= [SKIPPED] xe_validate_ccs_kunit ==============
[15:53:38] =================== [SKIPPED] xe_migrate ===================
[15:53:38] ================== xe_dma_buf (1 subtest) ==================
[15:53:38] ==================== xe_dma_buf_kunit  =====================
[15:53:38] ================ [SKIPPED] xe_dma_buf_kunit ================
[15:53:38] =================== [SKIPPED] xe_dma_buf ===================
[15:53:38] ================= xe_bo_shrink (1 subtest) =================
[15:53:38] =================== xe_bo_shrink_kunit  ====================
[15:53:38] =============== [SKIPPED] xe_bo_shrink_kunit ===============
[15:53:38] ================== [SKIPPED] xe_bo_shrink ==================
[15:53:38] ==================== xe_bo (2 subtests) ====================
[15:53:38] ================== xe_ccs_migrate_kunit  ===================
[15:53:38] ============== [SKIPPED] xe_ccs_migrate_kunit ==============
[15:53:38] ==================== xe_bo_evict_kunit  ====================
[15:53:38] =============== [SKIPPED] xe_bo_evict_kunit ================
[15:53:38] ===================== [SKIPPED] xe_bo ======================
[15:53:38] ==================== args (11 subtests) ====================
[15:53:38] [PASSED] count_args_test
[15:53:38] [PASSED] call_args_example
[15:53:38] [PASSED] call_args_test
[15:53:38] [PASSED] drop_first_arg_example
[15:53:38] [PASSED] drop_first_arg_test
[15:53:38] [PASSED] first_arg_example
[15:53:38] [PASSED] first_arg_test
[15:53:38] [PASSED] last_arg_example
[15:53:38] [PASSED] last_arg_test
[15:53:38] [PASSED] pick_arg_example
[15:53:38] [PASSED] sep_comma_example
[15:53:38] ====================== [PASSED] args =======================
[15:53:38] =================== xe_pci (2 subtests) ====================
[15:53:38] ==================== check_graphics_ip  ====================
[15:53:38] [PASSED] 12.70 Xe_LPG
[15:53:38] [PASSED] 12.71 Xe_LPG
[15:53:38] [PASSED] 12.74 Xe_LPG+
[15:53:38] [PASSED] 20.01 Xe2_HPG
[15:53:38] [PASSED] 20.02 Xe2_HPG
[15:53:38] [PASSED] 20.04 Xe2_LPG
[15:53:38] [PASSED] 30.00 Xe3_LPG
[15:53:38] [PASSED] 30.01 Xe3_LPG
[15:53:38] [PASSED] 30.03 Xe3_LPG
[15:53:38] ================ [PASSED] check_graphics_ip ================
[15:53:38] ===================== check_media_ip  ======================
[15:53:38] [PASSED] 13.00 Xe_LPM+
[15:53:38] [PASSED] 13.01 Xe2_HPM
[15:53:38] [PASSED] 20.00 Xe2_LPM
[15:53:38] [PASSED] 30.00 Xe3_LPM
[15:53:38] [PASSED] 30.02 Xe3_LPM
stty: 'standard input': Inappropriate ioctl for device
[15:53:38] ================= [PASSED] check_media_ip ==================
[15:53:38] ===================== [PASSED] xe_pci ======================
[15:53:38] =================== xe_rtp (2 subtests) ====================
[15:53:38] =============== xe_rtp_process_to_sr_tests  ================
[15:53:38] [PASSED] coalesce-same-reg
[15:53:38] [PASSED] no-match-no-add
[15:53:38] [PASSED] match-or
[15:53:38] [PASSED] match-or-xfail
[15:53:38] [PASSED] no-match-no-add-multiple-rules
[15:53:38] [PASSED] two-regs-two-entries
[15:53:38] [PASSED] clr-one-set-other
[15:53:38] [PASSED] set-field
[15:53:38] [PASSED] conflict-duplicate
[15:53:38] [PASSED] conflict-not-disjoint
[15:53:38] [PASSED] conflict-reg-type
[15:53:38] =========== [PASSED] xe_rtp_process_to_sr_tests ============
[15:53:38] ================== xe_rtp_process_tests  ===================
[15:53:38] [PASSED] active1
[15:53:38] [PASSED] active2
[15:53:38] [PASSED] active-inactive
[15:53:38] [PASSED] inactive-active
[15:53:38] [PASSED] inactive-1st_or_active-inactive
[15:53:38] [PASSED] inactive-2nd_or_active-inactive
[15:53:38] [PASSED] inactive-last_or_active-inactive
[15:53:38] [PASSED] inactive-no_or_active-inactive
[15:53:38] ============== [PASSED] xe_rtp_process_tests ===============
[15:53:38] ===================== [PASSED] xe_rtp ======================
[15:53:38] ==================== xe_wa (1 subtest) =====================
[15:53:38] ======================== xe_wa_gt  =========================
[15:53:38] [PASSED] TIGERLAKE (B0)
[15:53:38] [PASSED] DG1 (A0)
[15:53:38] [PASSED] DG1 (B0)
[15:53:38] [PASSED] ALDERLAKE_S (A0)
[15:53:38] [PASSED] ALDERLAKE_S (B0)
[15:53:38] [PASSED] ALDERLAKE_S (C0)
[15:53:38] [PASSED] ALDERLAKE_S (D0)
[15:53:38] [PASSED] ALDERLAKE_P (A0)
[15:53:38] [PASSED] ALDERLAKE_P (B0)
[15:53:38] [PASSED] ALDERLAKE_P (C0)
[15:53:38] [PASSED] ALDERLAKE_S_RPLS (D0)
[15:53:38] [PASSED] ALDERLAKE_P_RPLU (E0)
[15:53:38] [PASSED] DG2_G10 (C0)
[15:53:38] [PASSED] DG2_G11 (B1)
[15:53:38] [PASSED] DG2_G12 (A1)
[15:53:38] [PASSED] METEORLAKE (g:A0, m:A0)
[15:53:38] [PASSED] METEORLAKE (g:A0, m:A0)
[15:53:38] [PASSED] METEORLAKE (g:A0, m:A0)
[15:53:38] [PASSED] LUNARLAKE (g:A0, m:A0)
[15:53:38] [PASSED] LUNARLAKE (g:B0, m:A0)
[15:53:38] [PASSED] BATTLEMAGE (g:A0, m:A1)
[15:53:38] ==================== [PASSED] xe_wa_gt =====================
[15:53:38] ====================== [PASSED] xe_wa ======================
[15:53:38] ============================================================
[15:53:38] Testing complete. Ran 145 tests: passed: 129, skipped: 16
[15:53:38] Elapsed time: 33.390s total, 4.288s configuring, 28.785s building, 0.306s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/tests/.kunitconfig
[15:53:38] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[15:53:40] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[15:54:03] Starting KUnit Kernel (1/1)...
[15:54:03] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[15:54:03] == drm_test_atomic_get_connector_for_encoder (1 subtest) ===
[15:54:03] [PASSED] drm_test_drm_atomic_get_connector_for_encoder
[15:54:03] ==== [PASSED] drm_test_atomic_get_connector_for_encoder ====
[15:54:03] =========== drm_validate_clone_mode (2 subtests) ===========
[15:54:03] ============== drm_test_check_in_clone_mode  ===============
[15:54:03] [PASSED] in_clone_mode
[15:54:03] [PASSED] not_in_clone_mode
[15:54:03] ========== [PASSED] drm_test_check_in_clone_mode ===========
[15:54:03] =============== drm_test_check_valid_clones  ===============
[15:54:03] [PASSED] not_in_clone_mode
[15:54:03] [PASSED] valid_clone
[15:54:03] [PASSED] invalid_clone
[15:54:03] =========== [PASSED] drm_test_check_valid_clones ===========
[15:54:03] ============= [PASSED] drm_validate_clone_mode =============
[15:54:03] ============= drm_validate_modeset (1 subtest) =============
[15:54:03] [PASSED] drm_test_check_connector_changed_modeset
[15:54:03] ============== [PASSED] drm_validate_modeset ===============
[15:54:03] ====== drm_test_bridge_get_current_state (2 subtests) ======
[15:54:03] [PASSED] drm_test_drm_bridge_get_current_state_atomic
[15:54:03] [PASSED] drm_test_drm_bridge_get_current_state_legacy
[15:54:03] ======== [PASSED] drm_test_bridge_get_current_state ========
[15:54:03] ====== drm_test_bridge_helper_reset_crtc (3 subtests) ======
[15:54:03] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic
[15:54:03] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic_disabled
[15:54:03] [PASSED] drm_test_drm_bridge_helper_reset_crtc_legacy
[15:54:03] ======== [PASSED] drm_test_bridge_helper_reset_crtc ========
[15:54:03] ============== drm_bridge_alloc (2 subtests) ===============
[15:54:03] [PASSED] drm_test_drm_bridge_alloc_basic
[15:54:03] [PASSED] drm_test_drm_bridge_alloc_get_put
[15:54:03] ================ [PASSED] drm_bridge_alloc =================
[15:54:03] ================== drm_buddy (7 subtests) ==================
[15:54:03] [PASSED] drm_test_buddy_alloc_limit
[15:54:03] [PASSED] drm_test_buddy_alloc_optimistic
[15:54:03] [PASSED] drm_test_buddy_alloc_pessimistic
[15:54:03] [PASSED] drm_test_buddy_alloc_pathological
[15:54:03] [PASSED] drm_test_buddy_alloc_contiguous
[15:54:03] [PASSED] drm_test_buddy_alloc_clear
[15:54:03] [PASSED] drm_test_buddy_alloc_range_bias
[15:54:03] ==================== [PASSED] drm_buddy ====================
[15:54:03] ============= drm_cmdline_parser (40 subtests) =============
[15:54:03] [PASSED] drm_test_cmdline_force_d_only
[15:54:03] [PASSED] drm_test_cmdline_force_D_only_dvi
[15:54:03] [PASSED] drm_test_cmdline_force_D_only_hdmi
[15:54:03] [PASSED] drm_test_cmdline_force_D_only_not_digital
[15:54:03] [PASSED] drm_test_cmdline_force_e_only
[15:54:03] [PASSED] drm_test_cmdline_res
[15:54:03] [PASSED] drm_test_cmdline_res_vesa
[15:54:03] [PASSED] drm_test_cmdline_res_vesa_rblank
[15:54:03] [PASSED] drm_test_cmdline_res_rblank
[15:54:03] [PASSED] drm_test_cmdline_res_bpp
[15:54:03] [PASSED] drm_test_cmdline_res_refresh
[15:54:03] [PASSED] drm_test_cmdline_res_bpp_refresh
[15:54:03] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced
[15:54:03] [PASSED] drm_test_cmdline_res_bpp_refresh_margins
[15:54:03] [PASSED] drm_test_cmdline_res_bpp_refresh_force_off
[15:54:03] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on
[15:54:03] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_analog
[15:54:03] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_digital
[15:54:03] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced_margins_force_on
[15:54:03] [PASSED] drm_test_cmdline_res_margins_force_on
[15:54:03] [PASSED] drm_test_cmdline_res_vesa_margins
[15:54:03] [PASSED] drm_test_cmdline_name
[15:54:03] [PASSED] drm_test_cmdline_name_bpp
[15:54:03] [PASSED] drm_test_cmdline_name_option
[15:54:03] [PASSED] drm_test_cmdline_name_bpp_option
[15:54:03] [PASSED] drm_test_cmdline_rotate_0
[15:54:03] [PASSED] drm_test_cmdline_rotate_90
[15:54:03] [PASSED] drm_test_cmdline_rotate_180
[15:54:03] [PASSED] drm_test_cmdline_rotate_270
[15:54:03] [PASSED] drm_test_cmdline_hmirror
[15:54:03] [PASSED] drm_test_cmdline_vmirror
[15:54:03] [PASSED] drm_test_cmdline_margin_options
[15:54:03] [PASSED] drm_test_cmdline_multiple_options
[15:54:03] [PASSED] drm_test_cmdline_bpp_extra_and_option
[15:54:03] [PASSED] drm_test_cmdline_extra_and_option
[15:54:03] [PASSED] drm_test_cmdline_freestanding_options
[15:54:03] [PASSED] drm_test_cmdline_freestanding_force_e_and_options
[15:54:03] [PASSED] drm_test_cmdline_panel_orientation
[15:54:03] ================ drm_test_cmdline_invalid  =================
[15:54:03] [PASSED] margin_only
[15:54:03] [PASSED] interlace_only
[15:54:03] [PASSED] res_missing_x
[15:54:03] [PASSED] res_missing_y
[15:54:03] [PASSED] res_bad_y
[15:54:03] [PASSED] res_missing_y_bpp
[15:54:03] [PASSED] res_bad_bpp
[15:54:03] [PASSED] res_bad_refresh
[15:54:03] [PASSED] res_bpp_refresh_force_on_off
[15:54:03] [PASSED] res_invalid_mode
[15:54:03] [PASSED] res_bpp_wrong_place_mode
[15:54:03] [PASSED] name_bpp_refresh
[15:54:03] [PASSED] name_refresh
[15:54:03] [PASSED] name_refresh_wrong_mode
[15:54:03] [PASSED] name_refresh_invalid_mode
[15:54:03] [PASSED] rotate_multiple
[15:54:03] [PASSED] rotate_invalid_val
[15:54:03] [PASSED] rotate_truncated
[15:54:03] [PASSED] invalid_option
[15:54:03] [PASSED] invalid_tv_option
[15:54:03] [PASSED] truncated_tv_option
[15:54:03] ============ [PASSED] drm_test_cmdline_invalid =============
[15:54:03] =============== drm_test_cmdline_tv_options  ===============
[15:54:03] [PASSED] NTSC
[15:54:03] [PASSED] NTSC_443
[15:54:03] [PASSED] NTSC_J
[15:54:03] [PASSED] PAL
[15:54:03] [PASSED] PAL_M
[15:54:03] [PASSED] PAL_N
[15:54:03] [PASSED] SECAM
[15:54:03] [PASSED] MONO_525
[15:54:03] [PASSED] MONO_625
[15:54:03] =========== [PASSED] drm_test_cmdline_tv_options ===========
[15:54:03] =============== [PASSED] drm_cmdline_parser ================
[15:54:03] ========== drmm_connector_hdmi_init (20 subtests) ==========
[15:54:03] [PASSED] drm_test_connector_hdmi_init_valid
[15:54:03] [PASSED] drm_test_connector_hdmi_init_bpc_8
[15:54:03] [PASSED] drm_test_connector_hdmi_init_bpc_10
[15:54:03] [PASSED] drm_test_connector_hdmi_init_bpc_12
[15:54:03] [PASSED] drm_test_connector_hdmi_init_bpc_invalid
[15:54:03] [PASSED] drm_test_connector_hdmi_init_bpc_null
[15:54:03] [PASSED] drm_test_connector_hdmi_init_formats_empty
[15:54:03] [PASSED] drm_test_connector_hdmi_init_formats_no_rgb
[15:54:03] === drm_test_connector_hdmi_init_formats_yuv420_allowed  ===
[15:54:03] [PASSED] supported_formats=0x9 yuv420_allowed=1
[15:54:03] [PASSED] supported_formats=0x9 yuv420_allowed=0
[15:54:03] [PASSED] supported_formats=0x3 yuv420_allowed=1
[15:54:03] [PASSED] supported_formats=0x3 yuv420_allowed=0
[15:54:03] === [PASSED] drm_test_connector_hdmi_init_formats_yuv420_allowed ===
[15:54:03] [PASSED] drm_test_connector_hdmi_init_null_ddc
[15:54:03] [PASSED] drm_test_connector_hdmi_init_null_product
[15:54:03] [PASSED] drm_test_connector_hdmi_init_null_vendor
[15:54:03] [PASSED] drm_test_connector_hdmi_init_product_length_exact
[15:54:03] [PASSED] drm_test_connector_hdmi_init_product_length_too_long
[15:54:03] [PASSED] drm_test_connector_hdmi_init_product_valid
[15:54:03] [PASSED] drm_test_connector_hdmi_init_vendor_length_exact
[15:54:03] [PASSED] drm_test_connector_hdmi_init_vendor_length_too_long
[15:54:03] [PASSED] drm_test_connector_hdmi_init_vendor_valid
[15:54:03] ========= drm_test_connector_hdmi_init_type_valid  =========
[15:54:03] [PASSED] HDMI-A
[15:54:03] [PASSED] HDMI-B
[15:54:03] ===== [PASSED] drm_test_connector_hdmi_init_type_valid =====
[15:54:03] ======== drm_test_connector_hdmi_init_type_invalid  ========
[15:54:03] [PASSED] Unknown
[15:54:03] [PASSED] VGA
[15:54:03] [PASSED] DVI-I
[15:54:03] [PASSED] DVI-D
[15:54:03] [PASSED] DVI-A
[15:54:03] [PASSED] Composite
[15:54:03] [PASSED] SVIDEO
[15:54:03] [PASSED] LVDS
[15:54:03] [PASSED] Component
[15:54:03] [PASSED] DIN
[15:54:03] [PASSED] DP
[15:54:03] [PASSED] TV
[15:54:03] [PASSED] eDP
[15:54:03] [PASSED] Virtual
[15:54:03] [PASSED] DSI
[15:54:03] [PASSED] DPI
[15:54:03] [PASSED] Writeback
[15:54:03] [PASSED] SPI
[15:54:03] [PASSED] USB
[15:54:03] ==== [PASSED] drm_test_connector_hdmi_init_type_invalid ====
[15:54:03] ============ [PASSED] drmm_connector_hdmi_init =============
[15:54:03] ============= drmm_connector_init (3 subtests) =============
[15:54:03] [PASSED] drm_test_drmm_connector_init
[15:54:03] [PASSED] drm_test_drmm_connector_init_null_ddc
[15:54:03] ========= drm_test_drmm_connector_init_type_valid  =========
[15:54:03] [PASSED] Unknown
[15:54:03] [PASSED] VGA
[15:54:03] [PASSED] DVI-I
[15:54:03] [PASSED] DVI-D
[15:54:03] [PASSED] DVI-A
[15:54:03] [PASSED] Composite
[15:54:03] [PASSED] SVIDEO
[15:54:03] [PASSED] LVDS
[15:54:03] [PASSED] Component
[15:54:03] [PASSED] DIN
[15:54:03] [PASSED] DP
[15:54:03] [PASSED] HDMI-A
[15:54:03] [PASSED] HDMI-B
[15:54:03] [PASSED] TV
[15:54:03] [PASSED] eDP
[15:54:03] [PASSED] Virtual
[15:54:03] [PASSED] DSI
[15:54:03] [PASSED] DPI
[15:54:03] [PASSED] Writeback
[15:54:03] [PASSED] SPI
[15:54:03] [PASSED] USB
[15:54:03] ===== [PASSED] drm_test_drmm_connector_init_type_valid =====
[15:54:03] =============== [PASSED] drmm_connector_init ===============
[15:54:03] ========= drm_connector_dynamic_init (6 subtests) ==========
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_init
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_init_null_ddc
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_init_not_added
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_init_properties
[15:54:03] ===== drm_test_drm_connector_dynamic_init_type_valid  ======
[15:54:03] [PASSED] Unknown
[15:54:03] [PASSED] VGA
[15:54:03] [PASSED] DVI-I
[15:54:03] [PASSED] DVI-D
[15:54:03] [PASSED] DVI-A
[15:54:03] [PASSED] Composite
[15:54:03] [PASSED] SVIDEO
[15:54:03] [PASSED] LVDS
[15:54:03] [PASSED] Component
[15:54:03] [PASSED] DIN
[15:54:03] [PASSED] DP
[15:54:03] [PASSED] HDMI-A
[15:54:03] [PASSED] HDMI-B
[15:54:03] [PASSED] TV
[15:54:03] [PASSED] eDP
[15:54:03] [PASSED] Virtual
[15:54:03] [PASSED] DSI
[15:54:03] [PASSED] DPI
[15:54:03] [PASSED] Writeback
[15:54:03] [PASSED] SPI
[15:54:03] [PASSED] USB
[15:54:03] = [PASSED] drm_test_drm_connector_dynamic_init_type_valid ==
[15:54:03] ======== drm_test_drm_connector_dynamic_init_name  =========
[15:54:03] [PASSED] Unknown
[15:54:03] [PASSED] VGA
[15:54:03] [PASSED] DVI-I
[15:54:03] [PASSED] DVI-D
[15:54:03] [PASSED] DVI-A
[15:54:03] [PASSED] Composite
[15:54:03] [PASSED] SVIDEO
[15:54:03] [PASSED] LVDS
[15:54:03] [PASSED] Component
[15:54:03] [PASSED] DIN
[15:54:03] [PASSED] DP
[15:54:03] [PASSED] HDMI-A
[15:54:03] [PASSED] HDMI-B
[15:54:03] [PASSED] TV
[15:54:03] [PASSED] eDP
[15:54:03] [PASSED] Virtual
[15:54:03] [PASSED] DSI
[15:54:03] [PASSED] DPI
[15:54:03] [PASSED] Writeback
[15:54:03] [PASSED] SPI
[15:54:03] [PASSED] USB
[15:54:03] ==== [PASSED] drm_test_drm_connector_dynamic_init_name =====
[15:54:03] =========== [PASSED] drm_connector_dynamic_init ============
[15:54:03] ==== drm_connector_dynamic_register_early (4 subtests) =====
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_register_early_on_list
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_register_early_defer
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_register_early_no_init
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_register_early_no_mode_object
[15:54:03] ====== [PASSED] drm_connector_dynamic_register_early =======
[15:54:03] ======= drm_connector_dynamic_register (7 subtests) ========
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_register_on_list
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_register_no_defer
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_register_no_init
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_register_mode_object
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_register_sysfs
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_register_sysfs_name
[15:54:03] [PASSED] drm_test_drm_connector_dynamic_register_debugfs
[15:54:03] ========= [PASSED] drm_connector_dynamic_register ==========
[15:54:03] = drm_connector_attach_broadcast_rgb_property (2 subtests) =
[15:54:03] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property
[15:54:03] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property_hdmi_connector
[15:54:03] === [PASSED] drm_connector_attach_broadcast_rgb_property ===
[15:54:03] ========== drm_get_tv_mode_from_name (2 subtests) ==========
[15:54:03] ========== drm_test_get_tv_mode_from_name_valid  ===========
[15:54:03] [PASSED] NTSC
[15:54:03] [PASSED] NTSC-443
[15:54:03] [PASSED] NTSC-J
[15:54:03] [PASSED] PAL
[15:54:03] [PASSED] PAL-M
[15:54:03] [PASSED] PAL-N
[15:54:03] [PASSED] SECAM
[15:54:03] [PASSED] Mono
[15:54:03] ====== [PASSED] drm_test_get_tv_mode_from_name_valid =======
[15:54:03] [PASSED] drm_test_get_tv_mode_from_name_truncated
[15:54:03] ============ [PASSED] drm_get_tv_mode_from_name ============
[15:54:03] = drm_test_connector_hdmi_compute_mode_clock (12 subtests) =
[15:54:03] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb
[15:54:03] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc
[15:54:03] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc_vic_1
[15:54:03] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc
[15:54:03] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc_vic_1
[15:54:03] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_double
[15:54:03] = drm_test_connector_hdmi_compute_mode_clock_yuv420_valid  =
[15:54:03] [PASSED] VIC 96
[15:54:03] [PASSED] VIC 97
[15:54:03] [PASSED] VIC 101
[15:54:03] [PASSED] VIC 102
[15:54:03] [PASSED] VIC 106
[15:54:03] [PASSED] VIC 107
[15:54:03] === [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_valid ===
[15:54:03] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_10_bpc
[15:54:03] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_12_bpc
[15:54:03] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_8_bpc
[15:54:03] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_10_bpc
[15:54:03] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_12_bpc
[15:54:03] === [PASSED] drm_test_connector_hdmi_compute_mode_clock ====
[15:54:03] == drm_hdmi_connector_get_broadcast_rgb_name (2 subtests) ==
[15:54:03] === drm_test_drm_hdmi_connector_get_broadcast_rgb_name  ====
[15:54:03] [PASSED] Automatic
[15:54:03] [PASSED] Full
[15:54:03] [PASSED] Limited 16:235
[15:54:03] === [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name ===
[15:54:03] [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name_invalid
[15:54:03] ==== [PASSED] drm_hdmi_connector_get_broadcast_rgb_name ====
[15:54:03] == drm_hdmi_connector_get_output_format_name (2 subtests) ==
[15:54:03] === drm_test_drm_hdmi_connector_get_output_format_name  ====
[15:54:03] [PASSED] RGB
[15:54:03] [PASSED] YUV 4:2:0
[15:54:03] [PASSED] YUV 4:2:2
[15:54:03] [PASSED] YUV 4:4:4
[15:54:03] === [PASSED] drm_test_drm_hdmi_connector_get_output_format_name ===
[15:54:03] [PASSED] drm_test_drm_hdmi_connector_get_output_format_name_invalid
[15:54:03] ==== [PASSED] drm_hdmi_connector_get_output_format_name ====
[15:54:03] ============= drm_damage_helper (21 subtests) ==============
[15:54:03] [PASSED] drm_test_damage_iter_no_damage
[15:54:03] [PASSED] drm_test_damage_iter_no_damage_fractional_src
[15:54:03] [PASSED] drm_test_damage_iter_no_damage_src_moved
[15:54:03] [PASSED] drm_test_damage_iter_no_damage_fractional_src_moved
[15:54:03] [PASSED] drm_test_damage_iter_no_damage_not_visible
[15:54:03] [PASSED] drm_test_damage_iter_no_damage_no_crtc
[15:54:03] [PASSED] drm_test_damage_iter_no_damage_no_fb
[15:54:03] [PASSED] drm_test_damage_iter_simple_damage
[15:54:03] [PASSED] drm_test_damage_iter_single_damage
[15:54:03] [PASSED] drm_test_damage_iter_single_damage_intersect_src
[15:54:03] [PASSED] drm_test_damage_iter_single_damage_outside_src
[15:54:03] [PASSED] drm_test_damage_iter_single_damage_fractional_src
[15:54:03] [PASSED] drm_test_damage_iter_single_damage_intersect_fractional_src
[15:54:03] [PASSED] drm_test_damage_iter_single_damage_outside_fractional_src
[15:54:03] [PASSED] drm_test_damage_iter_single_damage_src_moved
[15:54:03] [PASSED] drm_test_damage_iter_single_damage_fractional_src_moved
[15:54:03] [PASSED] drm_test_damage_iter_damage
[15:54:03] [PASSED] drm_test_damage_iter_damage_one_intersect
[15:54:03] [PASSED] drm_test_damage_iter_damage_one_outside
[15:54:03] [PASSED] drm_test_damage_iter_damage_src_moved
[15:54:03] [PASSED] drm_test_damage_iter_damage_not_visible
[15:54:03] ================ [PASSED] drm_damage_helper ================
[15:54:03] ============== drm_dp_mst_helper (3 subtests) ==============
[15:54:03] ============== drm_test_dp_mst_calc_pbn_mode  ==============
[15:54:03] [PASSED] Clock 154000 BPP 30 DSC disabled
[15:54:03] [PASSED] Clock 234000 BPP 30 DSC disabled
[15:54:03] [PASSED] Clock 297000 BPP 24 DSC disabled
[15:54:03] [PASSED] Clock 332880 BPP 24 DSC enabled
[15:54:03] [PASSED] Clock 324540 BPP 24 DSC enabled
[15:54:03] ========== [PASSED] drm_test_dp_mst_calc_pbn_mode ==========
[15:54:03] ============== drm_test_dp_mst_calc_pbn_div  ===============
[15:54:03] [PASSED] Link rate 2000000 lane count 4
[15:54:03] [PASSED] Link rate 2000000 lane count 2
[15:54:03] [PASSED] Link rate 2000000 lane count 1
[15:54:03] [PASSED] Link rate 1350000 lane count 4
[15:54:03] [PASSED] Link rate 1350000 lane count 2
[15:54:03] [PASSED] Link rate 1350000 lane count 1
[15:54:03] [PASSED] Link rate 1000000 lane count 4
[15:54:03] [PASSED] Link rate 1000000 lane count 2
[15:54:03] [PASSED] Link rate 1000000 lane count 1
[15:54:03] [PASSED] Link rate 810000 lane count 4
[15:54:03] [PASSED] Link rate 810000 lane count 2
[15:54:03] [PASSED] Link rate 810000 lane count 1
[15:54:03] [PASSED] Link rate 540000 lane count 4
[15:54:03] [PASSED] Link rate 540000 lane count 2
[15:54:03] [PASSED] Link rate 540000 lane count 1
[15:54:03] [PASSED] Link rate 270000 lane count 4
[15:54:03] [PASSED] Link rate 270000 lane count 2
[15:54:03] [PASSED] Link rate 270000 lane count 1
[15:54:03] [PASSED] Link rate 162000 lane count 4
[15:54:03] [PASSED] Link rate 162000 lane count 2
[15:54:03] [PASSED] Link rate 162000 lane count 1
[15:54:03] ========== [PASSED] drm_test_dp_mst_calc_pbn_div ===========
[15:54:03] ========= drm_test_dp_mst_sideband_msg_req_decode  =========
[15:54:03] [PASSED] DP_ENUM_PATH_RESOURCES with port number
[15:54:03] [PASSED] DP_POWER_UP_PHY with port number
[15:54:03] [PASSED] DP_POWER_DOWN_PHY with port number
[15:54:03] [PASSED] DP_ALLOCATE_PAYLOAD with SDP stream sinks
[15:54:03] [PASSED] DP_ALLOCATE_PAYLOAD with port number
[15:54:03] [PASSED] DP_ALLOCATE_PAYLOAD with VCPI
[15:54:03] [PASSED] DP_ALLOCATE_PAYLOAD with PBN
[15:54:03] [PASSED] DP_QUERY_PAYLOAD with port number
[15:54:03] [PASSED] DP_QUERY_PAYLOAD with VCPI
[15:54:03] [PASSED] DP_REMOTE_DPCD_READ with port number
[15:54:03] [PASSED] DP_REMOTE_DPCD_READ with DPCD address
[15:54:03] [PASSED] DP_REMOTE_DPCD_READ with max number of bytes
[15:54:03] [PASSED] DP_REMOTE_DPCD_WRITE with port number
[15:54:03] [PASSED] DP_REMOTE_DPCD_WRITE with DPCD address
[15:54:03] [PASSED] DP_REMOTE_DPCD_WRITE with data array
[15:54:03] [PASSED] DP_REMOTE_I2C_READ with port number
[15:54:03] [PASSED] DP_REMOTE_I2C_READ with I2C device ID
[15:54:03] [PASSED] DP_REMOTE_I2C_READ with transactions array
[15:54:03] [PASSED] DP_REMOTE_I2C_WRITE with port number
[15:54:03] [PASSED] DP_REMOTE_I2C_WRITE with I2C device ID
[15:54:03] [PASSED] DP_REMOTE_I2C_WRITE with data array
[15:54:03] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream ID
[15:54:03] [PASSED] DP_QUERY_STREAM_ENC_STATUS with client ID
[15:54:03] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream event
[15:54:03] [PASSED] DP_QUERY_STREAM_ENC_STATUS with valid stream event
[15:54:03] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream behavior
[15:54:03] [PASSED] DP_QUERY_STREAM_ENC_STATUS with a valid stream behavior
[15:54:03] ===== [PASSED] drm_test_dp_mst_sideband_msg_req_decode =====
[15:54:03] ================ [PASSED] drm_dp_mst_helper ================
[15:54:03] ================== drm_exec (7 subtests) ===================
[15:54:03] [PASSED] sanitycheck
[15:54:03] [PASSED] test_lock
[15:54:03] [PASSED] test_lock_unlock
[15:54:03] [PASSED] test_duplicates
[15:54:03] [PASSED] test_prepare
[15:54:03] [PASSED] test_prepare_array
[15:54:03] [PASSED] test_multiple_loops
[15:54:03] ==================== [PASSED] drm_exec =====================
[15:54:03] =========== drm_format_helper_test (17 subtests) ===========
[15:54:03] ============== drm_test_fb_xrgb8888_to_gray8  ==============
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ========== [PASSED] drm_test_fb_xrgb8888_to_gray8 ==========
[15:54:03] ============= drm_test_fb_xrgb8888_to_rgb332  ==============
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb332 ==========
[15:54:03] ============= drm_test_fb_xrgb8888_to_rgb565  ==============
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb565 ==========
[15:54:03] ============ drm_test_fb_xrgb8888_to_xrgb1555  =============
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ======== [PASSED] drm_test_fb_xrgb8888_to_xrgb1555 =========
[15:54:03] ============ drm_test_fb_xrgb8888_to_argb1555  =============
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ======== [PASSED] drm_test_fb_xrgb8888_to_argb1555 =========
[15:54:03] ============ drm_test_fb_xrgb8888_to_rgba5551  =============
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ======== [PASSED] drm_test_fb_xrgb8888_to_rgba5551 =========
[15:54:03] ============= drm_test_fb_xrgb8888_to_rgb888  ==============
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb888 ==========
[15:54:03] ============= drm_test_fb_xrgb8888_to_bgr888  ==============
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ========= [PASSED] drm_test_fb_xrgb8888_to_bgr888 ==========
[15:54:03] ============ drm_test_fb_xrgb8888_to_argb8888  =============
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ======== [PASSED] drm_test_fb_xrgb8888_to_argb8888 =========
[15:54:03] =========== drm_test_fb_xrgb8888_to_xrgb2101010  ===========
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ======= [PASSED] drm_test_fb_xrgb8888_to_xrgb2101010 =======
[15:54:03] =========== drm_test_fb_xrgb8888_to_argb2101010  ===========
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ======= [PASSED] drm_test_fb_xrgb8888_to_argb2101010 =======
[15:54:03] ============== drm_test_fb_xrgb8888_to_mono  ===============
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ========== [PASSED] drm_test_fb_xrgb8888_to_mono ===========
[15:54:03] ==================== drm_test_fb_swab  =====================
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ================ [PASSED] drm_test_fb_swab =================
[15:54:03] ============ drm_test_fb_xrgb8888_to_xbgr8888  =============
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ======== [PASSED] drm_test_fb_xrgb8888_to_xbgr8888 =========
[15:54:03] ============ drm_test_fb_xrgb8888_to_abgr8888  =============
[15:54:03] [PASSED] single_pixel_source_buffer
[15:54:03] [PASSED] single_pixel_clip_rectangle
[15:54:03] [PASSED] well_known_colors
[15:54:03] [PASSED] destination_pitch
[15:54:03] ======== [PASSED] drm_test_fb_xrgb8888_to_abgr8888 =========
[15:54:03] ================= drm_test_fb_clip_offset  =================
[15:54:03] [PASSED] pass through
[15:54:03] [PASSED] horizontal offset
[15:54:03] [PASSED] vertical offset
[15:54:03] [PASSED] horizontal and vertical offset
[15:54:03] [PASSED] horizontal offset (custom pitch)
[15:54:03] [PASSED] vertical offset (custom pitch)
[15:54:03] [PASSED] horizontal and vertical offset (custom pitch)
[15:54:03] ============= [PASSED] drm_test_fb_clip_offset =============
[15:54:03] =================== drm_test_fb_memcpy  ====================
[15:54:03] [PASSED] single_pixel_source_buffer: XR24 little-endian (0x34325258)
[15:54:03] [PASSED] single_pixel_source_buffer: XRA8 little-endian (0x38415258)
[15:54:03] [PASSED] single_pixel_source_buffer: YU24 little-endian (0x34325559)
[15:54:03] [PASSED] single_pixel_clip_rectangle: XB24 little-endian (0x34324258)
[15:54:03] [PASSED] single_pixel_clip_rectangle: XRA8 little-endian (0x38415258)
[15:54:03] [PASSED] single_pixel_clip_rectangle: YU24 little-endian (0x34325559)
[15:54:03] [PASSED] well_known_colors: XB24 little-endian (0x34324258)
[15:54:03] [PASSED] well_known_colors: XRA8 little-endian (0x38415258)
[15:54:03] [PASSED] well_known_colors: YU24 little-endian (0x34325559)
[15:54:03] [PASSED] destination_pitch: XB24 little-endian (0x34324258)
[15:54:03] [PASSED] destination_pitch: XRA8 little-endian (0x38415258)
[15:54:03] [PASSED] destination_pitch: YU24 little-endian (0x34325559)
[15:54:03] =============== [PASSED] drm_test_fb_memcpy ================
[15:54:03] ============= [PASSED] drm_format_helper_test ==============
[15:54:03] ================= drm_format (18 subtests) =================
[15:54:03] [PASSED] drm_test_format_block_width_invalid
[15:54:03] [PASSED] drm_test_format_block_width_one_plane
[15:54:03] [PASSED] drm_test_format_block_width_two_plane
[15:54:03] [PASSED] drm_test_format_block_width_three_plane
[15:54:03] [PASSED] drm_test_format_block_width_tiled
[15:54:03] [PASSED] drm_test_format_block_height_invalid
[15:54:03] [PASSED] drm_test_format_block_height_one_plane
[15:54:03] [PASSED] drm_test_format_block_height_two_plane
[15:54:03] [PASSED] drm_test_format_block_height_three_plane
[15:54:03] [PASSED] drm_test_format_block_height_tiled
[15:54:03] [PASSED] drm_test_format_min_pitch_invalid
[15:54:03] [PASSED] drm_test_format_min_pitch_one_plane_8bpp
[15:54:03] [PASSED] drm_test_format_min_pitch_one_plane_16bpp
[15:54:03] [PASSED] drm_test_format_min_pitch_one_plane_24bpp
[15:54:03] [PASSED] drm_test_format_min_pitch_one_plane_32bpp
[15:54:03] [PASSED] drm_test_format_min_pitch_two_plane
[15:54:03] [PASSED] drm_test_format_min_pitch_three_plane_8bpp
[15:54:03] [PASSED] drm_test_format_min_pitch_tiled
[15:54:03] =================== [PASSED] drm_format ====================
[15:54:03] ============== drm_framebuffer (10 subtests) ===============
[15:54:03] ========== drm_test_framebuffer_check_src_coords  ==========
[15:54:03] [PASSED] Success: source fits into fb
[15:54:03] [PASSED] Fail: overflowing fb with x-axis coordinate
[15:54:03] [PASSED] Fail: overflowing fb with y-axis coordinate
[15:54:03] [PASSED] Fail: overflowing fb with source width
[15:54:03] [PASSED] Fail: overflowing fb with source height
[15:54:03] ====== [PASSED] drm_test_framebuffer_check_src_coords ======
[15:54:03] [PASSED] drm_test_framebuffer_cleanup
[15:54:03] =============== drm_test_framebuffer_create  ===============
[15:54:03] [PASSED] ABGR8888 normal sizes
[15:54:03] [PASSED] ABGR8888 max sizes
[15:54:03] [PASSED] ABGR8888 pitch greater than min required
[15:54:03] [PASSED] ABGR8888 pitch less than min required
[15:54:03] [PASSED] ABGR8888 Invalid width
[15:54:03] [PASSED] ABGR8888 Invalid buffer handle
[15:54:03] [PASSED] No pixel format
[15:54:03] [PASSED] ABGR8888 Width 0
[15:54:03] [PASSED] ABGR8888 Height 0
[15:54:03] [PASSED] ABGR8888 Out of bound height * pitch combination
[15:54:03] [PASSED] ABGR8888 Large buffer offset
[15:54:03] [PASSED] ABGR8888 Buffer offset for inexistent plane
[15:54:03] [PASSED] ABGR8888 Invalid flag
[15:54:03] [PASSED] ABGR8888 Set DRM_MODE_FB_MODIFIERS without modifiers
[15:54:03] [PASSED] ABGR8888 Valid buffer modifier
[15:54:03] [PASSED] ABGR8888 Invalid buffer modifier(DRM_FORMAT_MOD_SAMSUNG_64_32_TILE)
[15:54:03] [PASSED] ABGR8888 Extra pitches without DRM_MODE_FB_MODIFIERS
[15:54:03] [PASSED] ABGR8888 Extra pitches with DRM_MODE_FB_MODIFIERS
[15:54:03] [PASSED] NV12 Normal sizes
[15:54:03] [PASSED] NV12 Max sizes
[15:54:03] [PASSED] NV12 Invalid pitch
[15:54:03] [PASSED] NV12 Invalid modifier/missing DRM_MODE_FB_MODIFIERS flag
[15:54:03] [PASSED] NV12 different  modifier per-plane
[15:54:03] [PASSED] NV12 with DRM_FORMAT_MOD_SAMSUNG_64_32_TILE
[15:54:03] [PASSED] NV12 Valid modifiers without DRM_MODE_FB_MODIFIERS
[15:54:03] [PASSED] NV12 Modifier for inexistent plane
[15:54:03] [PASSED] NV12 Handle for inexistent plane
[15:54:03] [PASSED] NV12 Handle for inexistent plane without DRM_MODE_FB_MODIFIERS
[15:54:03] [PASSED] YVU420 DRM_MODE_FB_MODIFIERS set without modifier
[15:54:03] [PASSED] YVU420 Normal sizes
[15:54:03] [PASSED] YVU420 Max sizes
[15:54:03] [PASSED] YVU420 Invalid pitch
[15:54:03] [PASSED] YVU420 Different pitches
[15:54:03] [PASSED] YVU420 Different buffer offsets/pitches
[15:54:03] [PASSED] YVU420 Modifier set just for plane 0, without DRM_MODE_FB_MODIFIERS
[15:54:03] [PASSED] YVU420 Modifier set just for planes 0, 1, without DRM_MODE_FB_MODIFIERS
[15:54:03] [PASSED] YVU420 Modifier set just for plane 0, 1, with DRM_MODE_FB_MODIFIERS
[15:54:03] [PASSED] YVU420 Valid modifier
[15:54:03] [PASSED] YVU420 Different modifiers per plane
[15:54:03] [PASSED] YVU420 Modifier for inexistent plane
[15:54:03] [PASSED] YUV420_10BIT Invalid modifier(DRM_FORMAT_MOD_LINEAR)
[15:54:03] [PASSED] X0L2 Normal sizes
[15:54:03] [PASSED] X0L2 Max sizes
[15:54:03] [PASSED] X0L2 Invalid pitch
[15:54:03] [PASSED] X0L2 Pitch greater than minimum required
[15:54:03] [PASSED] X0L2 Handle for inexistent plane
[15:54:03] [PASSED] X0L2 Offset for inexistent plane, without DRM_MODE_FB_MODIFIERS set
[15:54:03] [PASSED] X0L2 Modifier without DRM_MODE_FB_MODIFIERS set
[15:54:03] [PASSED] X0L2 Valid modifier
[15:54:03] [PASSED] X0L2 Modifier for inexistent plane
[15:54:03] =========== [PASSED] drm_test_framebuffer_create ===========
[15:54:03] [PASSED] drm_test_framebuffer_free
[15:54:03] [PASSED] drm_test_framebuffer_init
[15:54:03] [PASSED] drm_test_framebuffer_init_bad_format
[15:54:03] [PASSED] drm_test_framebuffer_init_dev_mismatch
[15:54:03] [PASSED] drm_test_framebuffer_lookup
[15:54:03] [PASSED] drm_test_framebuffer_lookup_inexistent
[15:54:03] [PASSED] drm_test_framebuffer_modifiers_not_supported
[15:54:03] ================= [PASSED] drm_framebuffer =================
[15:54:03] ================ drm_gem_shmem (8 subtests) ================
[15:54:03] [PASSED] drm_gem_shmem_test_obj_create
[15:54:03] [PASSED] drm_gem_shmem_test_obj_create_private
[15:54:03] [PASSED] drm_gem_shmem_test_pin_pages
[15:54:03] [PASSED] drm_gem_shmem_test_vmap
[15:54:03] [PASSED] drm_gem_shmem_test_get_pages_sgt
[15:54:03] [PASSED] drm_gem_shmem_test_get_sg_table
[15:54:03] [PASSED] drm_gem_shmem_test_madvise
[15:54:03] [PASSED] drm_gem_shmem_test_purge
[15:54:03] ================== [PASSED] drm_gem_shmem ==================
[15:54:03] === drm_atomic_helper_connector_hdmi_check (27 subtests) ===
[15:54:03] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode
[15:54:03] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode_vic_1
[15:54:03] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode
[15:54:03] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode_vic_1
[15:54:03] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode
[15:54:03] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode_vic_1
[15:54:03] ====== drm_test_check_broadcast_rgb_cea_mode_yuv420  =======
[15:54:03] [PASSED] Automatic
[15:54:03] [PASSED] Full
[15:54:03] [PASSED] Limited 16:235
[15:54:03] == [PASSED] drm_test_check_broadcast_rgb_cea_mode_yuv420 ===
[15:54:03] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_changed
[15:54:03] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_not_changed
[15:54:03] [PASSED] drm_test_check_disable_connector
[15:54:03] [PASSED] drm_test_check_hdmi_funcs_reject_rate
[15:54:03] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_rgb
[15:54:03] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_yuv420
[15:54:03] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv422
[15:54:03] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv420
[15:54:03] [PASSED] drm_test_check_driver_unsupported_fallback_yuv420
[15:54:03] [PASSED] drm_test_check_output_bpc_crtc_mode_changed
[15:54:03] [PASSED] drm_test_check_output_bpc_crtc_mode_not_changed
[15:54:03] [PASSED] drm_test_check_output_bpc_dvi
[15:54:03] [PASSED] drm_test_check_output_bpc_format_vic_1
[15:54:03] [PASSED] drm_test_check_output_bpc_format_display_8bpc_only
[15:54:03] [PASSED] drm_test_check_output_bpc_format_display_rgb_only
[15:54:03] [PASSED] drm_test_check_output_bpc_format_driver_8bpc_only
[15:54:03] [PASSED] drm_test_check_output_bpc_format_driver_rgb_only
[15:54:03] [PASSED] drm_test_check_tmds_char_rate_rgb_8bpc
[15:54:03] [PASSED] drm_test_check_tmds_char_rate_rgb_10bpc
[15:54:03] [PASSED] drm_test_check_tmds_char_rate_rgb_12bpc
[15:54:03] ===== [PASSED] drm_atomic_helper_connector_hdmi_check ======
[15:54:03] === drm_atomic_helper_connector_hdmi_reset (6 subtests) ====
[15:54:03] [PASSED] drm_test_check_broadcast_rgb_value
[15:54:03] [PASSED] drm_test_check_bpc_8_value
[15:54:03] [PASSED] drm_test_check_bpc_10_value
[15:54:03] [PASSED] drm_test_check_bpc_12_value
[15:54:03] [PASSED] drm_test_check_format_value
[15:54:03] [PASSED] drm_test_check_tmds_char_value
[15:54:03] ===== [PASSED] drm_atomic_helper_connector_hdmi_reset ======
[15:54:03] = drm_atomic_helper_connector_hdmi_mode_valid (4 subtests) =
[15:54:03] [PASSED] drm_test_check_mode_valid
[15:54:03] [PASSED] drm_test_check_mode_valid_reject
[15:54:03] [PASSED] drm_test_check_mode_valid_reject_rate
[15:54:03] [PASSED] drm_test_check_mode_valid_reject_max_clock
[15:54:03] === [PASSED] drm_atomic_helper_connector_hdmi_mode_valid ===
[15:54:03] ================= drm_managed (2 subtests) =================
[15:54:03] [PASSED] drm_test_managed_release_action
[15:54:03] [PASSED] drm_test_managed_run_action
[15:54:03] =================== [PASSED] drm_managed ===================
[15:54:03] =================== drm_mm (6 subtests) ====================
[15:54:03] [PASSED] drm_test_mm_init
[15:54:03] [PASSED] drm_test_mm_debug
[15:54:03] [PASSED] drm_test_mm_align32
[15:54:03] [PASSED] drm_test_mm_align64
[15:54:03] [PASSED] drm_test_mm_lowest
[15:54:03] [PASSED] drm_test_mm_highest
[15:54:03] ===================== [PASSED] drm_mm ======================
[15:54:03] ============= drm_modes_analog_tv (5 subtests) =============
[15:54:03] [PASSED] drm_test_modes_analog_tv_mono_576i
[15:54:03] [PASSED] drm_test_modes_analog_tv_ntsc_480i
[15:54:03] [PASSED] drm_test_modes_analog_tv_ntsc_480i_inlined
[15:54:03] [PASSED] drm_test_modes_analog_tv_pal_576i
[15:54:03] [PASSED] drm_test_modes_analog_tv_pal_576i_inlined
[15:54:03] =============== [PASSED] drm_modes_analog_tv ===============
[15:54:03] ============== drm_plane_helper (2 subtests) ===============
[15:54:03] =============== drm_test_check_plane_state  ================
[15:54:03] [PASSED] clipping_simple
[15:54:03] [PASSED] clipping_rotate_reflect
[15:54:03] [PASSED] positioning_simple
[15:54:03] [PASSED] upscaling
[15:54:03] [PASSED] downscaling
[15:54:03] [PASSED] rounding1
[15:54:03] [PASSED] rounding2
[15:54:03] [PASSED] rounding3
[15:54:03] [PASSED] rounding4
[15:54:03] =========== [PASSED] drm_test_check_plane_state ============
[15:54:03] =========== drm_test_check_invalid_plane_state  ============
[15:54:03] [PASSED] positioning_invalid
[15:54:03] [PASSED] upscaling_invalid
[15:54:03] [PASSED] downscaling_invalid
[15:54:03] ======= [PASSED] drm_test_check_invalid_plane_state ========
[15:54:03] ================ [PASSED] drm_plane_helper =================
[15:54:03] ====== drm_connector_helper_tv_get_modes (1 subtest) =======
[15:54:03] ====== drm_test_connector_helper_tv_get_modes_check  =======
[15:54:03] [PASSED] None
[15:54:03] [PASSED] PAL
[15:54:03] [PASSED] NTSC
[15:54:03] [PASSED] Both, NTSC Default
[15:54:03] [PASSED] Both, PAL Default
[15:54:03] [PASSED] Both, NTSC Default, with PAL on command-line
[15:54:03] [PASSED] Both, PAL Default, with NTSC on command-line
[15:54:03] == [PASSED] drm_test_connector_helper_tv_get_modes_check ===
[15:54:03] ======== [PASSED] drm_connector_helper_tv_get_modes ========
[15:54:03] ================== drm_rect (9 subtests) ===================
[15:54:03] [PASSED] drm_test_rect_clip_scaled_div_by_zero
[15:54:03] [PASSED] drm_test_rect_clip_scaled_not_clipped
[15:54:03] [PASSED] drm_test_rect_clip_scaled_clipped
[15:54:03] [PASSED] drm_test_rect_clip_scaled_signed_vs_unsigned
[15:54:03] ================= drm_test_rect_intersect  =================
[15:54:03] [PASSED] top-left x bottom-right: 2x2+1+1 x 2x2+0+0
[15:54:03] [PASSED] top-right x bottom-left: 2x2+0+0 x 2x2+1-1
[15:54:03] [PASSED] bottom-left x top-right: 2x2+1-1 x 2x2+0+0
[15:54:03] [PASSED] bottom-right x top-left: 2x2+0+0 x 2x2+1+1
[15:54:03] [PASSED] right x left: 2x1+0+0 x 3x1+1+0
[15:54:03] [PASSED] left x right: 3x1+1+0 x 2x1+0+0
[15:54:03] [PASSED] up x bottom: 1x2+0+0 x 1x3+0-1
[15:54:03] [PASSED] bottom x up: 1x3+0-1 x 1x2+0+0
[15:54:03] [PASSED] touching corner: 1x1+0+0 x 2x2+1+1
[15:54:03] [PASSED] touching side: 1x1+0+0 x 1x1+1+0
[15:54:03] [PASSED] equal rects: 2x2+0+0 x 2x2+0+0
[15:54:03] [PASSED] inside another: 2x2+0+0 x 1x1+1+1
[15:54:03] [PASSED] far away: 1x1+0+0 x 1x1+3+6
[15:54:03] [PASSED] points intersecting: 0x0+5+10 x 0x0+5+10
[15:54:03] [PASSED] points not intersecting: 0x0+0+0 x 0x0+5+10
[15:54:03] ============= [PASSED] drm_test_rect_intersect =============
[15:54:03] ================ drm_test_rect_calc_hscale  ================
[15:54:03] [PASSED] normal use
[15:54:03] [PASSED] out of max range
[15:54:03] [PASSED] out of min range
[15:54:03] [PASSED] zero dst
[15:54:03] [PASSED] negative src
[15:54:03] [PASSED] negative dst
[15:54:03] ============ [PASSED] drm_test_rect_calc_hscale ============
[15:54:03] ================ drm_test_rect_calc_vscale  ================
[15:54:03] [PASSED] normal use
[15:54:03] [PASSED] out of max range
[15:54:03] [PASSED] out of min range
[15:54:03] [PASSED] zero dst
[15:54:03] [PASSED] negative src
[15:54:03] [PASSED] negative dst
[15:54:03] ============ [PASSED] drm_test_rect_calc_vscale ============
[15:54:03] ================== drm_test_rect_rotate  ===================
[15:54:03] [PASSED] reflect-x
[15:54:03] [PASSED] reflect-y
[15:54:03] [PASSED] rotate-0
[15:54:03] [PASSED] rotate-90
[15:54:03] [PASSED] rotate-180
[15:54:03] [PASSED] rotate-270
stty: 'standard input': Inappropriate ioctl for device
[15:54:03] ============== [PASSED] drm_test_rect_rotate ===============
[15:54:03] ================ drm_test_rect_rotate_inv  =================
[15:54:03] [PASSED] reflect-x
[15:54:03] [PASSED] reflect-y
[15:54:03] [PASSED] rotate-0
[15:54:03] [PASSED] rotate-90
[15:54:03] [PASSED] rotate-180
[15:54:03] [PASSED] rotate-270
[15:54:03] ============ [PASSED] drm_test_rect_rotate_inv =============
[15:54:03] ==================== [PASSED] drm_rect =====================
[15:54:03] ============ drm_sysfb_modeset_test (1 subtest) ============
[15:54:03] ============ drm_test_sysfb_build_fourcc_list  =============
[15:54:03] [PASSED] no native formats
[15:54:03] [PASSED] XRGB8888 as native format
[15:54:03] [PASSED] remove duplicates
[15:54:03] [PASSED] convert alpha formats
[15:54:03] [PASSED] random formats
[15:54:03] ======== [PASSED] drm_test_sysfb_build_fourcc_list =========
[15:54:03] ============= [PASSED] drm_sysfb_modeset_test ==============
[15:54:03] ============================================================
[15:54:03] Testing complete. Ran 616 tests: passed: 616
[15:54:03] Elapsed time: 24.854s total, 1.579s configuring, 23.001s building, 0.242s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/ttm/tests/.kunitconfig
[15:54:03] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[15:54:05] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[15:54:13] Starting KUnit Kernel (1/1)...
[15:54:13] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[15:54:13] ================= ttm_device (5 subtests) ==================
[15:54:13] [PASSED] ttm_device_init_basic
[15:54:13] [PASSED] ttm_device_init_multiple
[15:54:13] [PASSED] ttm_device_fini_basic
[15:54:13] [PASSED] ttm_device_init_no_vma_man
[15:54:13] ================== ttm_device_init_pools  ==================
[15:54:13] [PASSED] No DMA allocations, no DMA32 required
[15:54:13] [PASSED] DMA allocations, DMA32 required
[15:54:13] [PASSED] No DMA allocations, DMA32 required
[15:54:13] [PASSED] DMA allocations, no DMA32 required
[15:54:13] ============== [PASSED] ttm_device_init_pools ==============
[15:54:13] =================== [PASSED] ttm_device ====================
[15:54:13] ================== ttm_pool (8 subtests) ===================
[15:54:13] ================== ttm_pool_alloc_basic  ===================
[15:54:13] [PASSED] One page
[15:54:13] [PASSED] More than one page
[15:54:13] [PASSED] Above the allocation limit
[15:54:13] [PASSED] One page, with coherent DMA mappings enabled
[15:54:13] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[15:54:13] ============== [PASSED] ttm_pool_alloc_basic ===============
[15:54:13] ============== ttm_pool_alloc_basic_dma_addr  ==============
[15:54:13] [PASSED] One page
[15:54:13] [PASSED] More than one page
[15:54:13] [PASSED] Above the allocation limit
[15:54:13] [PASSED] One page, with coherent DMA mappings enabled
[15:54:13] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[15:54:13] ========== [PASSED] ttm_pool_alloc_basic_dma_addr ==========
[15:54:13] [PASSED] ttm_pool_alloc_order_caching_match
[15:54:13] [PASSED] ttm_pool_alloc_caching_mismatch
[15:54:13] [PASSED] ttm_pool_alloc_order_mismatch
[15:54:13] [PASSED] ttm_pool_free_dma_alloc
[15:54:13] [PASSED] ttm_pool_free_no_dma_alloc
[15:54:13] [PASSED] ttm_pool_fini_basic
[15:54:13] ==================== [PASSED] ttm_pool =====================
[15:54:13] ================ ttm_resource (8 subtests) =================
[15:54:13] ================= ttm_resource_init_basic  =================
[15:54:13] [PASSED] Init resource in TTM_PL_SYSTEM
[15:54:13] [PASSED] Init resource in TTM_PL_VRAM
[15:54:13] [PASSED] Init resource in a private placement
[15:54:13] [PASSED] Init resource in TTM_PL_SYSTEM, set placement flags
[15:54:13] ============= [PASSED] ttm_resource_init_basic =============
[15:54:13] [PASSED] ttm_resource_init_pinned
[15:54:13] [PASSED] ttm_resource_fini_basic
[15:54:13] [PASSED] ttm_resource_manager_init_basic
[15:54:13] [PASSED] ttm_resource_manager_usage_basic
[15:54:13] [PASSED] ttm_resource_manager_set_used_basic
[15:54:13] [PASSED] ttm_sys_man_alloc_basic
[15:54:13] [PASSED] ttm_sys_man_free_basic
[15:54:13] ================== [PASSED] ttm_resource ===================
[15:54:13] =================== ttm_tt (15 subtests) ===================
[15:54:13] ==================== ttm_tt_init_basic  ====================
[15:54:13] [PASSED] Page-aligned size
[15:54:13] [PASSED] Extra pages requested
[15:54:13] ================ [PASSED] ttm_tt_init_basic ================
[15:54:13] [PASSED] ttm_tt_init_misaligned
[15:54:13] [PASSED] ttm_tt_fini_basic
[15:54:13] [PASSED] ttm_tt_fini_sg
[15:54:13] [PASSED] ttm_tt_fini_shmem
[15:54:13] [PASSED] ttm_tt_create_basic
[15:54:13] [PASSED] ttm_tt_create_invalid_bo_type
[15:54:13] [PASSED] ttm_tt_create_ttm_exists
[15:54:13] [PASSED] ttm_tt_create_failed
[15:54:13] [PASSED] ttm_tt_destroy_basic
[15:54:13] [PASSED] ttm_tt_populate_null_ttm
[15:54:13] [PASSED] ttm_tt_populate_populated_ttm
[15:54:13] [PASSED] ttm_tt_unpopulate_basic
[15:54:13] [PASSED] ttm_tt_unpopulate_empty_ttm
[15:54:13] [PASSED] ttm_tt_swapin_basic
[15:54:13] ===================== [PASSED] ttm_tt ======================
[15:54:13] =================== ttm_bo (14 subtests) ===================
[15:54:13] =========== ttm_bo_reserve_optimistic_no_ticket  ===========
[15:54:13] [PASSED] Cannot be interrupted and sleeps
[15:54:13] [PASSED] Cannot be interrupted, locks straight away
[15:54:13] [PASSED] Can be interrupted, sleeps
[15:54:13] ======= [PASSED] ttm_bo_reserve_optimistic_no_ticket =======
[15:54:13] [PASSED] ttm_bo_reserve_locked_no_sleep
[15:54:13] [PASSED] ttm_bo_reserve_no_wait_ticket
[15:54:13] [PASSED] ttm_bo_reserve_double_resv
[15:54:13] [PASSED] ttm_bo_reserve_interrupted
[15:54:13] [PASSED] ttm_bo_reserve_deadlock
[15:54:13] [PASSED] ttm_bo_unreserve_basic
[15:54:13] [PASSED] ttm_bo_unreserve_pinned
[15:54:13] [PASSED] ttm_bo_unreserve_bulk
[15:54:13] [PASSED] ttm_bo_put_basic
[15:54:13] [PASSED] ttm_bo_put_shared_resv
[15:54:13] [PASSED] ttm_bo_pin_basic
[15:54:13] [PASSED] ttm_bo_pin_unpin_resource
[15:54:13] [PASSED] ttm_bo_multiple_pin_one_unpin
[15:54:13] ===================== [PASSED] ttm_bo ======================
[15:54:13] ============== ttm_bo_validate (22 subtests) ===============
[15:54:13] ============== ttm_bo_init_reserved_sys_man  ===============
[15:54:13] [PASSED] Buffer object for userspace
[15:54:13] [PASSED] Kernel buffer object
[15:54:13] [PASSED] Shared buffer object
[15:54:13] ========== [PASSED] ttm_bo_init_reserved_sys_man ===========
[15:54:13] ============== ttm_bo_init_reserved_mock_man  ==============
[15:54:13] [PASSED] Buffer object for userspace
[15:54:13] [PASSED] Kernel buffer object
[15:54:13] [PASSED] Shared buffer object
[15:54:13] ========== [PASSED] ttm_bo_init_reserved_mock_man ==========
[15:54:13] [PASSED] ttm_bo_init_reserved_resv
[15:54:13] ================== ttm_bo_validate_basic  ==================
[15:54:13] [PASSED] Buffer object for userspace
[15:54:13] [PASSED] Kernel buffer object
[15:54:13] [PASSED] Shared buffer object
[15:54:13] ============== [PASSED] ttm_bo_validate_basic ==============
[15:54:13] [PASSED] ttm_bo_validate_invalid_placement
[15:54:13] ============= ttm_bo_validate_same_placement  ==============
[15:54:13] [PASSED] System manager
[15:54:13] [PASSED] VRAM manager
[15:54:13] ========= [PASSED] ttm_bo_validate_same_placement ==========
[15:54:13] [PASSED] ttm_bo_validate_failed_alloc
[15:54:13] [PASSED] ttm_bo_validate_pinned
[15:54:13] [PASSED] ttm_bo_validate_busy_placement
[15:54:13] ================ ttm_bo_validate_multihop  =================
[15:54:13] [PASSED] Buffer object for userspace
[15:54:13] [PASSED] Kernel buffer object
[15:54:13] [PASSED] Shared buffer object
[15:54:13] ============ [PASSED] ttm_bo_validate_multihop =============
[15:54:13] ========== ttm_bo_validate_no_placement_signaled  ==========
[15:54:13] [PASSED] Buffer object in system domain, no page vector
[15:54:13] [PASSED] Buffer object in system domain with an existing page vector
[15:54:13] ====== [PASSED] ttm_bo_validate_no_placement_signaled ======
[15:54:13] ======== ttm_bo_validate_no_placement_not_signaled  ========
[15:54:13] [PASSED] Buffer object for userspace
[15:54:13] [PASSED] Kernel buffer object
[15:54:13] [PASSED] Shared buffer object
[15:54:13] ==== [PASSED] ttm_bo_validate_no_placement_not_signaled ====
[15:54:13] [PASSED] ttm_bo_validate_move_fence_signaled
[15:54:13] ========= ttm_bo_validate_move_fence_not_signaled  =========
[15:54:13] [PASSED] Waits for GPU
[15:54:13] [PASSED] Tries to lock straight away
[15:54:13] ===== [PASSED] ttm_bo_validate_move_fence_not_signaled =====
[15:54:13] [PASSED] ttm_bo_validate_swapout
[15:54:13] [PASSED] ttm_bo_validate_happy_evict
[15:54:13] [PASSED] ttm_bo_validate_all_pinned_evict
[15:54:13] [PASSED] ttm_bo_validate_allowed_only_evict
[15:54:13] [PASSED] ttm_bo_validate_deleted_evict
[15:54:13] [PASSED] ttm_bo_validate_busy_domain_evict
[15:54:13] [PASSED] ttm_bo_validate_evict_gutting
[15:54:13] [PASSED] ttm_bo_validate_recrusive_evict
stty: 'standard input': Inappropriate ioctl for device
[15:54:13] ================= [PASSED] ttm_bo_validate =================
[15:54:13] ============================================================
[15:54:13] Testing complete. Ran 102 tests: passed: 102
[15:54:13] Elapsed time: 10.129s total, 1.605s configuring, 7.907s building, 0.534s running

+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 36+ messages in thread

* ✗ CI.checksparse: warning for Handle Firmware reported Hardware Errors (rev3)
  2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (8 preceding siblings ...)
  2025-07-02 15:54 ` ✓ CI.KUnit: success " Patchwork
@ 2025-07-02 16:17 ` Patchwork
  2025-07-02 16:39 ` ✓ Xe.CI.BAT: success " Patchwork
  2025-07-04  6:45 ` ✗ Xe.CI.Full: failure " Patchwork
  11 siblings, 0 replies; 36+ messages in thread
From: Patchwork @ 2025-07-02 16:17 UTC (permalink / raw)
  To: Riana Tauro; +Cc: intel-xe

== Series Details ==

Series: Handle Firmware reported Hardware Errors (rev3)
URL   : https://patchwork.freedesktop.org/series/149756/
State : warning

== Summary ==

+ trap cleanup EXIT
+ KERNEL=/kernel
+ MT=/root/linux/maintainer-tools
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools /root/linux/maintainer-tools
Cloning into '/root/linux/maintainer-tools'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ make -C /root/linux/maintainer-tools
make: Entering directory '/root/linux/maintainer-tools'
cc -O2 -g -Wextra -o remap-log remap-log.c
make: Leaving directory '/root/linux/maintainer-tools'
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ /root/linux/maintainer-tools/dim sparse --fast 94631e6b7f655b1922e3227970cd6180200694af
Sparse version: 0.6.4 (Ubuntu: 0.6.4-4ubuntu3)
Fast mode used, each commit won't be checked separately.
-
+drivers/gpu/drm/drm_drv.c:452:6: warning: context imbalance in 'drm_dev_enter' - different lock contexts for basic block
+drivers/gpu/drm/drm_drv.c: note: in included file (through include/linux/notifier.h, arch/x86/include/asm/uprobes.h, include/linux/uprobes.h, include/linux/mm_types.h, include/linux/mmzone.h, include/linux/gfp.h, ...):
+drivers/gpu/drm/drm_plane.c:213:24: warning: Using plain integer as NULL pointer
+drivers/gpu/drm/i915/display/intel_alpm.c: note: in included file:
+drivers/gpu/drm/i915/display/intel_cdclk.c: note: in included file:
+drivers/gpu/drm/i915/display/intel_ddi.c: note: in included file:
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2019:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2032:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2032:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_display_types.h:2032:24: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/display/intel_hdcp.c: note: in included file:
+drivers/gpu/drm/i915/display/intel_hotplug.c: note: in included file:
+drivers/gpu/drm/i915/display/intel_pps.c: note: in included file:
+drivers/gpu/drm/i915/display/intel_psr.c: note: in included file:
+drivers/gpu/drm/i915/gt/intel_reset.c:1572:12: warning: context imbalance in '_intel_gt_reset_lock' - different lock contexts for basic block
+drivers/gpu/drm/i915/gt/intel_sseu.c:598:17: error: too long token expansion
+drivers/gpu/drm/i915/i915_active.c:1063:16: warning: context imbalance in '__i915_active_fence_set' - different lock contexts for basic block
+drivers/gpu/drm/i915/i915_drm_client.c:92:9: error: incompatible types in comparison expression (different address spaces):
+drivers/gpu/drm/i915/i915_drm_client.c:92:9: error: incompatible types in comparison expression (different address spaces):
+drivers/gpu/drm/i915/i915_drm_client.c:92:9:    expected struct list_head const *list
+drivers/gpu/drm/i915/i915_drm_client.c:92:9:    got struct list_head [noderef] __rcu *pos
+drivers/gpu/drm/i915/i915_drm_client.c:92:9:    struct list_head *
+drivers/gpu/drm/i915/i915_drm_client.c:92:9:    struct list_head *
+drivers/gpu/drm/i915/i915_drm_client.c:92:9:    struct list_head [noderef] __rcu *
+drivers/gpu/drm/i915/i915_drm_client.c:92:9:    struct list_head [noderef] __rcu *
+drivers/gpu/drm/i915/i915_drm_client.c:92:9: warning: incorrect type in argument 1 (different address spaces)
+drivers/gpu/drm/i915/i915_irq.c:492:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:492:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:500:16: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:500:16: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:505:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:505:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:505:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:543:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:543:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:551:16: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:551:16: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:556:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:556:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:556:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:600:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:600:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:603:15: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:603:15: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:607:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:607:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:614:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:614:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:614:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/i915_irq.c:614:9: warning: unreplaced symbol '<noident>'
+drivers/gpu/drm/i915/intel_uncore.c:1927:1: warning: context imbalance in 'fwtable_read8' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:1928:1: warning: context imbalance in 'fwtable_read16' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:1929:1: warning: context imbalance in 'fwtable_read32' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:1930:1: warning: context imbalance in 'fwtable_read64' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:1995:1: warning: context imbalance in 'gen6_write8' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:1996:1: warning: context imbalance in 'gen6_write16' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:1997:1: warning: context imbalance in 'gen6_write32' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:2017:1: warning: context imbalance in 'fwtable_write8' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:2018:1: warning: context imbalance in 'fwtable_write16' - unexpected unlock
+drivers/gpu/drm/i915/intel_uncore.c:2019:1: warning: context imbalance in 'fwtable_write32' - unexpected unlock
+drivers/gpu/drm/i915/intel_wakeref.c:145:19: warning: context imbalance in 'wakeref_auto_timeout' - unexpected unlock
+drivers/gpu/drm/ttm/ttm_bo.c:1199:31: warning: symbol 'ttm_swap_ops' was not declared. Should it be static?
+drivers/gpu/drm/ttm/ttm_bo_util.c:329:38:    expected void *virtual
+drivers/gpu/drm/ttm/ttm_bo_util.c:329:38:    got void [noderef] __iomem *
+drivers/gpu/drm/ttm/ttm_bo_util.c:329:38: warning: incorrect type in assignment (different address spaces)
+drivers/gpu/drm/ttm/ttm_bo_util.c:332:38:    expected void *virtual
+drivers/gpu/drm/ttm/ttm_bo_util.c:332:38:    got void [noderef] __iomem *
+drivers/gpu/drm/ttm/ttm_bo_util.c:332:38: warning: incorrect type in assignment (different address spaces)
+drivers/gpu/drm/ttm/ttm_bo_util.c:335:38:    expected void *virtual
+drivers/gpu/drm/ttm/ttm_bo_util.c:335:38:    got void [noderef] __iomem *
+drivers/gpu/drm/ttm/ttm_bo_util.c:335:38: warning: incorrect type in assignment (different address spaces)
+drivers/gpu/drm/ttm/ttm_bo_util.c:468:28:    expected void volatile [noderef] __iomem *addr
+drivers/gpu/drm/ttm/ttm_bo_util.c:468:28:    got void *virtual
+drivers/gpu/drm/ttm/ttm_bo_util.c:468:28: warning: incorrect type in argument 1 (different address spaces)
+./include/linux/srcu.h:400:9: warning: context imbalance in 'drm_dev_exit' - unexpected unlock

+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 36+ messages in thread

* ✓ Xe.CI.BAT: success for Handle Firmware reported Hardware Errors (rev3)
  2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (9 preceding siblings ...)
  2025-07-02 16:17 ` ✗ CI.checksparse: warning " Patchwork
@ 2025-07-02 16:39 ` Patchwork
  2025-07-04  6:45 ` ✗ Xe.CI.Full: failure " Patchwork
  11 siblings, 0 replies; 36+ messages in thread
From: Patchwork @ 2025-07-02 16:39 UTC (permalink / raw)
  To: Riana Tauro; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 959 bytes --]

== Series Details ==

Series: Handle Firmware reported Hardware Errors (rev3)
URL   : https://patchwork.freedesktop.org/series/149756/
State : success

== Summary ==

CI Bug Log - changes from xe-3335-e46fcd77ceacb06a5411cb77b7344d6b82e1ab72_BAT -> xe-pw-149756v3_BAT
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  

Participating hosts (8 -> 8)
------------------------------

  No changes in participating hosts


Changes
-------

  No changes found


Build changes
-------------

  * Linux: xe-3335-e46fcd77ceacb06a5411cb77b7344d6b82e1ab72 -> xe-pw-149756v3

  IGT_8434: 5185b9527673518a418d575c3f58b5554e27f111 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  xe-3335-e46fcd77ceacb06a5411cb77b7344d6b82e1ab72: e46fcd77ceacb06a5411cb77b7344d6b82e1ab72
  xe-pw-149756v3: 149756v3

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v3/index.html

[-- Attachment #2: Type: text/html, Size: 1507 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* ✗ Xe.CI.Full: failure for Handle Firmware reported Hardware Errors (rev3)
  2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
                   ` (10 preceding siblings ...)
  2025-07-02 16:39 ` ✓ Xe.CI.BAT: success " Patchwork
@ 2025-07-04  6:45 ` Patchwork
  11 siblings, 0 replies; 36+ messages in thread
From: Patchwork @ 2025-07-04  6:45 UTC (permalink / raw)
  To: Riana Tauro; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 378 bytes --]

== Series Details ==

Series: Handle Firmware reported Hardware Errors (rev3)
URL   : https://patchwork.freedesktop.org/series/149756/
State : failure

== Summary ==

ERROR: The runconfig 'xe-3335-e46fcd77ceacb06a5411cb77b7344d6b82e1ab72_FULL' does not exist in the database

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-149756v3/index.html

[-- Attachment #2: Type: text/html, Size: 943 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-07-15 17:30 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-02 14:11 [PATCH v3 0/7] Handle Firmware reported Hardware Errors Riana Tauro
2025-07-02 14:11 ` [PATCH v3 1/7] drm: Add a vendor-specific recovery method to device wedged uevent Riana Tauro
2025-07-03  4:06   ` Raag Jadav
2025-07-03  5:20     ` Riana Tauro
2025-07-03  6:40       ` Raag Jadav
2025-07-03  6:50         ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 2/7] drm/xe: Set GT as wedged before sending " Riana Tauro
2025-07-02 21:41   ` Rodrigo Vivi
2025-07-03  4:18   ` Raag Jadav
2025-07-03  5:18     ` Riana Tauro
2025-07-03  6:45       ` Raag Jadav
2025-07-07  6:44         ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 3/7] drm/xe/xe_survivability: Add support for Runtime survivability mode Riana Tauro
2025-07-02 21:40   ` Rodrigo Vivi
2025-07-03  5:16     ` Riana Tauro
2025-07-02 23:33   ` kernel test robot
2025-07-09 18:04   ` Summers, Stuart
2025-07-10  5:27     ` Riana Tauro
2025-07-15 17:30       ` Summers, Stuart
2025-07-02 14:11 ` [PATCH v3 4/7] drm/xe/doc: Document device wedged and runtime survivability Riana Tauro
2025-07-02 13:55   ` Riana Tauro
2025-07-03  7:19   ` Raag Jadav
2025-07-02 14:11 ` [PATCH v3 5/7] drm/xe: Add support to handle hardware errors Riana Tauro
2025-07-09 17:27   ` Summers, Stuart
2025-07-10  5:54     ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 6/7] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-07-02 21:35   ` Rodrigo Vivi
2025-07-03  5:28     ` Riana Tauro
2025-07-09 17:57   ` Summers, Stuart
2025-07-10  5:38     ` Riana Tauro
2025-07-02 14:11 ` [PATCH v3 7/7] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
2025-07-02 15:53 ` ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev3) Patchwork
2025-07-02 15:54 ` ✓ CI.KUnit: success " Patchwork
2025-07-02 16:17 ` ✗ CI.checksparse: warning " Patchwork
2025-07-02 16:39 ` ✓ Xe.CI.BAT: success " Patchwork
2025-07-04  6:45 ` ✗ Xe.CI.Full: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox